ADVANCING IMAGE RECONSTRUCTION AND RESTORATION THROUGH ROBUST
SUPERVISED AND GENERATIVE MODELS

By

Shijun Liang

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Biomedical Engineering—Doctor of Philosophy

2025

ABSTRACT

Medical imaging is integral to modern clinical workflows, extensively used in diagnosis, prognosis,

and treatment planning for numerous diseases. Magnetic Resonance Imaging (MRI) is particularly

valuable because it avoids ionizing radiation and provides excellent soft tissue contrast. However, MRI

has limitations, such as prolonged scan times that raise imaging costs and increase susceptibility to

artifacts like motion. Furthermore, challenges in medical imaging data acquisition and distribution

arise due to patient privacy concerns and strict ethical and legal restrictions. Additionally,

medical data is highly heterogeneous, varying across institutions, imaging devices, protocols, and

patient demographics. Harmonizing data from diverse sources requires extensive preprocessing,

complicating data acquisition.

This thesis presents algorithms that can be categorized into two main areas. The first area

explores MRI reconstruction using limited or data-free machine learning techniques to bridge gaps

in data acquisition. Specifically, for cases with limited training data, we propose the LONDN

MRI method, which trains on a small set of adaptively chosen neighboring images that are similar

to the target image. For data-free scenarios, we advanced the Deep Image Prior, introducing

self-guided DIP, which is a self-regularization method leveraging a denoising regularization based

on continuously perturbing inputs with random noise and applying network output smoothing to

enhance generalization. Inspired by self-guided DIP, we further improve the method efficiency by

building upon an insight into the impact of the DIP network input; we introduce Autoencoding

Sequential DIP (aSeqDIP), which incorporates a U-Net architecture whose weights are updated

sequentially with the input however simply updated in a feed-forward fashion with autoencoding

regularization. Our findings indicate that LONDN MRI outperforms the supervised MoDL model

by approximately 0.4 dB for MRI reconstruction from limited measurements. On a more challenging

datasets such as the FSE MRI dataset, our method achieves a 0.8 dB improvement. Both the

Self-Guided DIP and Autoencoding Sequential DIP (aSeqDIP) outperform the state of Art generaitve

model Score MRI model by approximately 0.45 and 0.76 dB, respectively.

Secondly, we explore means to improve the reconstruction network’s generalization capabilities.

We introduce an unrolling method (SMUG) combined with randomized smoothing to counteract

effects of worst-case and other perturbations. This approach combines model-based deep unrolling

with randomized smoothing, helping to mitigate worst-case perturbations as well as variations such

as sampling pattern shifts, differing acceleration factors, and Gaussian noise. Furthermore, a pre-

learned diffusion model can act as an effective purifier prior to an unrolled network by incrementally

adding Gaussian noise and subsequently removing it, thus serving as a robust noise-removal method

for image purification. We implemented the developed method Diffusion Purification in various

experiments, particularly on biomedical lesion data, and found it outperforms common robustness

approaches, such as adversarial training, randomized smoothing, and baseline methods without

robustness enhancements. However, the primary drawback of diffusion models lies in their slow

image generation and denoising processes, posing a significant challenge to balancing processing

speed and output quality. To handle this issue. we proposed SITCOM that exploits three conditions

for achieving measurement-consistent diffusion trajectories with expanded DDIM (DDIM). Building

on these conditions, we propose a new optimization-based diffusion reverse sampling method

that not only enforces the standard data manifold measurement consistency and forward diffusion

consistency, as seen in previous studies, but also incorporates backward diffusion consistency that

maintains a diffusion trajectory by optimizing over the input of the pre-trained model at every

sampling step. It outperforms the state-of-the-art method DAPS for most of the tasks by 1.2 dB in

terms of image quality.

Copyright by
SHIJUN LIANG
2025

To all warriors exploring in the darkness.

"There is only one heroism in the world: to see the world as it is and to love it."
— Romain Rolland

v

ACKNOWLEDGMENTS

First and foremost, I wish to express my gratitude to my advisor, Prof. Saiprasad Ravishankar, whose

indefatigable mentorship and steady support have been pivotal in my pursuit of a doctoral degree.

His guidance, patience, and unwavering dedication continually galvanized my resolve in the face of

daunting challenges, and I am deeply indebted to his expertise throughout this academic journey.

Moreover, I would like to extend special thanks to Michael T. McCann, whose insightful counsel

in programming and mathematics anchored me during the early stages of my Ph.D. I am equally

grateful to Zhishen Huang for easing my transition into doctoral life and offering invaluable advice

and moral support. I also owe a debt of appreciation to Ismail Alkhouri, whose insights illuminated

promising directions for diffusion model research and sharpened my writing skills. Finally, I would

like to acknowledge Prof. Rongrong Wang and Prof. Qing Qu for their enriching collaborations,

which broadened the horizons of my research.

I consider myself extraordinarily fortunate to have walked this path alongside a cohort of

remarkable labmates who have profoundly shaped my personal growth. The thoughtful exchanges

and steadfast encouragement of Avrajit Ghosh, Siddhant Gautam, Gabriel Maliakal, Evan Bell,

Angqi Li, and Tiffany Owen have infused this journey with memories I will forever cherish. The

synergy we cultivated has been a constant source of motivation, and I feel truly privileged to have

had such a supportive team by my side

My gratitude likewise extends to my family, my mom, Yawen Chen, and my dad, Xiwen Liang,

whose boundless love and unwavering support form the bedrock upon which all my achievements

rest. To my parents, whose belief in my potential never wavered and whose sacrifices paved the

way for every accomplishment I now claim—I am immeasurably thankful. Indeed, their quiet

resilience and constant reassurance fueled my perseverance, reminding me of the values that guide

my aspirations.

Lastly, I owe heartfelt thanks to the friends and extended family who steadfastly stood by me

throughout this odyssey. Their presence offered respite from the most arduous trials, their laughter

assuaged my worries, and their loyalty rekindled my resolve whenever it flickered. I am eternally

vi

grateful for the solidarity and warmth they have shown, and their support will forever remain an

integral part of my story.

vii

CHAPTER 1

1.1 Background .

OVERVIEW .
.

.

.

.

TABLE OF CONTENTS

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.

.
.

1
1

CHAPTER 2

.

.

BASIC IMAGE PROCESSING BACKGROUND . . . . . . . . . . . .

7
.
7
2.1 Magnetic Resonance Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 10
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
2.2 CT .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Deep Image Prior .
.
. 12
.
2.4 Diffusion Model .
2.5 Robustness
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
. .
2.6 Definition for Common Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 16

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.

. .

.

.

.

CHAPTER 3

Introduction . .
Introduction . .

3.1
3.2
3.3 Method .
3.4 Experiments . .
3.5 Discussion . .
3.6 Conclusions . .

LONDN-MRI .
.
.
.
.
.
.

. .
. .
.
. .
.
. .

. .

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 19
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
.
. 23
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
. 46
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 4

SELF-GUIDED DEEP IMAGE PRIOR . . . . . . . . . . . . . . . . . . 50
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Introduction . .
4.1
4.2 Methodology .
. 61
.
4.3 Experiments and Results
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5 Conclusions . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .
.
.

. .

.
.

.

CHAPTER 5

Introduction . .

AUTOENCODING SEQUENTIAL DEEP IMAGE PRIOR . . . . . . . . 76
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.1
5.2 Method .
. 79
.
5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4 Conclusions & Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .
.

. .

. .

.
.

CHAPTER 6

.

. .

Introduction . .

MRI RECONSTRUCTION BY SMOOTHED UNROLLING . . . . . . . 91
6.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2 Preliminaries and Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 91
.
6.3 Methodology .
. 95
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.4 Experiments . .
. 112
6.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
. .

.
.

CHAPTER 7

Introduction . .

MRI RECONSTRUCTION VIA DIFFUSION PURIFICATION . . . . . 113
7.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2 Lack of Robustness in DL-based MRI Reconstruction & Score-based DMs . . . 114
7.3 Diffusion Purification for Robust DL-based MRI Reconstruction . . . . . . .
. 119
7.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

. .

.

viii

.
7.5 Conclusion . .
7.6 Proof of Theorem 1 .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
. 136
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 8

.

. .

Introduction . .

STEP-WISE TRIPLE-CONSISTENT DIFFUSION SAMPLING . . . .

. 139
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.1
8.2 Background: Diffusion Models & Their Usage in Solving IPs . . . . . . . . . . 139
8.3 SITCOM: Step-wise Triple-Consistent Sampling . . . . . . . . . . . . . . . . 142
8.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.5 Conclusion . .

. .

.

CHAPTER 9

CONCLUSION .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 154

BIBLIOGRAPHY . .

. .

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

APPENDIX A

APPENDIX FOR SELF-GUIDED DIP

. . . . . . . . . . . . . . . .

. 170

APPENDIX B

APPENDIX FOR AUTOENCODING SEQUENTIAL DEEP IMAGE
PRIOR .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

.

.

APPENDIX C

APPENDIX FOR ROBUST MRI RECONSTRUCTION BY SMOOTHED
UNROLLING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

APPENDIX D

APPENDIX FOR STEP-WISE TRIPLE-CONSISTENT DIFFUSION
SAMPLING FOR INVERSE PROBLEMS . . . . . . . . . . . . . . . . 201

ix

CHAPTER 1

OVERVIEW

1.1 Background

Magnetic Resonance Imaging (MRI) (Fessler, 2010) utilizes strong magnetic fields and radio-

frequency (RF) waves to generate high-resolution, detailed images of tissues and anatomical

structures within the body. It has gained widespread use in clinical practice due to advantages such

as excellent soft tissue contrast, the absence of ionizing radiation, and the capability to capture a

wide range of physiological phenomena through various imaging techniques. MRI is instrumental

in diagnosing numerous disorders, including cerebral aneurysms, ocular and inner ear conditions,

multiple sclerosis, spinal cord disorders, stroke, tumors, and traumatic brain injuries. However, a

significant hurdle is the prolonged time required to acquire images. Some scanning procedures can

last up to an hour, necessitating that the subject remains confined in a cramped space for extended

periods. This not only poses discomfort for patients but also contributes to higher procedural costs,

making MRI an expensive imaging option. Additionally, the lengthy acquisition times limit its

applicability in situations requiring immediate diagnosis.

In MRI, measurements are acquired by sampling the transverse spins in the object after excitation

by radiofrequency waves. The application of spatially varying magnetic field gradients allows only

spins at specific resonant frequencies to be sampled during acquisition, enabling the localization

of spins based on their spatial frequencies. As a result, raw MR measurements are obtained in

the frequency domain, known as k-space, unlike other imaging modalities such as X-ray imaging

where acquisition occurs in the image domain. The scan duration in MRI depends on the number of

measurements collected in the frequency domain. One method to reduce scan time is by acquiring

fewer k-space measurements than traditionally necessary, effectively adopting sub-Nyquist level

sampling. Moreover, in dynamic MRI applications like cardiac imaging, acquiring fully sampled

measurements may be impractical or impossible. However, aggressive undersampling can lead to

the loss of critical information necessary for accurate reconstruction. Therefore, there is a pressing

need to develop algorithms that can reconstruct high-quality images from limited measurements,

1

mitigating the trade-off between scan duration and image quality (Wen et al., 2023).

The rising demand for quantitative information over qualitative assessments in medical imaging

has become increasingly apparent in recent years. Healthcare professionals are progressively relying

on precise quantitative data regarding patient health, rather than generic indicators, to make more

informed, case-specific decisions for monitoring, diagnosis, and treatment planning (Gatenby et al.,

2013; Lee et al., 2008; Rosenthal et al., 1992; Ramani et al., 2006). This shift underscores the

importance of advanced imaging techniques and reconstruction algorithms that can provide detailed,

quantifiable insights while addressing the practical challenges associated with MRI.

On the other hand, Computed Tomography (CT) (Elbakri and Fessler, 2002) employs X-rays to

produce cross-sectional images of the body, allowing clinicians to visualize internal structures with

high spatial resolution. In CT imaging, an X-ray source rotates around the patient while a detector

array captures the attenuated X-rays that pass through the body. This process generates multiple

2D X-ray projections, which are then reconstructed into a 3D image using advanced algorithms.

CT’s capability to visualize intricate anatomical details has led to its widespread use in diagnosing

various conditions, including head trauma, fractures, pulmonary diseases, abdominal pathologies,

and cardiovascular disorders (Hsieh, 2003; McCollough et al., 2009; Wintermark et al., 2015).

CT imaging offers several distinct advantages, primarily its speed and availability, making it a

preferred modality in emergency settings where rapid diagnosis is essential. Additionally, it can

capture high-resolution images within seconds, proving crucial in trauma cases, stroke evaluation,

and other time-sensitive conditions. The relatively fast scan times also reduce the likelihood of

motion artifacts caused by patient movement, which can otherwise degrade image quality. However,

CT has its limitations, most notably its reliance on ionizing radiation, which poses a risk to patients,

particularly with repeated exposure. The cumulative effects of radiation have led to efforts aimed

at optimizing dose levels while maintaining image quality, a challenging balance that has driven

advancements in CT technology and image reconstruction algorithms (McCollough et al., 2015).

CT measurements are typically acquired in the image domain by capturing the varying levels of

X-ray attenuation through tissues of different densities. This enables clear differentiation between

2

structures such as bone, soft tissue, and air-filled cavities. However, because CT image quality is

tied to radiation dose, there is an ongoing effort to minimize exposure by developing dose-reduction

techniques, such as iterative reconstruction and artificial intelligence (AI)-based methods that enable

diagnostic-quality images from low-dose scans. These techniques seek to reduce radiation while

preserving, or even enhancing, the clarity and detail of the images.

Recently, the focus in CT has expanded beyond mere structural imaging to include functional

imaging, where parameters like blood flow or tissue perfusion can provide valuable insights into

physiological processes. Functional CT imaging, though in its early stages, holds promise for more

comprehensive assessments of complex conditions, such as coronary artery disease and cancer. Like

MRI, CT is increasingly driven by the demand for quantitative, rather than qualitative, data. This

shift is evident in the growing emphasis on tools that can extract precise measurements from CT

images, supporting more tailored and accurate clinical decision-making.

Our work in this thesis focuses on addressing some of the problems in both the aforementioned

areas by developing dataless machine learning algorithms that allow for better reconstruction of

MR images and CT images from limited sampling of measurements. Also, we want to improve the

generlization of the MR images.

Chapter II briefly provides some background for the concepts that are pertinent to the algorithms

that are formulated and developed in the subsequent chapters

Chapter III focuses on the how to use adaptive LOcal NeighborhooD-based Networks for

MRI (LONDN-MRI) reconstruction to handle the Deep CNNs usually require enormous datasets

for offline training to ensure adequate performance trade-offs. The approach efficiently learns

reconstruction networks from small clusters in a training set, directly at reconstruction time. We

show connections of this algorithm to a challenging bilevel optimization problem. Our algorithm

for image reconstruction alternates between finding a small set of similar images to a current

reconstruction, training the network locally on such neighbors, and updating the reconstruction.

The proposed local learning approach is flexible and can be seamlessly integrated with various

existing deep learning frameworks for MRI, such as unrolled networks and image-domain denoisers,

3

to enhance their performance. Our experimental results on multiple datasets (fastMRI, Stanford

FSE, and fastMRI+) and across multiple k-space undersampling factors showed that the proposed

local adaptation techniques surpass networks trained globally on larger datasets. We demonstrated

improved performance against scan-specific deep learning methods such as deep image prior, RAKI,

and LORAKI, even when using a small number of neighbors for training.

Chapter IV explores the direction from local learning to dateless learning and focuses on a

self-guided DIP method, which eliminates the need for separate reference images (for network

input) and gives much better image reconstruction quality than the prior reference-guided method

as well as several other related and competing schemes. The proposed method relies on a crucial

denoising-based regularization. Also, To gain a deeper understanding of image reconstruction using

DIP, we conduct an analysis of gradient descent-trained CNNs in the over-parameterized regime.

We employ a realistic imaging forward operator instead of a Gaussian measurement matrix for our

analysis of the case of compressed sensing. Our primary finding is that as the number of gradient

descent steps used to optimize the standard DIP objective function approaches infinity, the difference

between the network estimate and the ground truth will reside in a subspace related to the null

space of the forward operator and the network’s neural tangent kernel. We empirically demonstrated

that this method yields promising results for MRI reconstruction and image inpainting on different

datasets. Notably, our approach does not involve any pre-training, and can thus readily handle

changes in the measured data. Moreover, this self-guided method showed better performance than

the same model trained in a supervised manner on a large dataset (with lengthy training times). This

shows that highly adaptive learning approaches may have the potential to outperform traditional

data-driven learning approaches in image reconstruction

Chapter V further improves the training data-free method unsupervised method self-guided- DIP

by building upon an insight about the impact of the DIP network input; we introduce Autoencoding

Sequential DIP (aSeqDIP), which incorporates a U-Net architecture whose weights are updated

sequentially. These updates are based on objective functions that consist of an input-adaptive data

consistency term and an autoencoding regularization term used for noise overfitting mitigation.

4

Our extensive experimental evaluations, in terms of standard image reconstruction metrics and

required run-time, highlight the superior (or competitive) performance of aSeqDIP compared to

DIP-based and leading DM-based methods for the tasks of MRI and CT reconstruction, denoising,

in-painting, and non-linear deblurring.

Chapter VI proposed another direction of the thesis on generalization and robustness of deep

learning based on MRI image reconstruction; in this direction, we proposed integrating the RS

approach within the Model Based Deep Learning(MoDL) framework for the problem of MR image

reconstruction. This is accomplished by using RS in each unrolling step and at the intermediate

unrolled denoisers in MoDL. This strategy is underpinned by the ‘pre-training + fine-tuning’

technique. We empirically showed that this approach is effective. We provide an analysis and

conditions under which the proposed smoothed unrolling (SMUG) technique is robust against

perturbations. Furthermore, we introduce a novel weighted smoothed unrolling scheme that learns

image-wise weights during smoothing, unlike conventional RS. This approach further improves the

reconstruction performance. Furthermore, in this work, we evaluate worst-case additive perturbations

in k-space or measurement space where image-space perturbations were considered.

Chapter VII tries to solve the problem of adversarial robustness perfectly; we introduce a general

robustification framework designed to enhance the resilience of DL-based MRI reconstructors

against a variety of instabilities and improve their generalization performance when faced with

out-of-distribution samples. This is accomplished through integrating purification via pre-trained

DMs into existing DL-based models. We present a novel approach to select a process-switching time

step - a critical parameter within our DM-based purification method. This eliminates the necessity

of treating it as a hyper-parameter.

Chapter VIII focuses on solving the problem of reverse sampling steps of Diffusion models such

as diffusion purification. Reverse sampling steps in diffusion models, such as those used in diffusion

purification, are often lengthy and computationally expensive. Moreover, errors tend to accumulate

when appropriate conditions are not established. In this chapter, we identify three critical conditions

necessary for achieving measurement-consistent diffusion trajectories.

5

Building upon these foundational conditions, we propose a novel optimization-based sampling

method. Unlike traditional approaches, which primarily focus on ensuring data manifold mea-

surement consistency and forward diffusion consistency, our method introduces an additional key

element: backward diffusion consistency. This new element ensures the preservation of the diffusion

trajectory by optimizing the input of the pre-trained model at every sampling step. This integrated

approach significantly enhances the efficiency and accuracy of reverse sampling in diffusion models.

6

CHAPTER 2

BASIC IMAGE PROCESSING BACKGROUND

This chapter provides an overview of the core applications and foundational concepts used throughout

this thesis. First, it reviews Magnetic Resonance Imaging (MRI) and the role of Compressive

Sensing in MRI, along with an MRI-based technique for noninvasive perfusion quantification. It

also introduces the Deep Image Prior and diffusion models, offering insight into the deep learning

methods utilized in this work.

2.1 Magnetic Resonance Imaging

Magnetic Resonance Imaging (MRI) (Tran et al., 2021) utilizes a powerful magnetic field (denoted

as 𝐵0) to align the proton spins in hydrogen atoms (primarily found in water molecules within

the body). By applying Radio Frequency (RF) waves, these spins are tipped, initiating a process

known as precession around the 𝐵0 field, which generates a detectable signal or voltage in a receiver

coil. Unlike imaging methods such as X-ray radiography or photography, where measurements are

captured in the image or spatial domain, MRI data is collected in the frequency domain, referred to

as k-space. Spatial localization within k-space is achieved through magnetic field gradients along

the x and y axes, known as frequency and phase encoding, respectively. In 2D imaging, gradients in

the z-direction are employed to selectively excite spins within a specific cross-sectional “slab” of the

subject. Once enough k-space data has been obtained, image reconstruction occurs through a Fourier

transform that translates frequency domain data into the image domain (Donoho, 2006b). For 3D

or volumetric imaging, the entire volume is excited, with additional spatial encoding achieved by

incorporating phase encoding along the z-axis, complementing the standard phase and frequency

encoding used in 2D acquisitions.

To ensure accurate image reconstruction, an ill-posed inverse problem can be formulated as:

ˆx = arg min

x

∥Ax − y∥2

2 + R (x),

(2.1)

where A is a linear measurement operator, y ∈ R𝑝 are the measurements, and ˆx ∈ R𝑞 is the

reconstructed image. The first term in the minimization is referred to as a data-fidelity function and

7

can also take on alternative forms depending on imaging setup. In classical image inpainting, A is a

binary masking operator. For the task of reconstructing a multi-coil MRI image, represented by

x ∈ C𝑞 the optimization problem is

ˆx = arg min

x

𝑁𝑐∑︁

𝑐=1

∥A𝑐x − y𝑐 ∥2

2 + 𝜆R (x),

(2.2)

where the 𝑘-space measurements taken from 𝑁𝑐 coils are represented by y𝑐 ∈ C𝑝, 𝑐 = 1, . . . , 𝑁𝑐.

The coil-wise forward operator is denoted as A𝑐 = MFFF S𝑐, where M ∈ {0, 1} 𝑝×𝑞 is a masking

operator that captures the data sampling pattern in 𝑘-space, FFF ∈ C𝑞×𝑞 is the Fourier transform

operator, and S𝑐 ∈ C𝑞×𝑞 represents the 𝑐th coil-sensitivity map (a diagonal matrix) (Lustig et al.,

2007).

An explicit regularizer R (·) is employed to limit the solutions to the domain of desirable images.

Various regularizers have been used in image reconstruction. For example, it can be the ℓ1 penalty

on wavelet coefficients, a total variation penalty, or patch-based sparsity in learned dictionaries (Wen

et al., 2020; Ravishankar and Bresler, 2010). where the regularizer exploits the learned transform

domain sparsity of reconstructed image patches or assumes that patches in the reconstructed image

can be expressed as sparse linear combinations of the atoms of a learned dictionary:

or

𝑅(𝑥) = min
𝐷,𝑍

∥𝑃𝑥 − 𝐷𝑍 ∥2

2 + 𝜆∥𝑍 ∥0,

𝑅(𝑥) = min
𝑊,𝛼

∥𝑊 𝑃𝑥 − 𝛼∥2

2 + 𝜆∥𝛼∥0.

(2.2)

(2.3)

Here, (2.2) and (2.3) correspond to the dictionary and transform-based regularization, respectively.

𝑃 is a patch extraction operator, and 𝑊 and 𝐷 are the dictionary and transform matrices (typically,

additional constraints or penalties are exploited for learning them). These can either be learned

or fixed beforehand. 𝛼 and 𝑍 represent sparse representation coefficients. Eqn. (2.2) adopts the

synthesis model that tries to express each patch in the image as the sum of a few fundamental

components, while Eqn. (2.3) adopts the analysis model that posits image patches can be decomposed

into a few significant coefficients if an appropriate transform is applied.

8

The success of deep learning in domains like computer vision and image processing and the

availability of pairwise training data (consisting of fully sampled reconstructions and corresponding

undersampled reconstructions) has ushered in the use of deep-neural networks in compressed-

sensing MRI, where a majority of techniques rely upon the richness of CNNs and GANs in their

ability to learn features from training data. Typically, in these algorithms, the output of a deep

network is used to regularize the MRI reconstruction problem, i.e., 𝑅(𝑥) = ∥𝑥 − 𝑉𝜃 (𝑥0) ∥2

2, where
𝑉 is a deep CNN whose weights are denoted by 𝜃, and 𝑥0 is an initial estimate of the image

being reconstructed, like a zero-filled reconstruction. These deep networks are typically trained

in a supervised fashion using pairwise training data consisting of a fully sampled ground truth

reconstruction as the target, and the corresponding undersampled reconstruction as an input to the

network (often, a zero-filled reconstruction is chosen for this purpose). Where such pairwise training

data is not available, generative adversarial networks are often used instead. In these settings,

𝑅(𝑥) = ∥𝑥 − 𝐺 𝜙 (𝑥0)∥2

2, where 𝐺 is a CNN trained using unpaired or partially paired training data,
and an (additional) adversarial objective, and 𝜙 are its weights, and 𝑥0 is an initial estimate of the

image being reconstructed, like a zero- filled reconstruction (Lei et al., 2020; Yang et al., 2017).

Other than the reduced demands for fully-sampled training data, an advantage of using GANs for

regularized reconstruction is that they yield images that have more realistic texture.

The category of supervised algorithms that have found the most success in reconstructing MR

images from limited measurements are a class of algorithms called unrolled algorithms (Monga et al.,

2021) . A trademark of such algorithms is that they usually extend iterative approaches to image

reconstruction to incorporate pairwise training data. Usually, this involves replacing one or multiple

stages in a single iteration of an image reconstruction algorithm by a deep CNN. While, unrolled

loop algorithms have demonstrated their superiority amongst supervised algorithms, and are often

treated as the replacement to traditional prior- based iterative reconstruction algorithms (Aggarwal

et al., 2019a; Zhang and Ghanem, 2018; Hammernik et al., 2018; Schlemper et al., 2017), there has

been little investigation into whether features learned by unrolled loop algorithms subsume those

enforced in traditional priors like dictionary or transform learning priors, or even Total Variation

9

(TV)-based methods.

2.2 CT

For CT reconstruction (Shete and Jadhav, 2023) is shared the same problem setup as the MRI

reconstruction where

ˆx = arg min

x

∥Ax − y∥2

2 + R (x),

(2.3)

A is the random transform which takes a function defined in a 2D space (often an image) and

transforms it into a new function that represents the integrals of the original function over straight

lines. For an arbitrary 2D object 𝑓 (𝑥, 𝑦), the corresponding Radon transform with a parallel-beam

X-ray CT imaging geometry can be written as follows:

𝑝(𝑠, 𝜃) =

∫ ∞

∫ ∞

−∞

−∞

𝑓 (𝑥, 𝑦) 𝛿(𝑥 cos 𝜃 + 𝑦 sin 𝜃 − 𝑠) 𝑑𝑥 𝑑𝑦.

(2.4)

Here, 𝑝(𝑠, 𝜃) denotes a Radon projection of 𝑓 (𝑥, 𝑦) at a certain view angle 𝜃. 𝛿(·) is the Dirac

delta function, and 𝑠 is the position of a detector unit relative to the geometry center of the X-ray

imaging system. Thus, 𝑓 (𝑥, 𝑦)𝛿(𝑥 cos 𝜃 + 𝑦 sin 𝜃 − 𝑠) represents the intersection of an X-ray beam

with 𝑓 (𝑥, 𝑦). The Radon projections are usually collected within a rotation interval of 180 degrees,

namely, 0 ≤ 𝜃 ≤ 𝜋.

To reconstruct 𝑓 (𝑥, 𝑦), one can use the FBP algorithm. Let’s denote the two-dimensional Fourier

transform of 𝑓 (𝑥, 𝑦) as 𝐹 (𝜔, 𝜃) and the one-dimensional Fourier transform of 𝑝(𝑠, 𝜃) as 𝑃(𝜔, 𝜃).

According to the Inverse Fourier transform and the Central Slice Theorem, 𝑓 (𝑥, 𝑦) can be expressed

as follows:

𝑓 (𝑥, 𝑦) =

∫ 𝜋

∫ ∞

0

0

𝐹 (𝜔, 𝜃) 𝑒2𝜋𝑖𝜔(𝑥 cos 𝜃+𝑦 sin 𝜃) 𝜔 𝑑𝜔 𝑑𝜃 =

∫ 𝜋

∫ ∞

0

−∞

𝑃(𝜔, 𝜃)|𝜔|𝑒2𝜋𝑖𝜔(𝑥 cos 𝜃+𝑦 sin 𝜃) 𝑑𝜔 𝑑𝜃.

(2.5)

10

Here, |𝜔| is the transfer function of the ramp filter. Introducing 𝑄(𝜔, 𝜃) = |𝜔|𝑃(𝜔, 𝜃) and

denoting the Inverse Fourier transform of 𝑄(𝜔, 𝜃) as 𝑞(𝑠, 𝜃), the reconstruction of an arbitrary 2D

object 𝑓 (𝑥, 𝑦) via the FBP algorithm can be obtained with the following two steps:

1. Apply a ramp filtering operation to 𝑝(𝑠, 𝜃) with respect to the variable 𝑠 in the Fourier domain,

as follows:

𝑞(𝑠, 𝜃) = F −1{|𝜔| · F {𝑝(𝑠, 𝜃)}};

2. Back-project 𝑞(𝑠, 𝜃) to obtain the reconstruction, as follows:

𝑓 (𝑥, 𝑦) =

∫ 𝜋

0

𝑞(𝑠, 𝜃)(cid:12)

(cid:12)𝑠=𝑥 cos 𝜃+𝑦 sin 𝜃

𝑑𝜃.

(2.6)

(2.7)

Here, 𝑠 = 𝑥 cos 𝜃 + 𝑦 sin 𝜃 denotes a sinusoidal track, from which the Radon projection points are

related to the reconstructed point (𝑥, 𝑦). The reconstruction by the FBP algorithm (Kak and Slaney,

2001) might suffer from noise-induced artifacts due to the degradation of the Radon projections.

To obtain a promising reconstruction, one can apply some off-the-shelf restoration algorithms in

the Radon projections and/or image domain to further improve the FBP results. This indicates that

the Radon inversion can be approximated by several successive operations, each of which is highly

dependent on the results of the previous operation.

2.3 Deep Image Prior

Image reconstruction is an ill-posed inverse problem that seeks to recover an 𝑛-dimensional

image x∗ from an 𝑚-dimensional measurements vector y, where 𝑚 < 𝑛. The forward model can be

formulated in different applications as y ≈ Ax∗, where A is the forward operator. For multi-coil

MRI, A = MFS, where M denotes coil-wise undersampling, F is the coil-by-coil Fourier transform,

and S represents sensitivity encoding with multiple coils. For CT, we use a simplified forward

operator to study the sparse-views setting: A = CR, where C selects specific projection views or

angles, and R is the radon transform(corresponding to parallel beam CT).

Deep image prior (DIP) was introduced by (Aggarwal et al., 2018), showing that a U-Net

generator network’s architecture alone can capture substantial low-level image statistics even without

prior learning. Specifically, the DIP image reconstruction is obtained through:

11

ˆ𝜃 = arg min
𝜃

∥A 𝑓𝜃 (z) − y∥2
2 ,

ˆx = 𝑓 ˆ𝜃 (z) ,

(2.8)

where ˆx is the reconstructed image, and 𝜃 corresponds to the parameters of a network 𝑓 . The input

to the network, z, is randomly selected and remains fixed during optimization. Although standard

DIP performs well on many tasks, determining the optimal number of iterations is challenging, as

the network may eventually fit noise in y or undesired images from the null space of A.

To mitigate the problem of noise overfitting, previous studies considered different approaches

such as regularization, early stopping (ES), and network pruning (Ghosh et al., 2024). For

regularization-based methods, the work in (Liu et al., 2019b) enhanced the standard DIP by

introducing a total variation (TV) regularization term for denoising and deblurring tasks, whereas the

study in (Cheng et al., 2019) proposed combining DIP with stochastic gradient Langevin dynamics

(SGLD) (Welling and Teh, 2011). The authors in (Wang et al., 2023a) use running variance as the

criterion for ES, whereas the authors of (Li et al., 2021) propose combining self-validation and

training to apply ES.

The input to the standard DIP (or Vanilla DIP) network is a random noise vector that, in most

works, remains fixed during the optimization. Nevertheless, other works, such as those in (Zhao

et al., 2020a) and (Tachella et al., 2021), have explored cases where the input contains some structure

of the ground truth. The approach employed in reference-guided DIP (Ref-Guided DIP) (Zhao

et al., 2020a) follows the same objective as standard DIP in (2.8). However, instead of using a fixed

random noise vector as input, it utilizes a reference image closely resembling the one undergoing

reconstruction. This method was applied to the task of MRI. This methodology proves particularly

effective when datasets comprising structurally similar data points are available. The reference

required here makes this method a data-dependent approach. In chapter V and VI , we introduce the

solution of how to solve this problem properly.

2.4 Diffusion Model

Pre-trained Diffusion Models (DMs) generate images by applying a pre-defined iterative denoising

process (Ho et al., 2020). In the Variance-Preserving Stochastic Differentiable Equations (SDEs)

12

setting (Song et al., 2021b,a), DMs are formulated using the forward and reverse processes

𝑑x𝑡 = −

𝛽𝑡
2

x𝑡 𝑑𝑡 + √︁𝛽𝑡 𝑑w ,

𝑑x𝑡 = −𝛽𝑡

(cid:104) 1
2

x𝑡 + ∇x𝑡 log𝑝𝑡 (x𝑡)

(cid:105)

𝑑𝑡 + √︁𝛽𝑡 𝑑 ¯w ,

(2.9)

where 𝛽 : {0, . . . , 𝑇 } → (0, 1) is a pre-defined function that controls the amount of additive

perturbations at time 𝑡, w (resp. ¯w) is the forward (resp. reverse) Weiner process (Anderson, 1982),

𝑝𝑡 (x𝑡) is the distribution of x𝑡 at 𝑡, and ∇x𝑡 log𝑝𝑡 (x𝑡) is the score function that is replaced by a neural

network (typically a time-encoded U-Net (Ronneberger et al., 2015a)) s : R𝑛 × {0, . . . , 𝑇 } → R𝑛,

parameterized by 𝜃. In practice, given the score function s𝜃, the SDEs in (2.9) can be discretized as

in (2.10) where η𝑡, η𝑡−1 ∼ N (0, I).

x𝑡 = √︁1 − 𝛽𝑡x𝑡−1 + √︁𝛽𝑡η𝑡−1 ,

x𝑡−1 =

1
√︁1 − 𝛽𝑡

(cid:104)x𝑡 + 𝛽𝑡s𝜃 (x𝑡, 𝑡)

(cid:105)

+ √︁𝛽𝑡η𝑡 .

(2.10)

When employed to solve inverse problems, the score function in (2.9) is replaced by a conditional

score function which, by Bayes’ rule, is ∇x𝑡 log𝑝𝑡 (x𝑡 |y) = ∇x𝑡 log𝑝𝑡 (x𝑡) + ∇x𝑡 log𝑝𝑡 (y|x𝑡). Solving

the SDE in (2.9) with the conditional score is referred to as posterior sampling (Chung et al., 2023b).

As there doesn’t exist a closed-form expression for the term ∇x𝑡 log𝑝𝑡 (y|x𝑡) (which is termed as

the measurements matching term in (Daras et al., 2024)), previous works have explored different

approaches, which we will briefly discuss below. We refer the reader to the recent survey in (Daras

et al., 2024) for an overview on DM-based methods for solving IPs.

A well-known method is Diffusion Posterior Sampling (DPS) (Chung et al., 2023b), which uses

the approximation 𝑝(y|x𝑡) ≈ 𝑝(y| ˆx0) where ˆx0(x𝑡) (or simply ˆx0) is the estimated image at time 𝑡 as

a function of the pre-trained model and x𝑡 (Tweedie’s formula (Vincent, 2011)), given as

ˆx0(x𝑡) =

1
√
¯𝛼𝑡

(cid:104)x𝑡 − √︁1 − ¯𝛼𝑡ϵ𝜃 (x𝑡, 𝑡)

(cid:105)

=: 𝑓 (x𝑡; 𝑡, ϵ𝜃) ,

(2.11)

where ¯𝛼𝑡 = (cid:206)𝑡

𝑗=1 𝛼 𝑗 and 𝛼𝑡 = 1 − 𝛽𝑡. We call the function 𝑓 , defined in (2.11), as ‘Tweedie-network
1 − ¯𝛼𝑡s𝜃 (x𝑡, 𝑡) (Luo, 2022) outputs the noise

denoiser’ (Chen et al., 2024). Here, ϵ𝜃 (x𝑡, 𝑡) = −

√

in x𝑡. Tweedie’s formula is also adopted in other DM-based IP solvers such as (Rout et al., 2023;

Chung et al., 2023d; Wang et al., 2022). The drawback of these methods is that they require a large

number of sampling steps.

13

The work in ReSample (Song et al., 2023a), solves an optimization problem on the estimated

posterior mean in the latent space to enforce a step-wise measurement consistency, requiring many

sampling and optimization steps.

The work in (Mardani et al., 2023) introduced RED-Diff, a variational Bayesian method that

fits a Gaussian distribution to the posterior distribution of the clean image given the measurements.

This approach involves solving an optimization problem using stochastic gradient descent (SGD) to

minimize a data-fitting term while maximizing the likelihood of the reconstructed image under the

denoising diffusion prior (as a regularizer). However, the SGD process requires multiple iterations,

each involving evaluations of the pre-trained DM on a different noisy image at some randomly

selected time. While RED-diff reduces the run-time, their qualitative results are not competitive on

several image restoration tasks.

Recently, Decoupling Consistency with Diffusion Purification (DCDP) (Li et al., 2024) proposed

separating diffusion sampling steps from measurement consistency by using DMs as diffusion

purifiers (Nie et al., 2022a; Alkhouri et al., 2024), with the goal of reducing the run-time. However,

for every task, DCDP requires tuning the number of forward diffusion steps for purification for

each sampling step. Shortly after, Decoupled Annealing Posterior Sampling (DAPS) (Zhang et al.,

2024a) introduced another decoupled approach, incorporating gradient descent noise annealing

via Langevin dynamics. DAPS, similar to DPS, also requires a large number of sampling and

optimization steps.

2.5 Robustness

Robustness in MRI reconstruction refers to the ability of a deep learning-based reconstruction

model to maintain accurate and stable reconstructions under various perturbations and distribution

shifts. These perturbations can arise due to (1) additive noise in k-space, (2) variations in

sampling protocols such as changes in undersampling rates or k-space sampling locations, and (3)

unseen anatomies and pathologies encountered at test time. A robust MRI reconstructor should

generalize well across these conditions without requiring extensive retraining, ensuring reliable

image reconstruction in real-world clinical settings.

14

2.5.0.1 K-space Additive Noise

Given a trained deep MRI reconstruction NN and an aliased image z = A𝐻y, recent studies have

shown that these NNs are not robust to additive perturbations δ to y (Li et al., 2023). The study in

(Jia et al., 2022b) presents an approach to generate worst-case additive noise that employs norm

constraints, in line with the attack strategies utilized in image classification. This approach aims

to produce a form of worst-case imperceptible additive noise against a reconstructor in the image

domain. Given a perturbation budget 𝜖 > 0, the worst-case additive perturbations can be obtained

using the following optimization problem.

(cid:16)CNN𝜃 (A𝐻y), CNN𝜃 (A𝐻 (y + δ))

(cid:17)

,

L

max
∥δ∥∞≤𝜖

(2.12)

where CNN is the common neural network and ∥.∥∞ is the ℓ∞ norm and L is a differentiable

loss function that computes the reconstruction loss. Given the original image x∗, generating the

perturbations can also be achieved by replacing the first argument of L in (2.12) with x∗. A solution

of (2.12) can be obtained using the Projected Gradient Descent (PGD) method (Madry et al., 2017).

In this paper, we also use zpert = A𝐻 (y + δ) = A𝐻ypert which relates perturbations in k-space and

image space.

In addition to the worst-case perturbations, random/realistic additive measurement noise could

also impact the performance of a reconstructor.

2.5.0.2 Training/Testing Sampling Protocol & Undersampling Rate Disparities

In addition to additive perturbations, the study presented in (Li et al., 2023) underscores an

additional potential source of instability that MoDL (and other DL-based reconstructors) may face

during testing. This source stems from changes in the measurement sampling rate, leading to

perturbations in the sparsity of the sampling mask within A (Antun et al., 2020a). Furthermore, in

this paper, we consider another variation that these NNs could encounter during the testing phase,

involving a shift or variation in the k-space sampling locations within the matrix M, resulting in

the construction of a nonidentical forward operator for testing. For this case, zpert = A𝐻

testy, where

Atest ≠ A.

15

We remark that ensuring the robustness of a reconstruction model to variations in the sampling

protocol, undersampling rate, scan contrast, etc., is crucial as it mitigates the need for re-training

to all possible practical scenarios and variations, common in imaging. Re-training models for

new setups is expensive. Moreover, the relatively limited training data availability (which requires

fully-sampled measurements as labels in supervised learning) in reconstruction applications also

warrant learning models that can still be significantly robust.

2.5.0.3 Unseen Anatomies & Pathologies at Testing Time

A lesion (or anatomy changes) denotes an anomaly, or impairment within a tissue or organ of

the body, arising from diverse factors such as injuries, diseases, or pathological conditions. In

the medical domain, the term commonly characterizes regions of abnormal or diseased tissue,

observed through MR imaging. In this paper, we study the practical case where the DL-based image

reconstructor is trained on some data points, but tested with measurements with unseen lesions.

Figure 2.1 illustrates reconstructed images from the instabilities and the generalization challenges

2.6 Definition for Common Metrics

For the performance evaluation, we employed three widely used metrics to assess the reconstruc-

tion quality of different methods. These metrics quantify the similarity between the reconstructed

images and the ground truth images derived from fully sampled k-space data. The chosen metrics

were:

• Peak Signal-to-Noise Ratio (PSNR): PSNR, measured in decibels (dB), is a standard metric

for evaluating image reconstruction quality. It is defined as:

PSNR = 10log10

(cid:33)

(cid:32) max(𝐼gt)2
MSE

,

(2.13)

where 𝐼gt is the ground truth image, and MSE (Mean Squared Error) represents the average

squared intensity differences between the reconstructed image and the ground truth. A higher

PSNR value indicates better reconstruction quality, as it suggests lower error between the

reconstructed and ground truth images.

16

PSNR = 30.8 dB

PSNR = 23.21 dB

PSNR = 22.18 dB

(a)

(b)

(c)

PSNR = 24.15 dB

PSNR = 27.26 dB

(d)

(e)

Figure 2.1 Here, we show the vulnerabilities and generalization challenges of DL-based MRI
reconstruction models by evaluating a trained MoDL reconstructor (trained at 4x undersampling). (a)
Reconstructed image from clean measurements. (b) Reconstructed image from measurements with
worst-case additive perturbations (Equation (2.12) with 𝜖 = 0.02). (c) Reconstructed image from
measurements with 2x undersampling rate during testing. (d) Reconstructed image from a different
test time sampling mask with 4x undersampling. (e) Reconstructed image from measurements with
an unseen lesion during testing.

• Structural Similarity Index (SSIM): SSIM is a perceptual metric that evaluates the similarity

between two images by considering luminance, contrast, and structural information. It is

computed as:

SSIM(𝐼𝑟, 𝐼gt) =

(2𝜇𝐼𝑟 𝜇𝐼gt + 𝐶1) (2𝜎𝐼𝑟 𝐼gt + 𝐶2)
+ 𝐶1) (𝜎2
𝐼𝑟

,

+ 𝜇2
𝐼gt
where 𝐼𝑟 and 𝐼gt denote the reconstructed and ground truth images, respectively, 𝜇 represents

+ 𝜎2
𝐼gt

+ 𝐶2)

(𝜇2
𝐼𝑟

(2.14)

the mean intensity, 𝜎 denotes the standard deviation, and 𝜎𝐼𝑟 𝐼gt is the covariance. 𝐶1 and 𝐶2

are small constants to stabilize the computation. SSIM values range from -1 to 1, with a value

closer to 1 indicating higher structural similarity.

• High Frequency Error Norm (HFEN): HFEN is designed to measure the preservation of

high-frequency details, which are often crucial in medical image reconstruction. It quantifies

17

the error in fine details by comparing the reconstructed image and the ground truth after

applying a Laplacian of Gaussian (LoG) filter. The HFEN metric is computed as:

HFEN =

∥LoG(𝐼𝑟) − LoG(𝐼gt) ∥2
∥LoG(𝐼gt) ∥2

,

(2.15)

where LoG(·) represents the Laplacian of Gaussian filtering operation. The numerator

measures the difference between LoG-filtered reconstructed and ground truth images using the

ℓ2-norm, while the denominator normalizes this difference by the ℓ2-norm of the LoG-filtered

ground truth. Lower HFEN values indicate better reconstruction performance, as they imply

reduced high-frequency reconstruction errors.

These three metrics collectively provide a comprehensive assessment of reconstruction quality:

PSNR captures overall signal fidelity, SSIM measures structural consistency, and HFEN evaluates

the preservation of fine details.

18

CHAPTER 3

LONDN-MRI

3.1

Introduction

Recent medical image reconstruction techniques focus on generating high-quality medical images

suitable for clinical use at the lowest possible cost and with the fewest possible adverse effects

on patients. Recent works have shown significant promise for reconstructing MR images from

sparsely sampled k-space data using deep learning. In this work, we propose a technique that rapidly

estimates deep neural networks directly at reconstruction time by fitting them on small adaptively

estimated neighborhoods of a training set. In brief, our algorithm alternates between searching for

neighbors in a data set that are similar to the test reconstruction, and training a local network on these

neighbors followed by updating the test reconstruction. Because our reconstruction model is learned

on a dataset that is in some sense similar to the image being reconstructed rather than being fit on a

large, diverse training set, it is more adaptive to new scans. It can also handle changes in training

sets and flexible scan settings, while being relatively fast. Our approach, dubbed LONDN-MRI, was

validated on multiple data sets using deep unrolled reconstruction networks. Reconstructions were

performed at four fold and eight fold undersampling of k-space with 1D variable-density random

phase-encode undersampling masks. Our results demonstrate that our proposed locally-trained

method produces higher-quality reconstructions compared to models trained globally on larger

datasets as well as other scan-adaptive methods.

3.2

Introduction

In applications like X-ray computed tomography (CT) (Elbakri and Fessler, 2002) and magnetic

resonance imaging (MRI) (Fessler, 2010), reconstructing images from undersampled or corrupted

observations is of critical importance. For example, this is necessary to reduce a patient’s exposure

to radiation in CT or reduce time spent acquiring MRI data. MRI scans involve sequential data

acquisition resulting in long acquisition times that are not only a burden for patients and hospitals, but

also make MRI susceptible to motion artifacts. Reconstructing images from limited measurements

can speed up the MRI scan, but usually entails solving an ill-posed inverse problem. Recent

19

approaches to accelerating MRI acquisition such as compressed sensing (CS) (Donoho, 2006a)

reduce scan time by collecting fewer measurements while preserving image quality by exploiting

image priors or regularizers. Historically, regularization in CS-MRI has been based on sparsity

of wavelet coefficients (Mihcak et al., 1999) or using total variation (Ma et al., 2008). While

conventional CS assumes sparse or incoherent signals, approaches based on learned image models

have been shown to be more effective for MRI reconstruction, starting with learned synthesis

dictionaries (Ravishankar and Bresler, 2011; Lingala and Jacob, 2013). The dictionary parameters

could be learned from unpaired clean image patches from a dataset and used for reconstruction or

learned simultaneously with image reconstruction (Ravishankar et al., 2015; Xu et al., 2012; Ye et al.,

2021; Ravishankar and Bresler, 2011).Additionally, recent advances in sparsifying transform learning

have resulted in efficient or inexpensive data-adaptive sparsity-based reconstruction frameworks

for MRI (Ravishankar and Bresler, 2012; Ravishankar et al., 2020; Wen et al., 2020). Other

contemporary techniques could allow learning explicit regularizers in a supervised manner (Ghosh

et al., 2022) for improved image restoration.

Deep learning (DL) has emerged as a potent methodology for tackling large-scale inverse

problems, notably in enhancing image reconstruction techniques in MRI and CT. Predominantly,

end-to-end CNN, as exemplified by the U-net model (Ronneberger et al., 2015b; Jin et al., 2017),

have been employed to mitigate artifacts arising from undersampling in MRI datasets. Additionally,

a plethora of alternative network models such as the Transformer (Feng et al., 2021), and Generative

Adversarial Networks (GANs)(Lei et al., 2021), have demonstrated their effectiveness in MRI

reconstruction, as detailed in comprehensive reviews like (Ravishankar et al., 2020). Furthermore,

transfer learning (Dar et al., 2017) has also been used with neural networks for MRI reconstruction

to achieve domain transfer.

To enhance both stability and performance, hybrid-domain approaches such as (Aggarwal

et al., 2019a) enforce data consistency (i.e., the reconstruction is enforced to be consistent with

the measurement model) all through training and reconstruction. Networks incorporating data

consistency layers are pivotal in MR imaging, maintaining alignment between the reconstructed

20

image and the original data in k-space (Zheng et al., 2019; Schlemper et al., 2018). This category

encompasses various methodologies, including deep unrolling-based methods (Yang et al., 2016;

Hammernik et al., 2018)(which adapt traditional iterative algorithms to learn regularization

parameters) regularization by denoising approaches (Romano et al., 2017), and plug-and-play

methods (Buzzard et al., 2018), among others. Distinctively, the ADMM-CSNet (Yang et al.,

2016) utilizes neural networks for the optimization of ADMM parameters, diverging from the

ISTA-Net (Zhang and Ghanem, 2018), which focuses on refining CS reconstruction models grounded

in the Iterative Shrinkage-Thresholding Algorithm. While these CNN-based reconstruction methods

have demonstrated superiority over traditional CS techniques, concerns regarding their stability and

interpretability persist, as highlighted in (Antun et al., 2020b).

Apart from algorithmic advances, another driving force behind deep learning-based reconstruc-

tion is the the rapid growth of publicly available training datasets. The availability of (paired or

unpaired) training data sets made possible by efforts like OCMR (Chen et al., 2020) and fastMRI (et al,

2019) has enabled rapidly demonstrating the capacity of deep learning-based algorithms for improved

image reconstruction or denoising quality in MRI applications.

However, one major drawback of these learned approaches is that they typically require large

training datasets such as fully sampled MRI data to be effective. A recent scan-specific deep learning

method is the deep image prior (Ulyanov et al., 2018), which has been applied to MRI (Darestani

and Heckel, 2021) and learns a neural network for reconstruction in an unsupervised fashion from a

single image’s measurements. Other scan-adaptive methods include RAKI (Akccccakaya et al., 2019),

which is a nonlinear deep learning-based auto-regressive auto-calibrated reconstruction method.

RAKI could be viewed as a deep neural network-based version of the parallel imaging scheme

GRAPPA (Deshmane et al., 2012).

LORAKI (Kim et al., 2019) is another scheme that trains an autocalibrated recurrent neural

network (RNN) to recover missing k-space data. The 1D deep low-rank and sparse network (ODLS)

(Wang et al., 2023c) demonstrates enhanced robustness for 2D MR image reconstruction, particularly

in scenarios characterized by a limited number of training samples. All these methods learn

21

scan-specific networks without requiring large datasets. A related approach dubbed self-supervised

learning has also shown promise for MRI (Yaman et al., 2020) and uses a large unpaired data set.

3.2.1 Contributions

While deep learning approaches have gained popularity for MRI reconstruction due to their

ability to model complex data sets, they often have difficulties generalizing to new data or distinct

experimental situations at test time.

Deep CNNs usually require enormous datasets for offline training to ensure adequate performance

trade-offs. In this work, we propose to learn adaptive LOcal NeighborhooD-based Networks for

MRI (LONDN-MRI) reconstruction. The approach efficiently learns reconstruction networks from

small clusters in a training set, directly at reconstruction time.

• The proposed models are trained using a small number of adaptively chosen neighbors that

are in proximity (or are similar to in a sense) to the underlying (to be reconstructed) image

(cf. (Lahiri et al., 2020) for a slightly related approach in the context of patch-based dictionary

learning).

• We show connections of this algorithm to a challenging bilevel optimization problem. Our

algorithm for image reconstruction alternates between finding a small set of similar images to

a current reconstruction, training the network locally on such neighbors, and updating the

reconstruction.

• The proposed local learning approach is flexible and can be seamlessly integrated with various

existing deep learning frameworks for MRI, such as unrolled networks and image-domain

denoisers, to enhance their performance.

• Our experimental results on multiple datasets (fastMRI, Stanford FSE, and fastMRI+) and

across multiple k-space undersampling factors showed that the proposed local adaptation

techniques surpass networks trained globally on larger datasets. We demonstrated improved

performance against scan-specific deep learning methods such as deep image prior, RAKI,

and LORAKI, even when using a small number of neighbors for training.

22

• We have shown the method’s generalizability under different scenarios including different

sampling patterns, and testing on data with artificial as well as natural lesions, when the

training dataset didn’t include such lesions. To establish clinical utility, we also conducted

tests under different MR scan contrast settings and varying signal-to-noise ratios at test time,

where the proposed method showed promise. Our study also encompassed an analysis of

image quality vs. time consumption trade-offs when involving different networks and number

of neighbors selected, and compared favorably with related approaches.

3.3 Method

Our approach relies on finding images in a data set that are in a sense similar to the one being

reconstructed. The similarity may be defined using a metric such as Euclidean distance or other

metrics. Assume we have a data set {x𝑛, y𝑛}𝑁

𝑛=1 with 𝑁 reference or ground-truth images x𝑛 and
their corresponding k-space measurements y𝑛 (with multi-coil data), we use the distance metric 𝑑 to

find the 𝑘 nearest neighbors to an (estimated/reconstructed) image x as follows:

𝑟∈𝐶
where 𝐶 is a set of cardinality 𝑘 containing indices of feasible neighbors, and C denotes the set

ˆ𝐶x = arg min
𝐶∈C,|𝐶 |=𝑘

∑︁

𝑑 (x, x𝑛),

(3.1)

of all such sets with 𝑘 elements. Different distance functions could produce a different set of similar

neighbors, which could then affect the outcome of the reconstruction algorithm, as our network

modeling is dependent on the choice of the local data set.

As a result, we used different metrics for evaluating our approach in this work. The distances

serve as a proxy for data similarity, with nearby data considered similar and distant data considered

dissimilar. We used the Euclidean distance, Manhattan distance, and normalized cross-correlation

as distance metrics as follows.

𝑑 𝐿1(x, x𝑛) = ∥x − x𝑛 ∥1

𝑑 𝐿2(x, x𝑛) = ∥x − x𝑛 ∥2
(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
x𝐻x𝑛
(cid:12)
(cid:12)

𝑑 𝑁𝐶𝐶 (x, x𝑛) =

∥x∥2 ∥x𝑛 ∥2

23

(3.2)

(3.3)

(3.4)

Figure 3.1 Flowchart of the proposed LONDN-MRI scheme with a specific unrolled reconstruction
network. The denoising network could be for example a U-Net or the recent DIDN.

In all cases, we select the top 𝑘 most similar neighbors from a set that corresponds to the 𝑘

smallest distances in (3.1). The indices of the chosen images are in the set ˆ𝐶x, i.e., they are the
minimizer in (3.1). These neighbors can be used to train the local model. These are expected

to capture structures most similar to the image being reconstructed, enabling a highly effective

reconstruction model to be learned.

3.3.1 Proposed Method

Our primary objective is to learn an adaptive neural network for MRI reconstruction, in which

the model’s free parameters are fitted using training data that are similar in a sense to the current

scan. We emphasize that the proposed model is local in the sense that it changes in response to the

input.

The advantage of the proposed method is that the model is fit for every scan and can thus be

adaptive to the scan, readily handling changes in sampling masks, for example.

The algorithm begins by obtaining an initial estimate of the underlying image, denoted x0, from

undersampled measurements y. Our proposed strategy then alternates between computing the

closest neighbors to the reconstruction in the training set and performing CNN-based supervised

learning on the estimated local dataset. During supervised learning, the network weights could be

randomly initialized or could be warm-started with the weights of a pre-trained (e.g., state-of-the-art)

24

network. In the latter case, the pre-trained network would adapt to the features of images similar to

the one being reconstructed (akin to transfer learning (Dar et al., 2017)).

In each iteration, the nearest ground truth images in the training set are computed in relation to

the reconstruction (estimate) predicted by the locally learned network, except in the first iteration,

when the nearest neighbors are computed in relation to the (typically highly aliased) initial x0 (we

used corresponding aliased images in the dataset for computing distances in the first iteration). In

practice, pairwise distances to even a large number of images can be computed very efficiently (in

parallel), after which the local network can be rapidly learned on a small set of neighbors (typically a

shallow network or with early stopping). The network weights for deep reconstruction are constantly

updated to map the initial images for the local data set to the target (ground truth) versions.

To demonstrate our approach, we used the state-of-the-art deep CNN reconstruction model MoDL (Ag-

garwal et al., 2019a), which is trained locally in our scheme. Additionally, we trained it globally,

i.e., once on a larger dataset, in order to compare it to our on-the-fly neighborhood-based learning

scheme. For completeness, we briefly recap the MoDL scheme in the following and discuss its local

training within our framework. MoDL is similar to the plug-and-play approach, except that instead

of pre-trained denoiser networks, end-to-end training is used to learn the shared network weights

across iterations in the architecture.

3.3.2 Network Model and Training

The proposed approach is compatible with any network architecture. We use MoDL, which

has shown promise for MR image reconstruction, and combines a denoising network with a data

consistency (DC) module in each iteration of an unrolled architecture. MoDL unrolls alternating

minimization for the following problem:
𝑁𝑐∑︁

L𝑎 (𝑧𝑧𝑧, x) := 𝜈

𝑐=1

∥A𝑐x − y𝑐 ∥2

2 + R (𝑧𝑧𝑧) + 𝜇∥x − 𝑧𝑧𝑧∥2
2.

(3.5)

We denote the initial image in the process as x0, 𝜈 ≥ 0 weights the data-consistency term above, and

𝜇 ≥ 0 weights the proximity of x to 𝑧𝑧𝑧. By decomposing the optimization into two subproblems over

𝑧𝑧𝑧 and x, the explicit regularizer-based update for 𝑧𝑧𝑧 can be solved by replacing it with a CNN-based

denoiser (𝐷𝜃 (·)), and the denoised estimate is then used to update x. The x update in the MoDL

25

scheme involves the data-consistency term and is performed using Conjugate Gradient (CG) descent.

Thus, 𝑧𝑧𝑧 is obtained as the output from a CNN-based denoiser (𝐷𝜃) and x is updated by CG.

This alternating scheme is repeated 𝐿 times (unrolling), with the initial input image x0 being

passed through 𝐿 blocks of denoising CNN + CG updates. Now, if S𝑙

𝜃 (.) is the function capturing

the 𝑙th iteration of the algorithm, then the MoDL output for the 𝑙th block is given as

x𝑙+1 = S𝑙

𝜃 (x𝑙) = S (cid:0)x𝑙, 𝜃, 𝜈𝑙, {A𝑐, y𝑐}𝑁𝑐
𝑐=1

(cid:1), and

S (cid:0) ¯x, 𝜃, 𝜈, {A𝑐, y𝑐}𝑁𝑐
𝑐=1
𝑁𝑐∑︁

(cid:1) ≜

arg min
x

𝜈

𝑐=1

∥A𝑐x − y𝑐 ∥2

2 + ∥x − D𝜃 ( ¯x) ∥2
2.

After 𝐿 iterations, the final output is

xsupervised = x𝐿 =

(cid:32) 𝐿−1

(cid:33)

S𝑙
𝜃

𝑙=0

(x0) ≜ 𝜃 (x0),

(3.6)

(3.7)

where

𝐿−1
𝑖=0 𝑓 𝑖 represents the composition of 𝐿 functions 𝑓 𝐿−1 ◦ 𝑓 𝐿−2 ◦ . . . ◦ 𝑓 0, and x0 is the initial
image. The weights of the denoiser D𝜃 are shared across the 𝐿 blocks. The network parameters

𝜃 are learned in a supervised manner so that xsupervised matches known ground truths (in mean

squared error or other metric) on a (large/global or local) training set. This involves the following

optimization for training:

ˆ𝜃 = arg min

𝜃

= arg min
𝜃

∑︁

𝐶𝛽 (𝜃 (x0

𝑛); xn)

𝑛∈𝑆
∑︁

𝑛∈𝑆

(cid:0)(cid:13)
(cid:13)x𝑛 − 𝜃 (x0

𝑛)(cid:13)
(cid:13)

2
2),

where 𝑛 indexes the samples from the data set used for training, with xn denoting the 𝑛th target (or

ground truth) image reconstructed from fully-sampled k-space measurements and x0

𝑛 denotes the

initial image estimate from undersampled measurements. The cost 𝐶𝛽 ( ˆx𝑛; x𝑛) denotes the training

loss. The main difference between a globally learned and locally learned network is the choice of

the set 𝑆 of training indices. For the proposed local approach, we fit the network based on the 𝑘

training samples closest to the current test image estimate, whereas the conventional (or global)

26

training would fit networks to a large dataset. The initial image estimate x0

𝑛 is obtained from the

undersampled measurements y𝑛 using a simple analytical reconstruction scheme such as applying

the adjoint of the forward model to the measurements.

In each iteration, the network is updated (Fig. 3.1), and the initial estimate of the underlying

unknown image is passed through the network to obtain a new estimate. In Fig. 3.1, we illustrate the

iterative process of neighbor fine-tuning and local network updating. Local learning may have the

advantage of accommodating changes in experimental conditions (e.g., undersampling pattern) at

test time, provided that such modified measurements and initial images for the small local training

set can be easily simulated from the existing x𝑛 or y𝑛. Our overall algorithm is also summarized in

Algorithm 3.1.

Algorithm 3.1 LONDN-MRI Algorithm
Require: Initial image x0, number of neighbors 𝑘, k-space undersampling mask M, regularization
parameters 𝜈 and 𝜇, number of training epochs 𝑇, number of iterations of alternating algorithm
𝑆.

1: Initialize reconstruction network parameters 𝜃 with pre-learned network weights ˆ𝜃 or randomly

initialized weights. Set x = x0.

2: for Iteration < maximal iteration 𝑆 do
3:

Compute the set of 𝑘 similar neighbors ˆ𝐶x to the current reconstruction estimate x using

metric 𝑑.

4:
5:

for epoch < maximal number 𝑇 do

For each batch of neighbor data, compute the gradient of the training loss with respect

to the network parameters 𝜃 and perform one update step on 𝜃.

end for
Update x ← 𝜃 (x0)

6:
7:
8: end for
9: return reconstruction x and learned net. parameters 𝜃.

3.3.3 Regularization

In order to avoid over-fitting when training networks on small sets, we also adopted regularization

of weights during training as follows:

ˆ𝜃 = arg min

𝜃

∑︁

𝑛∈𝑆

(cid:13)
(cid:13)x𝑛 − 𝜃 (x0

2
𝑛)(cid:13)
2 + 𝜆 R (𝜃),
(cid:13)

(3.8)

where R (·) denotes the regularization term on network weights. We primarily used the ℓ1 norm

regularizer to enforce sparsity of the network weights to learn simpler models. We observed that

27

regularizing the local model enables it to converge more easily, and shrinks weights for less important

or noisy features to zero. We provide more discussion in the experiments section.

3.3.4 Connections to Bilevel Optimization

The alternating algorithm for training involving a neighbor search step and a local network

update step could be viewed as a heuristic algorithm for the following bilevel optimization problem:

min
𝐶∈C, |𝐶 |=𝑘

∑︁

𝑖∈𝐶

|| 𝑓𝜃 (𝐶) (y) − x𝑖 ||2
2,

s.t. 𝜃 (𝐶) = arg min

𝜃

||x𝑖 − 𝑓𝜃 (y𝑖)||2
2.

∑︁

𝑖∈𝐶

(3.9)

Here, 𝑓𝜃 (𝐶) denotes a deep neural network learned on a subset 𝐶 of a data set that maps the
current k-space measurements y to a reconstruction. The network is akin to 𝜃 (x0) shown earlier 3.8,

but with x0 assumed to be generated from y (e.g., via the well-known sum of squares of coil-wise

inverse Fourier transforms, or via SENSE reconstruction, etc.). Problem 3.9 aims to find the best

neighborhood or cluster among the training data, where the reconstructed image belongs (with

closest distances to neighbors – we assumed Euclidean distance here), with the network weights for

reconstruction estimated on the data in that cluster. Problem 3.9 is a bilevel optimization problem

with the cluster optimization forming the upper level cost and network optimization forming the

lower level cost. Bilevel problems are known to be quite challenging (Crockett and Fessler, 2021;

Ghosh et al., 2022). It is also a combinatorial problem because we would have to sweep through all

possible choices of clusters of 𝑘 training samples with reconstruction networks trained in each such

cluster, to determine the best cluster choice.

The proposed algorithm is akin to optimizing the bilevel problem by optimizing for the network

weights 𝜃 with the clustering 𝐶 fixed (the lower level problem) and then optimizing for the clustering

𝐶 (upper level minimization) with the network weights fixed. This is a heuristic because the

optimized variables in each step are related, however, such an approach has been used in prior

work (Ye et al., 2021) and shown to be approximately empirically convergent for the bilevel cost. In

this work, we performed an empirical evaluation of convergence in the experiments section, where

the alternating algorithm is shown to reduce the upper-level cost in (3.9).

28

Figure 3.2 Comparison of MoDL with UNet denoiser trained globally vs. using the proposed
LONDN-MRI scheme (1 iteration). Reconstruction metrics are shown across training set sizes at 4x
and 8x undersampling.

3.4 Experiments

We first present the overall experimental setup in Section 3.4.1. Key results and comparisons

are presented in Section 3.4.2. The intricacies and behavior of LONDN-MRI are analyzed in

Section 3.4.3 and its generalizability is investigated in Section 3.4.4.

3.4.1 Experimental Setup

Datasets & Models: We evaluated the effectiveness of the proposed LONDN-MRI reconstruction

method on multiple datasets: the multi-coil fastMRI knee and brain datasets (et al, 2019, 2020), the

fastMRI+ dataset (which is just an annotated version of fastMRI indicating pathologies), and the

Stanford 2D FSE (Cheng, 2019) dataset. The results obtained on the fastMRI knee dataset and the

Stanford FSE data are described in Section 3.4.2. The fastMRI brain and fastMRI+ data are used in

the studies in Section 3.4.4. For training, we randomly selected a subset of 3000 images from the

fastMRI knee and brain datasets and the same for the fastMRI+ case. We used 2000 training images

for the smaller Stanford FSE dataset. We used 15 or 20 images for testing in different scenarios,

which were randomly chosen.

In some experiments, we evaluated the effect of training set size, where we worked with fewer

or more images in the training set. Coil sensitivity maps for model-based reconstruction were

generated for each scan using the BART toolbox (Uecker, 2018). We tested obtaining these using

either the fully-sampled k-space or only center of k-space data and noticed very little difference in

reconstruction quality between the two approaches.

Since the proposed LONDN-MRI framework is quite general and can be combined with any

supervised deep learning based reconstruction approach, we chose the recent popular model-based

29

10002000300060000.820.830.840.850.860.870.880.890.90SSIMSSIM Comparison vs Training size (4x)GlobalLONDNOracle10002000300060000.760.770.780.790.800.810.820.830.840.85SSIMSSIM Comparison vs Training size (8x)GlobalLONDNOracle10002000300060000.520.540.560.580.600.620.64HFENHFEN Comparison vs Training size (4x)GlobalLONDNOracle10002000300060000.640.660.680.700.720.74HFENHFEN Comparison vs Training size (8x)GlobalLONDNOracledeep learning (MoDL) reconstruction network and compared globally (over large set of training

samples) and locally (over very small matched set of samples) learned versions of the model for

different choices of deep denoisers in the network

We performed reconstructions at fourfold or 4x acceleration (25.0% sampling) as well as at

eightfold or 8x acceleration (12.5% sampling) of the k-space acquisition. In all cases, variable

density 1D random Cartesian (phase-encode) undersampling of k-space was performed. The initial

image estimates for MoDL were obtained by applying the adjoint of the measurement operator to

the subsampled k-space data, and were then used to train both local and global versions of MoDL

networks. In our local versions (LONDN-MRI), we used 30 images for training (searched from e.g.,

3000 images). while the global versions used the full subset of training images.

Network Architectures & Training: We trained two types of MoDL models at 4x and 8x k-space

undersampling, respectively. One used the well-known UNet denoiser, with a two-channel input and

two-channel output, where the real and imaginary parts of an image are separated into two channels.

The network weights during training were initialized randomly (normally distributed). The ADAM

optimizer was utilized for training the network weights. For LONDN-MRI, we used an initial

learning rate of 6 × 10−5 with a multi-step learning rate scheduler, which decreases the learning

rate at 100 and 150 epochs with learning rate decay 0.65. For training globally, we used an initial

learning rate of 1 × 10−4 with 150 epochs of training and a multi-step learning rate scheduler that

decreased the learning rate at 50 and 100 epochs with learning rate decay 0.6. For LONDN-MRI,

MoDL with 5 iterations was used with a shallow UNet that had 2 layers in the encoder and decoder,

respectively. We used a shallow network with dropout for the local model to avoid over-fitting to

the very small training set. For the MoDL network trained globally (on large dataset) for making

comparisons with, we utilized 4 layers in the decoder and encoder in UNet and 6 MoDL blocks.

We used a batch size of 2 during training for both the global and local cases. Furthermore, for the

data-consistency term, we used a tolerance of 10−5 in CG and a 𝜇/𝜈 ratio of 0.1. Also, we chose the

regularization weight 𝜆 as 10−9 for LONDN-MRI, unless specified otherwise.

For the second MoDL architecture, we used the recent state-of-the-art denoising network

30

DIDN (Yu et al., 2019a; Lahiri et al., 2021). Due to the high complexity of the DIDN network, we

first pre-trained it on the larger (global) dataset (learning rate, etc., similar to the UNet case) before

adapting the weights within LONDN-MRI for each scan. This is an alternative to constructing

shallower versions of a network for local adaptation.The ADAM optimizer was utilized for training,

with a learning rate of 5 × 10−5 in LONDN-MRI. We used 6 iterations of MoDL with the DIDN

denoiser for which we used 3 down-up blocks (DUBs). The number of epochs for training was

30 in LONDN-MRI. The remaining training parameters were chosen similarly as in the previous

UNet-based case. Using a pre-trained state-of-the-art denoiser allows the local adaptation to

converge faster.

Comparison to Scan-adaptive Methods: We compared the performance of our schemes to recent

related scan-specific methods such as deep image prior (DIP) (Darestani and Heckel, 2021)(using

the public package but additionally incorporating coil sensitivity maps), RAKI (Akccccakaya et al.,

2019)), SOUP-DIL (Ravishankar et al., 2015) (code extracted from publicly available package), and

LORAKI(Kim et al., 2019) (modified from RAKI code). In our experiments, we used parameters

specified in the authors’ original implementations, which we observed worked well.

Sampling Masks & Performance Metrics: We used binary masks for fourfold and eightfold

Cartesian undersampling of k-space. Fig. 3.3 shows the sampling masks primarily used in our

experiments that include a fully-sampled central region (with 31 central lines at 4x acceleration and

15 central lines at 8x acceleration) and the remaining phase encode lines were sampled uniformly at

random.

For the performance metrics, we used three common metrics to quantify the reconstruction quality

of different methods. These were the peak signal-to-noise ratio (PSNR) in decibels (dB), structural

similarity index (SSIM) (Wang et al., 2004), and the high frequency error norm (HFEN) (Ravishankar

and Bresler, 2011), which were computed between the reconstruction and the ground truth obtained

from fully-sampled k-space data. The HFEN was computed from the ℓ2 norm of the difference

between Laplacian of Gaussian (LoG) filtered reconstructed and ground truth images. This was

normalized by the ℓ2 norm of the LoG filtered ground truth.

31

(a)

(b)

Figure 3.3 Undersampling masks used in our experiments: (a) fourfold undersampled 1D Cartesian
phase-encoded; and (b) eightfold undersampled 1D Cartesian phase-encoded. The masks were
zero-padded for slightly larger images.

Ground Truth

Global

LONDN-MRI
(1 iteration)

LONDN-MRI
(2 iterations)

Oracle

PSNR = ∞ dB
Initial

PSNR = 29.26 dB (d) PSNR = 29.50 dB

DIP

SOUP-DIL

PSNR = 29.68 dB
RAKI

(f) PSNR = 29.72 dB
LORAKI

PSNR = 16.43 dB

PSNR = 27.50 dB (d) PSNR = 28.16 dB

PSNR = 28.32 dB

(f) PSNR = 28.72 dB

Figure 3.4 Comparison of image reconstructions with different methods at 8x undersampling. The
global and LONDN-MRI methods use the MoDL architecture with UNet denoiser with 1000 training
images. The inset panel on the top left in each image corresponds to a section of interest in the
image (shown by the red bounding box), while the inset panel on the top right corresponds to the
error map with respect to the ground truth.

3.4.2 Results and Comparisons

Results for the UNet-based Reconstructor: Table 3.1 compares the average PSNR values for

reconstruction over the fastMRI knee testing set at both 4x and 8x undersampling. We varied

the number of images in the training set for a more comprehensive study. We compare learning

networks over a small set of similar images to learning networks over the larger datasets (global),

as well as to an oracle LONDN scheme, where the neighbors in the training set were computed

based on each ground truth test image. The oracle scheme would ideally provide an upper bound

on the performance of the iterative LONDN-MRI scheme. Moreover, LONDN-MRI outperforms

DIP, RAKI, SOUP-DIL, and LORAKI with U-Net (Table 3.1). Note that DIP, RAKI, SOUP-DIL,

32

and LORAKI do not use information beyond the test scan (scan-adaptive). Later, we show how

LONDN-MRI performs when the overall dataset it uses is very limited.

When varying the size of the training set, the global approach was trained on the full set each

time, whereas the local approach performed training on small subsets of 30 training pairs selected

from the larger datasets. The iterations of the LONDN-MRI scheme quickly improve reconstruction

performance, and even with only 2 LONDN-MRI alternations, the PSNR values begin approaching

the oracle setting. The LONDN schemes (oracle or iterative) consistently outperform the globally

trained networks across the different training set sizes considered.

We note that the results for the globally trained model with many (6000) training scans match

closely the LONDN-MRI results, when LONDN-MRI uses a smaller overall training set (3000

scans) for neighbor search. This illustrates the potential of our approach with limited training data,

when compared with models trained on larger sets. Figure 3.2 compares the SSIM and HFEN

reconstruction metrics using bar graphs, where a similar trend is observed as with PSNR.

Figs. 3.4 and 3.5 show images reconstructed by different methods at 8x and 4x undersampling,

respectively. The LONDN-MRI reconstructions (either iterative or oracle) show fewer artifacts,

sharper features, and fewer errors than the global MoDL and initial aliased reconstructions. The

iterative LONDN-MRI results are also quite close to the oracle result.

Ax Data Global LONDN-MRI LONDN-MRI Oracle DIP RAKI LORAKI SOUP
DIL

4x

8x

size
1000 32.63
2000 33.00
3000 33.17
6000 33.48
1000 29.78
2000 30.21
3000 30.47
6000 30.78

(1 iteration)
32.78
33.28
33.46
33.58
30.15
30.53
30.76
30.94

(2 iterations)
32.87
33.31
33.51
33.65
30.26
30.58
30.80
31.04

30.1 30.25

31.35

30.97

28.9 29.01

29.71

29.47

32.99
33.35
33.54
33.69
30.34
30.64
30.85
31.09

Table 3.1 Average reconstruction PSNRs (in dB) for 15 images at 4x and 8x k-space undersampling.
The proposed LONDN-MRI (with 1 or 2 alternations) is compared to training a global reconstructor
for different training set sizes and another scan based method. We also compare to an oracle local
reconstructor, where neighbors are found with respect to known ground truth test images.

Results for the DIDN-based Reconstructor: To demonstrate adaptability to different network

architectures, Table 3.2 compares reconstruction performance on the test set with the DIDN denoiser-

based MoDL architecture. Average PSNR values with LONDN-MRI are compared to those with

33

Ground Truth

Global

LONDN-MRI LONDN-MRI
(2 iterations)
(1 iteration)

Oracle

PSNR = ∞ dB
Initial

PSNR = 32.78 dB
DIP

PSNR = 33.16 dB
SOUP-DIL

PSNR = 33.25 dB
RAKI

PSNR = 33.30 dB
LORAKI

PSNR = 21.23 dB

PSNR = 30.18 dB

PSNR = 30.66 dB

PSNR = 31.26 dB

PSNR = 31.67 dB

Figure 3.5 Same comparisons/setup as Fig. 3.4, but at 4x undersampling. The supervised methods
used MoDL architecture with UNet denoiser (3000 training images).

Acceleration Data Global LONDN-MRI Oracle

4x

8x

Size
1000
2000
3000
1000
2000
3000

33.66
34.01
34.15
31.02
31.34
31.79

(1 iteration)
33.92
34.23
34.39
31.33
31.64
32.08

33.96
34.31
34.42
31.37
31.68
32.12

Table 3.2 Average reconstruction PSNR values (in dB) on the testing set at 4x and 8x undersampling
for various training set sizes. MoDL reconstructor with DIDN denoiser is used.

networks trained globally at different training set sizes. We ran only 1 iteration of LONDN-MRI,

where the reconstruction with a pre-trained (global) network was used to find neighbors. PSNR

values for the oracle LONDN-MRI reconstructor are also shown. The overall performances with

the DIDN-based architectures are better than with the UNet-based unrolled networks. The PSNRs

for LONDN-MRI are consistently and similarly better than for the globally trained network across

the different training set sizes considered, indicating potential for LONDN-MRI in improving

state-of-the-art models. Fig. 3.6 visually compares reconstructions and reconstruction errors (in

zoomed in region) for different methods. We can see that the LONDN reconstructors capture the

original image features more sharply and accurately than the globally learned reconstruction.

Performance on the Stanford FSE Dataset: We also performed image reconstructions with the

Stanford multi-coil FSE dataset, which is a smaller dataset. We used same settings for the networks

34

Ground Truth

Global

LONDN-MRI
(1 iteration)

Oracle

PSNR =∞ dB

PSNR =34.15 dB PSNR =34.46 dB PSNR =34.54 dB

Figure 3.6 Comparison of image reconstructions at 4x undersampling for the MoDL network with
DIDN denoiser and 3000 training images, when compared to LONDN-MRI. A region of interest
and its error are also shown.

Ground Truth

Initial

Global

LONDN-MRI

LONDN-MRI

Oracle

(1 iteration)

(2 iterations)

PSNR =∞ dB

PSNR =22.01 dB PSNR =29.02 dB PSNR =31.46 dB PSNR =31.74 dB PSNR =31.87 dB

Figure 3.7 Comparison of image reconstructions with different methods at 4x undersampling using
MoDL architecture with UNet denoiser with 2000 training scans. The test slice and training data
were from the Stanford FSE dataset.

and training as in Section 3.4.1. Table 3.3 shows that LONDN-MRI significantly outperforms the

globally learned MoDL network at both 4x and 8x acceleration. This indicates benefits for the

proposed framework for smaller, more diverse datasets. Figs. 3.7 and 3.8 display visual comparisons

that show the LONDN-MRI scheme recovering sharper features than the globally learned network.

Acceleration Global LONDN-MRI LONDN-MRI Oracle

4x
8x

29.45
27.25

(1 itera-
tion)

31.49
29.35

(2 itera-
tions)

31.56
29.43

31.67
29.60

Table 3.3 Average reconstruction PSNR values (in dB) for the Stanford FSE test set at 4x and 8x
undersampling. The LONDN-MRI results are compared to a model globally trained on the FSE
dataset.

3.4.3 Behavior of LONDN-MRI

Here, we explore the intricacies and workings of LONDNMRI in more detail.

Performance with Different Distance Metrics: To determine a suitable distance metric for

35

Ground Truth

Initial

Global

LONDN-MRI

LONDN-MRI

Oracle

(1 iteration)

(2 iterations)

PSNR =∞ dB

PSNR =19.41 dB PSNR =26.52 dB PSNR =27.76 dB PSNR =27.85 dB PSNR =27.92 dB

Figure 3.8 Same comparisons/setup as Fig. 3.7, but at 8x undersampling.

our method, we analyzed a few popular distance metrics. This study focused on evaluating their

effectiveness in selecting the appropriate matching dataset for training in the context of LONDN-MRI

(oracle scheme). We tested the performance of MoDL with UNet denoiser using L1 and L2 distance

metrics as well as normalized cross-correlation (NCC), to find the matched training set from among

3000 images, which were all normalized. From the results in Table 3.4, we see that the different

distance functions offer only slight differences in reconstruction performance, with NCC offering

the best results with respect to all reconstruction metrics.

Acceleration Reconstruction Metric

4x

8x

SSIM
PSNR (dB)
HFEN
SSIM
PSNR (dB)
HFEN

L1
0.85
33.49
0.552
0.803
30.79
0.664

L2
0.849
33.44
0.56
0.802
30.71
0.674

NCC
0.852
33.54
0.542
0.804
30.85
0.658

Table 3.4 Average PSNR, SSIM, and HFEN values over 15 testing images for LONDN-MRI with
neighbor search peformed using L1 distance, L2 distance, and normalized cross-correlation (NCC).

Evaluating the Accuracy of Neighbor Search: Here, we study how the neighbor search proceeds

across the iterations or alternations of LONDN-MRI. We are interested to know if our locally learned

reconstructor can improve the neighbor finding process over iterations. We used all images from the

test set. First, we find the 𝑘 closest neighbors (in terms of Euclidean distance) for each ground truth

test image amongst the ground truth training images. The set 𝐶∗

𝑟 contains the indices of these oracle
neighbors for a test image indexed 𝑟. The set ˆ𝐶𝑟 contains the indices of closest neighbors from a

certain iteration of LONDN-MRI. The neighbor matching accuracy (NMA) metric below computes

36

the average (over the test set indices T ) percentage match between the two sets:

NMA :=

100
|𝐶 |

∑︁

𝑟∈T

| ˆ𝐶𝑟 ∩ 𝐶∗
𝑟 |
𝑘

,

(3.10)

The accuracy of the neighbor search at both 4x and 8x undersampling is shown in Fig. 3.9. The

accuracy of the initial search (based on x0) and after 1 or 2 iterations of LONDN-MRI are shown.

We find nearest neighbors for the initial highly aliased x0 with respect to the corresponding aliased

images in the training set (based on the same k-space undersampling mask as at testing time), rather

than based on the ground truth training images, because the latter resulted in lower neighbor search

accuracy for x0. It is clear from Fig. 3.9 that the accuracy improves quickly and tapers off in few

iterations.

Figure 3.9 Average accuracy (over test set) of neighbor search in LONDN-MRI (MoDL with UNet
denoiser) at 4x undersampling in (a) the first iteration (neighbors found with respect to the initial
input images x0) and after the (b) first and (c) second iteration. (d)-(f) are corresponding results at
8x undersampling.

Effect of Weight Regularization in LONDN-MRI: Here, we vary the strength of the regularization

penalty weight in (3.8) and run LONDN-MRI over the test set at 4x k-space undersampling. Fig. 3.10

plots the average PSNR as a function of the penalty weight for the MoDL network with UNet denoiser.

The normalized cross-correlation distance was used during neighbor search, with other parameters

as before. The result shows slight benefits for choosing the regularization weight carefully.

37

(a)(b)(c)(d)(e)(f)020406080Neighbor matching Accuracy(%)Figure 3.10 Average reconstruction PSNR on the test set at 4x undersampling for different
regularization penalty parameters. We used ℓ1 norm regularization of network weights for an MoDL
network with UNet denoiser.

Convergence of Loss in Bilevel Optimization: Next, we study the behavior of the alternating

LONDN-MRI algorithm as a heuristic for the bilevel optimization formulation in (3.9). Here, we

used an MoDL network with the UNet denoiser and 𝑘 = 30 training pairs were chosen (from 3000

cases) in the local dataset in each iteration of LONDN-MRI. The UNet weights were randomly

initialized to begin with, and the neighbor search in the first iteration of LONDN-MRI was performed

using x0 and correspondingly generated aliased training images. Fig. 3.11 plots the upper-level loss

in (3.9) (in a root mean squared error form) after each iteration of LONDN-MRI for a test image.

Here, we ran many iterations to verify convergence. We observe that the loss changes very little after

a few iterations and stabilizes. This matches with the behavior of the neighbor search accuracy bar

plots. The result indicates that the proposed alternating scheme could be a reasonable heuristic for

reducing the loss in the challenging problem (3.9). Finally, we compare the loss values in Fig. 3.11

with an oracle loss, where the upper-level loss in (3.9) is computed using the ground truth test image

and its 𝑘 nearest neighbors. It is clear that the loss values in LONDN-MRI converge very close to

the oracle loss, indicating the potential for our scheme.

Effect of Number of Nearest Neighbors on Image Quality: Again, we investigate how the

LONDN-MRI algorithm behaves when the number of nearest neighbors is varied to see how it

affects the effectiveness of the reconstruction. To test our method, we selected from 10 to 1000

38

10111010109108107Regularization Parameter33.333.433.533.633.733.8PSNR4xFigure 3.11 Upper-level loss in the bilevel optimization formulation (3.9) plotted over the iterations
(after network update step) of the LONDN-MRI scheme at 4x undersampling. We used MoDL
with a UNet denoiser and 𝑘 = 30 for neighbor search. In addition, the red line shows an oracle
upper-level loss computed using the ground truth test image and its 𝑘 nearest neighbors.

images for the closest neighbors (with NCC metric). The average test reconstruction PSNR for

different cases is shown in Fig. 3.12. Too few local neighbors can make the method prone to

overfitting and too many neighbors lead to a lack of scan-specificity and worse performance. 30-50

neighbors provide similar performances.

Time Consumption Trade-offs: To further understand the time efficiency of our method across

different neighborhood sizes for practical applicability, we conducted comparative analyses using

three models: an image-domain UNet denoiser, MODL with UNet denoiser, and the MODL with

DIDN denoiser. The experiments were run on an NVIDIA GeForce RTX A5000 GPU. The PSNR

vs. runtime trade-offs depicted in Figure 3.13 shed light on the time consumption for each model

configuration. It is observed that some decrease in the number of neighbors leads to reduced

time consumption without significantly compromising image quality. In addition, the results show

the effectiveness of starting with a pre-trained DIDN model to improved the reconstruction, as it

enhances the efficiency of the reconstruction process, reducing it to order of seconds.

3.4.4 Generalizability of LONDN-MRI

Here, we present a series of studies to evaluate the generalizability of LONDN-MRI in diverse

testing settings.

39

135791113151719Iteration3.7803.7823.7843.786Upper Level LossBilevel Optimization Upper Level LossOracle Upper Level Loss(a) Fast MRI case

(b) FSE MRI case

Figure 3.12 Average reconstruction PSNR on the fastMRI and FSE MRI test set at 4x undersampling
for different numbers of nearest neighbors.

Figure 3.13 PSNR vs. runtime trade-offs of various LONDN-MRI models for the fastMRI knee
dataset at 4x k-space undersampling. The models include MoDL networks with UNet or DIDN
denoisers, as well as a standalone image-domain UNet. The performance was evaluated across
different neighbor sizes, which are shown next to each data point. The processing time for these
models ranged from 6 seconds to 3 minutes, depending on the neighbor size. Unrolled networks
provided better image quality than the UNet denoiser.

40

101102103Number of Neighbors33.233.333.433.5PSNR4x101102103Number of Neighbors29.530.030.531.031.5PSNR4x0.10.20.5123Time(min)33.033.233.433.633.834.034.234.4PSNR 10 10 10 20 20 20 30 30 30 40 40 40 50 50 50LONDN UNetLONDN Modl UNetLONDN Modl DIDNPerformance in the Presence of Planted Features: To assess the capability of LONDN-MRI for

accurately reproducing image attributes not found in the training set (a common scenario when

detecting pathologies, etc.), we embedded artificial features into a knee image from the fastMRI

dataset, drawing inspiration from recent work (Lahiri et al., 2021). We performed 4x undersampling

in k-space and reconstructed with the MoDL network (with UNet denoiser) that was trained using

3000 images. In Fig. 3.14, we observe that LONDN-MRI produces sharper reconstruction of image

features and better PSNR compared to the globally trained network. The details or edges of the

planted features are better preserved in LONDN-MRI. Moreover, LONDN-MRI provides similar

image quality with and without the planted features (Fig. 3.5), whereas, the globally trained network

degrades significantly. This indicates the relatively improved stability and generalizability of the

proposed method.

Performance on Data with Lesions: While the previous experiment allowed comparing reconstruc-

tion quality with or without planted features, here we test our method on MRI scans with lesions,

which are often regions of abnormal or diseased tissue. We utilize the annotated fastMRI+ data to

evaluate our method’s image reconstruction capabilities, and compare its outcomes with established

baselines. For the training phase, the non-lesion dataset was employed for the global training

approach with 3000 images whereas LONDN-MRI used 30 adaptively selected images for training

(searched from 3000 images). In contrast, during the testing phase, we used 20 scans with lesions.

The results, as displayed in Table 3.5, indicate that our method achieves substantially higher PSNR

values in comparison to the globally trained baseline as well as the LORAKI method. Furthermore,

visualizations in Figure 3.16 clearly demonstrate the superiority of our method, particularly in the

nonspecific white matter lesion areas. Thus, both in terms of visual assessment and PSNR values,

our approach outperforms the existing baselines and aligns more closely with the ground truth.

Acceleration Global LONDN-MRI LONDN-MRI Oracle LORAKI

4x
8x

34.37
32.05

(1 itera-
tion)

34.89
32.65

(2 itera-
tions)

35.1
32.72

35.21
32.77

32.89
30.89

Table 3.5 Average reconstruction PSNR values (in dB) for the lesion fastMRI+ test set at 4x and 8x
k-space undersampling. LONDN-MRI and the global model were trained on the non-lesion dataset.

41

Ground Truth

LORAKI

Global

LONDN-MRI

LONDN-MRI

Oracle

(1 iteration)

(2 iterations)

PSNR = ∞ dB

PSNR = 31.45 dB

PSNR = 32.15 dB PSNR = 32.72 dB

PSNR = 33.15 dB

PSNR = 33.26 dB

Figure 3.14 Visualization of ground truth and reconstructed images using different methods at 4x
k-space undersampling. The central portion (with the planted feature) and its reconstruction error
map are shown in the top panels in the images.

Performance without Well-Matched Neighbors: Another natural question is how sensitive is

the proposed method to using a ‘well matched’ (to the test scan) subset of images in the global

training set. One might consider this restrictive. To better evaluate the working of LONDN-MRI,

we switched its training with the UNet denoiser from using the 30 closest neighbors to using the

31st to the 60th closest (or less similar) neighbors. Fig. 3.20 shows an example with the different

near-neighbors that are chosen from the 3000 image global training set, ranked based on NCC

distance. While the nearest neighbors look quite similar to the test image, the farther ones could be

relatively dissimilar in practice. In this case, LONDN-MRI (with 1 iteration) using the 31st to the

60th closest neighbors still reconstructs the test scans well with an average PSNR of 33.34 dB (at

4x k-space undersampling and 15 test images), which is only slightly worse than when using the

30 closest neighbors (33.46 dB). This indicates the proposed approach may not be very sensitive

to availability of highly visually matched training data. Indeed, the Stanford FSE data has more

variability than fastMRI and our approach performs well on that dataset.

Evaluating Generalization with Limited Training Sets: To facilitate a fairer comparison with

scan-adaptive methods such as DIP, LORAKI and RAKI, we conduct experiments utilizing much

smaller subsets of the original fastMRI knee dataset, from which the neighbors in LONDN-MRI are

selected. We randomly selected 5 to 100 slices for the overall training set in LONDN-MRI. These

were chosen from a small random set of volumes/patients. The goal is to emulate comparisons with

DIP, LORAKI, and RAKI when LONDN-MRI operates in a very limited dataset regime. For each

overall training set size, we selected the top 𝑘 similar neighbors at testing time, where 𝑘 is adjusted

42

based on the dataset size. For example, for a dataset with 5 slices, we selected the top 3 similar

scans at test time, and for a dataset with 100 samples, we selected the top 10 neighbors in the search.

The average reconstruction PSNR for the testing scans, plotted in Fig.3.15, reveals that although

there is some decline in performance with decreasing dataset size, the results still surpass those

achieved by DIP, RAKI and LORAKI, indicating potential for LONDN-MRI with very limited

training sets. While DIP, RAKI and LORAKI adapt purely to the individual test scans without

supervision, the LONDN-MRI approach wouldn’t make sense in the 0-paired data regime. In future

work, we plan to study hybrid methods leveraging both LONDN-MRI and DIP, i.e., adapting the

network based on both similar paired data and the current test scan’s measurements (as in DIP).

Figure 3.15 Average PSNR on test set (from fastMRI) for LONDN-MRI (MoDL network with UNet
denoiser) at 4x k-space undersampling for various dataset sizes. Subsets of the dataset are chosen as
neighbors in LONDN-MRI at test time. The average PSNR values with DIP, LORAKI, and RAKI,
which require no training data are shown as horizontal lines.

Effect of Varying Scan Settings at Test Time: Since the reconstruction network in LONDN-MRI

is trained for each scan, we would like to understand better the benefits this provides in terms of

letting the network adapt to distinct scan settings. So we chose the MoDL reconstructor with UNet

43

          ' D W D V L ] H         3 6 1 5   G %  2 X U V ' , 3 5 $ . , / 2 5 $ . ,Ground Truth

LORAKI

Global

LONDN-MRI

LONDN-MRI

Oracle

(1 iteration)

(2 iterations)

PSNR = ∞ dB

PSNR = 33.17 dB

PSNR = 35.10 dB

PSNR = 35.67 dB

PSNR = 35.74 dB

PSNR = 35.87 dB

Figure 3.16 Visualization of ground truth and reconstructed images using different methods at 4x
k-space undersampling for an annotated image from the fastMRI+ dataset, where the interest area is
a nonspecific white matter lesion (in green box).

Ground Truth

LORAKI

Global

LONDN-MRI

LONDN-MRI

Oracle

(1 iteration)

(2 iterations)

PSNR = ∞ dB

PSNR = 33.21 dB

PSNR = 36.12 dB

PSNR = 36.31 dB

PSNR = 36.54 dB

PSNR = 36.77

Figure 3.17 Visualization of ground-truth and reconstructed images using different methods at 4x
k-space undersampling for a T2 contrast MRI scan (with training on T1 contrast scans). A region of
interest (in green box) and its error map are also shown.

denoiser (with same hyperparameters for training as before) and trained it on the 3000 image set

in two ways: with a fixed sampling mask across the images (the mask was padded with zeros to

account for slight variations in matrix sizes), and with a different random sampling mask for each

image. The first setting was used in previous subsections. For LONDN-MRI, here, we used a

different random sampling mask for each test scan, but the network was adapted locally with the

same mask used across each (small) local training set. Table 3.6 shows the average PSNR values

on the test set with these different strategies as well as with the oracle LONDN-MRI scheme. It

is clear that the globally learned model with a fixed sampling mask struggles to generalize to the

different scan settings at test time. But training the global model with random sampling masks

leads to improved reconstruction PSNRs. Importantly, the LONDN-MRI schemes that adapt the

reconstruction model to the settings as well as the data for each scan provide marked improvements

over both globally learned network settings.

Results with Different Contrasts: To delve deeper into clinical applicability of our method, we

44

Acceleration Global Model Global Model LONDN-MRI Oracle
LONDN
trained
with
rand.
masks

trained
with
fixed
mask

(2 itera-
tions)

a

4x
8x

33.03
30.62

33.19
30.84

33.56
31.14

33.64
31.22

Table 3.6 Average reconstruction PSNR values (in dB) on the test set at 4x and 8x undersampling.
The LONDN-MRI results are compared to training a global model with a fixed sampling mask or
with random masks.

conducted further tests to ascertain its adaptability to different contrasts or weightings in scans.

Conventional deep learning reconstruction techniques may need consistency in contrast between

training and testing to achieve optimal results and could struggle with generalization across varied

experimental settings. Our method, being scan-specific, could offer some flexibility because of

adaptivity to features in test scans. To further study this, we conducted a test, where the global

model was trained exclusively on T1 MRI data at 4x and 8x undersampling using 3000 training

scans. Subsequent testing was done on T2 contrast MRI data with 20 images. For LONDN-MRI,

we used 30 images for local training (searched from 3000 images) for each test scan. The results,

presented as box plots in Fig. 3.18 and visualized with one example in Fig 3.17, highlight our

method’s reconstruction performance in comparison to the globally trained MoDL network and the

scan-specific LORAKI scheme. Our method exhibits notable better performance, underscoring its

effectiveness in diverse imaging contexts.

Performance with Different Signal-to-Noise Ratios: To assess the performance of LONDN-MRI

when the training and tested data have different signal-to-noise ratios (SNRs), we conducted tests

on scans from the fastMRI knee dataset that were subjected to additive random Gaussian noise

with a variance of 0.01 for the real and imaginary parts of the noise. The globally and locally

trained models at 4x and 8x undersampling used data without added noise, and the training settings

were the same as before in Section 3.4.1. Our findings revealed a general decline in reconstruction

performance across all methods, attributable to the different SNRs between training and testing.

Despite this, LONDN-MRI displays better capability in handling noise perturbations, with a wider

45

Figure 3.18 Box plots for average reconstruction PSNR values (in dB) for different methods for
the T2 fastMRI brain test set at 4x and 8x undersampling. LONDN-MRI (trained on T1 contrast
fastMRI dataset) results are compared to a model trained globally (on 3000 T1 contrast scans) and
to LORAKI.

performance gap over the globally trained model. This is clear from the PSNR values depicted in

the corresponding box plots in Fig. 3.19.

3.5 Discussion

We proposed a novel LONDN-MRI reconstruction technique that efficiently matches test

reconstructions to a cluster of a dataset, where networks are adaptively estimated on images most

related to a current scan. Our results on the multi-coil fastMRI brain and knee datasets, fastMRI+,

and the Stanford FSE dataset showed promise for our patient-adaptive network estimation scheme.

The approach does not require pre-training and can thus readily handle changes in the training set.

Additionally, the networks in LONDN-MRI can be randomly initialized and trained adaptively

on very small datasets, and such networks outperformed models trained globally on much larger

datasets (with lengthy training times). For example, for fastMRI knee scans, LONDN-MRI with 2

alternations involving MoDL with a randomly initialized UNet denoiser took 5 minutes to run on

a NVIDIA GeForce RTX A5000 GPU (with batchsize of 6 and 200 epochs each time to update

46

Global MoDLLONDN-MRI (1 iteration)LONDN-MRI (2 iterations)OracleLORAKI33.033.534.034.535.04x AccelerationGlobal MoDLLONDN-MRI (1 iteration)LONDN-MRI (2 iterations)OracleLORAKI30.7531.0031.2531.5031.7532.0032.2532.5032.758x AccelerationFigure 3.19 Box plots of average reconstruction PSNR values (in dB) for different methods on
the fastMRI knee test set at 4x and 8x undersampling. For the test dataset, we added zero-mean
Gaussian noise to the measurements with standard deviation 𝜎 = 0.01 for the real and imaginary
parts of the noise. All training data used did not include additional noise.

Original

1st Nearest Neighbor

3rd Nearest Neighbor

10th Nearest Neighbor

20th Nearest Neighbor

30th Nearest Neighbor

40th Nearest Neighbor

50th Nearest Neighbor

Figure 3.20 An image is shown along with different nearest neighbors from the fastMRI dataset.

networks locally). While LONDN-MRI outperformed the scan-adaptive methods such as DIP,

RAKI, and LORAKI in image quality, the runtimes for the methods were somewhat similar. DIP

takes about 5 minutes to reach peak performance (over iterations) with the same GPU, while RAKI

and LORAKI took 3 mins and 4 mins, respectively. LONDN-MRI requires only a few images (e.g.,

30) to train networks, with often 200-250 epochs for locally updating randomly initialized networks

such as the UNet. Fewer epochs (often 10 suffices) of update were needed with pre-trained networks

47

Global MoDLLONDN-MRI (1 iteration)LONDN-MRI (2 iterations)Oracle31.2531.5031.7532.0032.2532.5032.754x AccelerationGlobal MoDLLONDN-MRI (1 iteration)LONDN-MRI (2 iterations)Oracle28.5028.7529.0029.2529.5029.7530.008x Accelerationsuch as the pre-trained DIDN, resulting in runtimes of only 18 seconds per iteration of LONDN-MRI

(Fig. 3.13). Of course, a globally trained model would run faster at inference time. For example,

MoDL with pre-trained DIDN denoiser takes 8 seconds on average to reconstruct fastMRI knee

images. Note that the neighbor search process in the proposed method is highly efficient. We find

20-30 images from 3000 images to train the model in about 10 seconds, while the overall algorithm

takes minutes. The neighbor search is also highly parallelizable.

When compared to the supervised global model, the proposed method offers consistently

improved reconstruction quality in terms of PSNR, SSIM, and HFEN metrics. Additionally, we

demonstrated that the local model adapts better to test time changes (such as changes to the sampling

mask, scan contrast, SNR, presence of anomalies, etc.) compared to a globally learned (and

fixed) model. Our approach produced marked improvements for the Stanford FSE dataset, and

noticeable improvements for fastMRI/fastMRI+. Additionally, our study with different distance

metrics revealed they have only slight effect on reconstruction quality. The NCC metric provided the

best reconstruction quality and was thus used in our studies. We conjecture that a learned distance

metric (Kaya and cccS. bilge, 2019) could further enhance the performance of LONDN-MRI.

3.6 Conclusions

This paper examined supervised learning of deep unrolled networks at reconstruction time for

MRI by exploiting training sets along with local modeling and clustering. We showed advantages for

this approach at different k-space undersampling factors over networks learned in a global manner

on larger data sets. The training may be connected to a bilevel optimization problem. We also

compared different distance metrics for finding neighbors in our approach and regularization to

reduce local overfitting. We intend to expand our studies in the future by incorporating non-Cartesian

undersampling patterns, such as radial and spiral patterns, as well as deploying them to 3D settings

and other imaging modalities. Additionally, the method’s generalizability will be further examined,

with a particular emphasis on heterogeneous datasets. To handle more extreme training-test data

variations such as unseen anatomies, we plan to explore patch-based neighbors in local learning

schemes for future work. We showed benefits for both randomly seeded training of simple models

48

and for fine tuning of sophisticated pre-trained models, and believe our methodology could be

applied to a variety of deep learning-based tasks (even beyond image reconstruction) effectively to

improve overall performance. Finally, metric learning (Kaya and cccS. bilge, 2019) to improve local

clustering and subsequent network adaptation will be an important future direction.

49

CHAPTER 4

SELF-GUIDED DEEP IMAGE PRIOR

4.1

Introduction

In last chapter we mentioned the local learning for the MRI reconstruction to avoid the data

limit situation. But in the extreme case, we need to reconstruct the MRI image without any dataset.

To avoid this kind of situation, we want to proposed the unsupervised learning with the zero shot

method by using the deep image prior method. The ability of deep image prior (DIP) to recover

high-quality images from incomplete or corrupted measurements has made it popular in inverse

problems in image restoration and medical imaging including magnetic resonance imaging (MRI).

However, conventional DIP suffers from severe overfitting and spectral bias effects. In this

work, we first provide an analysis of how DIP recovers information from undersampled imaging

measurements by analyzing the training dynamics of the underlying networks in the kernel regime

for different architectures. This study sheds light on important underlying properties for DIP-based

recovery. Current research suggests that incorporating a reference image as network input can

enhance DIP’s performance in image reconstruction compared to using random inputs. However,

obtaining suitable reference images requires supervision, and raises practical difficulties. In an

attempt to overcome this obstacle, we further introduce a self-driven reconstruction process that

concurrently optimizes both the network weights and the input while eliminating the need for

training data. Our method incorporates a novel denoiser regularization term which enables robust

and stable joint estimation of both the network input and reconstructed image. We demonstrate that

our self-guided method surpasses both the original DIP and modern supervised methods in terms of

MR image reconstruction performance and outperforms previous DIP-based schemes for image

inpainting.

A recent study (Zhao et al., 2020b) demonstrated the effectiveness of incorporating additional

guidance into DIP-based restoration by using a strategically chosen reference image as network

input during training. This reference-guided technique considerably enhances reconstruction quality

and stability while obviating the need for fully supervised training. Nevertheless, this approach

50

depends on the availability of an appropriate reference image, which may not always be the case.

Additionally, it remains uncertain from (Zhao et al., 2020b) how to effectively select a suitable

reference based solely on undersampled measurements of an unknown test image. Inspired by the

ability of reference-based guidance to improve the performance of DIP reconstruction, we consider

the setting where absolutely no reference or training data is available.

4.1.1 Contributions

We summarize the paper’s main contributions as follows.

• To gain a deeper understanding of image reconstruction using DIP, we conduct an analysis of

gradient descent-trained CNNs in the over-parameterized regime. We employ a realistic imaging

forward operator instead of a Gaussian measurement matrix for our analysis of the case of compressed

sensing. Our primary finding is that as the number of gradient descent steps used to optimize the

standard DIP objective function approaches infinity, the difference between the network estimate

and the ground truth will reside in a subspace related to the null space of the forward operator and

the network’s neural tangent kernel.

• The choice of network architecture significantly affects the ability of DIP to recover the image in

compressed sensing tasks. We demonstrate both theoretically and empirically that certain generator

architectures will have greater difficulty recovering missing information or frequencies than others.

• We propose a self-guided DIP method, which eliminates the need for separate reference images (for

network input) and gives much better image reconstruction quality than the prior reference-guided

method as well as several other related and competing schemes. The proposed method relies on a

crucial denoising-based regularization.

4.1.2

Image reconstruction problem

To ensure accurate image reconstruction, an ill-posed inverse problem can be formulated as:

ˆx = arg min

x

∥Ax − y∥2

2 + R (x),

(4.1)

where A is a linear measurement operator, y ∈ R𝑝 are the measurements, and ˆx ∈ R𝑞 is the

reconstructed image. The first term in the minimization is referred to as a data-fidelity function and

51

can also take on alternative forms depending on imaging setup. In classical image inpainting, A is a

binary masking operator. For the task of reconstructing a multi-coil MRI image, represented by

x ∈ C𝑞 the optimization problem is

ˆx = arg min

x

𝑁𝑐∑︁

𝑐=1

∥A𝑐x − y𝑐 ∥2

2 + 𝜆R (x),

(4.2)

where the 𝑘-space measurements taken from 𝑁𝑐 coils are represented by y𝑐 ∈ C𝑝, 𝑐 = 1, . . . , 𝑁𝑐.

The coil-wise forward operator is denoted as A𝑐 = MFFF S𝑐, where M ∈ {0, 1} 𝑝×𝑞 is a masking

operator that captures the data sampling pattern in 𝑘-space, FFF ∈ C𝑞×𝑞 is the Fourier transform

operator, and S𝑐 ∈ C𝑞×𝑞 represents the 𝑐th coil-sensitivity map (a diagonal matrix).

An explicit regularizer R (·) is employed to limit the solutions to the domain of desirable images.

Various regularizers have been used in image reconstruction. For example, it can be the ℓ1 penalty

on wavelet coefficients, a total variation penalty, patch-based sparsity in learned dictionaries, or as in

our technique, a denoising type regularization involving e.g., a convolutional neural network (CNN).

4.1.3 Deep Image Prior for Image Reconstruction

Image reconstruction using DIP is typically formulated as:

ˆθ = arg min

θ

∥A 𝑓𝑓𝑓 θ (𝑧𝑧𝑧) − y∥2
2,

ˆx = 𝑓𝑓𝑓 ˆθ (𝑧𝑧𝑧),

(4.3)

Here, 𝑓𝑓𝑓 is a neural network with parameters θ, and 𝑧𝑧𝑧 is a typically fixed network input that is

randomly chosen (e.g., a random Gaussian vector or tensor). We will refer to this formulation as

“vanilla DIP" in this work.

4.1.4 Neural Tangent Kernel Analysis for Image Reconstruction

The Neural Tangent Kernel (NTK) (Jacot et al., 2018) is a mathematical tool used to analyze

the training dynamics of neural networks, particularly in the infinite-width setting. It provides an

approximation of the function space explored by a neural network during gradient-based training,

such as gradient descent or stochastic gradient descent.

When trained with gradient descent, a network’s weights are updated according to the equation:

w𝑡+1 = w𝑡 − 𝜂∇wL (w𝑡),

(4.4)

52

where w are the trainable network parameters at a certain training iteration 𝑡, 𝜂 is a step size

parameter, and L represents the loss function to be minimized. Rearranging equation (4.4) then

gives

w𝑡+1 − w𝑡
𝜂

= −∇wL (w𝑡).

If 𝜂 is small, this approximates the differential equation

𝑑w
𝑑𝑡

= −∇wL (w).

(4.5)

(4.6)

Because the network input is fixed in the vanilla DIP setting, we can view the network output 𝑧𝑧𝑧

as a function of w. Applying the chain rule yields

𝑑𝑧𝑧𝑧(w)
𝑑𝑡

= ∇𝑧𝑧𝑧(w)𝑇 𝑑w
𝑑𝑡

.

Substituting the loss from equation (4.3) into equations (4.6) and (4.7) gives

𝑑𝑧𝑧𝑧(w)
𝑑𝑡

= −∇𝑧𝑧𝑧(w)𝑇 ∇𝑧𝑧𝑧(w)A𝑇 (A𝑧𝑧𝑧(w) − y).

(4.7)

(4.8)

The critical assumption of NTK theory is that the matrix W := ∇𝑧𝑧𝑧(w)𝑇 ∇𝑧𝑧𝑧(w) – called the

neural tangent kernel – remains fixed throughout training. In this regime, equation (4.8) can be

rediscretized to show that the training dynamics of DIP for MRI reconstruction will reduce to

𝑧𝑧𝑧𝑡+1 = 𝑧𝑧𝑧𝑡 + 𝜂W (A𝑇 y − A𝑇 A𝑧𝑧𝑧𝑡).

(4.9)

We start gradient descent from a random initialization θ0 ∼ N (0, 𝜔I).

In our analysis, we make the simplifying assumption that 𝑧𝑧𝑧0 = 0. This assumption is not

unique to our analysis, and is consistent with prior literature (Tachella et al., 2020). To understand

this assumption, we first note that since all network parameters w are initialized from mean 0

distributions, the initial output 𝑧𝑧𝑧0 is 0 in expectation over this initialization. However, it is possible

that 𝑧𝑧𝑧0 may not be 0 for any particular instantiation of w. To correct for this, we can consider

a slightly modified network such that this assumption holds. Namely, for any particular 𝑓𝑓𝑓 θ with

random input 𝑧𝑧𝑧, it will have initial output 𝑧𝑧𝑧init = 𝑓𝑓𝑓 θ (𝑧𝑧𝑧). We can then define the slightly modified

53

network ˜𝑓𝑓𝑓

θ (𝑧𝑧𝑧) = 𝑓𝑓𝑓 θ (𝑧𝑧𝑧) − 𝑧𝑧𝑧init. This modification ensures that 𝑧𝑧𝑧0 = 0. Moreover, this modification
has little other effect on the analysis, since 𝑓𝑓𝑓 and ˜𝑓𝑓𝑓 have the same NTK, because 𝑧𝑧𝑧init is a constant.

With these preliminaries, we now state our first theorem on the training dynamics of DIP for

image reconstruction.

Theorem 4.1.1. Let A ∈ R𝑝×𝑞 be a full row rank forward operator. Assume 𝑧𝑧𝑧0 = 0 and let 𝑧𝑧𝑧∞

be the reconstruction as the number of training iterations approaches infinity. Let x ∈ R𝑞 be the

ground truth image and let W be the NTK of the reconstructor network. Further suppose that the

acquired measurements are free of noise so that y = Ax. If the learning rate satisfies 𝜂 < 2
∥B∥

,

where B := W

1

2 A𝑇 AW

1
2 , then

• If the NTK kernel W is non-singular, then the difference between 𝑧𝑧𝑧∞ and x lies in the null

space 𝑁 (A) of A, i.e.,

𝑧𝑧𝑧∞ − x ∈ 𝑁 (A).

(4.10)

Moreover, as long as 𝑃𝑁 (A)x ≠ 0, the reconstruction error 𝑧𝑧𝑧∞ − x ≠ 0. Here, 𝑃𝑁 (A) is the

projector onto the subspace 𝑁 (A).

• If the NTK W is singular, and 𝑃𝑁 (A)∩𝑅(W )x = 0 with 𝑅(W ) denoting the column or range

space of W , then the difference 𝑧𝑧𝑧∞ − x will be linear in 𝑃𝑁 (W )x, which is the component of

x lying in the null-space of W ,

𝑧𝑧𝑧∞ − x = −𝑃𝑁 (W )x + W

1
2 (AW

1

2 )†A𝑃𝑁 (W )x.

(4.11)

• If the NTK W is singular, 𝑃𝑁 (A)∩𝑅(W )x = 0 and 𝑥 ∈ 𝑅(W ), then the reconstruction is exact

or

𝑧𝑧𝑧∞ = x.

(4.12)

In practice, the NTK matrix isn’t precisely singular (or low-rank). Nevertheless, the above

theorem can be extended to the nearly singular case with some error correction terms. The principal

message, however, remains consistent, as outlined below.

54

When the NTK kernel W is non-singular, it can result in inaccurate reconstruction (4.10) at

convergence. On the other hand, if the NTK kernel W is singular or say low-rank, then surprisingly,

there is a possibility of exact reconstruction. Specifically, (4.12) indicates that exact recovery is

possible if the NTK operator effectively represents the underlying image, meaning x ∈ 𝑅(𝑊)

and if the measurement matrix A exhibits sufficient incoherence with the NTK, in the sense that

𝑁 (A) ∩ 𝑅(W ) = ∅ or 𝑃𝑁 (A)∩𝑅(W )x = 0. An example of a situation that meets these criteria is

that the true image x consists of a few non-bandlimited wavelet elements, the NTK kernel W is

sufficient to represent this x, and A includes a range of low-frequency Fourier modes. Provided that

the wavelet elements constituting x cannot be linearly combined to form a band-constrained signal,

they would not be included in the kernel of A, which consists of such signals. Consequently, the

condition 𝑃𝑁 (A)∩range(W )x = 0 would be met.

Now consider the more practical scenario when the NTK kernel W is almost singular (but

not exactly). We note that empirical studies suggest that the NTK of reasonably sized networks is

generally poorly conditioned (with a condition number of 103 or greater) (Liu and Hui, 2023). Taking

Theorem 4.1.1 into account, we can anticipate certain interesting outcomes. First of all, despite

W being nearly singular, it retains full rank. Therefore, as per the result for the non-singular NTK

outlined in equation (4.10), the reconstruction will incur a non-zero (likely non-negligible) error in

𝑁 (A) (e.g., MRI images invariably contain frequency content outside the sampled frequencies).

However, this substantial reconstruction error will only emerge if the algorithm is allowed to converge

fully over a sufficiently long duration. In the early stages of the iterations, however, the near singular

nature of W and the use of the gradient descent algorithm imply that the larger elements of W will

predominantly influence the gradient directions. As a result, the initial reconstructions will closely

resemble those in scenarios with a singular or low-rank W . According to the third statement of

Theorem 4.1.1, under certain conditions, this leads to minimal reconstruction errors. Therefore, it is

plausible to observe a pattern where reconstruction errors initially decrease significantly in the early

iterations, before increasing (towards the level indicated by (4.10)) after a prolonged period. A full

proof of the theorem is provided in Appendix A.1.

55

4.1.5 Extending Our Analysis to the Noisy Setting

The proof of Theorem 4.1.1 requires the assumption that the measurements y are free of noise,

or that we have y = Ax exactly. In this section, we extend our analysis to the more realistic setting

where the acquired imaging measurements are corrupted by noise, i.e., the acquired measurements

are of the form y = Ax + n with n ∼ N (0, 𝜎2I).

In this case, we estimate the mean squared error (MSE) of the reconstruction through the

decomposition MSE = ||Bias||2

2 + Variance (Tachella et al., 2020).

We first investigate the Bias term. In Appendix B, we use the recursion in equation (4.9) to show

that

||Bias𝑡 ||2

2 = ||En [𝑧𝑧𝑧𝑡] − x||2

2 = ||(I − 𝜂W A𝑇 A)𝑡x||2
2.

We also compute the covariance of 𝑧𝑧𝑧𝑡, Cov𝑡 and find that:

Cov𝑡 = En [𝑧𝑧𝑧𝑡𝑧𝑧𝑧𝑇

𝑡 ] − En [𝑧𝑧𝑧𝑡]En [𝑧𝑧𝑧𝑡]𝑇

= 𝜎2(I − (I − 𝜂W A𝑇 A)𝑡)A†(A†)𝑇 (I − (I − 𝜂A𝑇 AW )𝑡).

If we define 𝑄𝑄𝑄𝑡 := (I − (I − 𝜂W A𝑇 A)𝑡)A†, then we can write:

Var𝑡 = tr(Cov𝑡) = 𝜎2tr(𝑄𝑄𝑄𝑡𝑄𝑄𝑄𝑇

𝑡 ) = 𝜎2

𝑝
∑︁

𝑖=1

𝜈2
𝑡,𝑖,

where 𝜈𝑡,𝑖 are the singular values of 𝑄𝑄𝑄𝑡.

(4.13)

(4.14)

(4.15)

(4.16)

Theorem 4.1.2. Let A ∈ R𝑝×𝑞 be a full row rank measurement operator. Suppose that the acquired

measurements are y = Ax + n, where x is the ground truth image and n ∈ R𝑝 with n ∼ N (0, 𝜎2I).

Then the MSE for DIP based reconstruction at iteration 𝑡 is given by

MSE𝑡 = ||(I − 𝜂W A𝑇 A)𝑡x||2

2 + 𝜎2

𝑝
∑︁

𝑖=1

𝜈2
𝑡,𝑖,

(4.17)

where 𝜈𝑡,𝑖 are the singular values of the matrix (I − (I − 𝜂W A𝑇 A)𝑡)A†.

A full proof of the theorem is provided in Appendix A.2. As a corollary to Theorem 4.1.2, we

consider a special case in the setting of MRI reconstruction. Since MR images are complex-valued,

in practice it is common to use a network in DIP with real-valued input, real-valued weights, and a

56

two-channel output, representing the real and imaginary components of the reconstructed image.

The following corollary considers this setting with single-coil MRI. Note that the typical MRI

measurement operator mapping a complex-valued image to complex-valued measurements could

readily be rewritten as a mapping from/to the stacked real and imaginary parts of the vectors.

Corollary 1. We consider the single-coil MRI forward operator A = MFFF . Suppose the network

outputs vectors in R2𝑞, representing the real and imaginary parts of the reconstruction concatenated

together. Further suppose that the NTK, which we write as ˜W ∈ R2𝑞×2𝑞 has an eigendecomposition

of the form:

˜W = ˜FFF

𝑇 ˜𝚲 ˜FFF ;

˜𝚲 =

,


𝚲 0












0 𝚲



(4.18)

where ˜FFF =








operator. Then the MSE at iteration 𝑡 is given by:


FFF 𝑅 −FFF 𝐼







FFF 𝐼 FFF 𝑅

, and FFF 𝑅 and FFF 𝐼 are the real and imaginary parts of the Fourier transform

𝑀𝑆𝐸𝑡 =

𝑞
∑︁

𝑖=1

[(1 − 𝜂𝜆𝑖𝑚𝑖)2𝑡 |(FFF x)𝑖 |2 + 𝜎2(1 − (1 − 𝜂𝜆𝑖𝑚𝑖)𝑡)2],

(4.19)

where 𝜆𝑖s are the diagonal entries of 𝚲, 𝑚𝑖 denotes the 𝑖th diagonal entry of M𝑇 M, and (FFF x)𝑖 is the

𝑖th entry of FFF x.

The above structure for

˜W has a natural interpretation: applying ˜W to a vector in R2𝑞 can be

seen to be equivalent to applying the matrix W = FFF 𝐻𝚲FFF to the corresponding complex vector

in C𝑞. Thus, the setting corresponds to an equivalent circulant W , whose eigenvectors are fully

coherent with the Fourier forward operator.

Furthermore, we can interpret equation 4.19 in the limit as 𝑡 → ∞. In this limit, the first term

in the sum will tend to 0 for all sampled frequencies, provided 𝜂 is sufficiently small, and it is a

constant |(FFF x)𝑖 |2 at nonsampled frequencies (a result of coherence between W and measurement

operator FFF similar to what is described in section 4.1.6). On the other hand, the second term is 0 for

all of the unsampled frequencies, and will tend to 𝜎2 for the sampled frequencies. This behavior

indicates that we expect a bias-variance tradeoff, where the bias decreases as 𝑡 → ∞, the variance

57

increases as 𝑡 → ∞, and the optimal performance is achieved for some intermediate 𝑡. A full proof

of the corollary is provided in Appendix A.3.

4.1.6 Example of the Relationship Between the NTK and the Forward Operator

Theorem 4.1.2 and Corollary 1 show that the training dynamics of DIP for inverse problems

such as MRI are largely governed by the relationship between the forward operator A and the NTK

W . In this section, we analyze a simple network architecture to theoretically demonstrate how this

relationship affects the network’s ability to recover missing frequency content.

In (Heckel and Soltanolkotabi, 2020), the authors analyze simple generator networks 𝐺 of the

form

𝐺C ( · ) = ReLU(U C ( · ))𝑣𝑣𝑣,

(4.20)

where C ∈ R𝑛×𝑘 is a weight matrix, U ∈ R𝑛×𝑛 is a convolution operator, and 𝑣𝑣𝑣 ∈ R𝑘 is a vector

with 𝑣𝑣𝑣 = 1
√
𝑘

(cid:104)

1, . . . , 1, −1, . . . , −1

(cid:105)𝑇

with half of its entries 1 and half -1, which represent fixed last

layer weights of the generator. It is then proven that in expectation

E[W ] =

𝑘
∑︁

𝑙=1

𝑙 E (cid:2)𝜎′
𝑣2

(U𝑐𝑐𝑐(𝑙))𝜎′

(U𝑐𝑐𝑐(𝑙))𝑇 (cid:3) ⊙ U U 𝑇 ,

(4.21)

where 𝜎′ is the derivative of the ReLU activation, ⊙ denotes the entry-wise product, 𝑣𝑙 is the 𝑙th

entry of 𝑣𝑣𝑣, and 𝑐𝑐𝑐(𝑙) is the 𝑙th column of C. It is then shown that

[E[W ]]𝑖, 𝑗 =

(cid:18)

1
2

1 − cos−1

(cid:18)

⟨u𝑖, u 𝑗 ⟩
∥u𝑖 ∥2∥u 𝑗 ∥2

(cid:19)

(cid:19)

,

/𝜋

(4.22)

where u𝑟 denotes the 𝑟th row of U . In this case, with circulant U , it is possible to show that W

is also circulant, and hence is diagonalized by Fourier operators. Thus, networks of this form are

related to the case in Corollary 1 (in expectation).

4.1.7 Understanding DIP-MRI with Different Networks

In Section 4.1.6, we saw that Corollary 1 suggests that (in expectation) generator networks of the

form (4.20) will not be able to effectively recover missing measurement frequencies when used for

DIP reconstruction. To empirically validate this claim and compare network designs, we present a

simple experiment comparing two neural network architectures for a 1D signal reconstruction task.

58

Magnitude of Fourier Transform Magnitude of Fourier Transform

of Deep Decoder NTK

of Wavelet CNN NTK

Figure 4.1 Top row: The performance of reconstructing a 1D square signal using a Wavelet CNN
and Deep Decoder. The top row shows the sampling mask applied in Fourier space (top left), and
the Fourier transform of the signals recovered by the Deep Decoder (top middle), and Wavelet CNN
(top right) where the y-axis is the magnitude and x-axis indexes the entries of the signal. Bottom
row left: The RMSE of the reconstructed signals vs. the number of training iterations. Bottom row
right: The two figures at the bottom display the Fourier transform of the left eigenvector matrix for
both the Deep Decoder (bottom left) and the WCNN (bottom right).

We compare the Deep Decoder, a simple generator network described in (Heckel and Hand,

2019), and a U-Net architecture, where the upsampling and downsampling filters were replaced by

wavelet transformations. The results of this experiment are shown in Figure 4.1. We found that the

training on NTK regime in practice, the networks showed little change over training iterations. We

also observed that the signal was very close to the subspace 𝑅(W ) when a low-rank approximation

of W (obtained by truncating small singular values) was used.

We find that the reconstruction performance of the deep decoder quickly plateaus with the

A = MFFF sampling operator, and it is not able to recover significant missing frequency content. In

contrast, the error of wavelet-based U-Net reconstruction slowly decreases, then increases after many

training iterations because of overfitting. We also plot the magnitude of the Fourier transform of

each network’s NTK’s eigenvectors. We can see that the deep decoder’s NTK (the eigenvectors) is

highly coherent with the Fourier basis, whereas the wavelet U-Net’s NTK is less so. This experiment

demonstrates that the analysis and discussion presented in Section 4.1.4 is based on reasonable

assumptions, and our conclusions hold when applied to real networks.

59

02004000.00.51.00100200300400500050100150200250300Ground TruthDeep DecoderReconstruction0100200300400500050100150200250300Ground TruthWavelet CNNReconstruction100101102103Training Iteration2.12.22.32.42.5RMSEWavelet CNNDeep Decoder020040001002003004005002.55.07.510.012.515.0020040001002003004005000.02.55.07.510.012.515.017.54.1.8 Overfitting and Spectral Bias in Deep Image Prior

Because DIP typically uses corrupted and/or limited data for network training, any related

distortions will inevitably manifest in the network’s output if it is trained until the loss function

reaches equilibrium. This issue impacts not only DIP’s performance in well-researched areas

like image denoising, but also in inverse problems such as MRI reconstruction, where forward

operators may have high dimensional null spaces. Fig. 4.2 quantitatively demonstrates the overfitting

phenomenon in MRI reconstruction. One can see that the reconstruction reaches peak performance

quickly and then slowly diminishes as the training persists. This highlights the necessity for

implementing an early stopping criterion when using vanilla DIP to solve inverse problems.

4.1.9 Understanding Spectral Bias and Overfitting for DIP MRI

To gain insights into the spectral bias inherent in vanilla DIP MRI image reconstruction, we

utilize a frequency band metric to explore the disparity between the reconstructed frequencies and

the actual ones. We compare the multi-coil 𝑘-space of the output image 𝑓𝑓𝑓 θ (𝑧𝑧𝑧) at every stage of

network training to the fully-sampled the 𝑘-space y𝑐, 𝑐 = 1, ..., 𝑁𝑐 of the ground truth image x to

study how various frequency components converge (refer to Fig. 4.2). We execute this by calculating

a normalized error metric for low, medium, and high-frequency bands:

(cid:205)𝑁𝑐
𝑐=1

(cid:13)
(cid:13)
(cid:13)
(cid:13)

NMSE :=

MfreqFFF S𝑐 𝑓𝑓𝑓 θ (𝑧𝑧𝑧) − Mfreqy𝑐

(cid:13)
(cid:13)
(cid:13)
(cid:13)

2

2

(cid:205)𝑁𝑐
𝑐=1

(cid:13)
(cid:13)
(cid:13)
(cid:13)

Mfreqy𝑐

(cid:13)
(cid:13)
(cid:13)
(cid:13)

2

2

(4.23)

where Mfreq is the frequency band mask. Intuitively, the above metric measures the consistency

between the reconstructed image 𝑓𝑓𝑓 θ (𝑧𝑧𝑧) and the true 𝑘-space y𝑐 in the frequency domain. Fig. 4.2

plots this metric computed across three frequency bands for vanilla DIP MRI reconstruction. The

result shows that the low frequencies are learned more quickly and with lower error, confirming that

spectral bias is present in MRI reconstruction using DIP.

60

Figure 4.2 Top row: the three masks used to compute the frequency band-based metric. Bottom
row: reconstruction PSNR plot on the left illustrates the overfitting issue that occurs during MRI
reconstruction. Spectral bias also affects the performance of DIP for MRI reconstruction (right plot),
as different frequency bands are reconstructed at different rates.

4.2 Methodology

To more effectively address the overfitting issue inherent in the vanilla deep image prior (DIP),

certain methods have been introduced including using matched references.

In contrast to the

approach of using a reference image, we propose the introduction of a self-regulation method as an

enhancement.

4.2.1 Reference-Guided DIP

The reference-guided DIP formulation was proposed in (Zhao et al., 2020b) as

ˆθ = arg min

θ

∥A 𝑓𝑓𝑓 θ (𝑧𝑧𝑧) − y∥2
2,

ˆx = 𝑓𝑓𝑓 ˆθ (𝑧𝑧𝑧).

(4.24)

This formulation is identical to the problem in (4.3), except that the input to the network is no

longer fixed random noise, but is instead a reference image that is very similar to the one being

61

reconstructed. The input to the network introduces some additional structural information, and

we can consider the network as essentially performing image refinement or style transfer rather

than image generation from scratch. This method is quite reasonable in cases where a dataset of

structurally similar images is available, and there is a systematic way to choose the network input

image from the dataset based on only undersampled 𝑘-space observations at testing time.

In (Zhao et al., 2020b), the input image seems to be chosen by hand. As a more realistic

modification of this method, we propose an approach similar to the recent LONDN-MRI (Liang

et al., 2024b) method to search for the reference image (using a distance metric such as Euclidean

distance or other metric) that is most similar to an estimated test reconstruction from undersampled

data. In our experiments, we used A𝐻y as estimated test image, and used corresponding versions of

reference images to find the closest neighbor.

4.2.2 Self-Guided DIP

To circumvent the need for a prior chosen reference to guide DIP, we introduce the following

method, which adaptively estimates such a reference that we call self-guided DIP:

ˆθ, ˆ𝑧𝑧𝑧 = arg min

θ,𝑧𝑧𝑧

∥AEη [ 𝑓𝑓𝑓 θ (𝑧𝑧𝑧 + η)] − y∥2
2
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
data consistency

+ 𝛼 ∥Eη [ 𝑓𝑓𝑓 θ (𝑧𝑧𝑧 + η)] − 𝑧𝑧𝑧∥2
2
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
denoiser regularization

(cid:124)

ˆx = Eη

(cid:2) 𝑓𝑓𝑓 ˆθ (ˆ𝑧𝑧𝑧 + η)(cid:3)

(4.25)

(4.26)

In this optimization, 𝑧𝑧𝑧 is no longer a reference image, but is instead initialized appropriately

and updated. The search space of 𝑧𝑧𝑧 is not constrained, although 𝑧𝑧𝑧 must have the same dimension

as x. For example, in multi-coil MRI, the initialization can be a zero-filled (for missing 𝑘-space)

reconstruction (cid:205)𝑁𝑐

𝑐=1 A𝐻

𝑐 y. Furthermore, η is random noise drawn from some distribution 𝑃η

(either uniform or Gaussian in our experiments). The first term in the optimization enforces data

consistency, while the second term is a regularization penalty. The input 𝑧𝑧𝑧 is optimized here, in

contrast to both vanilla and reference-guided DIP. Hence, we call this method “self-guided" because

62

at each iteration (of an algorithm) the network’s “reference” is updated, with the regularization also

guiding the process. Another intriguing feature of this method that we have observed is that the

optimal performance is obtained when the magnitude of η is quite large.

The proposed regularization smooths the network output over input perturbations. This strategy

has been exploited in approaches such as randomized smoothing and makes the network mapping

more stable. The regularizer attempts to match the smoothed output to the unperturbed input,

mimicking a denoiser. A somewhat simpler form of the objective would place the expectation

outside the norms rather than inside. In this case, the regularization term would push the network to

act as a usual denoiser, i.e., ensure 𝑓𝑓𝑓 θ (𝑧𝑧𝑧 + η) ≈ 𝑧𝑧𝑧. We place the expectation inside of the norms (with

the reconstruction being Eη [ 𝑓𝑓𝑓 θ (𝑧𝑧𝑧 + η)]). This offers the learned network some more flexibility and

yielded slightly better image reconstructions in our studies. For example, with the expectation inside,

the regularization loss would be 0 for zero-mean η if the network were a denoising autoencoder or

even just an autoencoder.

The proposed loss is optimized using the Adam optimizer. In Fig. 4.3, the important role of the

regularization component in the optimization process of a U-Net network (for multi-coil MRI with

4x and 8x undersampling mask) is underscored. In the absence of this element, 𝑧𝑧𝑧 fails to be updated

correctly, resulting in unstable training and inferior performance. This illustrates the efficiency of

leveraging smoothing and denoising based regularization.

Next, we conduct an ablation study on the impact of additive noise in the network input.

Specifically, for every optimization iteration, we add noise vectors 𝜂 to the DIP network input with

magnitude controlled by 𝜎, i.e., 𝜂 ∼ N (0, 𝜎I). Fig. 4.4 presents the results w.r.t. to different

values of 𝜎. The results demonstrate a correlation between the input noise magnitude and the

reconstruction quality. Specifically, performance improves as the noise intensity increases, reaching

an optimal point, after which further increases in noise lead to a gradual decline in performance.

Fig. 4.5 shows how the network’s input 𝑧𝑧𝑧+η evolves throughout the self-guided DIP optimization.

It is observable that the input 𝑧𝑧𝑧 progressively acquires more feature information, which facilitates

the network’s learning process, but the input continues to change because of the added noise η. In

63

each iteration, the network and input are updated as in Fig. 4.6.

Figure 4.3 Self-guided deep image prior: effect of regularization.

Figure 4.4 Self-guided deep image prior: effect of added noise in the network input.

64

010−310−210−1100101102α22.525.027.530.032.5PSNR4x Sampling8x Sampling101100 283032PSNR4x Sampling8x SamplingFigure 4.5 Evolution of the network input in self-guided DIP during training for MRI reconstruction
at 4x undersampling. As the loss from (4.25) diminishes, the self-guided input supplies additional
data, enabling the neural network to enhance its reconstruction capabilities.

Figure 4.6 Flow chart of the proposed self-guided DIP algorithm.

4.2.3 Post-processing Data Correction

In some applications, it may be desirable to ensure that the reconstructed image is completely

consistent with the acquired measurements. This could be the case in compressed sensing

65

Loss𝒛𝒛𝟎𝟎𝒙𝒙𝑜𝑜𝑜𝑜𝑜𝑜Denoising   networkAdd Noise{η1 …..,η𝑘𝑘}Estimate ExpectationDataCorrectionCompute Loss and Gradient to Update Network and Inputproblems, when signal-to-noise ratios are good. For example, consider y = M𝚿x, where

M ∈ R𝑝×𝑞 is a subsampling matrix and 𝚿 ∈ C𝑞×𝑞 is a full measurement matrix. Then the matrix

M′ := M𝑇 M ∈ R𝑞×𝑞, subsamples the same measurements, but has zero rows for measurements that

are not sampled. Define M = I − M′. Then, for any reconstruction ˆx, we can construct new, “fully

sampled" measurements ynew ∈ C𝑞 as ynew = M𝑇 y + M𝚿 ˆx. Then with these measurements, we can

obtain a corrected (data consistent) reconstruction by solving ˆxcorrected = arg min

For multi-coil MRI, assuming appropriately normalized coil sensitivity maps ((cid:205)𝑁𝑐

x

||𝚿x − ynew||2
2.
𝑐=1 S𝐻

𝑐 S𝑐 = I)

yields

y𝑐new = M𝑇 y𝑐 + MFFF S𝑐 ˆx,

ˆxcorrected =

𝑁𝑐∑︁

𝑐=1

𝑐 FFF 𝐻y𝑐new.
S𝐻

4.3 Experiments and Results

We tested the proposed method and alternatives for MRI reconstruction from undersampled

measurements and image inpainting.

Dataset. We tested methods for MRI reconstruction using the multi-coil fastMRI knee and brain

datasets (et al, 2019, 2020) and the Stanford 2D FSE (Cheng, 2019) dataset. The coil sensitivity

maps for all cases were obtained using the BART toolbox (Uecker, 2018). The sensitivity maps

were estimated from under-sampled center of k-space data. We also tested our method on image

inpainting using the CBSD68 dataset (Roth and Black, 2005) which is show on the supplement

material.

Training setup.

In our experiments, we compare to related reconstruction methods, which

include vanilla DIP, RAKI(Akccccakaya et al., 2019) which is a nonlinear deep learning-based auto-

regressive auto-calibrated reconstruction method, reference-guided DIP, DIP with total variation (TV)

regularization (Liu et al., 2019a), self-guided DIP, compressed sensing with wavelet regularization,

ZS-SSL (zero-shot self-supervised learning) , TRPA (Truncated Residual Based Plug-and-Play

ADMM) and a neural network trained in an end-to-end supervised manner (on a set of 3000 images).

For compressed sensing MRI, we used the SigPy package, and the regularization parameter was

tuned and set as 𝜆 = 10−6. During training, network weights were initialized randomly (normally

66

distributed). For all of the deep network methods, the network architecture used was a deep U-Net

(∼ 3 × 108 parameters). The network parameters were optimized using Adam with a learning rate

of 3 × 10−4. For TV-regularized DIP, the parameters used are the same as those in the original

paper (Liu et al., 2019a), which worked well. For the ZS-SSL (Yaman et al., 2022), we used the

settings in the original paper, with 300 epochs, 10 unrolling blocks, 10 CG iterations and learning

rate of 5 × 10−4. Finally, we compared to the plug-and-play method TRPA (Hou et al., 2022) using

the default settings and training the network using 3000 images.

For the self-guided method, we observed that the noise η can be drawn from different distributions

such as the normal or uniform distribution with essentially identical performance. For our

experiments, we drew η from 𝑈 (0, 𝑚), where 𝑚 is 1

2 of the maximum value of the magnitude of
any real or imaginary component of 𝑧𝑧𝑧. In this case, 𝑧𝑧𝑧 is also optimized using Adam with a learning

rate of 1 × 10−1. At each iteration, we estimated the expectation inside the loss function using 4

realizations of η. For all unsupervised methods besides compressed sensing, the data correction

outlined in Section 4.2.3 was applied. Among supervised methods, we tested with the U-Net and

the unrolled MoDL network (Aggarwal et al., 2019a), for which no post-processing was undertaken,

as it did not yield significant improvements.

Evaluation. We tested each of the MRI reconstruction methods at 4x acceleration (25.0% sampling)

and 8x acceleration (12.5% sampling). Variable density 1-D random Cartesian (phase-encode)

undersampling was performed in most cases, unless uniform sampling is specified. We quantified

the reconstruction quality of the different methods using the peak signal-to-noise ratio (PSNR) in

decibels (dB). We also computed the frequency band metric using equation (4.23) to study the

spectral bias and overfitting in each method.

4.3.1 Reconstruction Results for fastMRI Dataset

Table 4.1 provides a quantitative comparison of the average PSNR values for knee (test) data

with 4x and 8x sampling acceleration. The proposed self-guided DIP outperforms vanilla DIP,

reference-guided DIP, compressed sensing reconstruction, and a corresponding supervised model

that was trained on a paired dataset. We also compare to the zero-shot self-supervised learning

67

method ZS-SSL (Yaman et al., 2022) and plug-and-play based method TRPA (Hou et al., 2022),

and find that self-guided DIP yields better performance. A similar comparison for the fastMRI

brain dataset can be found in Table 4.2. The benefits of self-guided DIP are also evident in the

visual comparisons in Figs. 4.9, 4.10, and 4.11, which show qualitative comparisons for 8x and 4x

accelerated knee images, and a 4x accelerated brain image.

Ax Vanilla RAKI

DIP

TV
Ref-
DIP Guided Recon

CS Ours

ZS
SSL

4x
8x

30.22
28.77

30.47 30.52
29.03 28.98

33.18
30.24

29.32 33.61 33.01
27.82 30.73 30.34

Supervised TRPA

U-Net

33.17
30.28

33.21
30.31

Table 4.1 Average reconstruction PSNR values (in dB) for 25 images from the fastMRI knee dataset
at 4x and 8x undersampling or acceleration (Ax) including the ZS-SSL method.

Ax Vanilla RAKI

DIP

TV
Ref-
DIP Guided Recon

CS Ours

ZS
SSL

4x
8x

30.72
29.03

30.99 31.04
29.27 29.25

33.56
30.54

29.84 34.12 33.44
28.12 31.04 30.34

Supervised TRPA

U-Net

33.74
30.57

33.64
30.45

Table 4.2 Average reconstruction PSNR values (in dB) for 25 images from the fastMRI brain dataset
at 4x and 8x undersampling or acceleration (Ax).

We also evaluate the performance of the reconstruction methods using uniform sampling

masks. A quantitative comparison in this setting for fastMRI knee data is given in Table 4.3. The

reconstruction results with the uniform undersampling mask indicate that our method significantly

outperforms the reference-guided DIP and compared self-supervised and supervised approaches.

Ax Vanilla RAKI

DIP

TV
Ref-
DIP Guided Recon

CS Ours

ZS
SSL

4x
8x

30.34
29.25

30.55 30.67
29.56 29.75

33.24
30.84

29.45 34.45 33.67
28.32 31.88 30.74

Supervised TRPA

U-Net

33.74
30.87

33.45
30.78

Table 4.3 Average reconstruction PSNR values (in dB) for 25 images from the fastMRI knee dataset
using 4x and 8x uniform undersampling masks.

Ax Vanilla RAKI TV

Ref-
DIP Guided Recon

CS Ours

ZS
SSL

Supervised TRPA

U-Net

1.45

1.15

2.02

0.45

2.05 15.22

0.24

0.56

DIP

1.12

4x

Table 4.4 Average run-time (minutes) for 30 images from the fastMRI knee dataset at 4x undersam-
pling.

Furthermore, we observed that our method requires slightly more computation time than both

the vanilla DIP and the supervised approach in Table 4.4. We note that the run-time for Supervised

68

U-Net and TRPA exclude training time, i.e., it is inference time only. However, our method

demonstrates significantly better performance than vanilla DIP and operates without the need for

any training data (dataless) that supervised methods require.

In this subsection, we conducted experiments to understand the reconstruction of different

frequencies across the three DIP-based methods. To do this, we used the same frequency band metric

introduced previously. We computed this metric over 25 images for 4x k-space undersampling, and

the average metric is shown in Fig. 4.7. We observe that the self-guided method shows reduced

spectral bias (high frequencies are reconstructed sooner and more accurately), and shows less

overfitting in both frequency bands considered, especially compared to vanilla DIP.

To further compare the presence of overfitting in the vanilla, reference-guided, and self-guided

methods, Fig. 4.8 shows the average of the reconstruction PSNR for 25 images throughout training.

The self-guided DIP shows essentially no overfitting, compared to the vanilla DIP and the reference-

guided DIP. The PSNR increases a bit more gradually for self-guided DIP due to its reference/input

optimization. However, it quickly outperforms the other compared DIP methods. Our hypothesis is

that as the input undergoes continuous optimization, it accrues more high-frequency details (see

Fig. 4.5). This enrichment facilitates the network’s ability to better assimilate high-frequency details

in the output without overfitting.

4.3.2 Reconstruction Results for the Stanford FSE Dataset

Here, we evaluate the reconstruction performance on the Stanford multi-coil FSE dataset(Cheng,

2019). The FSE dataset is a relatively more challenging dataset because it has more diversity in terms

of anatomical structures when compared to fastMRI. However, the number of samples is relatively

smaller than fastMRI. As illustrated in Table 4.5, the self-guided DIP method outperforms other

methods including the well-known unrolling-based MoDL (Aggarwal et al., 2019a) reconstructor.

We note that the U-Net in MoDL was trained with supervision on a set of 2000 scans, comprising

most of the dataset. We used 6 iterations/unrollings within MoDL. For the End-to-End VarNet,

we used a sigmoid with a slope of 10 and 5 cascades, which is the default setting in the paper.

We note that MoDL and End-to-End VarNet were each trained separately for the 4x and the 8x

69

Figure 4.7 Error in the low and high frequencies of the reconstructions, with different methods
plotted over iterations at 4x undersampling.

Figure 4.8 PSNR plotted over iterations at 4x undersampling.

acceleration factors. A visual comparison is presented in Fig. 4.12. The results demonstrate the

ability of self-guided DIP to restore sharper features compared to the MoDL reconstructor, despite

using no training data or references.

4.3.3 Generalization of Self-Guided DIP

Because self-guided DIP trains a network that can accepts different kind of inputs(and noise

perturbed), we anticipate that a network trained in this manner will exhibit superior generalization

70

050010001500200025003000iteration100101NMSELow Frequency Reference-GuidedLow Frequency VanillaLow Frequency Self-Guided050010001500200025003000iteration3×1004×1006×100NMSEHigh Frequency Reference-GuidedHigh Frequency VanillaHigh Frequency Self-Guided050010001500200025003000Training Iteration1520253035PSNRReference-Guided DIPVanilla DIPSelf-Guided DIPGround Truth

U-Net

ZS-SSL

Self-Guided

CS Recon

Ref-Guided

Vanilla DIP

PSNR = ∞ dB

PSNR = 30.45 dB

PSNR =30.62 dB

PSNR =31.01 dB

PSNR =26.8 dB

PSNR =30.32 dB

PSNR =28.26 dB

Figure 4.9 Comparison of reconstructions of a knee image using the proposed self-guided DIP
method at 8x k-space undersampling or acceleration compared to supervised learning, vanilla DIP,
compressed sensing, ZS-SSL and reference-guided DIP reconstruction. A region of interest is
shown with the green box and its error (magnitude) is shown in the panel on the top right.

Ground Truth

U-Net

ZS-SSL

Self-Guided

CS Recon

Ref-Guided

Vanilla DIP

PSNR = ∞ dB

PSNR =35.14 dB

PSNR =34.94 dB

PSNR =35.75 dB

PSNR =31.2 dB

PSNR =35.45 dB

PSNR =32.2 dB

Figure 4.10 Same comparisons/setup as Fig. 4.9, but for 4x acceleration.

Ground Truth

U-Net

ZS-SSL

Self-Guided

CS Recon

Ref-Guided

Vanilla DIP

PSNR = ∞ dB

PSNR = 35.2 dB

PSNR = 34.78 dB

PSNR = 36.7 dB

PSNR = 30.9 dB

PSNR = 34.5 dB

PSNR = 31.25 dB

Figure 4.11 Comparison of reconstructions of a brain image using the proposed self-guided method
at 4x acceleration versus supervised learning, vanilla DIP, compressed sensing, ZS-SSL and
reference-guided reconstruction.

Ax Vanilla RAKI Reference- CS

DIP

Guided Recon

Self-Guided Supervised Supervised
VarNet

MoDL

DIP

29.9

30.11
4x
8x 27.45 27.56

32.57
29.81

29.75
28.1

33.15
30.45

32.78
30.12

32.89
29.88

Table 4.5 Average PSNR values (in dB) on the Stanford FSE test set at 4x and 8x undersampling for
15 images for different reconstructors.

to unseen data compared to a network trained using the conventional DIP method. To test this

hypothesis, we train the network to reconstruct the nearest neighbor (in terms of ℓ2 distance) of

the target for both vanilla DIP and self-guided DIP. Subsequently, we optimize only the network

input while keeping its network parameters fixed (i.e., to the network that was trained on the nearest

neighbor) using the loss function from (4.24) for DIP and (4.25) for self-guided DIP to reconstruct

71

Ground Truth

MoDL

E2E-Varnet

Self-Guided

CS Recon

Ref-Guided

Vanilla DIP

PSNR = ∞ dB

PSNR = 34.05 dB

PSNR = 34.12

PSNR = 34.5 dB

PSNR = 28.7 dB

PSNR = 33.7 dB

PSNR = 29.1 dB

Figure 4.12 Comparison of reconstructions of a FSE dataset image from fourfold undersampled data
using the proposed self-guided method versus supervised learning, vanilla DIP, compressed sensing,
E2E-Varnet and reference-guided DIP. A region of interest and its error are also shown.

Ground Truth

U-Net

Self-Guided

CS Recon

Ref-Guided

Vanilla DIP

PSNR = ∞ dB

PSNR = 32.75 dB

PSNR = 35.17 dB

PSNR = 32.1 dB

PSNR = 34.67 dB

PSNR = 31.5 dB

Figure 4.13 Comparison of image reconstructions at 4x k-space undersampling. The methods shown
are the proposed self-guided method, supervised learning, vanilla DIP, compressed sensing, and
reference-guided reconstruction.

Ground Truth

U-Net

ZS-SSL

Self-Guided

CS Recon

Ref-Guided

Vanilla DIP

PSNR = ∞dB

PSNR = 34.22 dB

PSNR =34.42 dB

PSNR =35.01 dB

PSNR =32.01 dB

PSNR =34.13 dB

PSNR =32.23 dB

Figure 4.14 Visualization of ground truth and reconstructed images using different methods at 4x
k-space undersampling for an annotated image from the fastMRI+ dataset, where the interest area is
a nonspecific white matter lesion (in green box). Self-Guided DIP produces sharper image features
with reduced artifacts compared to other methods. The top right box shows the error (magnitude) of
each reconstruction in the region of interest.

the target. We executed the same experiment for 4x and 8x data undersampling scenarios for 15

fastMRI knee images. The results in Table 4.6 show that self-guided DIP displays much higher

benefits in terms of generalization compared to the conventional DIP.

Ax Vanilla DIP Self-Guided

Generalized

DIP
Generalized

4x
8x

28.2
26.65

31.77
29.11

Table 4.6 Average reconstruction PSNR values (in dB) for 15 images from the fastMRI knee test set
at 4x and 8x undersampling.

72

Figure 4.15 Box plots of reconstruction PSNR values (in dB) for different methods for the fastMRI
lesion test set at 4x and 8x undersampling. Our (self-guided DIP) results are compared to vanilla
DIP, reference-guided DIP, ZS-SSL, and a supervised U-Net trained on 3000 non-lesion scans, and
CS reconstruction.

Furthermore, to evaluate the ability of self-guided DIP to accurately reconstruct fine image

details, especially in common scenarios like pathology detection, we incorporated some features

into a knee image from the fast MRI dataset. This is similar to recent work (Lahiri et al., 2021).

By undersampling at a 4x rate in 𝑘-space and using the U-Net, we observe (Fig. 4.13) that the self-

guided DIP renders a clearer image with better PSNR compared to supervised learning techniques.

The intricacies and boundaries of the added features were more effectively maintained with the

self-guided DIP scheme. Distinctively, the image quality offered by self-guided DIP remains similar,

irrespective of whether the features are included or not (see Fig. 4.10). In contrast, the quality using

the supervised method dipped notably, highlighting the superior stability and adaptability of a robust

DIP-based approach.

In Figure 4.14, we provide an additional comparison of these methods for reconstructing an

image from the fastMRI+ dataset that contains a real brain lesion. This comparison shows the

73

CS ReconVanilla DIPRef-Guided DIPU-NetZS-SSLOurs32.032.533.033.534.034.535.0PSNR (dB)4x AccelerationCS ReconVanilla DIPRef-Guided DIPU-NetZS-SSLOurs30.030.531.031.532.032.533.0PSNR (dB)8x Accelerationsuperiority of our method in reconstructing the white matter lesion. For the training phase of the

supervised U-Net, a non-lesion dataset was employed with 3000 scans with 4x and 8x undersampling

(as in section 4.3.1).Also, the ZS-SSL was employed using the same setting as the previous section

To provide a quantitative comparison, we tested the methods on 15 scans with lesions. The results,

displayed in boxplots in Figure 4.15, show that our method also achieves higher PSNR values on

this data compared to other methods, including the supervised U-Net.

4.4 Discussion of Results

We have introduced a novel self-guided image reconstruction method requiring no training data

that iteratively optimizes the reconstructor network and its input. This approach is completely

unsupervised and instance-adaptive, and demonstrated strong reconstruction performance on the

multi-coil fastMRI knee and brain datasets and the Stanford FSE dataset. The approach does not

require pre-training and can easily accommodate variations in most MRI reconstruction settings.

Additionally, it was found to outperform supervised methods like image-domain U-Net and hybrid-

domain MoDL and E2E Varnet, especially on smaller, more diverse datasets. We note that given

enough matched training data, these powerful supervised methods should outperform self-guided

DIP, although the required number of samples may be very large. Indeed, previous studies (Klug

and Heckel, 2023) have demonstrated significantly diminishing returns for datasets larger than a

few thousand images for medical image reconstruction. Since acquiring large paired datasets is

challenging, particularly in medical imaging, we emphasize the importance of developing effective

zero-shot methods. We also showed that the networks learned in self-guided DIP demonstrate better

stability and generalizability compared to those learned in vanilla DIP. Finally, we demonstrated the

effectiveness of self-guided DIP for image inpainting on the CBSD68 dataset.

While the self-guided DIP algorithm does require more optimization steps than vanilla DIP

because the network’s input must also be optimized, this additional cost is not detrimental. For

example, self-guided DIP with a randomly initialized U-Net took about 2 minutes to run on an

NVIDIA GeForce RTX A5000 GPU (with a batch size of 2 and 1500 training iterations), whereas

vanilla DIP took about 1 minute.

74

4.5 Conclusions

In this study, we first presented theoretical results that help explain the training dynamics of

unsupervised neural networks for general image reconstruction. We empirically validated our

findings using some simple example problems.

We then proposed a novel self-guided deep image prior based MRI reconstruction technique

that iteratively optimizes the network input while also training the model to be robust to large

random perturbations of its input. This was achieved by introducing a new regularization term that

encourages the reconstructor to act as a denoiser.

We empirically demonstrated that this method yields promising results for MRI reconstruction

and image inpainting on different datasets. Notably, our approach does not involve any pre-training,

and can thus readily handle changes in the measured data. Moreover, this self-guided method

showed better performance than the same model trained in a supervised manner on a large dataset

(with lengthy training times). This shows that highly adaptive learning approaches may have the

potential to outperform traditional data-driven learning approaches in image reconstruction. In

the future, we hope to carry out more theoretical analyses to better understand the performance of

self-guided DIP for image reconstruction and analyze how the optimization of the network’s input

improves reconstruction performance. We also plan to study whether similar self-guided schemes

could improve the performance of DIP for other imaging modalities and restoration tasks such as

deblurring and super-resolution.

75

CHAPTER 5

AUTOENCODING SEQUENTIAL DEEP IMAGE PRIOR

5.1

Introduction

In the previous chapter, we introduced Self-Guided Deep Image Prior (Self-Guided DIP) to

alleviate the overfitting issue commonly encountered in Deep Image Prior (DIP) methods. While

Self-Guided DIP significantly enhances reconstruction quality by guiding the optimization with a

carefully designed self-supervisory signal, it also introduces additional computational overhead.

This longer runtime partly stems from the extra steps required for the self-guidance mechanism to

refine the network’s predictions.

Our goal, therefore, is to preserve the benefits of Self-Guided DIP—namely, its ability to

discourage overfitting and improve reconstruction fidelity—while addressing its slower convergence.

Drawing inspiration from the progressive denoising strategy found in recent diffusion-based

generative models, we propose a novel approach, Autoencoding Sequential DIP (aSeqDIP), to

achieve more efficient image reconstruction. Compared to diffusion models, our method does not

require training data and outperforms other DIP-based methods in mitigating noise overfitting while

maintaining a similar number of parameter updates as Vanilla DIP. Through extensive experiments,

we validate the effectiveness of our method in various image reconstruction tasks, such as MRI

and CT reconstruction, as well as in image restoration tasks like image denoising, inpainting, and

non-linear deblurring.

5.1.1 Related Work

DIP-based Methods: Deep Image Prior (DIP) was first introduced by (Ulyanov et al., 2018). The

authors demonstrated that the architecture of a generator network alone is capable of capturing a

significant amount of low-level image statistics even before any learning takes place. Specifically,

the DIP image reconstruction is obtained through the minimization of the following objective:

ˆ𝜃 = arg min
𝜃

∥A 𝑓𝜃 (z) − y∥2
2 ,

ˆx = 𝑓 ˆ𝜃 (z) ,

(5.1)

76

where ˆx is the reconstructed image, and 𝜃 corresponds to the parameters of network 𝑓 : R𝑛 → R𝑛,

which is typically implemented using a U-Net architecture (Ronneberger et al., 2015a). The input to

the network, z ∈ R𝑛, is randomly chosen and remains fixed throughout the optimization process.

While standard DIP was shown to perform well in many tasks, selecting the number of iterations to

optimize objective (5.1) poses a challenge as the network would eventually fit the noise present in y

or could fit to undesired images based on the null space of A.

To mitigate the problem of noise overfitting, previous studies considered different approaches such

as regularization, early stopping (ES), and network pruning (Ghosh et al., 2024). For regularization-

based methods, the work in (Liu et al., 2019b) enhanced the standard DIP by introducing a total

variation (TV) regularization term for denoising and deblurring tasks, whereas the study in (Cheng

et al., 2019) proposed combining DIP with stochastic gradient Langevin dynamics (SGLD) (Welling

and Teh, 2011). The authors in (Wang et al., 2023a) use running variance as the criterion for ES,

whereas the authors of (Li et al., 2021) propose combining self-validation and training to apply ES.

The input to the standard DIP (or Vanilla DIP) network is a random noise vector that, in most

works, remains fixed during the optimization. Nevertheless, other works, such as those in (Zhao

et al., 2020a) and (Tachella et al., 2021), have explored cases where the input contains some structure

of the ground truth. The approach employed in reference-guided DIP (Ref-Guided DIP) (Zhao

et al., 2020a) follows the same objective as standard DIP in (5.1). However, instead of using a fixed

random noise vector as input, it utilizes a reference image closely resembling the one undergoing

reconstruction. This method was applied to the task of MRI. This methodology proves particularly

effective when datasets comprising structurally similar data points are available. The reference

required here makes this method a data-dependent approach.

Inspired by the departure from using a random fixed input, the authors in (Liang et al., 2024a)

recently introduced Self-Guided DIP. Unlike Ref-Guided DIP (Zhao et al., 2020a), a prior image

that closely resembles the unknown (to be estimated) image is not needed, and the optimization

occurs simultaneously with respect to both the input and the parameters of the network. Specifically,

77

Self-Guided DIP employs the following objective:

ˆ𝜃, ˆz = arg min
𝜃,z

∥AEη [ 𝑓𝜃 (z + η)] − y∥2

2 + 𝛼∥Eη [ 𝑓𝜃 (z + η)] − z∥2
2 ,

(5.2)

where η is random noise, and 𝛼 is a regularization parameter. The first (resp. second) term is

used for data consistency (resp. denoising regularization) and final reconstruction is obtained as

ˆx = Eη [ 𝑓 ˆ𝜃 (ˆz + η)]. aSeqDIP is different from Self-Guided DIP as our method does not require

gradient-based updates for the input, making it computationally less expensive. Self-Guided DIP

has demonstrated superior performance compared to Vanilla DIP, TV-DIP, and SGLD-DIP, thus

serving as a primary baseline for comparison.

DM-based Methods:

In recent years, there has been an abundance of DM-based methods proposed

to address inverse imaging problems (Chung and Ye, 2022; Chung et al., 2022, 2023c; Li et al.,

2024; Song et al., 2024; Daras et al., 2024). A well-known method for natural images is Diffusion

Posterior sampling (DPS) (Chung et al., 2023c). DPS incorporates a gradient step into the reverse

sampling process of pre-trained DMs, ensuring data consistency and enabling sampling from the

conditional distribution. In the context of image reconstruction and restoration tasks, numerous

diffusion-based approaches have emerged, as evidenced by works such as (Xie and Li, 2022; Güngör

et al., 2023; Peng et al., 2022). Notably, the authors in (Chung and Ye, 2022) and (Chung et al., 2022)

introduced a SOTA DM-based approach for addressing the MRI and CT reconstruction inverse

problems, respectively. They propose incorporating the predictor-corrector sampling algorithm

(Song et al., 2021c) for data consistency, akin to DPS, thereby facilitating the sampling from a

conditional distribution.

One clear distinction between aSeqDIP and DM-based methods is that our approach does

not necessitate pre-trained models. For our experiments in MRI, CT, and denoising (as well as

in-painting and deblurring) tasks, we will utilize Score-MRI (Chung and Ye, 2022), Manifold

Constrained Gradient (MCG) (Chung et al., 2022), and DPS (Chung et al., 2023c), respectively, as

DM-based baselines.

78

5.2 Method

In this section, we begin by investigating the impact of the input on DIP. Then, we introduce our

method, aSeqDIP. We note that while we consider linear and non-linear inverse problems, in our

formulations, we use a linear forward model to simplify notation.

5.2.1 Motivation of aSeqDIP: The Impact of the Network Input in Vanilla DIP

Here, we aim to address the question: How does employing a noisy version of the ground truth

image, which retains some structure of the ground truth, as the fixed input to the Vanilla DIP

objective in (5.1), affect performance? To investigate, we conduct the following experiment.

Consider the MRI task defined as y ≈ Ax∗. Let the input to the standard DIP objective in (5.1)

be denoted as z = x∗ + δ, where δ ∼ N (0, 𝜎2I). Here, 𝜎 controls the magnitude of the perturbations

added to the ground truth image, indicating that a larger 𝜎 results in a greater deviation between z

and x∗. We optimize (5.1) for various values of 𝜎, recording the best possible PSNR compared to

the ground truth, i.e., prior to the start of the noise overfitting decay.

Figure 5.1 displays the average results for 8 images. Notably, for all images, a closer similarity

of the DIP network input to x∗, as indicated by 𝜎, corresponds to higher reconstruction quality,

measured by PSNR. Larger variance in the standard Gaussian distribution corresponds to larger

additive perturbations even for the case of x∗ = 0 (the red curve). We conjecture that this still leads

to larger distances from the ground truth and hence worse performance.

Based on this discussion, a notable insight emerges:

The proximity of the DIP network input to the ground truth correlates with the quality of the

reconstruction. This promotes the question: Can we develop an input-adaptive DIP method that

mitigates noise overfitting? We proceed to address this question by proposing our method, which

we refer to as Autoencoding Sequential DIP (aSeqDIP). In Appendix B.1, we provide a case study

and theory on the impact of the DIP network input through the lens of the Neural Tangent Kernel

in residual networks. The onset of severe noise overfitting therein is delayed for better inputs

(Appendix B.1.2).

79

Figure 5.1 Average best possible PSNR values (in dB) obtained from standard DIP in (5.1) for 8
MRI (with 4x acceleration factor) scans (y-axis), where the network input z is either a perturbed
version of the ground truth or pure noise. The noise is a zero-mean additive Gaussian noise with
strength determined by 𝜎 (x-axis).

5.2.2 The Proposed aSeqDIP Algorithm

Consider that we have a U-Net architecture defined by 𝑓 : R𝑛 → R𝑛 whose weights are given by

𝜙𝑘 , where 𝑘 ∈ [𝐾], and [𝐾] := {1, . . . , 𝐾 }. Each set of parameters in 𝑓𝜙𝑘 takes an input z𝑘 and

outputs 𝑓𝜙𝑘 (z𝑘 ). Based on the insight from the previous subsection, we initially set z0 to y (resp.

A𝐻y) for denoising, in-painting, and deblurring (resp. MRI and CT). The initialization of 𝜙1 follows

the same initialization as any other DIP-based method. The parameters in 𝑓𝜙𝑘 , and the input, z𝑘 , are

then updated sequentially through

𝜙𝑘 ← arg min
𝜙𝑘

∥A 𝑓𝜙𝑘 (z𝑘−1) − y∥2

2 + 𝜆∥ 𝑓𝜙𝑘 (z𝑘−1) − z𝑘−1∥2
2 ,

z𝑘 ← 𝑓𝜙𝑘 (z𝑘−1) ,

(5.3)

(5.4)

where 𝜆 ∈ R+ is a regularization parameter, and the initialization of 𝜙𝑘 is the optimized 𝜙𝑘−1 in

(5.3). The final reconstruction is given as:

ˆx = z𝐾 = 𝑓𝜙𝐾 (z𝐾−1) .

(5.5)

The proposed procedure outlined in (5.3) and (5.4) consists of two key components. First, the

optimization of each set of weights in 𝑓𝜙𝑘 using an objective that consists of the data consistency

80

0.20.50.81.11.41.72.0: Input Noise Level30313233Best Possible PSNR (dB)DIP with input z=x*+DIP with input z=term and the second autoencoding term that aims to alleviate noise overfitting. Second, the update

of the input, z𝑘 , after optimizing each set of weights 𝑓𝜙𝑘 , so that our method is input-adaptive.

Algorithm 5.1 presents the procedure of our proposed approach. As inputs, the algorithm takes

y, A, 𝐾, 𝑁, 𝜆, and the learning rate 𝛽. Apart from the measurements and the forward operator,

the remaining parameters are considered hyper-parameters, typical in most DIP-based methods.

The parameters in 𝑓𝜙𝑘 are set to 𝜙𝑘−1 (step 2) and subsequently optimized for 𝑁 iterations using a

gradient-based optimizer, such as gradient descent (as depicted in Algorithm 5.1) or Adam (Kingma

and Ba, 2014). A block diagram of our proposed aSeqDIP method is presented in Figure 5.2.

In the following remarks, we provide insights into our proposed aSeqDIP method.

Remark 1 (Differences from Vanilla DIP (Ulyanov et al., 2018)). Assume that the iterates of

(5.3) and (5.4) converge, i.e., as 𝑘 → ∞, z𝑘 → z∗ and 𝜙𝑘 → 𝜙∗. Then, according to (5.4), for a

continuous mapping 𝑓 , we have z∗ = 𝑓𝜙∗ (z∗). Substituting this into (5.3) in the limit, we get

𝜙∗ = {arg min
𝜙

∥A 𝑓𝜙 (z∗) − y∥2

2 :

𝑓𝜙 (z∗) = z∗},

which corresponds to the minimizer of

min
𝜙

∥A 𝑓𝜙 (z) − y∥2
2

𝑠.𝑡.

z = 𝑓𝜙 (z).

(5.6)

(5.7)

The limit points of aSeqDIP correspond to the solution of a constrained version of the Vanilla DIP

objective in (5.1). The constraint enforces additional prior that could alleviate overfitting. While

its not straightforward to use a gradient-based algorithm for (5.7) given the hard constraint, the

aSeqDIP scheme’s limit points nevertheless minimize (5.7). Furthermore, aSeqDIP automatically

estimates the network input by a sequential feed forward process without needing expensive updates.

The main point is to show that aSeqDIP is solving the optimization problem in (5.7), which is

different than Vanilla DIP in (5.1).

Remark 2 (Differences from Self-Guided DIP (Liang et al., 2024a)). While both aSeqDIP and

Self-Guided DIP (Liang et al., 2024a) update the input and network parameters simultaneously,

there exist fundamental differences. Firstly, Self-Guided DIP solves the optimization problem in (5.2)

81

Algorithm 5.1 Autoencoding Sequential Deep Image Prior (aSeqDIP).
Input: Measurements y, forward operator A, number of input updates 𝐾, number of gradient updates 𝑁 per
input update, regularization parameter 𝜆, and learning rate 𝛽.
Output: Reconstructed image ˆx.
Initialization: z0 = A𝐻y; 𝜙0 ∼ N (0, I).
1: For each 𝑘 ∈ [𝐾]
Initialize 𝜙 (0)
2:
For each 𝑖 ∈ [𝑁]. (Network parameters update)
(cid:104)

𝑘−1 for 𝑘 ∈ {2, . . . , 𝐾 }, and 𝜙 (0)

𝑘 ← 𝜙 (0) for 𝑘 = 1.

𝑘 ← 𝜙 ( 𝑁 )

3:

∥A 𝑓 𝜙𝑘 (z𝑘−1) − y∥2

2 + 𝜆∥ 𝑓 𝜙𝑘 (z𝑘−1) − z𝑘−1∥2
2

(cid:105)(cid:12)
(cid:12)
(cid:12)𝜙𝑘=𝜙 (𝑖−1)

𝑘

.

𝑘

4:

− 𝛽∇𝜙𝑘

𝑘 = 𝜙 (𝑖−1)
𝜙 (𝑖)
5: Obtain z𝑘 := 𝑓
6: Reconstructed image: ˆx = z𝐾 = 𝑓

𝜙 ( 𝑁 )
𝑘

(z𝑘−1). (Network input update)

(z𝐾 −1)

𝜙 ( 𝑁 )
𝐾

Figure 5.2 Illustrative block diagram of the proposed aSeqDIP procedure. Each trapezoid corresponds
to the updates of 𝑓𝜙𝑘 that takes z𝑘−1 as input and is initialized with the optimized parameters 𝜙𝑘−1
for 𝑘 ∈ {2, . . . , 𝐾 } or randomly for 𝑘 = 1. The optimization for each set of weights takes place
based on (5.3) and is run for 𝑁 steps. The final reconstruction is 𝑓𝜙𝐾 (z𝐾−1).
which does not strictly enforce the auto-encoder constraint z = 𝑓𝜙 (z) as in (5.7). Secondly, aSeqDIP

only requires a network forward pass to update z, resulting in significantly fewer computations as

will further be demonstrated in our experimental results. Thirdly, the second term in the aSeqDIP

objective does not require computing the expectation, as it is an auto-encoder rather than a denoiser

that results in higher resistance to noise overfitting. Lastly, our method does not require initializing

z randomly and generating random vectors (η in (5.2)). The selection of η introduces an additional

hyper-parameter that we avoid, focusing solely on selecting 𝑁𝐾 (total number of iterations) and 𝜆

(regularization strength), which are necessary in most DIP-based methods.

Remark 3 (Computational Requirements). The computational requirements of aSeqDIP are

determined by two factors: (i) the 𝑁𝐾 gradient-based parameter updates, and (ii) the number of

function evaluations necessary for updating z, which is 𝐾. In our experiments, we have found that

setting 𝑁 = 2 and 𝐾 = 2000 is generally sufficient. This configuration makes aSeqDIP nearly as

efficient as Vanilla DIP.

82

𝐳0=𝐀𝐻𝐲𝐳1=𝑓𝜙1(𝐳0)𝐳2=𝑓𝜙2(𝐳1)𝐳𝐾−1=𝑓𝜙𝐾−1(𝐳𝐾−2)𝐳𝐾=𝑓𝜙𝐾(𝐳𝐾−1)𝑓𝜙1(⋅)InitializeOptimize for𝑁 steps 𝑓𝜙2(⋅)InitializeOptimize for𝑁 steps 𝜙2←𝜙1𝜙1𝑓𝜙𝐾(⋅)InitializeOptimize for𝑁 steps 𝜙𝐾←𝜙𝐾−1randomlyFigure 5.3 An overview of differences between aSeqDIP and prior arts in terms of data dependency,
network architecture(s), and procedural requisites. ‘Data-Dependency’ here indicates whether a
method depend on a prior reference image or pre-trained models.

Remark 4 (Relationship to DMs). aSeqDIP bears resemblance to the reverse process in DMs due

to their shared gradual denoising steps. However, despite these similarities, several distinctions

emerge. Firstly, unlike the DM network, aSeqDIP does not require encoding a scalar representing

time 𝑡. Secondly, and perhaps most significantly, aSeqDIP operates without requiring any training

data or pre-trained networks. Thirdly, aSeqDIP operates in a truly sequential manner in terms of

time, whereas in DMs, whether it’s training (e.g., denoising score matching (Vincent, 2011)) or

sampling, the prevalent technique involves sampling from time 𝑡 ∼ U [0, 1] (uniform distribution),

which allows for non-sequential time points.

Figure 5.3 illustrates how different approaches compare to aSeqDIP.

5.2.2.1 Mitigating Noise Overfitting in aSeqDIP

In DIP-based approaches, noise overfitting occurs as the network attempts to fit its output to

the noisy or subsampled measurements, y, as 𝑘 increases during training. However, the specific

value of 𝑘 at which this PSNR decay begins is uncertain and varies across tasks and even among

images within the same task and distribution. In aSeqDIP, when the output of network 𝑓𝜙𝑘 improves

compared to that of 𝑓𝜙𝑘−1, the autoencoder term enforces similarity between the input and output
of the network, thus delaying the onset of noise overfitting decay. This occurs because we are not

only enforcing the network output to be measurement-consistent, but also enforcing that the output

and input become similar. Consequently, as 𝑘 increases, noise fitting is delayed, and utilizing the

83

OptimizeVanilla DIPOptimize Params.RandomInputSelf-Guided DIPOptimize Params.RandomInputand InputReference-Guided DIPOptimize Params.ReferenceImageDM-based ApproachesPre-trained DM +RandomInputData ConsistencyPre-trained DM +Data ConsistencyPre-trained DM +Data ConsistencyAutoencoding Sequential DIP (Ours)MeasurementsOptimize Params.UpdateInputUpdateInputUpdateInputOptimize Params.Optimize Params.Data-Independent MethodsData-Dependent Methodsautoencoder provides regularization against noise overfitting. In Section 5.3, we will demonstrate

how the proposed autoencoding term effectively regulates noise overfitting.

One might expect that incorporating the autoencoder could negatively impact reconstruction

quality. However, empirical observations reveal that not only is noise overfitting delayed with

the autoencoder term, but also image reconstruction quality is enhanced. To further support this

statement, in Appendix B.1.3, we investigate whether a trained autoencoder on clean images can act

as a reconstructor at testing time by optimizing the input.

5.3 Experimental Results

5.3.1 Settings, Datasets, and Baselines

In our experiments, we consider five tasks: MRI reconstruction from undersampled measurements,

sparse-view CT image reconstruction, denoising, non-linear deblurring and in-painting. For MRI,

we use the fastMRI dataset. The forward model is y ≈ Ax∗. The multi-coil data is obtained using

15 coils and is cropped to a resolution of 320 × 320 pixels. To simulate undersampling of the MRI

k-space, we use a Cartesian mask with 4x and 8x accelerations. Sensitivity maps for the coils are

obtained using the BART toolbox (Tamir et al., 2016). For CT, we use the AAPM dataset. For parallel

beam CT, the input image with 512 × 512 pixels is transformed into its sinogram representation

using a Radon transform (the operator A). The forward model assuming a monoenergetic source

and no scatter, noise is 𝑦𝑖 = 𝐼0𝑒−[Ax∗]𝑖 , with 𝐼0 denoting the number of incident photons per ray

(assumed to be 1 for simplicity) and 𝑖 indexing the 𝑖th measurement or detector pixel. We use the

post-log measurements for reconstruction. We use a full set of 180 projection angles and simulate

two different sparse view acquisition scenarios (with equispaced angles). Specifically, we created

cases with 18 and 30 angles/views. The image resolution is kept at a fixed size.

For the tasks of denoising, in-painting, and non-linear deblurring, we use the CBSD68 dataset.

For each task, we use 20 measurements/corrupted images. To evaluate the reconstruction quality,

we use the Peak Signal to Noise Ratio (PSNR), and the Structural SIMilarity (SSIM) index (Wang

et al., 2004). For experimental settings and baselines, see Table 5.1 and its caption. Note that we

consider data-dependent and data-independent baselines as shown in the third and fourth columns

84

Task

MRI

CT

Setting

Ax ∈ {4x, 8x}

views ∈ {18, 30}

Data-independent baselines

Vanilla DIP (Ulyanov et al., 2018), ES-DIP (Wang et al., 2023a),
TV-DIP (Liu et al., 2019b), Self-Guided DIP (Liang et al., 2024a)

Vanilla DIP (Ulyanov et al., 2018), Self-Guided DIP (Liang et al., 2024a),
Filter Back Projection (FBP) (Zeng, 2020)

Data-dependent baselines

Ref-Guided DIP (Zhao et al., 2020a)
Score-MRI (Chung and Ye, 2022)

Ref-Guided DIP (Zhao et al., 2020a)
MCG (Chung et al., 2022)

Denoising

𝜎d ∈ {15, 30}

Vanilla DIP (Ulyanov et al., 2018), ES-DIP (Wang et al., 2023a), Self-Guided DIP (Liang et al., 2024a),
TV-DIP (Liu et al., 2019b), Rethinking-DIP (Jo et al., 2021), SGLD-DIP (Cheng et al., 2019)

DPS (Chung et al., 2023c)

In-painting

HIAR ∈ {0.1, 0.25}

Vanilla DIP (Ulyanov et al., 2018), ES-DIP (Wang et al., 2023a), Self-Guided DIP (Liang et al., 2024a),
SGLD-DIP (Cheng et al., 2019), TV-DIP (Liu et al., 2019b)

DPS (Chung et al., 2023c)

Deblurring BKSE (Tran et al., 2021)

Self-Guided DIP (Liang et al., 2024a)
SGLD-DIP (Cheng et al., 2019)

DPS (Chung et al., 2023c)

Table 5.1 Tasks, settings, and baselines considered in our experiments. For MRI, we consider two
Acceleration (Ax) factors, 4x and 8x, that determine the subsampling of the measurements. For 2D
CT (parallel beam geometry), we use two sparse view settings: 18 and 30 views. For denoising, we
perturb the ground truth images using two noise levels determined by 𝜎d. In in-painting, we use two
hole-to-image area ratios (HIAR), 0.1 and 0.25. For non-linear deblurring, we use the Blurring
Kernel Space Exploring (BKSE) setting (Tran et al., 2021), described in Equations (56) to (59)
of (Chung et al., 2023c). Each baseline that utilizes pre-trained models or a reference image is
considered data-dependent. Further details are provided in Appendix B.2.

of Table 5.1. All the experiments are conducted on a single RTX5000 GPU machine. Further

implementation details are provided in Appendix B.2.

For the proposed aSeqDIP method in Algorithm 5.1, we use the Adam optimizer with learning

rate of 𝛽 = 0.0001. Furthermore, the regularization parameter is set to 𝜆 = 1 following the

ablation study in Appendix B.2.4. We select 𝑁 = 2 and 𝐾 = 2000 following the ablation study in

Appendix B.2.5.

5.3.2

Impact of the Autoencoding term on Noise Overfitting

In this subsection, we showcase the impact of the proposed autoencoding regularization in

aSeqDIP on noise or null space (nuisance) overfitting.

We conducted experiments using 20 MRI scans and 20 CT scans, considering two cases of

aSeqDIP as outlined in Algorithm 5.1. The first case sets 𝜆 = 1, consistent with the remainder of

the paper, while the second case sets 𝜆 = 0, effectively disabling the autoencoding regularization

term in (5.3). Additionally, for comparison, we report results for Vanilla DIP and Self-Guided DIP.

The average PSNR results for these cases are depicted in Figure 5.4.

As observed, when the autoencoder term is disabled in aSeqDIP (black dashed lines), noise

overfitting in MRI, akin to Self-Guided DIP, begins after nearly 1600 iterations. For CT, we

note that aSeqDIP without regularization starts noise overfitting at around iteration 3800, whereas

85

Figure 5.4 Average PSNR results w.r.t. iteration 𝑖 of 20 MRI (with 4x) scans (left) and 20 CT (with
18 views) scans (right) to show the impact of the proposed autoencoding regularization term on
noise overfitting in aSeqDIP. Furthermore, average results of Vanilla DIP and Self-Guided DIP
are also reported for comparison. For aSeqDIP, iteration 𝑖 ∈ [𝑁𝐾], where 𝑁 = 2. Vertical lines
approximately indicate the start of the PSNR decay for every case. In Appendix B.2.1, we include
the PSNR curves of aSeqDIP and other DIP-based methods for the task of denoising.

Self-Guided DIP experiences PSNR decay earlier, after approximately 1250 iterations. Importantly,

when the autoencoding term is utilized (black solid lines), not only does the decay in noise overfitting

not commence until after iteration 4000, but the reconstruction quality (measured by PSNR) also

improves.

As expected, PSNR decay in Vanilla DIP begins early, at around iteration 500 and 750 for MRI

and CT, respectively. In Appendix B.2.4, we provide an ablation study to better show the impact of

the value of 𝜆 in aSeqDIP.

5.3.3 Main Results

Here, we present our primary results regarding the reconstruction quality, measured by PSNR

and SSIM, as well as the associated run-time. Table 5.2 presents the results for the considered tasks

in this paper. Column 3 indicates whether the baselines depend on prior data or pre-trained models.

The last three columns provide the PSNR, SSIM, and run-time results where the arrows indicate

favorable results. For PSNR and SSIM, the settings correspond to the second column of Table 5.1.

The black (resp. black) text corresponds to the first (resp. second) setting. Values after the ± sign

indicate standard deviation. Subsequently, we offer observations on the main results.

Compared to data-independent methods, i.e., the baselines that do not depend on a reference

image or pre-trained models, aSeqDIP demonstrates improved PSNR and SSIM scores. For example,

aSeqDIP, apart from Self-Guided DIP, shows nearly a 1dB improvement for MRI 8X acceleration

86

01000200030004000Iteration i1520253035PSNR (dB)MRI010002000300040005000Iteration iCTaSeqDIPaSeqDIP w/out regularizationSelf-Guided DIPVanilla DIPTask

Method

Data
Independency

PSNR (dB) (↑)
(Setting 1, Setting 2)

SSIM ∈ [0, 1] (↑)
(Setting 1, Setting 2)

Run-time (↓)
(minutes)

Score-MRI
Ref-Guided DIP
TV-DIP
ES-DIP
Vanilla DIP
Self-Guided DIP
aSeqDIP (Ours)

MCG
FBP
Ref-Guided DIP
Vanilla DIP
Self-Guided DIP
aSeqDIP (Ours)

DPS
Vanilla DIP
SGLD DIP
TV-DIP
Rethinking-DIP
ES-DIP
Self-Guided DIP
aSeqDIP (Ours)

DPS
Vanilla DIP
SGLD DIP
TV-DIP
ES-DIP
Self-Guided DIP
aSeqDIP (Ours)

DPS
SGLD DIP
Self-Guided DIP
aSeqDIP (Ours)

MRI

CT

Denoising

In-Painting

Deblurring

×
×
✓
✓
✓
✓
✓

×
✓
×
✓
✓
✓

×
✓
✓
✓
✓
✓
✓
✓

×
✓
✓
✓
✓
✓
✓

×
✓
✓
✓

(31.51±0.45, 29.61±0.44)
(33.17±0.27, 30.23±0.24)
(30.52±0.25, 29.20±0.37)
(31.02±0.34, 29.44±0.45)
(30.21±0.42, 28.75±0.33)
(33.6±0.23, 30.75±0.25)
(34.08±0.41, 31.34±0.47)
(32.82±0.52, 31.35±0.49)
(22.92±0.22, 19.52±0.32)
(31.21±0.24, 28.31±0.42)
(26.21±0.12, 24.31±0.34)
(33.95±0.32, 31.95±0.32)
(34.88±0.36, 33.09±0.39)
(31.02±0.25, 28.2±0.31)
(30.48±0.28, 27.84±0.32)
(30.58±0.34, 28.12±0.42)
(30.57±0.31, 28.47±0.26)
(30.98±0.31, 28.67±0.25)
(31.11±0.23, 28.12±0.41)
(31,21±0.26, 28.31±0.35)
(31.51±0.34, 28.97±0.44)
(23.9±0.45, 22.03±0.36)
(22.56±0.31, 21.32±0.67)
(23.09±0.55, 21.41±0.45)
(22.87±0.45, 21.64±0.51)
(23.33±0.44, 21.89±0.28)
(23.84±0.43, 21.78±0.52)
(24.56±0.45, 22.57±0.47)
(23.40±0.56)
(19.80±0.43)
(20.34±0.55)
(23.89±0.40)

(0.891±0.012, 0.862±0.014)
(0.912±0.021, 0.873±0.016)
(0.872±0.022, 0.852±0.022)
(0.882±0.031, 0.858±0.028)
(0.865±0.02, 0.842±0.022)
(0.922±0.008, 0.874±0.006)
(0.929±0.008, 0.887±0.009)
(0.912±0.08, 0.852±0.09)
(0.75±0.021, 0.68±0.023)
(0.892±0.023, 0.842±0.021)
(0.791±0.021, 0.772±0.012)
(0.918±0.02, 0.872±0.031)
(0.941±0.026, 0.92±0.022)
(0.912±0.02, 0.882±0.021)
(0.905±0.021, 0.871±0.030)
(0.908±0.021, 0.877±0.017)
(0.914±0.022, 0.882±0.014)
(0.912±0.02, 0.887±0.03)
(0.914±0.017, 0.886±0.024)
(0.916±0.02, 0.891±0.03)
(0.926±0.021, 0.908±0.031)
(0.817±0.023, 0.762±0.021)
(0.754±0.023, 0.721±0.012)
(0.772±0.023, 0.732±0.041)
(0.774±0.04, 0.742±0.042)
(0.781±0.034, 0.745±0.041)
(0.792±0.042, 0.752±0.064)
(0.838±0.051, 0.778±0.045)
(0.776±0.032)
(0.720±0.03)
(0.732±0.025)
(0.792±0.033)

6.2±0.12
2.5±0.2
2.5±0.1
1.56±0.34
1.5±0.12
4.5±0.67
2.2±0.12
6.4±0.2
0.2±0.01
2.5±0.42
1.5±0.21
4.5±0.56
2.2±0.42
2.5±0.17
1.5±0.22
3.2±0.24
2.5±0.24
2.5±0.34
1.45±0.44
3.5±0.45
2.4±0.45
2.5±0.3
1.5±0.35
2.5±0.45
2.5±0.31
1.25±0.55
3.5±0.45
2.4±0.54
2.24±0.65
3.24±0.55
3.4±1.02
2.5±0.78

Table 5.2 Average PSNR, SSIM, and run-time results reported by our method against the selected
baselines for the tasks of MRI reconstruction, CT reconstruction, image denoising, in-painting,
and non-linear deblurring. ‘Data-Independency’ in column 3 indicates whether the methods
depend on prior data or pre-trained models. Setting 1 and Setting 2, in the fourth and fifth columns
correspond to the scenarios in the second column of Table 5.1. For tasks with two settings, the
run-time results are averaged over the two settings. Values past ± represent the standard deviation.
See Appendix B.2.2 and Appendix B.2.3 for more comparison results.

compared to conventional methods. For the task of 30-views CT, aSeqDIP reports SSIM score

of 0.92 which is 5% more than the second best, which is Self-Guided DIP with SSIM of 0.872.

Although improvements against Self-Guided DIP are generally marginal in terms of reconstruction

quality, our method proves to be 2X faster for MRI and CT reconstruction and requires 1 minute

less than Self-Guided DIP for denoising and in-painting. This speed-up is attributed to updating

87

Figure 5.5 Reconstructed/recovered images using our proposed approach, aSeqDIP, and the baselines
for the considered tasks. The ground truth (GT) and degraded images are shown in the first and
second columns, respectively, followed by three or four baselines per task. The last column presents
our method. PSNR results are given at the bottom of each reconstructed image. For MRI (8x
undersampling) and CT (18 views), the top right box shows the absolute difference between the
center region box of the reconstructed image and the same region in the GT image. Denoising and
in-painting used 𝜎𝑑 = 25 and HIAR = 0.25. For the task of Deblurring, aSeqDIP contains artifacts
when compared to DPS. However, DPS generates a perceptually different image when compared to
the GT. For all other tasks, aSeqDIP reconstructions contain sharper and clearer image features than
other methods.

the input using one forward pass of the trained network at each iteration 𝑘, instead of computing

gradients with respect to the input for the update. Compared to Vanilla DIP, our method, on average,

only requires an additional 30 to 60 seconds. When compared to ES-DIP (Wang et al., 2023a), our

method requires longer time, but on average achieves better reconstruction results across three tasks

and different settings.

In comparison to data-dependent methods such as Score-MRI and MCG, our approach not only

yields the best PSNR and SSIM but also requires reduced run-time, all without requiring any training

data or pre-trained models. For instance, on average, aSeqDIP achieves nearly a 2dB improvement

88

GTMRIVanilla DIPRef-Guided DIPSelf-Guided DIPScore-MRIaSeqDIPPSNR = 30.12 dBPSNR = 32.78 dBPSNR = 33.78 dBPSNR = 32.98 dBPSNR = 34.44 dBGTCTVanilla DIPMCGaSeqDIPPSNR = 29.7 dBPSNR = 32.57 dBPSNR = 33.78 dBPSNR = 33.01 dBPSNR = 34.34 dBGTDenoisingVanilla DIPSGLD DIPSelf-Guided DIPDPSaSeqDIPPSNR = 27.88 dBPSNR = 28.67 dBPSNR = 29.11 dBPSNR = 29.21 dBPSNR = 30.77 dBGTIn-PaintingaSeqDIPPSNR = 26.65 dBPSNR = 28.01 dBPSNR = 28.12 dBPSNR = 27.98 dBPSNR = 28.35 dBRef-Guided DIPSelf-Guided DIPVanilla DIPSGLD DIPSelf-Guided DIPDPSDegraded ImageDeblurringPSNR = 21.79 dBPSNR = 25.67 dBPSNR = 23.42 dBPSNR = 27.25 dBaSeqDIPDPSSGLD DIPSelf-Guided DIPVanilla DIPDegraded ImageDegraded ImageDegraded ImageDegraded ImagePSNR = 22.65 dBGTin 30-views CT compared to MCG while being 2X faster. In comparison to DPS, on average, our

method report higher SSIM. Our method requires slightly less run-time on average but enhances

the PSNR by approximately 0.6dB for both denoising and in-painting. Notably, our method is

an optimization-based approach, whereas DM-based methods only require function evaluations.

However, the generally larger run-time reported for DM-based methods is due to the necessity of

running a large number of reverse sampling steps. When compared to Ref-Guided DIP, our method

achieves higher PSNR and SSIM results without the need for any prior (or reference image) image.

5.3.4 Visualizations

Figure 5.5 shows reconstructed images for the five considered tasks using aSeqDIP and the other

baselines. Each row corresponds to a task. The first column displays the ground truth (GT) image

whereas the second column shows the degraded image. Column 3 to the column before last present

the reconstructed images by the baselines, while the last column shows the reconstructed images by

aSeqDIP. PSNR values are provided at the bottom of each reconstructed image.

As observed, aSeqDIP achieves the highest PSNR scores. Additionally, the top right green

boxes, which show the difference between the central region of the reconstructed and GT images,

indicate that for MRI and CT, our method visually exhibits the least difference, making it the closest

to the GT.

A similar observation is seen for the denoising task for the zoomed in bottom box. For inpainting,

we note that aSeqDIP introduces the fewest unwanted artifacts as observed in the clouds (for DPS),

and the left wing of the plane. While aSeqDIP contains artifacts for the task of Deblurring when

compared to DPS, the latter generates a perceptually different image when compared to the GT.

Similar observation are noticed with the additional visualizations provided in Appendix B.3.

5.4 Conclusions & Future Work

In this paper, we introduced Autoencoding Sequential Deep Image Prior (aSeqDIP), a new

unsupervised image recovery algorithm. Notably, aSeqDIP operates without the need of pre-trained

models, relying solely on a sequential update of network parameters. These parameters are optimized

using an input-adaptive data consistency objective combined with autoencoding regularization,

89

effectively mitigating noise overfitting. Our experimental results across various tasks highlight the

competitive performance of the proposed algorithm, matching (or outperforming) diffusion-based

methods in terms of reconstruction quality and required run time, all without the need for pre-trained

models.

For future directions, we aim to explore the applicability of aSeqDIP to other image recovery

problems, thereby expanding its versatility and potential impact across diverse domains. Additionally,

we are interested in investigating the integration of a network input update mechanism to dynamically

adjust the autoencoding regularization parameter and the number of gradient updates per iteration.

90

CHAPTER 6

MRI RECONSTRUCTION BY SMOOTHED UNROLLING

6.1

Introduction

After the last charpter, we will explore the another direction of my research that focus on

enhancing the robustness of the deep learning method. As the popularity of deep learning (DL) in

the field of magnetic resonance imaging (MRI) continues to rise, recent research has indicated that

DL-based MRI reconstruction models might be excessively sensitive to minor input disturbances,

including worst-case or random additive perturbations. This sensitivity often leads to unstable

aliased images. This raises the question of how to devise DL techniques for MRI reconstruction that

can be robust to these variations. To address this problem, we propose a novel image reconstruction

framework, termed Smoothed Unrolling (SMUG), which advances a deep unrolling-based MRI

reconstruction model using a randomized smoothing (RS)-based robust learning approach. RS,

which improves the tolerance of a model against input noise, has been widely used in the design of

adversarial defense approaches for image classification tasks. Yet, we find that the conventional

design that applies RS to the entire DL-based MRI model is ineffective. In this paper, we show that

SMUG and its variants address the above issue by customizing the RS process based on the unrolling

architecture of DL-based MRI reconstruction models. We theoretically analyze the robustness of

our method in the presence of perturbations. Compared to vanilla RS and other recent approaches,

we show that SMUG improves the robustness of MRI reconstruction with respect to a diverse set of

instability sources, including worst-case and random noise perturbations to input measurements,

varying measurement sampling rates, and different numbers of unrolling steps.

6.2 Preliminaries and Problem Statement

6.2.1 Setup of MRI Reconstruction

Many medical imaging approaches involve ill-posed inverse problems such as the work

in (Donoho, 2006a), where the aim is to reconstruct the original signal x ∈ C𝑞 (vectorized

image) from undersampled k-space (Fourier domain) measurements y ∈ C𝑝 with 𝑝 < 𝑞. The

imaging system in MRI can be modeled as a linear system y ≈ Ax, where A may take on different

91

forms for single-coil or parallel (multi-coil) MRI, etc. For example, in the single coil Cartesian

MRI acquisition setting, A = MF, where F is the 2D discrete Fourier transform and M is a masking

operator that implements undersampling. With the linear observation model, MRI reconstruction is

often formulated as

ˆx = arg min

x

∥Ax − y∥2

2 + 𝜆R (x),

(6.1)

where R (·) is a regularization function (e.g., ℓ1 norm in the wavelet domain to impose a sparsity

prior (Mihcak et al., 1999)), and 𝜆 > 0 is the regularization parameter.

MoDL (Aggarwal et al., 2019a) is a recent popular supervised deep learning approach inspired

by the MR image reconstruction optimization problem in (6.1). MoDL combines a denoising

network with a data-consistency (DC) module in each iteration of an unrolled architecture. In

MoDL, the hand-crafted regularizer, R, is replaced by a learned network-based prior ∥x − Dθ (x)∥2
2
involving a network Dθ.

MoDL attempts to optimize this loss by initializing x0 = A𝐻y, and then iterating the following

process for a number of unrolling steps indexed by 𝑛 ∈ {0, . . . , 𝑁 − 1}. Specifically, MoDL iterations

are given by

x𝑛+1 = arg min
x

∥Ax − y∥2

2 + 𝜆∥x − Dθ (x𝑛) ∥2
2.

(6.2)

After 𝑁 iterations, we denote the final output of MoDL as x𝑁 = FMoDL(x0). The weights of

the denoiser are shared across the 𝑁 blocks and are learned in an end-to-end supervised manner

(Aggarwal et al., 2019a).

6.2.2 Lack of Robustness of DL-based Reconstructors

In (Antun et al., 2020a), it was demonstrated that deep learning-based MRI reconstruction can

exhibit instability, when confronted with subtle, nearly imperceptible input perturbations. These

perturbations are commonly referred to as ‘adversarial perturbations’ and have been extensively

investigated in the context of DL-based image classification tasks, as outlined in (I et al., 2015).

In the context of MRI, these perturbations represent the worst-case additive perturbations, which

can be used to evaluate method sensitivity and robustness (Antun et al., 2020a; Jia et al., 2022a).

92

Let δ denote a small perturbation of the measurements that falls in an ℓ∞ ball of radius 𝜖, i.e.,

∥δ∥∞ ≤ 𝜖. Adversarial disturbances then correspond to the worst-case input perturbation vector δ

that maximizes the reconstruction error, i.e.,

max∥δ∥∞≤𝜖 ∥FMoDL(A𝐻 (y + δ)) − t∥2
2,

(6.3)

where t is a ground truth target image from the training set (i.e., label). The operator A𝐻 transforms

the measurements y to the image domain, and A𝐻y is the input (aliased) image to the reconstruction

model. The optimization problem in (6.3) can be effectively solved using the iterative projected

gradient descent (PGD) method (Madry et al., 2017). In Fig. 6.1-(a) and (b), we show reconstructed

images using MoDL originating from a benign (i.e., undisturbed) input and a PGD scheme-perturbed

input, respectively. It is evident that the worst-case input disturbance significantly deteriorates the

quality of the reconstructed image. While one focus of this work is to enhance robustness against

input perturbations, Fig.6.1-(c) and (d) highlight two additional potential sources of instability that

the reconstructor (MoDL) can encounter during testing: variations in the measurement sampling

rate (resulting in “perturbations” to the sparsity of the sampling mask in A) (Antun et al., 2020a),

and changes in the number of unrolling steps (Gilton et al., 2021a). In scenarios where the sampling

mask (Fig.6.1-(c)) or number of unrolling steps (Fig.6.1-(d)) deviate from the settings used during

MoDL training, we observe a significant degradation in performance compared to the original

setup (Fig.6.1-(a)), even in the absence of additive measurement perturbations. In Section 6.4, we

demonstrate how our method improves the reconstruction robustness in the presence of different

types of perturbations, including those in Fig.6.1.

6.2.3 Randomized Smoothing (RS)

Randomized smoothing, introduced in (Cohen et al., 2019), enhances the robustness of DL

models against noisy inputs. It is implemented by generating multiple randomly modified versions

of the input data and subsequently calculating an averaged output from this diverse set of inputs.

Given some function 𝑓 (x), RS formally replaces 𝑓 with a smoothed version

𝑔(x) := E

η∼N (0,𝜎2I) [ 𝑓 (x + η)] ,

(6.4)

93

Figure 6.1 MoDL’s instabilities resulting from perturbations to input data, the measurement
sampling rate, and the number of unrolling steps used at testing phase shown on an image from
the fastMRI dataset (Zbontar et al., 2018). We refer readers to Section 6.4 for further details
about the experimental settings. (a) MoDL reconstruction from benign (i.e., without additional
noise/perturbation) measurements with 4× acceleration (i.e., 25% sampling rate) and 8 unrolling
steps. (b) MoDL reconstruction from disturbed input with perturbation strength 𝜖 = 0.02 (see
Section 6.4.1). (c) MoDL reconstruction from clean measurements with 2× acceleration (i.e.,
50% sampling), and using 8 unrolling steps. (d) MoDL reconstruction from clean or unperturbed
measurements with 4× acceleration and 16 unrolling steps. In (b), (c), and (d), the network trained
in (a) is used.

where N (0, 𝜎2I) denotes a Gaussian distribution with zero mean and element-wise variance 𝜎2, and

I denotes the identity matrix of appropriate size. Prior research has shown that RS has been effective

as an adversarial defense approach in DL-based image classification tasks (Cohen et al., 2019;

Salman et al., 2020; Zhang et al., 2022). However, the question of whether RS can significantly

improve the robustness of MoDL and other image reconstructors has not been thoroughly explored.

A preliminary investigation in this area was conducted by (Wolf, 2019), which demonstrated the

integration of RS into MR image reconstruction in an end-to-end (E2E) setting. We can formulate

image reconstruction using RS-E2E as

xRS-E2E = E

η∼N (0,𝜎2I) [FMoDL(A𝐻 (y + η))].

(RS-E2E)

This formulation aligns with the one used in (Wolf, 2019), where the random noise vector η is

directly added to y in the frequency domain (complex-valued), followed by multiplication with A𝐻

to obtain the input image for MoDL. The noisy measurements are also utilized in each iteration in

MoDL. RS-E2E can be identically formulated for alternative reconstruction models.

Fig. 6.2 shows a block diagram of RS-E2E-backed MoDL. This RS-integrated MoDL is trained

with supervision in the standard manner. Although RS-E2E represents a straightforward application

of RS to MoDL, it remains unclear if the formulation in (RS-E2E) is the most effective method to

incorporate RS into unrolled algorithms such as MoDL, considering the latter’s specialties, e.g., the

94

(a)(b)(c)(d)involved denoising and the data-consistency (DC) steps.

As such, for the rest of the paper, we focus on studying the following questions (Q1)–(Q4).

• (Q1): How should RS be integrated into an unrolled algorithm such as MoDL?

• (Q2): How do we learn the network Dθ (·) in the presence of RS operations?

• (Q3): Can we prove the robustness of SMUG in the presence of data perturbations?

• (Q4): Can we further improve the RS operation in SMUG for enhanced image quality or

sharpness?

6.3 Methodology

In this section, we address questions (Q1)–(Q4) by taking the unrolling characteristics of MoDL

into the design of an RS-based MRI reconstruction. The proposed novel integration of RS with

MoDL is termed Smoothed Unrolling (SMUG). We also explore an extension to SMUG. We note

that while we develop our methods based on MoDL, in the last subsection, we discuss incorporating

our approaches within other unrolling methods such as ISTA-Net.

6.3.1 Solution to (Q1): RS at intermediate unrolled denoisers

As illustrated in Fig.,6.2, the RS operation in RS-E2E is typically applied to MoDL in an

end-to-end manner. This does not shed light on which component of MoDL needs to be made more

robust. Here, we explore integrating RS at each intermediate unrolling step of MoDL.

In this subsection, we present SMUG, which applies RS to the denoising network. This seemingly

simple modification is related to a robustness certification technique known as “denoised smoothing”

(Salman et al., 2020). In this technique, a smoothed denoiser is used, proving to be sufficient for

establishing robustness in the model. We use x𝑛
S = A𝐻y, the procedure is given by

from x0

S to denote the 𝑛-th iteration of SMUG. Starting

x𝑛+1
S = arg min
x

∥Ax − y∥2

2 + 𝜆∥x − Eη

(cid:2)Dθ (x𝑛

S + η)(cid:3) ∥2
2 ,

(6.5)

where η is drawn from N (0, 𝜎2I). After 𝑁 − 1 iterations, the final output of SMUG is denoted
S = FSMUG(x0). Fig. 6.3 presents the architecture of SMUG.

by x𝑁

95

Figure 6.2 A schematic overview of RS-E2E. Here, iterative unrolling takes place between the data
consistency and denoising blocks for multiple noisy versions of the input.

Figure 6.3 Architecture of SMUG. Here, for every unrolling step, after applying the denoiser on each
noisy version of the input, the data consistency is applied on the average of the denoised images.

6.3.2 Solution to (Q2): SMUG’s pre-training & fine-tuning

In this subsection, we develop the training scheme of SMUG. Inspired by the currently celebrated

“pre-training + fine-tuning” technique (Zoph et al., 2020; Salman et al., 2020), we propose to

train SMUG following this learning paradigm. Our rationale is that pre-training can provide a

robustness-aware initialization of the DL-based denoising network for fine-tuning. To pre-train the

denoising network Dθ, we consider a mean squared error (MSE) loss that measures the Euclidean

distance between images denoised by Dθ and the target (ground truth) images, denoted by t. This

leads to the pre-training step:

θpre = arg min
θ

Et∈T [Eη ||Dθ (t + η) − t||2

2] ,

(6.6)

96

Figure 6.4 Architecture of weighted SMUG. Here, we extend SMUG by including the weight
encoder and the use of weighted randomized smoothing.

where T is the set of ground truth images in the training dataset. Next, we develop the fine-tuning

scheme to improve θpre based on the labeled/paired MRI dataset. Since RS in SMUG (Fig. 6.3) is

applied to every unrolling step, we propose an unrolled stability (UStab) loss for fine-tuning Dθ:

ℓUStab(θ; y, t) =

𝑁−1
∑︁

𝑛=0

Eη ||Dθ (x𝑛 + η) − Dθ (t)||2
2 .

(6.7)

The UStab loss in (6.7) relies on the target images, bringing in a key benefit: the denoising

stability is guided by the reconstruction accuracy of the ground-truth image, yielding a graceful

trade-off between robustness and accuracy.

Integrating the UStab loss, defined in (6.7), with the standard reconstruction loss, we obtain the

fine-tuned θ by minimizing E(y,t) [ℓ(θ; y, t)], where

ℓ(θ; y, t) = ℓUStab(θ; y, t) + 𝜆ℓ ∥FSMUG(A𝐻y) − t∥2
2,

(6.8)

with 𝜆ℓ > 0 representing a regularization parameter to strike a balance between the reconstruction

error (for accuracy) and the denoising stability (for robustness) terms. We initialize θ as θpre when

optimizing (6.8) using standard optimizers such as Adam (Kingma and Ba, 2015a).

In practice, the same dataset is used for fine-tuning as pre-training because the pre-trained model

is initially trained solely as a denoiser, while the fine-tuning process aims at integrating the entire

regularization strategy applied to the MoDL framework. This approach ensures that the fine-tuning

optimally adapts the model to the specific enhancements introduced by our robustification strategies.

97

6.3.3 Answer to (Q3): Analyzing the robustness of SMUG in the presence of data perturba-

tions

The following theorem discusses the robustness (i.e., sensitivity to input perturbations) achieved

with SMUG. Note that all norms on vectors (resp. matrices) denote the ℓ2 norm (resp. spectral

norm) unless indicated otherwise.

Theorem 6.3.1. Assume the denoiser network’s output is bounded in norm. Given the initial input

image A𝐻y obtained from measurements y, let the SMUG reconstructed image at the 𝑛-th unrolling

step be x𝑛

S(A𝐻y) with RS variance of 𝜎2. Let δ denote an additive perturbation to the measurements

y. Then,

where 𝐶𝑛 = 𝛼∥A∥2 (cid:169)
(cid:173)
(cid:171)

(cid:32)

1−

√

𝑀𝛼
2𝜋𝜎
1− 𝑀 𝛼
√
2 𝜋 𝜎

∥x𝑛
(cid:33) 𝑛

S(A𝐻y) − x𝑛

S(A𝐻 (y + δ)) ∥ ≤ 𝐶𝑛 ∥δ∥,

(6.9)

(cid:170)
(cid:174)
(cid:172)

+ ∥A∥2

(cid:17) 𝑛

(cid:16) 𝑀𝛼
√
2𝜋𝜎

, with 𝛼 = ∥(A𝐻A + I)−1∥2 and

𝑀 = 2 max

x

(∥Dθ (x) ∥)

.

The proof is provided in the Appendix. Note that the output of SMUG x𝑛

initial input (here A𝐻y) and the measurements y. We abbreviated it to x𝑛

S(·) depends on both the
S(A𝐻y) in the theorem and
proof for notational simplicity. The constant 𝐶𝑛 depends on the number of iterations or unrolling

steps 𝑛 as well as the RS standard deviation parameter 𝜎. For large 𝜎, the robustness error bound

for SMUG clearly decreases as the number of iterations 𝑛 increases. In particular, if 𝜎 > 𝑀𝛼/

2𝜋,
(cid:17). Furthermore, as 𝜎 → ∞, 𝐶𝑛 → 𝐶 ≜ 𝛼 ∥A∥2. Clearly,

then as 𝑛 → ∞, 𝐶𝑛 → 𝛼 ∥A∥2 /

(cid:16)

1 − 𝑀𝛼
√
2𝜋𝜎

√

if 𝛼 ≤ 1 and ∥A∥2 ≤ 1 (normalized), then 𝐶 ≤ 1.

Thus, for sufficient smoothing, the error introduced in the SMUG output due to input perturbation

never gets worse than the size of the input perturbation. Therefore, the output is stable with respect to

(w.r.t.) perturbations. These results corroborate experimental results in Section 6.4 on how SMUG

is robust (whereas other methods, such as vanilla MoDL, breakdown) when increasing the number

of unrolling steps at test time, and is also more robust for larger 𝜎 (with good accuracy-robustness

trade-off).

98

The only assumption in our analysis is that the denoiser network output is bounded in norm.

This consideration is handled readily when the denoiser network incorporates bounded activation

functions such as the sigmoid or hyperbolic tangent. Alternatively, if we expect image intensities

to lie within a certain range, a simple clipping operation in the network output would ensure

boundedness for the analysis.

A key distinction between SMUG and prior works, such as RS-E2E (Wolf, 2019), is that

smoothing is performed in every iteration. Moreover, while (Wolf, 2019) assumes the end-to-end

mapping is bounded, in MoDL or SMUG, it clearly isn’t because the data-consistency step’s output

is unbounded as y grows.

We remark that our intention with Theorem 1 is to establish a baseline of robustness intrinsic to

models with unrolling architectures.

6.3.4 Solution to (Q4): Weighted Smoothing

In this subsection, we present a modified formulation of randomized smoothing to improve its

performance in SMUG. Randomized smoothing in practice involves uniformly averaging images

denoised with random perturbations. This can be viewed as a type of mean filter, which leads to

oversmoothing of structural information in practice. As such, we propose weighted randomized

smoothing, which employs an encoder to assess a weighting (scalar) for each denoised image and

subsequently applies the optimal weightings while aggregating images to enhance the reconstruction

performance. Our method not only surpasses the SMUG technique but also excels in enhancing

image sharpness across various types of perturbation sources. This allows for a more versatile or

flexible and effective approach for improving image quality under different conditions.

The weighted randomized smoothing operation applied on a function 𝑓 (·) is as follows:

𝑔w(x) :=

Eη [𝑤(x + η) 𝑓 (x + η)]
Eη [𝑤(x + η)]

,

(6.10)

where 𝑤(·) is an input-dependent weighting function.

Based on the weighted smoothing in (6.10), we introduce Weighted SMUG. This approach

involves applying weighted RS at each denoising step, and the weighting encoder is trained

in conjunction with the denoiser during the fine-tuning stage. For the weighting encoder in

99

our experiments, we use a simple architecture consisting of five successive convolution, batch

normalization, and ReLU activation layers followed by a linear layer and Sigmoid activation.

Specifically, in the 𝑛-th unrolling step, we use a weighting encoder Eϕ, parameterized by ϕ, to learn
the weight of each image used for (weighted) averaging. Here, we use x𝑛

W to denote the output of

the 𝑛-th block. Initializing x0

W = A𝐻y, the output of Weighted SMUG w.r.t. 𝑛 is

x𝑛+1
W = arg min
x
(cid:13)
(cid:13)
x −
(cid:13)
(cid:13)

Eη [Eϕ (x𝑛

𝜆

∥Ax − y∥2

W+η)Dθ (x𝑛
W+η)]

Eη [Eϕ (x𝑛

W+η)]

2 +
2
(cid:13)
(cid:13)
(cid:13)
(cid:13)
2

.

(6.11)

After 𝑁 iterations, the final output of Weighted SMUG is x𝑁

W = FwSMUG(x0).

Figure 6.4 illustrates the block diagram of weighted SMUG.

Furthermore, we extend the “pre-training + fine-tuning” approach proposed in Section 6.3.2 to

the Weighted SMUG method.

In this case, we obtain the fine-tuned θ and ϕ by using

E(y,t) [𝜆𝑙 ∥𝐹wSMUG(A𝐻y) − t∥2

2 + ℓUStab(θ; y, t)].

min
θ,ϕ

(6.12)

6.3.5

Integrating RS into Other Unrolled Networks

In this subsection, we further discuss the extension of our SMUG schemes for other unrolling

based reconstructors, using ISTA-Net (Zhang and Ghanem, 2018) as an example. The goal is to

demonstrate the generality of our proposed approaches for deep unrolled models.

ISTA-Net uses a training loss function composed of discrepancy and constraint terms.

In

particular, it performs the following for 𝑁 unrolling steps:

r𝑛 = x𝑛−1 − 𝜆(𝑛)A𝐻 (Ax𝑛−1 − y)

x𝑛 = ˆF 𝑛 (Soft(F 𝑛 (r𝑛), 𝜃𝑛)) ,

(6.13)

(6.14)

where ˆF and F involve two linear convolutional layers (without bias terms) separated by ReLU
activations, and ˆF 𝑛 ◦ F 𝑛 are constrained close to the identity operator. The function Soft performs

soft-thresholding with parameter 𝜃𝑛 (Zhang and Ghanem, 2018).

100

Similar to SMUG for MoDL, we integrate RS into the network-based regularization (denoising)

component of ISTA-Net. This results in the following modification to (6.14):

x𝑛 = Eη [ ˆF 𝑛 (Soft(F 𝑛 (r𝑛 + η), 𝜃𝑛))] ,

where η is drawn from N (0, 𝜎2I). For weighted SMUG, (6.14) becomes

x𝑛 =

Eη [Eϕ(r𝑛 + η) ˆF 𝑛 (Soft(F 𝑛 (r𝑛 + η), 𝜃𝑛))]
Eη [Eϕ(r𝑛 + η)]

.

(6.15)

(6.16)

6.4 Experiments

6.4.1 Experimental Setup

6.4.1.1 Models & Sampling Masks

For the MoDL architecture, we use the recent state-of-the-art denoising network Deep iterative

Down Network, which consists of 3 down-up blocks (DUBs) and 64 channels (Yu et al., 2019b).

Additionally, for MoDL, we use 𝑁 = 8 unrolling steps with denoising regularization parameter

𝜆 = 1. The conjugate gradient method (Aggarwal et al., 2019b), with a tolerance level of 10−6, is

utilized to execute the DC block. We used variable density Cartesian random undersampling masks

in k-space, one for each undersampling factor that include a fully-sampled central k-space region and

the remaining phase encode lines were sampled uniformly at random. The coil sensitivity maps for

all scenarios were generated with the BART toolbox (Tamir et al., 2016). Extension to the ISTA-Net

model is discussed in Section 6.4.7.

6.4.1.2 Baselines

We consider two robustification approaches: first is the RS-E2E method (Jia et al., 2022a)

presented in (RS-E2E), and the second is Adversarial Training (AT) (Jia et al., 2022b). Furthermore,

we consider other recent reconstruction models, specifically, the Deep Equilibrium (Deep-Eq)

method (Gilton et al., 2021b) and a leading diffusion-based MRI reconstruction model from (Chung

and Ye, 2022), which we denote as Score-MRI.

101

Figure 6.5 Reconstruction accuracy box plots for the fastMRI brain dataset with 4x acceleration
factor. The additive random Gaussian noise of the second column plots is obtained using standard
deviation of 0.01. The worst-case additive noise of the third column is obtained using the PGD
method with 𝜖 = 0.02.

Ground Truth Vanilla MoDL

RS-E2E

SMUG

PSNR = ∞ dB
AT

PSNR = 24.84 dB
Score-MRI

PSNR = 25.78 dB

PSNR = 30.81 dB

Deep Equilibrium Weighted-SMUG

PSNR = 30.92 dB

PSNR = 30.51 dB

PSNR = 24.58 dB

PSNR = 31.21 dB

Figure 6.6 Visualization of ground truth and reconstructed images using different methods for 4x
k-space undersampling, evaluated on PGD-generated worst-case inputs of perturbation strength
𝜖 = 0.02.

6.4.1.3 Datasets & Training

For our study, we execute two experimental cases. For the first case, we utilize the fastMRI

knee dataset, with 32 scans for validation and 64 unseen scans/slices for testing. In the second case,

we employ our method for the fastMRI brain dataset. We used 3000 training scans in both cases.

The k-space data are normalized so that the real and imaginary components are in the range [−1, 1].

102

Vanilla MoDLRS-E2EATDeep-EqScore-MRISMUGWeighted SMUG31.031.532.032.533.0PSNR - Clean AccuracyVanilla MoDLRS-E2EATDeep-EqScore-MRISMUGWeighted SMUG31.031.532.032.533.0PSNR - Robust Accuracy (Evaluated by random noise)Vanilla MoDLRS-E2EATDeep-EqScore-MRISMUGWeighted SMUG24252627282930PSNR - Robust Accuracy (Evaluated by PGD)Vanilla MoDLRS-E2EATDeep-EqScore-MRISMUGWeighted SMUG0.9050.9100.9150.9200.9250.930SSIM - Clean AccuracyVanilla MoDLRS-E2EATDeep-EqScore-MRISMUGWeighted SMUG0.890.900.910.920.93SSIM - Robust Accuracy (Evaluated by random noise)Vanilla MoDLRS-E2EATDeep-EqScore-MRISMUGWeighted SMUG0.760.770.780.790.800.81SSIM - Robust Accuracy (Evaluated by PGD)Figure 6.7 Reconstruction accuracy box plots for the fastMRI knee dataset with 4x Acceleration
factor. The additive random Gaussian noise of the second column plots is obtained using a standard
deviation of 0.01. The worst-case additive noise of the third column is obtained using the PGD
method with 𝜖 = 0.02.
We use a batch size of 2 and 60 training epochs. The experiments are run using two A5000 GPUs.

The ADAM optimizer (Kingma and Ba, 2014) is utilized for training the network weights with

momentum parameters of (0.5, 0.999) and learning rate of 10−4. The stability parameter 𝜆ℓ in (6.8)

(and (6.12)) is tuned so that the standard accuracy of the learned model is comparable to vanilla

MoDL. For RS-E2E, we set the standard deviation of Gaussian noise to 𝜎 = 0.01, and use 10 Monte

Carlo samplings to implement the smoothing operation. Note that in our experiments, Gaussian

noise and corruptions are added to real and imaginary parts of the data with the indicated 𝜎. For AT,

we implemented a 30-step PGD procedure within its minimax formulation with 𝜖 = 0.02. For Score

MRI, we used 150 steps for the reverse diffusion process with the pre-trained model. We fine-tuned

a pre-trained Deep-Eq model with the same data as the proposed schemes. Unless specified, training

parameters were similar across the compared methods.

6.4.1.4 Testing

We evaluate our methods on clean data (without additional perturbations), data with randomly

injected noise, and data contaminated with worst-case additive perturbations. The worst-case

disturbances allow us to see worst-case method sensitivity and are generated by the ℓ∞-norm based

PGD scheme with 10 steps (Antun et al., 2020a) corresponding to ∥δ∥∞ ≤ 𝜖, where 𝜖 is set

103

Vanilla MoDLRS-E2EATDeep EqScore-MRISMUGWeighted SMUG30.030.531.031.532.0PSNR - Clean AccuracyVanilla MoDLRS-E2EATDeep EqScore-MRISMUGWeighted SMUG30.030.531.031.532.0PSNR - Robust Accuracy (Evaluated by random noise)Vanilla MoDLRS-E2EATDeep EqScore-MRISMUGWeighted SMUG242526272829PSNR - Robust Accuracy (Evaluated by PGD)Vanilla MoDLRS-E2EATDeep EqScore-MRISMUGWeighted SMUG0.8950.9000.9050.9100.9150.9200.9250.930SSIM - Clean AccuracyVanilla MoDLRS-E2EATDeep EqScore-MRISMUGWeighted SMUG0.890.900.910.920.930.940.95SSIM - Robust Accuracy (Evaluated by random noise)Vanilla MoDLRS-E2EATDeep EqScore-MRISMUGWeighted SMUG0.750.760.770.780.790.80SSIM - Robust Accuracy (Evaluated by PGD)Figure 6.8 Reconstruction accuracy box plots for the fastMRI knee dataset with 8x Acceleration
factor. The additive random Gaussian noise in the second column plots is obtained using a standard
deviation of 0.01. The worst-case additive noise in the third column is obtained using the PGD
method with 𝜖 = 0.02.

nominally as the maximum underlying k-space real and imaginary part magnitude scaled by 0.05.

We will indicate the scaling for 𝜖 (e.g., 0.05) in the results and plots that follow. The quality of

reconstructed images is measured using peak signal-to-noise ratio (PSNR) and structure similarity

index measure (SSIM)(Wang et al., 2004). In addition to the worst-case perturbations and random

noise, we evaluate the performance of our methods in the presence of additional instability sources

such as (i) different undersampling rates, and (ii) different numbers of unrolling steps.

6.4.2 Robustness with Additive Perturbations

In this subsection, we present the robustness results of the proposed approaches w.r.t. additive

noise. In particular, the evaluation is conducted on the clean, noisy (with added Gaussian noise),

and worst-case perturbed (using PGD for each method) measurements. Fig. 6.5 presents testing

set PSNR and SSIM values as box plots for different smoothing architectures, along with vanilla

MoDL and the other baselines using the brain dataset. The clean accuracies of Weighted SMUG and

SMUG are similar to vanilla MoDL indicating a good clean accuracy vs. robustness trade-off. As

indicated by the PSNR and SSIM values, we observe that weighted SMUG, on average, outperforms

all other baselines in robust accuracy (the second and third set of box plots of the two rows in

Fig. 6.5). This observation is consistent with the visualization of reconstructed images for the brain

104

Vanilla MoDLRS-E2EATDeep EqScore-MRISMUGWeighted SMUG29.029.530.030.531.0PSNR - Clean AccuracyVanilla MoDLRS-E2EATDeep EqScore-MRISMUGWeighted SMUG28.0028.2528.5028.7529.0029.2529.5029.7530.00PSNR - Robust Accuracy (Evaluated by random noise)Vanilla MoDLRS-E2EATDeep EqScore-MRISMUGWeighted SMUG21222324252627PSNR - Robust Accuracy (Evaluated by PGD)Vanilla MoDLRS-E2EATDeep EqScore-MRISMUGWeighted SMUG0.850.860.870.880.890.900.910.92SSIM - Clean AccuracyVanilla MoDLRS-E2EATDeep EqScore-MRISMUGWeighted SMUG0.860.870.880.890.90SSIM - Robust Accuracy (Evaluated by random noise)Vanilla MoDLRS-E2EATDeep EqScore-MRISMUGWeighted SMUG0.700.710.720.730.740.750.76SSIM - Robust Accuracy (Evaluated by PGD)Ground Truth Vanilla MoDL

RS-E2E

SMUG

AT

Score-MRI

Deep Equilibrium Weighted-SMUG

Figure 6.9 Visualization of ground-truth and reconstructed images using different methods for 4x
k-space undersampling, evaluated on PGD-generated worst-case inputs of perturbation strength
𝜖 = 0.02.

Ground Truth Vanilla MoDL

RS-E2E

SMUG

AT

Score-MRI

Deep Equilibrium Weighted-SMUG

Figure 6.10 Visualization of ground truth and reconstructed images using different methods for
8x k-space undersampling, evaluated on PGD-generated worst-case inputs of perturbation scaling
𝜖 = 0.02.
dataset in Fig. 6.6. We note that weighted SMUG requires longer time for training, which represents

a trade-off. When comparing to AT, we observe that AT is comparable to SMUG in the case of

robust (or worst-case noise) accuracy. However, the drop in clean accuracy (without perturbations)

for AT is significantly larger than for SMUG. Furthermore, AT takes a much longer training time as

it requires to solve an optimization problem (PGD) for every training data sample at every iteration

to obtain the worst-case perturbations. Furthermore, we observe that its effectiveness is degraded

for other perturbations including random noise as well as modified sampling rates shown in the next

subsection. Importantly, the proposed SMUG and Weighted SMUG are not trained to be robust to

105

any specific perturbations or instabilities, but are nevertheless effective for several scenarios.

In comparison to the diffusion based Score-MRI, the proposed methods perform better in terms

of both clean accuracy and random noise accuracy. Although for worst-case perturbations, the

PSNR values of Score-MRI are only slightly worse than SMUG, it is important to note that not only

the training of diffusion-based models takes longer than our method, but also the inference time is

longer as Score-MRI requires to perform nearly 150 sampling steps to process one scan and takes

nearly 5 minutes with a single RTX5000 GPU, whereas our method takes only about 25 seconds per

scan. The SMUG schemes also substantially outperform the deep equilibrium model in the presence

of perturbations.

In Fig 6.7 and Fig 6.8, we report PSNR and SSIM results of different methods at two sampling

acceleration factors for the knee dataset. Therein, we observe quite similar outcomes to those

reported in Fig 6.5.

Figs. 6.9 and 6.10 show reconstructed images by different methods for knee scans at 4x and 8x

undersampling, respectively. We observe that SMUG and Weighted SMUG show fewer artifacts,

sharper features, and fewer errors when compared to Vanilla MoDL and other baselines in the

presence of the worst-case perturbations.

Fig. 6.11 presents average PSNR results over the test dataset for the considered models under

different levels of worst-case perturbations (i.e., attack strength 𝜖). We used the knee dataset for this

experiment. We observe that SMUG and weighted SMUG outperform RS-E2E, vanilla MoDL, and

Deep-Eq across all perturbation strengths. When compared to Score-MRI and AT, our proposed

methods consistently maintain higher PSNR values for moderate to large perturbations (less than

𝜖 = 0.08). For instance, when 𝜖 = 0.02, weighted SMUG reports more than 1 dB improvement over

AT and Score-MRI.

6.4.3 Robustness for Varying Sampling Rates and Unrolling Steps

In this subsection, we evaluate the robustness of our proposed approaches and the considered

baselines at varying sampling rates and unrolling steps.

For our first experiment, during training, a k-space undersampling or acceleration factor of 4x

106

Figure 6.11 PSNR of baseline methods and the proposed method versus perturbation strength (i.e.,
scaling) 𝜖 used in PGD-generated worst-case examples at testing time with 4x k-space undersampling.
𝜖 = 0 corresponds to clean accuracy.

is used for our methods and the considered baselines. At testing time, we evaluate performance

(in terms of PSNR) with acceleration factors ranging from 2x to 8x. The results are presented in

Fig. 6.12. It is clear that when the acceleration factor during testing matches that of the training

phase (4x), all methods achieve their highest PSNR results. Conversely, performance generally

declines when the acceleration factors differ. For acceleration factors 3x to 8x (ignoring 4x where

models were trained), we observe that our methods outperform all the considered baselines. For the

2x case, our methods report higher PSNR values compared to RS-E2E, vanilla MoDL, and Deep-Eq

and slightly underperform AT, while Score-MRI shows more resilience at 2x.

For the second experiment, we study the performance of varying unrolling steps. More

specifically, during training, we utilize 8 unrolling steps to train our methods and the baselines.

At testing time, we report the results of utilizing 1 to 16 unrolling steps. The PSNR results of all

107

                     3 H U W X U E D W L R Q  V W U H Q J W K                   3 6 1 5   G %  0 R ' / ' H H S  ( T 5 6  (  ( $ 7 6 F R U H  0 5 , 6 0 8 *  : H L J K W H G  6 0 8 *the considered cases are given in Fig. 6.13. The results show that both SMUG and Weighted

SMUG maintain performance comparable to the Deep Equilibrium model. Furthermore, when

using different unrolling steps and faced with additive measurement perturbations, the SMUG

methods’ PSNR values are stable and close to the unperturbed case (indicating robustness), whereas

the other methods see more drastic drop in performance. This behavior for SMUG also agrees with

the theoretical bounds in Section 6.3.

Although we do not intentionally design our method to mitigate MoDL’s instabilities against

different sampling rates and unrolling steps, the SMUG approaches nevertheless provide im-

proved PSNRs over other baselines. This indicates broader value for the robustification strategies

incorporated in our schemes.

Figure 6.12 PSNR results for different MRI reconstruction methods versus different measurement
sampling rates (models trained at 4× acceleration).

108

        $ F F H O H U D W L R Q  I D F W R U                     3 6 1 5   G %  0 R ' / ' H H S  ( T 5 6  (  ( $ 7 6 F R U H  0 5 , 6 0 8 * : H L J K W H G  6 0 8 *Figure 6.13 PSNR results for different MRI reconstruction methods at 4x k-space undersampling
versus number of unrolling steps (8 steps used in training). “Clean" and "Robust" denote the cases
without and with added worst-case (for each method) measurement perturbations.

Figure 6.14 PSNR vs. worst-case perturbation strength (𝜖) for SMUG for different configurations of
UStab loss (6.7).

6.4.4

Importance of the Ustab Loss

We conduct additional studies on the unrolled stability loss in our scheme to show the importance

of integrating target image denoising into SMUG’s training pipeline in (6.7). Fig. 6.14 presents

PSNR values versus perturbation strength/scaling (𝜖) when using different alternatives to Dθ (t)

in (6.7), including t (the original target image), Dθ (x𝑛) (denoised output of each unrolling step),

109

              8 Q U R O O L Q J  V W H S                 3 6 1 5   G %  0 R ' /    & O H D Q 0 R ' /    5 R E X V W ' H H S  ( T    & O H D Q ' H H S  ( T    5 R E X V W 5 6  (  (    & O H D Q 5 6  (  (    5 R E X V W 6 0 8 *    & O H D Q 6 0 8 *    5 R E X V W : H L J K W H G  6 0 8 *    & O H D Q : H L J K W H G  6 0 8 *    5 R E X V W                     3 H U W X U E D W L R Q  V W U H Q J W K                 3 6 1 5  G E  WMoDL(t)MoDL(xn)(xn)(t)and variants when using the fixed, vanilla MoDL’s denoiser DθMoDL instead. As we can see, the

performance of SMUG varies when the UStab loss (6.7) is configured differently. The proposed

Dθ (t) outperforms other baselines. A possible reason is that it infuses supervision of target images

in an adaptive, denoising-friendly manner, i.e., taking the influence of Dθ into consideration.

6.4.5

Impact of the Noise Smoothing

To comprehensively assess the influence of the introduced noise during smoothing, denoted

as η, on the efficacy of the suggested approaches, we undertake an experiment involving varying

noise standard deviations 𝜎. The outcomes, documented in terms of RMSE, are showcased in

Fig.6.15. The accuracy (reconstruction quality w.r.t. ground truth) and robustness error (error

between with and without measurement perturbation cases) are shown for both SMUG and RS-

E2E. We notice a notable trend: as the noise level 𝜎 increases, the accuracy for both methods

improves before beginning to degrade. Importantly, SMUG consistently outperforms end-to-end

smoothing. Furthermore, the robustness error continually drops as 𝜎 increases (corroborating with

our analysis/bound in Section 6.3), with more rapid decrease for SMUG.

Figure 6.15 Left: Norm of difference between SMUG and RS-E2E reconstructions and the ground
truth for different choices of 𝜎 in the smoothing process. A worst-case PGD perturbation δ computed
at 𝜖 = 0.01 was added to the measurements in all cases. Right: Robustness error for SMUG and
RS-E2E at various 𝜎, i.e., norm of difference between output with the perturbation δ and without it.

6.4.6 Empirical Analysis of the behavior of Weighted SMUG

In subsequent final study, we analyze the behavior of the Weighted SMUG algorithm. We delve

into the nuances of weighted smoothing, which can assign different weights to different images

110

103102101100Sigma Noise level0.040.050.060.070.080.090.100.11RMSESMUGRS-E2E103102101100Sigma Noise level0.01000.01250.01500.01750.02000.02250.02500.0275RMSESMUGRS-E2EFigure 6.16 Weights predicted by the weight encoder network in Weighted SMUG (from final layer
of unrolling) plotted against root mean squared error (RMSE) of the corresponding denoised images
for 5 randomly selected scans (with 4x undersampling).

during the smoothing process. The aim is to gauge how the superior performance of Weighted

SMUG arises from the variations in learned weights. Our findings indicate that among the 10 Monte

Carlo samplings implemented for the smoothing operation, those with lower denoising RMSE when

compared to the ground truth images generally receive higher weights, as illustrated in Fig. 6.16.

6.4.7 Results of Applying Our Methods to ISTA-NET

In our concluding study, we investigate whether our robustification methods can be effective

with an alternative unrolling technique, ISTA-Net (Zhang and Ghanem, 2018). For ISTA-Net,

we adopted the default architecture, utilizing the ADAM optimizer with a learning rate of 10−4.

The network was configured with 9 phases (unrolling iterations) and trained on the fastMRI knee

dataset comprising 3000 scans at 4x undersampling and with 100 epochs for training. Similar to

previous experiments, we used 64 scans for testing. Other settings for training the vanilla ISTA-Net

were set to default values. Other settings for the RS-E2E version, and the SMUG and Weighted

SMUG versions of ISTA-Net were similar to the MoDL case. The results, as presented in Figure

6.17, demonstrate that the clean accuracy performance of SMUG and weighted SMUG versions of

ISTA-Net are comparable to vanilla ISTA-Net. Notably, under conditions of random noise (Gaussian

noise with 𝜎 = 0.01 added) and PGD attack (30 steps with 𝜖 = 0.02) perturbed measurements, our

method surpasses both the original ISTA-Net and the RS-E2E version. The comparative results

reveal that the performance closely aligns with the outcomes previously observed when unrolling

smoothing was combined with the MoDL network.

111

                 : H L J K W                               5 0 6 (Figure 6.17 Reconstruction accuracy box plots for the fastMRI knee dataset with 4x acceleration
factor for the case of ISTA-Net. The additive random Gaussian noise in the second column plots is
obtained using a standard deviation of 0.01. The worst-case additive noise in the third column is
obtained using the PGD method with 𝜖 = 0.02.

6.5 Discussion and Conclusion

In this work, we proposed a scheme for improving the robustness of DL-based MRI reconstruction.

In particular, we investigated deep unrolled reconstruction’s weaknesses in robustness against worst-

case or noise-like additive perturbations, sampling rates, and unrolling steps. To improve the

robustness of the unrolled scheme, we proposed SMUG with a novel unrolled smoothing loss. We

also provided a theoretical analysis on the robustness achieved by our proposed method. Compared

to the vanilla MoDL approach and other schemes, we empirically showed that our approach is

effective and can significantly improve the robustness of a deep unrolled scheme against a diverse set

of external perturbations. We also further improved SMUG’s robustness by introducing weighted

smoothing as an alternative to conventional RS, which adaptively weights different images when

aggregating them. In future work, we hope to apply the proposed schemes to other imaging modalities

and evaluate robustness against additional types of realistic perturbations. While we theoretically

characterized the robustness error for SMUG, we hope to further analyze its accuracy-robustness

trade-off with perturbations.

112

ISTARS-E2E ISTASMUG ISTAWeighted SMUG ISTA32.032.232.432.632.8PSNR - Clean AccuracyISTARS-E2E ISTASMUG ISTAWeighted SMUG ISTA28.529.029.530.030.531.031.5PSNR - Robust Accuracy (Evaluated by random noise)ISTARS-E2E ISTASMUG ISTAWeighted SMUG ISTA242526272829PSNR - Robust Accuracy (Evaluated by PGD)ISTARS-E2E ISTASMUG ISTAWeighted SMUG ISTA0.9100.9150.9200.925SSIM - Clean AccuracyISTARS-E2E ISTASMUG ISTAWeighted SMUG ISTA0.8950.9000.9050.9100.9150.9200.925SSIM - Robust Accuracy (Evaluated by random noise)ISTARS-E2E ISTASMUG ISTAWeighted SMUG ISTA0.770.780.790.80SSIM - Robust Accuracy (Evaluated by PGD)CHAPTER 7

MRI RECONSTRUCTION VIA DIFFUSION PURIFICATION

7.1

Introduction

In the last chapter, we introduced several methods to improve model generalization and robustness.

However, a primary weakness of SMUG is that it can be difficult to integrate into models other than

the unrolling model. Furthermore, SMUG does not demonstrate significantly better performance

than adversarial training or other existing methods. To address these shortcomings, we now aim to

refine and enhance this approach. Insipried by a recent study conducted by Nie et al. (Nie et al.,

2022b) has introduced a robustification strategy that effectively mitigates the impact of additive

worst-case perturbations, harnessing the power of diffusion models (DMs) (Chung and Ye, 2022;

Chung et al., 2023c; Karras et al., 2022). Drawing inspiration from this methodology and benefiting

from the generalization capabilities of DMs, we investigate the application of a similar approach to

enhance the resilience of DL-based MRI reconstruction. Our approach centers on the application of

pre-trained diffusion models as noise purifiers. More precisely, this purification process entails a

gradual introduction of noise, followed by the refinement of the noise through the utilization of the

pre-trained DM

7.1.1 Contributions

• We introduce a general robustification framework designed to enhance the resilience of DL-based

MRI reconstructors against a variety of instabilities, and improve their generalization performance

when faced with out of distribution samples. This is accomplished through integrating purification

via pre-trained DMs into existing DL-based models.

• We prove that the perturbed and clean images’ distributions (and conditional distributions) get

closer to each other as the time increases in the forward diffusion stage.

• We present a novel approach to select a process-switching time step - a critical parameter within our

DM-based purification method. This eliminates the necessity of treating it as a hyper-parameter.

• We use fine-tuning to further improve the DL-based reconstructors’ performance, which, unlike

well-known state-of-the-art (SOTA) robustification method AT, neither requires solving a minimax

113

problem nor involves generating worst-case examples.

• In our experimental results, we demonstrate the effectiveness of our proposed approach by

assessing it against standard evaluation metrics, surpassing the performance of AT, RS, and

diffusion-based MRI reconstruction in the presence of several sources of instabilities. Furthermore,

we illustrate that after being trained on the knee fastMRI dataset, the purification process using

DMs extends its benefits to other MRI datasets, including a brain MRI dataset or data with

unseen lesions. Additionally, we show that our robustification approach can be applied to multiple

DL-based supervised methods such as the well-known MoDL and the recent Recurrent Variational

Network (RecurrentVarNet) (Yiasemis et al., 2022).

7.2 Lack of Robustness in DL-based MRI Reconstruction & Score-based DMs

In this section, we first introduce the inverse problem formulation for Deep MRI reconstruction.

Second, we shed light on the lack of robustness in these models. Then, we present the formulation

of the score-based DM used in this paper.

7.2.1 DL-based MRI Reconstruction

MRI reconstruction is a challenging ill-posed inverse problem (Donoho, 2006a). Its objective

is to recover the original signal x ∈ C𝑛 from observed measurements y ∈ C𝑚, with 𝑚 < 𝑛. For

multi-coil MRI, this task can be formulated as a linear inverse problem denoted as y ≈ Ax, where

A = MFS with S denoting the sensitivity encoding with multiple coils, F denoting coil-by-coil

Fourier transform, and M denoting coil-wise undersampling.

Typically, the reconstruction process involves solving the optimization problem minx ∥Ax −

y∥2

2 + 𝜆R (x) , where R (·) (resp. 𝜆 > 0) is a regularization term (resp. parameter).
There are several methods that use unrolling steps to train Deep MRI image reconstruction.

While for the major part of this paper we focus on the popular MoDL framework (Aggarwal et al.,

2018), our proposed method can be applied to other DL-based reconstruction models, as illustrated

in the last subsection of the experimental results. In MoDL, the traditional regularization term is

substituted with a denoising Neural Network (NN) represented as 𝑓 : C𝑛 → C𝑛, parameterized by 𝜃.

114

This denoising NN is trained in a supervised learning framework using a dataset of multiple pairs of

measurements y and their corresponding ground truth images x.

For each pair (y, x) in the training set 𝐷, the MoDL training process initializes x0 (e.g., as A𝐻y)

and then iterates through the subsequent steps for a specified number of unrolling iterations indexed

by 𝑗 ∈ {0, . . . , 𝑁 − 1}. This process can be described as follows:

z 𝑗 ← 𝑓𝜃 (x 𝑗 ) ,

x 𝑗+1 ← arg min

x

∥Ax − y∥2 + 𝜆∥x − z 𝑗 ∥2 .

(7.1)

(7.2)

The parameters of 𝑓𝜃 are updated end-to-end in a supervised manner following (Aggarwal et al.,

2018).

Equation (7.1) corresponds to the denoising step, while Equation (7.2) pertains to the data

consistency (DC). Equation (7.2) has a closed-form solution given by x 𝑗+1 ← (A𝐻A + 𝜆I)−1(A𝐻y +

𝜆z 𝑗 ).

During the testing phase, when presented with an aliased image (e.g., A𝐻y), a trained MoDL

model reconstructs x by applying the procedure described in Equations (7.1) and (7.2) for a specified

number of unrolling steps. For the remainder of this paper, we use MoDL𝜃 (A𝐻y) to denote the

image reconstructed from MoDL.

7.2.2 Vulnerabilities & Challenges of DL-based MRI Reconstructors

7.2.2.1 K-space Additive Noise

Given a trained deep MRI reconstruction NN and an aliased image z = A𝐻y, recent studies have

shown that these NNs are not robust to additive perturbations δ to y (Li et al., 2023). The study in

(Jia et al., 2022b) presents an approach to generate worst-case additive noise that employs norm

constraints, in line with the attack strategies utilized in image classification. This approach aims

to produce a form of worst-case imperceptible additive noise against a reconstructor in the image

domain. Given a perturbation budget 𝜖 > 0, the worst-case additive perturbations can be obtained

using the following optimization problem.

(cid:16) MoDL𝜃 (A𝐻y), MoDL𝜃 (A𝐻 (y + δ))

(cid:17)

,

L

max
∥δ∥∞≤𝜖

(7.3)

115

where ∥.∥∞ is the ℓ∞ norm and L is a differentiable loss function that computes the reconstruction

loss. Given the original image x∗, generating the perturbations can also be achieved by replacing the

first argument of L in (7.3) with x∗. A solution of (7.3) can be obtained using the Projected Gradient

Descent (PGD) method (Madry et al., 2017). In this paper, we also use zpert = A𝐻 (y + δ) = A𝐻ypert

which relates perturbations in k-space and image space.

In addition to the worst-case perturbations, random/realistic additive measurement noise could

also impact the performance of a reconstructor.

7.2.2.2 Training/Testing Sampling Protocol & Undersampling Rate Disparities

In addition to additive perturbations, the study presented in (Li et al., 2023) underscores an

additional potential source of instability that MoDL (and other DL-based reconstructors) may face

during testing. This source stems from changes in the measurement sampling rate, leading to

perturbations in the sparsity of the sampling mask within A (Antun et al., 2020a). Furthermore, in

this paper, we consider another variation that these NNs could encounter during the testing phase,

involving a shift or variation in the k-space sampling locations within the matrix M, resulting in

the construction of a nonidentical forward operator for testing. For this case, zpert = A𝐻

testy, where

Atest ≠ A.

We remark that ensuring the robustness of a reconstruction model to variations in the sampling

protocol, undersampling rate, scan contrast, etc., is crucial as it mitigates the need for re-training

to all possible practical scenarios and variations, common in imaging. Re-training models for

new setups is expensive. Moreover, the relatively limited training data availability (which requires

fully-sampled measurements as labels in supervised learning) in reconstruction applications also

warrant learning models that can still be significantly robust.

7.2.2.3 Unseen Anatomies & Pathologies at Testing Time

A lesion (or anatomy changes) denotes an anomaly, or impairment within a tissue or organ of

the body, arising from diverse factors such as injuries, diseases, or pathological conditions. In

the medical domain, the term commonly characterizes regions of abnormal or diseased tissue,

116

PSNR = 30.8 dB

PSNR = 23.21 dB

PSNR = 22.18 dB

(a)

(b)

(c)

PSNR = 24.15 dB

PSNR = 27.26 dB

(d)

(e)

Figure 7.1 Here, we show the vulnerabilities and generalization challenges of DL-based MRI
reconstruction models by evaluating a trained MoDL reconstructor (trained at 4x undersampling)
with the considered cases in Section 7.2.2. (a) Reconstructed image from clean measurements. (b)
Reconstructed image from measurements with worst-case additive perturbations (Equation (7.3)
with 𝜖 = 0.02). (c) Reconstructed image from measurements with 2x undersampling rate during
testing. (d) Reconstructed image from a different test time sampling mask with 4x undersampling.
(e) Reconstructed image from measurements with an unseen lesion during testing.

observed through MR imaging. In this paper, we study the practical case where the DL-based image

reconstructor is trained on some data points, but tested with measurements with unseen lesions.

Figure 7.1 illustrates reconstructed images from the instabilities and the generalization challenges

considered in this paper.

7.2.3 Score-based Diffusion Models

Diffusion models (DMs) have shown great potential for solving many hard computer vision

tasks and recently extended to medical imaging applications.

The Bayesian framework of DMs, introduced in (Ho et al., 2020; Sohl-Dickstein et al., 2015),

consists of a discrete Markov Chain. The forward direction is constructed by sampling from

𝑝(z𝑖 | z𝑖−1) = N (z𝑖 ; √︁1 − 𝛽𝑖z𝑖−1, 𝛽𝑖I), where 𝛽𝑖 ∈ (0, 1) is an entry of a sequence of monotonically
increasing positive noise scales w.r.t. 𝑖.

117

Score-based DM was introduced in (Song et al., 2021c) and was shown to be equivalent to the

Bayesian framework. Score-based DMs can be formulated by the following forward and reverse

Stochastic Differential Equations (SDEs).

𝑑z = f(z, 𝑡)𝑑𝑡 + 𝑔(𝑡)𝑑w ,

𝑑z = (cid:2)f(x, 𝑡) − 𝑔2(𝑡)∇zlog𝑝𝑡 (z)(cid:3) 𝑑𝑡 + 𝑔(𝑡)𝑑 ¯w ,

(7.4)

(7.5)

where f and 𝑔 are the drift and diffusion coefficients, respectively. 𝑡 spans the interval [0, 1] and

represents the time index. 𝑑w and 𝑑 ¯w represent standard Brownian motion evolving forward

and backward in time, respectively. The term 𝑝𝑡 (z) denotes the distribution of z at time 𝑡, while

∇zlog𝑝𝑡 (z) represents the score function. By employing the formulation of the Variance Exploding
(VE) SDE (VE-SDE) (Song et al., 2021c), for which f = 0 and 𝑔(𝑡) = √︁𝑑𝜎2(𝑡)/𝑑𝑡, we can re-write

the forward and reverse SDEs as

𝑑z =

√︂ 𝑑𝜎2(𝑡)
𝑑𝑡

𝑑w ,

𝑑z = −

𝑑𝜎2(𝑡)
𝑑𝑡

∇zlog𝑝𝑡 (z)𝑑𝑡 +

√︂ 𝑑𝜎2(𝑡)
𝑑𝑡

𝑑 ¯w .

(7.6)

(7.7)

In Equations (7.6) and (7.7), function 𝜎(𝑡) = 𝜎𝑙 (𝜎𝑢/𝜎𝑙)𝑡 is a monotonically increasing function

w.r.t. 𝑡, where 𝜎𝑙 ∈ (0, 1) and 𝜎𝑢 > 1 are constants.

The score function is in practice replaced by a neural network denoted as 𝑠 : C𝑛 × [0, 1] → C𝑛,

parameterized by 𝜙, which is trained using the denoising score matching technique (Chung and Ye,

2022) as

E

min
𝜙

(cid:34) (cid:13)
(cid:13)
(cid:13)
(cid:13)

𝜎(𝑡)𝑠𝜙 (z(𝑡), 𝑡) −

z(𝑡) − z
𝜎(𝑡)

2 (cid:35)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

.

(7.8)

The expectation in (7.8) is taken over 𝑡 ∼ 𝑈 [0, 1], z ∼ 𝑝(z), and z(𝑡) ∼ N (z, 𝜎(𝑡)I), where

𝑝(z) = 𝑝0(z) is the distribution of the training data.

Having obtained a trained DM with parameters 𝜙, the task of sampling ˆz(0) at the time instant

𝑡 = 0 is realized through the solution of the reverse process SDE in (7.7). In this step, the score

function is substituted with the learned function 𝑠𝜙. There exist various techniques for sampling

118

Algorithm 7.1 Predictor-Corrector Sampling with DC (Chung and Ye, 2022)
Input: Image z = A𝐻y, trained DM 𝑠 𝜙, discretized time step 𝑁𝑟 , and noise schedule 𝜖𝑖.
Function: ˆz = PCDC (cid:0)𝑠 𝜙 (z(𝑁𝑟 ), 𝑁𝑟 ), y, A, 𝑁𝑟 , 0(cid:1).

1: Initialize z(𝑁𝑟 ) ∼ N (0, 𝜎2(𝑁𝑟 )I).
2: For 𝑖 ∈ {𝑁𝑟 − 1, . . . , 0} \\Prediction
3:

z′(𝑖) ← z(𝑖 + 1) + (𝜎2(𝑖 + 1) − 𝜎2(𝑖))𝑠 𝜙 (z(𝑖 + 1), 𝑖 + 1)
z(𝑖) ← z′(𝑖) + √︁𝜎2(𝑖 + 1) − 𝜎2(𝑖)𝜂, 𝜂 ∼ N (0, I)
z(𝑖) ← z(𝑖) + A𝐻 (y − Az(𝑖)) \\Data Consistency
For 𝑀𝑟 steps do \\Correction
z′(𝑖) ← z(𝑖) + 𝜖𝑖 𝑠 𝜙 (z(𝑖), 𝑖)
z′(𝑖) ← z′(𝑖) +
z(𝑖) ← z′(𝑖) + A𝐻 (y − Az(𝑖)) \\Data Consistency

2𝜖𝑖 𝜂, 𝜂 ∼ N (0, I)

√

4:

5:

6:

7:
8:

9:

10:

ˆz = z(0)

from DMs, which involve solving the reverse SDE in (7.7). In this paper, the Euler method (Platen

and Bruti-Liberati, 2010) and the Predictor-Corrector (PC) scheme (Allgower and Georg, 2012) are

used. Following the work in (Chung and Ye, 2022), a data consistency step is considered to allow

sampling from the conditional distribution 𝑝(z|y), In practice, the continuous time index 𝑡 ∈ [0, 1]

is discretized into 𝑖 ∈ [𝑁𝑟], where [𝑁𝑟] := {1, . . . , 𝑁𝑟 }. The PC sampling technique consists of 𝑁𝑟

prediction reverse steps. In each prediction iteration, 𝑀𝑟 correction steps are required (Song et al.,

2021c). The full procedure is outlined in Algorithm 7.1.

7.3 Diffusion Purification for Robust DL-based MRI Reconstruction

In this section, we begin by outlining the key components of the proposed Diffusion Purification

(DP) pipeline. Subsequently, we introduce our approach for obtaining the PST step. Following that,

we elaborate on our fine-tuning strategy for MoDL, leveraging the purified samples.

7.3.1 DM-based Purification

Here, we present our DP approach, which consists of the following two stages.

Diffusion Stage: Given measurements y, let zpert denote the perturbed version of z = A𝐻y. As

illustrated in the previous section, this perturbed version can be due to various reasons such as

random measurement noise, not well-modeled noise and artifacts (e.g., it may make sense to consider

worst-case additive noise (from (7.3))), and different k-space undersampling factors or sampling

patterns/masks at testing time.

119

Figure 7.2 A schematic block diagram illustrating the standard pipeline and our proposed reconstruc-
tion pipeline. The functions MoDL𝜃 (·) and MoDL𝜃FT (·) represent the application of the standard
pre-trained MoDL procedure and our ‘pre-trained+fine-tuned’ robust MoDL procedure, respectively.
Here, MoDL can be replaced with other DL-based reconstruction models.

The first stage of the DP approach involves diffusing z(0) = zpert from 𝑡 = 0 to 𝑡 = 𝑡∗, where

𝑡∗ ∈ (0, 1) indicates the diffusion time index at which the forward process stops. We term 𝑡∗ as the

Process-Switching Time (PST) step. The PST step and 𝜎(·) control the amount of noise added to

zpert. This stage corresponds to

zpert(𝑡∗) = zpert +

√︁𝜎2(𝑡∗) − 𝜎2(0)𝜂𝑡∗ , 𝜂𝑡∗ ∼ N (0, I) .

(7.9)

Purification Stage: After obtaining the diffused perturbed image, denoted as zpert(𝑡∗), the objective
of the second step is to derive the purified sample, denoted as zpur

pert, from zpert(𝑡∗). This is achieved

by employing the PC reverse process with data consistency (DC). In other words, we use the PC

with DC procedure in Algorithm 7.1 as:

zpur
pert(0) = PCDC (cid:0)𝑠𝜙 (zpert(𝑡∗), 𝑡∗), ypert, A, 𝑡∗, 0(cid:1) .

(7.10)

In practice, we use 𝑁𝑡∗, which represents the discrete PST step. We remark that 𝑁𝑡∗ is less than the

total number of available steps in standard sampling reverse process 𝑁𝑟. Algorithm 7.2 illustrates

the diffusion purification procedure.

Intuition: Starting with a perturbed image zpert, which is assumed to be drawn from
distribution 𝑞(z), our approach initiates with z(0) = zpert and gradually introduces noise. If the

aliased image z follows a distribution 𝑝(z), then as 𝑡 → 1, these two distributions will get closer.

This signifies that the perturbations are progressively diminishing due to the incremental noise

120

Our Pipeline𝒛pertPSNR:24.18dBMoDL𝜃(𝒛pert)PSNR:20.28dBMoDL𝜃(⋅)Standard PipelinePSNR:27.08dB𝒛pert0=𝒛pertpurMoDL𝜃FT(𝒛pertpur)PSNR:30.98dBMoDL𝜃FT(⋅)𝒛pert(1)𝒛pert(2)𝒛pert(𝑁𝑡∗)𝒛pert(𝑁𝑡∗−1)Diffusion StagePurification StageData ConsistencyAlgorithm 7.2 Diffusion Purification
Input: Perturbed measurements ypert, operator A, trained DM 𝑠 𝜙, and PST step 𝑁𝑡 ∗
Function: zpur

pert = DP𝜙 (ypert, A, 𝑁𝑡 ∗).

1: Initialize z(0) = zpert
2: For 𝑖 ∈ {1, . . . , 𝑁𝑡 ∗ } \\Diffusion steps
3: Obtain z(𝑖) ← z(𝑖 − 1) + √︁𝜎2(𝑖) − 𝜎2(𝑖 − 1) 𝜂, 𝜂 ∼ N (0, I)
4: For 𝑖 ∈ {𝑁𝑡 ∗, . . . , 1} \\Purification steps
5: Obtain z(𝑖 − 1) ← PCDC(𝑠 𝜙 (z(𝑖), 𝑖), ypert, A, 𝑖, 𝑖 − 1)
6: Obtain zpur

pert = z(0).

incorporated during the forward process of (7.9). To emphasize this point, we present the following

Theorem, whose proof is deferred to the Appendix.

Theorem 7.3.1. Let 𝑝𝑡 (z) and 𝑝0𝑡 (z(𝑡) | z) be the distribution and the conditional distribution of z(𝑡)

given the VE-SDE forward process of (7.6) starts at the unperturbed image z. Similarly, let 𝑞𝑡 (z)

and 𝑞0𝑡 (z(𝑡) | zpert) be the distribution and the conditional distribution of z(𝑡) given the VE-SDE

forward process of (7.6) starts at the perturbed image zpert = A𝐻ypert = A𝐻 (y + δ). Then, as 𝑡 moves

forward from 𝑡 = 0 to 𝑡 = 1:

1. The KL divergence between 𝑝0𝑡 and 𝑞0𝑡, defined in (7.11), monotonically decreases.

𝐷KL( 𝑝0𝑡 || 𝑞0𝑡) =

∥A𝐻δ∥2
2(𝜎2(𝑡) − 𝜎2(0))

, 𝑡 ∈ (0, 1] .

2. The KL divergence between 𝑝𝑡 and 𝑞𝑡 monotonically decreases, i.e.,

𝑑𝐷KL( 𝑝𝑡 || 𝑞𝑡)
𝑑𝑡

≤ 0 .

(7.11)

(7.12)

It is important to highlight that our Theorem uses the VE-SDE, where the probability distributions

are from the standard Bayesian framework of DMs (Ho et al., 2020).

7.3.2 Selection of the Process-Switching Time Step

Here, we present an approximate method to obtain 𝑡∗ < 1 (or 𝑁𝑡∗ < 𝑁𝑟) based on the Maximum

Mean Discrepancy (MMD) metric (Gretton et al., 2006). The MMD metric measures the dissimilarity

between two distributions by comparing their mean embedding in a reproducing kernel Hilbert

121

space. It is commonly employed in machine learning and statistics for various tasks, including

domain adaptation (Guan and Liu, 2021) and kernel methods (Hofmann et al., 2008).

We utilize the MMD metric to approximately quantify the empirical distribution shift between

the original distribution 𝑝(z) and the perturbed images’ distribution 𝑞(z). During the forward

diffusion process, let 𝑍 (𝑖) and 𝑍p(𝑖) (with |𝑍 (𝑖)| = |𝑍p(𝑖)|) represent the set of unperturbed and

perturbed images, respectively, at discrete time step 𝑖, where | · | denotes the cardinality of a set.

Since we lack access to the exact distributions, we can approximate MMD( 𝑝𝑖, 𝑞𝑖) using empirical

distributions as follows:

MMD( 𝑝𝑖, 𝑞𝑖) ≈ 𝐶

(cid:16)

∑︁

𝑘 (z(𝑖), z′(𝑖))+

z(𝑖),z′ (𝑖)∈𝑍 (𝑖),z(𝑖)≠z′ (𝑖)
∑︁

z(𝑖),z′ (𝑖)∈𝑍p(𝑖) ,z(𝑖)≠z′ (𝑖)

𝑘 (z(𝑖), z′(𝑖))

(cid:17)

−

2
|𝑍 (𝑖)|2

∑︁

𝑘 (z(𝑖), z′(𝑖)) ,

z(𝑖)∈𝑍 (𝑖),z′ (𝑖)∈𝑍p (𝑖)

(7.13)

where 𝐶 = 1/(|𝑍 (𝑖)|(|𝑍 (𝑖)| − 1)) is used for brevity, and 𝑘 (z(𝑖), z′(𝑖)) = exp(−∥z(𝑖) −

z′(𝑖)∥2/2𝑣2) is the Gaussian kernel parameterized by 𝑣 > 0.

Considering the balance between purifying additive perturbations (achieved with a larger 𝑡∗) and

preserving global structures (achieved with a smaller 𝑡∗) within perturbed samples, there exists an

ideal value of 𝑡∗ that yields a robust reconstruction accuracy. In the case of the worst-case additive

perturbations, the changes are usually small and can be rectified with a small 𝑡∗. It was shown in

(Nie et al., 2022b) that the most efficient choice of 𝑡∗ related to adversarial robustness tends to be

on the smaller side. As such, our objective is to find the minimum value of 𝑖 ∈ [𝑁𝑟] for which

MMD( 𝑝𝑖, 𝑞𝑖) ≈ 0. Consequently, we formulate the following optimization problem to determine

the near-optimal discrete PST step, 𝑁𝑡∗.

𝑁𝑡∗ := (cid:8)arg min
𝑖∈[𝑁𝑟 ]

𝑖 s.t. MMD( 𝑝𝑖, 𝑞𝑖) = 0(cid:9) .

(7.14)

In order to obtain the solution of (7.14), it is required to perform the forward diffusion (steps 2 and

3 in Algorithm 7.2) on the unperturbed and perturbed samples until the constraint is satisfied.

122

Algorithm 7.3 Our Robust MoDL Pipeline
Input: Perturbed measurements ypert, operator A, trained DM 𝑠 𝜙, PST step 𝑁𝑡 ∗, number of unrolling steps 𝑁,
and fine-tuned MoDL parameters 𝜃FT.
Output: Reconstructed image after purification x.
1: Obtain zpur
pert = DP𝜙 (ypert, A, 𝑁𝑡 ∗).
2: Initialize MoDL reconstructed image as x0 = zpur
pert
3: For 𝑗 ∈ {0, . . . , 𝑁 − 1} \\MoDL unrolling steps
4: Obtain z 𝑗 ← 𝑓 𝜃FT (x 𝑗)
5: Obtain x 𝑗+1 ← (A𝐻A + 𝜆I) −1(zpur
6: Obtain x ← x𝑁

pert + 𝜆z 𝑗)

Since we have knowledge of the source of perturbations that allows us to obtain 𝑍p from 𝑍, we

remark that the PST step selection method we propose can be applied to any diffusion purification

task.

7.3.3 Fine-tuning with Purified Perturbed Examples

In this subsection, drawing inspiration from the widely used ‘pre-training + fine-tuning’ approach

(Zoph et al., 2020; Salman et al., 2020), we propose fine-tuning the parameters of MoDL, which are

obtained through the process outlined in Section 7.2.A, using contaminated purified examples.

We start with pre-trained parameters 𝜃, and utilize noised purified examples for fine-tuning. Let

𝜃FT represent the fine-tuned parameters specific to MoDL. Initially, we set 𝜃FT equal to 𝜃. Then, for

each measurement y within dataset 𝐷, we generate a noisy version of the aliased reconstruction,

A𝐻 (y + v), where v is drawn from a normal distribution N (0, 𝜎FTI). Subsequently, for every (y, x),

we follow the procedure outlined in (Aggarwal et al., 2018), while initializing x0 as

x0 = DP𝜙 (y + v, A, 𝑁𝑡∗) .

(7.15)

Having trained 𝜃FT that maps x0 to fully-sampled reconstructions, at the testing phase, the robust

MoDL MRI reconstruction using diffusion purification is represented in Algorithm 7.3. A block

diagram of the proposed approach is given in Figure 7.2.

We emphasize that while our primary focus is on the formulation of MoDL under which we

develop our proposed approach, in the last subsection of our experimental results, we demonstrate

the versatility of our approach by showcasing its applicability to other DL-based supervised MRI

123

reconstruction models.

7.4 Experimental Results

In this section, we start by illustrating our experimental setup, baselines, and the instability

sources and the generalization challenges considered in this work. Subsequently, we present results

for the process-switching time (PST) step selection through our MMD-based method. Following

this, we present the primary results showcasing the robustness of our approach. Furthermore, we

present visualizations illustrating knee and brain MRI reconstructions.

7.4.1 Experimental Setup

In the case of MoDL, we employ a configuration with 𝑁 = 6 unrolling steps and a regularization

parameter 𝜆 = 1. The architecture of 𝑓𝜃 is selected as the Deep Iterative Down-Up Network (Yu

et al., 2019b). Additionally, we set the convergence threshold for the conjugate gradient optimization

used in the data consistency step of (7.2) to 10−6. In the DM setting, 𝑡 ∈ [0, 1] is discretized into

500 steps. We adopt a pre-trained DM model from (Chung and Ye, 2022), where 𝜎(𝑖) is a geometric

series selected as 𝜎(𝑖) = 0.01(37800)

𝑖

𝑁𝑟 −1 . We note that the DM model was trained on the knee

training dataset. We conduct our experiments on the fastMRI dataset (Zbontar et al., 2018), using

3000 purified images for fine-tuning the pre-trained MoDL network. Additionally, 20 images are

reserved for validation, and 64 images are used for testing. Moreover, we use 𝜎FT = 0.01. The

multi-coil image data is acquired using 15 coils and is cropped to a resolution of 320 × 320 pixels

for MRI reconstruction. To simulate undersampling of the MRI k-space, we adopt a Cartesian mask

with 4x acceleration (equivalent to a 25% sampling rate). Sensitivity maps for the coils, which are

incorporated into the operator A for all scenarios, are obtained using the BART toolbox (Tamir et al.,

2016). Rather than employing the root-sum-of-squares reconstruction method, we apply sensitivity

map-based reconstruction. The quality of the reconstructed images is evaluated using the Peak

Signal-to-Noise Ratio (PSNR) in dB, and the Structural Similarity Index Measure (SSIM), which

returns values in [0, 1] with 1 indicating identical images. All the experiments are conducted on a

single RTX5000 GPU machine.

Baselines: Here, we list the baselines used in our experiments.

124

Figure 7.3 Selection of the PST step. Estimated MMD (using (7.13)) w.r.t.
𝑖 ∈ [𝑁𝑟] (top). Ablation study by comparing with the ground truth (bottom).

the discrete steps

7.4.1.1 Vanilla DL-based MRI Reconstructors

Here, we consider standalone MoDL and Recurrent VarNet. These are also incorporated within

our proposed framework..

7.4.1.2 Adversarial Training

In AT, we implemented a 30-step PGD procedure within its minimax formulation.

7.4.1.3 E2E Randomized Smoothing

For E2E-RS, we introduced Gaussian noise with a standard deviation of 0.01, and to perform

the smoothing operation, we employed 10 Monte Carlo samplings.

7.4.1.4 Score-MRI

We compare our proposed approach with a diffusion-based method, namely the score-MRI work

in (Chung et al., 2023c).

7.4.1.5 Standalone Diffusion Purification

We report results of using only the diffusion purifier with data-consistency (Algorithm 2). We

use ‘DP’ to refer to this case.

7.4.1.6 LORAKI

LORAKI is an unsupervised recurrent neural network tailored for MRI reconstruction in k-space,

representing a scan-specific method. Here, we utilize the modification in (Akccccakaya et al., 2019),

where a publicly available code is used. For generating worst-case additive noise, we use the same

approach as in (7.17).

125

7.4.2

Implementation Details for the Sources of Instabilities & Generalization Settings

7.4.2.1 k-space Additive Noise

Here, we consider additive perturbations applied to the measurements y. Recall that for example

in the unrolled MoDL, this is both an input and is used in the conjugate gradients (CG) scheme in

the data consistency step. We consider two types of additive noise: a zero-mean complex Gaussian

random vector with a variance of 0.01, and worst-case additive perturbations. For the latter, we

employed two gradient-based optimization techniques. The first method is the conventional ℓ∞-norm

PGD (Madry et al., 2017) with 30 iterations and a perturbation budget of 𝜖 = 0.004. The second

approach utilizes the advanced momentum-based AUTO attack (Croce and Hein, 2020), configured

similarly to PGD. To generate perturbations using PGD or AUTO, it is necessary to calculate the

gradients w.r.t. the input of our model.

In this paper, we consider an additional case where we apply the method from (Nie et al., 2022b)

and calculate the gradients to propagate through both MoDL and the SDE of the DP. This represents

the worst-case additive perturbations w.r.t. the DP and MoDL. In this case, the perturbations are

generated as:

(cid:16) MoDL𝜃FT (DP𝜙 (A𝐻y, 𝑁𝑡∗)),

L

max
∥δ∥∞≤𝜖

MoDL𝜃FT (DP𝜙 (A𝐻 (y + δ), 𝑁𝑡∗))

(cid:17)

.

(7.16)

Worst-case additive noise for AT, E2E-RS, Recurrent Var Net and LORAKI are generated using

the optimization problem in (7.3), with changing the structure of the network.

For score-MRI and standalone diffusion purification, we use Equations (7.17) and (7.18),

respectively, which are modified versions of (7.3).

(cid:16) PCDC (cid:0)𝑠𝜙 (z(𝑁𝑟), 𝑁𝑟), y + δ, A, 𝑁𝑟, 0(cid:1),

L

PCDC (cid:0)𝑠𝜙 (z(𝑁𝑟), 𝑁𝑟), y, A, 𝑁𝑟, 0(cid:1)(cid:17)

.

L (cid:0) DP𝜙 (A𝐻y, 𝑁𝑡∗), DP𝜙 (A𝐻 (y + δ), 𝑁𝑡∗)(cid:1) .

max
∥δ∥∞≤𝜖

max
∥δ∥∞≤𝜖

(7.17)

(7.18)

126

Figure 7.4 Reconstruction accuracy box plots for the knee fastMRI dataset with 4x Acceleration
factor. The additive Gaussian random noise of the second column plots is obtained using variance
of 0.01. The worst-case additive noise of the third and fourth columns are obtained using PGD and
AUTO methods with 𝜖 = 0.02.

Figure 7.5 Average inference run-time of our proposed approach and the baselines for the experiment
setting of the top right box plot of Figure 7.4.

7.4.2.2 Training/Testing Sampling Protocol and Undersampling Rate Disparities

Here, we consider two variations in the construction of the forward operator A between the

training and testing phases. In other words, we train MoDL with A and evaluate it with different

Atest. The first variation involves using a different acceleration factor (sampling rate), while the

second involves shifts in the locations of the k-space samples. In particular, for the first variation,

we train MoDL with 4x undersampling, and test it with {2x,3x,4x,5x,6x,7x,8x}. For the second

variation, we train MoDL using a 4x mask and then evaluate it using various shifted versions of

the original mask. Specifically, the central part of the mask (low frequencies) remains constant,

whereas the higher frequency phase encodes are shifted by {5%,10%,15%,20%,25%}.

127

Vanilla MoDLRS-E2EATScore-MRILORAKIDP+MoDL31.031.532.032.533.033.534.0PSNR - Clean AccuracyVanilla MoDLRS-E2EATScore-MRILORAKIDP+MoDL30.030.531.031.532.032.533.033.5PSNR - Robust Accuracy (Evaluated by random noise)Vanilla MoDLRS-E2EATScore-MRILORAKIDP+MoDL242628303234PSNR - Robust Accuracy (Evaluated by PGD)Vanilla MoDLRS-E2EATScore-MRILORAKIDP+MoDL22242628303234PSNR - Robust Accuracy (Evaluated by AUTO)Vanilla MoDLRS-E2EATScore-MRILORAKIDP+MoDL0.870.880.890.900.910.920.930.94SSIM - Clean AccuracyVanilla MoDLRS-E2EATScore-MRILORAKIDP+MoDL0.840.860.880.900.920.94SSIM - Robust Accuracy (Evaluated by random noise)Vanilla MoDLRS-E2EATScore-MRILORAKIDP+MoDL0.7250.7500.7750.8000.8250.8500.8750.9000.925SSIM - Robust Accuracy (Evaluated by PGD)Vanilla MoDLRS-E2EATScore-MRILORAKIDP+MoDL0.7250.7500.7750.8000.8250.8500.8750.9000.925SSIM - Robust Accuracy (Evaluated by AUTO)Vanilla MoDLRS-E2EATScore-MRIDP+MoDL0.00.51.01.52.02.53.0Time(mins)Average Inference Run-Times7.4.2.3 Unseen Anatomies & Pathologies at Testing Phase

We evaluate our method’s performance in the presence of white non-specific lesions using the

fastMRI+ dataset. In particular, the DL-based image reconstructor is trained on the lesion-free

fastMRI dataset and evaluated on the fastMRI+ dataset.

Furthermore, we evaluate the performance of the proposed method with testing brain measure-

ments but wherein the diffusion purifier was pre-trained on knee data (i.e., different anatomy).

7.4.3 Selection of the PST Step

In this section, we conduct an experiment to evaluate the effectiveness of the proposed MMD-

based method in determining the near-optimal PST step, denoted as 𝑁𝑡∗. The experiment is depicted

in Figure 7.3 (top), where we present the MMD values computed using (7.13). Additionally,

Figure 7.3 (bottom) displays the results obtained when applying various values of 𝑁𝑡∗ within our

pipeline, with corresponding PSNR values compared to ground truth images. In this experiment, we

calculate the MMD values by setting the Gaussian kernel 𝑣 as the mean of the magnitude of the

images in set 𝑍, which comprises images A𝐻y for 20 scans y ∈ 𝐷. For the perturbed images, we

utilize the worst-case additive perturbations, denoted as δ, calculated from (7.3). Consequently, the

set 𝑍p encompasses A𝐻 (y + δ) for the same measurements used in 𝑍.

The results of Figure 7.3 (bottom) show that, in comparison to the ground truth, the optimal

PSNR result is achieved at 𝑁𝑡∗ = 150, consistent with the observed approximate MMD value in

Figure 7.3 (top). Furthermore, it is evident that although the MMD values for 𝑁𝑡∗ in the range

(150, 500] are also close to zero, PSNR values begin to deteriorate. This observation aligns with

the intuition that increasing the value of 𝑁𝑡∗ effectively removes perturbations but runs the risk of

losing image structure. Consequently, for the remainder of this paper, we adopt 𝑁𝑡∗ = 150 as our

chosen setting.

Furthermore, we remark that the number of reverse (purification) process steps chosen for our

robustification task, which is 150, is notably lower than the requirement in the diffusion-based image

reconstruction task presented in (Chung and Ye, 2022), where 500 steps were used.

128

Figure 7.6 Robustness evaluation against variations in: (a) acceleration factors, (b) locations of
k-space sampling, (c) variance level of the Gaussian random additive noise, and (d) perturbation
budget of the worst-case additive disturbances generated by PGD and AUTO methods. The ‘PGD
E2E’ and ‘AUTO E2E’ in (d) correspond to the cases of generating end-to-end perturbations while
calculating gradients through propagating the DP and MoDL. Furthermore, ‘Ours PGD w/out FT’
corresponds to the case where no MoDL fine-tuning is applied. This figure is best viewed in color.

Ground Truth

MoDL

RS-E2E

AT

Score-MRI

Ours

PSNR = ∞ dB

PSNR = 32.28 dB PSNR = 31.13 dB

PSNR = 31.07 dB PSNR = 30.87 dB

PSNR = 32.67 dB

Figure 7.7 Visualization of ground-truth and reconstructed images using different methods, evaluated
by the knee fastMRI testing set with 8x acceleration factor.

Ground Truth

MoDL

RS-E2E

AT

Score-MRI

Ours

PSNR = ∞ dB

PSNR = 22.28 dB PSNR = 25.34 dB

PSNR = 29.47 dB PSNR = 29.28 dB

PSNR = 32.88 dB

PSNR = ∞ dB

PSNR = 22.23 dB PSNR = 24.56 dB PSNR = 29.25 dB

PSNR = 29.35 dB PSNR = 33.18 dB

Figure 7.8 Visualization of ground-truth and reconstructed images using different methods, evaluated
by PGD-based worst-case additive perturbations with 𝜖 = 0.02.

129

Figure 7.9 Reconstruction accuracy box plots of the brain fastMRI dataset with 4x Acceleration
factor. The additive Gaussian random noise of the second column plots are obtained from using
variance of 0.01. The worst-case additive noise of the third and fourth columns are obtained using
PGD and AUTO methods with 𝜖 = 0.02.

Ground Truth

MoDL

RS-E2E

AT

Score-MRI

Ours

PSNR = ∞ dB

PSNR = 34.28 dB PSNR = 33.13 dB

PSNR = 34.07 dB PSNR = 32.27 dB

PSNR = 34.92 dB

Figure 7.10 Visualization of ground-truth and reconstructed images using different methods, trained
with fastMRI (without lesions), and evaluated by the fastMRI+ dataset (with lesions).

7.4.4 Robustness Results

7.4.4.1 Robustness to Additive Perturbations

Figure 7.4 presents box plots for a comprehensive view of the performance of our robustification

method, as well as that of Vanilla MoDL, AT, E2E-RS, LORAKI, and Score-MRI, assessed through

PSNR (top) and SSIM (bottom) metrics using the knee dataset. We evaluate these methods across

multiple scenarios, including benign aliased images (top and bottom first plots), images subjected to

additive random Gaussian noise with variance of 0.01 (top and bottom second plots), and images

with additive worst-case perturbations generated using PGD and AUTO methods with 𝜖 = 0.02 (top

and bottom last two plots).

While AT, RS, and score-MRI show improvements when compared to vanilla MoDL, we observe

130

Vanilla MoDLRS-E2EATScore-MRILORAKIDP+MoDL31.031.532.032.533.033.534.034.5PSNR - Clean AccuracyVanilla MoDLRS-E2EATScore-MRILORAKIDP+MoDL3031323334PSNR - Robust Accuracy (Evaluated by random noise)Vanilla MoDLRS-E2EATScore-MRILORAKIDP+MoDL242628303234PSNR - Robust Accuracy (Evaluated by PGD)Vanilla MoDLRS-E2EATScore-MRILORAKIDP+MoDL22242628303234PSNR - Robust Accuracy (Evaluated by AUTO)Vanilla MoDLRS-E2EATScore-MRILORAKIDP+MoDL0.870.880.890.900.910.920.930.940.95SSIM - Clean AccuracyVanilla MoDLRS-E2EATScore-MRILORAKIDP+MoDL0.840.860.880.900.92SSIM - Robust Accuracy (Evaluated by random noise)Vanilla MoDLRS-E2EATScore-MRILORAKIDP+MoDL0.7250.7500.7750.8000.8250.8500.8750.9000.925SSIM - Robust Accuracy (Evaluated by PGD)Vanilla MoDLRS-E2EATScore-MRILORAKIDP+MoDL0.750.800.850.90SSIM - Robust Accuracy (Evaluated by AUTO)that, on average, our robustification approach reports the highest values of PSNR and SSIM. For

the example of the rightmost plot, our method achieves an average PSNR that is approximately 3

dB more than score-MRI and nearly 9 dB more than Vanilla MoDL. Additionally, the PSNR and

SSIM results in the first plots (top and bottom) indicate an improvement of our proposed approach

(DP+MoDL), even in the absence of any perturbations. It is important to highlight that although our

proposed approach reports the highest PSNR values in terms of reconstruction, it requires larger

inference run-time compared to AT, RS, and vanilla MoDL. In Figure 7.5, we present inference

run-times for the setting of the top right box plot of Figure 7.4. As observed, on average, our method

and score-MRI need nearly 3 minutes per image, whereas other methods require only 60 seconds or

less. The increased run-time is attributed to the application of the proposed diffusion purification

prior to DL-reconstructor, representing a trade-off.

In Figure 7.6 (c), we present the PSNR values of AT, E2E-RS, DP, score-MRI, and our approach,

evaluated under different levels of added Gaussian noise during testing. Notably, as the noise level

(indicated by the variance) increases, the reported PSNR values decrease for all methods. However,

our approach consistently reports higher PSNR values when compared to the other baselines across

all tested noise levels. For instance, when faced with a variance of 0.05, our method reports nearly

33 dB whereas the second best (in this case AT) reports a PSNR of 30.5 dB.

In Figure 7.6 (d), we present the PSNR performance of our approach and the considered

baselines, evaluated under varying perturbation budgets given by the values of 𝜖. The evaluation

encompasses both PGD and AUTO methods. Additionally, we explore the PGD E2E and AUTO

E2E scenarios, which involve generating end-to-end perturbations using (7.16). As the perturbation

budget increases, all methods experience a decline in their PSNR values, which is expected. However,

we observe that our approach consistently returns the highest PSNR values across the entire range of

perturbation budgets. We also observe that employing the E2E attack results in slightly lower PSNR

values compared to the case of generating perturbations solely w.r.t. MoDL. Finally, we observe

that the AUTO results are marginally lower than those of PGD, which aligns with expectations since

AUTO represents a more advanced approach in generating worst-case additive noise.

131

Moreover, in Figure 7.6 (d), we illustrate the effect of fine-tuning on the robustness of our

method. Specifically, we compare PSNR values for our approach when exposed to PGD-based

worst-case additive perturbations under two scenarios: with fine-tuning MoDL using perturbed

purified training samples (i.e., 𝑓𝜃FT) and without fine-tuning, relying solely on the pre-trained MoDL

(i.e., 𝑓𝜃). These two cases are represented by the solid blue and blue plots in Figure 7.6 (d). The

results clearly highlight that the pre-trained+fine-tuned MoDL enhances robustness, as evidenced

by the higher PSNR values compared to pre-trained MoDL. We also note that the results obtained

without fine-tuning are slightly higher than those achieved using AT (see the solid green curve in

Figure 7.6 (d)). This indicates that MoDL+DP without fine-tuning still exhibits improvements when

compared to AT, vanilla MoDL, and RS-E2E

Figure 7.8 presents visual comparison of image reconstructions and their associated reconstruction

errors within a closely examined region. Each image in the figure includes two inset panels in the

bottom-left and bottom-right corners. The bottom-left inset panel, enclosed within a green bounding

box, serves as a reference for the region of interest in the image. In contrast, the bottom-right inset

panel depicts an error map in relation to the ground truth. Notably, our method stands out in its

ability to capture more features from the original image, surpassing the performance of alternative

methods (as also evident from the reported PSNR values).

7.4.4.2 Robustness to Different Sampling Protocols & Undersampling Rates

In Figure 7.6 (a), we illustrate the performance across different acceleration factors. During

training, a k-space undersampling or acceleration factor of 4x was employed. However, during

testing, we assess performance with various acceleration factors ranging from 2x to 8x.

It is

evident that when the acceleration factor matches the training phase (4x), all methods exhibit their

highest PSNR results compared to when different acceleration factors are used. Nevertheless, when

compared to the other methods, our approach consistently reports the highest PSNR values when

tested with acceleration factors other than 4x. For instance, at 2x acceleration, AT and E2E-RS

report PSNR values of 21 dB or lower, while our approach achieves nearly 32 dB. Additionally, in

Figure 7.6 (a), we report results of using LORAKI with different acceleration factors. As observed,

132

Models
Metrics

Vanilla MoDL
E2E-RS
AT
Score-MRI

Accuracy

Training

PSNR ↑ SSIM ↑ Acceleration Factor
33.25
33.12
32.17
33.5

0.920
0.917
0.913
0.899

8x
8x
8x
4x

DP+MoDL

33.67

0.922

4x

Table 7.1 Reconstruction accuracy for fastMRI knee data using the testing portion of the dataset
with acceleration of 8x.

LORAKI reports lower PSNR values when compared to our proposed approach.

Figure 7.6 (b) shows the PSNR values of our proposed approach and the considered baselines,

assessed under varying percentages of shifts in the location of the k-space sampling during testing.

The shifts were applied to high-frequency phase encode locations in the original sampling pattern

or mask. This is to help understand reconstruction robustness when the sampling masks change a

lot at a fixed k-space undersampling factor. We observe that as the percentage of shifts increases,

the reported PSNR values decrease across all methods. However, we observe that, our method

consistently outperforms the other approaches across all tested percentages, exhibiting the highest

PSNR values. For instance, when the mask at testing time contains 25% shift when compared to the

training mask, our method achieves 32 dB whereas all other methods report PSNR values of 31.2

dB or less.

To further underscore the generalization and robustness of our proposed approach, we designed

an experiment with different training and testing settings across different methods. Specifically,

we trained vanilla MoDL, AT, and RS models using an 8x acceleration factor, while our method

and score-MRI were trained with a 4x acceleration factor. Subsequently, we subjected benign

measurements to testing with an 8x acceleration factor, aligning with the training settings of MoDL,

AT, and RS, rather than 4x. The results, given in Table 7.1, showcase that our method, despite

undergoing testing with a different acceleration setting, reports slightly higher PSNR (33.67 dB) and

SSIM (0.922) values when compared to other methods. Moreover, the visualizations in Figure 7.7

show that when tested with an 8x acceleration factor despite being trained on 4x, our proposed

approach outperforms the considered baselines under conditions where both training and testing

acceleration factors are 8x.

133

Models
Metrics

Vanilla MoDL
E2E-RS
AT
Score-MRI

MRI Reconstruction Accuracy
PSNR ↑
31.25
31.12
30.87
30.22

SSIM ↑
0.915
0.912
0.910
0.885

DP+MoDL (Ours)

32.4

0.919

Table 7.2 Brain fastMRI+ (with lesion) results.

Robust Accuracy (Evaluated by random noise) Robust Accuracy (Evaluated by PGD)

Models
Metrics

Vanilla RecurrentVarNet
AT+RecurrentVarNet
E2E-RS+RecurrentVarNet

Clean Accuracy
PSNR ↑ SSIM ↑ PSNR ↑
32.89
0.925
33.78
33.01
0.919
33.19
33.12
0.922
33.67

DP+RecurrentVarNet (Ours)

34.33

0.941

34.07

SSIM ↑
0.91
0.914
0.915

0.938

PSNR ↑
26.5
31.67
30.20

33.64

SSIM ↑
0.793
0.892
0.875

0.935

Table 7.3 Brain dataset reconstruction accuracy using Recurrent Variational Network as our
DL-based image reconstructor.

Ground Truth RecurrentVarNet

E2E-RS

AT

Ours

PSNR = ∞ dB

PSNR = 29.82 dB

PSNR = 32.82 dB

PSNR = 33.28 dB

PSNR = 35.78 dB

Figure 7.11 Visualization of ground-truth and reconstructed images using RecurrentVarNet and
RecurrentVarNet+DP (Ours) methods, evaluated by PGD-based worst-case additive perturbations
with 𝜖 = 0.02.

7.4.4.3 Robustness to Anatomical Variations

In Figure 7.9, we replicate the experiment conducted in Figure 7.4, this time utilizing the brain

dataset. Notably, MoDL underwent fine-tuning using perturbed purified examples sourced from

the training set of the brain dataset. When comparing the results of our proposed method with

other approaches, we find that the observations of Figure 7.4 remain consistent. For the PGD case

(third column), our method reports an average SSIM of nearly 0.91 whereas Vanilla MoDL (the

DL-reconstructor considered in this experiment) reports an average SSIM of approximately 0.775.

An important point to highlight is that the pre-trained DM employed in our purification stage for this

experiment was originally trained exclusively on knee data, without any exposure to brain data. This

underscores the robust generalization capabilities of the diffusion purification process within our

134

approach, extending its effectiveness to previously unseen MRI datasets.

It’s worth mentioning that similar diffusion model generalization capabilities were also observed

in the study conducted by Chung et al.

(Chung and Ye, 2022). However, further thorough

investigation is required to precisely determine the limitations of these generalization capabilities,

and this remains a promising direction for future research.

Here, we employ the fastMRI+ dataset to assess our approach’s image reconstruction capability,

contrasting the outcomes with relevant baselines. For the training phase, we employ the original

fastMRI brain dataset, which excludes lesion cases, as the basis for training all methods. During

the testing phase, however, we utilize the lesion dataset. Table 7.2 shows the results, where our

method reports the highest PSNR and SSIM values compared to other baselines. It is important

to highlight that, unlike the cases of additive k-space noise and training/testing sampling protocol

and undersampling rate disparities, the improvements observed from utilizing our method with

unseen lesions are somewhat marginal as seen from the average PSNR and SSIM results (at least

1.2 dB PSNR improvement when compared to the 2nd best results). Additionally, visualizations are

provided Figure 7.10 where we highlight the nonspecific white matter lesion area. As observed,

both visually and in terms of PSNR values, our approach reports improved results when compared

to the other baselines.

7.4.5 Applying Our Method to Other DL-based MRI Reconstruction Models

Here, we demonstrate the applicability of our diffusion purification strategy to other DL-

based supervised MRI reconstructors. Specifically, we explore the Recurrent Variational Network

(RecurrentVarNet) (Yiasemis et al., 2022), presenting results both with and without perturbations,

as well as with and without the integration of our diffusion purification technique. The results

are summarized in Table 7.3. As depicted in the table, when the standalone RecurrentVarNet (or

RecurrentVarNet integrated with AT and/or RS) encounters additive worst-case perturbations in the

measurement space, the reported PSNR and SSIM scores (last two columns of the first three rows)

experience a significant drop (for example, the Vanilla RecurrentVarNet encounters a PSNR drop

of nearly 7 dB). However, upon employing our diffusion purification (last row), we observe only a

135

marginal decrease in performance (of 0.69 dB). These findings illustrate that our strategy can be

integrated well with general DL-based reconstructors. The visualizations in Figure 7.11 provide

additional support for our claim.

7.5 Conclusion

Recent studies have unmasked vulnerabilities in DL-based MRI reconstruction methods—namely,

susceptibility to additive perturbations and variations in training/testing settings, such as acceleration

factors and k-space sampling patterns. This paper has addressed these challenges by harnessing

the power of diffusion models. Our innovative robustification strategy enhanced the resilience of

DL-based MRI reconstruction models by integrating pre-trained diffusion models as noise purifiers.

Unlike conventional robustification techniques like adversarial training (AT), our method eliminated

the need for complex minimax optimization problems. Instead, it simply requires fine-tuning on

perturbed purified examples. Our extensive experiments have illustrated the remarkable efficacy

of our approach in mitigating different instabilities when compared to utilizig diffusion-based

MRI reconstructors and leading robustification methods, including AT and randomized smoothing.

We also evaluated the robustness of our approach using an MRI dataset with lesions. Moreover,

we illustrated the adaptability of our strategy to multiple reconstruction models. These findings

underscore the promise of leveraging diffusion models to enhance the robustness and reliability of

DL-based MRI reconstruction, paving the way for more dependable and accurate medical imaging

technologies in the future.

7.6 Proof of Theorem 1

Proof of Theorem 1: For the first part, we begin by establishing the results in (7.11). Utilizing

the VE-SDE formulation of DMs, the conditional distributions 𝑝0𝑡 and 𝑞0𝑡 are expressed as per the

following equations (Song et al., 2021c).

𝑝0𝑡 (z(𝑡) | z) = N (z(𝑡); z, (𝜎2(𝑡) − 𝜎2(0))I) ,

𝑞0𝑡 (z(𝑡) | zpert) = N (z(𝑡); zpert, (𝜎2(𝑡) − 𝜎2(0))I) .

(7.19a)

(7.19b)

136

Notably, these two distributions have different means, but share the same covariance. Consequently,

the 𝐷KL can be obtained as

𝐷KL( 𝑝0𝑡 || 𝑞0𝑡) =

1
2

(cid:18)

log(cid:16) det(𝜎2I)
det(𝜎2I)

(cid:17)

+ Tr(cid:16)

(𝜎2I)−1(𝜎2I)

(cid:17)

+(zpert − z)𝑇 (𝜎2I)−1(zpert − z) − 𝑛

(cid:19)

,

where det(·) (resp. Tr(·)) denotes the determinant (resp. trace) of a matrix, and 𝜎2 = 𝜎2(𝑡) − 𝜎2(0)

is used for brevity. Since log(1) = 0, the first term is zero. Given the definition of the trace

and the identity matrix properties, the second term reduces to 𝑛 and cancels the last term. Since

A𝐻δ = zpert − z, and (A𝐻δ)𝑇 A𝐻δ ≥ 0, then Equation (7.11) holds.

Subsequently, the numerator in (7.11) is more than or equal to 0 (can only be zero if δ = 0),

and is not a function of 𝑡. Moreover, since 𝜎(𝑡) = 𝜎𝑙 (𝜎𝑢/𝜎𝑙)𝑡, where 𝜎𝑙 ∈ (0, 1) and 𝜎𝑢 > 1 are

constants, it is evident that the denominator monotonically increases as 𝑡 increases.

In conclusion, the rate of change of 𝐷KL( 𝑝0𝑡

|| 𝑞0𝑡) w.r.t. 𝑡 (as long as δ ≠ 0) is less than 0.

Given the derivative of 𝜎(𝑡) w.r.t. 𝑡 is 𝑑𝜎(𝑡)

𝑑𝑡 = 𝜎𝑙log(𝜎𝑢/𝜎𝑙) (𝜎𝑢/𝜎𝑙)𝑡, this is supported by

𝑑𝐷KL( 𝑝0𝑡 || 𝑞0𝑡)
𝑑𝑡

=

−∥A𝐻δ∥2𝜎𝑙log(𝜎𝑢/𝜎𝑙) (𝜎𝑢/𝜎𝑙)2𝑡
(cid:0)𝜎2(𝑡) − 𝜎2
𝑙

(cid:1) 2

< 0 .

This inequality establishes that 𝐷KL( 𝑝0𝑡 || 𝑞0𝑡) monotonically decreases as time travels from 𝑡 = 0

to 𝑡 = 1 while employing the forward process defined in (7.6). Consequently, the proof of the first

part is complete.

The proof of the second part follows from (Song et al., 2021c) and (Nie et al., 2022b). Using

the Fokker-Planck-Kolmogorov representation (Särkkä and Solin, 2019) for the forward process in

(7.6), we write

𝑑𝑝𝑡 (z)
𝑑𝑡
𝑑𝑞𝑡 (z)
𝑑𝑡

1
2
1
2

=

=

∇z · (cid:0)𝑝𝑡 (z)

∇z · (cid:0)𝑞𝑡 (z)

𝑑𝜎2(𝑡)
𝑑𝑡
𝑑𝜎2(𝑡)
𝑑𝑡

∇zlog𝑝𝑡 (z)(cid:1) ,

∇zlog𝑞𝑡 (z)(cid:1) .

(7.20a)

(7.20b)

Employing the definition of the KL divergence, Equation (7.20), integration by parts, and

assuming the smoothness and fast decay of 𝑝𝑡 (z) and 𝑞𝑡 (z), we can derive the derivative of the KL

137

divergence w.r.t. 𝑡:

where

𝑑𝐷KL( 𝑝𝑡 || 𝑞𝑡)
𝑑𝑡

= −

1
2

𝑑𝜎2(𝑡)
𝑑𝑡

𝐷F( 𝑝𝑡 || 𝑞𝑡) ≤ 0 ,

(7.21)

𝐷F( 𝑝𝑡 || 𝑞𝑡) =

∫

𝑝𝑡 (z) ∥∇zlog𝑝𝑡 (z) − ∇zlog𝑞𝑡 (z) ∥2𝑑z ≥ 0 ,

denotes the Fisher divergence. Given that 𝑑𝜎2 (𝑡)

𝑑𝑡

> 0, the proof of the second part is thereby

established.

138

CHAPTER 8

STEP-WISE TRIPLE-CONSISTENT DIFFUSION SAMPLING

8.1

Introduction

In the previous chapter, we introduced the diffusion model as the purifier of the image

reconstruction in order to handle the different kinds of noise. However, a key bottleneck in DMs is

their computational speed, as they are slower than other generative models due to the large number

of sampling steps. Although various methods have been proposed to reduce sampling frequency

(e.g., (Song et al., 2023b)), these improvements have yet to be fully realized for DMs applied to IPs.

Most existing methods still require dense sampling, which continues to pose speed challenges.

Contributions:

In this chapter, we: (i) identify key issues in accelerating DMs for IPs, (ii) propose

three conditions that could fully leverage the information from the measurements and the pre-trained

diffusion model to effectively address these issues, and (iii) present a new optimization-based method

in the pixel space that satisfies these conditions. We refer to our accelerated sampling method

as Step-wise Triple-Consistent Sampling (SITCOM). We evaluate our method on several image

restoration tasks: Super Resolution, Box In-painting, Random In-painting, Motion Deblurring,

Gaussian Deblurring, Non-linear Deblurring, High Dynamic Range, and Phase Retrieval. Compared

to leading baselines, our approach consistently achieves either state-of-the-art or highly competitive

quantitative results, while also reducing the number of sampling steps and, consequently, the

computational time. See Figure 8.1 for examples.

8.2 Background: Diffusion Models & Their Usage in Solving IPs

Pre-trained Diffusion Models (DMs) generate images by applying a pre-defined iterative denoising

process (Ho et al., 2020). In the Variance-Preserving Stochastic Differentiable Equations (SDEs)

setting (Song et al., 2021b,a), DMs are formulated using the forward and reverse processes

𝑑x𝑡 = −

𝛽𝑡
2

x𝑡 𝑑𝑡 + √︁𝛽𝑡 𝑑w ,

𝑑x𝑡 = −𝛽𝑡

(cid:104) 1
2

x𝑡 + ∇x𝑡 log𝑝𝑡 (x𝑡)

(cid:105)

𝑑𝑡 + √︁𝛽𝑡 𝑑 ¯w ,

(8.1)

where 𝛽 : {0, . . . , 𝑇 } → (0, 1) is a pre-defined function that controls the amount of additive

perturbations at time 𝑡, w (resp. ¯w) is the forward (resp. reverse) Weiner process (Anderson, 1982),

139

Figure 8.1 Qualitative results on the FFHQ dataset on two linear tasks (top) and two non-linear
tasks (bottom) under measurement noise of 𝜎y = 0.05. The PSNR and LPIPS values are given
below each restored image. Zoomed-in regions show how SITCOM captures greater image details
when compared to two general (non)linear DM-based methods (DPS (Chung et al., 2023b) and
DAPS (Zhang et al., 2024a)).

𝑝𝑡 (x𝑡) is the distribution of x𝑡 at 𝑡, and ∇x𝑡 log𝑝𝑡 (x𝑡) is the score function that is replaced by a neural

network (typically a time-encoded U-Net (Ronneberger et al., 2015a)) s : R𝑛 × {0, . . . , 𝑇 } → R𝑛,

parameterized by 𝜃. In practice, given the score function s𝜃, the SDEs in (8.1) can be discretized as

in (8.2) where η𝑡, η𝑡−1 ∼ N (0, I).

x𝑡 = √︁1 − 𝛽𝑡x𝑡−1 + √︁𝛽𝑡η𝑡−1 ,

x𝑡−1 =

1
√︁1 − 𝛽𝑡

(cid:104)x𝑡 + 𝛽𝑡s𝜃 (x𝑡, 𝑡)

(cid:105)

+ √︁𝛽𝑡η𝑡 .

(8.2)

When employed to solve inverse problems, the score function in (8.1) is replaced by a conditional

score function which, by Bayes’ rule, is ∇x𝑡 log𝑝𝑡 (x𝑡 |y) = ∇x𝑡 log𝑝𝑡 (x𝑡) + ∇x𝑡 log𝑝𝑡 (y|x𝑡). Solving

the SDE in (8.1) with the conditional score is referred to as posterior sampling (Chung et al., 2023b).

As there doesn’t exist a closed-form expression for the term ∇x𝑡 log𝑝𝑡 (y|x𝑡) (which is termed as

the measurements matching term in (Daras et al., 2024)), previous works have explored different

approaches, which we will briefly discuss below. We refer the reader to the recent survey in (Daras

et al., 2024) for an overview on DM-based methods for solving IPs.

A well-known method is Diffusion Posterior Sampling (DPS) (Chung et al., 2023b), which uses

the approximation 𝑝(y|x𝑡) ≈ 𝑝(y| ˆx0) where ˆx0(x𝑡) (or simply ˆx0) is the estimated image at time 𝑡 as

a function of the pre-trained model and x𝑡 (Tweedie’s formula (Vincent, 2011)), given as

140

Super ResolutionGround TruthMeasurementsDPSDAPSSITCOM (ours)PSNR=24.66LPIPIS=0.251PSNR=30.22LPIPIS=0.172PSNR=32.39LPIPIS=0.156Motion DeblurringGround TruthMeasurementsDPSDAPSPSNR=22.98LPIPIS=0.289PSNR=31.46LPIPIS=0.131PSNR=33.26LPIPIS=0.097Non-linear Deblurring Ground TruthMeasurementsDPSDAPSPSNR=23.12LPIPIS=0.267PSNR=27.65LPIPIS=0.167PSNR=29.22LPIPIS=0.145Phase Retrieval Ground TruthMeasurementsDPSDAPSPSNR=17.88LPIPIS=0.401PSNR=30.89LPIPIS=0.118PSNR=32.67LPIPIS=0.112SITCOM (ours)SITCOM (ours)SITCOM (ours)ˆx0(x𝑡) =

1
√
¯𝛼𝑡

(cid:104)x𝑡 − √︁1 − ¯𝛼𝑡ϵ𝜃 (x𝑡, 𝑡)

(cid:105)

=: 𝑓 (x𝑡; 𝑡, ϵ𝜃) ,

(8.3)

where ¯𝛼𝑡 = (cid:206)𝑡

𝑗=1 𝛼 𝑗 and 𝛼𝑡 = 1 − 𝛽𝑡. We call the function 𝑓 , defined in (8.3), as ‘Tweedie-
network denoiser’ (also termed as ‘posterior mean predictor’ in (Chen et al., 2024)). Here,

ϵ𝜃 (x𝑡, 𝑡) = −

√

1 − ¯𝛼𝑡s𝜃 (x𝑡, 𝑡) (Luo, 2022) outputs the noise in x𝑡. Tweedie’s formula, like in our

method, is also adopted in other DM-based IP solvers such as (Rout et al., 2023; Chung et al., 2023d;

Wang et al., 2022). The drawback of these methods is that they require a large number of sampling

steps.

The work in ReSample (Song et al., 2023a), solves an optimization problem on the estimated

posterior mean in the latent space for many steps to enforce measurement consistency, requiring

many sampling and optimization steps.

The work in (Mardani et al., 2023) introduced RED-Diff, a variational Bayesian method that

fits a Gaussian distribution to the posterior distribution of the clean image conditional on the

measurements. This approach involves solving an optimization problem using stochastic gradient

descent (SGD) to minimize a data-fitting term while maximizing the likelihood of the reconstructed

image under the denoising diffusion prior (as a regularizer). However, the SGD process requires

multiple iterations, each involving evaluations of the pre-trained DM on a different noisy image at

some randomly selected time, making it quite computationally expensive.

Recently, Decoupling Consistency with Diffusion Purification (DCDP) (Li et al., 2024) proposed

separating diffusion sampling steps from measurement consistency by using DMs as diffusion

purifiers (Nie et al., 2022a; Alkhouri et al., 2024), with the goal of reducing the run-time. However,

DCDP requires tuning the number of forward diffusion steps for purification. Shortly after, Decoupled

Annealing Posterior Sampling (DAPS) (Zhang et al., 2024a) introduced another decoupled approach,

incorporating gradient descent noise annealing via Langevin dynamics. DAPS, similar to DPS and

RED-Diff, also requires a large number of sampling and optimization steps. Under measurement

noise, DCDP achieves SOTA run-time across various linear restoration tasks, while DAPS sets the

SOTA in restoration quality. Both will serve as primary baselines in our experiments.

141

8.3 SITCOM: Step-wise Triple-Consistent Sampling

8.3.1 Motivation: Addressing the Challenges in Applying DMs to IPs

Most inverse problems are ill-conditioned and undersampled. DMs, when trained on a dataset

that closely resembles the target image, can provide critical information to alleviate ill-conditioning

and improve recovery. Despite various previous efforts, a key challenge remains: How to efficiently

integrate DMs into the framework of inverse problems? We will now elaborate on this challenge in

detail.

The standard reverse sampling procedure in DMs consists of applying the backward discrete

steps in (8.2) for 𝑡 ∈ {𝑇, 𝑇 − 1, . . . , 1}, forming the standard diffusion trajectory for which x0 is the

generated image. To incorporate the measurement y into these steps, a common approach adopted

in previous works that demonstrate superior performance (e.g., (Song et al., 2023a; Zhang et al.,

2024a; Li et al., 2024)) is to the ˆx0 computed via (8.3) as follows:

ˆx′
0(x𝑡) = arg min
x

∥A (x) − y∥2 + 𝜆∥x − ˆx0(x𝑡) ∥2 ,

(8.4)

where 𝜆 ∈ R+ is a regularization parameter. The ˆx′

also remaining consistent with the measurements. When using ˆx′

0(x𝑡) obtained from (8.4) is close to ˆx0(x𝑡) while
0(x𝑡) to sample x𝑡−1, the second

formula in (8.2) can be rewritten as in (8.5), where the derivation is provided in Appendix D.1.

x𝑡−1 =

√

𝛼𝑡 (1 − ¯𝛼𝑡−1)
1 − ¯𝛼𝑡

x𝑡 +

√

¯𝛼𝑡−1𝛽𝑡
1 − ¯𝛼𝑡

ˆx0(x𝑡) + √︁𝛽𝑡η𝑡 .

(8.5)

By substituting ˆx0(x𝑡) into (8.5) with the measurement-consistent ˆx′

0(x𝑡), the modified sampling

formula becomes:

x𝑡−1 =

√

𝛼𝑡 (1 − ¯𝛼𝑡−1)
1 − ¯𝛼𝑡

x𝑡 +

√

¯𝛼𝑡−1𝛽𝑡
1 − ¯𝛼𝑡

0(x𝑡) + √︁𝛽𝑡η𝑡 .
ˆx′

(8.6)

While this approach effectively ensures data consistency at each step, it inevitably causes ˆx′

0 to

deviate from the diffusion trajectory, leading to two major issues:

(I1) The image ˆx0(x𝑡), initially constructed through Tweedie’s formula, usually appears quite natural

(e.g., columns 3 to 5 of Figure 8.2 ); however, the modified version, ˆx′

0(x𝑡), is likely to exhibit

severe artifacts (e.g., columns 6 to 8 of Figure 8.2).

142

(I2) Since the DM network, ϵ𝜃, is trained via minimizing the objective function Ex0,ϵ∥𝜖 − ϵ𝜃 (

√

¯𝛼𝑡x0 +

√

1 − ¯𝛼𝑡ϵ, 𝑡)∥2 (denoising score matching (Vincent, 2011)) on a finite dataset, it performs best on

noisy images lying in the high-density regions of the training distribution N (x𝑡;

√

¯𝛼𝑡x0, (1− ¯𝛼𝑡)I),

x0 ∼ 𝑝(x0). We define an algorithm as forward-consistent if it likely applies ϵ𝜃 only to

in-distribution inputs (i.e., those from the same distribution used for training). For example,

if the forward diffusion used to train ϵ𝜃 adds Gaussian noise, the in-distribution input to ϵ𝜃

should ideally be sampled from a Gaussian with specific parameters. If Poisson noise is used in

the forward process, inputs drawn from suitable Poisson distributions are more likely to fall

within the well-trained region of the network. In summary, forward consistency requires that

inputs to ϵ𝜃 during sampling align with the forward process. While the x𝑡−1 generated from

(8.5) is forward-consistent by design, the one generated from the modified formula (8.6) is not.

Therefore, in the latter case, the DM network, ϵ𝜃, may be applied to many out-of-distribution

inputs, leading to degraded performance.

We pause to verify our claimed Issue (I1) through a box-inpainting experiment. Columns 3 to

5 of Figure 8.2 show ˆx′

0(x𝑡) at various 𝑡. The results clearly demonstrate successful enforcement
of data consistency, as the region outside the box aligns with the original image. However, this

enforcement compromises the natural appearance of the image, introducing significant artifacts in

the reconstructed area inside the box. Details about the setting of the results in Figure 8.2 are given

in Section D.3.

Issue (I2) was previously observed in (Lugmayr et al., 2022), which proposed a remedy known

as ‘resampling’. In this approach, the sampling formula in (8.6) is replaced by

√

x𝑡−1 =

¯𝛼𝑡−1 ˆx0 + √︁1 − ¯𝛼𝑡−1η𝑡 .

(8.7)

Provided ˆx0 is close to the ground truth x0, x𝑡−1 generated this way will stay in-distribution with

high probability. For a more detailed explanation of the rationale behind this remedy, we refer the

reader to (Lugmayr et al., 2022). This method has since been adopted by subsequent works, such as

(Song et al., 2023a; Zhang et al., 2024a), and we will also employ it to address (I2).

143

Figure 8.2 Effects of enforcing backward-consistency in box-inpainting: Results of using Tweedie’s
formula without measurement consistency (columns 3 to 5), enforcing measurement-consistency via
(8.4) (columns 6 to 9), and enforcing both measurement-consistency and backward-consistency via
(8.11) (columns 10 to 12) at different time steps 𝑡′. Experimental details are given in Appendix D.3.

8.3.2 Network Regularization & Backward Diffusion Consistency

Previous studies, such as (Song et al., 2023a; Zhang et al., 2024a), mitigate issue (I1) by using a

large number of sampling steps, which inevitably increases the computational burden. In contrast,

this paper proposes employing a network regularization to resolve issue (I1). This approach not only

accelerates convergence but also enhances reconstruction quality. Let’s first clarify the underlying

intuition.

It is widely observed that the U-Net architecture or trained transformers exhibit an effective

image bias (Ulyanov et al., 2018; Liang et al., 2024a; Ghosh et al., 2024; Hatamizadeh et al.,

2024). From columns 3 to 5 of Figure 8.2, we observe that without enforcing data consistency,

the reconstructed ˆx0, derived directly from Tweedie-network denoiser 𝑓 (x𝑡; 𝑡, ϵ𝜃) for each time 𝑡,

exhibits natural textures. This indicates that the reconstruction using the combination of Tweedie’s

formula and the DM network has a natural regularizing effect on the image. By definition, the output

of 𝑓 (x𝑡; 𝑡, ϵ𝜃) in (8.3) represents the denoised version of x𝑡 at time 𝑡 using the Tweedie’s formula

and the DM denoiser ϵ𝜃. Due to the implicit bias of ϵ𝜃, this denoised image tends to align with the

clean image manifold, even if x𝑡 does not correspond to a training image, as shown in columns 3 to

5 of Figure 8.2. We refer to this regularization effect of 𝑓 (x𝑡; 𝑡, ϵ𝜃), which arises from network bias,

as “network regularization”.

By employing network regularization, we can address (I1) by ensuring that the data-consistent

ˆx′
0 is also network-consistent. We refer the latter condition as Backward Consistency and define it
formally as follows.

Definition 1 (Backward Consistency). We say an ˆx′

and the DM neural network ϵ𝜃 at time 𝑡 if there exists some v𝑡 such that ˆx′

0 is backward-consistent with Tweedie’s formula
0 = 𝑓 (v𝑡; 𝑡, ϵ𝜃). In other

144

Measurement Consistency by Equation (5)Tweedie’s Formula without Measurement ConsistencyMeasurement & Backward Consistency by Equation (11)Ground TruthMasked Image𝑡′=600𝑡′=400𝑡′=200𝑡′=600𝑡′=400𝑡′=200𝑡′=600𝑡′=400𝑡′=200words, backward consistency requires ˆx′

0 to be a ‘denoised version’ of some noisy image v𝑡 via the

Tweedie-network denoiser 𝑓 at time 𝑡.

The subset of images that are in the range of the function 𝑓 (i.e., backward-consistent) is denoted

by C𝑡 and defined as

C𝑡 := { 𝑓 (v𝑡; 𝑡, ϵ𝜃) : v𝑡 ∈ R𝑛} .

(8.8)

Enforcing ˆx′

0 to be both measurement- and backward-consistent involves solving the following

optimization problem.

ˆx′
0, ˆv𝑡 := arg min
𝑡 ,x′
v′
0

(cid:110)

∥A (cid:0)x′
0

(cid:1) − y∥2
2

subject to x′

0 = 𝑓 (v′

𝑡; 𝑡, ϵ𝜃)

(cid:111)

.

(8.9)

However, (8.9) may violate forward consistency, as ˆv𝑡 could possibly be far from x𝑡. Therefore,

we propose adding a regularization term, for which (8.9) becomes

ˆx′
0, ˆv𝑡 := arg min
𝑡 ,x′
v′
0

(cid:110)

∥A (cid:0)x′
0

(cid:1) − y∥2

2 + 𝜆∥x𝑡 − v′

𝑡 ∥2
2

subject to x′

0 = 𝑓 (v′

𝑡; 𝑡, ϵ𝜃)

(cid:111)

.

(8.10)

During the reverse sampling process, at each time 𝑡, with the given x𝑡, we seek a v′

𝑡 in the nearby

region (i.e., ∥x𝑡 − v′

𝑡 ∥ is small), such that v′

𝑡; 𝑡, ϵ𝜃)), which is also consistent with the measurements y (i.e., ∥A (cid:0)x′
0

𝑡 can be denoised by 𝑓 to produce a clean image x′

0 (i.e.,
2 is small).
𝑡 because x𝑡 itself cannot be directly denoised by 𝑓 to yield an image

(cid:1) − y∥2

0 = 𝑓 (v′
x′
We need to identify such a v′

consistent with the measurements. By substituting the constraint into the objective function, the

optimization problem in (8.10) is reduced to

(cid:110)

ˆv𝑡 := arg min
v′
𝑡

∥A (cid:0) 𝑓 (v′

𝑡; 𝑡, ϵ𝜃)(cid:1) − y∥2

2 + 𝜆∥x𝑡 − v′

𝑡 ∥2
2

(cid:111)

,

ˆx′
0 = 𝑓 ( ˆv𝑡; 𝑡, ϵ𝜃).

(8.11)

The benefit of the considered backward consistency constraint is shown in columns 6 to 8 of Figure

8.2. After obtaining ˆx′

0, the resampling formula in (8.7) is used to obtain x𝑡−1.

8.3.3 Triple Consistency Conditions

We now summarize the three key conditions that apply at each sampling step.

C1 Measurement Consistency: The reconstruction ˆx′

0 is consistent with the measurements

This means that A ( ˆx′

0) ≈ y.

145

C2 Backward Consistency: The reconstruction ˆx′

0 is a denoised image produced by the
Tweedie-network denoiser 𝑓 . More generally, we define the backward consistency to include any

form of DM network regularization (e.g., using the DM probability-flow (PF) ODE (Karras et al.,

2022)) applied to ˆx′
0.

C3 Forward Consistency: The pre-trained DM network ϵ𝜃 is provided with in-distribution

inputs with high probability. To ensure this, we apply the resampling formula in (8.7) and enforce

that ˆv𝑡 remains close to x𝑡.

We emphasize that C1-C3 aim to ensure that all intermediate reconstructions ˆx′

0(x𝑡) (with
𝑡 > 0) are as accurate as possible, allowing us to effectively reduce the number of sampling steps.

If reducing sampling steps is not necessary, these conditions become less critical, as the final

reconstruction at 𝑡 = 0 can still be accurate with a large number of sampling steps, even if the

intermediate reconstructions are less precise. Previous works, such as (Song et al., 2023a; Zhang

et al., 2024a), enforce measurement consistency by applying A ( ˆx0) = y exactly, whereas DPS

(Chung et al., 2023b) does not ensure consistency along the diffusion trajectory.

8.3.4 The Proposed Sampler

Given x𝑡, ϵ𝜃, and towards satisfying the above conditions, our method, at sampling time 𝑡,

consists of the following three steps:

ˆv𝑡 := arg minv′

𝑡

∥A (cid:0)

1
√
¯𝛼𝑡

(cid:2)v′

𝑡 −

√

1 − ¯𝛼𝑡 ϵ𝜃 (v′

𝑡, 𝑡)(cid:3) (cid:1) − y∥2

2 + 𝜆∥x𝑡 − v′

𝑡 ∥2
2

0 = 𝑓 ( ˆv𝑡; 𝑡, ϵ𝜃) ≡ 1
ˆx′
¯𝛼𝑡

√

√

(cid:2) ˆv𝑡 −

1 − ¯𝛼𝑡 ϵ𝜃 ( ˆv𝑡, 𝑡)(cid:3)

√

x𝑡−1 =

¯𝛼𝑡−1 ˆx′

0 +

√

1 − ¯𝛼𝑡−1η𝑡 , η𝑡 ∼ N (0, I) .

(S1)

(S2)

(S3)

The minimization in the first step optimizes over the input v′

𝑡 of the pre-trained diffusion model at

time 𝑡, where the first term of the objective enforces measurement consistency for the posterior mean

estimated image, satisfying condition C1. The second term serves as a regularization term, implicitly

146

Figure 8.3 Illustrative diagram of the proposed procedure in SITCOM (left). Conceptual illustration
of SITCOM, where M𝑡 is the DM generative manifold at time 𝑡 and C𝑡 is the subset of images that
are backward-consistent, defined in (8.8) (right). Step (1) (solid arrow), Step (2) (dotted arrow), and
Step (3) (dashed arrow) correspond to (S1), (S2), and (S3), respectively.

promoting closeness between ˆv𝑡 and x𝑡 (i.e., condition C3), with 𝜆 > 0 acting as the regularization

parameter. The argument of the forward operator in (S1) and the second step in (S2) enforce that ˆv𝑡

and ˆx′

0, respectively, maintain the diffusion trajectory through obeying Tweedie’s formula, thereby
satisfying the backward consistency condition, C2. After obtaining the measurement-consistent

estimate, ˆx′

0, as given in (S2), it must be mapped back to time 𝑡 − 1 to generate x𝑡−1. This is achieved
through the forward diffusion step in (S3) as outlined in the forward consistency condition, C3. A

diagram of SITCOM procedure is provided in Figure 8.3 (left).

Remark 5. Obtaining the estimated image at time 0 given some x𝑡 using the standard DM PF-ODE

(Karras et al., 2022) is more accurate compared to the one-step Tweedie’s formula. However, since

PF-ODE is an iterative procedure, it requires more computational time. In SITCOM, PF-ODE could

replace Tweedie’s formula in (S2). Nevertheless, we chose not to use it, as this would increase the

run time, and our empirical results are already highly competitive using Tweedie’s formula.

A conceptual illustration of SITCOM is shown in Figure 8.3 (right).

The DM generative manifold, M𝑡, is defined as the set of all x𝑡 sampled from 𝑞(x𝑡 |x0) =

√

N (x𝑡;

¯𝛼𝑡x0, (1 − ¯𝛼𝑡)I), and x0 ∼ 𝑝0(x). This set coincides with the entire space R𝑛 equipped with

the probability measure induced by the distribution of x𝑡, which we denote as P𝑡. In Figure 8.3

(right), the variation of color around each M𝑡 indicates the concentration of the measure P𝑡, with

147

𝐱𝑡𝐱𝑡−1Estimation byStep (1) SITCOMො𝐱0′ො𝐯𝑡 Measurement-Posterior MeanConsistent Step (2) SITCOMStep (3) SITCOMOptimization OverNetwork InputForwardConsistency𝐱ℳ𝑡ℳ0{ො𝐱0′∶ 𝒜ො𝐱0′=𝒚}𝐱𝑡ℳ𝑡−1𝐱𝑡−1ො𝐯𝑡DM Generative Manifolds𝒞𝑡Ground Truthො𝐱0′≔𝑓(ො𝐯𝑡;𝑡,𝝐𝜃)Measurement & Backward ConsistencyTweedie’sMeasurement-Image at 𝑡Consistentdarker colors representing higher concentration.

SITCOM’s Step (1) and Step (2) enforce measurement consistency and backward consistency,

set { ˆx′

thus map x𝑡 to ˆx′

0 = 𝑓 ( ˆv𝑡; 𝑡, ϵ𝜃) which lies within the intersection of (i) measurement-consistent
0) ≈ y} (the shaded black line) and (ii) the backward-consistent set C𝑡 (the yellow
0 into the resampling

ellipsoid) defined in (8.8). Subsequently, x𝑡−1 is generated by inserting ˆx′

0 : A ( ˆx′

formula, which enforces the forward consistency.

Handling Measurement Noise: To avoid the case where the first term of the objective in (S1)

reaches small values yielding noise overfitting (i.e., when additive Gaussian noise is considered,

𝜎y > 0), we propose refraining from enforcing strict measurement fitting A (x) = y. Instead, we use
the stopping criterion (cid:13)

< 𝛿2 ,

(cid:2)v′

√

1 − ¯𝛼𝑡 ϵ𝜃 (v′

𝑡, 𝑡)(cid:3) (cid:1) − y(cid:13)
(cid:13)

2
2

𝑡 −

(cid:13)A (cid:0) 1
¯𝛼𝑡

√

where 𝛿 ∈ R+ is a hyper-parameter that indicates the level of tolerance for noise and helps

prevent overfitting. This is equivalent to enforcing an ℓ2 constraint, and is in spirit similar to (Wang

et al., 2024). Since the noise level cannot be accurately estimated, in our experiments, we use 𝛿 that

is slightly larger than the actual level of noise in the measurements, i.e., 𝛿 > 𝜎y

√

𝑚.

8.3.5 SITCOM with Arbitrary Stepsizes

In this subsection, we explain how to apply SITCOM with a large stepsize and present the final

algorithm. The pre-trained DM is trained with 𝑇 diffusion steps. Given that our method is designed

to satisfy measurement and diffusion consistency, SITCOM requires 𝑁 ≪ 𝑇 sampling iterations,

using a step size of Δ𝑡 := ⌊ 𝑇

𝑁 ⌋. Thus, we introduce the index 𝑖 instead of 𝑡 with a relation 𝑡 = 𝑖Δ𝑡.
The procedure of SITCOM is outlined in Algorithm 8.1. As inputs, SITCOM takes y, A (·), ϵ𝜃,

the number of sampling steps 𝑁, ¯𝛼𝑖 for all 𝑖 ∈ {1, . . . , 𝑁 }, the number of optimization steps 𝐾 per

sampling step, stopping criteria 𝛿, and the learning rate 𝛾.

Starting with initializing v(0)

𝑖

as x𝑖 (satisfying condition C3), lines 3 through 6 correspond to the

first step of SITCOM, where (S1) is solved via either gradient descent (as shown in the algorithm),

or the ADAM optimizer (Kingma and Ba, 2015b). In lines 5 and 6, the stopping criterion is applied

to prevent strict data fidelity (avoiding noise overfitting). Following the gradient updates in the

inner loop, ˆv𝑖 is obtained in line 7, which is then used in line 8 to obtain ˆx′

0 as specified in (S2),

148

Algorithm 8.1 Step-wise Triple-Consistent Sampling (SITCOM).
Input: Measurements y, forward operator A (·), pre-trained DM ϵ𝜃 (· , ·), number of diffusion steps 𝑁, DM
noise schedule ¯𝛼𝑖 for 𝑖 ∈ {1, . . . , 𝑁 }, number of gradient updates 𝐾, stopping criterion 𝛿, learning rate 𝛾, and
regularization parameter 𝜆.
Output: Restored image ˆx.
Initialization: x𝑁 ∼ N (0, I), Δ𝑡 = ⌊ 𝑇
𝑁 ⌋
1: For each 𝑖 ∈ {𝑁, 𝑁 − 1, . . . , 1}. (Reducing diffusion sampling steps)
2:

Initialize v(0)
For each 𝑘 ∈ {1, . . . , 𝐾 }. (Gradient updates for measurement & backward consistency: C1, C2)
√
1 − ¯𝛼𝑖 ϵ𝜃 (v𝑖, 𝑖Δ𝑡)(cid:3) (cid:17)
v(𝑘 )
𝑖

𝑖 ← x𝑖. (Initialization to ensure Closeness: C3 )

2 + 𝜆∥x𝑖 − v𝑖 ∥2

= v(𝑘−1)
𝑖

− 𝛾∇v𝑖

− y(cid:13)
(cid:13)

2

2

.

(cid:105)(cid:12)
(cid:12)
(cid:12)v𝑖=v(𝑘−1)

𝑖

If (cid:13)

√

(cid:2)v(𝑘 )

(cid:13)A (cid:0) 1
𝑖
¯𝛼𝑖
Break the For loop in step 3. (Preventing noise overfitting)

2
, 𝑖Δ𝑡)(cid:3) (cid:1) − y(cid:13)
(cid:13)
2

𝑖 −

< 𝛿2 . (Stopping criterion)

6:
7: Assign ˆv𝑖 ← v(𝑘 )
8: Obtain ˆx′

𝑖

. (Backward diffusion consistency of ˆv𝑖: C2)

√

0 = 𝑓 ( ˆv𝑖; 𝑡, 𝜃) = 1
0: C2)
√
¯𝛼𝑖
√
¯𝛼𝑖−1 ˆx′
1 − ¯𝛼𝑖−1η𝑖, η𝑖 ∼ N (0, I) . (Forward diffusion consistency: C3)
9: Obtain x𝑖−1 =
10: Restored image: ˆx = x0.

1 − ¯𝛼𝑖 ϵ𝜃 ( ˆv𝑖, 𝑖Δ𝑡)(cid:3). (Backward consistency of ˆx′

(cid:2) ˆv𝑖 −

0 +

√

(cid:2)v𝑖 −

(cid:16) 1
(cid:104)(cid:13)
(cid:13)A
√
¯𝛼𝑖
√
1 − ¯𝛼𝑖 ϵ𝜃 (v(𝑘 )

3:

4:

5:

satisfying condition C2. Note that line 8 requires no additional computation, as the ˆx′

0 calculated
here was already obtained while checking the stopping condition in line 6. After obtaining the

double-consistent ˆx′

0, the resampling is applied to map the image back to time 𝑡 − 1 while ensuring
x𝑡−1 to be in-distribution, as indicated in line 9 of the algorithm. In the next iteration, the requirement

that ˆv𝑡−1 is close to x𝑡−1 ensures that the input ˆv𝑡−1 to the DM network, ϵ𝜃, is also in-distribution,

thus satisfying the forward-consistency (condition C3).

The computational requirements of SITCOM are determined by (i) the number of sampling

steps 𝑁 and (ii) the number of gradient steps 𝐾 required for each sampling iteration. Given the

proposed stopping criterion, this results in at most 𝑁𝐾 Number of Function Evaluations (NFEs) of

the pre-trained model (forward pass), 𝑁𝐾 backward passes through the pre-trained model, and 𝑁𝐾

applications each for the forward operator and its adjoint to solve the optimization problem in (S1).

With early stopping, the computational cost is lower. For example, for a linear operator A with

dimensions 𝑚 × 𝑛, the cost of applying it (or its adjoint) to a vector is O (𝑚𝑛). For a network with

width 𝑀 and depth 𝐿, the cost for making a forward pass is O (𝐿 𝑀 2). The gradients are computed

w.r.t. the input of the DM network, requiring an additional backward pass. This backward pass has

149

the same computational cost as the forward pass. Consequently, this procedure is significantly more

efficient than network training, where the network weights are updated instead of the input.

8.3.6 Relation with Existing Approaches

While SITCOM and DPS (Chung et al., 2023b) both use Tweedie’s formula, there are two

major differences. First, DPS does not enforce backward consistency. Specifically, it only considers

one gradient descent step of the optimization in (S1), whereas our method perform multiple steps,

initializing with x𝑡. Second, DPS does not enforce the forward diffusion consistency, namely, it does

not use resampling (S3). This means that DPS does not enforce a step-wise C1-C3.

Both SITCOM and the works in (Song et al., 2023a; Zhang et al., 2024a) are optimization-based

methods that modify the sampling steps to enforce measurement consistency, and both involve

mapping back to time 𝑡 − 1 (as in step 3 of SITCOM). However, there is a major difference between

them: The optimization variable in these works is the estimated image at time 𝑡 (the output of the

DM network), whereas in SITCOM, it is the noisy image at time 𝑡 (the input of the network). This

means that these studies enforce C1 and C3, but not C2.

8.4 Experimental Results

Tasks: Our experimental setup for IPs and noise levels used largely follows DPS (Chung et al.,

2023b). For linear IPs, we evaluate five tasks: super resolution, Gaussian deblurring, motion

deblurring, box inpainting, and random inpainting. For Gaussian deblurring and motion deblurring,

we use 61×61 kernels with standard deviations of 3 and 0.5, respectively. In the super-resolution

task, a bicubic resizer downscales images by a factor of 4. For box inpainting, a random 128×128

box is applied to mask image pixels, and for random inpainting, the mask is generated with each

pixel masked with a probability of 0.7, as described in (Song et al., 2023a). For nonlinear IP tasks,

we consider three tasks: phase retrieval, high dynamic range (HDR) reconstruction, and nonlinear

(non-uniform) deblurring. For phase retrieval, an oversampling rate of 2 is applied in frequency

domain, and we report the best result out of four independent samples, consistent with (Chung

et al., 2023b; Zhang et al., 2024a) (see Appendix D.4 for more discussion on phase retrieval). In

HDR reconstruction, the goal is to restore a higher dynamic range image from a lower dynamic

150

range image (with a factor of 2). Nonlinear deblurring follows the setup in (Tran et al., 2021). For

measurement noise, we use 𝜎y ∈ {0.01, 0.05} for all tasks.

Baselines & Datasets: For baselines, in this section, we use DPS (Chung et al., 2023b), DDNM

(Wang et al., 2022), DCDP (Li et al., 2024), and DAPS (Zhang et al., 2024a). The selection criteria is

based on these baselines’ competitive performance on several linear and non-linear inverse problems

under measurement noise. Additionally, we provide comparison results with three other baselines in

Table D.2 of Appendix D.5. We evaluate SITCOM and baselines using 100 test images from the

validation set of FFHQ (Karras et al., 2019) and 100 test images from the validation set of ImageNet

(Deng et al., 2009) for which the FFHQ-trained and ImageNet-trained DMs are given in (Chung

et al., 2023b) and (Dhariwal and Nichol, 2021), respectively, following the previous convention. For

evaluation metrics, we use PSNR, SSIM (Wang et al., 2004), and LPIPS (Zhang et al., 2018).

SITCOM Settings: For Algorithm 8.1, we set 𝑁 = 20 and 𝐾 = 30 for most tasks. We show

the impact of 𝑁 and 𝐾 in Appendix D.6.1. The parameter 𝜆 is set to 0 for all tasks other than

phase retrieval where we use 𝜆 = 1, following the ablation study in Appendix D.6.2. The impact

of the stopping criterion under the noisy setting is given in Appendix D.6.3. The learning rate

for (S1) is set to 𝛾 = 0.01 across all measurements noise levels, datasets, and tasks. Table D.5 in

Appendix D.6.4 lists all the hyper-parameters used for every task. We note that the exact set of

hyper-parameters is used for the FFHQ and ImageNet datasets. Our code is available online.

Main Results:

In Table 8.1, we present the quantitative results in terms of the average PSNR,

SSIM, LPIPS, and run-time (minutes). Columns 3 to 6 correspond to the FFHQ dataset, while

columns 7 to 10 reflect results for the ImageNet dataset. The table covers 8 tasks, 4 evaluation

metrics, and 2 datasets, totaling 64 results. Among these, SITCOM reports the best performance in

58 out of 64 cases.

On average, SITCOM demonstrates strong reconstruction capabilities across most tasks. For the

FFHQ dataset, SITCOM reports a PSNR improvement of over 1 dB in Super Resolution, random

In-painting, and Gaussian Deblurring compared to the second-best method. On ImageNet, we

151

Task

Method

PSNR (↑)

SSIM (↑)

LPIPS (↓) Run-time (↓)

PSNR (↑)

SSIM (↑)

LPIPS (↓) Run-time (↓)

FFHQ

ImageNet

Super Resolution 4×

Box In-Painting

Random In-Painting

Gaussian Deblurring

Motion Deblurring

Phase Retrieval

Non-Uniform Deblurring

High Dynamic Range

DPS
DAPS
DDNM
DCDP
SITCOM (ours)

DPS
DAPS
DDNM
DCDP
SITCOM (ours)

DPS
DAPS
DDNM
DCDP
SITCOM (ours)

DPS
DAPS
DDNM
DCDP
SITCOM (ours)

DPS
DAPS
SITCOM (ours)

DPS
DAPS
DCDP
SITCOM (ours)

DPS
DAPS
DCDP
SITCOM (ours)

DPS
DAPS
SITCOM (ours)

24.44±0.56
29.24±0.42
28.02±0.78
27.88±1.34
30.68±1.02

23.20±0.89
24.17±1.02
24.37±0.45
23.66±1.67
24.68±0.78

28.39±0.82
31.02±0.45
29.93±0.67
28.59±0.95
32.05±1.02

25.52±0.78
29.22±0.50
28.22±0.52
26.67±0.78
30.25±0.89

23.40±1.42
29.66±0.50
30.34±0.67

17.34±2.67
30.67±3.12
28.52±2.50
30.97±3.10

23.42±2.15
28.23±1.55
28.78±1.44
30.12±0.68

22.88±1.25
27.12±0.89
27.98±1.06

0.801±0.032
0.851±0.024
0.842±0.034
0.825±0.07
0.867±0.045

0.754±0.023
0.787±0.032
0.792±0.024
0.762±0.07
0.801±0.042

0.844±0.042
0.902±0.015
0.889±0.032
0.852±0.06
0.909±0.09

0.826±0.052
0.884±0.056
0.867±0.056
0.835±0.08
0.892±0.032

0.737±0.024
0.872±0.027
0.902±0.037

0.67±0.045
0.908±0.041
0.892±0.19
0.915±0.064

0.757±0.042
0.833±0.052
0.827±0.08
0.902±0.042

0.722±0.056
0.825±0.056
0.832±0.052

0.26±0.022
0.135±0.039
0.197±0.034
0.211±0.05
0.142±0.056

0.196±0.032
0.135±0.032
0.232±0.026
0.144±0.05
0.121±0.08

0.194±0.021
0.098±0.017
0.122±0.056
0.202±0.04
0.095±0.025

0.211±0.017
0.164±0.032
0.216±0.042
0.196±0.04
0.135±0.078

0.270±0.025
0.157±0.012
0.148±0.041

0.41±0.08
0.122±0.084
0.167±0.92
0.112±0.102

0.279±0.067
0.155±0.041
0.162±0.04
0.145±0.037

0.264±0.089
0.166±0.078
0.158±0.032

1.26±0.52
1.24 ±0.22
1.07±0.42
0.52±0.34
0.45±0.58

1.57±0.55
1.35±0.45
1.02±0.032
0.56±0.25
0.35±0.25

1.52±0.30
1.56±0.40
1.45±0.35
0.55±0.25
0.45±0.50

1.50±0.50
1.40±0.52
1.56±0.45
0.56±0.23
0.46±0.25

2.40±0.55
1.86±0.12
0.5±0.45

1.50±0.34
1.34±0.78
3.30±0.45
0.52±0.34

1.55±0.44
1.42±0.41
3.30±0.45
0.52±0.45

1.45±0.34
1.25±0.35
0.52±0.30

23.86±0.34
25.67±0.73
23.96±0.89
24.12±1.24
26.35±1.21

19.78±0.78
21.43±0.40
21.64±0.66
20.45±1.22
21.88±0.92

24.26±0.42
28.44±0.45
29.22±0.55
26.22±1.13
29.60±0.78

21.86±0.45
26.12±0.78
28.06±0.52
23.24±1.18
27.40±0.45

21.86±2.05
27.86±1.20
28.65±0.34

16.82±1.22
25.76±2.33
24.25±2.25
25.45±2.78

22.57±0.67
27.65±1.2
26.56±1.09
28.78±0.79

19.33±1.45
26.30±1.02
26.97±0.87

0.76±0.041
0.802±0.045
0.767±0.045
0.772±0.000
0.812±0.021

0.691±0.052
0.736±0.020
0.732±0.028
0.712±0.07
0.742±0.032

0.772±0.02
0.872±0.024
0.912±0.034
0.791±0.06
0.915±0.028

0.772±0.08
0.832±0.092
0.879±0.072
0.781±0.06
0.854±0.045

0.724±0.022
0.862±0.032
0.876±0.021

0.64±0.08
0.797±0.045
0.778±0.14
0.808±0.065

0.778±0.067
0.822±0.056
0.803±0.06
0.832±0.056

0.688±0.067
0.792±0.046
0.821±0.045

0.357±0.069
0.256±0.067
0.475±0.044
0.351±0.00
0.232±0.038

0.312±0.025
0.218±0.021
0.319±0.015
0.298±0.04
0.214±0.021

0.326±0.034
0.135±0.052
0.191±0.048
0.289±0.03
0.127±0.039

0.362±0.034
0.245±0.022
0.278±0.089
0.343±0.04
0.236±0.039

0.357±0.032
0.196±0.021
0.189±0.036

0.447±0.032
0.255±0.095
0.287±0.089
0.246±0.088

0.310±0.102
0.169±0.044
0.182±0.05
0.16±0.048

0.503±0.132
0.177±0.089
0.167±0.052

2.38±1.02
2.16±0.45
1.27±0.55
1.45±0.00
1.12±0.52

2.28 ±1.02
2.54±1.02
1.45±1.02
1.127±0.25
1.12±0.35

2.27±0.25
2.14±0.45
1.54±0.52
1.44±0.34
1.14±0.45

2.55±0.45
2.23±0.52
1.75±0.63
1.34±0.43
1.10±0.42

2.56±0.40
2.3±0.45
1.48±0.35

2.17±0.24
2.24±0.25
3.49±0.52
1.40±0.40

2.35±0.45
2.14±0.45
3.70±0.36
1.25±0.45

2.42±0.46
2.18±0.55
1.54±0.35

Table 8.1 Average PSNR, SSIM, LPIPS, and run-time (minutes) of SITCOM and baselines using
100 test images from the FFHQ dataset (columns 3 to 7) and 100 test images from the ImageNet
dataset with a measurement noise level of 𝜎y = 0.05. The results for the 𝜎y = 0.01 case are given
in Table D.1 of Appendix D.5. The first five tasks are linear, while the last three tasks are non-linear
(underlined). For each task and dataset combination, the best results are bolded, and the second-best
results are underlined. Values after ± represent the standard deviation. All results were obtained
using a single RTX5000 GPU machine. For phase retrieval, the run-time is reported for the best
result out of four independent runs. This is applied for SITCOM and baselines. More discussion
about phase retrieval is given in Appendix D.4.

observe more than a 1 dB improvement in random In-painting. Other than ImageNet Gaussian

Deblurring and ImageNet Phase Retrieval, for which we under-perform by 0.66 dB and 0.31 dB,

respectively, our PSNR improvement when compared to the second-best results are less than 1 dB.

However, in terms of run-time, SITCOM consistently requires less computational time across all

tasks. For FFHQ, SITCOM is over 3× faster in Box In-painting and motion Deblurring, and more

than 2× faster in the remaining tasks, whereas on ImageNet, the run-time improvement ranges from

152

36 seconds (for HDR) to 62.4 seconds (for Super Resolution), when compared to DPS, DDNM, and

DAPS.

For linear tasks, SITCOM requires slightly less run-time than DCDP on both datasets. However,

across the two datasets, SITCOM achieves PSNR improvements of more than 1 dB, 2 dB, and

3 dB for the tasks of super resolution, box in-painting, and random in-painting (and Gaussian

Deblurring), respectively, as compared to DCDP. For non-linear tasks, SITCOM not only provides

PSNR improvements over DCDP but also significantly reduces run-time.

In summary, the results in Table 8.1 demonstrate that SITCOM either provides a notable

improvement in restoration quality (e.g., cases where we report PSNR improvements of over 1 dB)

or delivers comparable results to the baselines, all while significantly reducing computation time.

In Appendix D.5, we present the results with 𝜎y = 0.01 case (Table D.1). Additionally, Table D.2

includes quantitative results for three more baselines. In addition to the FFHQ restored images

in Figure 8.1, we also provide additional samples from both datasets in the figures found in

Appendix D.8.

8.5 Conclusion

In this paper, we proposed three conditions to achieve measurement- and diffusion-consistent

trajectories for linear and non-linear inverse imaging problems using diffusion models (DMs)

as priors. These conditions form the basis of our unique optimization-based sampling method,

which optimizes the input of the diffusion model at each step. This approach allows for greater

control over the diffusion process and enhances data consistency with the given measurements.

Through extensive experiments across eight image restoration tasks, we evaluated the effectiveness

of our method. The results showed that our sampler consistently delivers improved or comparable

quantitative performance against state-of-the-art baselines, even with measurement noise. Notably,

our method is efficient, requiring significantly less run-time than leading baselines, making it

practical for real-world applications.

153

CHAPTER 9

CONCLUSION

This chapter lists some of the possible extensions to the work presented in the thesis.

• In Chapter 3, we examined supervised learning of deep unrolled networks at reconstruction

time for MRI by exploiting training sets along with local modeling and clustering. We intend

to expand our studies in the future by incorporating non-Cartesian undersampling patterns,

such as radial and spiral patterns, as well as deploying them to 3D settings and other imaging

modalities.

• Additionally, the method’s generalizability will be further examined, with a particular emphasis

on heterogeneous datasets. To handle more extreme training-test data variations, such as

unseen anatomies, we plan to explore patch-based neighbors in local learning schemes for

future work.

• In Chapter 4, we introduced a self-guided deep image prior-based MRI reconstruction

technique that iteratively optimizes the network input while also training the model to be

robust to large random perturbations of its input. This was achieved by introducing a new

regularization term that encourages the reconstructor to act as a denoiser. However, the main

disadvantage is the time costs associated with gradient updates for the network input.

• In Chapter 5, to solve the problem of the previous chapter, we proposed the aSeq DIP,

which relies solely on a sequential update of network parameters. These parameters are

optimized using an input-adaptive data consistency objective combined with autoencoding

regularization, effectively mitigating noise overfitting. For future directions, we aim to

explore the applicability of aSeqDIP to other image recovery problems, thereby expanding

its versatility and potential impact across diverse domains. Additionally, we are interested

in investigating the integration of a network input update mechanism to dynamically adjust

the autoencoding regularization parameter and the number of gradient updates per iteration.

154

Also, we want to have an analysis of the convergence of self-guided DIP and aSeqDIP is also

needed and left for future work

• In Chapter 6, we proposed a scheme for improving the robustness of DL-based MRI

reconstruction. In particular, we investigated deep unrolled reconstruction’s weaknesses

in robustness against worst-case or noise-like additive perturbations, sampling rates, and

unrolling steps. To improve the robustness of the unrolled scheme, we proposed SMUG with a

novel unrolled smoothing loss. In future work, we hope to apply the proposed schemes to other

imaging modalities and evaluate robustness against additional types of realistic perturbations.

While we theoretically characterized the robustness error for SMUG, we hope to further

analyze its accuracy-robustness trade-off with perturbations.

• In chapter 7, we addressed challenges of unseen noise by harnessing the power of diffusion

models. Our innovative robustification strategy enhanced the resilience of DL-based MRI

reconstruction models by integrating pre-trained diffusion models as noise purifiers.

• In Chapter 8, we improve the diffusion purifier speed and improve performance by applying a

better reverse sampling method by introducing the triple consistency regularization.

• In the future, we will focus on how to prune the unnecessary network weight for the diffusion

model in order to improve the computation speed further

• In conclusion, this thesis addresses the dual challenges of deep learning model-based

approaches, namely data scarcity and limited robustness. To alleviate data scarcity, we

introduce three methods—LONDN-MRI, Self-Guided DIP, and aSeq DIP—that employ

adaptive strategies to work effectively with limited datasets. In parallel, we tackle robustness

issues through SMUG and diffusion purification, which mitigate vulnerabilities such as noise

and adversarial perturbations. Furthermore, to improve the efficiency of diffusion models,

we propose SITCOM, an approach that accelerates the reverse sampling process without

155

sacrificing result quality. Collectively, these contributions push the boundaries of deep

learning in constrained settings while strengthening the reliability of model-based solutions

156

BIBLIOGRAPHY

Aggarwal, H. K., Mani, M. P., and Jacob, M. (2018). Modl: Model-based deep learning architecture

for inverse problems. IEEE transactions on medical imaging, 38(2):394–405.

Aggarwal, H. K., Mani, M. P., and Jacob, M. (2019a). MoDL: model-based deep learning architecture

for inverse problems. IEEE Trans. Med. Imaging, 38(2):394–405.

Aggarwal, H. K., Mani, M. P., and Jacob, M. (2019b). MoDL: Model-based deep learning

architecture for inverse problems. IEEE Transaction on Medical Imaging, 38(2):394–405.

Akccccakaya, M., Moeller, S., Weingartner, S., and Ugurbil, K. (2019). Scan-specific robust artificial-
neural-networks for k-space interpolation (raki) reconstruction: Database-free deep learning for
fast imaging. Magnetic Resonance in Medicine, 81(2):439–453.

Alkhouri, I., Liang, S., Wang, R., Qu, Q., and Ravishankar, S. (2024). Diffusion-based adversarial
purification for robust deep mri reconstruction.
In ICASSP 2024-2024 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12841–12845. IEEE.

Allgower, E. L. and Georg, K. (2012). Numerical continuation methods: an introduction, volume 13.

Springer Science & Business Media.

Anderson, B. D. (1982). Reverse-time diffusion equation models. Stochastic Processes and their

Applications, 12(3):313–326.

Antun, V., Renna, F., Poon, C., Adcock, B., and Hansen, A. (2020a). On instabilities of deep learning
in image reconstruction and the potential costs of AI. Proceedings of the National Academy of
Sciences, 117(48):30088–30095.

Antun, V., Renna, F., Poon, C., Adcock, B., and Hansen., A. C. (2020b). On instabilities of
deep learning in image reconstruction and the potential costs of ai. Proceedings of the National
Academy of Sciences, 117(48):30088–30095.

Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R., and Wang, R. (2019). On exact computation
with an infinitely wide neural net. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F.,
Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32.
Curran Associates, Inc.

Buzzard, G. T., Chan, S. H., Sreehari, S., and Bouman, C. A. (2018). Plug-and-play unplugged:
optimization-free reconstruction using consensus equilibrium. SIAM J. Imaging Sci., 11(3):2001–
20.

Chan, S. H., Wang, X., and Elgendy, O. A. (2016). Plug-and-play admm for image restoration: Fixed-
point convergence and applications. IEEE Transactions on Computational Imaging, 3(1):84–98.

157

Chen, C., Liu, Y., Schniter, P., Tong, M., Zareba, K., Simonetti, O., Potter, L., and Ahmad, R.
(2020). Ocmr (v1.0)–open-access multi-coil k-space dataset for cardiovascular magnetic resonance
imaging. arXiv preprint arXiv:2008.03410.

Chen, G., Zhu, F., and Ann Heng, P. (2015). An efficient statistical method for image noise level
estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages
477–485.

Chen, S., Zhang, H., Guo, M., Lu, Y., Wang, P., and Qu, Q. (2024). Exploring low-dimensional
subspaces in diffusion models for controllable image editing. arXiv preprint arXiv:2409.02374.

Cheng, J. (2019). Stanford 2D FSE.

Cheng, Z., Gadelha, M., Maji, S., and Sheldon, D. (2019). A bayesian perspective on the deep image
prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 5443–5451.

Chung, H., Kim, J., Kim, S., and Ye, J. C. (2023a). Parallel diffusion models of operator and image
for blind inverse problems. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 6059–6069.

Chung, H., Kim, J., Mccann, M. T., Klasky, M. L., and Ye, J. C. (2023b). Diffusion posterior
In The Eleventh International Conference on

sampling for general noisy inverse problems.
Learning Representations.

Chung, H., Kim, J., Mccann, M. T., Klasky, M. L., and Ye, J. C. (2023c). Diffusion posterior
In The Eleventh International Conference on

sampling for general noisy inverse problems.
Learning Representations.

Chung, H., Kim, J., and Ye, J. C. (2023d). Direct diffusion bridge using data consistency for
inverse problems. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine,
S., editors, Advances in Neural Information Processing Systems, volume 36, pages 7158–7169.
Curran Associates, Inc.

Chung, H., Sim, B., Ryu, D., and Ye, J. C. (2022).

Improving diffusion models for inverse
problems using manifold constraints. Advances in Neural Information Processing Systems,
35:25683–25696.

Chung, H. and Ye, J. C. (2022). Score-based diffusion models for accelerated mri. Medical image

analysis, 80:102479.

Cohen, J., Rosenfeld, E., and Kolter, Z. (2019). Certified adversarial robustness via randomized

smoothing. In International Conference on Machine Learning, pages 1310–1320. PMLR.

Croce, F. and Hein, M. (2020). Reliable evaluation of adversarial robustness with an ensemble of

158

diverse parameter-free attacks.

Crockett, C. and Fessler, J. A. (2021). Bilevel methods for image reconstruction. arXiv preprint

arXiv:2109.09610.

Dar, S. U. H., Özbey, M., cccCatlı, A. B., and cccCukur, T. (2017). A Transfer-Learning Approach for

Accelerated MRI using Deep Neural Networks. arXiv preprint arXiv:1710.02615.

Daras, G., Chung, H., Lai, C.-H., Mitsufuji, Y., Ye, J. C., Milanfar, P., Dimakis, A. G., and Delbracio,
M. (2024). A survey on diffusion models for inverse problems. arXiv preprint arXiv:2410.00083.

Darestani, M. Z. and Heckel, R. (2021). Accelerated MRI with un-trained neural networks. arXiv

preprint arXiv:2007.02471.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-
scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern
recognition, pages 248–255. Ieee.

Deshmane, A., Gulani, V., Griswold, M. A., and N, S. (2012). Parallel mr imaging. Journal of

magnetic resonance imaging, 36(1):55–72.

Dhariwal, P. and Nichol, A. (2021). Diffusion models beat gans on image synthesis. Advances in

neural information processing systems, 34:8780–8794.

Donoho, D. (2006a). Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–

1306.

Donoho, D. L. (2006b). Compressed sensing. IEEE Transactions on information theory, 52(4):1289–

1306.

Elbakri, I. A. and Fessler, J. A. (2002). Statistical image reconstruction for polyenergetic X-ray

computed tomography. IEEE Transactions on Medical Imaging, 21(2):89–99.

et al, F. K. (2020). fastMRI: A Publicly Available Raw k-Space and DICOM Dataset of Knee
Images for Accelerated MR Image Reconstruction Using Machine Learning. Radiology: Artificial
Intelligence, 2(1):e190007.

et al, J. Z. (2019). fastMRI: An Open Dataset and Benchmarks for Accelerated MRI. arXiv preprint

arXiv:1811.08839.

Feng, C., Yan, Y., Fu, H., Chen, L., and Xu, Y. (2021). Task Transformer Network for Joint MRI
Reconstruction and Super-Resolution. In International Conference on Medical Image Computing
and Computer Assisted Intervention (MICCAI).

Fessler, J. A. (2010). Model-Based Image Reconstruction for MRI.

IEEE Signal Processing

159

Magazine, 27(4):81–89.

Gatenby, R. A., Grove, O., and Gillies, R. J. (2013). Quantitative imaging in cancer evolution and

ecology. Radiology, 269(1):8–14.

Ghosh, A., Mccann, M., and Ravishankar, S. (2022). Bilevel learning of l1 regularizrs with closed-
form gradients (blorc). In ICASSP 2022 - 2022 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 1491–1495.

Ghosh, A., Zhang, X., Sun, K. K., Qu, Q., Ravishankar, S., and Wang, R. (2024). Optimal
eye surgeon: Finding image priors through sparse generators at initialization. In Forty-first
International Conference on Machine Learning.

Gilton, D., Ongie, G., and Willett, R. (2021a). Deep equilibrium architectures for inverse problems

in imaging. IEEE Transactions on Computational Imaging, 7:1123–1133.

Gilton, D., Ongie, G., and Willett, R. (2021b). Deep equilibrium architectures for inverse problems

in imaging. IEEE Transactions on Computational Imaging, 7:1123–1133.

Gretton, A., Borgwardt, K., Raschand, M., Schölkopf, B., and Smola, A. (2006). A kernel method

for the two-sample-problem. Advances in neural information processing systems, 19.

Guan, H. and Liu, M. (2021). Domain adaptation for medical image analysis: a survey. IEEE

Transactions on Biomedical Engineering, 69(3):1173–1185.

Güngör, A., Dar, S. U., Öztürk, cccS., Korkmaz, Y., Bedel, H. A., Elmas, G., Ozbey, M., and cccCukur,
T. (2023). Adaptive diffusion priors for accelerated MRI reconstruction. Medical Image Analysis,
page 102872.

Hammernik, K., Klatzer, T., Kobler, E., Recht, M. P., Sodickson, D. K., Pock, T., and Knoll, F.
(2018). Learning a variational network for reconstruction of accelerated MRI data. Magnetic
resonance in medicine, 79(6):3055–3071.

Hatamizadeh, A., Song, J., Liu, G., Kautz, J., and Vahdat, A. (2024). Diffit: Diffusion vision
transformers for image generation. In European Conference on Computer Vision, pages 37–55.
Springer.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level
performance on imagenet classification. In 2015 IEEE International Conference on Computer
Vision (ICCV), pages 1026–1034.

Heckel, R. and Hand, P. (2019). Deep decoder: Concise image representations from untrained

non-convolutional networks. In ICLR.

Heckel, R. and Soltanolkotabi, M. (2020). Denoising and regularization via exploiting the structural

160

bias of convolutional generators. In ICLR.

Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in

neural information processing systems, 33:6840–6851.

Hofmann, T., Schölkopf, B., and Smola, A. (2008). Kernel methods in machine learning. The

Annals of Statistics, 36(3):1171–1220.

Hou, R., Li, F., and Zhang, G. (2022). Truncated residual based plug-and-play admm algorithm for

MRI reconstruction. IEEE Transactions on Computational Imaging, 8:96–108.

Hsieh, J. (2003). Computed tomography: principles, design, artifacts, and recent advances. SPIE

Journal of Medical Imaging, 42(6):1234–1245.

I, G., J, S., and C, S. (2015). Explaining and harnessing adversarial examples. 2015 ICLR, arXiv

preprint arXiv:1412.6572.

Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural tangent kernel: convergence and generalization
in neural networks. In Proceedings of the 32nd Neurips, NIPS’18, page 8580–8589. Curran
Associates Inc.

Jia, J., Hong, M., Y.Zhang, M.Akcakaya, and Liu, S. (2022a). On the robustness of deep learning-
based MRI reconstruction to image transformations. In Workshop on Trustworthy and Socially
Responsible Machine Learning, NeurIPS 2022.

Jia, J., Hong, M., Zhang, Y., Akcakaya, M., and Liu, S. (2022b). On the robustness of deep
learning-based mri reconstruction to image transformations. In Workshop on Trustworthy and
Socially Responsible Machine Learning, NeurIPS 2022.

Jin, K. H., McCann, M. T., Froustey, E., and Unser, M. (2017). Deep convolutional neural network

for inverse problems in imaging. IEEE Trans. Im. Proc., 26(9):4509–22.

Jo, Y., Chun, S. Y., and Choi, J. (2021). Rethinking deep image prior for denoising. In Proceedings

of the IEEE/CVF International Conference on Computer Vision, pages 5087–5096.

Kak, A. C. and Slaney, M. (2001). Principles of computerized tomographic imaging. SIAM.

Karras, T., Aittala, M., Aila, T., and Laine, S. (2022). Elucidating the design space of diffusion-based

generative models. Advances in neural information processing systems, 35:26565–26577.

Karras, T., Laine, S., and Aila, T. (2019). A style-based generator architecture for generative
In Proceedings of the IEEE/CVF conference on computer vision and

adversarial networks.
pattern recognition, pages 4401–4410.

Kawar, B., Elad, M., Ermon, S., and Song, J. (2022). Denoising diffusion restoration models.

161

Advances in Neural Information Processing Systems, 35:23593–23606.

Kaya, M. and cccS. bilge, H. (2019). Deep metric learning: A survey. Symmetry, 11(9).

Kim, T. H., Garg, P., and Haldar., J. P. (2019). LORAKI: Autocalibrated Recurrent Neural Networks

for Autoregressive MRI Reconstruction in k-Space. arXiv preprint arXiv:1904.09390.

Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint

arXiv:1412.6980.

Kingma, D. P. and Ba, J. (2015a). Adam: A method for stochastic optimization. 2015 ICLR, arXiv

preprint arXiv:1412.6980.

Kingma, D. P. and Ba, J. (2015b). Adam: A method for stochastic optimization. In International

Conference on Learning Representations (ICLR).

Klug, T. and Heckel, R. (2023). Scaling laws for deep learning based image reconstruction. In

ICLR.

Lahiri, A., Ravishankar, S., and Fessler, J. A. (2020). Combining supervised and semi-blind
dictionary (Super-BReD) learning for MRI reconstruction. In Proc. Intl. Soc. Mag. Res. Med.,
page 3456.

Lahiri, A., Wang, G., Ravishankar, S., and Fessler, J. (2021). Blind Primed Supervised (BLIPS)
Learning for MR Image Reconstruction. IEEE Transactions on Medical Imaging, 40(11):3113–
3124.

Lakshmanan, H., De, F., and Daniela, P. (2008). Decentralized resource allocation in dynamic

networks of agents. SIAM Journal on Optimization, 19(2):911–940.

Lee, S. S., Byun, J. H., Park, B. J., Park, S. H., Kim, N., Park, B., Kim, J. K., and Lee, M.-
G. (2008). Quantitative analysis of diffusion-weighted magnetic resonance imaging of the
pancreas: usefulness in characterizing solid pancreatic masses. Journal of Magnetic Resonance
Imaging: An Official Journal of the International Society for Magnetic Resonance in Medicine,
28(4):928–936.

Lei, K., Mardani, M., M.Pauly, J., and Vasanawala, S. (2021). Wasserstein gans for mr imaging:

From paired to unpaired training. IEEE Transactions on Medical Imaging, 40(1):105–115.

Lei, K., Mardani, M., Pauly, J. M., and Vasanawala, S. S. (2020). Wasserstein gans for mr imaging:

from paired to unpaired training. IEEE transactions on medical imaging, 40(1):105–115.

Li, H., Jia, J., Liang, S., Yao, Y., Ravishankar, S., and Liu, S. (2023). Smug: Towards robust mri
reconstruction by smoothed unrolling. In ICASSP 2023-2023 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.

162

Li, T., Zhuang, Z., Liang, H., Peng, L., Wang, H., and Sun, J. (2021). Self-validation: Early
stopping for single-instance deep generative priors. In Proceedings of the British Machine Vision
Conference (BMVC), 2021.

Li, X., Kwon, S. M., Alkhouri, I. R., Ravishankar, S., and Qu, Q. (2024). Decoupled data consistency

with diffusion purification for image restoration. arXiv preprint arXiv:2403.06054.

Liang, S., Bell, E., Qu, Q., Wang, R., and Ravishankar, S. (2024a). Analysis of deep image prior

and exploiting self-guidance for image reconstruction. arXiv preprint arXiv:2402.04097.

Liang, S., Lahiri, A., and Ravishankar, S. (2024b). Adaptive local neighborhood-based neu-
IEEE Transactions on

ral networks for MR image reconstruction from undersampled data.
Computational Imaging. to appear.

Lingala, S. G. and Jacob, M. (2013). Blind compressive sensing dynamic MRI. IEEE Transactions

on Medical Imaging, 32(6):1132–1145.

Liu, C., Freeman, W., Szeliski, R., and Kang, S. B. (2006). Noise estimation from a single image.
In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR’06), volume 1, pages 901–908.

Liu, C. and Hui, L. (2023). ReLU soothes the NTK condition number and accelerates optimization

for wide neural networks. arXiv e-prints, pages arXiv–2305.

Liu, J., Sun, Y., Xu, X., and Kamilov, U. (2019a). Image restoration using total variation regularized

deep image prior. ICASSP 2019 - 2019 IEEE International Conference on ICASSP.

Liu, J., Sun, Y., Xu, X., and Kamilov, U. S. (2019b). Image restoration using total variation regularized
deep image prior. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 7715–7719. Ieee.

Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., and Van Gool, L. (2022). Repaint:
Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pages 11461–11471.

Luo, C. (2022). Understanding diffusion models: A unified perspective.

arXiv preprint

arXiv:2208.11970.

Lustig, M., Donoho, D., and Pauly, J. M. (2007). Sparse mri: The application of compressed sensing
for rapid mr imaging. Magnetic Resonance in Medicine: An Official Journal of the International
Society for Magnetic Resonance in Medicine, 58(6):1182–1195.

Ma, S., Yin, W., Zhang, Y., and Chakraborty, A. (2008). An efficient algorithm for compressed MR
imaging using total variation and wavelets. In 2008 IEEE Conference on Computer Vision and
Pattern Recognition, pages 1–8.

163

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. (2017). Towards deep learning

models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.

Mardani, M., Song, J., Kautz, J., and Vahdat, A. (2023). A variational perspective on solving inverse

problems with diffusion models. arXiv preprint arXiv:2305.04391.

McCollough, C. H., Leng, S., Yu, L., and Fletcher, J. G. (2015). Dual-and multi-energy ct: principles,

technical approaches, and clinical applications. Radiology, 276(3):637–653.

McCollough, C. H., Primak, A. N., Braun, N., Kofler, J., Yu, L., and Christner, J. (2009). Strategies

for reducing radiation dose in ct. Radiologic Clinics, 47(1):27–40.

Mihcak, M. K., Kozintsev, I., Ramchandran, K., and Moulin, P. (1999). Low-complexity image
denoising based on statistical modeling of wavelet coefficients. IEEE Signal Processing Letters,
6(12):300–303.

Monga, V., Li, Y., and Eldar, Y. C. (2021). Algorithm unrolling: Interpretable, efficient deep
learning for signal and image processing. IEEE Signal Processing Magazine, 38(2):18–44.

Nie, W., Guo, B., Huang, Y., Xiao, C., Vahdat, A., and Anandkumar, A. (2022a). Diffusion
models for adversarial purification. In International Conference on Machine Learning, pages
16805–16827. PMLR.

Nie, W., Guo, B., Huang, Y., Xiao, C., Vahdat, A., and Anandkumar, A. (2022b). Diffusion models
for adversarial purification. In Proceedings of the 39th International Conference on Machine
Learning, volume 162 of Proceedings of Machine Learning Research, pages 16805–16827.
PMLR.

Peng, C., Guo, P., Zhou, S. K., Patel, V. M., and Chellappa, R. (2022). Towards performant
and reliable undersampled MR reconstruction via diffusion model sampling. In International
Conference on Medical Image Computing and Computer-Assisted Intervention, pages 623–633.
Springer.

Platen, E. and Bruti-Liberati, N. (2010). Numerical solution of stochastic differential equations

with jumps in finance, volume 64. Springer Science & Business Media.

Ramani, A., Jensen, J. H., and Helpern, J. A. (2006). Quantitative mr imaging in alzheimer disease.

Radiology, 241(1):26–44.

Ravishankar, S. and Bresler, Y. (2010). Mr image reconstruction from highly undersampled k-space

data by dictionary learning. IEEE transactions on medical imaging, 30(5):1028–1041.

Ravishankar, S. and Bresler, Y. (2011). MR image reconstruction from highly undersampled k-space

data by dictionary learning. IEEE Transactions on Medical Imaging, 30(5):1028–1041.

164

Ravishankar, S. and Bresler, Y. (2012). Learning sparsifying transforms. IEEE Transactions on

Signal Processing, 61(5):1072–1086.

Ravishankar, S., Nadakuditi, R. R., and Fessler, J. A. (2015). Efficient sum of sparse outer products

dictionary learning (SOUP-DIL). CoRR, abs/1511.06333.

Ravishankar, S., Ye, J. C., and A.Fessler, J. (2020).

Image reconstruction: From sparsity to

data-adaptive methods and machine learning. Proceedings of the IEEE, 108(1):86–109.

Romano, Y., Elad, M., and Milanfar, P. (2017). The Little Engine That Could: Regularization by

Denoising (RED). SIAM Journal on Imaging Sciences, 10(4):1804–1844.

Ronneberger, O., Fischer, P., and Brox, T. (2015a). U-net: Convolutional networks for biomedical
image segmentation. In Medical image computing and computer-assisted intervention–MICCAI
2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part
III 18, pages 234–241. Springer.

Ronneberger, O., Fischer, P., and Brox, T. (2015b). U-net: Convolutional networks for biomed-
ical image segmentation. In Medical Image Computing and Computer-Assisted Intervention –
MICCAI 2015, pages 234–241.

Rosenthal, D. I., Barton, N. W., McKusick, K. A., Rosen, B., Hill, S., Castronovo, F., Brady,
R., Doppelt, S., and Mankin, H. (1992). Quantitative imaging of gaucher disease. Radiology,
185(3):841–845.

Roth, S. and Black, M. J. (2005). Fields of experts: a framework for learning image priors. In 2005

IEEE Computer Society Conference on CVPR’05, volume 2, pages 860–867 vol. 2.

Rout, L., Raoof, N., Daras, G., Caramanis, C., Dimakis, A. G., and Shakkottai, S. (2023). Solving
linear inverse problems provably via posterior sampling with latent diffusion models. arXiv
preprint arXiv:2307.00619.

Salman, H., Sun, M., Yang, G., Kapoor, A., and Kolter, J. Z. (2020). Denoised smoothing: A
provable defense for pretrained classifiers. Advances in Neural Information Processing Systems,
33.

Särkkä, S. and Solin, A. (2019). Applied stochastic differential equations, volume 10. Cambridge

University Press.

Schlemper, J., Caballero, J., Hajnal, J. V., Price, A., and Rueckert, D. (2017). A deep cascade
of convolutional neural networks for mr image reconstruction. In International Conference on
Information Processing in Medical Imaging, pages 647–658. Springer.

Schlemper, J., Caballero, J., Hajnal, J. V., Price, A., and Rueckert, D. (2018). A Deep Cascade of
Convolutional Neural Networks for Dynamic MR Image Reconstruction. IEEE Transactions on

165

Medical Imaging, 37(2):491–503.

Shete, M. M. and Jadhav, C. R. (2023). Advancements in ct image reconstruction: An exploration of
conventional and deep learning-driven approaches. In International Conference on Computational
Intelligence, pages 77–88. Springer.

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. (2015). Deep unsupervised
learning using nonequilibrium thermodynamics. In International conference on machine learning,
pages 2256–2265. PMLR.

Song, B., Kwon, S. M., Zhang, Z., Hu, X., Qu, Q., and Shen, L. (2023a). Solving inverse problems
with latent diffusion models via hard data consistency. In The Twelfth International Conference
on Learning Representations.

Song, B., Kwon, S. M., Zhang, Z., Hu, X., Qu, Q., and Shen, L. (2024). Solving inverse problems
with latent diffusion models via hard data consistency. In The Twelfth International Conference
on Learning Representations.

Song, J., Meng, C., and Ermon, S. (2021a). Denoising diffusion implicit models. In International

Conference on Learning Representations.

Song, Y., Dhariwal, P., Chen, M., and Sutskever, I. (2023b). Consistency models. In International

Conference on Machine Learning, pages 32211–32252. PMLR.

Song, Y., Durkan, C., Murray, I., and Ermon, S. (2021b). Maximum likelihood training of score-based

diffusion models. Advances in neural information processing systems, 34:1415–1428.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. (2021c). Score-
based generative modeling through stochastic differential equations. In International Conference
on Learning Representations.

Sriram, A., Zbontar, J., Murrell, T., Defazio, A., Zitnick, C. L., Yakubova, N., Knoll, F., and Johnson,
P. (2020). End-to-end variational networks for accelerated mri reconstruction. In Medical Image
Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference,
Lima, Peru, October 4–8, 2020, Proceedings, Part II 23, pages 64–73. Springer.

Tachella, J., Tang, J., and Davies, M. (2021). The neural tangent link between cnn denoisers and
non-local filters. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 8618–8627.

Tachella, J., Tang, J., and Davies, M. E. (2020). CNN denoisers as non-local filters: The neural

tangent denoiser. CoRR, abs/2006.02379.

Tamir, J. I., Ong, F., Cheng, J. Y., Uecker, M., and Lustig, M. (2016). Generalized magnetic
resonance image reconstruction using the berkeley advanced reconstruction toolbox. In ISMRM

166

Workshop on Data Sampling & Image Reconstruction, Sedona, AZ.

Tran, P., Tran, A. T., Phung, Q., and Hoai, M. (2021). Explore image deblurring via encoded
blur kernel space. In Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, pages 11956–11965.

Uecker, M. (2018). mrirecon/bart: version 0.4.03.

Ulyanov, D., Vedaldi, A., and Lempitsky, V. (2018). Deep image prior. In Proceedings of the IEEE

conference on computer vision and pattern recognition, pages 9446–9454.

Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural

computation, 23(7):1661–1674.

Wang, H., Li, T., Zhuang, Z., Chen, T., Liang, H., and Sun, J. (2023a). Early stopping for deep

image prior. Transactions on Machine Learning Research.

Wang, H., Zhang, X., Li, T., Wan, Y., Chen, T., and Sun, J. (2024). Dmplug: A plug-in method for

solving inverse problems with diffusion models. arXiv preprint arXiv:2405.16749.

Wang, Y., Yu, J., and Zhang, J. (2022). Zero-shot image restoration using denoising diffusion

null-space model. arXiv preprint arXiv:2212.00490.

Wang, Y., Yu, J., and Zhang, J. (2023b). Zero-shot image restoration using denoising diffusion

null-space model. In The Eleventh International Conference on Learning Representations.

Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. (2004). Image quality assessment: From
error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612.

Wang, Z., Qian, C., Guo, D., Sun, H., Li, R., Zhao, B., and Qu, X. (2023c). One-dimensional
deep low-rank and sparse network for accelerated mri. IEEE Transactions on Medical Imaging,
42(1):79–90.

Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient langevin dynamics.
In Proceedings of the 28th international conference on machine learning (ICML-11), pages
681–688. Citeseer.

Wen, B., Li, Y., and Bresler, Y. (2020). Image recovery via transform learning and low-rank
modeling: The power of complementary regularizers. IEEE Transactions on Image Processing,
29:5310–5323.

Wen, B., Ravishankar, S., Pfister, L., and Bresler, Y. (2020). Transform learning for magnetic
resonance image reconstruction: From model-based learning to building neural networks. IEEE
Signal Processing Magazine, 37(1):41–53.

167

Wen, B., Ravishankar, S., Zhao, Z., Giryes, R., and Ye, J. C. (2023). Physics-driven machine
learning for computational imaging [from the guest editor]. IEEE Signal Processing Magazine,
40(1):28–30.

Wintermark, M., Sanelli, P. C., Anzai, Y., Tsiouris, A. J., Whitlow, C. T., Druzgal, T. J., Gean, A. D.,
Lui, Y. W., Norbash, A. M., Raji, C., et al. (2015). Imaging evidence and recommendations for
traumatic brain injury: conventional neuroimaging techniques. Journal of the American College
of Radiology, 12(2):e1–e14.

Wolf, A. (2019). Making medical image reconstruction adversarially robust. Online Report:

https://cs229.stanford.edu/proj2019spr/report/97.pdf.

Xie, Y. and Li, Q. (2022). Measurement-conditioned denoising diffusion probabilistic model for
under-sampled medical image reconstruction. In International Conference on Medical Image
Computing and Computer-Assisted Intervention, pages 655–664. Springer.

Xu, Q., Yu, H., Mou, X., Zhang, L., Hsieh, J., and Wang, G. (2012). Low-dose x-ray ct reconstruction

via dictionary learning. IEEE Transactions on Medical Imaging, 31(9):1682–1697.

Yaman, B. ., Hosseini, S. A. H., Moeller, S., Ellermann, J., Ugurbil, K., and Akcakay, M. (2020).
Self-supervised learning of physics-guided reconstruction neural networks without fully sampled
reference data. Magnetic Resonance in Medicine, 84(6):3172–3191.

Yaman, B., Hosseini, S. A. H., and Akcakaya, M. (2022). Zero-shot self-supervised learning for

MRI reconstruction. In International Conference on Learning Representations.

Yang, G., Yu, S., Dong, H., Slabaugh, G., Dragotti, P. L., Ye, X., Liu, F., Arridge, S., Keegan, J., Guo,
Y., et al. (2017). Dagan: deep de-aliasing generative adversarial networks for fast compressed
sensing mri reconstruction. IEEE transactions on medical imaging, 37(6):1310–1321.

Yang, Y., Sun, J., Li, H., and Xu, Z. (2016). Deep ADMM-Net for compressive sensing MRI. In

Advances in Neural Information Processing Systems, pages 10–18.

Ye, S., Li, Z., McCann, M. T., Long, Y., and Ravishankar, S. (2021). Unified Supervised-
Unsupervised (SUPER) Learning for X-Ray CT Image Reconstruction. IEEE Transactions on
Medical Imaging, 40(11):2986–3001.

Yiasemis, G., Sonke, J.-J., Sánchez, C., and Teuwen, J. (2022). Recurrent variational network:
a deep learning inverse problem solver applied to the task of accelerated mri reconstruction.
In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages
732–741.

Yu, S., Park, B., and Jeong, J. (2019a). Deep iterative down-up cnn for image denoising. In 2019
IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW),
pages 2095–2103.

168

Yu, S., Park, B., and Jeong, J. (2019b). Deep iterative down-up CNN for image denoising.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Workshops, pages 0–0.

Zbontar, J., Knoll, F., Sriram, A., Murrell, T., Huang, Z., Muckley, M. J., Defazio, A., Stern, R.,
Johnson, P., Bruno, M., et al. (2018). fastmri: An open dataset and benchmarks for accelerated
mri. arXiv preprint arXiv:1811.08839.

Zeng, G. L. (2020). Fast filtered backprojection algorithm for low-dose computed tomography.

Journal of radiology and imaging, 4(7):45.

Zhang, B., Chu, W., Berner, J., Meng, C., Anandkumar, A., and Song, Y. (2024a). Improving diffusion
inverse problem solving with decoupled noise annealing. arXiv preprint arXiv:2407.01521.

Zhang, H., Zhou, J., Lu, Y., Guo, M., Wang, P., Shen, L., and Qu, Q. (2024b). The emergence of
reproducibility and consistency in diffusion models. In Forty-first International Conference on
Machine Learning.

Zhang, J. and Ghanem, B. (2018). ISTA-Net: Interpretable Optimization-Inspired Deep Network

for Image Compressive Sensing. arXiv preprint arXiv:1706.07929.

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. (2018). The unreasonable
effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 586–595.

Zhang, Y., Yao, Y., Jia, J., Yi, J., Hong, M., Chang, S., and Liu, S. (2022). How to robustify
black-box ML models? a zeroth-order optimization perspective. In International Conference on
Learning Representations.

Zhao, D., Zhao, F., and Gan, Y. (2020a). Reference-driven compressed sensing mr image
reconstruction using deep convolutional neural networks without pre-training. Sensors, 20(1):308.

Zhao, D., Zhao, F., and Gan, Y. (2020b). Reference-driven compressed sensing mr image
reconstruction using deep convolutional neural networks without pre-training. Sensors, 20(1).

Zheng, H., Fang, F., and Zhang, G. (2019). Cascaded dilated dense network with two-step data

consistency for MRI reconstruction. In NeurIPS.

Zoph, B., G.Ghiasi, Lin, T., Cui, Y., Liu, H., Cubuk, E. . D., and Le, Q. (2020). Rethinking

pre-training and self-training. Advances in Neural Information Processing Systems, 33.

169

APPENDIX A

APPENDIX FOR SELF-GUIDED DIP

In the Appendix, we provide additional intuition and a detailed explanation for Theorems 4.1.1 and

4.1.2, as well as the Corollary introduced in the previous section.

A.1 Proof of Theorem 4.1.1

We start from the following update step for the estimate 𝑧𝑧𝑧𝑡:

𝑧𝑧𝑧𝑡+1 = 𝑧𝑧𝑧𝑡 + 𝜂W (A𝑇 y − A𝑇 A𝑧𝑧𝑧𝑡).

(A.1)

We make a change of variables 𝑝𝑝𝑝𝑡 (w) = W − 1

2 𝑧𝑧𝑧𝑡 (w), where W − 1

2 is the pseudo-inverse of W 1/2,

and W 1/2 is the positive semidefinite matrix whose square is equal to W . With this change of

variables, the recursion formula (A.1) becomes

𝑝𝑝𝑝𝑡+1 = 𝑝𝑝𝑝𝑡 + 𝜂W

1

2 (A𝑇 y − A𝑇 AW

1

2 𝑝𝑝𝑝𝑡)

= (I − 𝜂W

1

2 A𝑇 AW

1

2 )𝑝𝑝𝑝𝑡 + 𝜂W

1

2 A𝑇 A(x⊥ + W

= (I − 𝜂B)𝑝𝑝𝑝𝑡 + 𝜂B(W − 1

2 x) + 𝜂W

1

2 A𝑇 Ax⊥,

1

2 (W − 1

2 x))

where 𝑧𝑧𝑧𝑡 (w) = W

1

2 𝑝𝑝𝑝𝑡 (w) because 𝑧𝑧𝑧0 = 0 and so 𝑧𝑧𝑧𝑡 (w) ∈ 𝑅(W ) = 𝑅(W

1

2 ) here. We have also set

1

1

2 A𝑇 AW

B := W
Since we hope 𝑧𝑧𝑧𝑡 to converge to x, then 𝑝𝑝𝑝𝑡 is expected to converge to ˜x := W − 1

2 , and x⊥ := 𝑃𝑁 (W )x. We have also used the decomposition x = x⊥ + 𝑃𝑅(W )x.

2 x. Now we keep

track of the errors ε𝑡 := 𝑝𝑝𝑝𝑡 − ˜x and 𝑒𝑒𝑒𝑡 := 𝑧𝑧𝑧𝑡 − x. By subtracting ˜x from the above recursion for 𝑝𝑝𝑝𝑡,

we obtain

ε𝑡+1 = (I − 𝜂B)ε𝑡 + 𝜂W

1

2 A𝑇 Ax⊥,

This implies the following result. The proof for the second equality is included in Appendix D in

the supplement.

ε𝑡 = (I − 𝜂B)𝑡ε0 + 𝜂

(cid:35)

(I − 𝜂B) 𝑘

W

1

2 A𝑇 Ax⊥

(cid:34) 𝑡−1
∑︁

𝑘=0

= (I − 𝜂B)𝑡ε0 + B†(I − (I − 𝜂B)𝑡)W

1

2 A𝑇 Ax⊥.

(A.2)

170

Invoking the relation between 𝑝𝑝𝑝𝑡 and 𝑧𝑧𝑧𝑡, we can derive the following useful relation between the

errors 𝑒𝑒𝑒𝑡 and ε𝑡:

𝑒𝑒𝑒𝑡 = W

1

2 𝑝𝑝𝑝𝑡 − x

= W

1

2 (ε𝑡 + ˜x) − x

= W

1

2 (ε𝑡 + W − 1

2 x) − x

= W

1
2 ε𝑡 + W

1

2 W − 1

2 x − x

= W

1

2 ε𝑡 + 𝑃𝑅(W )x − x

(∗)
= W

1

2 ε𝑡 − 𝑃𝑁 (W ) (x)

(∗∗)
= W

1

2 ε𝑡 + 𝑃𝑁 (W ) (W

1

2 𝑝𝑝𝑝𝑡 − x)

= W

1

2 ε𝑡 + 𝑃𝑁 (W )𝑒𝑒𝑒𝑡,

(A.3)

(A.4)

(A.5)

(A.6)

(A.7)

(A.8)

(A.9)

(A.10)

where 𝑃𝑁 (W ) denotes projection onto the null space of W . Most steps in the derivation above

follow from the definitions of the quantities 𝑝𝑝𝑝𝑡, 𝑒𝑒𝑒𝑡, and ˜x. To obtain (∗), we have used the symmetry

of W , and to obtain (∗∗) we have used the fact that 𝑅(W

1

2 ) is equal to 𝑅(W ), which is orthogonal

to 𝑁 (W ). In what follows, for simplicity of notation, we use 𝑃W to denote the projection onto

the range of W . Using the above relation and the fact that 𝑅(W ) and 𝑁 (W ) are orthogonal for

symmetric W , we can write:

𝑃W 𝑒𝑒𝑒𝑡 = W

1
2 ε𝑡

= W

1

2 (I − 𝜂B)𝑡ε0 + W

1

2 B†(I − (I − 𝜂B)𝑡)W

1

2 A𝑇 Ax⊥

= W

1

2 (I − 𝜂B)𝑡W − 1

2 𝑒𝑒𝑒0

+ W

1

2 B†(I − (I − 𝜂B)𝑡)W

1

2 A𝑇 Ax⊥.

On the other hand, subtracting x from (A.1) and then projecting both sides of the resulting equation

to 𝑁 (W ) yields

𝑃𝑁 (W )𝑒𝑒𝑒𝑡 = 𝑃𝑁 (W )𝑒𝑒𝑒𝑡−1 = · · · = 𝑃𝑁 (W )𝑒𝑒𝑒0.

171

Summing the above two equations yields

𝑧𝑧𝑧𝑡 − x = W

1

2 (I − 𝜂B)𝑡W − 1

2 (𝑧𝑧𝑧0 − x) + 𝑃𝑁 (W ) (𝑧𝑧𝑧0 − x)

+ W

1

2 B†(I − (I − 𝜂B)𝑡)W

1

2 A𝑇 Ax⊥.

If W is of full rank, then (A.11) reduces to

𝑧𝑧𝑧𝑡 − x = W

(A.12) can be further rewritten as

1

2 (I − 𝜂B)𝑡W − 1

2 (𝑧𝑧𝑧0 − x)

𝑧𝑧𝑧𝑡 − x = W

+ W

= W

+ W

1

1

2 (I − 𝜂B)𝑡 𝑃𝑅(B)W − 1
2 (I − 𝜂B)𝑡 𝑃𝑁 (B)W − 1
2 (I − 𝜂B)𝑡 𝑃𝑅(B)W − 1
2 𝑃𝑁 (B)W − 1

2 (𝑧𝑧𝑧0 − x).

1

1

2 (𝑧𝑧𝑧0 − x)

2 (𝑧𝑧𝑧0 − x)

2 (𝑧𝑧𝑧0 − x)

(A.11)

(A.12)

(A.13)

In order to make sure the operator I − 𝜂B is non-expansive, we need to require the learning

rate 𝜂 to satisfy 𝜂 < 2
∥B∥

, where ∥B∥ is the spectral norm of B. Under this assumption,

||I − 𝜂B∥ ≤ 𝜌 := max{1 − 𝜂𝜎min(B), 𝜂∥B∥ − 1} < 1, meaning that the operator I − 𝜂B is

contractive on the range of B. Then as 𝑡 → ∞, the first term on the right-hand side in (A.13)

converges to 0, since

∥W

1

2 (I − 𝜂B)𝑡 𝑃𝑅(B)W − 1

2 (𝑧𝑧𝑧0 − x) ∥2

2 ≤ 𝜅(W ) 𝜌2𝑡 ∥𝑧𝑧𝑧0 − x∥2
2

where 𝜅(W ) is the condition number of W 1. Therefore, (A.13) implies that

𝑧𝑧𝑧∞ − x = W

2 (𝑧𝑧𝑧0 − x)

1

2 𝑃𝑁 (B)W − 1
2 𝑃𝑁 (B)W − 1

1

2 x,

= −W

(A.14)

where the last equality used the assumption 𝑧𝑧𝑧0 = 0.

1If W is low-rank, then its condition number is defined as the ratio of the maximal and minimal non-zero singular

values.

172

Let v := 𝑃𝑁 (B)W − 1
2 A𝑇 AW

2 v = 0 or AW

1

1

W

2 x. By this definition, we have v ∈ 𝑁 (B) which is equivalent to

1

2 v = 0. The latter implies W

1

2 v ∈ 𝑁 (A). This when combined with

the equation 𝑧𝑧𝑧∞ − x = −W

1

2 v (A.14), yields 𝑧𝑧𝑧∞ − x ∈ 𝑁 (A).

Moreover, for 𝑧𝑧𝑧∞ − x to be 0, it is necessary that v = 0, which means W − 1

2 x has to be orthogonal

to 𝑁 (B). Consequently, this necessitates that x be orthogonal to 𝑁 (A), or 𝑃𝑁 (A)x = 0. This

completes the proof for the full-rank portion of the theorem.

To prove the result for the singular W case, we rewrite the quantity W

1

2 (I − 𝜂B)𝑡W − 1

2 in

(A.11) as follows. Here, for simplicity of notation, we use 𝑃B to denote the projection onto the

range of B, and 𝑃B⊥ to denote the projection onto the kernel of B.

W

1

2 (I − 𝜂B)𝑡W − 1

2 =W

1

2 𝑃B (I − 𝜂B)𝑡 𝑃BW − 1
2 (𝑃B⊥ 𝑃W 𝑃B⊥)𝑡W − 1
2 .

2

1

+ W

(A.15)

The detailed proof of the above result is given in Supplement Appendix E.

Taking 𝑡 → ∞ in (A.15), we obtain that

lim
𝑡→∞

W

1

2 (I − 𝜂B)𝑡W − 1

2

= W

1
2 lim
𝑡→∞

𝑃B (I − 𝜂B)𝑡 𝑃BW − 1

2

+ W

1
2 lim
𝑡→∞

(𝑃B⊥ 𝑃W 𝑃B⊥)𝑡W − 1

2

= 0 + W

1

2 𝑃𝑁 (B)∩𝑅(W )W − 1
2 ,

where the last equality used the fact that lim𝑛→∞(𝑃A𝑃B𝑃A)𝑛 = 𝑃A∩B. Then (A.11) implies that

𝑧𝑧𝑧∞ − x = −W

1

2 𝑃𝑁 (B)∩𝑅(W )W − 1

2 x − x⊥

+ W

= −W

1

1

2 B†W

2 A𝑇 Ax⊥
2 𝑃𝑁 (B)∩𝑅(W )W − 1

1

2 x − x⊥

+ W

1
2 (AW

1

2 )†Ax⊥,

(A.16)

where the last equality is based on the fact that (CC𝐻)†C = (C𝐻)† for any tall matrix C.

173

Now if

𝑃𝑁 (B)∩𝑅(W )W − 1

2 x = 0,

then the first term in the RHS of (A.16) is 0, and then

𝑧𝑧𝑧∞ − x = −x⊥ + W

1
2 (AW

1

2 )†Ax⊥.

(A.17)

(A.18)

Given that the condition expressed in equation (A.17) can be inferred from the condition (A.19)

below, it follows that (A.19) also implies (A.18) as stated in the theorem.

𝑃𝑁 (A)∩𝑅(W )x = 0.

(A.19)

The rationale behind (A.19) being a sufficient condition of (A.17) is

𝑃𝑁 (A)∩𝑅(W )x = 0 ⇒ x ⊥ 𝑁 (A) ∩ 𝑅(W )

⇒ ⟨x, a⟩ = 0, ∀a ∈ 𝑁 (A) ∩ 𝑅(W )

⇒ ⟨x, a⟩ = 0, ∀a ∈ 𝑅(W ), Aa = 0

⇒ ⟨W

1

2 x, W − 1

2 a⟩ = 0, ∀a ∈ 𝑅(W ), AW 1/2W −1/2a = 0

⇒ ⟨W

1

2 x, b⟩ = 0, ∀b ∈ 𝑅(W ), AW 1/2b = 0

⇒ ⟨W

1

2 x, b⟩ = 0, ∀b ∈ 𝑅(W ), Bb = 0

⇒ W − 1

2 x ⊥ 𝑁 (B) ∩ 𝑅(W )

⇒ 𝑃𝑁 (B)∩𝑅(W )W − 1

2 x = 0,

where b = W − 1

2 a.

Furthermore, if aside from (A.19) we also have x ∈ 𝑅(W ), then (A.18) reduces to

𝑧𝑧𝑧∞ − x = 0

which completes the proof of Theorem 1.

A.2 Proof of Theorem 4.1.2

In this case, we suppose that the acquired measurements are y = Ax + n, where n ∈ R𝑝 with

n ∼ N (0, 𝜎2I), and A ∈ R𝑝×𝑞 is full row rank. We first note that this can equivalently be written

174

as y = A(x + A†n), since A has full row rank, so AA†n = n. Then, we start with the recursion

(A.1), which in this case gives:

𝑧𝑧𝑧𝑡+1 = 𝑧𝑧𝑧𝑡 + 𝜂W (A𝑇 A(x + A†n) − A𝑇 A𝑧𝑧𝑧𝑡)

= 𝑧𝑧𝑧𝑡 + 𝜂W A𝑇 A((x + A†n) − 𝑧𝑧𝑧𝑡).

We set 𝑧𝑧𝑧0 = 000 and define K := 𝜂W A𝑇 A to ease notation. We can use this to derive a useful

closed-form for 𝑧𝑧𝑧𝑡:

𝑧𝑧𝑧𝑡 = (I − K)𝑧𝑧𝑧𝑡−1 + K(x + A†n)

= (I − K)((I − K)𝑧𝑧𝑧𝑡−2 + K(x + A†n)) + K(x + A†n)

= (I − K)2𝑧𝑧𝑧𝑡−2 + (I + (I − K))K(x + A†n)

= (I − K)2((I − K)𝑧𝑧𝑧𝑡−3 + K(x + A†n))

+ (I + (I − K))K(x + A†n)

= (I − K)3𝑧𝑧𝑧𝑡−3 + (I + (I − K) + (I − K)2)K(x + 𝐴†n)
...

= (I − K)𝑡𝑧𝑧𝑧0
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
000

(cid:124)

+

𝑡−1
∑︁

𝑖=0

(I − K)𝑖K(x + A†n)

(∗)
= (I − (I − K)𝑡) (x + A†n)

= (I − (I − 𝜂W A𝑇 A)𝑡) (x + A†n).

To obtain the equality (∗), we have used the algebraic identity (cid:205)𝑡−1

𝑖=0 M𝑖 (I − M) = I − M𝑡, which
holds for any square matrix M. In this case, replacing M with I − K yields the identity used in (∗).

We can express the squared norm of the bias at iteration 𝑡 as:

||Bias𝑡 ||2

2 = ||En [𝑧𝑧𝑧𝑡] − x||2
2

= ||En [(I − (I − 𝜂W A𝑇 A)𝑡) (x + A†n)] − x||2
2

(∗∗)
= ||(I − (I − 𝜂W A𝑇 A)𝑡)x − x||2
2

= ||(I − 𝜂W A𝑇 A)𝑡x||2
2,

175

where (∗∗) follows by linearity of expectation and the assumption that n is zero mean. Next we

compute the covariance matrix of 𝑧𝑧𝑧𝑡 as:

Cov𝑡 = En [𝑧𝑧𝑧𝑡𝑧𝑧𝑧𝑇

𝑡 ] − En [𝑧𝑧𝑧𝑡]En [𝑧𝑧𝑧𝑡]𝑇

(A.20)

To simplify notation, we define the matrix R𝑡 := (I − (I − 𝜂W A𝑇 A)𝑡), so we get:

Cov𝑡 = En [R𝑡 (x + A†n) (x + A†n)𝑇 R𝑇
𝑡 ]

− En [R𝑡 (x + A†n)]En [R𝑡 (x + A†n)]𝑇

= En [R𝑡 (xx𝑇 + A†nn𝑇 (A†)𝑇 )R𝑇

𝑡 ] − (R𝑡x) (R𝑡x)𝑇

= En [R𝑡A†nn𝑇 (A†)𝑇 R𝑇
𝑡 ]

= 𝜎2R𝑡A†(A†)𝑇 R𝑇
𝑡

= 𝜎2𝑄𝑄𝑄𝑡𝑄𝑄𝑄𝑇
𝑡 ,

where we have defined 𝑄𝑄𝑄𝑡 := R𝑡A† = (I − (I − 𝜂W A𝑇 A)𝑡)A†. Then, to compute the variance,

we take the trace of Cov𝑡, and use the fact that the trace of a matrix is the sum of its eigenvalues.

However, the eigenvalues of 𝑄𝑄𝑄𝑡𝑄𝑄𝑄𝑇

𝑡 are exactly the squares of the singular values of 𝑄𝑄𝑄𝑡. This gives us

that:

Var𝑡 = 𝜎2

𝑝
∑︁

𝑖=1

𝜈2
𝑡,𝑖,

(A.21)

where 𝜈𝑡,𝑖 are the singular values of 𝑄𝑄𝑄𝑡. Summing these expressions for the bias and variance of the

estimate exactly give equation (4.17).

A.3 Proof of Corollary 1

We now consider the single-coil MRI forward operator, A = MFFF , where FFF is the usual Fourier
˜A ∈ R2𝑝×2𝑞 (that maps between stacked

operator. Since A ∈ C𝑝×𝑞, we introduce an equivalent

real and imaginary parts of vectors) to ensure that everything is real-valued. Throughout, we use

subscripts 𝑅 and 𝐼 to denote the real and imaginary parts of vectors or operators. We define the
matrices ˜M ∈ R2𝑝×2𝑞 and ˜FFF ∈ R2𝑞×2𝑞 by:

176


𝑧𝑧𝑧𝑅







𝑧𝑧𝑧𝐼









˜M =

;

˜FFF =


M 0











0 M



˜FFF

.


FFF 𝑅 −FFF 𝐼






𝑇

FFF 𝐼 FFF 𝑅









We note that ˜FFF is orthogonal, i.e.

𝑇 ˜FFF = ˜FFF ˜FFF
straightforward to verify that applying ˜A to a vector with stacked real and imaginary components is

= I. Thus, we define ˜A = ˜M ˜FFF .

It is

equivalent to applying A to a complex vector. We also rewrite ˜x =

that n ∼ N (0, 𝜎2I), we then have that n𝑅, n𝐼

iid
∼ N (0, 𝜎2

2 I).

and ˜n =

x𝑅

x𝐼

















n𝑅

n𝐼

















. Supposing

We consider a network with a 2 channel output, i.e., a network that outputs ˜𝑧𝑧𝑧 =

∈ R2𝑞, so

that its NTK is ˜W ∈ R2𝑞×2𝑞. We now suppose that

structure:

˜W is diagonalized by ˜FFF with the following

˜W = ˜FFF

𝑇 ˜𝚲 ˜FFF ;

˜𝚲 =

𝚲 0

0 𝚲









.









With this structure, applying ˜W to a vector with its real and imaginary parts concatenated is

equivalent to applying the circulant matrix W = FFF 𝐻𝚲FFF to a complex vector.

With this reformulation, the update equation for ˜𝑧𝑧𝑧𝑡 becomes:

˜𝑧𝑧𝑧𝑡 = (I − (I − 𝜂 ˜W ˜A𝑇 ˜A)𝑡) ( ˜x + ˜A† ˜n)

= (I − (I − 𝜂 ˜FFF

= (I − (I − 𝜂 ˜FFF

𝑇 ˜𝚲 ˜FFF ( ˜M ˜FFF )𝑇 ˜M ˜FFF )𝑡) ( ˜x + ˜A†n)
𝑇 ˜𝚲 ˜M𝑇 ˜M ˜FFF )𝑡) ( ˜x + ˜A† ˜n)
(I − (I − 𝜂 ˜𝚲 ˜M𝑇 ˜M)𝑡) ˜FFF ( ˜x + ˜A† ˜n).

𝑇

= ˜FFF

In this case, the bias becomes:

177

||Bias𝑡 ||2
2

= ||E ˜n [˜𝑧𝑧𝑧𝑡] − ˜x||2
2

= ||E ˜n [( ˜FFF
𝑇

= || ˜FFF

𝑇

(I − (I − 𝜂 ˜𝚲 ˜M𝑇 ˜M)𝑡) ˜FFF ( ˜x + A† ˜n)] − ˜x||2

2

(I − (I − 𝜂 ˜𝚲 ˜M𝑇 ˜M)𝑡) ˜FFF ˜x − ˜x||2

2

𝑇

= || ˜FFF

(I − 𝜂 ˜𝚲 ˜M𝑇 ˜M)𝑡 ˜FFF ˜x||2

2

= ||(I − 𝜂 ˜𝚲 ˜M𝑇 ˜M)𝑡) ˜FFF ˜x||2

2

=

=

2𝑞
∑︁

𝑖=1
𝑞
∑︁

𝑖=1

(1 − 𝜂 ˜𝜆𝑖 ˜𝑚𝑖)2𝑡 |( ˜FFF ˜x)𝑖 |2

(1 − 𝜂𝜆𝑖𝑚𝑖)2𝑡 |(FFF x)𝑖 |2,

where ˜𝜆𝑖 are the diagonal entries of ˜𝚲, ˜𝑚𝑖 are the diagonal entries of ˜M𝑇 ˜M, and ( ˜FFF ˜x)𝑖 is the 𝑖th
entry of ˜FFF ˜x.

The computation of the covariance is similar to Theorem 2 with small modifications. First, we

𝑇

now have R𝑡 = ˜FFF
have the identity ˜A ˜A𝑇 ˜n = ˜n. We also note the additional factor of 1

(I − (I − 𝜂 ˜𝚲 ˜M𝑇 ˜M)𝑡) ˜FFF . Also, we define 𝑄𝑄𝑄𝑡 = R𝑡

˜A𝑇 , which is valid since we

2 introduced by separating n into

n𝑅 and n𝐼. Thus, we have:

178

=

tr( ˜FFF

Var𝑡 = tr(Cov𝑡)
𝜎2
2
˜FFF ˜A𝑇 ˜A ˜FFF
𝜎2
2

=

𝑇

𝑇

(I − (I − 𝜂 ˜𝚲 ˜M𝑇 ˜M)𝑡)

(I − (I − 𝜂 ˜𝚲 ˜M𝑇 ˜M)𝑡) ˜FFF )

tr((I − (I − 𝜂 ˜𝚲 ˜M𝑇 ˜M)𝑡)

˜M𝑇 ˜M(I − (I − 𝜂 ˜𝚲 ˜M𝑇 ˜M)𝑡))

𝜎2
2
𝜎2
2
𝜎2
2

=

=

=

= 𝜎2

tr( ˜M𝑇 ˜M(I − (I − 𝜂 ˜𝚲 ˜M𝑇 ˜M)𝑡)2)

tr((I − (I − 𝜂 ˜𝚲 ˜M𝑇 ˜M)𝑡)2)
2𝑞
∑︁

(1 − (1 − 𝜂 ˜𝜆𝑖 ˜𝑚𝑖)𝑡)2

𝑖=1
𝑞
∑︁

𝑖=1

(1 − (1 − 𝜂𝜆𝑖𝑚𝑖)𝑡)2.

Summing these expressions for the bias and variance yields the result of Corollary 1.

179

APPENDIX B

APPENDIX FOR AUTOENCODING SEQUENTIAL DEEP IMAGE PRIOR

In this Appendix, we first shed more light on the impact of the DIP network input by studying the

training dynamics using the neural tangent kernel for CNNs with residual connections. Next, we

show how trained autoencoders on clean images can be used as reconstructors at testing time. Lastly,

we provide additional experimental results and visualizations.

B.1 Case Study: Impact of the DIP Network Input through lens of Neural Tangent Kernel in

Residual Networks

We show the impact of the DIP input through the lens of the Neural Tangent Kernel (NTK)

(Tachella et al., 2021; Jacot et al., 2018) for residual networks1. The NTK is a tool used to analyze

the training dynamics of neural networks in the infinite width limit, where for CNNs the network

width corresponds to the number of channels. In this limit, the change of any individual parameter

during training becomes very small, which means that the change in the network’s output during

training can be accurately approximated by a first order Taylor expansion around its initialization. In

the context of DIP, we consider training a neural network 𝑓 with parameters 𝜃 and a fixed input 𝑧𝑧𝑧

using gradient descent. At each training iteration, the network parameters are updated according to:

𝜃 (𝑡+1) = 𝜃 (𝑡) − 𝛽∇𝜃 L ( 𝑓𝜃 (𝑡 ) (𝑧𝑧𝑧)),

(B.1)

where L is the loss function, and 𝛽 is the learning rate. We also consider the resulting change in the

network’s output due to this parameter update using the first order Taylor expansion:

𝑓𝜃 (𝑡+1) (𝑧𝑧𝑧) ≈ 𝑓𝜃 (𝑡 ) (𝑧𝑧𝑧) + ∇𝜃 𝑓𝜃 (𝑧𝑧𝑧)(cid:12)

(cid:12)𝜃=𝜃 (𝑡 ) (𝜃 (𝑡+1) − 𝜃 (𝑡)).

Substituting (B.1) into (B.2) and applying the chain rule to write:

∇𝜃 L ( 𝑓𝜃 (𝑡 ) (𝑧𝑧𝑧)) = (∇𝜃 𝑓𝜃 (𝑧𝑧𝑧)(cid:12)

(cid:12)𝜃=𝜃 (𝑡 ) )𝑇 (∇ 𝑓

𝜃 (𝑡 ) (𝑧𝑧𝑧)L ( 𝑓𝜃 (𝑡 ) (𝑧𝑧𝑧)))

(B.2)

(B.3)

1We note that skip and residual connections are not exactly the same as skip represents concatenation (typically from
encoder to decoder end) and residual represents adding the input to the output. However, both operations correspond to
sending initial input or features of a network to its latter portion or output.

180

yields the equation:

𝑓𝜃 (𝑡+1) (𝑧𝑧𝑧) ≈ 𝑓𝜃 (𝑡 ) (𝑧𝑧𝑧) − 𝛽 (∇𝜃 𝑓𝜃 (𝑧𝑧𝑧)(cid:12)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:124)

(cid:12)𝜃=𝜃 (𝑡 ) ) (∇𝜃 𝑓𝜃 (𝑧𝑧𝑧)(cid:12)
(cid:123)(cid:122)
𝚯(𝑡 )

(cid:12)𝜃=𝜃 (𝑡 ) )𝑇
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:125)

(∇ 𝑓

𝜃 (𝑡 ) (𝑧𝑧𝑧)L ( 𝑓𝜃 (𝑡 ) (𝑧𝑧𝑧))) .

(B.4)

In the infinite width limit, NTK theory states that the matrix 𝚯(𝑡) := (∇𝜃 𝑓𝜃 (𝑧𝑧𝑧)(cid:12)

(cid:12)𝜃=𝜃 (𝑡 ) )𝑇
stays fixed throughout training, so that 𝚯(𝑡) = 𝚯(0) for all 𝑡. This matrix is called the neural tangent

(cid:12)𝜃=𝜃 (𝑡 ) ) (∇𝜃 𝑓𝜃 (𝑧𝑧𝑧)(cid:12)

kernel, and we denote it as 𝚯. Moreover, because the parameters 𝜃 (0) are initialized randomly, in the

infinite width limit, the NTK 𝚯 becomes deterministic (as a function of 𝑧𝑧𝑧) due to the law of large

numbers (Tachella et al., 2021), and does not depend on the specific instantiation of 𝜃 (0). In DIP,

the loss function is the least squares loss given in (2.8). For simplicity, we consider the denoising

case, where the forward operator A = I. Then, substituting the gradient of the loss into (B.4) shows

explicitly how the output of deep image prior evolves during training:

𝑓𝜃 (𝑡+1) (𝑧𝑧𝑧) = 𝑓𝜃 (𝑡 ) (𝑧𝑧𝑧) + 𝛽𝚯(y − 𝑓𝜃 (𝑡 ) (𝑧𝑧𝑧)).

(B.5)

Using this recursion relation, one can derive a closed form of the network output at iteration 𝑡 in

terms of the initial output and NTK (Liang et al., 2024a; Tachella et al., 2021). The reconstruction

at iteration 𝑡 is given by:

𝑓𝜃 (𝑡 ) (𝑧𝑧𝑧) = y − (I − 𝛽𝚯)𝑡 (y − 𝑓𝜃 (0) (𝑧𝑧𝑧)).

(B.6)

It is evident from (B.6) that the initial reconstruction of the network, 𝑓𝜃 (0) (𝑧𝑧𝑧), has important effects

on the training dynamics of DIP. Furthermore, networks used in DIP often feature skip connections

from earlier layers to later ones, and it is natural to believe that these connections may cause the

input 𝑧𝑧𝑧 to have a large effect on 𝑓𝜃 (0) (𝑧𝑧𝑧). In the following theorem, we analyze the training dynamics

of CNNs with a very similar architectural modification: a residual connection that adds the input

directly to the network output.

Theorem B.1.1 (Dynamics of DIP with Residual Connections). Let 𝑔 be a convolutional neural

network with parameters 𝜃. We consider the complementary residual network 𝑓 defined by 𝑓𝜃 (𝑧𝑧𝑧) =

𝑧𝑧𝑧 + 𝑔𝜃 (𝑧𝑧𝑧). Suppose that 𝑓 is trained using gradient descent with the loss L ( 𝑓𝜃 (𝑧𝑧𝑧)) = 1

2 || 𝑓𝜃 (𝑧𝑧𝑧) − y||2
2.

181

Then, in the infinite width limit (number of channels), in expectation over the initialization of the

parameters 𝜃, we have that the output at training iteration 𝑡 is given by:

𝑓𝜃 (𝑡 ) (𝑧𝑧𝑧) = y − (I − 𝛽𝚯)𝑡 (y − 𝑧𝑧𝑧).

(B.7)

The proof of Theorem B.1.1 is provided in Appendix B.1.1, along with a precise statement of

the assumptions on the network architecture and parameter initialization. Additionally, in Appendix

B.1.2 we provide a simple experiment to validate that the training dynamics given in (B.7) hold for

real networks.

Remark 6. Theorem B.1.1 can be used to understand how the choice of network input affects

the performance of DIP for image denoising. To gain intuition, we consider two special cases.

First, we consider using the noisy image y as the input 𝑧𝑧𝑧. In this case, equation (B.7) simplifies

to 𝑓𝜃 (𝑡 ) (y) = y for all iterations 𝑡. In this case absolutely no denoising occurs. On the other

hand, we consider the oracle case where the clean image x is used as 𝑧𝑧𝑧. This gives us 𝑓𝜃 (𝑡 ) (x) =

y − (I − 𝛽𝚯)𝑡 (y − x). We see that at initialization (𝑡 = 0), we already expect perfect denoising, since

𝑓𝜃 (0) (x) = y − (I − 𝛽𝚯)0(y − x) = x. These two cases support intuition that using a network input

closer to the true image could result in better performance in fewer training iterations.

B.1.1 Proof of Theorem B.1.1

Setting of Theorem B.1.1. We first precisely state the conditions of Theorem B.1.1, in particular

the network architectures considered and the corresponding parameter initializations. The present

setting is very similar to the setting considered in (Tachella et al., 2021), but we provide the details

here for completeness. We consider an 𝐿 layer CNN with 𝑐in input channels and 𝑐out output channels,

with 𝑐 hidden channels in all intermediate layers. We assume that 𝑐in, 𝑐out << 𝑐. We assume all

convolutions have a filter size of 𝑟. For simplicity, the network input and output are vectorized,

so convolutions of any dimension are treated identically. For example, for a 2D CNN with 5 × 5

kernels, 𝑟 = 25. Written explicitly, a network 𝑔 with this architecture takes the form

𝑔𝜃 (𝑧𝑧𝑧) = 𝐶𝐿 (𝜑(𝐶𝐿−1(𝜑(· · · 𝜑(𝐶1(𝑧𝑧𝑧)))))),

182

where the operators 𝐶𝑖 represent convolutions with an additive bias, and 𝜑 is a pointwise activation

function such as ReLU. In this section, we also consider the residual network architecture defined by

𝑓𝜃 (𝑧𝑧𝑧) = 𝑧𝑧𝑧 + 𝑔𝜃 (𝑧𝑧𝑧).

We assume that the parameters are initialized using the He initialization (He et al., 2015). With

this initialization, the first layer convolutional filter weights are drawn from N (0, 𝜎2
filter weights for all other layers are drawn from N (0, 𝜎2

𝑐𝑟 ), where the variance 𝜎2

𝑐in𝑟 ), and the
𝑤 depends on the

𝑤

𝑤

non-linearity used in the network. For ReLU networks, 𝜎2

𝑤 = 2 (He et al., 2015). All biases are

initialized to 0.

Proof. In the setting described above, the NTK emerges in the limit 𝑐 → ∞. A body of existing

theory (Tachella et al., 2021; Jacot et al., 2018; Arora et al., 2019) establishes that in this limit the

NTK is a deterministic matrix as a function of the network input 𝑧𝑧𝑧. This theory does not consider

residual connections, but applies immediately to both 𝑔 and 𝑓 . For 𝑔, the NTK is given by

𝚯 := (∇𝜃𝑔𝜃 (𝑧𝑧𝑧)(cid:12)

(cid:12)𝜃=𝜃 (0) ) (∇𝜃𝑔𝜃 (𝑧𝑧𝑧)(cid:12)

(cid:12)𝜃=𝜃 (0) )𝑇 .

(B.8)

However, we can see that ∇𝜃𝑔𝜃 (𝑧𝑧𝑧)(cid:12)

(cid:12)𝜃=𝜃 (0) = ∇𝜃 𝑓𝜃 (𝑧𝑧𝑧)(cid:12)

(cid:12)𝜃=𝜃 (0) .

Therefore, the linearization given in equation (B.4) holds for 𝑓 using the same kernel 𝚯, and

equation (B.6) describes the training dynamics of 𝑓 .

Using equation (B.6), we can write:

𝑓𝜃 (𝑡 ) (𝑧𝑧𝑧) = y − (I − 𝛽𝚯)𝑡 (y − 𝑓𝜃 (0) (𝑧𝑧𝑧))

= y − (I − 𝛽𝚯)𝑡 (y − 𝑧𝑧𝑧 − 𝑔𝜃 (0) (𝑧𝑧𝑧))

(B.9)

(B.10)

To prove Theorem B.1.1, we consider the output 𝑓𝜃 (𝑡 ) (𝑧𝑧𝑧) in expectation over the initialization

𝜃 (0). Since all parameters 𝜃 (0) are drawn from mean 0 gaussian distributions, we find that

E𝜃 (0) [𝑔𝜃 (0) (𝑧𝑧𝑧)] = 0 for any input 𝑧𝑧𝑧. Since 𝚯 is deterministic in the limit 𝑐 → ∞, in expectation over
𝜃 (0) equation (B.10) reduces to 𝑓𝜃 (𝑡 ) (𝑧𝑧𝑧) = y − (I − 𝛽𝚯)𝑡 (y − 𝑧𝑧𝑧), which proves Theorem B.1.1. □

183

B.1.2 Example to support the results of Theorem B.1.1

We now provide a simple example using real networks to support the validity of Theorem B.1.1.

This experiment substantiates both of the special cases considered in Remark 6. Additionally, it

shows that using the ground truth as the network input greatly inhibits overfitting. Indeed, in this

experiment, we find that a residual network trained with the ground truth as input takes approximately

25 times more training iterations to completely learn the noisy signal than a residual network trained

using a random noise input.

We use DIP for denoising a 1D sinusoidal signal. We denote the clean signal x. The noisy signal is

y = x+n, where n ∼ N (0, I). The network used is a five layer ReLU CNN with a residual connection.

The full architecture can be written as 𝑓 (𝑧𝑧𝑧) = 𝑧𝑧𝑧 + 𝐶5(ReLU(𝐶4(· · · ReLU(𝐶2(ReLU(𝐶1(𝑧𝑧𝑧))))))),

where each 𝐶𝑖 represents a convolution (with bias). The signal has a size of 100, and the convolutions

each have a filter size of 3, with 64 hidden channels. In all cases, the network is trained using

gradient descent with a learning rate of 5 × 10−4. The same seed was used to initialize the network

in all cases.

We trained this network using three different inputs: the true signal x, the noisy signal y, and

noise 𝑧𝑧𝑧 ∼ N (0, I). The results of training the network using these three inputs are shown in Figure

B.1. We find that the results for training this real, reasonably sized network show the behavior

predicted by Theorem B.1.1, which was obtained in the infinite width limit. Indeed, equation (B.7)

predicts that when x is used as the input, the initial error will be small, with eventual overfitting to

the noisy signal. This is observed in Figure B.1, where the error is lowest at initialization, and it

steadily increases throughout training. With this input, the network is highly resistant to overfitting,

requiring approximately 105 training iterations to completely fit the noisy signal. We also see that

when y is used as the input, the error curve obtained is essentially flat and converges quickly to the

error of the noisy signal. This agrees with the expectation that the network output will be y for all

iterations 𝑡 when y is used as the input. Finally, when random noise 𝑧𝑧𝑧 is used as the input, the typical

DIP behavior emerges.

184

Figure B.1 Ground truth signal and measurements (left), and the results of the denoising experiment
in Appendix B.1.2 (right) to support the claims in Theorem B.1.1 and Remark 6.

B.1.3 Trained Autoencoders as Reconstructors

Here, we investigate how the autoencoder term in aSeqDIP is improving the reconstruction

quality while mitigating the impact of noise overfitting. In particular, we try to answer the question:

Can an autoencoder trained on clean images operate as a reconstructor at testing time?

Figure B.2 Average PSNR (y-axis) of 8 MRI images (with 4x undersampling) obtained by optimizing
the input of a trained autoencoder (using (B.11)) w.r.t. different values of the regularization parameter
𝜆𝑎 in (B.12) (x-axis).

To address this question, we perform the following steps: (i) train an autoencoder on fully

sampled measurements or clean data images and (ii) utilize the trained autoencoder with unseen

subsampled or corrupted measurements, optimizing over the input using the DIP objective with the

autoencoder term. This enables the autoencoder to function as an image reconstructor. Specifically,

given a training dataset, D, comprising unperturbed images or fully sampled MRI/CT data, denoted

by x, we train an autoencoder U-Net 𝑔 : R𝑛 → R𝑛 with parameters 𝜓. The training process seeks to

185

020406080100−3−2−10123SignalsUsedinDenoisingExperimentGroundTruthxNoisySignaly100101102103104105TrainingIteration024681012ErrorofNetworkOutput(ℓ2Norm)DIPDenoisingPerformanceUsingDifferentInputsInputGroundTruthxInputNoisySignalyInputNoisez∼(0,I)NoisySignalError||y−x||20.51.01.52.02.53.0a: Autoencoder Regularization343536PSNR (dB)Average of 8 MRI testing scansobtain ˆ𝜓 as

x∈D
Subsequently, given unseen measurements y and the learned autoencoder’s parameters ˆ𝜓, we test

ˆ𝜓 = arg min
𝜓

|D|

1

∑︁

∥𝑔𝜓 (x) − x∥2
2 .

(B.11)

the reconstruction of

z ← arg min

z

∥A𝑔 ˆ𝜓 (z) − y∥2

2 + 𝜆𝑎 ∥𝑔 ˆ𝜓 (z) − z∥2
2 ,

(B.12)

where 𝜆𝑎 ∈ R+ is a regularization parameter. We perform this experiment by training 𝜓 using

3000 fully sampled scans from the fastMRI dataset (Zbontar et al., 2018). We then evaluate the

reconstruction quality of the trained encoder using 8 scans from the fastMRI testing set. The

average PSNR results for different values of 𝜆𝑎 are depicted in Figure B.2. As observed, a trained

autoencoder effectively serves as a reconstructor as evidenced by the achieved PSNR. Thus, we

deduce that the autoencoder term in aSeqDIP not only mitigates noise overfitting but also enhances

the reconstruction quality as an important prior.

B.2 Additional Experiments

B.2.1 Robustness to Noise Overfitting for the Denoising Task

Figure B.3 Average PSNR results of 20 images w.r.t. to the optimization iteration using aSeqDIP
and DIP-based baselines.

In this subsection, we illustrate aSeqDIP’s robustness to the noise overfitting issue for the

denoising task. The average PSNR for 20 images from the CBSD68 dataset for denoising using

aSeqDIP and other DIP-based methods are given in Figure B.3. We observe two key points. First, in

addition to higher PSNR, aSeqDIP shows higher robustness against noise overfitting compared to

186

0200040006000800010000Iterations15.017.520.022.525.027.530.032.5PSNRAverage PSNR for DenoisingaSeqDIP (Ours)Self-Guided DIPSGLD DIPVanilla DIPother DIP-based methods, consistent with MRI and CT results in Figure 5.4. Second, unlike MRI

and CT, the onset of noise overfitting occurs earlier, but the subsequent decay is very small.

B.2.2 Comparison with VarNet: An End-to-End MRI Supervised Method

Here, we compare aSeqDIP with the End-to-End (E2E) MRI supervised model (Sriram et al.,

2020) that uses the variational network (VarNet). Results are given in Table B.1. As observed,

we slightly under-perform when compared to VarNet (trained on 8000 data points) for the task of

MRI reconstruction all without requiring any labeled training data. It is important to note that,

at inference, E2E models only require a few unrolling steps, whereas aSeqDIP is an optimization

method that requires to train the network parameters for each new set of measurements.

Task

MRI

Method

Data Independency PSNR (↑)

E2E VarNet (trained on 8000 data points from fastMRI) (Sriram et al., 2020)
E2E VarNet (trained on 3000 data points from fastMRI) (Sriram et al., 2020)
aSeqDIP (Ours)

×
×
✓

34.89
33.78
34.08

Table B.1 Average PSNR results (over 20 MRI scans at 4x undersampling from the testing set of
fastMRI) reported by our method against E2E VarNet (Sriram et al., 2020) (pre-trained on fastMRI)
for the task of MRI reconstruction.

B.2.3 Comparison with DM-based Methods on the FFHQ Dataset

Task

Denoising

Random In-Painting

Deblurring

Box In-Painting

Method

Data Independency PSNR (↑)

DPS (trained on FFHQ) (Chung et al., 2023c)
DDNM (trained on FFHQ) (Wang et al., 2023b)
aSeqDIP (Ours)

DPS (trained on FFHQ) (Chung et al., 2023c)
DDNM (trained on FFHQ) (Wang et al., 2023b)
aSeqDIP (Ours)

DPS (trained on FFHQ) (Chung et al., 2023c)
DDNM (trained on FFHQ) (Wang et al., 2023b)
aSeqDIP (Ours)

DPS (trained on FFHQ) (Chung et al., 2023c)
DDNM (trained on FFHQ) (Wang et al., 2023b)
aSeqDIP (Ours)

×
×
✓

×
×
✓

×
×
✓

×
×
✓

31.45
31.65
31.77

24.54
25.54
25.76

23.67
23.88
24.02

22.67
22.89
22.3

Table B.2 Average PSNR results reported by our method against DPS as well as a more recent
leading method DDNM (Wang et al., 2023b) for four image restoration tasks: Denoising, Random
In-Painting, non-linear Deblurring, and Box In-Painting.

Here, we present average PSNR results (averaged over 20 images) for the tasks of denoising

, random inpainting (97% missing pixels), box-in-painting (with HIAR of 0.25), and non-linear

deblurring of our method versus Denoising Diffusion Null-Space Model (DDNM) (Wang et al.,

187

2023b) and DPS (Chung et al., 2023c) on the FFHQ testing dataset. For DPS and DDNM, we used a

pre-trained model that was trained on the training set of FFHQ. As observed, our training-data-free

method achieves competitive or slightly improved results when compared to data-intensive methods

on all tasks other than box-inpainting (for which we under-perform by less than 1 dB), all without

requiring a pre-trained model.

B.2.4 Ablation Study on the Regularization Parameter in aSeqDIP

Figure B.4 Average PSNR results of 20 MRI (with 4x undersampling) scans in aSeqDIP for the
cases where 𝜆 ∈ {0.5, 1, 2} and 𝑖 ∈ [10000].

In this section, we conduct an ablation study on the choice of the autoencoding regularization

parameter, 𝜆, in (5.3). Specifically, we conduct an experiment using 20 MRI scans to examine the

impact of 𝜆 on the reconstruction quality and noise overfitting in aSeqDIP. We note that while the

main results in Section 5.3 use 𝑁𝐾 = 4000, in these experiments, we run our algorithm for an

extended number of iterations to investigate the onset of noise overfitting. We set 𝐾 = 5000 and

𝑁 = 2 for this purpose.

In this experiment, we run aSeqDIP with values of 𝜆 ∈ {0.5, 1, 2}. Average PSNR results are

given in Figure B.4. It is evident that, on average, using 𝜆 = 1 yields the most favorable results in

terms of PSNR values, which is our selected choice. Furthermore, we observe that for 𝜆 = 0.5 (red),

the start of the PSNR decay (the onset of noise overfitting) precedes that of 𝜆 = 1 and 𝜆 = 2 (blue

and black).

B.2.5 Ablation Study on 𝑁 and 𝐾 in aSeqDIP

In this section, we conduct an ablation study to investigate the impact of the number of gradient

updates (𝑁) per one set of parameter (𝐾) in aSeqDIP. Specifically, we report the PSNR results

across the tasks of MRI, CT, denoising, and in-painting for the case of 𝑁𝐾 = 4000, considering

combinations of (𝑁, 𝐾) as (1, 4000), (2, 2000), and (4, 1000). The results, presented in Figure B.5,

188

0200040006000800010000Iteration i20253035PSNR (dB)Average of 20 MRI measurements with Regularization Parameter =0.5Average of 20 MRI measurements with Regularization Parameter =1Average of 20 MRI measurements with Regularization Parameter =2reveal that across all tasks considered, the combination of 𝑁 = 2 and 𝐾 = 2000 consistently yields

the most favorable results in terms of PSNR.

Figure B.5 Ablation study on the choice of the number of gradient updates 𝑁 per U-Net, and the
number of U-Nets 𝐾 in terms of PSNR using the four considered tasks for the case of 𝑁𝐾 = 4000.

B.2.6 Additional Implementation Details

For denoising, each noisy RGB image is 512 × 512 and is generated by adding an additive

Gaussian white noise with two noise levels as descired in Table 5.1. For in-painting, we consider

a central region mask, and we evaluate two hole-to-image area ratios (HIAR) with image size

512 × 512.

For Vanilla DIP, Self-Guided DIP, Reference-guided DIP, TV-DIP, Rethinking DIP, and SGLD

DIP, we use 4000 iterations. For TV-DIP, we set the regularization parameter to 1. For ES-DIP, we

use the default configuration provided by the authors for the three considered tasks.

The reference image in Reference-guided DIP is chosen as (using a distance metric such as

Euclidean distance or other metric) that which is most similar to an estimated test reconstruction

from undersampled data or sparse-view data.

For DM-based approaches, we use the codes attached to the authors’ papers. Specifically,

Score-MRI (Chung and Ye, 2022), MCG (Chung et al., 2022), and DPS (Chung et al., 2023c).

For our experiments with natural images (denoising, in-painting, and non-linear deblurring) in

Table 5.2, we used the CBSD68 dataset. As such, for DPS (the DM-based method), we utilized a

pre-trained model that was trained on a very large and diverse dataset which is ImageNet 128 × 128,

256 × 256, and 512 × 512. This pre-trained model is much more generalizable when compared to

the other option which was trained on FFHQ (a dataset of faces). According to (Zhang et al., 2024b),

189

the ImageNet pre-trained model has high generalizability. For the FFHQ comparison results in

Appendix B.2.3, we used an FFHQ-pre-trained DM for DPS and DDNM.

For MRI, the pre-trained model used in Score-MRI was originally trained on natural images

then fine-tuned using the training set of fastMRI. Similar approach was used for the CT pre-trained

model used in MCG.

B.3 Additional Visualizations

Figures B.6 and B.7 present additional MRI visualisations, whereas Figures B.8 and B.9 present

CT visualizations. Samples from the natural image restoration tasks are given in Figure B.10,

Figure B.11, and Figure B.12 for box-inpainting, denoising, and non-linear deblurring, respectively.

Ground Truth

Input

Self-Guided

Reference Guided

PSNR = ∞ dB
Score MRI

PSNR = 21.12 dB
aSeqDIP

PSNR = 34.01 dB
Vanilla DIP

PSNR = 32.45 dB

PSNR = 31.2 dB

PSNR = 34.68 dB

PSNR = 30.26 dB

Figure B.6 Visualization of ground-truth and reconstructed images using different methods of a
knee image from the fastMRI dataset with 4x k-space undersampling. A region of interest is shown
with a green box and its error (magnitude) is shown in the panel on the top right. aSeqDIP provides
the sharpest and clearest reconstruction of image features.

190

Ground Truth

Input

Self-Guided

Reference Guided

PSNR = ∞ dB
Score MRI

PSNR = 16.67 db
aSeqDIP

PSNR = 32.95 dB
Vanilla DIP

PSNR = 32.31 dB

PSNR = 30.5 dB

PSNR = 33.18 dB

PSNR = 28.96 dB

Figure B.7 Visualization of ground-truth and reconstructed images using different methods of a
knee image from the fastMRI dataset with 8x k-space undersampling. A region of interest is shown
with a green box and its error (magnitude) is shown in the panel on the top right. aSeqDIP provides
the clearest reconstruction of image features amongst the methods.

191

Ground Truth

Input

Self-Guided

Reference Guided

PSNR = ∞ dB
MCG

PSNR = 17.12 dB
aSeqDIP

PSNR = 33.05 dB
Vanilla DIP

PSNR = 32.31 dB

PSNR = 32.47 dB

PSNR = 33.48 dB

PSNR = 29.77 dB

Figure B.8 Visualization of ground-truth and reconstructed images using different methods of a CT
scan from the AAPM dataset with 18 views. A region of interest is shown with a green box and its
error (magnitude) is shown in the panel on the top right. aSeqDIP provides the sharpest and clearest
reconstruction of image features.

192

Ground Truth

Input

Self-Guided

Reference Guided

PSNR = ∞ dB
MCG

PSNR = 22.67 dB
aSeqDIP

PSNR = 34.45 dB
Vanilla DIP

PSNR = 33.31 dB

PSNR = 34.15 dB

PSNR = 35.18 dB

PSNR = 30.16 dB

Figure B.9 Visualization of ground-truth and reconstructed images using different methods of a CT
scan from the AAPM dataset with with 30 views. A region of interest is shown with a green box and
its error (magnitude) is shown in the panel on the top right. aSeqDIP provides better reconstruction
of small and low-contrast image features.

193

Ground Truth

Input

Self-Guided

SGLD

PSNR = ∞ dB
DPS

PSNR = 14.56 dB
aSeqDIP

PSNR = 24.21 dB
Vanilla DIP

PSNR = 23.02 dB

PSNR = 23.12 dB

PSNR = 24.35 dB

PSNR = 22.16 dB

Figure B.10 In-painting example with 0.1 HIAR where image restorations of different methods are
given using an example from the CBSD68 dataset. The diffusion-based DPS produces spurious
(although sharp) content in the hole region while aSeqDIP much better preserves features in the
original ground truth.

194

Ground Truth

Input

Self-Guided

SGLD

PSNR = ∞ dB
DPS

PSNR =19.68 dB
aSeqDIP

PSNR = 30.19 dB
Vanilla DIP

PSNR = 29.62 dB

PSNR = 29.92 dB

PSNR = 30.35 dB

PSNR = 28.78 dB

Figure B.11 Denoising example with 𝜎d = 25 where image restorations of different methods are
given using an example from the CBSD68 dataset. aSeqDIP provides the sharpest and clear
reconstruction of image features.

195

Ground Truth

Input

Self-Guided

SGLD

PSNR = ∞ dB
DPS

PSNR = 14.24 dB
aSeqDIP

PSNR = 27.82 dB
Vanilla DIP

PSNR = 22.62 dB

PSNR = 19.82 dB

PSNR = 28.67 dB

PSNR = 25.24 dB

Figure B.12 Non-linear deblurring samples where image restorations of different methods are given
using an example from the FFHQ dataset.

196

APPENDIX C

APPENDIX FOR ROBUST MRI RECONSTRUCTION BY SMOOTHED UNROLLING

C.1 Preliminary of Theorem 6.3.1

Lemma 1. Let 𝑓 : R𝑑 → R𝑚 be any bounded function. Let η ∼ N (0, 𝜎2I). We define 𝑔 : R𝑑 → R𝑚

as

Then, 𝑔 is an 𝑀

√

2𝜋𝜎

𝑔(x) = Eη [ 𝑓 (x + η)].

-Lipschitz map, where 𝑀 = 2 max

x∈R𝑑 (∥ 𝑓 (x) ∥2).

In particular, for any

x, δ ∈ R𝑑:

∥𝑔(x) − 𝑔(x + δ) ∥2 ≤

𝑀

√

2𝜋𝜎

∥δ∥2.

Proof. The proof of this bound follows recent work (Wolf, 2019), with a modification on 𝑀.

Let 𝜇 be the probability distribution function of random variable η. By the change of variables

w = x + η and w = x + η + δ for the integrals constituting 𝑔(x) and 𝑔(x + δ), we have
∥𝑔(x)−𝑔(x+δ)∥2 = ∥ ∫

R𝑑 𝑓 (w) [𝜇(w−x)−𝜇(w−x−δ)] 𝑑w∥2. Then, we have ∥𝑔(x)−𝑔(x+δ)∥2

≤

∫

R𝑑

∥ 𝑓 (w) [𝜇(w − x) − 𝜇(w − x − δ)] ∥2 𝑑w,

which is a standard result for the norm of an integral. We further apply Holder’s inequality to upper

bound ∥𝑔(x) − 𝑔(x + δ)∥2 with

(∥ 𝑓 (x)∥2)

max
x∈R𝑑

∫

R𝑑

|𝜇(w − x) − 𝜇(w − x − δ) |𝑑w.

(C.1)

Observe that 𝜇(w − x) ≥ 𝜇(w − x − δ) if ∥w − x∥2 ≤ ∥w − x − δ∥2. Let 𝐷 = {w : ∥w − x∥2 ≤

∥w − x − δ∥2}. Then, we can rewrite the above bound as

(∥ 𝑓 (x)∥2) · 2

∫

𝐷

[𝜇(w − x) − 𝜇(w − x − δ)] 𝑑w

2 ∫
𝐷

𝜇(w − x) 𝑑w − 2 ∫
𝐷

𝜇(w − x − δ) 𝑑w.

(cid:17)

= max
x∈R𝑑
𝑀
(cid:16)
2

=

Following Lemma 3 in (Lakshmanan et al., 2008), we obtain the bound

∫

2

𝐷

𝜇(w − x) 𝑑w − 2

∫

𝐷

𝜇(w − x − δ) 𝑑w ≤

√

2

2𝜋𝜎

∥δ∥2 ,

197

(C.2)

(C.3)

(C.4)

which implies that ∥𝑔(x) − 𝑔(x + δ)∥2 ≤

proof.

2 max

x∈R𝑑 (∥ 𝑓 (x)∥2)

√

2𝜋𝜎

∥δ∥2 = 𝑀
2𝜋𝜎

√

∥δ∥2. This completes the

□

The proof of this bound closely follows (Wolf, 2019), with a correction on 𝑀. We have that

||𝑔(x) − 𝑔(x + δ)||

∫

R𝑑
∫

= ||

≤ ||

𝐷+

∫

𝐷 −

+ ||

𝑓 (w) [𝜇(w − x) − 𝜇(w − x − δ]𝑑𝑤||

𝑓 (𝑤) [𝜇(w − x) − 𝜇(w − x − δ)]𝑑𝑤||

𝑓 (w) [𝜇(w − x − δ) + 𝜇(w − x)]𝑑𝑤||

where 𝐷+ = {w : 𝜇(w − x) > 𝜇(w − x − δ)} = {w : ∥w − x∥2 < ∥w − x − δ∥2} and

𝐷− = {w : 𝜇(w − x) < 𝜇(w − x − δ)} = {w : ∥w − x∥2 > ∥w − x − δ∥2}. We notice that

∫

𝐷+

[𝜇(w − x) − 𝜇(w − x − δ)]𝑑𝑤 =

∫

𝐷 −

[𝜇(w − x − δ) − 𝜇(w − x)]𝑑𝑤.

Now we use Jensen’s inequality on each norm on the right hand side to get

||𝑔(x) − 𝑔(x + δ)||

≤

+

∫

𝐷+

∫

𝐷 −

|| 𝑓 (w) [𝜇(w − x) − 𝜇(w − x − δ)] ||𝑑𝑤

|| 𝑓 (w) [𝜇(w − x − δ) + 𝜇(w − x)] ||𝑑𝑤

Now we apply Holder’s inequality to get

≤ max( 𝑓 ) · (

∫

𝐷+

||𝜇(𝑤 − 𝑥) − 𝜇(w − x − δ)||𝑑𝑤

||𝜇(w − x − δ) − 𝜇(w − x)||𝑑𝑤)

∫

+

𝐷 −

= 2 max( 𝑓 ) · (

∫

𝐷+

𝜇(w − x) − 𝜇(w − x − δ)𝑑𝑤)

198

C.2 Proof of Theorem 6.3.1

Proof. Assume that the data consistency step in MoDL at iteration 𝑛 is denoted by x𝑛

M(A𝐻y). We

will sometimes drop the input and y dependence for notational simplicity. Then

M = (A𝐻A + I)−1(A𝐻y + Dθ (A𝐻y)) ,
x1

M = (A𝐻A + I)−1(A𝐻y + Dθ (x𝑛−1
x𝑛

M )),

(C.5)

(C.6)

where Dθ is the denoiser function. For the sake of simplicity and consistency with the experiments,

we use the weighting parameter 𝜆 = 1 (in the data consistency step). We note that the proof works

for arbitrary 𝜆. SMUG introduces an iteration-wise smoothing step into MoDL as follows:

x1
S = ((A𝐻A + I)−1(A𝐻y + Eη1 [Dθ (A𝐻y + η1)])

S = ((A𝐻A + I)−1(A𝐻y + Eη𝑛 [Dθ (x𝑛−1
x𝑛

S + η𝑛)])

= (A𝐻A + I)−1(A𝐻y)+

(A𝐻A + I)−1Eη𝑛 [Dθ (x𝑛−1

S + η𝑛)],

(C.7)

(C.8)

(C.9)

where we apply the expectation to the denoiser Dθ at each iteration. We use η𝑛 to denote

the noise during smoothing at iteration 𝑛. The robustness error of SMUG after 𝑛 iterations is

∥x𝑛

S(A𝐻y) − x𝑛

S(A𝐻 (y + δ))∥. We apply Lemma 1 and properties of the norm (e.g., triangle

inequality) to bound ∥x𝑛

S(A𝐻y) − x𝑛

S(A𝐻 (y + 𝜹)) ∥ as

≤ ∥(A𝐻A + I)−1A𝐻δ∥

(C.10)

+ ∥(A𝐻A + I)−1 · (cid:0)Eη𝑛 [Dθ

(cid:0)x𝑛−1
S

(A𝐻y) + η𝑛(cid:1)]−

Eη𝑛 [Dθ

(cid:0)x𝑛−1
S

(A𝐻 (y + δ)) + η𝑛(cid:1)](cid:1) ∥

≤ ∥(A𝐻A + I)−1∥2∥A𝐻δ∥2

+ ∥(A𝐻A + I)−1∥2∥Eη𝑛 [Dθ

(cid:0)x𝑛−1
S

(A𝐻y) + η𝑛(cid:1)]−

Eη𝑛 [Dθ

(cid:0)x𝑛−1
S

(A𝐻 (y + δ)) + η𝑛(cid:1)] ∥

≤ ∥(A𝐻A + I)−1∥2∥A𝐻δ∥2 + ∥(A𝐻A + I)−1∥2×
(cid:16) 𝑀
√
2𝜋𝜎

(A𝐻 (y + 𝜹)) ∥.

(A𝐻y) − x𝑛−1

∥x𝑛−1
S

(cid:17)

S

(C.11)

199

Here, 𝑀 = 2 maxx(∥Dθ (x)∥). Then we plug in the expressions for x𝑛−1
(from (C.8)) and bound their normed difference with ∥(A𝐻A + I)−1A𝐻δ∥ + ∥(A𝐻A + I)−1 ·
(A𝐻 (y + δ)) + η𝑛−1(cid:1)](cid:1) ∥. This is bounded above
(cid:0)Eη𝑛−1 [Dθ
similarly as for (C.10). We repeat this process until we reach the initial x0
S

(A𝐻y) + η𝑛−1(cid:1)] − Eη𝑛−1 [Dθ

on the right hand side.

(A𝐻y) and x𝑛−1

(A𝐻 (y+ 𝜹))

(cid:0)x𝑛−2
S

(cid:0)x𝑛−2
S

S

S

(C.12)

(C.13)

(C.14)

+ ∥A∥2

(cid:17) 𝑛

(cid:16) 𝑀𝛼
√
2𝜋𝜎

, with

□

(cid:170)
(cid:174)
(cid:172)

(cid:170)
(cid:174)
(cid:172)

This yields the following bound involving a geometric series.

∥x𝑛

S(A𝐻y) − x𝑛

S(A𝐻 (y + 𝜹)) ∥

(cid:18)

≤ ∥A𝐻δ∥2

(cid:205)𝑛

𝑗=1 ∥(A𝐻A + I)−1∥

𝑗
2 ·

(cid:17) 𝑗−1(cid:19)

(cid:16) 𝑀
√
2𝜋𝜎

+ ∥(A𝐻A + I)−1∥𝑛
2

(cid:17) 𝑛

(cid:16) 𝑀
√
2𝜋𝜎

∥A𝐻δ∥2
(cid:32)

(cid:33) 𝑛

≤ ∥A∥2∥δ∥2∥(A𝐻A + I)−1∥2 (cid:169)
(cid:173)
(cid:171)
∥A∥2∥δ∥2 ≤ 𝐶𝑛 ∥δ∥2,

+ ∥(A𝐻A + I)−1∥𝑛
2

∥(A𝐻 A+I) −1 ∥2

(cid:17) 𝑛

2 𝜋 𝜎

1−

√

𝑀
2𝜋𝜎
1− 𝑀
√

∥(A𝐻 A+I) −1 ∥𝑛
2

(cid:16) 𝑀
√
2𝜋𝜎

(cid:33) 𝑛

(cid:32)

1−

√

𝑀𝛼
2𝜋𝜎
1− 𝑀 𝛼
√
2 𝜋 𝜎

where we used the geometric series formula, and 𝐶𝑛 = 𝛼∥A∥2 (cid:169)
(cid:173)
(cid:171)

𝛼 = ∥(A𝐻A + I)−1∥2.

200

APPENDIX D

APPENDIX FOR STEP-WISE TRIPLE-CONSISTENT DIFFUSION SAMPLING FOR
INVERSE PROBLEMS

In the Appendix, we start by showing the equivalence between the second formula in (2.10)

and (8.5) (Appendix D.1). Then, we discuss the known limitations and future extensions of

SITCOM (Appendix D.2). Subsequently, we present experiments to highlight the impact of the

proposed backward consistency (Appendix D.3). This is followed by a discussion on phase retrieval

(Appendix D.4). In Appendix D.5, we provide further comparison results, and in Appendix D.6, we

perform ablation studies to examine the effects of the stopping criterion and other components/hyper-

parameters in SITCOM. Appendix D.7 covers the implementation details of tasks and baselines,

followed by examples of restored images (Appendix D.8).

D.1 Derivation of (8.5)

From (Luo, 2022), we have

s𝜃 (x𝑡, 𝑡) = −

√

1
1 − ¯𝛼𝑡

ϵ𝜃 (x𝑡, 𝑡) .

Rearranging the Tweedie’s formula in (2.11) to solve for ϵ𝜃 (x𝑡, 𝑡) yields

ϵ𝜃 (x𝑡, 𝑡) =

√
x𝑡 −
√

¯𝛼𝑡 ˆx0(x𝑡)
1 − ¯𝛼𝑡

.

(D.1)

(D.2)

201

Now, we substitute into the recursive equation for x𝑡−1:

x𝑡−1 =

=

=

=

=

=

=

=

=

=

[x𝑡 + 𝛽𝑡s𝜃 (x𝑡, 𝑡)] + √︁𝛽𝑡η𝑡

x𝑡 + 𝛽𝑡

(cid:18)

−

√

1
1 − ¯𝛼𝑡

ϵ𝜃 (x𝑡, 𝑡)

(cid:19)(cid:21)

+ √︁𝛽𝑡η𝑡

ϵ𝜃 (x𝑡, 𝑡)

(cid:21)

+ √︁𝛽𝑡η𝑡

√
(cid:18) x𝑡 −
√

¯𝛼𝑡 ˆx0(x𝑡)
1 − ¯𝛼𝑡

(cid:19)(cid:21)

+ √︁𝛽𝑡η𝑡

(cid:16)x𝑡 −

√

¯𝛼𝑡 ˆx0(x𝑡)

(cid:17) (cid:21)

+ √︁𝛽𝑡η𝑡

(cid:20)

(cid:20)

(cid:20)

(cid:20)

1
√︁1 − 𝛽𝑡
1
√︁1 − 𝛽𝑡
1
√︁1 − 𝛽𝑡
1
√︁1 − 𝛽𝑡
1
√︁1 − 𝛽𝑡
1
√︁1 − 𝛽𝑡

x𝑡 −

x𝑡 −

x𝑡 −

(cid:20)(cid:18)

1 −

√

√

𝛽𝑡
1 − ¯𝛼𝑡
𝛽𝑡
1 − ¯𝛼𝑡
𝛽𝑡
1 − ¯𝛼𝑡
𝛽𝑡
1 − ¯𝛼𝑡

(cid:19)

ˆx0(x𝑡)

(cid:21)

+ √︁𝛽𝑡η𝑡

ˆx0(x𝑡) + √︁𝛽𝑡η𝑡

√

x𝑡 +

¯𝛼𝑡 𝛽𝑡
1 − ¯𝛼𝑡
√
¯𝛼𝑡 𝛽𝑡
√︁1 − 𝛽𝑡 (1 − ¯𝛼𝑡)
√
¯𝛼𝑡 𝛽𝑡
𝛼𝑡 (1 − ¯𝛼𝑡)
¯𝛼𝑡−1𝛽𝑡
1 − ¯𝛼𝑡
¯𝛼𝑡−1𝛽𝑡
1 − ¯𝛼𝑡

√

ˆx0(x𝑡) + √︁𝛽𝑡η𝑡

ˆx0(x𝑡) + √︁𝛽𝑡η𝑡

ˆx0(x𝑡) + √︁𝛽𝑡η𝑡 ,

(1 − ¯𝛼𝑡 − 𝛽𝑡)
√︁1 − 𝛽𝑡 (1 − ¯𝛼𝑡)

x𝑡 +

(𝛼𝑡 − ¯𝛼𝑡)
√
𝛼𝑡 (1 − ¯𝛼𝑡)
√
(cid:0)√
𝛼𝑡 −

𝛼𝑡 ¯𝛼𝑡−1(cid:1)

x𝑡 +

√

√

1 − ¯𝛼𝑡
𝛼𝑡 (1 − ¯𝛼𝑡−1)
1 − ¯𝛼𝑡

x𝑡 +
√

x𝑡 +

(D.3)

(D.4)

(D.5)

(D.6)

(D.7)

(D.8)

(D.9)

(D.10)

(D.11)

(D.12)

which is equivalent to the second formula in (2.10).

D.2 Limitations & Future Work

In SITCOM, the stopping criterion parameter is set slightly higher than the level of measurement

noise, determined by 𝜎y. As a result, our method requires access to (or estimation of) the

measurement noise prior to the restoration process. Knowledge of noise level is also assumed in

other works such as DAPS (Zhang et al., 2024a). In practice, classical approaches, such as (Liu

et al., 2006; Chen et al., 2015), can be used to estimate the noise.

Additionally, the stated conditions and proposed sampler are limited to the non-blind setting, as

SITCOM assumes full access to the forward model, unlike works such as (Chung et al., 2023a),

which perform both image restoration and forward model estimation.

For future work, in addition to addressing the aforementioned limitations, we aim to extend

202

SITCOM to the latent space and explore its applicability in medical image reconstruction.

D.3

Impact of the proposed Backward Consistency

Here, we demonstrate the impact of the proposed backward diffusion consistency in SITCOM

using two experiments.

Figure D.1 Results of applying optimization-based measurement consistency, for which the opti-
mization variable is the DM output (resp. input), are shown in the first (resp. second) row for each
task: Box Inpainting (top) and Gaussian Deblurring (bottom).

First, for the box-painting task, we compare optimizing over the input to the DM (as in SITCOM)

with optimizing over the output of the DM network (as is done in DCDP (Li et al., 2024) and

DAPS (Zhang et al., 2024a)) at time steps 𝑡′ ∈ {200, 400, 600}. For each case (selection of 𝑡′), we

start from 𝑡 = 𝑇 and run SITCOM with a step size of ⌊ 𝑇

𝑁 ⌋. At 𝑡 = 𝑡′, given x𝑡′, we perform two
separate optimizations with intializing the optimization variable as x𝑡′: one iteratively over the DM

network input (ours) and another iteratively over the DM network output (i.e., (8.4) but without

the regularization), both running until convergence (i.e., when the loss stops decreasing). For our

approach, the result of the optimization from (S1) is used as input to Tweedie’s formula in (S2) to

compute the posterior mean ˆx′

0 = ˆx0(v𝑡). For the case of optimizing over the DM output, we use

203

GT𝑡′=800Optimizing over the output of the DM network at 𝑡=𝑡′ for 𝐾=20 iterations. Optimizing over the input of the DM network (ours) at 𝑡=𝑡′ for 𝐾=20 iterations. Degraded ImageTask: Gaussian Deblurring𝑡′=600𝑡′=400𝑡′=200GTDegraded Image𝑡′=800𝑡′=600𝑡′=400𝑡′=200Task: Box InpaintingOptimizing over the output of the DM network at 𝑡=𝑡′ for 𝐾=20 iterations. Optimizing over the input of the DM network (ours) at 𝑡=𝑡′ for 𝐾=20 iterations. (8.4) without regularization. Figure 8.2 shows the results at different time steps. The consistency

between the ground truth and the unmasked regions of the estimated images suggest the convergence

of the measurement consistency. As observed, SITCOM produces significantly less artifacts in the

masked region when compared to optimizing over the output. This is evident both at earlier time

steps (𝑡′ = 600) and later steps (𝑡′ = 400 and 𝑡′ = 200).

For the second experiment, the goal is to show that SITCOM requires much smaller number of

optimization steps to remove the noise as compared to the case where the optimization variable

is the output of the DM network. The results are given in Figure D.1, where we repeat the above

experiment with two tasks: Box-inpainting (top) and Gaussian Deblurring (bottom), this time

using a fixed number of optimization steps for both SITCOM, and optimizing over the DM output.

Specifically, we run SITCOM from 𝑡 = 𝑇 to 𝑡 = 𝑡′ + 1. Then, we apply 𝐾 = 20 iterations (the

setting in SITCOM) in (S1), and 𝐾 = 20 when optimizing (8.4) (without regularization) where

measurement noise is 𝜎y = 0.05. As shown, compared to optimizing over the DM output, SITCOM

significantly reduces noise across all considered 𝑡′, underscoring the effect of the proposed backward

diffusion consistency when optimizing over the DM input.

D.4 Discussion on Phase Retrieval

As discussed in our experimental results section, for the phase retrieval task, we report the best

results from 4 independent runs, following the convention in (Chung et al., 2023b; Zhang et al.,

2024a). For the phase retrieval results of Table 8.1 and Table D.1 (given in Appendix D.5), we use

this approach across all baselines where the run-time is reported for one run.

The forward model for phase retrieval is adopted from DPS where the inverse problem is

generally more challenging compared to other image restoration tasks. This increased difficulty

arises from the presence of multiple modes that can yield the same measurements (Zhang et al.,

2024a).

In Figure D.2, we present two examples comparing SITCOM, DPS, and DAPS. For each ground

truth image, we show four results from which the best one was selected. In the first column, SITCOM

avoids significant artifacts, while DAPS produces one image rotated by 180 degrees. In the second

204

Figure D.2 Results of Phase Retrieval on two images (top row) from the FFHQ dataset. Rows 2, 3,
and 4 correspond to the results of DPS, DAPS, and SITCOM (ours), respectively.

column, both SITCOM and DAPS exhibit one run with severe artifacts. However, the last image

from SITCOM does exhibit more artifacts compared to the second worst-case result from DAPS.

Additionally, the DPS results show severe perceptual differences in both cases, with artifacts being

particularly noticeable in the second column.

D.5 Additional Comparison Results

In Table D.1, we present the average PSNR, SSIM, LPIPS, and run-time (minutes) of DPS,

DAPS, DDNM, and SITCOM using the FFHQ and ImageNet datasets for which the measurement

noise level is set to 𝜎y = 0.01 (different from Table 8.1). The goal of these results is to evaluate our

method and baselines under less noisy settings.

Overall, we observe similar trends to those discussed in Section 5.3 for Table 8.1. On the FFHQ

dataset, SITCOM achieves higher average PSNR values compared to the baselines across all tasks,

with improvements exceeding 1 dB in 5 out of 8 tasks. For the ImageNet dataset, we observe

more than 1 dB improvement on the non-linear deblurring task, while for the remaining tasks, the

improvement is less than 1 dB, except for Gaussian deblurring (where SITCOM underperforms by

0.22 dB) and phase retrieval (underperforming by 0.36 dB).

205

Ground TruthGround TruthDPSDAPSSITCOM (Ours)Task

Method

PSNR (↑)

SSIM (↑)

LPIPS (↓)

Run-time (↓)

PSNR (↑)

SSIM (↑)

LPIPS (↓) Run-time (↓)

FFHQ

ImageNet

Super Resolution 4×

Box In-Painting

Random In-Painting

Gaussian Deblurring

Motion Deblurring

Phase Retrieval

Non-Uniform Deblurring

High Dynamic Range

25.20±1.22
DPS
DAPS
29.6±0.67
DDNM 28.82±0.67
30.95±0.89
Ours

23.56±0.78
DPS
DAPS
24.41±0.67
DDNM 24.67±0.067
24.97±0.55
Ours

28.77±0.56
DPS
DAPS
31.56±0.45
DDNM 30.56±0.56
33.02±0.44
Ours

DPS
25.78±0.68
29.67±0.45
DAPS
DDNM 28.56±0.45
32.12±0.34
Ours

DPS
DAPS
Ours

DPS
DAPS
Ours

DPS
DAPS
Ours

DPS
DAPS
Ours

23.78±0.78
30.78±0.56
32.34±0.44

17.56±2.15
31.45±2.78
31.88±2.89

23.78±2.23
28.89±1.67
31.09±0.89

23.33±1.34
27.58±0.829
28.52±0.89

0.806±0.044
0.871±0.034
0.851±0.043
0.872±0.045

0.762±0.034
0.791±0.034
0.788±0.024
0.804±0.045

0.847±0.034
0.905±0.013
0.902±0.013
0.919±0.012

0.831±0.034
0.889±0.045
0.872±0.024
0.913±0.024

0.742±0.042
0.892±0.034
0.908±0.028

0.681±0.056
0.909±0.035
0.921±0.067

0.761±0.051
0.845±0.057
0.911±0.056

0.734±0.049
0.828±0.00
0.844±0.045

0.242±0.102
0.132±0.088
0.188±0.13
0.137±0.046

0.191±0.087
0.129±0.067
0.229±0.055
0.118±0.022

0.191±0.023
0.094±0.012
0.116±0.023
0.0912±0.013

0.202±0.014
0.163±0.033
0.211±0.034
0.139±0.045

0.265±0.024
0.146±0.023
0.135±0.028

0.392±0.021
0.109±0.044
0.102±0.078

0.269±0.064
0.150±0.056
0.132±0.45

0.251±0.078
0.161±0.067
0.148±0.035

1.31±0.44
1.24±0.43
1.07±0.35
0.50±0.34

1.52±0.43
1.33±0.42
1.02±0.42
0.37±0.34

1.55±0.34
1.42±0.45
1.25±0.42
0.47±0.34

1.33±0.44
2.15±0.37
1.24±0.34
0.45±0.25

1.65±0.34
1.44±0.34
0.52±0.34

1.52±0.42
1,85±0.32
0.54±0.45

1.56±0.45
1.41±0.37
0.56±0.37

1.34±0.42
1.26±0.44
0.51±0.42

24.45±0.89
25.98±0.74
24.67±0.78
26.89±0.86

20.22±0.67
21.79±0.34
21.99±0.54
22.23±0.44

24.57±0.45
28.86±0.67
30.12±0.45
30.67±0.45

22.45±0.42
26.34±0.55
28.44±0.021
28.22±0.45

22.33±0.727
28.24±0.62
29.12±0.38

16.77±1.78
26.12±2.12
25.76±1.78

22.97±1.57
28.02±1.15
29.56±0.78

19.67±0.056
26.71±0.088
27.56±0.78

0.792±0.052
0.794±0.09
0.771±0.06
0.802±0.057

0.69±0.034
0.734±0.045
0.737±0.034
0.745±0.034

0.775±0.023
0.877±0.021
0.917±0.012
0.918±0.013

0.778±0.067
0.836±0.034
0.882±0.021
0.891±0.014

0.726±0.034
0.867±0.023
0.882±0.025

0.651±0.076
0.802±0.023
0.813±0.032

0.781±0.023
0.831±0.082
0.844±0.045

0.693±0.034
0.802±0.032
0.825±0.037

0.331±0.089
0.234±0.089
0.432±0.34
0.224±0.056

0.297±0.077
0.214±0.034
0.315±0.022
0.208±0.023

0.318±0.26
0.131±0.044
0.124±0.032
0.118±0.012

0.344±0.041
0.244±0.023
0.267±0.00
0.216±0.021

0.352±0.00
0.191±0.017
0.182±0.025

0.442±0.037
0.247±0.034
0.238±0.067

0.302±0.089
0.162±0.034
0.147±0.042

0.498±0.112
0.172±0.066
0.162±0.046

2.33±0.40
2.10±1.02
1.38±0.55
1.34±0.45

1.55±0.44
2.44±0.34
1.42±0.45
1.23±0.44

2.12±0.30
2.01±0.34
1.89±0.23
1.40±0.34

2.12±0.44
2.22±0.43
1.76±0.33
1.34±0.25

2.21±0.40
2.12±0.44
1.45±0.31

2.18±0.38
2.32±0.35
1.31±0.45

2.34±0.44
2.23±0.56
1.34±0.44

2.34±0.41
2.12±0.32
1.45±0.41

Table D.1 Average PSNR, SSIM, LPIPS, and run-time (minutes) of SITCOM and baselines using
100 test images from FFHQ and 100 test images from ImageNet with a measurement noise level
of 𝜎y = 0.01. The first five tasks are linear, while the last three tasks are non-linear (underlined).
For each task and dataset combination, the best results are bolded, and the second-best results are
underlined. Values after ± represent the standard deviation. All results were obtained using a single
RTX5000 GPU machine. For phase retrieval, the run-time is reported for the best result out of four
independent runs. This is applied for SITCOM and baselines.

In terms of run-time, generally, SITCOM significantly outperforms DDNM, DPS, and DAPS,

with all methods evaluated on a single RTX5000 GPU. For the FFHQ dataset, SITCOM is at least

twice as fast when compared to baselines. On ImageNet, SITCOM consistently requires much less

run-time compared to DPS and DAPS. When compared to DDNM, SITCOM’s run-time is similar

or slightly lower. For example, on the super-resolution task, both SITCOM and DDNM average

1.34 minutes, but SITCOM achieves over a 2 dB improvement.

In Table D.2, we report the average PSNR and LPIPS results using three more baselines:

Denoising Diffusion Restoration Models (DDRM) (Kawar et al., 2022), Plug-and-Play (PnP)

ADMM (Chan et al., 2016) (a non diffusion-based solver), and Regularization by Denoising with

206

Task

Method

FFHQ

ImageNet

PSNR (↑) LPIPS (↓) PSNR (↑) LPIPS (↓)

Super Resolution 4×

Box In-Painting

Random In-Painting

Gaussian Deblurring

Motion Deblurring

Phase Retrieval

Non-Uniform Deblurring

High Dynamic Range

DDRM (Kawar et al., 2022)
PnP-ADMM (Chan et al., 2016)
SITCOM (ours)

DDRM (Kawar et al., 2022)
PnP-ADMM (Chan et al., 2016)
SITCOM (ours)

DDRM (Kawar et al., 2022)
PnP-ADMM (Chan et al., 2016)
SITCOM (ours)

DDRM (Kawar et al., 2022)
PnP-ADMM (Chan et al., 2016)
SITCOM (ours)

PnP-ADMM (Chan et al., 2016)
SITCOM (ours)

RED-Diff (Mardani et al., 2023)
SITCOM (ours)

RED-Diff (Mardani et al., 2023)
SITCOM (ours)

RED-Diff (Mardani et al., 2023)
SITCOM (ours)

27.65
23.48
30.68

22.37
13.39
24.68

25.75
20.94
32.05

23.36
21.31
30.25

23.40
30.34

15.60
30.97

30.86
30.12

22.16
27.98

0.210
0.725
0.142

0.159
0.775
0.121

0.218
0.724
0.095

0.236
0.751
0.235

0.703
0.148

0.596
0.112

0.160
0.145

0.258
0.158

25.21
22.18
26.35

19.45
12.61
21.88

23.23
20.03
29.60

23.86
20.47
27.40

24.23
28.65

14.98
25.45

30.07
28.78

22.03
26.97

0.284
0.724
0.232

0.229
0.702
0.214

0.325
0.680
0.127

0.341
0.729
0.236

0.684
0.189

0.536
0.246

0.211
0.160

0.274
0.167

Table D.2 Average PSNR and LPIPS results of our method and other baselines over 100 FFHQ and
100 ImageNet test images. The measurement noise setting is 𝜎y = 0.05. The results of DDRM and
PnP-ADMM (resp. RED-Diff) are sourced from Tables 1 and 3 (resp. 2 and 4) in (Zhang et al.,
2024a). The remaining results are as given in Table 8.1 of Section 5.3.

Diffusion (RED-Diff) (Mardani et al., 2023). The results of DDRM, PnP-ADMM, and RED-Diff are

sourced from (Zhang et al., 2024a). DDRM and PnP-ADMM present results for linear tasks whereas

RED-Diff is used for the non-linear tasks. The results of SITCOM are as reported in Table 8.1.

When compared to DDRM and PnP-ADMM, SITCOM demonstrates notable improvements

in both PSNR and LPIPS across all tasks and datasets. For instance, SITCOM achieves over a

5 dB improvement in random in-painting on both datasets. Compared to RED-Diff, SITCOM

outperforms by 5 dB on FFHQ and more than 10 dB on ImageNet for phase retrieval. A similar

trend is observed in the High Dynamic Range task. For non-linear non-uniform deblurring, although

SITCOM performs better in terms of LPIPS, it reports approximately 1 dB (FFHQ) and 2 dB

(ImageNet) less PSNR than RED-Diff, all without requiring external denoisers.

207

D.6 Ablation Studies

D.6.1 Effect of the number of Optimization steps 𝐾, & the number of Sampling steps 𝑁

In this subsection, we perform an ablation study on the number of optimization steps, 𝐾,

and the number of sampling steps, 𝑁. Specifically, for the tasks of Super Resolution, Motion

Deblurring, and Random In-painting, we run SITCOM using combinations from 𝑁 ∈ {10, 20, 30}

and 𝐾 ∈ {20, 30, 40}. The average PSNR results over 20 test images from the FFHQ dataset are

presented in Table D.3. As shown, for the first three tasks, SITCOM consistently achieves strong

PSNR scores across all (𝑁, 𝐾) pairs, demonstrating that its performance is not very sensitive to

variations in (𝑁, 𝐾) within these ranges as the results vary by nearly 1 dB. For High Dynamic Range

tasks, we observe that the best results are obtained with (𝑁, 𝐾) = (20, 40). The selected (𝑁, 𝐾)

values for our main results are listed in Table D.5 of Appendix D.6.4.

(𝑁, 𝐾)

(10, 20)

(10, 30)

(10, 40)

(20, 20)

(20, 30)

(20, 40)

(30, 20)

(30, 30)

(30, 40)

Super Resolution 4×
Motion Deblurring
Random Inpainting

29.654
29.976
33.428

29.771
30.820
34.444

29.815
31.264
34.699

29.913
31.259
34.546

29.952
31.380
34.558

29.961
30.452
34.574

30.009
31.282
34.619

30.027
30.624
34.634

30.033
30.438
34.639

25.902

High Dynamic Range

26.957
Table D.3 Effect of the number of sampling steps (𝑁) and optimization steps per sampling iteration
(𝐾) on the tasks listed in the first column for SITCOM. The reported PSNR values are averaged over
20 FFHQ test images.

27.873

26.290

27.874

27.171

27.127

27.104

26.806

D.6.2 Effect of the Regularization Parameter 𝜆

In this subsection, we perform an ablation study to assess the impact of the regularization

parameter, 𝜆, in SITCOM. Table D.4 shows the results across four tasks using various 𝜆 values. Aside

from phase retrieval, the effect of 𝜆 is minimal. We hypothesize that initializing the optimization

variable in (S1) with x𝑡 is sufficient to enforce forward diffusion consistency in C3. Therefore, we

set 𝜆 = 1 for phase retrieval and 𝜆 = 0 for the other tasks.

Additionally, for all tasks other than phase retrieval, we observed that when 𝜆 = 0, the restored

images exhibit enhanced high-frequency details. For visual examples, see the results of 𝜆 = 0 versus

𝜆 = 1 in Figure D.3.

208

𝜆

0

0.05

0.5

1

1.5

Super Resolution 4× 29.952
31.380
Motion Deblurring
34.559
Random Inpainting

29.968
31.393
34.537

29.464
31.429
34.523

29.550
31.382
34.500

29.288
31.150
34.301

Phase Retrieval

31.678

31.892

32.221

32.342

32.124

Table D.4 Ablation Study on the impact of the regularization parameter 𝜆.

Figure D.3 Results of running SITCOM using different regularization parameters in (S1) for the
task of Motion deblurring.

D.6.3

Impact of the Stopping criterion For Noisy Measurements

In this subsection, we demonstrate the impact of applying the stopping criterion in SITCOM

when handling measurement noise. For the tasks of super resolution and motion deblurring, we

run SITCOM with and without the stopping criterion for the case of 𝜎y = 0.05. The results are

presented in Figure D.4. As shown, for both tasks, using the stopping criterion (i.e., 𝛿 > 0) not only

improves PSNR values compared to the case of 𝛿 = 0, but also visually reduces additive noise in

the restored images. This is because, without the stopping criterion , the measurement consistency

enforced by the optimization in (S1) tends to fit the noise in the measurements.

Figure D.4 Impact of the stopping criterion in preventing noise overfitting. For the most right
column, 𝛿 is set as in Table D.5.

209

Ground TruthDegraded Image𝜆=0𝜆=0.01𝜆=1PSNR=28.99PSNR=28.97PSNR=27.47GTDegraded ImageSuper ResolutionDegraded ImagePSNR=27.267𝛿=0PSNR=30.62𝛿>0PSNR=22.819𝛿=0PSNR=31.911Motion Deblurring𝛿>0D.6.4 Complete List of hyper-parameters in SITCOM

Table D.5 summarizes the hyper-parameters used for each task in our experiments, as determined

by the ablation studies in the previous subsections. Notably, the same set of hyper-parameters is

applied to both the FFHQ and ImageNet datasets.

Task

Sampling Steps 𝑁 Optimization Steps 𝐾 Regularization 𝜆

Super Resolution 4×
Box In-Painting
Random In-Painting
Gaussian Deblurring
Motion Deblurring

Phase Retrieval
Non-Uniform Deblurring
High Dynamic Range

20
20
20
20
20

20
20
20

20
20
30
30
30

30
30
40

0
0
0
0
0

1
0
0

Stopping criterion 𝛿 for 𝜎y ∈ {0.05, 0.01}
𝑚SR ,0.011√
√
√
𝑚 ,0.011
√
√
𝑚 ,0.011
√
√
𝑚 ,0.011
√
√
𝑚 ,0.011
𝑚PR ,0.011√
√
√
𝑚 ,0.011
√
√
𝑚 ,0.011

{0.051√
{0.051
{0.051
{0.051
{0.051
{0.051√
{0.051
{0.051

𝑚SR}
𝑚}
𝑚}
𝑚}
𝑚}

𝑚PR}
𝑚}
𝑚}

Table D.5 Hyper-parameters of SITCOM for every task considered in this paper. The same set of
hyper-parameters is used for FFHQ and ImageNet. The learning rate in Algorithm 8.1 is set to
𝛾 = 0.01 for all tasks, datasets, and measurement noise levels. For the stopping criterion column,
𝑚SR = 64 × 64 × 3, 𝑚 = 256 × 256 × 3, and 𝑚PR = 384 × 384 × 3

.

D.7 Detailed Implementation of tasks and Baselines

The forward models of all tasks are adopted from DPS. We refer the reader to Appendix B of

(Chung et al., 2023b) for details. For baselines, we used the codes provided by the authors of each

paper: DPS, DDNM, DAPS, and DCDP. Default configurations are used for each task.

D.8 Qualitative results

Figure D.5 presents results with SITCOM, DPS, and DAPS using ImageNet. See also Figure D.6,

Figure D.7, Figure D.8, and Figure D.9 for more images.

210

Figure D.5 Qualitative results on the ImageNet dataset for five linear tasks and three non-linear
tasks under measurement noise of 𝜎y = 0.05. The PSNR and LPIPS values are given below each
restored image.

Figure D.6 Super resolution (left) and box inpainting (right) results. First (resp. last) three rows
are for the FFHQ (resp. ImageNet) dataset.

211

Super ResolutionGround TruthMeasurementsDPSDAPSSITCOM (ours)PSNR=22.10LPIPIS=0.256PSNR=26.89LPIPIS=0.195PSNR=28.20LPIPIS=0.145Box InpaintingGround TruthMeasurementsDPSDAPSPSNR=20.24LPIPIS=0.267PSNR=22.78LPIPIS=0.211PSNR=24.55LPIPIS=0.189Motion DeblurringGround TruthMeasurementsDPSDAPSPSNR=22.09LPIPIS=0.291PSNR=27.99LPIPIS=0.184PSNR=29.02LPIPIS=0.167Gaussian DeblurringGround TruthMeasurementsDPSDAPSPSNR=21.72 LPIPIS=0.345PSNR=27.34LPIPIS=0.245PSNR=28.78LPIPIS=0.1189Non-linear Deblurring Ground TruthMeasurementsDPSDAPSPSNR=22.56LPIPIS=0.312PSNR=28.01LPIPIS=0.167PSNR=29.25LPIPIS=0.145Phase Retrieval Ground TruthMeasurementsDPSDAPSPSNR=14.89LPIPIS=0.58PSNR=30.12LPIPIS=0.102PSNR=30.34LPIPIS=0.089Random Inpainting Ground TruthMeasurementsDPSDAPSPSNR=24.56LPIPIS=0.315PSNR=29.02LPIPIS=0.128PSNR=30.15LPIPIS=0.102High Dynamic RangeGround TruthMeasurementsDPSDAPSPSNR=18.90LPIPIS=0.450PSNR=26.29LPIPIS=0.203PSNR=28.02LPIPIS=0.156SITCOM (ours)SITCOM (ours)SITCOM (ours)SITCOM (ours)SITCOM (ours)SITCOM (ours)SITCOM (ours)Super ResolutionGround TruthMeasurementsSITCOM (ours)DAPSDPSGround TruthMeasurementsSITCOM (ours)DAPSDPSFigure D.7 Motion deblurring (left) and Gaussian deblurring (right) results. First (resp. last)
three rows are for the FFHQ (resp. ImageNet) dataset.

Figure D.8 Random inpainting (left) and non-linear (non-uniform) deblurring (right) results.
First (resp. last) three rows are for the FFHQ (resp. ImageNet) dataset.

212

Ground TruthMeasurementsSITCOM (ours)DAPSDPSGround TruthMeasurementsSITCOM (ours)DAPSDPSGround TruthMeasurementsSITCOM (ours)DAPSDPSGround TruthMeasurementsSITCOM (ours)DAPSDPSFigure D.9 Phase retrieval (left) and high dynamic range (right) results. First (resp. last) three
rows are for the FFHQ (resp. ImageNet) dataset.

213

Phase RetrievalGround TruthMeasurementsSITCOM (ours)DAPSDPSGround TruthMeasurementsSITCOM (ours)DAPSDPS