z . .5.
mail"
.u .__ . a

l.-
l

.2 .

Ewim a.»

X r.
.; 5
14!. . . .5‘ A .3
am“):

{.1

IS» . (v 3:?

a .Eyemx

v : a
w... . 4. «Sn. L
. 5.3,."
am

E

‘
a.

y.
..

M5,
L_.

5 ‘ . . ,
4.? mm": . ‘ m mug.
5 7ki¥

ﬁg

:3“?

mi
. V“

‘2 4 A
,_.m.:w.
.2 «‘1

-
1.5.

 

 

THESIS

2 cm

This is to certify that the
dissertation entitled

MULTISCALE MODELING AND ESTIMATION OF POISSON PROCESSES
WITH APPLICATIONS TO EMISSION COMPUTED TOMOGRAPHY

presented by

Klaus E. Timmermann

has been accepted towards fulﬁllment
of the requirements for

Ph.D degree in Electrical Eng

MajoEprofesso\

 

Dategl/Z 2’;/00

MS U is an Afﬁrmative Action/Equal Opportunity Institution 0- 12771

 

 

 

LIBRARY

Michigan State

University

 

 

PLACE lN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE

DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

moo mm."

 

MULTISCALE MODELING AND ESTIMATION
OF POISSON PROCESSES WITH APPLICATIONS
TO EMISSION COMPUTED TOMOGRAPHY

By

Klaus Edmond Timmermann

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Department of Electrical and Computer Engineering

2000

ABSTRACT

MULTISCALE MODELING AND ESTIMATION
OF POISSON PROCESSES WITH APPLICATIONS
TO EMISSION COMPUTED TOMOGRAPHY

By

Klaus Edmond T immermann

Many important problems in engineering and science are well-modeled by Pois-
son processes, and in many instances it is of great interest to accurately estimate
the intensities underlying the observed Poisson data. This dissertation addresses the
problem of general Poisson process estimation, but the work is primarily motivated
by the photon-limited imaging problem. First, a Bayesian approach to Poisson in-
tensity estimation based on a multiscale framework is presented. It is shown that the
multiscale representation of signals provides a very natural and powerful framework
for this problem. Using this framework, a novel multiscale Bayesian prior to model
intensity functions is devised. The behavior of the new model is characterized by a
study of its correlation properties. The new prior leads to a simple, Bayesian intensity
estimation procedure. Practical fast shift-invariant algorithms for the new estimation
framework are presented and applied to photon-limited data.

We extend the modeling and estimation approaches for general Poisson processes
to the emission computed tomography (ECT) image reconstruction problem. Two

multiscale approaches are introduced. The ﬁrst approach is based on the geometrical

 

 

 

 

 

r. ' (ﬁlm i
put)” . . .

‘ . l
m. we «.

r... l

ETAIIIGII‘ fl

[Uln‘if'iV L"

¢

II
“viral-la

“A '.

run 'T" 7"“

\r v

.1.) .."..
.

7‘"'("‘ tr“
A“ ‘,"..‘._‘ ,li

Thi- .V

I0 3 Vif‘

3 ll
, YA.

ﬁl yl‘ -v -
‘hA\h.‘\,l;

‘9 OH}
in _

“4.3;‘i‘ l

H.101;—

:I PC";
a» I,
A:

9'

properties of the so-called natural pixels of the intensity image. Within this frame—
work, we develop a practical prior model for the sinogram image, which is used to
estimate the underlying sinogram intensity from the raw projection data prior re-
construction. The sinogram estimate is then used in conjunction with the standard
filtered-backprojection algorithm to produce an improved image reconstruction. The
superiority of the new approach over the conventional ﬁltered-backprojection based
reconstruction is illustrated with clinical and simulated data.

The second, more sophisticated'approach to ECT is based on a new multiscale-
based Radon inverse transform. It is shown that the new formulation lends itself
to a very natural discretization of the Radon inverse operator which is especially
amenable to numerical computation. Within this new reconstruction framework,
we develop a prior model for the cumulative sinogram image ——a new intermediate
image representation between a sinogram and the intensity image. The new model is
unique in that it exploits the high degree of redundancy of information present in the
sinogram.We demonstrate the superiority of the proposed method using real data.

As the multiscale framework to modeling and estimation represents the founda-
tion of the methods presented here, we introduce a new approach to characterizing
multiscale models and estimators in order to assess their qualities. This approach is
shown to be more general in nature than classical statistical descriptors ( e. g., bias-
ness, mean-square error, etc.) which are based on limited information. Towards this
end, we introduce the information-theoretic deﬁnitions of anomy, accuracy precision,
and resolution power. Based on criteria developed with these concepts, we explore
the advantages of multiscale Bayesian modeling and motivate its application to the

estimation problem.

COpyright © by
Klaus Edmond Timmermann

2000

 

 

T
“Tm ( C‘

In memory of my Father,
who instilled in me my love for nature,
who taught me the values for order, simplicity and social responsibility,
and who got me started in sciences.
To my Mother,
who nourished the spiritual side in my life,
and who taught me the importance of family and true friendship.
To my beautiful wife,
who has been there for me in times of trial,
and in every other moment to enjoy life together.
To my children, Karl and Derek,

who make it all worth it,

and who remind me day to day that I am not. there yet.

 

 

   

We dun ii
others for thr- :.
wish to Illahii I.
rife than all W

First. I w:
time. help and a
{hark my adv}
defined to :1.

t

. .
in n‘v
artisan?"

and he
patience as he
and Dr. Hat‘fil
ire: weigh ”we:
A150. my g
apported and
Jim x3115. .lr..
Wily and £11.:

FA

Es;-;.;uza. Azxi

r»- ‘ T 0

dr‘ .,,' i ‘ .

w .a.‘ TIL‘)
last that

ACKNOWLEDGMENTS

We often go about our day-to-day business of life not realizing how much we owe
others for the many good things that come to us. It is only in the rare occasions we
wish to thank them in writing that we come to realize the many they are and that.
cite them all we cannot. Below is my short list.

First, I would like to thank the members of my guidance committee for their
time, help and advice during the various stages of this long endeavor. In particular, I
thank my advisor, Dr. Robert Nowak for the many hours of technical discussions he
dedicated to me during our years of association; Dr. John Deller for his friendliness
and helpfulness throughout. these years; Dr. Michael Frazier for his kindness and
patience as he introduced me to the fascinating areas of real and multiscale analysis;
and Dr. Hassan Khalil for his openness and warmth during an interview which would
later weigh heavily in my decision to attend Michigan State.

Also, my gratitude and thanks go to the following very special people who
supported and encouraged me during my years of study: Jim and Lois Mills,
Jim Mills, Jr., Sue Vollmar, Ken and Jan Dart, and my very dear brother and sister
Willy and Emmy, and their spouses, Maria Esther Timmermann and Jesus Alonso
Espinoza. And to those whom I owe the most: my wife, Judy and parents, Leonor
and Guillermo Timmermann, I thank from the deepest of my heart.

I also thank the National Science Foundation for its very generous ﬁnancial sup-

port without which I would not have been able to pursue this higher academic degree.

vi

 

 

 

l IDIIOdUCIlL
1.1 OPRW

N‘IIA.£

N

The Multi:
'2.l Prelim

9') “Hip”
"- - 1.2“!

.....

o»:

co
dr-

go
x1; 0
9—4

bx.
Q"
b~

TABLE OF CONTENTS

1 Introduction

1.1 Organization and Summary of Contributions ..............

The Multiscale Modeling Paradigm

2.1 Preliminaries ...............................

2.2 Multiresolution through Wavelets ....................
2.2.1 The Fourier Transform and Scales ................
2.2.2 Wavelet Representation of Signals ................

2.3 Multiscale Modeling and Attributes ...................
2.3.1 The Components of a Model and their Notation ........
2.3.2 Anomy, Accuracy, Precision, and Resolution Power ......
2.3.3 Two Illustrative Examples ....................

2.4 The Multiscale Modeling and Estimation Advantage ..........
2.4.1 A/ P Models and their Properties ................
2.4.2 Bayesian Multiscale Models and Estimation ..........

2.5 Other Multiscale Modeling and Estimation Approaches ........
2.5.1 Threshold Smoothing Methods .................

Multiscale Modeling and Estimation of Poisson Processes

3.1 Preliminaries ...............................

3.2 Notation ..................................

3.3 Why the Unnormalized Haar Transform? ................

3.4 A New Probability Model for Intensity Images .............
3.4.1 Multiscale Signal Model Framework ...............
3.4.2 Multiscale Multiplicative Innovations Model ..........
3.4.3 Prior Distribution for Innovations ................

3.5 Estimation .................................
3.5.1 Bayesian Multiscale Intensity Estimator ............
3.5.2 Selection and Analysis of the Beta-Mixture Prior .......
3.5.3 Estimation of Prior Parameters .................

vii

7

7
10
10
13
27
27
30
42
47
48
56
63
63

7O
71
72
74
75
75
78
81
84
84

3.5.4 Optimal Estimation from Large Ensembles ........... 91

3.5.5 Example of Estimation from a Single Observation ....... 93
3.6 Stationary Intensity Models and Estimators .............. 94
3.6.1 A Shift-Invariant MMI Model .................. 95
3.6.2 A Fast Shift-Invariant MMI Estimator ............. 97
3.6.3 Autocorrelation Functions of MMI and SI-MMI Models . . . . 100
3.6.4 SI-MMI Model and 1 / f Processes ................ 105
3.7 Numerical Comparison of Wavelet-Based Intensity Estimators . . . . 107
3.8 Application to Photon-Limited Imaging ................. 109
3.8.1 Photon—Limited Imaging Simulation ............... 111
3.8.2 Application to Nuclear Medicine Imaging ............ 112
4 Emission Computed Tomography 116
4.1 Preliminaries ............................... 116
4.2 The Image Reconstruction Problem ................... 119
4.3 The Filtered-Backprojection Reconstruction Technique ........ 124
4.4 Some Limiting Aspects of Conventional Reconstruction Methods . . . 125
4.5 First Approach to Multiscale Modeling and Estimation of ECT Intensitie8127
4.5.1 The SI-MMI Method ....................... 128
4.5.2 Example .............................. 130
4.6 A New Multiscale-Based Tomographic Inversion Method ....... 134
4.6.1 The Multiscale Reconstruction Formula ............. 134
4.6.2 Computation of 52k and A610 ................... 143
4.6.3 A Fast Multiscale Radon-Inverse Transform Algorithm . . . . 145

4.7 Second Approach to Multiscale Modeling and Estimation of ECT In-
tensities .................................. 148
4.7.1 The CSI—MMI Method ...................... 149
4.7.2 Example .............................. 151
5 Conclusions 154
A Appendix to Chapter 2 158
A.1 Proof of Expression (2.2) ......................... 158
A2 On the Monotonicity of Ip(X; A) .................... 159
A3 An Alternate Interpratation for Precision ................ 161
A4 Only an MA Model has Anomy of Zero ................. 163
A5 Equivalence of Anomy Deﬁnitions .................... 163
A6 Proof of Expression (2.35) ........................ 165

viii

 

 

1.? A 8
AS C0111

   
  

B Appendix
Bl Pm"...
8.? 0pm

C Appendix
Cl The H
C2 Pltmfi

BIBLIOGRA

A7 A Sufﬁcient Condition for the A/ P Accuracy Inequality ........ 167

A8 Coarse-Scale-Data Limited Models ................... 171
B Appendix to Chapter 3 173
B.1 Posterior Distributions .......................... 173
B2 Optimal Estimation of the Multiplicative Innovation .......... 177
C Appendix to Chapter 4 179
C.1 The Hilbert Transform as a Continuous Averaging Process ...... 179
C.2 Proof of Expression (4.24) ........................ 180
BIBLIOGRAPHY 184

ix

3.1

3.2

LIST OF TABLES

AMSE results for various test intensities and estimation algorithms.

Peak intensity = 8. ............................ 109
AMSE results for various test intensities and estimation algorithms.
Peak intensity = 128 ............................ 109

 

 

 

 

 

 

 

gv
:‘c
~r~

lg
to
.
/
._.
._:
.___
_.
A
1

[Q a.)

.L. 0-3
f‘
" .
:x
,4
._

CO no
is '1
c F“
_ 'I‘

l r

1 I .

.1 1 \'
‘ C

O u

2.1

2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10

3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9

3.10
3.11
4.1
4.2
4.3
4.4
4.5
4.6

LIST OF FIGURES

The Haar and Daubechies 4-pt scaling and wavelet functions in the

interval ................................... 16
M ultiresolution analysis expressed as a sequence of direct sums . . . . 17
Wavelet-based Signal Processing ..................... 21
Time-frequency plane ........................... 22
M ultiscale Signal Analysis ........................ 24
M ultiscale representation of images ................... 26
Components of a Model of a Process .................. 29
The various probability densities deﬁning accuracy .......... 35
Multiscale representation of a 1-d model of size N = 8 ........ 49
Method for obtaining consecutively increasing EPRs p? +1, p; +1, . . . for

a Gaussian model ............................. 54
M ultiscale scaling coefﬁcients {CM} ................... 74
Structure of a Hear-based intensity estimator ............. 77
Histogram of perturbation variates (6 = B/A) .............. 78
MMI model interpreted as a probabilistic tree ............. 81
Three component Beta-mixture distribution .............. 84
Perturbation estimate ii as a function of d/c .............. 89
Poisson intensity estimation ....................... 96
Fast Poisson intensity estimation .................... 100
Correlation functions for MMI and SI-MMI 256-point (J = 8) intensity

priors .................................... 104
Photon-limited image estimation using MMI models .......... 113
Nuclear medicine image estimation using SI-MMI models ....... 115
Tomographic data collection geometry ................. 119
Projection geometry for the intensity f (x) ............... 121
Shepp—Logan head phantom ....................... 128
Natural Pixels ............................... 129
Shepp-Logan head phantom image reconstruction ........... 133
Pelvic bone study image reconstruction ................. 134

xi

i4“ 44“ 141. C

4.7
4.8
4.9
A.1
C.1

 

 

Wavelet functions for image reconstruction ............... 141

Tomography reconstruction-formula coeﬂicients ............ 149

Second pelvic bone study image reconstruction ............. 154

Simpliﬁed geometric representation of EPR cubes at consecutive scales 170
2_1r 9-1

Magnitude frequency content of (lfrizégi’gng .............. 183

xii

 

 

CHA

 

Introd

.1 treat Elwin

Prism pm w
gives I159 TO Il
is EHCO‘JIﬁPWl
3 and ne‘pp

{6115‘

r: of a 1*:

WE O‘DSPR‘E (in

and wish to e

onetime:

i ' r
151‘)“

'.

A.

I ,
«Lf‘

"Elan. In .

PET " n“ '

J1 €4)sd4&¢:/A
‘

“35 A PIJISVTI; i

l. P: (OIIlDdf'i T‘

in. ,V 1.
» .‘ v. .
N' f ‘7'? v
t ' ' (5(A..“‘;-.‘

m

,. , .
N“);

CHAPTER 1

Introduction

A great number of important phenomena in science and engineering are modeled as
Poisson processes. Often, it is of interest to estimate the underlying intensity which
gives rise to these phenomena. The intensity estimation problem of Poisson processes
is encountered in many ﬁelds including medicine [1], astronomy [2], communications
[3], and networks [4]. This dissertation considers the problem of estimating the in-
tensity of a general Poisson process from a single observation of the process. That is,

we observe counts c which obey1
CIA ~ Poisson()\), (1.1)

and wish to estimate the intensity A. The counts 0 and intensity A are typically

one-dimensional (1-d) signals or 2-d images, but may be of any other higher ﬁnite
dimension, in general.

For example, the basic photon-limited imaging process is widely regarded as obey-

ing a Poisson law. The problem is described as follows. We observe photon emissions

in a compact region of the plane. The photon emissions are the result of an underly-

ing two-dimensional intensity function. We are interested in estimating the intensity

 

13M denotes the random vector c given the vector A.

 

 

function if“

  
   

pilt'ciilt iLIIIIli. I
liiiiutioii ui .
its. due in p.
said): Baas
posa-ﬁl (ring ..
inﬁnity 3
hi this Iii»

ma 'lijf .MN‘Il I

of signals [39 w

ill

g this 113111“
is devised. \Vc
expression for i
and shown :0
UPI " "iii B11725

invariant alzuri

{ODEIIICJIII III).
.35 a seroziii

MC ‘féiimci I311

El 15 an '* *

Ml)

' I‘
l" h v‘ _ i \
.- I'LIIIIUIIEII ll

am “ UH [(I'Ir 3‘.

[I rddjr,é(o:,.. .
" ~ -a\‘J I14.

c, I =
IH'l'v .
».d.1(;‘;e<

rr vl.,
A W Adrift"

0‘19 if: ')i, I“

I."

7-0" Twp,“ '
uh; . , V ~.
1“ .4 (J.iili__

function from the counts of photon detections. Nuclear medicine imaging is one ap—
plication that motivates our study of the photon-limited imaging problem. The major
limitation of nuclear medicine imaging is the low-count levels acquired in typical stud-
ies, due in part to the limited level of radioactive dosage required to insure patient
safety. Because of the variability of low-count images, it is very common to employ a
post-ﬁltering or estimation procedure to obtain a “better” estimate of the underlying
intensity [5].

In this dissertation we introduce a Bayesian approach to Poisson intensity esti-
mation based on a multiscale framework. It is shown that multiscale representation
of signals provides a very natural and powerful framework for this problem. Us-
ing this framework, a novel multiscale Bayesian prior to model intensity functions
is devised. We look at the nature of the proposed prior by deriving a closed-form
expression for its autocorrelation. Some extensions to the basic model are developed
and shown to possess very desirable properties. With these new priors, a simple,
optimal Bayesian intensity estimation procedure is developed. A practical fast shift-
invariant algorithm for the new estimation framework is also presented and applied
to photon-limited data.

As a second contribution of this dissertation, we extend the above general modeling
and estimation approaches to the emission computed tomography imaging problem.
ECT is an important and very active area of research in many ﬁelds. For example,
in functional neuroimaging, ECT is used to map regions of activity in the brain
associated with physical [6] and intellectual [7] tasks; in nuclear waste management,

emission tomographic methods may be employed to determine the activity density
of radioactive material within cemented barrels [8]; and in electron microscopy, ECT
tecbniques make possible 3-d image reconstruction of chromosomes’ structures [9].

One important diagnosis method in nuclear medicine is single-photon emission

computed tomography (SPECT). Essentially, the process aims to reconstruct. density

 

 

 

   
 
  
  
   

maps iiinae
daia I‘Iiiit‘t'Ir
well-iiiotieieri
imaging of 11'
levels and. Iiu
porianee to no
sions of tom 171;:
However. maxi
other ECT op;

This dissvr‘
suspicion p
so called noiu
the sinogram i
Slhograzri inia;
the raw projei
confine-tion w
All llllpigyed I

r? . . -

maps (images) of radiopharmaceutical distribution within a patient from projection
data collected at many angles about the subject. Projection data in SPECT are
well-modeled to be the outcome of Poisson processes, and just as in the case of static
imaging of nuclear medicine discussed above, they are characterized by low-count
levels and, therefore, by low signal-to—noise ratios (SNR). Motivated by its great im-
portance to medicine, and by the challenge posed by its low-count nature, our discus-
sions of tomography focuses throughout on the SPECT problem of nuclear medicine.
However, many results obtained here are directly applicable or easily extendable to
other ECT applications.

This dissertation presents two new multiscale approaches to the ECT image re-
construction problem. The ﬁrst approach is based on geometrical properties of the
so called natural pixels of intensity sinograms as well as on the high structure of
the sinogram image. Within this framework, we develop a practical prior model for
sinogram images which are used to estimate the underlying sinogram intensity from
the raw projection data prior reconstruction. The sinogram estimate is then used in
conjunction with the standard filtered-backprojection (FBP) algorithm to produce
an improved image reconstruction. We illustrate the superiority of this method over
the conventional FBP-based approach using clinical and simulated data.

We also introduce a second and more sophisticated approach to ECT based on a
new multiscale-based approach to the Radon inverse transform. This new formulation
lends itself to a very natural “discretization” of the Radon inverse operator which is
especially amenable to numerical computation. It will be seen that in fact, in contrast
to Fourier-based numerical reconstruction methods widely used in practice, the new
inversion operator requires no discretization itself, and it is only the data that needs
to be sampled. Fourier-based numerical reconstruction methods require conflicting
approximations to the Radon operator by simultaneously discretizing the frequency

and space planes. In these methods, the intensities must be assumed to be of finite

 

 

 

with Y y
bu;nnlli in
Al

I

\
DTN‘UHI l
I

ELM lei '3‘];

II
M Hill

ue 51:1. Ii

‘
h""? ‘ru
ILLS: IllM

l

'I'v .

Lijﬁs (f .-.
.. .) I)
' . I
[hit {new
9}-

. ‘ .
‘1? {".. \

“‘—AA

7'

'l1."
dam“ "3 7

AA A
We.
»“A “7-
~‘ . \
in u“:
i
f

o
l

“.3. 3“"? 13'

D».
”1’76 ()1; (

support in both space and frequency, and such signals do not exist. The method
presented here avoids such conundrum and reconstructs intensity images without the
associated potential artifacts.

Within the new reconstruction framework, we develop a prior model for the cumu-
lative sinogram image (intensity) (C81). The new model is unique in that it exploits
the high degree of redundancy of information present. in a sinogram to create a very
robust tomographic reconstruction of photon-limited images. We demonstrate the
advantage of the proposed method using clinical data.

Multiscale representation and analysis of signals have been used in the past as
the basic framework in the modeling and estimation of Poisson, Gaussian, and other
types of processes [10, 11, 12, 13, 14, 15, 16]. While in general all these approaches
have been successful, except for the case of Gaussian processes, a clear justiﬁcation for
their choice has not been offered. In an attempt to explain the advantage of multiscale
modeling and estimation in a most general way, we introduce information-theoretic
measures that quantify the degree of goodness of models in various aspects. For
this, we introduce the concepts of anomy, accuracy, precision, and resolution power.
Based on criteria developed with these concepts we gain insight into the advantage of
Bayesian estimators within the framework of multiscale representation of processes.
The Bayesian approach is shown to give the means for a systematic and maximal
use of the available information at each scale of multiscale models. Additionally, we
give general guidelines for constructing new multiscale linear transformations which
are statistically motivated and applicable to the general Gaussian model. The trans-
formations are potentially better suited than conventional time/ frequency multiscale

analysis for estimation purposes, and may also be extended to other models as well.

1.1 Organization and Summary of Contributions

The main objective of Chapter 2 is to motivate the multiscale “paradigm” as an
approach to modeling for Bayesian estimation. Throughout this dissertation the
multiscale modeling of processes and the multiscale representation of signals play very
important roles as they provide the underlying framework for the Poisson models
and estimators presented in Chapters 3 and 4. Therefore, in Section 2.2 we ﬁrst
present an elementary review of wavelets. Much intuition about multiscale modeling
and estimation is gained from its understanding. Also, much of the notation use
throughout this dissertation is introduced here and in Section 2.3.

In addition, Chapter 2 offers the following three main contributions. First, in
Section 2.3 we present a new, unifying approach to characterize and qualify models
and estimators alike. For this, we introduce four new measures that are more general,
and we believe more natural, than the conventional statistical measures often used
for this purposes. Second, in Section 2.4 the advantages of modeling and estimating
within the multiscale framework in general, and within the multiscale Bayesian ap—
proach in particular, are established for a class of processes. Third, we give general
guidelines to constructing multiscale models which are statistically motivated, and
consequently, are better suited for the estimation problem. These guidelines are de-
veloped for Gaussian processes, but they can be easily extended to other processes.
lVIuch work needs to be done in this respect, but the criteria set forth here opens a
wide range of new and exciting possibilities. In Section 2.5 we conclude the chapter
with a brief review of some important existing multiscale and estimation approaches.

There are four major contributions in Chapter 3. First, in Section 3.4 we describe
a new, multiscale, prior probability model for non-negative intensity functions. This

model employs a multiplicative innovations structure in the scale-space domain. Sec-

ond, based on this new prior, in Section 3.5 we derive a simple and computationally

efﬁcient, Bayesian estimator of the intensity given an observation of counts, under
squared error loss. It is shown through examples that the Bayesian estimation proce-
dure signiﬁcantly outperforms existing wavelet-based methods. Third, we extend in
Section 3.6.1 the multiscale intensity prior to a shift-invariant one, and develop a fast
shift-invariant estimation procedure. Furthermore, we obtain closed-form expressions
for the correlation functions of both priors, and show that the correlation behavior of
the shift-invariant prior has 1 / f spectral characteristics and is more regular than that
of the shift-variant prior. Fourth, in Section 3.8 we apply the framework to photon-
limited imaging and examine its potential to improve nuclear medicine imaging.
The main contributions in Chapter 4 are three. First, in Section 4.5 we introduce
an extension to the prior modeling approach of Chapter 3 which is especially well
suited for modeling computed tomography sinograms. The new prior is based on geo-
metrical consideration of the inherent structure of sinograms. The excellent match of
the prior to real and synthetic data is seen in the examples. Second, in Section 4.6 we
present a new, multiscale-based, Radon-inverse transform algorithm. The transform
has three major qualities for ECT applications: it admits a very efﬁcient computa-
tional implementation; it allows reconstruction of images at any desired resolution
supported by the data with the corresponding computational savings; and it provides
a very robust reconstruction of images from photon-limited projection data. This last
quality is unique to the new method and is illustrated with an example. Third, in
Section 4.7 we introduce a third extension to the intensity modeling approach devel-
oped for general Poisson processes and apply it to the cumulative sinogram images
of emission-computed tomography. This prior is used in conjunction with the new
Radon-inverse transform to reconstruct highly reliable images from photon-limited
data. The superiority of this modeling, estimation, and reconstruction method is
illustrated with clinical data.

Finally, some comments and conclusions are given in Chapter 5.

CHAPTER 2

The Multiscale Modeling Paradigm

2. 1 Preliminaries

In his book “Conceptual Physics” [17], Hewitt identiﬁes a crucial factor that made
the evolution of human scientiﬁc knowledge possible when he writes: “Science had its
beginnings before recorded history when people ﬁrst discovered recurring relationships
around them. Through careful observations of these relationships, they began to know
nature and, because of nature’s dependability, found they could make predictions that
gave them some control over their surroundings. ”

Hewitt’s recurring relationships refer to the observed patterns of cause and effect
that appeared to dictate the course of nature in many instances. It is evident that
while our ancestors could not make predictions about the exact shape of a ﬂame in a
ﬁre, they could always expect the whole of the ﬂame and smoke to rise. Thus, looked
at on a large enough scale, the phenomenon could be satisfactorily predicted.

Probably one of the greatest successes in science before modern times was achieved
in predicting the movement of the planets. In their observations of the skies, the

Mayans did not have to contend with erratic or chaotic effects, but only had to discover
the patterns manifested at very large scales. It is no coincidence that nowadays

scientists still seek to discover patterns in whatever is under study in order to advance

their knowledge. After all, if the behavior of artiﬁcial neural networks is a hint of
how the human brain works, to understand nature means training our brains enough
so as to recognize patterns in our environment, for then we may predict its future
behavior.

As the subjects studied by people became more and more complex, the patterns
to be discovered were less evident and more difﬁcult to perceive. Mathematics then
became the primary tool in this quest. After the experimentation phase, models
were postulated and tested. At ﬁrst, the mathematical models were deterministic
in nature, but as scientists’ interest shifted towards natural phenomena “belonging”
to sufﬁciently small scales, probabilistic models had to be introduced. For example,
before the turbulent behavior of the flames in a ﬁre could be understood, statistical
thermodynamic models had to be postulated to explain the molecular behavior of
gases.

Clearly, the new probabilistic models represented more accurate but less precise1
models than their deterministic counterparts. They predicted the outcome of an ex-
periment more reliably while providing less detail about such outcome—gas molecules’
“typical” behavior could be predicted very successfully; however, very little could be
said about any given molecule’s state, e.g., its location and momentum. The reason
for this was that while small scale phenomena were being modeled, the experiments
carried out to undercover their hidden patterns were of much larger scales; conse-
quently, only the patterns displayed at these larger scales provided information about
the underlying phenomena. In the case of gas molecules, only their aggregate behav-
ior could be discerned and so, only probabilistic inferences about individual molecules

could be made.

 

1The terms accuracy and precision are used here in a general sense, meaning, respectively, the
accordance of an assertion with the truth, and the amount of information conveyed by that assertion.
Thus, by forecasting rain over the Paciﬁc Ocean this year, one makes a highly accurate assertion
but of very low precision for not much information about the time or place of the event is given, for
example. In Section 2.3.2 we give precise meaning to these terms.

As advances in technology allowed probing at smaller and smaller scales, the
models created became more and more precise. Nevertheless, to maintain accuracy,
they necessarily had to be probabilistic in nature, for if we accept that every event
in nature has its origins in the smallest of scales, the Heisenberg Principle prevents
us from observing the full state (e. g., displacement / momentum, time/ frequency, etc.)
of whatever “lives” at those scales, and so, prevents us from producing inﬁnitely
accurate deterministic models—never mind that our known deterministic “laws” of
nature do not apply at these extremely small scales.

Although in practice we are often content with deterministic coarse-scale averaged
representations of observed phenomena, the above discussion brings to light the in-
tuitive notion that while modeling of phenomena based on coarser scale information
alone may be more accurate, it can only be achieved at the expense of precision.
Alternatively, coarse-scale models may be constructed more accurately than their
ﬁne-scale counterparts, but the coarse-scale models are necessarily less precise.

These two statements are not just reworded versions of one another; however, once
the information-theoretic deﬁnitions for accuracy and precision have been introduced,
we will show them to be equivalent. At that point, we will also be able to gain a
greater insight as to their interpretation. Their signiﬁcance is as follows.

The ﬁrst assertion establishes that construction of more precise models requires
new information, and that such information may only be found in ﬁner scale patterns.
This implies the futility of trying to improve the precision of estimators beyond what
the observations’ scale supports.

The ability to trade precision for accuracy by modeling different scales of a process,
as established by the second assertion, is of great signiﬁcance to the point estimation

problem. We will show that under certain conditions the robustness2 of an estimator

 

2In the literature (see, for example [18]) robustness connotes the consistency of an estimator’s
performance under all possible distributions for the observations, that is, under typical observations

can be enhanced by leveraging the estimation process through this trade. The multi-
scale Bayesian estimation approach introduced in Chapter 3 will be shown to provide
an intrinsic way to achieve this.

Given the crucial roll that multiscale representation of signals plays throughout
this dissertation, we next give a brief introduction to the topic. To this end, we make
use of wavelet theory as it gives a natural perspective of multiscale analysis. We
avoid as much as possible the formalities associated with this topic, however, and

emphasize a motivational point of View, as this is the goal of the present chapter.

2.2 Multiresolution through Wavelets

2.2.1 The Fourier Transform and Scales

In the quest to identify the underlying patterns of phenomena and processes, ex-
pressing the signals of interest as a linear combination of more elemental functions
has often proved invaluable. With this approach, features in the signals which are
uniquely characteristic to the event under study have been brought to view and in
this manner have helped to identify the patterns.

For most problems encountered in engineering the sets of functions of greatest
utility have been those which are complete in L2(R) (or more generally, in L2(lRN))3
since they can represent any ﬁnite energy signal that might arise.4 Undoubtedly, the
Fourier system has had the greatest of inﬂuences in this respect since its discovery in
1807. One reason for this is the orthogonality of the set, which leads to its simplicity

and wide range of use. However, it is the fact that the elementary functions are

 

as well as those including outlayers. Here, we use the term to mean a degree of goodness. Later in
the chapter we will introduce the concept of resolution power and use it for this purpose.

3For the sake of simplicity, throughout this dissertation we focus our attention on signals on the
real line when this sufﬁces to make the desired point.

4Clearly, the ﬁdelity of such a representation is only in the mean-square error sense and so, it is
only accurate to within a set of measure zero. We ignore these technicalities for the most part.

10

eigenfunctions to linear translation invariant operators that has mostly contributed
to its great extent of applications.

The Fourier transform decomposes a function into its frequency components; this
is particularly easy to visualized in the case of periodic functions—clearly, not L2
functions. In this case, the components are denumerable and readily identiﬁable at
ﬁnite intervals along the frequency axis; thus, admitting a series representation for
the signals. For L2 functions, the situation is radically different as no one component
is present in the original signal, for if any one were, it would be so only through its
manifestation of energy in some form, no matter how small this is, but we know that
no energy is born at any given frequency.

An alternate view of the Fourier representation is that the transform decomposes a
signal into a countable number of scales corresponding to an arbitrary partitioning of
the frequency axis. The system in this case is still orthogonal, but the basic elements,
or atoms, are now the functions that the Fourier exponentials integrate to within each
of the frequency intervals. The advantage of this system from the temporal viewpoint
is that L2 signals may now be represented as linear combinations of such atoms, with
each atom contributing a ﬁnite amount of energy—if present at all, that is. Fiom the
frequency perspective, we have that the highly frequency-localized analysis of signals
is preserved, although not to the inﬁnite frequency resolution of the original Fourier
system.

As an example, this scale decomposition may be constructed with a set of Gabor-

like functions deﬁned on the frequency line:

. 1 _ ' . . j .
hj.k(w) E —l(w——i€—O) e’kuow‘ﬁ") for all j, k E Z, (2.1)
£0 50
where l() is the indicator function over the {—1/2, 1 / 2) interval, i2 = —1, and uo and

{0 are arbitrary parameters. Clearly, the set {lg-thy. partitions the frequency line

11

into adjacent non-overlapping uniform intervals of width £0, and include modulating
factors. Letting uoéo = 27r, it is easy to show that the frequency spectrum f of f

may be expressed as (see Appendix A.1)
fiw) = Z<f’ hjikliljidwli (22)

with (-, ) denoting inner product. By obtaining the inverse Fourier transform of both

sides of this equality and using Parseval formula we obtain,

The commutation of the summation and integral operators in these steps is assured

since every Fourier series is integrable term by term[19]. Here,

sin {29(t + kuo)
hj.k(t) : £9“ + kilo)
2

 

e’jEOt for all j, k E Z. (2.3)

The partition of the frequency axis could have been chosen to produce a more
convenient scale.5 The octave scale is especially meaningful in acoustics as it matches
the sensitivity scale of human hearing more naturally, and is obtained by replacing j
by 2i everywhere in the above expressions.

In many instances, the relevant information in a signal is transitory in nature. In

an image, for example, it is the edges and other singularities which often convey the

 

5Throughout, we use the term scale with two different but related meanings: one, to signify a
single element in the partition of the frequency axis, when we want to stress its place among the rest
of the elements in the partition; and two, the partition itself, when we want to stress the speciﬁc
structure of the partition.

12

information of interest. This explains why one can frequently discern the content of
an image of a real-world scene solely from its high-pass ﬁltered version; the contour
of a human face and the silhouette of a house convey much of the total informa-
tion. Consequently, parsimonious representations of an image with its short-duration
features require temporally narrow well-localized atoms. On the other hand, in or-
der to achieve ﬁdelity of representation while maintaining conciseness, long-duration
features need also to be represented accurately with only a few atoms. This would
insure, for instance, that the presence of texture in an image would not undo the
economy or the quality of the linear representation.

A most practical system, then, would consist of both arbitrarily narrow time-
localized elements and arbitrarily narrow frequency-localized (temporally long) ele-
ments, as well as everything in between. Clearly, neither the Fourier nor the system

(2.1)~—(2.3) possess these characteristics.

2.2.2 Wavelet Representation of Signals

The theory of wavelets provides a systematic approach to constructing a complete
orthogonal set through iterated dilations and translates of an elemental function if),
called a mother wavelet. When this function is chosen to be fairly well localized in
time, the wavelet system becomes highly efﬁcient in representing singularities and
other local features. This is the case, in particular, with wavelet functions of ﬁnite
support. However, for this class of wavelets, the scales necessarily overlap to some
extent and do not strictly constitute a partition of the frequency axis. Nevertheless,
the induced scales are always well deﬁned, if not well localized, due to the inherent
nature of wavelets. For this reason, wavelet analysis of signals offers a versatile
representation which exposes time-frequency patterns or features not discernible in

either the time nor the frequency domain alone.

13

Multi-resolution Analysis

In general, a wavelet representation of an L2(R)-function f has the form

f = Z Z dj,k¢’j,k3 (2.4)

jeZ keZ

where d“ _=. (f, $11k) and vbj,k(t) E 2‘1/2w(2‘jt — 1:). Meaning is given to this decom-
position through the concept of multi-resolution analysis, which is the foundation of
every wavelet system representation.

IV‘Iulti-resolution analysis consists of a family of closed subspace {V,-},-ez of L2(R)

having the following properties: 6

i) Vj+1§ V]- for all j E Z, and 0,6sz = {0}.

ii) There exists a function a such that the set {o(t — k)}kez is a complete

orthonormal set in V0.
iii) A function f(t) is in V0 if, and only if, f(2’jt) is in V,.

iv) Ujesz is dense in L2(R), i.e., for any f E L2(R), there exists a

sequence {fn}n€N in Ujezvj which converges to f in the L2 sense.

The function o is known as the scaling function. Notice that by ii and iii, o(2'jt—k) E
V,- for all j E Z. It is easy to see that since the set {q5(t — k)}kez is orthonormal,
the set {¢(2“j t — k)}k€z is orthogonal, and, with the scaling of 2‘j/2, orthonormal as
well. Moreover, the set {2‘37 2o(2‘j t — klh-ez turns out to be complete in the space

V,. Therefore, any element of V0 can be expressed in terms of this set corresponding

 

6Multi-resolution analysis may be deﬁned for other spaces other than L2(R), for example, Sobolev
spaces W5" 5 {f : Ifflk)(t)|”dt < 00, k = 0,1,... ,m} with 0 < m,p < 00; but we only concern
ourselves with the former.

14

to V. 1 since V0 g V_1, i.e., there exists a sequence of numbers {hﬂkez such that

at) = Z hk21/Qo(2t— k). (2.5)

1:62

In a similar manner to the deﬁnition given earlier for u’w, let 95,-},(t) E
2’j/2¢(2“jt—k). The conditions for an inﬁnite set of subspaces to be a multi-resolution
analysis were ﬁrst stipulated by Stéphane Mallat [20], who in turn gave the follow-
ing very important result. Deﬁne the sequence {gﬂkez by gk E (—1)’°"1h‘1‘_,c for all
k E Z, where h; designates the complex conjugate of the coefﬁcient hk in the scaling

equation (2.5). Deﬁne

to) s [gel/Wt — k). (2.6)

keZ

Then, {2’j/2w(2‘jt— k)},,kez is a complete orthonormal set. in L2(R), i.e., {w,,k},,kez
is a wavelet system for L2(R). This justiﬁes (2.4). Furthermore, the set

{T/Jj,k}ker{¢,-,k}kez is orthonormal, with the consequence that if

W,- ;-=_ {Z zulyt [(Zklkez E (HZ) la

keZ

then every element of W,- is orthogonal to every element of V,. The “parent” scaling
functions (25, and the mother wavelets 1/2 for the Haar and Daubechies systems are
Shown in Figure 2.1. Dilation and translations of these atoms give rise to the entire
family l¢j.k}keZU{¢j.k}kez-

Mallat also found that {W,-, V,} forms a partition for V,-__1, i.e., W,, V,- g V,_1 and
for every element 12,--1 of V,_1, there exist elements w,- and v,- of W,- and V,, such that
03—1 = w, + v,-. This is expressed concisely as the direct sums l/,_1 = W,- ® V,-, and

are depicted in Figure 2.2.

An immediate consequence of this relationship is the following. Designate the

15

 

 

 

 

 

.................................................

 

 

 

 

 

 

O—u-
d—T—
0+
—l

 

 

   

 

 

 

 

 

 

0.,
d
o-il-
.n—[.-

(C) (d)

Figure 2.1. The Haar and Daubechies 4—pt scaling and wavelet functions in the
interval. (a) Haar scaling function. (b) Haar mother wavelet. (c) Daubechies scaling
function. (d) Daubechies mother wavelet.

projection of f on V,-_1 by f,_1, that is,

fj—l E ZCj—l,k¢j—1,ka (2.7)

1:62

Where c,_1,k E (f, (151—1.1:). We may then also write

fj—l = 2 (033mm + dj,k¢j.k), (2.8)

1:62

with d“, as in (2.4). Equating these two expressions and writing $3.1: and w“, in terms

0f (151—1,1, according to the generalized versions of (2.5) and (2.6)——which are obtained

16

——>vj+2 ——-)V,-+1 9V, 9 vj-I ——>
/® /® /®/ 6-) /
Wi+1 w] wj-I

W j+2

Figure 2.2. Multiresolution analysis expressed as a sequence of direct sums. Each
space V,-._1 encompasses the functions living in V,- and W,-, and all those functions
formed by the sum of any two elements in V,- U W,. Since W,- is orthogonal to V,-
and, therefore, to every V1 with l 2 j, W, deﬁnes the scale of L2(R) functions with
features not representable in any V, with l 2 j.

by the repeated substitution of the argument t for 2t, and in each iteration, multiply-
ing the resulting expression by 21/2—the discrete wavelet reconstruction expression

resuks
Cj—1,k = Eat—21011 + gk—ude), (2.9)
(62
for all j, k in Z. The corresponding discrete wavelet decomposition relations are

Cch = Z hz—2ij—1,i (2-10)

162

and

dj,k = Zgz-mch—u (2-11)

(62

These are obtained by again expressing qt“, and 1?ch in terms of (15,.”c according to
the generalized versions of (2.5) and (2.6) in c“, E (f, any.) and d”, E (f, 1b,,k), and
Writing the resulting integrals in term of c,_1,k.

The coefﬁcients hk and 9,, in (2.9), (2.10) and (2.11) are the wavelet ﬁlter coef-

ﬁcients, often referred to as quadrature mirror ﬁlter (QMF) coefﬁcients within the

17

linear-ﬁlter-processing community. It is not difﬁcult to show that the sequences (hk)

and (9],) correspond to low-pass and band-pass ﬁlters, respectively.

Signals’ Finite Representation

The fact that we can express the coefﬁcients at one scale in terms of those at the scale
immediately “above” or “below” without the need of scaling or wavelet functions is
key to the usefulness of the wavelet transform, for otherwise, the wavelet transform
would have gone the way of the Fourier transform before the FFT (Fast Fourier
Transform) was invented. For the transform to be computationally practical, however,
it is also necessary that a function may be representable by only a ﬁnite number of
coefﬁcients. The following shows how under very mild conditions this is possible.

We ﬁrst rewrite (2.4) as

f = E Z (Ext/5.}: + Z Z d,,,,z/;,-,,, + Z Z d,,,,v,,k, (2.12)

ngVkeZ .ﬂ<jsJkeZ J<jkeZ

where J’ < J, but which are both arbitrary integers otherwise. From Figure 2.2, it is

seen that

f1 = fJ+1+ ZdJHI/JJHJc
keZ

J+2

= fJ+2 + Z Zdj,k¢j,k

j=J+1keZ
J+3

= fJ+3 + Z Zdj,kt/Ij.k

j=J+1keZ

= = zzdjﬁl/ija

J<j 1:62

18

and similarly

fJ’ = Z :dj,k¢j,k

J’<j keZ
Therefore, (2.12) can be written as
f = (f — m + Z Eat-mi + a (2.13)
J'<ng keZ

It is not hard to show that the projection fj is the best approximation in V]- to f in
the sense that N f — fj“ is minimized over all sums :keZ qujj‘k for any set {zdkez of
real numbers. Then, since a multi-resolution analysis is dense in L2(IR) (property iv),
the sequence (fj);’j, converges to f, i.e., ||f — f,“ —> O as j —> —00. This implies
that if J’ is chosen small enough, the (f — f 1:) term can be ignored maintaining an
approximation to f as closely as desired. Doing this in (2.13), and arbitrarily choosing

7

the lowest scale approximation J’ to be scale zero we obtain

f ’31 Z Zdj,k¢’j,k +fJ

0<j_<_J keZ

= Z Z (1)“!ij + Z CJ,k¢J,k

0<jsJ keZ keZ

(2.14)

This expression still involves an inﬁnite number of terms. However, if 11) is of
compact support, then the entire wavelet system {112M} 3"]; is of compact support, and
so, if f is also of compact support, there will only exist. a ﬁnite number of wavelet

and scaling coefﬁcients different than zero.

 

7There is no loss of generality here for we may expand or dilate the original function by any
amount required so that the approximation to the newly obtained function at scale J = 0 represents
the desired fit.

19

The Discrete Wavelet Transform

When a function f admits a ﬁnite wavelet representation, it is often convenient to
think of the set of coefﬁcients {dj,k}k,0<jg U{cJ,k}k in (2.14) as the wavelet represen-
tation itself of the function. In this sense, the scaling coefﬁcients {C0,k}k obtained by
the reconstruction operation (2.9) is viewed as the (discrete) signal of interest. The
operation by which {coy},c is transformed into {dj‘khn 0<js J LJ{cJ,k};c is known as the
discrete wavelet transform (DWT).

Due to the decimation operations indicated in (2.10) and (2.11), the number
of scaling (and wavelet) coefﬁcients in a scale is always half the number of scaling
coefﬁcients in the prior lower scale. So if the coefﬁcients at scale 0 run from, say, k = O
to N — 1, where N is an integer’s power of two, it takes N / 2i scaling and wavelet
coefﬁcients each at scale j to convey the same information. Therefore, whether the

)T are derived from a function of a continuous

coefﬁcients c E co E (60.0, - - . ,cQN
variable, or constitute a discrete signal in their own right, the DWT may be expressed

as

0..
III

(1J_l = We, (2.15)

 

 

where, for any j, d, E (dj,0,dj,1,--- ,dj,N/23~_1)T, and J = log2 N. ()T denotes the
transpose operation. The elements of the matrix W are linear combinations of the
ﬁlter coefﬁcients {hk} and {gk}. These are found by expressing (2.10) and (2.11) in

matrix form, and applying them iteratively up to the highest possible scale J. In

20

 

 

 

 

 

f approx- c d wavelet-based

a a }
lmatlon I I19 ——5 IDWT ——-—>’synthosls ——>

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 2.3. Wavelet-based Signal Processing. A ﬁnite-support function f is ﬁrst
projected to a suitable scale space, e.g., V5. The scaling coefﬁcients c = c0 corre-
sponding to the projection are DWTed. Signal processing algorithms are employed
to generate new wavelet coefﬁcients d from the original. By an IDWT, corresponding
scaling coefﬁcients 6 = (:0 are obtained from which the desired ﬁnal function f may
be synthesized.

particular, for N = 4, J = 2 and

(ho h, 0 0) (ho h1 h2 123)
90 91 0 0 h—2 h—l ho hl

O 0 10 go 9.1 92 93
\0 0 0 1} Km 9-1 90 9,)

 

 

 

 

A signiﬁcant advantage of the DWT in problems involving ﬁnite-support functions
of continuous variables is that, once the vector of scaling coefﬁcients on has been
obtained, one may operate solely on the coefﬁcients (1 resulting from the DWT until
the actual desired function is needed; at that point, the new function is synthesized
using (2.14) with the newly obtained scaling coefﬁcients following the inverse DWT
(IDWT) of the processed coefﬁcients, d. This is the basis for discrete wavelet-based

signal processing. A pictorial representation of this idea is shown in Fig. 2.3.

Signiﬁcance of the DWT coefﬁcients

Each element of the set {dj,k}k=o,...,~-1 is associated with a region of the time— frequency
j=1,...,J
plane according to the location where the energy of the corresponding wavelet is

mostly concentrated. In particular, wavelet coefﬁcient dJ-Jc corresponds to a region

21

 

%.>1Hno

 

1’?
Hi
lll

hoquoncy Hequency

(a) (b)

J

 

>/

 

K
i

A
W

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 2.4. Time-frequency plane. (a) Partition induced by the Haar system, and
(b) partition induced by the system (2.3). Nodes and links in (a) display the func-
tional relationships among the various wavelet coefﬁcients. The uppermost node
represents wavelet coeﬁicient dyo, and those at the bottom, represent coefﬁcients
dog, C10,], - - - , d0,N_.1. See text for an alternate interpretation.

centered at (uch, 5M) and of dimensions r,- x 0,, where u”, E f t ijy,(t)|2 dt, {M E
fwliﬁj,k(w)|2 did, and for any k, r, E [(t — Uj‘k)2le‘k(t)|2 dt, and o]- E [(w —
e.k)"’Iz/3j,k(w)|2dw.

The time-frequency partition induced by the Haar system is shown in Figure
2.4(a), and that induced by the system (2.3) is shown in Figure 2.4(b) for referen’ce.
The nodes in (a) represent the wavelet coefﬁcients, and the links represent their
functional relationships according to (2.9), (2.10) and (2.11).

Another useful interpretation of the tree—like structure superimposed onto the
time-frequency plane is that each row of nodes constitutes a projection of the original
signal onto a subspace of NOR). Speciﬁcally, the upper most node always represents
c 1,0, the second row, the (c 1-1,0, CJ_1,1) vector, and so on. The bottom row corre-
sponds to the highest resolution representation available, and as was indicated before,
is often taken to be the original signal of interest. In this sense, we speak of the j-scale
representation a, E (cm, - - - ,cj. N/21_1)T of co, since c, are the scaling coefﬁcients of

f, whenever co are of f. Notice that under this alternate interpretation, the signal

22

represented by the coefﬁcients on any given row, i.e., at any scale, of Figure 2.4(a),
may possess energy within the entire frequency band extending from 0 Hz (top of the
ﬁgure) down to the row of coefficients.

As an illustration of these two interpretations, the values taken by the coefﬁcients
in each row of the time—frequency plane are shown in Figure 2.5. At the bottom of
the ﬁgure, the scaling coefﬁcients corresponding to scale zero constitute the original
signal of interest. Right above them, the scaling coefﬁcients corresponding to scales
1, 2, and 3 are shown along (to their left) one translate of the scaling functions used
in obtaining them. Each sample on any row corresponds to a node in Figure 2.4
according to the second interpretation. On the other hand, the values taken by each
node conforming to the ﬁrst View are displayed on the right side of the ﬁgure. These
are the wavelet coefﬁcients obtained by the inner products between the signal at scale
zero and the sequence of translates of wavelets shown in the rightmost column. Both

views are valuable depending on what is being sought.

Vanishing Moments

There are many wavelet systems in existence and each is better suited for a particular
set of applications. The systems may be differentiated by a myriad of properties that
they may or may not hold. One very important distinguishing property often stated
is the degree of regularity of the family of wavelets. For important families of wavelets
this may be measured by the number of vanishing moments [21]. A wavelet function

w has v E N vanishing moments if 8

ftpw(t)dt=O forp=0,... ,v—1, (2.16)

 

8Throughout this dissertation we avoid writing the limits of integration as much as possible for
simplicity of notation, but every integral indicates a deﬁnite integration operation. When the limits
are omitted, the region of integration is the entire space where the variable of integration is deﬁned.

23

¢' k (Cj,k) (dj,k) Wj,k

,

 

 

scale j

r" (
.J “v flr

ﬂ
2 J1
n ....: :_ «will

 

fl”

Figure 2.5. Multiscale Signal Analysis. Scale and Wavelet coefﬁcients (second and
third columns, respectively) of the signal at scale 0 (bottom row) are displayed for
scales 1, 2, and 3. Columns 1 and 4 give one single translate of a scale function
and wavelet used in obtaining the coeﬁicients. These functions correspond to the
unnormalized Haar system, which we review more extensively in Chapter 3.

but not for p = v.

By a simple change of variable one can easily show that if 212 has v vanishing
moments, then each 10,-), has v vanishing moments as well. Therefore, any function
f which can be closely approximated by a polynomial of order v — 1 will have all its
wavelet coefficients d“. be zero or nearly zero. This situation represents a high degree
of compression, for only a set of coarse scale scaling coefﬁcients sufﬁces to represent
the function.

Beyond parsimonious representations, a sufﬁciently regular system may be ex—
ploited in estimating a signal from within noise. If the true signal is smooth enough,
most of the energy in the wavelet coefﬁcients will be from noise, and thus a simple
thresholding scheme may remove much of this noise without seriously altering the

true signal upon reconstruction. We review some of these wavelet-based estimation

24

procedures in Section 2.5.

Images

There are various ways to extend wavelet analysis to images or L2(IR2) functions. A
simple approach often taken consists of constructing a separable 2-d multiresolution

analysis V,2 as the tensor product of 1-d multiresolution counterparts:
V]? E V]- ® V, for allj E Z. (2.17)

The resulting family of subspaces {VJ-2} of L2(R2) possess 2-d versions of the four
properties i—iv given earlier characterizing 1-d multiresolution analyses. Speciﬁcally,

the scaling property is satisﬁed by a scaling function deﬁned as

¢j.k1,k2(t1» t2) 5 ¢j.k1(t1)¢j.k2(t2). (2.18)

In two dimensions, the detail space W,-2 is the space spanned not by one set of
wavelets corresponding to scale j, but by three sets, each associated with a given
orientation: horizontal, vertical, and diagonal. These wavelets are constructed as

follows.

w2k1,k2(tlat2) E ¢j.k1(t1)’l/'Jj,k2(t2)
w;k1.k2(t1’ t2) 5 wj.k1(tl)¢j.k2(t2) (2.19)

I d

wj.k1.k2 (tlv t2) 5 with (tlle,k2(t2).

Each set of wavelets deﬁnes a subspace of L2(R2) such that the projection of a function
f onto it gives the energy (a measure of the amount of detail content) of the function
in that orientation. As an illustration, Fig. 2.6 shows the 2-d wavelet decomposition

of the standard cameraman picture. On the left is a discretized rendering of the

25

 

Figure 2.6. Multiscale representation of images. (Left). The standard cameraman
image: highest resolution scaling coefﬁcients, i.e., co. (Right). Wavelet representation
of the cameraman image: c2 are the scale 2 scaling coefﬁcients, (11 and d2 are the
wavelet coefﬁcients at scales 1 and 2 corresponding to the horizontal, vertical, and
diagonal orientation according to the h, v, and d designations.

original scene; hence, we regard it to constitute the set of scaling coefﬁcients at the
ﬁnest of scales available, and which we arbitrarily denote as scale zero.9 The wavelet
analysis of this image up to scale 2 is shown on the right. The representation is
standard: on the upper left, the scaling coefﬁcients corresponding to j = 2 form a
coarser representation of the original image; the horizontal, vertical, and diagonal
wavelet coefﬁcients dﬁkl‘kz, diam» and dink: at the ﬁrst and second scales provide
the details of the original image c0 not in c2.

In Chapter 3, we introduce a new two-dimensional multiscale representation of

images which is derived from the 1-d Haar system but which is not separable. The

 

9This interpretation in which each pixel in the picture assumes the value of a scaling coefﬁcient (at
scale zero, in the present case) is justiﬁed if c0)“, E (f, ¢O.k1.k2> 2 f(k1. k2). It may be shown that
under proper dilation of the original function f, most wavelet systems. and certainly, all bounded
systems of compact support, satisfy this condition whenever f is Lipschitz, i. e., there exist constants
c < co and o < a s 1 such that If(t’1,t'2) — fem)! : C |<t'i — n)‘2 + (t’2 —t2)21"/2. for an :1.
t2. t’i, and t’2 [22].

26

new analysis approach is better suited for the modeling of non-negative 2-d functions
as the estimates obtained are always non-negative. In general, this does not hold for

estimates based on any other 2-d wavelet system.

2.3 Multiscale Modeling and Attributes

2.3.1 The Components of a Model and their Notation

Whether an event is a naturally occurring phenomenon or the result of a man-made
process, it ultimately exists only as a manifestation of re-distribution (or conversion)
of energy in space. The flow of energy, or its ﬁnal state, forms a discernible pattern by
which the event may be identiﬁed. Typically, we record such patterns as signals that.
we can later manipulate and study. Thus, these signals convey all the information that
we may ever have about the process, and might well be regarded as the process itself
for modeling purposes. This is reminiscent to the nature of random variables, which
encode underlying random events but are of no essence once the random variables are
deﬁned.

Because there are no truly spontaneous events, every process is simply the continu-
ation of some prior process, and so, we partition the succession of events by recording
signals at intervals of time. Any two consecutive signals constitute the cause and
effect of the process they encompass.10 Denote these signals by A E A and :c E X,
respectively, where in general, A and X are subspaces of VCR”) or l2(ZN). For
simplicity, however, we restrict subsequent discussion to the case where A and X are

subspaces of RN, and to stress this, the signals are shown in bold face to remind us

 

10We note that for some processes the cause and effect parameters or signals may seem inter-
changeable, when in fact, they correspond to two distinctly different processes. For example, in
modeling the behavior of ideal gases one choice would be to have the change of temperature of an
isolated volume of gas be the result of changing pressure. Another possibility would be to consider
the change of temperature to be the cause of the change of pressure.

27

that they are vectors.

As indicated earlier, a deterministic model for a process may sufﬁce for many
applications, but a probabilistic model is always much more general. In fact, it can
be argued that the joint distribution p(x|A)p(A) is the right way to convey all known
information about the relation between A and x and, therefore, about the process
itself.

Although it is often convenient to distinguish between a random variable and its
realizations, for simplicity, we provide no separate symbols for each. Instead, we rely
on the context to make the differentiation. Thus, for example, with E representing
the expectation operator, the expression E[x|A] necessarily implies that x is a random
variable (or a random vector) (rv) while A may be a random parameter or a realization
of it.

By p(x) we denote the probability density distribution of the ru x if it may take
on an uncountable number of possible values. If the range of x is countable, then p(x)
stands for its probability mass distribution. In general, p(x|A) 75 p(ylA), as we deﬁne
the densities solely by their arguments. Only in a few instances will it be necessary
to be more explicit and write, for example, px(y) to mean the density of x evaluated
at x = y.

Our motivation for modeling events primordially comes from the desire to estab-
lish the probable causes of the observed phenomena. That is, given the outcome x we
would like to estimate A. This is often an important task for if we know A we may
predict or estimate the outcome of some other process dependent on it as well; alter-
natively, we may be interested in whatever caused the outcome A in the ﬁrst place.
For example, in nuclear medicine, the immediate problem consists of establishing the
distribution of radioactive material within the patient given a noisy image of photon
counts. This knowledge later aids the physician in diagnosing the state of the patient.

In view of this, we regard only models in which A represents the statistical mean

28

 

 

 

A
process A. process x estimator 1.
pa) * p(xm ’ 5(x) ——*

 

 

 

 

 

 

 

 

 

 

 

Figure 2.7. Components of a Model of a Process. The overall process is modeled as
a doubly stochastic process in which A is the realization of an unknown process, but
which is modeled by a prior distribution p(A). The output x, in turn, is modeled as
a realization of the process obeying the likelihood p(x|A). The purpose of the model
is to facilitate the construction of an estimator 6 for A based on the observation x,
but the estimator is not a component of the model.

for the rv xlA. That is, x represents a “noisy” realization of A governed by the
likelihood p(x|A), which is parameterized by its mean A. Furthermore, all derivations
in this chapter are made under the assumption that p(xlA) and p(A) are absolutely
continuous; however, most remarks and conclusions are easily extended to discrete
models as well.

A pictorial view of the components of a model are shown in Figure 2.7. In this
ﬁgure, 6 represents an estimator for A, the estimate of which is denoted by A. The
estimator is not an element of the model per se; however, since the objective of the
model is to facilitate the construction of the best estimator possible, the chosen model
depends on the desired estimator. Bayesian-based estimators can be shown to be op—
timal over all other types of estimators under most reasonable criteria as long as a
suitable prior distribution p(A) is available [23]. This is often not the case, however,
and constructing one is a difﬁcult problem in general. One very important contribu-
tion of this dissertation is, in fact, showing how to formulate practical prior densities
for general Poisson processes as well as for photon-limited tomographic processes.

In Section 2.4 we aim to show that under the criteria set forth below, the multiscale
framework is the right framework for formulating models of processes suitable for

Bayesian inferences.

29

2.3.2 Anomy, Accuracy, Precision, and Resolution Power

Two important concepts widely used in estimation theory for purpose of evalu-
ating the degree of an estimator’s goodness are unbiasedness and minimum vari-
ance. An estimator 5(x) of A is said to be unbiased if E[6(x)|A] = A for all
A 6 A; and it is said to be uniform minimum variance unbiased (UMVU) if
var 6(x) 3 var 6’(x) for all A E A, where 6’(x) is any other unbiased estimator of
A, and var 6(x) E E[(6(x) — A)2|A] [18].11

Thus, a biased estimator is simply one that on average incurs an error in its
estimation, i.e., one which is expected to systematically depart from the truth. On
the other hand, a UMVU estimator is one which, in addition to estimating correctly on
average, the expected departure from the truth (in a square-error sense) is minimum
among all estimators.

Clearly, while biasedness and variance are intuitive and useful concepts, they
represent only ﬁrst and second order measures of the quality of estimators and cannot,
therefore, give a complete assessment. Other measures exist for this purpose as
well (e.g., equivariance, risk, etc.), however, they also give only a partial picture
of the merits of estimators. In order to remedy this problem, we introduce four new
information-theoretic concepts: anomy, accuracy, precision, and resolution power.

Since our interest here is to qualify the degree of goodness of models as well as
that of estimators, we take the following unifying and general approach. First, we
take the view that models exist independently of any estimator, and that estimators
are meaningful only when associated with a model. Second, we only evaluate models
per se, but an estimator may be assessed in relation to a model by augmenting the

model with the estimator and then evaluating the augmented model. And third, in

 

11The tests for unbiasedness and UMVU are most often written as E[6(x)] = A and E[(6(x) —A)2] 5
E](6’(x) — A)2] in books of statistics [18, 24, 25]; but knowledge of A is implicit in these tests since
x is assumed to be the result of a unique non-random A. Here, we prefer to be explicit as A is a
random variable itself, and write the conditional dependencies in the tests.

30

order to unify this view with traditional stands where the various measures apply to
estimators rather than models, when necessary we consider the identity estimator,
i.e., 6(x) = x, in conjunction with a model and apply the various criteria to that
estimator. The end result is that whatever is concluded about the estimator applies
to the model itself. In this manner, for example, we could talk about the bias of
a model by considering the value of E[6(x)|A] = E[x|A]. For the restricted class of
models we are considering, this equals A; therefore, this class of models are unbiased.
In a similar manner, when referring to the new concepts to be introduced below, they
will apply to the model to which we have associated the identity estimator.

The simplicity and convenience of the proposed view of the relationship between

models and estimators will become apparent later in the section.

Precision

Let I (X ; A)12 denote the mutual information between the ensembles X and A of all

possible outcomes x and A, respectively. That is13

1904*)
p(X)

 

I(X;A) E [p(x|A)p(A) log ddi. (2.20)

As is well known, I (X ; A) represents the information that, on average, the outcome
x conveys about the rv A—or that the input A conveys about x—measured in number
of bits, nats, or decimal digits depending on whether the log function is base 2, e,
or 10, respectively. When the input and output (cause and effect) rv’s are scalar, for
example, a realization :1: determines the input A to within I (X ; A) bits of precision, on

average. If this quantity were inﬁnite, we would know with all certainty the precise

 

12For convenience, we sometimes simply write I (x; A); but we must remember this always repre-
sents an averaging process over the entire sample spaces X and A.

13The mutual information can equivalently be written as fp(A|x)p(x)logB-£%i—’)dedA, or
f p(A, x) log 2" )ddi, etc.

pApx

31

value of A that caused :c, and we would say that A is known with inﬁnite precision.
Clearly, this occurs when p(rlA) = 6D(2: — A), i.e., a Dirac delta function. Thus, it
appears only natural to deﬁne the precision ’P of a model to be the average mutual

information between the cause and effect signals:

79 a I(X; A). (2.21)

The chosen terminology is compatible with the common experience. For example,
the term precision is formally used among the instrumentation community to indicate
the degree of repeatability, over all possible readings, that a measurement can be made
[26]. That is, it represents the amount of information in number of signiﬁcant digits
that the reading (the rv x) conveys about the true state of events (the realization A).
And this is exactly what I (X ; A) represents: the net information (as an average of
positive and negative information) over all possible realization pairs (x, A) [27].

An equivalent and insightful deﬁnition for precision is obtained by interpreting
it as the entropy of the input A when quantized to the number of bits determined

14 The proof

by I (X ;A), averaged over all possible uniform quantization schemes.
of the equivalence of the two deﬁnitions is given in Appendix A.3, where notation
introduced later in this section is used.

The range of ”P is clearly the non-negative real numbers. A precision of 0 indi-
cates that x and A are independent and, therefore, the outcome of one conveys no

information about the outcome of the other. For discrete ensembles X and A, it is

sometimes convenient to talk of the relative precision ’Pr of a model which can only

 

 

14Often we think of quantization of a real variable 2:, for example, as the mapping y = 13,, [10":13] ,
where [-j is the integer part function, and where n determines the number of decimal digits of
quantization: n = 00 => no quantization, n = 0 => quantization to nearest integer no greater than
1:, etc. It is possible to deﬁne, however, a more general quantization operation as follows: for any
real 3 > O and r 20, y: %[s(x-—r)] +r.

32

take values between 0 and 1. A useful deﬁnition is given by

Pr 5 I(X;A)/H(A),

where H (A), or simply H (A), is the entropy of A, i.e., - f p(A) log p(A) dA. Clearly,
Pr may be expressed as a percentage as well. For the rest of this chapter, however,
only ensembles with absolutely continuous densities will be explicitly considered.
One interesting result stemming from the adopted deﬁnition of precision is the fact
that the precision P of any given model is always equal to or greater than the precision
P5 of any augmented model. That is, if the original model is given by the pair
p(xlA) and p(A), then the augmented model is given by the pair p(6 IA) = p(6(x)[A)
and p(A). This is a consequence of the Data Processing Inequality theorem well
known in information theory, and which establishes that I (X ; A) Z I (6(X ); A), for

any transformation 5 [27]. Thus, we have

mgr. am)

As stated earlier, the model p(xlA) = 60(x—A) is associated with an inﬁnite precision.

Inequality (2.22) is reminiscent to the Cramér-Rao inequality, which states that
the mean squared error of any unbiased estimator 6(x) of the parameter A is lower
bounded by the reciprocal of the Fisher information J (A) of A. Writing this in terms

of the reciprocal of the variance, the inequality becomes

 

var-1(6) may be regarded as a second-order measure of precision, but it is clear that

the inequality (2.22) is stronger for it involves the entire model in all its moments.

33

Anomy and Accuracy

While precision tells the average amount of information between a model’s input
and output, it does not give any indication about the utility of the information that
either x conveys about A, or that A conveys about x. For instance, it is possible to
construct a model in which the output gives much information about its cause (i.e., a
very precise model), but which is not all that useful as it is not all that accurate. This
would occur, for example, in a model characterized by a multimodal density p(xlA)
with very well deﬁned (“sharp”) modes. Figure 2.8 illustrates one such density when
A and x are scalars.

Assuming a fairly uniform p(A) and that the scenario depicted in Figure 2.8 is
representative of the model in general, the integral (2.20) takes on a high value,
because the regions on which log gills-l becomes large and positive are more likely
than those regions on which it assumes negative values. Thus, the model is highly
precise.

On average, an outcome .7: places the value of the input A to “within” P bits from
its true value. However, since the real line may be partitioned in an inﬁnite number
of ways with any given ﬁnite number of bits,15 the model does not necessarily specify
a region which includes the input A, in which case the accuracy of the model is poor.
In the example, the P bits of information conveyed by the realization :1: places A
to within one of the ﬁve regions where p(rIA) is greater than p(zr), i.e., within the
neighborhoods of the modes, none of which include the input A. Therefore, the model,
while precise, is inaccurate.

This example illustrates a situation where the realizations {13,} are, on average,

far removed from their input A despite the fact that $327,123,- —+ A as n increases

 

15For example, one bit of information may partition the real line into the positive and negative
halves—the sign bit—or may discriminate between numbers with a fractional part in the [0, .5) or
[.5, 1) intervals—ﬁrst bit to the right of the binary point—and so on.

34

 

pp(XI>~)

/ p(XIX)

p(X)
l/\'
U‘ ,

A-p A A+p x

 

 

 

 

 

Figure 2.8. The various probability densities deﬁning accuracy. p(xlA) and p(x)
are pdfs associated with the given model, the precision of which is P. pp(x|A) and
pp(x) E f pp(x|A)p(A) dA (not shown) are auxiliary pdfs with the characteristic of
representing the same precision P, while pp(x|A) is the pdf with the smallest support
about A.

without bound—Khinchine’s Strong Law of Large Numbers [28].

It is important to note that the degree of accuracy of a model is not an indication of
its truthfulness. A model, i.e., the joint density p(x|A)p(A), is always assumed correct
given the available information. This is not to say that models of varying degrees
of accuracy cannot be constructed for the same process; the alternate models may
differ in precision as well. Although for particular input-output pairs of realizations
a model may be highly inaccurate, on average a (correct) model will possess some
degree of accuracy, and its precision bits of information will be useful in that same
measure.

If in the example of Figure 2.8, p(x|A) had been unimodal with a very sharp mode
at a: = A, the model would have been highly accurate, and all information given by
:r would serve to identify the location of A within its neighborhood. This goes to
illustrate that accuracy is some measure of the spread of p(xlA) with respect to the

most accurate (MA) distribution, which we deﬁne to be one that conveys the same

35

amount of information—one that has the same precision—but whose information
most effectively identiﬁes the neighborhood of each A responsible for each outcome
x. This requirement brings to mind UMVU models, but as was argued earlier, un-
biasedness and minimum variance are only ﬁrst- and second-order measures of the
desired property in the model.

A most general measure of the “spread” of a distribution is given by its en-
tropy. So, we reason that the accuracy .A of a model must be related to the
difference H (X IA) — Hp(X IA) of conditional entropies corresponding to the ac-
tual conditional density p(xlA) and to the MA distribution, which we denote by
pp(x|A).16 Here, H(XIA) E — fp(x|A)p(A)10gp(x|A) dA dx, and Hp(X|A) E
- fpp(X|A)p(A) log PplxlAl d’\ dx.

The MA distribution pp(:r|A) for the particular input A in the above example is
shown in Figure 2.8. In general, pp(x|A) is deﬁned to be a uniform density centered
at x = A and of support the N -dimensional cube C (A; p) of sides 2p. We denote it by
“p(x — A). This choice is motivated by the fact that of all the probability densities
of compact support with a given entropy, the uniform has the smallest of supports,
and thus, is most concentrated around its center [29].

To be meaningful, the MA distribution must. convey the same precision as is
conveyed by the actual model. Denoting the average mutual information between a

process’ input and output under its MA model by Ip(X; A), i.e.,

Ip(X;A) E [pp(x|A)p(A) logggf—Il—S-A dA dx, (2.23)

where pp(x) E fpp(x[A)p(A)dA, we require that the “radius” p of pp(x|A) be such
that [p(X; A) = I (X ;A). Clearly, this requirement. makes sense only if [p(X; A) can

 

16We have departed from the convention of denoting the rv as a subscript for its density function.
The subscript p in pp(x]A) denotes only a parameter.

36

be shown to be a one-to—one function of p, for otherwise, p would not be uniquely
determined for a given precision of value I (X ; A). In Appendix A.2 we show this to
indeed be the case.

Although p is not a true radius, thinking of it as such is more descriptive, and
helps to better interpret its signiﬁcance. We shall call it the equivalent precision
radius or EPR, for short.

Although appealing, measuring the “spread” or “disorder” of the actual model
with respect to that of the most accurate model by the difference of their conditional
entropies presents a fundamental ﬂaw, which we now illustrate with a simple example.
Suppose that p(xlA) also represents a uniform density of support of radius p, but
which for every x, is centered so that it does not overlap with p,,(x|A).l7 This situation
results in a difference H (X IA) — H p(X IA) = 0, leading us to conclude that the actual
and MA model have the same accuracy as well as precision. This is clearly not the
desired result, as the actual model is completely inaccurate, for its information is
misleading for every outcome x.

We now introduce a new measure of the “disorder” of a model which gauges the
average degree to which the precision bits fragment the IR” space in determining the
probable input sets for each possible outcome. We call this measure the anomy N of

a model, and is deﬁned as

N E D (pp(XI»\)p(A)l|p(XIA)p(A))a (224)

where D(p||q) is the Kullback Leibler distance or relative entropy between the densities

 

l7Strictly, this condition is not realizable if the model is to be correct, however, close approxima-
tions may be constructed.

37

p and q, i.e.,

= _, 0 an
D(P($)[[(1(17))—/P( )1 ”(1(1) dw.

More speciﬁcally,

D (WWWIIp<xIA>p<A>> = fpp<xIA)p(A) 1035713)“) «max. (2.25)

We note that the Kullback Leibler distance is not an ordinary distance in the
mathematical sense, for it fails to meet all the conditions of a metric. In particular,
it is not the case that D(p||q) = D(qllp) for all densities p and q, nor does D(p||q) = 0
imply that p = q. Nevertheless, this information-theoretic measure of the “separa-
tion” between distributions possess several desirable properties. For example, for all
densities p and q, D(p||q) Z 0 and D(pllp) = 0. Furthermore, because pp(x|A) is uni—
form conveying the same average information as p(x|A), D(pp(x|A)p(A)||p(x|A)p(A))
equals zero only if p(xlA) = pp(x|A) everywhere except, perhaps, within a set of mea-
sure zero. This implies that the anomy of a model can not be zero unless it is the
MA model. We prove this assertion in Appendix A.4.

We can establish the similarities and differences between anomy and the difference
of entropies discussed earlier by expressing the log function in (2.25) as a difference

of log functions: on one hand we have

dAdx—/pp(x|A)p(A) log 1 dAdx, (2.26)

N: fp,(x|,\)p( )log— W W

1
W IA)
while on the other

H(AIX)- H(=A|X)

1 1 (2.27)
/p(XIA)p(«\ )log— W IA) dAdx— / pp(x|A)p(A) log 2W dedx.

38

These two expressions are remarkably similar; however, the averaging process carried
out in (2.26) is obtained with respect to the same joint distribution pp(x|A)p(A),
whereas in (2.27), it is not. Thus, the ﬁrst case represents an average of relative
spreads, while the second case represents a difference of average absolute spreads.
As a consequence, anomy has the desired property of being a most general measure
of separation between actual and the MA model without the shortcomings of the
entropy-difference measure.

Accuracy18 .A of a model may now be deﬁned very naturally as

A E exp(—N), (2.28)

if the anomy of the model is measured in nats; otherwise, we take A = 2‘N or
A E 10‘”, depending on whether the log base being used is 2 or 10. When the
anomy is zero, meaning that the model’s average P bits of information conveyed
by x about A completely determine the input’s neighborhood, the accuracy is 1 or,
equivalently, 100%, as expected. In the opposite instance, for an inﬁnite anomy——
indeed, a complete disorder in the information conveyed—the accuracy is zero.

This extreme case should only occur when the model is erroneous; however, it
is easy to construct a model whose anomy is inﬁnite, and yet contains much useful
information. In a two—dimensional case, for example, if p(xlA) resembles an upper-
half toroid with inner radius greater than J2 p and centered about A so that p(xlA)
is zero within the square C (A; p), N will be inﬁnity.

To remedy this difﬁculty inherent in the original deﬁnition, we give the following

 

18In classical estimation theory, the inaccuracy of an estimator is measured by the corresponding
risk function [18].

39

more general deﬁnition:

N E D(pp(X)Hp(X)) (229)

In Appendix A.5, we demonstrate that this and the deﬁnition given in (2.24) are
consistent. That is, for “non—pathological” models, the two deﬁnitions are completely
equivalent. If (2.29) ever becomes inﬁnite, we interpret the model as one of zero

accuracy.

Resolution Power

It is natural to assess the “value” of a model by its ability to effectively convey useful
information about the process. The amount of information conveyed is measured by
its precision, and its utility is measured by the model’s accuracy. Therefore, we may
measure the quality of the model by the product of precision and accuracy. We call

this the resolution power, or simply, the resolution of the model, and denote it by R:

R 2 AP. (2.30)

Essentially, two main mechanisms exist by which one can modify the resolution
power of a model, each requiring the introduction of new information into the model.
One approach consists of replacing the prior model p(A) with one that reflects newly
found information about the behavior of A. This information may come from a
better understanding of the processes that precedes the one of interest, or from simple
observations of the signal and a “frequentist” induction of its behavior, for example.

The second mechanism to modifying a model’s resolution consists of adding an
estimator as shown in Figure 2.7, and incorporating it into the original model. In this

case, the estimator function incorporates new knowledge about the relation between

40

the model’s original input and output, knowledge which may be exploited for a better
representation of the input.

Strictly speaking, from the assumption that a process is described by its joint
density p(x|A)p(A), modifying its model results in a different process than the one
intended to be studied. For convenience then, we take the more general view that a
model may be modiﬁed by updating the prior density p(A) and/ or by incorporating
an estimator 6(x), and still represent the original model, but augmented with the
new information. The key idea here is the invariance of the input and the observable
output under consideration; y is simply a mathematical transformation of x, but not
a new observable output per se.

After the model has been modiﬁed by either of the above methods, the resulting
model may have a lower or higher resolution power than the model from which it
originated. T 0 see this, we ﬁrst consider the prior-modiﬁcation mechanism, and realize
that it is always possible to ﬁnd a monotone function g such that Fnew(g(A)) = F (A),
where F and Fnew are the cumulative distributions of the original and new priors,
respectively [30]. Consequently, the new prior simply constitutes a ru transformation
Anew = g(A) and, by the Data Processing Inequality theorem of information theory,
the new model’s mutual information can only be less than or equal to the mutual

information of the original model [31]. Thus

Pnew S 73-

Updating the prior density of a model, however, can increase, decrease, or leave
unaltered the corresponding model’s anomy by simply enhancing, reducing, or leaving
alone the likelihoods of the sets of A for which the “separation” between pp(x|A) and
p(xlA) are greatest.

As for the second mechanism for modifying a model, we saw in page 33 that be-

41

cause an estimator constitutes a transformation of the output, by the Data Processing

Inequality theorem

7353?,

where P5 and P represent the model’s precision with and without the estimator. On
the other hand, we have that since the estimator in essence gives us a new conditional
density p(dlA), this too can modify the anomy upwards or downwards.

Although these results may appear disconcerting at ﬁrst, for we may anticipate
that new information always increases the resolution power of a model, they are
indeed what we should expect. For example, when updating the prior and ﬁnding
the resolution lowered, it indicates that the leverage given by the model in learning
about the input from the output is not as signiﬁcant as it was before we gained
the new information about the input. Likewise, when ﬁnding that the inclusion of
an estimator only reduces the resolution of the model, it indicates that the newly
incorporated information is not constructive, either because it misleads (negative
information), or because its gain in accuracy does not offset the loss of precision.

In Section 2.4.2 we will ﬁnd that resolution power can be employed as a useful

ﬁgure of merit for comparing models.

2.3.3 Two Illustrative Examples

In this section we present two examples that will help to illustrate the newly intro-
duced concepts of anomy, accuracy, precision, and resolution power. These examples
are also intended to give practical results that may be used for further work. In

particular, the results of example 1 are used in the next section.

42

Example 1

In this example we calculate the accuracy, anomy and precision of an N -dimensional
Gaussian model. The antecedent or input A is modeled as a vector of jointly Gaussian
elements centered around a ﬁxed vector u. and covariance matrix T. The consequence
or output x is also Gaussian distributed. Its mean is A, and its covariance matrix is
denoted by K.

The entropy H (X IA) is found in [29] to be %10g{(27re)N det K}. Since this
result is independent of A, the conditional entropy H (X IA) 2 E}, [H (X IA)] also
takes this form. Likewise, the entropy of the input is H (A) = %log{(27re)N det T}.
Because AIx is also Gaussian distributed, in this case with covariance matrix
K(K+T)‘1T, H(AIX) = %10g{(27re)N det K(K + T)‘1T}. Therefore, from the
identity I (X ; A) = H (X ) — H (X IA) we obtain the corresponding model’s precision:

_ 1 det(K+T)
P —— 5 log{ detK }. (2.31)

The units in this case are nats because the log function is taken to be base 6.

From the deﬁnition of anomy,

1
ddi—fpp(xIA)p(A )log——— ddi.

N: /p,,() xIA)p( A)log-—— ppx( IA)

le 1IA)

In Appendix A.2 we show that the second integral evaluates to log ICI. Since

pp(xIA) = Ll,,(—x A) and p(xIA)= Wexpt—é—(X— A)TK-1(x—A)), the ﬁrst

43

integral can be also expressed as

-1
l—Cl p(A)] log p(xIA) dx dA
RN C(Azp)

 

1
= —log{(27re)NdetK} +—1— p(A)/ l{(FC—MTK—1lx—All dXdA
2 IC I RN C(A;.o) 2
1 N 1 2NpN+2 -1
= — log {(27te) det K} + —- p(A) tI‘K 61*
2 IC] RN 6

1 2
= 5 log {(27re)N det K} + 86— trK’1
(2.32)

Therefore, the anomy is given by

1 we N p2 _1

and consequently, the accuracy is
2P2 N/2 p2
A = (7}?) V dell K-l exp (_E trK-1) . (2.34)

In general, it is difﬁcult to ﬁnd an exact closed form expression for the radius p
which satisﬁes the equality 1,,(X; A) = P. When the models are not speciﬁc enough
to apply numerical techniques, we have the option of considering special cases which
allow practical simpliﬁcations that make the resolution of p manageable. The next

example is one important case.

Example 2

We consider the same Gaussian model of Example 2, but with the added restriction

that the covariance matrices K and T of p(xIA) and p(A), respectively, satisfy
max{k,-,,- I K = {ki,j}} << min{t,-,,o I T = {tiJ}}-

44

That is, p(A) is a very smooth (slowly varying) function relative to p(xIA).

This model often describes realistic events where x represents the signal A per-
turbed by noise whose power is much smaller compared to the signal’s own power.
In these cases, p(xIA) is the density of the noise, and the diagonal elements km- are
simply the power in the individual noise elements.

In Appendix A.6 we prove that for sufﬁciently smooth p(A) relative to p(xIA),
[p(X;A) z H(A) — log [C], (2.35)
which we shall write as an equality. Then, from Example 2,
Ip(X;A) = élog{(27re)N det T} — log(2p)N.

From [p(X; A) = P, where P is given in (2.31), the square of the EPR is calculated

to be

 

, 7re (detK detT)1/N

’0 = 2 det(K+T)

By applying this expression to (2.33) and (2.34) we obtain the model’s anomy and

 

 

 

accuracy:
1 det(K+T) 7re detK detT UN _1
= _ __ _ t K , 2.36
N 21°g( detT )+12 (det(K+T)) r ( l
and
detT 7re detK detT 1/ N
= __ __ t K"1 . 2.37
“4 I/det(K+T) €pr 12 (det(K+T)) ’ I ( )

To gain insight into the nature of the accuracy and precision of this model consider

the scalar case. With N = 1, let a2 = detK and '72 = det T. From the original

45

assumption we have that 02 << 72, and (2.31) and (2.37) reduce to19

 

P = 110g (02;???) z log (7)

2 E

A 72 we 02 72 _1
= ————- ex - z e .
02 + '72 p 1202 02 + 72

Thus, for the scalar doubly random Gaussian model where the “noise” power is small

and

a
n

N

 

compared to the power of the signal, the precision of the model increases logarithmi-
cally to the inverse decay of the noise power. This relationship certainly appeals to
intuition, especially when written in terms of the signal-to—noise ratio and expressed
in bits: P z .166 SNRdB (bits). As an example, we have that in order to achieve a
precision of 8 bits, an SNR of 48 dB is required.

Meanwhile, we observe the peculiar behavior of the model’s accuracy of being
approximately constant for all values of signal and noise powers (within the allowed
ranges.) Therefore, the resolution power behaves like the precision within a factor of

close to one-half, i.e.,

R = AP z .4910g (3).
For example, using this model with an SN R of 48 dB, the outcomes :1: convey, on
average, 4 bits of information about the true value of their corresponding antecedents
A. The other 4 of the 8 bits of precision also discern the inputs, but to within sets
too “dispersed” to be constructive. Only with a detailed study of the forms of p(xIA)
and p(A) could one possibly exploit the extra information for a particular outcome.

In fact, for the models where p(xIA) are not unimodal, on average the extra bits may

be outright misleading.

 

19The square of the EPR becomes p2 = 329 (%2—'—’:,) z 12302, and p z 20.

46

2.4 The Multiscale Modeling and Estimation Ad-

vantage

Early in the chapter we asserted that coarse-scale models may be constructed more
accurately than their ﬁne-scale counterparts, but that the coarse-scale models are
necessarily less precise. Our argument was in no way rigorous, and was based only
on an intuitive notion of the concepts of precision and accuracy. Now, we are in a
position to formalize this statement. using the deﬁnitions introduced in Section 2.3.2,
and give some insight into the conditions under which it holds.

We shall refer to the ability of trading accuracy for precision (and vice versa)
as we move through scales as the Accuracy/Precision (A/ P) property of multiscale
models. Likewise, we shall say that a model having this property is an A/P model
and satisﬁes the A/P conditions.

Perhaps the major consequence of the A/ P property is in estimation. For models

possessing this property, the following estimation approach is possible:
Under the multiscale framework, the estimation process may be started at a coarse
scale, where the model is typically highly accurate. Then, one may proceed with the
next ﬁner scale, leveraging the new estimate with the accurate estimate from the pre-
vious coarser scale. By proceeding in this fashion, moving in each step to the next
ﬁner scale, a sequence of increasingly more precise and highly accurate estimates can
be obtained.

In Chapters 3 and 4, we employ this estimation approach with great success in the

“recovery” of the underlying intensity signals which give rise to Poisson processes.

47

Aao

 
 

   
 
 

x21 A20

x2.0
, /
/ ,/ / \\
\ ;
"1.0 .1 x12 x13 ALO Am A1.2 A'1.3
x0.0 xo.1 x02 xo.a xo.4 x0.5 xos X01 A-o.o Am A02 A03 ADA Ans Ans Arm

(3) (b)

Figure 2.9. M ultiscale representation of a 1-d model of size N = 8. (a) Observations

x, = (se,-,0, . .. , at]. g _1) at the various scales j, and (b) their corresponding intensities
12]

A,- = (A130,... ,Ajyg_1). In this example, a two-to—one relation between elements

2
of adjacent scales is illustrated; however, more generally the relations are m-to-one,
where 2 S m g N.

2.4.1 A/P Models and their Properties

A model for the highest resolution representations A0 and x0 of the input and output of
a process is given by the joint probability density p(xo, A0) = p(xOIA0)p(Ao). Likewise,
the model corresponding to the j-scale representation of the same process is given by
p(xj, A3) = p(ijAj)p(A,-). Recall that the subindex 3' denotes the scale, with higher
numbers corresponding to coarser scales (see Figure 2.9.) We denote the anomy,
accuracy, and precision associated with this model by AG, Aj, Pj, respectively.

The estimation process of progressing from coarser to ﬁner scales so that more
precise and highly accurate estimates are obtained at each step, derives from the
Accuracy / Precision conditions assumably satisﬁed by some multiscale models. These

conditions are summarized below.

48

 

 

 

The Accuracy / Precision Conditions
j—scale Model (j + 1)-scale Model
P(ijAj)P(’\jl P(Xj+il>\j+ilp(’\j+1)
P,- P,- > 17,-“ 79,-“
Aj Aj < Aj+1 Aj+1

 

 

 

Throughout we have conﬁned all discussions to ﬁnite length signals and their
models, but we believe extensions can easily be made to models of signals with inﬁnite
lengths, which appear to represent the simpler case; we do not pursue them here,
however.

Whether the A/ P conditions are met at any given scale j depends on the form of
the model p(ijAj)p(AJ-) and on the particular transformation used to generate the
next scale model. Thus, a model p(onA0)p(A0) is an A /P model if a set of orthonor-
mal linear transformations {Wj}20 exists such that each of the models p(ijAj)p(AJ-)
satisfy the A/ P conditions.

The models of interest here all satisfy (1) EIijAJ-I = A], and (2) p(xj+1IAj+1) =
p(xj+1IAj) (i.e., Aj+1 is a sufﬁcient statistics for xj+1). If in addition to these condi-
tions, we impose the mild restriction that for all scales (3) H (AjIxJ-H) > H (AjIxj)

(i.e., that the model be non-trivial), the inequality Pj > Pj+1 of the A/ P conditions

 

20It is understood from our previous discussions that the set of transformations must be such that
A1+1
= . A-
( 9M W] J

where Aj+1 is a “coarser scale representation” and with half the length of Aj. When the transfor-
mation is a wavelet transformation, 0j+1 is the vector of wavelet coefﬁcients at scale j + l.

49

is trivially satisﬁed:

791‘ = I(Xj;1\j) = H(A)) - H(AjIXj)
> H(Aj) - H(AjIXj+1l = I(Xj+1;AJ-)

Z I(Xj+1.Aj+1) = 77341-

The second inequality in this expression is due to the Data Processing Inequality
theorem, because the mapping A, H Aj+1 is non-invertible (see Section 2.2.2).

The idea of non-triviality of a multiscale model is perhaps better understood by
considering the idea of a trivial model. We deﬁne a trivial model to be one that for
any valid scale j, the output x,- conveys no more information about the input A,- than
xj+1 does.

This concept is analog to that of ﬁnite bandwidth signals, where we regard an
oversampled signal S, to be a trivial representation of its decimated counterpart 31-“.
In this instance, the oversampled signal conveys no more information than sj+1 does,
and so, it is also the case that H(tJ-Isj) = H(thsj+1) for any rv tj.

Now we can add to the requirements that a model must meet and show that
under the augmented set of conditions the inequality Aj < A)“ is also satisﬁed. Our
effort in this direction has only produced conditions which are much less intuitive
than the required and sufficient condition JV,- > AG“. This is because, in addition
to the already intuitive nature of anomy as a “distance”, this measure is much more
encompassing than more traditional ones involving only a ﬁnite number of statistical
moments. Consequently, it becomes cumbersome trying to characterize general A / P
models using conventional statistics.

In Appendix A.7, we develop a sufﬁcient condition for the accuracy inequality.
We use it in the following example as a starting point to delineate a general approach

to ﬁnding a set of transformations {Wj} for a Gaussian model such that the A/ P

50

conditions hold. We use the notation of Example 2 of Section 2.3.3, and those of
Appendix A.7. Please refer to those sections as needed, especially Figure A.1. Also,
recall that IC(A,-; p,-)I = (2p,-)NJ‘ denotes the volume of the cube C(A,; p,) centered at

A,-, of sides twice the EPR p,-, and of dimension N,- = N0/2j.

Example

Condition (A.12) may be simpliﬁed by noting that A,-+1(a) 2 0. So, a new (and more

stringent) condition for the accuracy condition to hold is

log (— ”J ) < / p(A.) {Au-(pi — aim} dA.. (2.38)
pj-H IRNJ'
Resolution of the integral— f( p(A p)dA, was already obtained in (2. 32):

2.
—/ P(’\lej(PldAj =llog{(27re)N1detK,-}+ p_, trKT‘l.
11W 2 6 3
And the integral f p(A,-) A3(p) dA, can similarly be solved for:

, 1 P2 _
[1th p(A,)A,-(p)dA,- = —§log{( (27re) N1 detW, K ,WJT} — J——+-l tr rW,-K,- 1W;

= —-;—log{( 27re) NidetK- ,} — p—Jé-Z‘th trKj'l.

Recall that K, is the covariance matrix of the N, --dimensional Gaussian pdf p(x, IA,).

trTK
Substituting these two results in (2.38) and letting a,= _ -6—\{— the condition 18

p.
log (—J-) < a,- (p? — p3“).
Let f(p,-+1) E log (31:1) — a,- (p?+1 — p?), then we have as a condition
up...) > 0. (2.39)

51

f (p,+1) is a concave function with one maximum at p,+1 = 1/ M23}. The set of
p,+1 > 0 for which (2.39) is satisﬁed is always an interval, except when p,- = 1/ J23},
where there is no solution at all. If p,- > 1/\/2a—,-, the interval of solutions lies
immediately to the left of p,-; and if p,- < 1 / «271,-, the interval of solutions lies
immediately to the right of p,. Because p,- and a,- are already known quantities at
the time of search for the transform W,- that satisﬁes condition (2.39), we can readily
determine whether we should seek a transform that induces a new EPR that is greater
than or less than p,.

Let (1 Z Z (N, be the eigenvalues of K ,-, and m1, . .. ,mN, the corresponding
eigenvectors normalized to 1, i.e., IIm,II = 1. Since the axes of the Nj-dimensional
ellipsoid Q,(z) E zK,-'lzT = 1 are ﬂ}... ,\/Z_N_j [32], the point 1/\/27,- is the
square root of the harmonic average of the square of the ellipsoid’s axes, scaled by
V3.

Ignoring this last factor, this averaging operation heavily favors the smaller axes,
and so, if the number of large axes is not disproportionately greater than the number
of small ones, it is convenient to begin the search for the desired transform by choosing
W,- to be such that p,+1 takes the smallest value possible among the values of the
ellipsoid’s axes. By starting with the smallest possible value and incrementing it at
each step of the algorithm, we are assured not to miss the interval of solutions. So
we shall proceed in this fashion of favoring simplicity over optimality of search.

Let us ﬁrst assume that p,- > 1/\/2—d,. Then, our initial choice is W,” = M? E
[m1 , . . . , mNj], which produces the smallest EPR p?+1 possible because p(x,-+1|A,-+1) is
obtained by integrating p(x, IA,-) over the space spanned by the ﬁrst N , / 2 eigenvectors
m1, . . . , mN, )2. Again, these correspond to the direction of slowest change of p(x, IA,)
[32]. Consequently, as p(x,-IA,-) is projected onto the x,+1-(w,-+1 x x,+1) hyper-plane
it becomes the density p(x,+1IA,-+1) with the smallest possible proﬁle. See Figure 2.10

for a depiction of this at one step ahead in the iteration process.

52

 

 

 

 

 

 

 

C(Aj§ PI)
"j+|
wj+l
Cl)» 1;P~ I)
1+ 1+ C( xj+liej+l ; pj+l)
A'jM
Q,-(2) = c
W,- ,
._/ij
Figure 2.10. Method for obtaining consecutively increasing EPRs p9+1,p,1.+1, . . . for a

Gaussian model. The transform W, gets updated so that the projections of p(x,IA,~)
onto the x,+1-(w,-+1 X x,+1) hyper-plane progress from its narrow side to its broadside.
The process is depicted at an intermediate step near the beginning of the algorithm.

If p,-+1 = p?+1 does not satisfy condition (2.39), we next rotate the axes w,-+1—A,-+1
a small angle in a direction which tends to broaden the projection of the density and
make the corresponding EPR bigger. The possibilities of rotation paths increase very
rapidly with N,; in fact, when N,- > 2 the number is clearly uncountable. One simple
approach, however, is to take W,1 = M}, the ﬁrst rearrangement of M? corresponding

to a reordering of the eigenvalues obtained by interchanging the middle two:
C1 CNj/z—i CN,/2+1 CNj/2 CN,/2+2 (N,-

Thus, AI} E Imi, , mN,/2—l» mN,/2+1, mpg/2, HIM/2+2, 1 mm]-

53

If (2.39) is satisﬁed with the presently computed p,+1 = p? +1, we still move one
step forward in the algorithm just as if the condition had not been satisﬁed. The
idea is to proceed until (2.39) is satisﬁed with an EPR sufficiently close to 1/ m,
since this will insure a large spread between ./V, and MM,” and therefore, a greater
advantage when estimating within the multiscale framework.

If (2.39) is once again satisﬁed with the new EPR p,+1 = p]+1 induced by W,1 =
M}, but [pg+1 — 1/\/2—o_,-I < Ip,1-+1 — UM], we stop; otherwise, we proceed using
M}, the second rearrangement of A1,. In this case, we interchange the next two
columns of M,1 nearest to the center but which have not been exchanged previously.

This corresponds to the eigenvalue reordering

C1 . . . (NJ/2—2 (NJ/2+2 (NJ/2+1 (NJ/2 (AG/2‘1 (NJ/2+3 ° ° ' (Ni.

We proceed in this manner until (2.39) is satisﬁed and we can no longer reduce
the distance between the computed EPR and 1/ \/2—a,.

If at the outset p,- < 1/ V271,, the approach is the same as before, but in this case
we do not test (2.39) while p,+1 < p,, for we know the solution, if one exists, must lie
to the right of p,-. Once a suitable transform W,- is found, the search for W,+1 may

begin in a similar fashion.

Comments

Several points are worth noting here about the searching approach just delineated.
First is the fact that some intermediate transforms may be required between two
consecutive trial transforms M; and M,“ if p311, “overshoots” the target value. It

is easy to envision methods for constructing the sequence of transforms AI; which

 

21Since we are attempting to satisfy only a sufﬁcient condition, it is clear that the optimal EPR,
one for which the difference IN, — .N',+1I is maximum, may be found to be other than 1/ ,/2o,-.

54

represent ﬁner rotation changes at each step than those provided here. One possible
way would be, for example, to consider all possible permutations of the eigenvalues
of K ,, and construct the transforms M,’ of normalized eigenvectors with the corre-
sponding ordering which generates the monotonic rotational increasing sequence from
M]? to MJN’I—l = [mN,, mN,_1, ... , m2, m1]. We shall not dwell on this issue as our
intention is solely to illustrate the general idea as to how one may ﬁnd the set of
transforms {W,}. ‘

Another important point is that the sequence of transformations obtained in this
manner, and which guarantee that N,- > .A/,+1 at all valid scales, may be viewed as
a new multiscale transform which is statistically motivated. This is in contrast with
wavelet based and other existing multiscale transforms that operate in the time or
space domain. Clearly, the possibility exists that the two types of multiscale ﬁltering
coincide for a class of models, which we would call scale ergodic models. For this class
of models the A/ P conditions may be satisﬁed through the following mechanism.

From the discussions of Section 2.2, we know that the coarser scale representation
x,+1 of x, is obtained by a linear transformation that produces elements {x,+1,,-},- that
are more highly correlated than the elements {x,,,},- of x, are. Consequently, for the
models of interest here, for which E[x,IA,-] = A,, the density function of x,+1IA,+1
tends to be sharper than the one for x,IA,-, concentrating a greater probability mass
about its mean. This, in turn, tends to reduce the size of EPRs, leading to a better
match between the densities p(x,IA,) and pp, (x,IA,-), and so, increasing the accuracy
of the model with scale.

On a more general point, determining at the outset whether a model admits trans-
formations W, such that (A.12) is satisﬁed at each scale is, in general, a difﬁcult task,
and we believe that each case, or at least, each family of models, needs to be looked at
separately. We defer this effort for future work. However, under the assumption that a

model does admit such a sequence of transformations, we can use condition (A.12) to

55

aid us in their search in a manner very similar to what we have done here—although,
this may entail an intensive iterative process. With knowledge of the sequence of
transformations, we can generate the sequence of multiscale representations {x,} of
the data x0, which can then be used to exploit the estimation advantage referred to

earlier in the section.

Coarse-Scale-Data Limited Models

Earlier in the chapter we stated that “while modeling of phenomena based on coarser
scale information alone may be more accurate, it can only be achieved at the expense
of precision.” We shall say that a model that possesses this property satisﬁes the
coarse-scale-data limited model conditions, or CSDL model conditions, for short, and
refer to such models as CSDL models. At that time we also asserted the equiva-
lence of this phenomenon to the fact that coarse-scale models may be constructed
more accurately than their ﬁne-scale counterparts, but always at the expense of pre-
cision, which is the deﬁning property of A/ P models. We relegate the proof of this

equivalence to Appendix A.8.

2.4.2 Bayesian Multiscale Models and Estimation

In the past section we have seen that an advantage exists in estimating an underlying
intensity within a multiscale framework whenever the model is an A / P model. How-
ever, the advantage is only potential, because whether it is realized or not depends
not only on the model itself, but also on the particular estimator used at each scale.
To illustrate this we take an extreme example.

Suppose the model at hand is indeed an A/ P model, and suppose that within
the top-down multiscale leveraging estimation framework we obtain at each scale an
estimate A, of A, by randomly choosing a signal among all those that satisfy A,+1 =

A N' A
W,A,. Here, W, E Iii-21"”), and A,“ denotes the previously obtained estimate for

56

A,+1, which at the coarsest scale J considered we let A; = x J. Clearly, the potential
advantage in this case does not, on average, lead to a better estimate of A0 than one
would obtain, for example, by simply letting A0 = 1:0. It is not hard to see that under
the ﬁrst estimation scheme one has P0 = 0 and N0 = 1, and thus R0 = 0; while for
the second method, Po and N}, are the original values corresponding to the model
p(xOIAo) p(Ao), which for practical models give R0 > 0.

This example highlights the nontrivial aspect of how to exploit available infor-
mation in a multiscale estimation scheme. So, we ask what is the optimal estimator
6(x,) for A, that. most efﬁciently and systematically exploits knowledge of the esti-
mates A,+1, . .. , AJ? Clearly, the answer must satisfy two conditions: the estimator
must have the means to extract all relevant information (patterns) from coarser scales,
and it must optimally reduce the estimation error.

It is well known that Bayesian estimators are optimal in the mean-square error
sense, and in fact, under various other reasonable criteria as well, whenever a suitable
prior density of the antecedent is available [23]. Furthermore, since a Bayesian-based
estimator can make explicit use of the posterior distribution p(A,IA,+1, . . . ,AJ)—the
most comprehensive statement of A, ’s dependency on the prior information—they are
the most natural and optimal multiscale estimators of choice. A Bayesian multiscale
estimator 6 E {6(x,)}22 can be formulated as follows.

For the desired estimate A0 = 6(x0) to be optimal, each indiviual estimate23

A, = 6(x,) must also be optimal. The optimal estimate at scale j is the posterior

 

22Note that each estimator 6 is scale dependent, but we let the argument indicate this. That is,
6(Xj) # 6(Xj+() for z ¢ 0. A

23When referring to the estimate A, = 6(x,), it is clear that x, is taken to be a realization of
the corresponding random variable, which we also denote by x,—a clear abuse of notation. Here,
however, the terms estimate and estimator are used interchangeably in order to reduce some of the
repetitive vocabulary.

57

mean

A

E [Aj IX,,A,'+1,... ,XJ] =/Ajp()\jIXj,Xj+1,... ,XJ)dAj. (2.40)

X.-

A Bayesian approach facilitates the solution of (2.40) by expressing it in terms of
the known model distribution of x, given A,. Applying Bayes’ theorem to (2.40)
and using the equalities p(x,IA,,A,+1, . .. ,AJ) = p(x,IA,) and p(A,IA,-+1, . .. ,AJ) =

p(A,IA,+1)24 we obtain the desired form

 

X, = fAj p(ijAjllXAJiAj-H) dAJ (2.41)
f p(ijAjlplAjIAj+l) dA,

The prior p(A,IA,+1) can be obtained from the distribution p(A,IA,+1) allowing
for uncertainty in the deviation of A,+1 with respect to A,+1. If the data x,, and
therefore, x,+1, . . . ,XJ, are determined to be reliable (accurate), then p(A,IA,+1) can
be taken to be well approximated by PJyIA,+1(’\jIXj+1l- Since for A/ P models, the
accuracy of the data increases with scale, the approximation is increasingly better
with scale as well.

This represents an important advantage of the multiscale estimation framework:
While Bayesian estimation facilitates the realization of the multiscale leveraging es-
timation advantage by fully exploiting all available information in the scales, a mul-
tiscale model enhances the reliability of the Bayesian estimator by virtue of the more
accurate data at those higher scales—the A/ P property.

Another beneﬁt of the multiscale description of a Bayesian estimator is that the

formulation of realistic priors {p(A,IA,+1)} is often much simpler than devising the

 

24Clearly, p(A,IA,-+1,A,+1,... ,AJ) = p(A,IA,+1), and assuming the optimality of the
estimate Aj+1, it is alsci reasonable to assume p(A,+1|A,+1,...,AJ) = Ap(A,+1IA,+1).
Therefore, p(A,IA,-+1,... ,AJ) = f£(AjIAj+1,A,+1,... ,AJ)p(Aj+1IAj+1,... ,AJ)dAj+1 =

fp(Aj|/\j+1)P(Aj+1|X3-+1) dA,+1 = MMI/‘2'“)-

58

prior p(AO) alone. In the next chapter, we elaborate this point further, and show how
to construct a practical prior especially suited for general Poisson processes.

Also, within the multiscale Bayesian approach, it is sometimes possible to formu-
late the problem such that the inferences can be computed very efﬁciently. This is
particularly true if the prior is chosen to be conjugate to the conditional density of
the model [23]. The estimator developed in the next chapter is an excellent example

of this.

Resolution Power of Multiscale Bayesian Models

We now consider the precisions and accuracies of a model augmented by a Bayesian
estimator outside and within the multiscale framework. We shall refer to the resulting
models as the standard and multiscale augmented models, respectively. With these
formulations at hand, we can gain some insight into the mechanics of the multiscale
Bayesian approach that lead to its resolution power advantage. For simplicity, we
shall assume the estimators to be invertible, which is certainly the case for the Poisson
processes addressed in Chapter 3. One immediate consequence of this assumption is
that the precision of either augmented model is unchanged by the estimators, and
that these precisions corresponding to the standard and multiscale augmented models
are the same.

Since our interest ultimately lies in whatever occurs at the ﬁnest of scales, we
restrict the following derivations and discussions to scale j = 0, but all the results
apply equally to every other scale. This is important because the beneﬁts of the
multiscale approach can only be guaranteed if every intermediate scale enjoys similar
beneﬁts. We ﬁrst consider the standard augmented model.

Let (0 be a Bayesian estimator for A0 without the beneﬁt of any conditioning

59

from higher scales:

_ p(XolA )
co = Elele] = f me.) (7015-) dA.. (242)
We recognize the factor 1%]? as the ratio in the mutual information log W of

x0 and A0 at the particular realizations at which the ratio is evaluated. In general, as
we know, this ratio is a measure of the dependence between the two random variables:
when it takes values near one, it indicates highly independent entities; when it takes
large values, it indicates that the variables are highly dependent; and when the ratio
is very small, it indicates near mutual exclusiveness. In our applications we do not
encounter this third case as we assume x to be positively correlated to A.
Inspecting (2.42), we see that the estimator produces an average based almost
exclusively on the information provided by the prior when it deems that the rv x0
conveys very little information regarding A0. On the contrary, when the dependence
between the variables is high, W will de-emphasize those values of A0 which
are far from the regions that make the ratio large so that the computed average
corresponds to the average of those regions that favor the particular outcome x0.
This is the general mechanics of a Bayesian estimator, and gives insight into
its powerful approach to producing an estimate. Now, the mechanics that makes a

multiscale-based Bayesian estimator superior to its standard version can be seen by

computing the difference of the anomie corresponding to their augmented models:

Nc. — M. = /Ppo(C0|AO)P(Ao)10gM9|—Agl dcodAo

PlCoIAO)
p90 (60IA0)

— 6 A A 1 d6 dA,
fpml oI 0)P( 0) 0g p(5oIAo) o o

where 60 denotes the estimator (2.41) for j = 0, and C0 is the estimator (2.42).

Note that the EPRs associated with both of these estimators is the same. This is

60

a consequence of the invariance of the precision under the standard and multiscale

model augmentation scheme by an invertible estimator. The above difference reduces
to

_ — , 210134191
N20 Mo - /Ppo(00I)\0)P(A0)108 PCOWOIAO) €100 €130,

where do is a dummy variable. The ratio of densities may be interpreted in accordance

with the deﬁnitions of the estimators 60 and C0:

P60(0’0I)\0) PlEleIXoiAllIAo)

PCOWOIAO) p(ElAOIXOIIAO)

It is intuitive that this ratio is non-increasing as a function of the distance [00 — A0].
That is, while the probability of ﬁnding the estimate EIAono, A1] near the true value
A0 is greater than the probability of ﬁnding EIAOIxo] in the same region, the prob-
ability of E[A0Ixo,A1] decreases more rapidly than that of EIAOIxo] as their differ-
ence to A0 increases, ie, p(EIAon0,A1]IA0) is more “concentrated” about A0 than

p(EIAon0]IA0) is. Assuming this behavior yields

. 1 P6o(0'ol)\0)
NCO MO > /p(A0) 3:12) (2t)N0 ./C(Ao;t)10g pC0(00IA0) duo dAOv

which in terms of the density of XOIAO may be written as

. 1 p(XoIAoV |J60(Xo)|
o ’No / A l -——-—/ l d dA ,
NC 5 > M O)t-l-gb(2t)N° 00.0,.) 0gp(Xole)/|Jc,(3<o)| x0 0

where J50 (x0) E 3% and J¢0(x0) E g??- are the Jacobians of 50 and (0 with respect

to x0. Thus,

. 1 [JC0(X0)I
N —N > lim——/ lo ——dx, (2.43)
<° 5° ~00 No a...) glam” °

61

For simplicity, we consider the 2—d case. We let (A1, 01)T = WAO and ($1,w1)T =
on, where W is a linear orthonormal transformation. Then, due to the linearity of

the expectation operator

 

 

 

IaE[(A1.61)Tlx1.uu]

 

laAifxmvl) 691(1'1,w1) _ 511(11401) aéllxlvwﬂl

[JC0(X0)I _

 

 

 

 

 

_ , 6(1'1.w1) 7 . _ 6171 6101 aw) 82:1
IJ50(X0)I aEIIA1.01)TJI1.W1.A1I aAifl‘MUiJu) 891(11,w1,A1) 331(1‘iyw1J1) 891(11.w1,A1) .
I - I
6(1‘13111) 62:1 3101 8101 6101

Typically, the dependency of the estimate Al on the crossterm w, and the dependency
of the estimate 51 on the crossterm $1 are weak compared to the dependencies on

2:1 and wl, respectively. Therefore, the behavior of (2.43) may be inferred from the

 

 

approximation
6} , 69 .
Hmong I 12:1“) ‘22.”)!
IJ50(X0)I l3A1($1.w1.A1) 691(x1.w1.A1) I.
6:131 6w1

 

This ratio is, at least on average, greater than one, for the functional dependency of
A1 on A1, and that of 51 on 61, makes the denominator less sensitive to changes of 3:1
and w. A more rigorous argument can be formulated, but we leave this effort for
future work. Ftom (2.43), we conclude that M, < .A/Co, and so, A50 > AC0. Since
the precision associated with the standard and multiscale formulation of the Bayesian
mode is the same, we conclude that the multiscale-based Bayesian estimator achieves
a greater resolution power than the traditional Bayesian estimator: R50 > RC0-
Although some of the derivations in this section have not been carried out with
all the desired rigour, they lead to plausible results which cast light on the advantage
of the multiscale-based Bayesian estimation framework over the traditional Bayesian
approach. The main objective of this chapter up to this point has been to promote
this approach by establishing, at least in its beginning form, a foundation under
which models and estimators alike can be studied under a common all-comprehensive
set of criteria that applies equally to multiscale and traditional frameworks. These

criteria should not only become useful in comparing various estimation approaches,

62

but should also become the basis for designing guidelines for new estimators.

The new characterization of models and estimators also reflects our attempt to
put forward a clearer view of the interplay between scale (space / frequency) resolution
and information resolution (resolution power). Their relation depends on the speciﬁc
transformations used in creating the various scale models from the original high-
resolution model, and is reflected in the accuracies and precisions attained at those

scales.

2.5 Other Multiscale Modeling and Estimation
Approaches

In this section we briefly review some other important multiscale modeling and es-
timation approaches. These are the threshold smoothing methods: Hard and Soft
thresholding, Universal, and the SureShrink Method; Cross- Validation Method; and
False Discovery Rate. This list is not all exhaustive, but in conjunction with the
multiscale Bayesian approach, they represent the most important classical methods.
Their importance derives not only from their wide use in practice, but also from the
fact that they represent estimation paradigms from which many other methods have

later derived.

2.5.1 Threshold Smoothing Methods

The standard model used to represent the input-output phenomena of a very large

25'

class of processes is

x = A + 17, (2.44)

 

25Clearly, this model is not as general as a Bayesian-based model.

63

where x is the “noisy” data, A the underlying intensity, and where n represents
additive noise [33]. Within the thresholding methods, estimation of the intensity
from the data occurs in the frequency-space (wavelet) domain, where the intensity of
typical real-world signals can be projected onto a relatively small subspace compared
to that occupied by the noise’s projection. The segregation of the signals’ energies
is what makes their separation from noise possible. The general approach consists of
the following steps:

Wavelet transform the data, reduce the smaller wavelet coefﬁcients to zero ac-
cording to some thresholding rule and, inverse transform the coefﬁcients to recover
an estimate of the intensity.

The samples of signals of interest (the elements of A) are typically highly cor-
related. This induces a high structure among the wavelet coefﬁcients containing
signiﬁcant signal energy. Speciﬁcally, they exhibit the properties of clustering and
persistence across scales [34]. Clustering is the property of coefﬁcients tending to
take values in the order of those of its neighbors; and persistence across scales means
that large/ small values of wavelet coefﬁcients tend to propagate across scales. These
phenomena will be illustrated in Chapter 3. A consequence of this high structure is
the sparseness in the wavelet domain representation of signals, which is to say that
only a relatively few coefﬁcients convey the signal’s features.

Meanwhile, noise samples (the elements of 11) are often well-modeled as being in-
dependent, and therefore, the noise displays a “flat” distribution across the frequency
spectrum where the intensity lives. This causes the noise energy to become distributed
among the wavelet coefﬁcients of a much larger subspace than that occupied by the
signal. Consequently, large wavelet coefﬁcients are associated with signal energy, and
small coefﬁcients with noise contribution. Thus, reducing the smaller coefﬁcients to
zero effectively removes noise from the intensity, with only a minor smoothing effect

on the reconstructed intensity.

64

The literature on thresholding schemes often assumes the noise to be independent
of the intensity, and much of that work also assumes it to be Gaussian distributed
I12, 33]. These two points are to be found in high contrast with the methods devel-
oped later on in this dissertation. These assumptions greatly simplify the estimation
problem because the wavelet transform, being a linear orthogonal operator, sustains
the two assumptions over the transformation. These common assumptions are made

in the following descriptions.

Hard and Soft Thresholding

Let W denote the DWT. Applying it to (2.44)

w = 0 + er (2.45)

results. Here, w E Wx and 0 E WA, the wavelet coefﬁcients of the data and of
the signal (see (2.15)). For any given threshold r > 0 and an element wk of w,
there are two standard ways of modifying the coefﬁcient. These are the hard and soft
thresholding. Hard thresholding produces a new wavelet coefﬁcient according to the

rule

wk if [wk] 2 r

E’)
a-
ll

0 otherwise.

Soft thresholding uses a rule that modiﬁes every coefﬁcient, even those with strong

signal to noise ratios. The thresholding condition is

wk—r ifw;c >r
wk = 0 if [1ka S 7

wk +r if wk < -r

65

When applied to the denoising of images, soft thresholding is claimed to give more
pleasing estimates than the hard thresholding [35]; however, this can be due to over-
smoothing effects, that often has this visual appeal but which in fact may repre-
sent a degraded estimate under most conventional error measures. The smoothing
phenomenon will be illustrated in Chapter 4 within the context of tomographic im-
age reconstruction. For a more in-depth treatment of thresholding techniques, see

[10,12,35,36]

Universal Method

This method can take the form of a hard or soft thresholding rule. In the latter case,
Donoho & Johnson [35], the developers of the method, call it VisuShrink. This is
because the method usually oversmooths the noisy signal, which as we noted before,
tends to produce visually appealing estimates. A more accurate method is also given
in [35], which is a minimax thresholding method that is optimal in terms of L2 risk.

Both universal and minimax based methods are global thresholding approaches
because the chosen threshold is applied to all the wavelet coefficients. In practice,
however, it is typical to threshold only coefﬁcients at the ﬁnest resolution (i.e., scale
j = 0), since coarser-scale features of the data tend to belong to the original intensity
signal.

The distinguishing characteristic of the universal method is the basis on which the
threshold is computed. Assuming the noise or error term for each element of w is i.i.d.
normally distributed like N (0, 02), the elements of W11 are also distributed according
to N (0, 02). In this case, the simplest of the universal thresholds is calculated to be
7,. = m.

This threshold insures that as N increases, the probability that all noise coefﬁ-
cients get “rooted out” tends to one. Clearly, because the threshold TN is derived

from asymptotic arguments, the universal method does not perform well for signals

66

of short lengths [33].

SureShrink Method

In contrast to global thresholding approaches, the SureShrink method is a data-
dependant selection procedure. This very popular method, also introduced by Donoho
& Johnstone [12], ﬁnds thresholdings 1', at each scale j so that the L2 risk of the

estimator A for A will be small. Using the equality EIATA] = EIOTBI, the risk can be

expressed in the wavelet domain:

N-l
A 1 A A
120., x) a E If); 2W — 1,)2] oc E [239“ — 9,02] . (2.46)
k=0 M
The wavelet coefﬁcient estimates {6/l,’k}k are obtained from the data wavelets {w,),}jc
by a soft thresholding rule with threshold r,.
Using Stein’s Unbiased Risk Estimator (SURE) deﬁned as

N—l N—l
SURE(r,;w,) E N + (7',2 - 2) 21(ij6 S Tj) + 2 wire I(wj.k > 73'),
k=0 k=0

where I() stands for the indicator function and w, E (w,),)k, an unbiased estimate
of the risk of the wavelet coefﬁcient estimates {ﬁ,,k}k can be obtained. Then, with
the set of thresholds {1,} such that for each j, r, = arg min T>0 SURE(1'; w,), we can
expect that the risk R(A, A) be asymptotically minimized, because due to the Law of

Large Numbers, the SURE criterion is asymptotically close to the true risk [33].

Cross-Validation Method

The general method of cross-validation (CV) was initially adapted to wavelet regres-
sion by Weyrich & Warhola [37] and N ason [38]. Other important related work have

followed since then (see for example [39, 40].) Similarly to the SureShrink method,

67

the multiscale CV approach aims to minimize the risk (2.46). Here, however, a global
threshold r is chosen such that the average risk corresponding to two different es-
timates for A is minimized. The ﬁrst of these estimates, AOdd, is obtained by soft
thresholding the data wavelets corresponding to the odd samples of x; and the sec-

even
0nd estimate, A , is similarly computed from the wavelets associated with the even

elements. The threshold used is one which minimizes

1

MW) E ﬂ {2(Xpdd _ xix/emf + 2(Xfx'en _ x?dd)2} ,
i
where 135“" and (Bi-”dd are interpolated samples of x required to align the “data” samples
to the even and odd estimates.

Essentially, the CV approach seeks to obtain the threshold value that reduces the
squared error incurred when predicting half of the data samples with estimates that
are based solely on the other half of the data elements. Clearly, due to the required
interpolation of the data, the separation between the portion of the data used in
estimating and that used in testing—that is, used in computing the associated risk—

is not complete in the case of the multiscale approach.

False Discovery Rate

The False Discovery Rate (FDR) approach to computing the required threshold is
due to Abramovich & Benjamini [41]. They structure the problem in terms of a
multiple hypothesis test. There are N — 1 null hypotheses H0 : 6,), = 0, which are
regarded to be highly probable on average. The aim of the approach is to reduce the
erroneous possibility that a coefﬁcient satisfying the null hypothesis is included in the
reconstruction of the signal estimate, or that a coefﬁcient not satisfying H0 is included
with the wrong sign. Thus, if R is the number of coefﬁcients that are included in

the reconstruction (erroneously or not), and Q the number of coefﬁcients incorrectly

68

included, then Abramovich & Benjamini attempt to include as many coefﬁcients as
possible maintaining the expected value of Q / R below a user-speciﬁed value.
Comparison between this and the VisuShrink method shows that the FDR ap-
proach performs better for signals that include some abrupt changes, while the
VisuShrink method is superior when the function space only includes smooth in-

tensities [41].

69

CHAPTER 3

Multiscale Modeling and

Estimation of Poisson Processes

In this chapter, we present and analyze a new multiscale Bayesian framework for the
modeling of Poisson processes. In the previous chapter, we motivated this approach
for arbitrary processes by showing that multiscale representation of signals makes
possible the utilization of all available information in the signal. These results, how-
ever, did not speciﬁcally show the way to model any particular process in order to
gain the multiscale advantage. There are always many ways of achieving this even
within the multiscale framework, but not all lead to simple and practical models. The
approach introduced here will be shown to provide a very powerful and natural frame-
work to studying a wide variety of Poisson processes. The new framework makes full
use of the Poisson probability model and enables the incorporation of realistic prior
information into the estimation process. We will also show how it can be applied to

photon-limited imaging.

70

3. 1 Preliminaries

The problem of estimating the intensity A of a general Poisson process from a single
observation1 c of the process has been studied in great depth. For example, many
earlier approaches to Poisson intensity estimation were based on the idea of modeling
the variability of the process by Gaussian ﬂuctuations with non-stationary charac-
teristics, e. g., [42, 43]. Recently, simple wavelet-based approaches to this problem
make use of the square-root of the counts (a variance stabilizing transformation that
makes the data approximately Gaussian) and then apply standard wavelet thresh-
olding techniques for Gaussian noise removal [12]. More sophisticated wavelet-based
estimation procedures attempt to deal with the Poisson statistics directly. Kolaczyk
has developed a wavelet-based thresholding scheme for the estimation of a special
class of Poisson processes termed “burst-like” processes [14]. The burst-like Poisson
process is characterized by a homogeneous, low intensity background with spatially
isolated bursts of high intensity, and is motivated by problems in astronomical imag-
ing. Nowak and Baraniuk propose a wavelet-based method for the estimation of more
general Poisson intensities in [44] using the cross-validation estimator developed in
[40]. This method is applied to nuclear medicine image estimation in [45]. Both meth-
ods [14, 44] can provide satisfactory results in certain situations. However, neither
method adopts a Bayesian perspective, and hence they do not explicitly make use
of prior information that may be available. As we noted at the end of last chapter,
several wavelet-based Bayesian estimation procedures have been proposed for Gaus-
sian data, e. g., [46, 34, 11, 15, 47], however, such methods are not applicable to the

Poisson problem considered here.

 

1For completeness, in Section 3.5.4 we address the case of multiple observations of the same and
related processes, but our main thrust throughout will be the single-observation case.

71

3.2 Notation

To simplify the presentation we work with one-dimensional intensity functions in the
interval [0, 1]. In a later section, we extend the modeling and estimation approaches
to two-dimensional problems. Furthermore, we assume that the intensity function is

discretized so that A is represented as a vector of length N with elements (Ak)£:1

0 .
The counts ck are the elements of the vector c, also of length N.

We follow the wavelet representation and notation introduced in Section 2.2.2
throughout. Speciﬁcally, let c0 be the data sequence of counts c of length N, where
N is assumed to be a positive integer power of two, and let c0), be its kt” element. As
before, the subscript 0 denotes the ﬁnest scale (resolution) of analysis. Similarly, let

A0 be the ﬁnest resolution representation of the intensity sequence A, i.e., A0 = A,

also of length N. Then,
COIAO ~ Poisson(A0), (3.1)

and the objective is to estimate A0 from the observation co.

From Figures 2.1-(a) and -(b), and expressions (2.10) and (2.11) the wavelet ﬁlter
coefﬁcients corresponding to the Haar system are (h0,h1) = (ﬁ, 53) and (go, 91) =
(715,—71,,,-). For reasons soon to be addressed, it is convenient to work with the
unnormalized Haar system. The ﬁlter coefﬁcients corresponding to this system are
simply (ho,h1) = (1,1) and (go, gl) 2 (1, —1). Therefore, a multiscale analysis of CD

can be obtained by iterating

Cch = Cj—1,2k+Cj—l,2k+1s (3.2)

dik = Cj—l,2k"Cj—l.2k+1i (3.3)

for j =1,...,J and k = 0, . . .,N/2j — 1, and J = log2(N). As before, J denotes the

72

Figure 3.1. Multiscale scaling coefﬁcients {CM}. At the top, we have scaling co—
eﬁicients at the coarsest resolution. At the bottom, we have the ﬁnest resolution,
expressed by the data themselves. The connecting segments illustrate the functional
dependencies among the various scaling coefﬁcients c,,k according to expression (3.2).

coarsest scale of analysis, and CM. and d,,,c denote the scaling and wavelet coefﬁcients
of the data, respectively, at scale j and position (shift) k. The scaling coefﬁcients
c, = (c,,k),1c”=/(2,j-l represent a lower resolution representation of the data c,_1. The
“detail” information in c,_1, which is absent in c,, is conveyed by the sequence of
wavelet coefﬁcients (1, = (d,,k)£f=/ 31"]. Using (2.9), c,_1 can be perfectly reconstructed
from c, and d,. Figure 3.1 is an annotated version of tree-structured representation
of Figure 2.4-(a), which shows the functional dependencies among the various scaling
coefﬁcients of a sequence c of length N = 8.

Similarly, as for Co, we deﬁne the scaling coefﬁcients A,.), and the wavelet coefﬁ-

cients 0,), of the intensity function A0:

Aik = ’\j—1.2k+)\j—1,2k+la (3.4)

011k = Aj—1.2k — Aj—1,2k+1- (3.5)
When the intensity of interest is of a discrete nature, A0 is simply the sequence A

73

itself. If, on the contrary, the intensity signal is a function of a continuous variable
t 6 [0,1], say A(t), then A0 corresponds to the sequence of scaling coefﬁcients at
scale 0. That is, in accordance with the deﬁnition of the unnormalized Haar wavelet

transform 011 the interval, A01: = (A, 0501:) = if; 1)/ N A(t) dt, and more generally,

21(k+1)/N

A311- : (A, 2j/2¢j.k> = / A(t) dt (3-6)

22 k/N

3.3 Why the Unnormalized Haar Transform?

Multiscale analysis based on the unnormalized version of the Haar transform has the
unique property that every scaling coefﬁcient is the sum of two ﬁner-scale scaling
coefﬁcients, and consequently, due to the reproducing property of the Poisson dis-
tribution,2 every scaling coefﬁcient is Poisson distributed. Furthermore, it is well
known that given two Poisson variates, c1IA1 ~ Poisson(A1) and CQIAQ ~ Poisson(A2),
the conditional distribution of Cl given A1, A2, and the sum c1 + c2 is binomial [48].
This reveals a very simple “parent-child” relationship between the scaling coefﬁcients
across scales. In Section 3.5, these facts are crucial in the development of the proposed
intensity estimator. Similar attributes (reproducibility and simple parent-child rela-
tionship) do not hold for more general multiscale analyses of Poisson processes based
on other wavelet systems. This is in marked contrast to the Gaussian case, in which
such attributes hold for a wide variety of wavelet analyses (including all orthogonal
wavelet systems). In short, multiscale transforms of Poisson processes other than the
unnormalized Haar transform are much more difficult to analyze and process. The
natural match between the unnormalized Haar transform and the Poisson process is
the primary motivation for choosing it.

The use of the unnormalized Haar wavelet transform to carry out the multiscale

 

2c,|A, ~ Poisson(A,~), c,IA,~ independent => 2 c,-I Z: A,- ~ PoissonQ: A,).

74

analysis of the data has additional benefits springing from the following points. Pois-
son processes result from counting independent events occurring in disjoint regions of
time or space of equal size. In such cases, the unnormalized Haar scaling coefﬁcients
correspond exactly to these counts occurring at intervals of sizes varying according
to scale. Thus, the scaling and wavelet coefficient have a very natural interpretation
according to (3.6) and (3.3). The Haar basis functions also have the property of being
completely localized in space. By this we mean that at each scale, scaling functions,
as well as wavelets, do not overlap. Therefore, at each scale, scaling coefﬁcients are

conditionally independent, that is,

P(CjI’\j) = lecj,kl)‘j.k)- (3-7)
k

Also, Poisson processes are inherently nonnegative; therefore, a good estimator should
always produce intensity estimates that are either positive or zero. An estimator based

on the Haar system may be designed with this quality.

3.4 A New Probability Model for Intensity Images

3.4.1 Multiscale Signal Model Framework

To formulate a Bayesian estimator for this problem, we must ﬁrst propose a prior
probability model for the unknown intensity A. The observed data c is regarded
as the realization of a Poisson process spawned by the unknown realization A of
some random sequence with prior density p(A). In last chapter’s terms, A and c are
respectively the cause and effect of the process completely determined by p(cIA)p(A),
where p(cIA) is the Poisson probability mass distribution.

Just as we did in the last chapter for general processes, we can formulate the

present model within a multiscale framework, and similarly arrive at the optimal

75

 

intensity model

[a

multiscale ]

 

 

 

 

 

Haar multiscale Inverse -
estimated
transform processing Transform

 

 

 

 

 

 

 

 

 

Figure 3.2. Structure of a Haar-based intensity estimator.

estimator (see (2.41))

x, _ fAjp(c,~lA,)p(A,|A,-+1)dAj (3 8)
J " A ' '
fP(Cj|’\j)P(AjI’\j+1)dAj

 

This Bayes estimator poses two interrelated problems. First, the speciﬁcation of
a meaningful and useful prior p(A,IA,+1), which we deem to be an excellent approx-
imate for p(A,IA,+1) (see Section 2.4.2). Second, the numerical computation of the
estimator. The role the above quantities play in the estimation process is illustrated
in Figure 3.2. The remainder of this section describes a new prior probability model
for the Haar scaling and wavelet coefﬁcients of a non-negative intensity that leads
to a very simple speciﬁcation of p(A,IA,+1). In Section 3.5, we derive an efﬁcient
algorithm for computing the optimal estimator (3.8).

There are two important reasons for adopting a multiscale approach to this prob-

lem:

0 Prior models that are mathematically tractable, computationally practical, and

empirically supported can be speciﬁed very naturally.

e Poisson data is much more reliable (accurate) at coarse scales than at ﬁne reso-

76

 

 

 

 

 

 

 

 

L

—1 -oi5 o 0.5 1 —1 —o.5 o 0.5 1
coefficient value coefficient value

(a) (b)

v , 1 f

 

 

 

 

 

 

 

 

 

 

 

.4 .M

-1 -0.5 0 0.5 1 -1 —0.5 O 0.5 1
coefficient value coefficient value

(0) (d)

 

Figure 3.3. Histogram of perturbation variates (6 = 9/A) for scale (a) j = 1, (b)
j = 2, (c)j = 3, and (d) j = 4 of the cameraman image of Figure 3.10(a). The
general invariance of the distributions’ structure across scales illustrates a self-similar
property of real-world image statistics.

lutions (higher counts => higher signal-to—noise ratio). Therefore, more reliable

coarse-scale estimates can be leveraged to improve high resolution estimators.3

The ﬁrst point is due partly to the fact that multiscale decompositions of real-
world intensities are often statistically self-similar. By this we mean the property

that the various scale representations preserve the major features and characteristics

 

3The good match between the Poisson and Gaussian distributions that occurs at high counts
suggests a similar positive relation between the Poisson models’ anomies and corresponding signal-
to—noise ratios as seen to exists in the Gaussian case of Example 3 of Section 2.3.3.

77

of the original object, except for the usual gradual loss of resolution. In particu-
lar, it has been widely recognized that the distribution of the wavelet coefﬁcients of
real-world signals tend to be similar at all scales of analysis, and are usually con-
centrated around the origin and unimodal [34]. The self-similarity captured by the
Haar multiscale analysis is illustrated by considering the distributions of the wavelet
coefﬁcients at various scales. Figure 3.3 illustrates this phenomena. The histograms
in this ﬁgure correspond to the wavelet coefﬁcients 4 at scales 1, 2, 3, and 4 for the
cameraman image of Figure 2.6. The similarity between these distributions facilitates
the speciﬁcation of Bayesian prior for the intensity in a very natural way.

The second point above motivates an estimation process that evolves from coarse
to ﬁne scales (See Chapter 2). It is easily veriﬁed that the signal-to—noise ratio (SNR)
in a Poisson process increases linearly with the underlying intensity (signal). Thus,
according to (3.2), cm = 2;? Cox, and so the signal—to-noise ratio at scale J is 2"
times as large as that for the average data point 00*. For example, for a 128 by 128

pixel image, this represents a SN R improvement of 42 dB.

3.4.2 Multiscale Multiplicative Innovations Model

We now describe a new Haar-based probability model for the intensity. Let AM and
6,,c denote the random variables corresponding to the j, k-th scaling and wavelet
coefﬁcient of the intensity, respectively. At the coarsest scale j = J, the single scaling
coefﬁcient Aw has a density with support on IV. In this work, we choose the gamma
density, since it is especially easy to use in conjunction with the Poisson mass function,
and because it provides a reasonable mechanism for incorporating prior knowledge of

the intensity range. However, as noted earlier, the SNR in the count 0,1,0 is typically

 

4More precisely, here we are plotting the histogram of the ratio of the wavelet coefﬁcient relative to
the corresponding scaling coefﬁcient. That is, each wavelet coefﬁcient is divided by the corresponding
scaling coefﬁcient at the same scale and position. If the scaling coefﬁcient is zero, the operation maps
to zero. The motivation for this ratio will become apparent in the following sections.

78

very high, and therefore any reasonable prior with support on IR+ will not signiﬁcantly
influence the estimation of A $0.5
Next, introduce statistically independent perturbation variables {6,1,} and model

the wavelet coefﬁcients by

911k = AM 5j.k- (39)

Each wavelet coefﬁcient is modeled as an independent perturbation of its correspond-
ing scaling coefﬁcient. Furthermore, the perturbations at all scales and positions are
assumed to be mutually independent. Applying recursions (3.4) and (3.5) to these
coefﬁcients, we ﬁnd that A,.ljc = %(A,,W2] + (-1)’° 6,,[k/21), where I] stands for the
integer part of the argument.

To gain some insight into this model, consider the random variable y,,;c deﬁned by

(1 + 6”). (3.10)

NIH

35.1.». E

The variable y,,;c can be viewed as the canonical multiscale parameter for Poisson
processes because of the following parent-child relationship. It is well known that
given two Poisson variates c1 and c2 such that c1IA1 ~ Poisson(A1) and c2IA2 ~
Poisson(A2), the conditional distribution of c1 given the sum c1 + c2 is binomial with
parameter y = 13% [48]. In the context of our multiscale analysis, this special
property implies a very simple parent-child relationship. Speciﬁcally, the conditional
distribution of child c,_1,2)c given the parent CM = c,-1,2k + c,_1,2k+1 is binomial with
parameter y,,,. This relationship demonstrates the fundamental role of y”, in the

multiscale analysis of Poisson processes.

 

5In fact, in practice we often use the estimate Am 2 cm.

79

 

“/1, ‘~ / \
A'1.o,Q\ \AAH A12 K Ana
,/ \ , /’ . .
/ ‘ 1— \ 1- \1— ‘ 1—
Ym/ \. Y1.o Y1,1/ \\ Ym Y1,2/ \ Y1,2 Y1.3/ \ Y1,3
/ \ / \ / \ / \
O O 6 D O O O O

10.0 A'0.1 A'0.2 A'0.3 )‘OA A0.5 A'0.6 A0.7

Figure 3.4. MMI model interpreted as a probabilistic tree. The MMI model can be
viewed as a tree-structured probability model in which the intensity A,,;c at coarse scale
j is reﬁned (split) via the multiplicative innovation y,,,c to obtain two new intensities
A,.”,c and A,-”),H at the next ﬁner scale of analysis j — 1. The innovations variates
{y,,,} are mutually independent at all scales j and positions k.

Using y“, in conjunction with (3.9) we have

)‘j—lﬂk = Achijm (3.11)

/\j-1,2k+1 = )‘j,k(1"yj.kl' (3-12)

We can interpret these reﬁnements as a multiscale innovations structure, with the
innovations y,,,c and 1 --y,,;c entering in a multiplicative fashion, in contrast to the more
standard additive innovations structure encountered in Gaussian estimation problems
[49]. We call this model a multiscale multiplicative innovations (MMI) model. The
model is graphically depicted in Figure 3.4. The following are some key properties of
the MMI model.

So long as the distributions for the perturbations {613k} are chosen to be similar
across scales, the MMI model gives rise to self-similar intensity representations, which

as discussed in Section 3.4.1, are typical of real-world intensities. Also, here we

80

only consider temporally homogeneous processes; it would be undesirable if the prior
depended on the observation time interval in a complicated manner. Due to the
multiplicative innovations structure, only the coarsest scale of the prior is dependent
on the observation time interval. Thus, the model is essentially invariant to the length
of the observation time. Moreover, in Section 3.5 the MMI model is shown to provide
a mathematically tractable match to the Poisson nature of the data which leads to a
very simple estimator formulation.

The MMI model is closely related to other models studied in physics and statistics.
The MMI model belongs to the class of cascade models, which are used in statistical
physics for modeling a variety of natural phenomena including turbulence modeling
[50] and rainfall distributions [51]. In fact, because MMI model is a type of cascade
model, it can be shown that the MMI model is a random multifractal [52]. It is also
interesting to note that the MMI model is a special case of a Polya tree [53, 54]. In
the statistics community, Polya trees are used to model probability distributions, a

problem analogous to modeling a non-negative intensity function.

3.4.3 Prior Distribution for Innovations

A prior distribution is determined by the nature of the ensemble of objects to be
modeled, and so, the better deﬁned the ensemble is, the more informative the prior,
and consequently, the better the model. Towards this end, we emphasize photon—
limited images, particularly, nuclear medicine images; however, as noted earlier, a very
large class of other real-world images are well characterized by the same features of the
prior p(6) for the perturbations 6,,k: statistical self-similarity across scale, symmetry
about the origin, unimodality, concentration around zero (see Section 3.4.1), and
support on the [—1, 1] interval.

This last property is due to the fact that the range of 6,), is [—A,,,, AM]. The rest

of the properties are based on the characteristics of observed wavelet coefﬁcients’ dis-

81

tributions resulting from natural signals [34], and which have been exploited in other
areas including wavelet-based compression [55]. These properties are also illustrated
in the histograms of Figure 3.3. All the properties, however, may equally be inferred
a priori from a few observations.

The images of interest have the common characteristic of being mostly smooth
with a few discontinuities. Typically, these discontinuities form edges that are long
enough to span several scales. On average, neither the edges nor the overall intensity
variations have any preferred orientation, as dictated by nature. It is the slowly
varying intensity typical of images which is responsible for the high concentration of
wavelet coefﬁcients around the origin, and it is its non-oriented nature that causes
their symmetry. The few outlayers coefficients are due to the discontinuities, which
are found nearly in the same numbers in opposite orientations—think about the
silhouette of a phase, for example. The statistical self-similarity is due to the span
of images feature across scales. For example, large regions of smooth intensities and
discontinuities alike are observed unaltered at many scales.

One very general class of probability density functions that possesses the desired
characteristics and which reflect the above image characteristics are beta-mixture

densities of the form

 

(1 _ 62) si—l
219102281-1’ (3.13)

3(3113

for —1 S 6 S 1, where B is the Euler beta function, 0 S p, g 1 is the weight of the
i-th beta density ﬁ$—%;—ir with parameter s,- _>_ 1, and 2,”; p,- = 1. Figure 3.5
depicts a mixture of three beta densities. A similar method was also recently proposed
using a prior based on a mixture of a Dirac impulse and a single beta distribution
[56].

Other classes of density functions may also provide the desired characteristics, but

82

 

 

 

 

 

, -o.2 —o.1 o 0.2
coefficient value

Figure 3.5. Three component Beta-mixture distribution (solid line) superimposed on
the histogram of the perturbation variates (6 = B/A) of Figure 2.6 of Section 2.2.2.
The beta mixture parameters here are 31 = 1, .92 = 100, 33 = 10000, p1 = 0.001,
[)2 = 0.400, and p3 = 0.599.

the beta family has a signiﬁcant computational advantage. As pointed out above, yj‘k
parameterizes the conditional distribution of the child 9-12;, given the parent CM.
This conditional distribution is binomial. Hence, from a practical perspective, the
use of a prior that is conjugate to the binomial will greatly facilitate computations
[23]. It is well known that the beta family is conjugate to the binomial. For this
reason, the beta mixture prior described above leads to a very simple, closed-form
estimator which is discussed in the next section. However, for the sake of brevity, in
our derivation of the optimal estimator, given in Section 8.2, we directly compute
the posterior means based on a beta mixture prior, without explicitly noting the use

of conjugacy.

83

3.5 Estimation

3.5.1 Bayesian Multiscale Intensity Estimator

We shall focus on the posterior mean estimator, although other estimators (e.g.,
MAP) may also be considered within our framework. The posterior mean is the
optimal Bayes estimate under a quadratic loss function. The posterior mean estimate
of an intensity AM, given all the information available in the data c0 and the MMI
prior model pg, is the conditional mean 3:ch E E[Aj,k|c0]. In this section we derive
simple closed-form expressions for the posterior mean.

First, based on the analysis in Section B.1 of Appendix B,
KM = E[/\j.k]Cj]-

This implies that a simple coarse-to—ﬁne procedure can be employed in the estimation
process.

At the coarsest scale 3' = J, the intensity is represented by a single scal-
ing coefﬁcient A 1,0. Let us begin by considering the estimation of A 1,0. We have
X 1,0 E E [AJ,0[CO] = E [Awlcm]. As argued in Section 3.4, the corresponding count
cm is itself usually a very good estimate for A“) provided the total number of counts
is sufﬁciently large. This choice has the added advantage of insuring the preservation
of total number of counts, i.e., 2k Soy, = 2k 0%.

The posterior mean estimate for the wavelet coefﬁcient 03",, is given by

$3)

',k E E[6j,k|CO]

= E [Aj,k]C0] E [6j.k|C0] a

where we have used (3.9), and have exploited the independence between AM and 511k-

84

Now, we may simply write

933k = Xch 6M: (3.14)

with the obvious deﬁnition 3],, E E [6,,klc0].

property of the expectation operator:

>

>’

The desired form for X“, may be obtained using (3.4) and (3.5), and the linearity
1 k
1.1: = E - (bulk/21 + (-1) 6j+1.[k/2])

2 C0]

1 A .x
= 5 (Aj+1,[k/2] +(-1)A 6j+1.[k/2]) - (3-15)

 

Here is where we exploit the multiscale framework for estimating the desired intensity.
The estimate for 2:33;, is leveraged by the more robust coarser-scale scaling coefﬁcient’s
estimate Xj+1.[k/2]-

In Section B.2 we show that

 

 

Z p.3(3i+Cj-l.2ks3i+Cj—1,2k+1)
A ' I B i» t 2 i '
51.x: = d“. i (s S” “6““) (3.16)

2 DB(S;'+CJ'—l,2ko5i+Cj—1,2k+l) .
i 1 3091.30

The parameters {p,, 3,} are the deﬁning parameters for the beta mixture model in
(3.13). Note that 333;, guarantees nonnegativity of the resulting intensity estimates.

That is, X131: 2 0 for j = 0,... , J and all It. To verify this, simply rewrite (3.16) with

 

—d-’4'5—| Slfor

the factor dJ-j, inside the upper summand and consider the fact that 28', +c, k

all 2'.

The overall algorithm is described below.

85

 

Bayesian Multiscale Intensity Estimation

1. Estimate coarsest scale coefﬁcient

/\J.0 = 0.1.0

2. Forj=Jdowntolandk=0toN/2j—1

Compute:
3; 1, according to (3.16)
(71.1. = in- 311
Reﬁne:
-= % < ,... a: .)
/\j-1,2k+1=% (3:31. — 6“)

 

 

 

These simple procedural steps produce posterior mean estimates for ﬁner and finer
representations of the underlying intensity, and terminate with the desired, finest-scale
estimate X0. The complexity of the proposed estimator is 0(N), the same order as
the fast wavelet transform itself.

For large data sets it is possible that the full J = log2(N) iterations over the
scales in Step 2 above are not necessary. For instance, for long data sequences the
estimator may be initiated at a scale J * < log2(N), for which the estimate X]. = cJ.
is already very reliable. In practice, even for low-count data, we have found that
J" = 5 provides excellent results. Using J“ other than J is equivalent to dividing
the original data sequence into equal subsequences, estimating their underlying in-
tensities separately, and concatenating the individual results to obtain the overall
intensity estimate. However, in Section 3.6 we introduce a shift-invariant version of

the estimator above, for which the equivalence just described does not hold.

86

3.5.2 Selection and Analysis of the Beta-Mixture Prior

Our experiments and analysis with real-world intensity functions have led to several

conclusions.

1. The perturbation densities of many real-world intensities are very well character-
ized by a weighted combination of three beta densities with shape parameters

.91 = 1, 32 = 100, and s3 = 10000.

2. The .91 = 1 component is the uniform density, and we have found that ﬁxing the
corresponding weight to a small positive constant (e.g., p1 = 0.001), to insure
that the prior density 115 is bounded from below over the entire [—1, 1] interval,

is appropriate in most. situations we have studied.

3. The key parameter that distinguishes the characteristics of different intensity func-
tions is the trade-off between the 32 = 100 component and the lower vari-

ance .93 = 10000 component. The trade-off is parameterized by the probability

102 E [0,1 -p1]o (Note P3 =1- 101- 102-)

To gain some insight into the functioning of the MMI model estimator, consider
Figure 3.6 which plots If“, versus the ratio dj,k/cj,k for three cases: low (c = 10),
medium (0 = 30), and high counts (0 = 1000). These are, respectively, the dashed,
solid, and dot-dashed curves. The ratio dJ-JC/cjj, may be regarded as an empirical
counterpart to the perturbation variate 613k. Figure 3.6(a) and (b) correspond to two
6-priors with 132 = 0.1, and p2 = 0.9, respectively. The rest of the parameters take
the values given in points 1—3 above.

From these ﬁgures we can observe the following. At high counts, when the SN R is
high, the estimator regards djjc/cjj, to be a good estimate of 5ch for almost every value
of d,.), /cj,k; thus, the resemblance to the unit-slope linear function. In contrast, for low

counts, the SNR is much lower, and consequently the estimator severely attenuates

87

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.-.-.. C=1000 II!
—- C=30 I!"
0.5’ --- C=10 1"" 1
g (E) "II! "A
g g ...... - ~~~~~~ 1
o q) o i’
Q Q '1 I.
-0 5 O 5 r . I".
-1 ‘ 1 1 .
-1 -0. 5 0 0.5 1
d/c

 

Figure 3.6. Perturbation estimate 3 as a function of d/c for c = 10 (dash), c = 30
(solid), and c = 1000 (dot-dash). The estimator’s deﬁning parameters are 31 =
1, 32 = 100, 33 = 10000, p1 = 0.01, and (a) 102 = 1 - p1 — p3 = 0.100, and (b)
p2 =1—p1 -—p3 = 0.900.

djk/ijc. This phenomenon is reminiscent of the behavior of wavelet-domain threshold
estimators designed for additive Gaussian white noise (AGWN) [12], except that the
threshold is adaptive to the local intensity.

The MMI model estimator minimizes the expected squared error with respect to
the MMI model prior. Of course, the error is minimized by balancing the trade-off
between ﬁdelity to the data and ﬁtting to the prior model. If the data are very ‘reli-
able,’ then the estimator favors the data. If the data are unreliable, then the estimator
favors the prior. As noted, at low counts the ratios dJ-jc/cjjc are not accurate esti-
mates for 6, and so, the Bayesian estimator attenuates them to minimize the expected
squared error in accordance with the prior. The nonlinearities in Figure 3.6(a) corre-
spond to a lower-variance prior (smaller 192) than that giving rise to the nonlinearities
of Figure 3.6(b). As a result, the nonlinearities in Figure 3.6(a) display a ‘dead—band’
characteristic, since the prior requires that a greater number of djj,/cjjc samples be

mapped towards zero, in contrast to the higher-variance prior which displays a less

88

harsh attenuation of small dJ-jc/cJ-Jc in Figure 3.6(b). This behavior is similar to that
observed in other wavelet-based Bayesian estimators designed for AGWN [46, 11, 15].

However, the functioning of the MMI model estimator is in contrast to that of es-
timators for AGWN processes. In the latter cases, all wavelet coefﬁcients are typically
attenuated independently and in disregard to the values of the corresponding scaling
coefﬁcients [46, 11, 15]. The MMI model estimator, on the other hand, adapts not
just to the statistics of a particular scale, but also naturally incorporates information
from coarser scales. This is crucial in the Poisson problem since the coarse-scale in-
tensities (scaling coefﬁcients) are indicative of the statistical reliability of the wavelet

coefﬁcient.

3.5.3 Estimation of Prior Parameters

The Haar wavelet coefﬁcient distributions of real-world intensities often ﬁt a proﬁle
which resembles that of Figure 3.5 as previously discussed. However, while many
distributions follow this general characteristic, one expects that subtle variations will
exist from application to application. Therefore, it is of interest to adapt the prior to
the problem at hand. Here we give a very simple approach to ﬁtting the prior based on
a moment-matching method. This adaptation can be viewed as an empirical Bayesian
extension of framework described above.

Recall, we assume that for each scale j, the set {633k} is independent identically
distributed (i.i.d.) with an M—component beta-mixture density distribution with
parameters {pj,,-,s,-}f‘il. We let the mixing probabilities {p13,} depend on scale to
enable variations of the density across scale. Also note that here the index i does not
refer to shift (as does It in yjj, for example), but rather it refers to the i-th component
of the mixture density. Then, by (3.10), at each scale j, {yj,k}£I=/§j_l is also i.i.d.

with an [VI-component standard-beta-mixture density distribution with parameters

89

{191.11 Silfiii

 

3,—1 1 _ 3,—1
pJ-(y) = [my ( 9) . 0 s y s 1. (3.17)
Since y”, is independent of AM, E[A}‘_1‘2k] = E[A;.“k] E[y;‘k] and

E A" 121
E [113%] = —[E[Z)\—k]_]_ (3-18)

The moments E[yj,k] need not be computed using this expression since the prior
model (3.17) gives the mean % for any choice of parameters. For n 2 2, the moments

E[A" j_ 12k] and E[A" k] are easily estimated from the data. Since E[cj klAj k]— — Aj k,
EULA-l = ElE[CJ.k|/\j.kll = Bicycl-

And in general, it can be shown that for all 71 there exists a degree n polynomial pn(-)

such that
13033.] = Ean(CJ‘.k)l'

For example,

Bldg? ,=kl EClJ k ’CJkl ~ [WW-2303', k’CJk)

Substituting these estimates for various 71 into (3.18) we can obtain empirical
estimates for the various moments of y“. Equating these to the moments of the
beta-mixture model (3.17) produces a set of equations that can be solved for the
parameters {pj,,-, 8,},311

As mentioned above, we have found that the choice of a three component prior

90

beta-mixture model sufﬁces for many real-world intensities. At all scales j, we suggest
the shape parameters 31 = 1, 32 = 100, and 33 = 10000, with weights p“ = .001,
pig = 1-pj,1-pj,2, and 13,-,2 adapted at each scale using the second moment estimate for

31M:

 

3.5.4 Optimal Estimation from Large Ensembles

Until now, we have strived to construct an estimator which optimally estimates the
intensities of Poisson processes from a single observation of the process. There are
situations, however, when several observations of the same process are available, or,
when a large set of realizations corresponding to distinct elements of the ensemble
to which the object of interest belongs is at hand or may be created.6 These two
conditions require distinct approaches in order to exploit all information available
and optimally estimate the underlying intensity.

In the ﬁrst situation, one arrives at a suitable approach by viewing the set of
observations (of the one process) as sub-intervals of a longer integration time for the
process. Then, the counts corresponding to this larger integration interval are simply
the sum of all the counts available on a sample (pixel) basis, and represent sufﬁcient
statistics for the intensity pixels [18]. This well known result is due to the reproducing
property of the Poisson distribution,7 and may be stated as A E E[A|c1, . .. ,cm] =

E[A|c1 + +cm].

 

6Under certain conditions, estimates for the object’s distribution may be obtained from simulated
large scale ensembles, for example, with the use of the bootstrap method [57, 58].
7See footnote 2 in page 74.

91

In the second scenario, an optimal estimator would be desirable if, for example,
a large data bank of nuclear medicine images of, say, knees are available from past
clinical studies. These data then could be used to leverage the estimate of the nuclear
images of a new patient’s knees. Situations like this admit two possible solutions.
First, the moment-matching method described in the last section could be employed
with the larger observation data set to obtain very robust estimates of the prior
parameters. A robust prior clearly would lead to a robust estimate of the intensity
of interest.

If the data set is large enough that the ratios

p(Cj—1.2k = C1 +1|Cch = 0 +1)

P(CJ—1.2k = CIICch = C)

 

can be computed sufﬁciently accurately, a prior-independent optimal estimator for
the innovations {yjjc} can be devised as follows. Abbreviating the indexing notation,

we have

yp(yl01 c )dy

/
=/yp(cilcy)p(( gently-

(CIIC)

 

Since p(cllc, y) = (col)ycl(l — y)‘3‘°l (see [48]), and since the innovations are indepen-

dent of any rv at their own and coarser scales,

A f y (f,)yc‘(1 - y)c‘“1p(y) dy
y =
P(01|C)
_cl +1 I (51:11) yc’“(1 -y)c“c‘p(y)dy
c + 1 p(cllc) °

 

 

92

Thus, we obtain

A _ Cl +1 p(cj-1‘2k = C1 + 1'01"}; = C +1)

yk _ 3.20
J c + 1 p(Cj—1,2k = ClleJc = C) ( )

 

Clearly, to compute the ratio of probabilities in this expression we only need,
for example, a table for p(CIIC). For instance, for a realization C“, = 0, necessarily
9.131, = 0, and the innovations estimate reduces to if”, = p(cj_1,2k = 1|cj,k = 1),
which typically takes on values close to 5 For very large counts of CM, one would
expect that p(cllc) z p(cl + 1|c + 1). Therefore, for very large counts 171.}: z 961,
reﬂecting the fact that at high counts the SNR is high and the data is, therefore,

reliable.

3.5.5 Example of Estimation from a Single Observation

For our ﬁrst illustration of the performance of the Bayes’ estimator we give the follow-
ing simple example. We offer more comparative examples in Section 3.7 once a more
sophisticated model and estimator is introduced in the next section. In Section 3.8,
two-dimensional examples are given.

Consider the test intensity function depicted in Figure 3.7 (a). Figure 3.7 (b)
depicts a realization of counts from this intensity. Note that the ﬁrst “bump” in
the low intensity region of (a) is statistically reliable—it is easily distinguished in
the count data (b). The second bump in the high intensity region is equal in size
to the ﬁrst bump, but it is not statistically reliable as is seen in the count data in
(b). Therefore, we expect that a good estimator should recover the ﬁrst bump from
the data and not attempt to recover the second. Figure 3.7 (0) depicts the estimate

resulting from the PRESS-optimal estimator8 proposed in [59]. It is an improvement

 

8This estimator is a wavelet threshold type operation that is derived using the statistical method
of cross-validation [40].

93

over the raw data, but still exhibits quite a bit of variability. Figure 3.7 (d) depicts
the intensity estimated by the Bayesian approach introduced in Section 3.5.1.

Due to the very simple—and unnatural—structure of the intensity where the cor-
responding wavelet coefﬁcients may be found in one of only two states, zero and
“large”, a simple two-component beta-mixture density model sufﬁced for the innova-
tions yJ-‘k. The parameters of the beta densities were 31 = 1 and .92 = 10000. The
mixing parameter p1 = 1 — p2 was adapted at each scale using the data-based mo—
ment matching approach proposed in Section 3.5.3. Note that the Bayes’ estimate
does an excellent job at correctly estimating the reliable features (edges and bump
in low intensity region). The experiment shown in Figure 3.7 was repeated in 1000
independent trials. The normalized MSE of the raw count estimate was 1.000, the
MSE of the PRESS-optimal estimator was 0.375, and the MSE of the new Bayes’

estimator was 0.106.

3.6 Stationary Intensity Models and Estimators

One potential limitation of the MMI model is that it is not stationary due to the
fact that the Haar wavelet transform is shift-dependent. That is, the analysis and
estimation depends on the alignment between the Haar basis functions and the data.
Moreover, the coarse scale approximation by Haar wavelets is piece-wise constant,
an unrealistic intensity model. To circumvent such problems, shift-invariant wavelet
transfonns have beenproposed in literature [60, 61, 62, 63, 36, 64]. In this section, we
provide a uniﬁed Bayesian framework for shift-invariant analysis and estimation based
on the MMI model. We formulate a fast shift-invariant estimator from this analysis,
and illustrate it with an example. Also, we characterize the autocorrelation functions
of the MMI model (non-stationary) and a shift-invariant MMI model (stationary),

and show that the shift-invariant MMI model has a more regular correlation behavior

94

 

 

['1

; 11

 

ll]

 

 

 

 

 

 

 

 

so a 80
360' g 50' 4
40- ‘0’ ’ l
20. _ , i
E ay’lkww “l" NW1.“ L...
o Sp.“ — J" 0 3pm
(a) (b)
140 14o—~
120 , l - 120
‘ ..1' "f 1., ,1
1.. 1111‘.)1:1 ‘ 1‘°°:’ : r
3°01 2 ”1
g 00] 3 00‘
. g ’
m 40’ I ’ E ‘0’ l
'. ; g ] ‘ j 1 ‘ [ l
20 _llr'»+wr«1, with-w; I 20 If i l
0211. Space “J OT Space H
(c) ((31)

Figure 3.7. Poisson intensity estimation. (a) Test intensity function. (b) Realization
of Poisson process with intensity in (a). (c) PRESS-optimal estimate using algorithm
in [59]. (d) Bayes’ estimate using algorithm described in Section 3.5.1.

which may be better suited for modeling real-world intensities.

3.6.1 A Shift-Invariant MMI Model

Shift-invariant MMI intensity models can be easily constructed within a Bayesian
framework. Speciﬁcally, the shift of the intensity function with respect to the Haar
wavelet system can be viewed as an additional degree of freedom in the model, and
a probability measure on the shift parameter can be introduced. It is assumed that
all shifts are circular. If we regard the original model as “unshifted” (shift = 0), then
the standard MMI model introduced in Section 3.4 is denoted p(Alshift = 0). Now

95

let p(shift = m) denote a probability mass function for the shift, and consider the

averaged MMI model

p(A) = Zp(Alshift = m)p(shift = m). (3.21)

If p(shift = m) is the uniform distribution, a non-informative prior expressing no
preference for any particular shift, then the averaged MMI model provides a shift-
invariant (stationary) intensity prior and we call it the shift-invariant (SI) MMI model.
It is shown in Section 3.6.3 that the SI—MMI model is more regular than the basic
MMI model.

Estimation using the SI-MMI model is easily carried out by computing the opti-
mal Bayes shift-dependent estimator given in Section 3.5.1 for each shift, and then
computing an average of the results. The complexity of the shift-invariant estimator
is 0(N2) operations. However, note that if we employ a J *-scale SI-MMI model, with
J "‘ < log2(N), then the estimator is invariant to shifts modulo 2].. This is due to the
fact that the scaling functions at the coarsest scale have support over 2". samples.
In general the J *-scale SI-MMI model only requires a uniform shift prior over a 2"
sample region of support, rather than over the entire range of circular shifts. Thus,
if J "' < log2(N), then the complexity of shift-invariant estimation is only 0(N2J).
As pointed out earlier (see Section 3.5.1), in many applications it sufﬁces to take J *
smaller than log2(N).

Note that the fast shift-invariant methods described in [61, 62] are not applicable
in this case due to the dependence between wavelet coefﬁcients across scale. This
dependence stems from the multiplicative relationship between scaling and wavelet
coefﬁcients.9 However, fast shift-invariant estimation algorithms are feasible if not

“homogeneous” (see below). We next present one such approach.

 

9The fast shift-invariant methods treat all wavelet coefﬁcients independently.

96

3.6.2 A Fast Shift-Invariant MMI Estimator

Referring to the tree representation of the intensity model of Figure 3.4 in Sec-
tion 3.4.2, we know that the mapping {A0,k}k 1—1 {A10} U {yjjb—j is invertible. There-
fore, p(A) oc p(AJ,0,yJ,0, . .. ,yl’N/2_1), and so, the shift-invariant model (3.21) may

also be expressed as

p(A) 0< 2901.0): 11115711111 = m), (322)

m j,k

where we have exploited the invariance of A JD with respect to shift, and the indepen—
dence of the set {yj,k}j,k. The notation p(yjif’k) is simply a more economical way of ex-
pressing p(yj,k|shift = m). Note that with this notation, Agfo E 11031370113110 - - (gift),
for example, is a component of A0,", and not of A09 in the averaging process.

The shift-invariant model of the previous section is characterized by the assump—
tion that any two distributions p(yﬂ‘) and p(yjmkz) are equal for all shifts m1 and m2.
If, however, we choose probabilities such that p(ygnk) = 0 for all k 75 0 the above

model becomes

p(A) oc 1901.0) 2 111903-70) p(shift = m) (323)

m .7

Clearly, this new model is also shift invariant for it is obtained by averaging over all
possible shifts, which we take to be equally likely. However, we say the model is not
homogeneous because a different model results if we choose a new set of distributions
to be the non-zero distributions in (3.22), say, for example, p(yjnk) = 0 for all j and
10, except whenever k # 1.

Despite the lack of homogenity, this intensity model has proved to be valuable as it
leads to an efﬁcient estimator of complexity 0(N). Also, note that more sophisticated

fast algorithms can be created from this general prescription by simply choosing not

97

one, but a combination of “legs” of the tree to represent the elements of the intensity
sequence as the tree shifts. We have experimented with these notions and have
formulated the corresponding estimators. Although the resulting models from the
added complexity in the construction process are theoretically more appealing for
they “close” the gap between the homogeneous and non-homogeneous models, we
have found that they add very little as to the quality of the intensity estimates. For
this reason, we have not pursued them here any further.

The procedure for the fast estimator derived from the simpler intensity

model (3.23) is the following.

 

Fast Shift-Invariant MMI Intensity Estimation

1. Estimate coarsest scale coefﬁcient
/\J,0 = CJ,0
2. Form=0toN—1andj=Jdowntol

Compute:
03% according to (3.16)
3137,10 = % (1 + 3‘70)

Reﬁne:

A

m _ Am Am
Aj—1,O - A130 310,0

 

 

 

The desired intensity estimate is A = (3130359,... ,Ago" 1). We note that these
procedural steps call for N / 2 redundant shifts in order to compute the N elements
of the intensity sequence. Although it is not hard to describe the above estimation
algorithm void of these redundancies, we feel that the description is clearer in the

form presented.

98

 

 

120’ ’— ‘ 12°). '11 thu’IIIII'I ”I

 

 

 

 

 

 

 

100’ ##T‘w ' 100. I II“. |
III I'M ‘MIIWI [I
E 80 a so
‘0' 40'
r I
20- ‘ 20> ? ‘ | ‘
"ﬁfth-1’ If ‘1 III R “PHI“ $151,“ I
or— _ —_7 Vﬂ—‘Lﬁ’ Om -_' _ m
Space Space
(a) (b)
140 no
120'- ‘20
100. 100 g
g .
8 80 E .0.
g i
3 so :3. 60
g a
13 40: 3 ‘0'
I I [ '
I
20' 1 2° “
1 _ f _'__—_ ‘ —#ilg‘—F-—
0 Space 0 Sp.“

Figure 3.8. Fast Poisson intensity estimation. (a) Test intensity function. (b) Realiza-
tion of Poisson process with intensity in (a). (c) Bayes’ estimate using the algorithm
described in Section 3.5.1. (d) Bayes’ estimate using the fast shift-invariant algorithm
described here.

Example

In order to illustrate the quality of the estimates obtained by the fast shift-invariant
estimator we repeated the example of Section 3.5.5. We created a sequence of Poisson
counts from the intensity in Figure 3.8 (a), and used them to estimate the intensity
using the fast algorithm. The result is depicted in Figure 3.8 (d). For comparison,
we reproduced Figure 3.7 (d) in Figure 3.8 (c), which shows the estimate based on
the MMI model of Section 3.5.1.

By repeating this experiment with a large number of different count realizations,

99

we found that the quality of the estimate evident in Figure 3.8 (d) was representative

of the fast shift-invariant estimator.

3.6.3 Autocorrelation Functions of MMI and SI—MMI Mod-

els

The underlying beta mixture densities capture the key heavy—tailed, non-Gaussian
nature of wavelet coefﬁcient distributions. However, the shift-dependent nature of the
Haar wavelet transform generates a non-stationary correlation structure as illustrated
next. For the sake of illustration, we focus on the 1-d case. Extensions to higher
dimensions are straightforward. Also, to keep things simple, we assume that the
maximum number of scales J = log2(N) are computed in the analysis, where N
is the length of the intensity vector. The results easily extend to other choices for
J * 7E J.

First, consider that basic MMI model (shift: 0) and let us introduce the following
notation. Let p3 E E[A‘2,’0], the second moment of the scaling coefﬁcient at the coarsest
scale, and let p? E E [yik] = E [(1 — yj,k)2], the second moment of the innovations
variates. Recall that we assume the distribution of y“, does not depend on position
k. The variables y“ and 1— y“, have a common second moment due to the symmetry
of the distribution of yJ-‘k about a

For illustration consider the correlation between the intensities A00 and A0,; in
the MMI model depicted in Figure 3.4 in Section 3.4.2. Note that the ﬁnest scale for
which A09 and A03 have a common predecessor above in the tree is j = 2 and the

predecessor is A20. Therefore we can write

/\0,0 = 311.0 312.0 A20,

/\0,2 = yi,1 (1 " 92.0) /\2.0-

100

The correlation between the two intensities is computed

E [A0,0Aog] = E [yro 311,1 92,0 (1 " 312,0) A30] -

Exploiting the independence of the innovations variates, we have

E [A0,0/\0,2I = E [311,0] ELI/1,1] E [92,0 (1 - 3/2,0)I E [A30] -

Examining the individual product terms and making use of the moments deﬁned

above,
Elyml =El311,1l = 24,
E [312,0 (1‘ 3120)] = 1/2 — pg,
E [A30] = #3103-
Hence,

Elhophoal = 2‘2 (1/2 - 03) #3393.

N ow consider a general case in which we are interested in the correlation between
two intensities, say ADJcl and /\0,k2. Let j ‘ be the ﬁnest scale for which AM, and A0,!”
have a common predecessor (above) A)”, in the MMI model tree. The scale 3'" can
be explicitly calculated using the binary representations of 101 and kg, and depends

on the exact positions of k1 and kg with respect to the alignment of the Haar basis

101

functions. Assuming that k1 < k2, we have

r-l

)‘OJci = H 31:31: yj'Jc Awe, (3.24)
i=1
j‘-1

/\o.k2 = H 312,]: (1 — yj‘Jc) Male. (3-25)
i=1

In the expressions for AM, and Auk-.2 above, we use a generic spatial index k, since
the distributions of independent innovations variates do not depend on position. We
do, however, use y“, and yg-‘k to distinguish the independent innovations variates

corresponding to AM, and ADM, respectively. The correlation is given by

j'-1 j'—1
E [Aoalhoazl = E H yank H all. yj'Jc (1 — yj‘Jc) A3231: - (3-26)
i=1 i=1

Again exploiting the independence of the innovation variates and making use of the

moments deﬁned above, we have E [Afk] = #3 Hijzj. +1 p? and

T(k1,k2) E E[A0,k1A0.k2]
J

= 2'2””) (1/2-pf-‘M3 H p? (3-27)

i=j‘+l

Note that because 3'“ depends on the alignment between the intensity function and
the Haar basis, r(k1, k2) is not a function of the difference k1 — k2 alone. This shows
that the intensity distribution represented by the MMI model is non—stationary. Two
columns (ﬁxed 101) of the autocorrelation function of a 256 length MMI model intensity
prior are shown in Figure 3.9. Note also that the autocorrelation function is highly
irregular (piecewise-constant), which is an undesirable model for real-world intensities.

Now consider the autocorrelation function of the SI-MMI model (3.21). Given
a displacement of n between two intensities, say A03 and A0,", we can compute the

probability of the ﬁnest scale j * of a common predecessor, with respect to the uniform

102

_____ _.__.*

——‘"‘ﬁ

 

 

1 - - - r(0.k)
------ r(55.k+55)
‘ -— r(k) Sl-MMI

 

 

 

Figure 3.9. Correlation functions for MMI and SI-MMI 256-point (J = 8) intensity
priors. MMI model autocorrelation r(0, k) (dash-dash) and 7‘(55, k + 55) (dash—dot)
plotted as a function of k. Stationary SI-MMI model autocorrelation r(k) (solid).
For both the MMI and SI-MMI models in this example, #3 = 1 and p12- : 0.26, j =
1,. . . , J.

distribution over the shift parameter, as follows. We need to count the number of
shifts that give rise to each possible value of j", or more precisely the probability
of the set of shifts that give rise to each possible 3". For example, suppose n = 1
and consider the tree in Fig. 3.4. In this case, four shifts (out of eight) result in
j“ = 1, another two shifts result in j* = 2, and two shifts (one wrapping around
due to circularity) result in j" = 3. Hence, we compute p(j" = 1|n = 1) = 1/2,
p(j“ = 2|n = 1) = 1/4, and p(j“ = 3|n = 1) = 1/4. Similarly, if n = 2, then
p(j’ =1|n = 2) = O, p(j“ = 2|n = 2) = 1/2, and p(j* = 3|n = 2) = 1/2, where again
we must be careful to account for the wrap-around effect of the circular shifting.
Given a displacement of n = k between two scaling coefﬁcients in a J -scale MMI
model, the probability of the scale 3'“ is determined by inspecting the associated
binary tree. We need only consider |k| S 23"1 due to the periodicity of the MMI.
First note that we have p(j“ _<_ J | n = k) = 1. Next, notice that (2“ - k)+ is the

number of shifts of a length-k sequence that ﬁt within a length-2’" sequence, where

103

(3:)+ = max(:z:, 0). In other words, (2m — k)+ is the total number of scaling coefﬁcient
pairs spaced 1: apart at the bottom of a m-level binary tree. Also note that the
number of m-level subtrees, m < J, at the bottom of a larger J-level binary tree is

precisely 2J‘m. Hence, for m < J

130'" S m In = k) = (2m - |k|)+ 2""‘2"

= (2m — |k|)+ 2"". (3.28)

It follows that

p(mlk) E 100'
0, m < [10g2(|k|+1)l
1- 2"'"lkl, m = (10g2(lkl +1)l
2""lkl, [log2(|k| +1)1 < m < J
|k|2‘J+1, m = J,

where [log2(|k|+1)] denotes the smallest integer greater than or equal to log2(|k| +1).

With these probabilities deﬁned, the autocorrelation function is given by

J
7‘(k) E E[/\0,1)\0,1+k] = Zump(m|k), (3.29)
m=1
where
J
um22*2‘m-1’(1/2—p3..)u3 H p? (3.30)
i=m+1

is simply the autocorrelation between two intensities at the bottom of an MMI binary
tree model for which 3"“ = m, as in (3.27). Note that the SI—MMI autocorrelation

is stationary. The autocorrelation function for a 256 length SI-MMI intensity prior

104

is shown in Figure 3.9. Unlike the autocorrelation of the MMI model, the SI—MMI
model’s autocorrelation is piece—wise linear (compared to piece-wise constant) and
hence is more regular and potentially better suited for the analysis of natural intensi—
ties. These results are similar in spirit to the analysis in [61], where it is observed that
the shift—invariant Haar wavelet transform is line-preserving.10 Moreover, the results
here suggest possible schemes for choosing the parameters of the SI—MMI model. For
example, the decay of the autocorrelation function may be tailored by appropriate
choices of the pf, i = 1,... ,J. Larger values for the p? cause r(k) to decay faster.
We also note that similar correlation analysis may be carried out for related wavelet-
domain signal models developed for Gaussian data [46, 34, 11, 15, 47]. We also note
that the issue of combining or mixing trees (which is essentially what is being done
in the SI-MMI model) has been studied in the more general setting of the Polya tree.
Some very interesting theoretical results concerning the continuity of densities gener-
ated by mixtures of Polya trees (SI-MMI models are a special case) are given in [54].
These results may provide further insights into the SI-MMI model and may suggest

other extensions of our framework, but we have not pursued this here.

3.6.4 SI—MMI Model and 1 / f Processes

Remarkably, the SI-MMI correlation function has a fractal 1 / f -like character. Fractal
1 / f random process models are commonly used in image modeling, and it has been
observed that natural signals and images often display a correlation structure similar
to that of the SI-MMI model [65, 66]. To see that the SI-MMI correlation is 1 / f ,

consider the simple case in which the second moments of the innovations are constant

 

10Theﬁe conclusions extend to the SI-MMI model as well.

105

independent of scale, i.e., pj = p, j = 1,... ,J. Then

um. = 2'2”“) (1/2 - p?) #319” 7’" (331)

= C (2p)-2m, (3.32)

where C is a constant independent of m. Next equate (2p)’2m with 2‘74”" and solve

for ’7. This gives
um = 020*)“, (3.33)

where '7 = —1—log2p2. Note that 1 / 4 < ,02 < 1 / 2, where the lower and upper bounds
corresponds to the two extreme limits of the beta density: a point mass at 1/2 or
two point masses at 0 and 1, respectively. This implies 0 < 7 < 1. Combining (3.33)

with (3.29) and after some algebra, we have

TU?) .__ C(2(‘r-1)Ilog2(|k|+l)] _ﬁlkl2(,_2)J

+ 5| k|2(7—2)110g2(|kl+1)1) (334)

for lkl > 0, where 6 = Egg—3%. In the case k = 0, r(0) = H3P2J- Finally, making use

of the approximation 2l10g2('k'+1)] 2 [It], for Ikl > 0 we have
r(k) z C [(1 - ﬁ)lk|<7-1)+ BlklZW'W] . (3.35)

For large J and '7 < 1, the term ﬁ|k|2(7‘2)J is negligible, and hence the correlation
function behaves like |k|("“1) and the power spectrum decays like l-f—IF (see [67] for
the relationship between autocorrelation functions and power spectrums of 1 / f pro-
cesses). That is, the SI-MMI model produces non-negative, stationary processes with

1/ f characteristics. For more details on SI wavelet models and 1 / f processes see [64].

106

3.7 Numerical Comparison of Wavelet-Based In-
tensity Estimators

Here we compare the performance of the new Bayesian estimation algorithm with
several existing methods. To assess the performance of each method, four test inten-
sity functions were used. These functions were the “Doppler,” “Blocks,” “HeaviSine,”
and “Bumps” test signals proposed in [12]. Each test function is 1024 samples long
(i.e., N = 1024, J = 10). These functions serve as benchmark tests for signal es-
timators, and they were designed to be representative of a variety of natural signal
structures. We refer the reader to [12] for more information‘about the test functions.
Since the intensity functions must be non-negative, each test function was shifted
and scaled to obtain an intensity with a desired peak value and a minimum value of
m. Realizations of counts are generated from each intensity using a standard
Poisson random number generator [23]. We compare the performance of the simple
estimator based on the raw counts (COUNT), a shift-invariant version of the cross-
validation estimator11 (CV) proposed in [44], the SI-MMI model estimator described
in Section 3.6.1 with a three component beta-mixture model for the innovations with
parameters12 31 = 1, 32 = 100, and s3 = 10000, the square-root estimation methods
using a shift-invariant version of the Haar wavelet transform (D2), and the square-root
estimation method using a shift-invariant version of the Daubechies-8 (D8) wavelet.
The method proposed in [14] is not compared since it is derived under a “burst-like”
process model which is not appropriate for these test functions with the exception of
the Bumps function. '
The square-root method ﬁrst computes the square—root of the counts, then treats

the square-root data as though it were Gaussian, takes the shift-invariant discrete

 

11See footnote 8 in page 93.
12The mixing parameter p1 = 0.001, and parameter p2 (and hence 193) is determined using the
data-adaptive moment-matching method given in Section IV-C.

107

Table 3.1. AMSE results for various test intensities and estimation algorithms. Peak

 

 

 

 

 

intensity = 8.
Intensity C O U N T CV BA YES D2 D8
Doppler 0.1786 0.0588 0.0154 0.0548 0.0443
Blocks 0.1935 0.0617 0.0178 0.0700 0.0800
H eavz'S ine 0.1833 0.0552 0.0052 0.0294 0.0300
Bumps 0.5207 0.1877 0.1475 0.4570 0.4317

 

 

 

 

 

 

 

Table 3.2. AMSE results for various test intensities and estimation algorithms. Peak

intensity = 128.

 

 

 

 

 

 

 

 

 

 

 

Intensity CO U N T C V BA YES D2 D8
Doppler 0.0111 0.0047 0.0026 0.0095 0.0059
Blocks 0.0120 0.0040 0.0027 0.0077 0.0126
HeaviSz'ne 0.0115 0.0039 0.0007 0.0036 0.0028
Bumps 0.0324 0.0171 0.0143 0.1046 0.0908

 

 

 

wavelet transform [61, 62], applies a soft-threshold nonlinearity to wavelet coeffi-
cients, and computes the inverse transform of the thresholded coefﬁcients. After this
processing, the result is squared to obtain an intensity estimate. For both square-root
methods the universal threshold proposed in [12] was used.

We consider both the D2 and D8 wavelet (which can be applied in the case of
Gaussian data) to demonstrate that the Haar-based method introduced here can out-
perform the square—root method even when more regular wavelets like the D8 are
used. All methods employ a 5-scale wavelet transform. In practice, we could use
full J-scale transforms, but their performances were roughly the same as that of
the 5—scale transforms in the experiments, and the 5-scale transforms are more com-
putationally efficient. Table 1 gives the average mean-square errors (AMSE) of the

various methods for a peak intensity of 8. Table 2 gives the AMSE of each method for

108

a peak intensity of 128. The AMSE is estimated using 25 independent trials in each
case, and each AMSE is normalized by the squared Euclidean norm of the underlying
intensity function. Inspection of the tables shows that all methods offer signiﬁcant im-
provements over the simple count estimator. Moreover, the SI—MMI based estimator
outperforms all others in every case. We also note that similar tests and compar-
isons were made with the shift-variant versions of each estimator. As expected, the
shift-variant estimators did not perform as well as their shift-invariant counterparts.
However, the MMI model estimator outperformed the other shift-variant methods in

all cases as well.

3.8 Application to Photon-Limited Imaging

In this section we apply the MMI models and estimation procedure to the problem
of photon-limited imaging. Photon-limited imaging arises in many ﬁelds including
medicine and astronomy. The fundamental problem in photon-limited imaging is the
variability due to quantum effects in the emission and detection of photons. In many
problems, the photon counts collected during image acquisition are well-modeled by
a temporally homogeneous and spatially inhomogeneous Poisson process.

Assume that we detect photon emissions in a compact region of the plane. The
photon emissions are the result of an underlying two—dimensional continuous intensity
function. We are interested in estimating the intensity function from the photon
detections. For practical reasons. (computing and display), we seek an estimate of
the intensity at a ﬁnite scale (resolution) represented by a ‘pixelized’ intensity. A
crude estimate of the pixelized intensity is obtained by simply counting the number
of photons detected in each square pixel region of the plane. This “count” image is
highly variable due to the random nature of the photon emission process. However,

lower resolution images, obtained by counting the number of photons detected in

109

larger square pixel regions of the plane, provide better (less variable) estimates of
the low-resolution intensities. This illustrates the advantage of multiscale analysis in
photon-limited imaging. Relatively reliable coarse-scale estimators of the intensity
can be leveraged to obtain ﬁner details using the multiscale Bayesian framework.

To illustrate the effectiveness of the framework in photon-limited imaging appli-
cations, we consider a simulated experiment and a real—world application to nuclear
medicine imaging. Note that the MMI models and estimator are easily generalized to
two dimensions. Speciﬁcally, we take the 2-d multiscale parameters to be the factors
corresponding to the multiplicative reﬁnement of a coarse scaling coefficient (inten-
sity) into four ﬁner scaling coefficients by ﬁrst splitting it vertically (horizontally) into
two halves, then next horizontally (vertically) splitting each half into two quarters.
That is, if A“ is a 2—d intensity function, then we deﬁne A0,)” E A“ at the ﬁnest

scale j = O and for coarser scales take

 

 

)‘j+1,k,l = Aj,2k,2l + )‘j,2k,21+1 + Aj.2k+1,21 + /\j.2k+1.2z+1,
yr M21321 + /\j.2k.21+1
j+lvkvl 9
More: + /\j,2k.21+1 + ’\j.2k+1,21 + /\j,2k+1,21+1
2 )‘j.2k.2l
yj+1,k,l

9
Amer + )‘j.2k.2l+1

,\.

3 _ 1.2k+1.2l

yj+l.k,l — A A ‘ (3.36)
jolt-+1.21 + j.2k+1.2l+1

 

Note that in the 2—d case we have three sets of multiplicative innovations, one vertical
set y1 and two horizontal sets 3;2 and y3. In the analysis of count images, each
scaling coefﬁcient is the sum of four counts, and each wavelet coefficient is simply the
difference of two counts. Hence, all the machinery developed for the one-dimensional
case, based on sums and differences of pairs of counts, is immediately applicable
to two (or even higher) dimensional data. Note that this 2-d multiscale analysis

deﬁned above differs from the standard 2-d Haar wavelet analysis [68], which involves

110

a vertical, horizontal and diagonal differences. We use the alternative 2-d analysis
because, unlike the standard 2-d Haar analysis, it allows us to decouple the Poisson
problem, just as in the 1-d case. In both experiments, the 5-scale Haar transform is
employed and a three component beta-mixture density is used for the SI-MMI model
of Section 3.6.1 with ﬁxed shape parameters .91 = 1, 32 = 100, and 33 = 10000. The
mixing probability p1 is fixed at 0.001, and p2 (and p3) is adapted to the data at each
scale using the moment-matching method described in Section 3.5.3. We have found
these choices of 31, S2, and 33, combined with the ﬂexibility of the data-adaptive p2,

to provide very good results for a wide-variety of imagery.

3.8.1 Photon-Limited Imaging Simulation

Figure 3.10 depicts a typical realization of a simulated photon-limited imaging appli-
cation and the resulting estimates provided by the MMI and SI-MMI models. The
maximum intensity in the image in Figure 3.10(a) is 60.00, and the average intensity
is 25.40. Hence, this simulation models a fairly low intensity (low SNR) imaging
problem. A realization of counts is generated from this intensity using a standard
Poisson random number generator [23]. Note the visual improvement provided by
the MMI and SI-MMI model estimates in Figure 310(0) and ((1), respectively, in
comparison to the count image Figure 3.10(b). Furthermore, the estimate based on
the SI-MMI model appears to be better than that of the MMI model. In fact, in 25
independent trials of this experiment we estimated the average mean squared pixel
error to be 25.28 for the count image, 7.36 for the MMI model estimator, and 4.01

for the SI—MMI model estimator.

111

 

Figure 3.10. Photon-limited image estimation using MMI models. (a) intensity func—
tion, (b) realization of Poisson counts (average squared pixel error = 24.80), (c) in-
tensity estimate using MMI model (average squared pixel error = 7. 36 ), (d) intensity
estimate using SI-MMI model (average squared pixel error = 3.98).

3.8.2 Application to Nuclear Medicine Imaging

Nuclear medicine imaging is a widely used commercial imaging modality [69]. Un-
like many other medical imaging techniques, nuclear medicine imaging can provide
both anatomical and functional information. However, nuclear medicine imaging has
a much lower signal-to-noise ratio relative to other imaging techniques. Hence, im-
provements in image quality via optimized signal processing represent a signiﬁcant

opportunity to advance the state-of-the-art in nuclear medicine.

112

Nuclear medicine images are acquired by the following procedure [69]. Radioac-
tive pharmaceuticals that are targeted for uptake in speciﬁc regions of the body are
injected into the patient’s bloodstream. As the radioactive pharmaceuticals decay,
gamma rays are emitted from within the patient. Imaging the gamma ray emissions
provides a mapping of the distribution of the pharmaceutical, and hence a mapping
of the anatomy or physiologic function of the patient. Gamma rays are detected and
spatially located using a gamma camera, which converts gamma rays into light. Pho-
tomultiplier tubes then detect and locate the emissions. The raw nuclear medicine
data is an image of photon detections (counts). The raw data may be viewed directly
or used for tomographic reconstruction. The major limitation of nuclear medicine
imaging is the low-count levels acquired in typical studies, due in part to the limited
level of radioactive dosage required to insure patient safety. Because of the variabil-
ity of low-count images, it is very common to employ a post-ﬁltering or estimation
procedure to obtain a “better” estimate of the underlying intensity. An excellent
discussion of the potential diagnostic beneﬁts of various frequency domain processing
methods is given in [5]. Advantages of multiscale methods over frequency domain
methods for photon-limited imaging problems are discussed in [59].

To illustrate the potential of our multiscale Bayesian framework in nuclear
medicine imaging, consider the spine and heart studies depicted in Figure 3.11. Fig-
ure 3.11(a) depicts the count image from a nuclear medicine spine study. The radio-
pharmaceutical used here is Technetium-99m labeled diphosphonate. In bone studies
such as this, brighter areas indicate increased uptake of blood in areas where bone
growth is occurring. This may reﬂect areas where bone damage has occurred. Func-
tional changes in bone can be detected using nuclear medicine images before they will

show up in x-ray images. The maximum count in this image is 178 in the “hot-spot”
at the bottom of the spine. The maximum count in the upper portion of the spine

is 75. Figure 3.11(b) shows the SI-MMI model estimate of the underlying intensity.

113

 

(C) ((0

Figure 3.11. Nuclear medicine image estimation using SI-MMI models. (a) Spine
count image. (b) SI-MMI model estimate of underlying intensity. (c) Heart count
image. (d) SI-MMI model estimate.

Figure 3.11(c) depicts an image of a heart obtained from a nuclear medicine study.
The image was obtained using the radiopharmaceutical Thallium—201, and the max-
imum count is 33. In this type of study, the radiopharmaceutical is injected into the
bloodstream of the patient and moves into the heart wall in proportion to the local
degree of blood perfusion. The purpose of this procedure is to determine if there is
decreased blood ﬂow to the heart muscle. Figure 3.11(d) shows the SI-MMI model
estimate. In both studies, we see that the SI-MMI model estimator preserves the

important image structure and has signiﬁcantly lower variance compared to the raw

114

count image. The intensity estimates provided by the SI—l\»"Il\/’II model may enable

better diagnosis in clinical nuclear medicine.

115

CHAPTER 4

Emission Computed Tomography

In this chapter, we extend the MMI models and estimation approaches to emission
tomography imaging. We introduce two extensions: one based on geometrical consid-
erations of natural pixels of images, and the other based on a new multiscale inversion
approach to the tomographic image reconstruction problem. The new models are used
to represent the intensity sinogram of the projected images, from which their optimal
estimation can be computed prior to reconstruction. We illustrate the performance

of the new estimators with examples using synthetic and clinical data.

4. 1 Preliminaries

The ﬁrst explicit formula for the reconstruction of a function on a plane given its
integrals over lines was ﬁrst derived by Radon in 1917. Although it was not until the
late sixties that practical applications of the Radon formula started to appear, ﬁrst in
radioastronomy and then in electron micrography [70], the practical computed tomog-
raphy that it gave birth to has since become one of the most important tools in many
branches of science for processing information that otherwise would be inaccessible.
For example, in astronomy and geology, imaging of the Sun’s and Earth’s internal

structures have been possible only by tomographical techniques [71]. It is perhaps in

116

medicine where we have witnessed the greatest impacts of this technology. Imaging
the internal structure of human patients by noninvasive methods is frequently the
safest and most expeditious method in determining their conditions, and thus, often
the most desirable method for diagnosis.

Although our interest is in emission computed tomography (ECT) in general,
in this chapter we focus exclusively on nuclear medicine tomography—more speciﬁ—
cally, in single—photon emission tomography (SPECT)—f0r the following reasons. The
problems encountered in nuclear medicine tomography typify many of the problems
encountered in ECT at large; and, concentrating in SPECT in particular, permits us
to be more speciﬁc in the treatment of the subject. Also, nuclear medicine tomog-
raphy represents one of the most challenging applications due to typical low SNR
regimes under which it takes place; therefore, the methods presented here are tested
for robustness. Further, nuclear medicine tomography represents one very important
application that directly affects many people’s quality of life.

There exists a number of tomographical methods for diagnostic imaging. For ex-
ample, x—ray computed tomography (x—ray CT), magnetic resonance imaging (MRI),
positron emission tomography (PET), and SPECT. The physiological basis of emis-
sion computed tomography medical imaging (e. 9., PET and SPECT), opposed to the
anatomic approach of transmission imaging (e.g., x—ray CT and MRI), are unique
in that measurements of functional abnormalities are readily possible from a single
image.1

In nuclear medicine imaging a radiopharmaceutical, which is targeted for uptake
by some speciﬁc organs, is administered to the patient. As the process of nuclear
decay takes place in the radiopharmaceutical, gamma-ray photons are emitted in all

directions. Those photons incident to an array of detectors placed in close proximity

 

1Although functional measurements in transmission-based imaging are possible, these require
more complex procedures involving sequences of images obtained at controlled time-intervals that
must be carefully registered [72, 73].

117

  
   

)

‘2
ll"
1
i

ii
\\ j)

ll;
‘ (a z
A <1:

.t€;’ \\ 5'
r a

1% we

1. . to .f

i

‘1
{,4
I

,I
/
/

Figure 4.1. Tomographic data collection geometry. The array of detectors collect and
count the number of photons induced by the arriving gamma rays emanating from
within the subject. The array is repositioned at equal angle intervals about the long
axis of the subject so as to obtain a sufﬁcient number of projections.

to the patient and in the vicinity of the organs of interest are counted. The number of
photon counts are then interpreted as a function of physiological activity or intensity.

For tomographical image reconstruction of a transversal slice of the body, mea-
surements of photon counts are made at many angles about the long axis of the body.
Other image geometries are also possible, in which case, the axes about which mea-
surements are taken are also perpendicular to the images’ planes. Figure 4.1 depicts
a typical data measurement geometry.

In modern non-diffracting tomography (e. g., all the tomographic methods re-
ferred to thus far), Radon-transform, algebraic, and iterative-based reconstruction
algorithms are the main general approaches to image formation. Radon-transform
based methods are almost exclusively used in nuclear medicine because algebraic
and iterative reconstructions generally are more computationally intensive. One very
Popular and efﬁcient algorithm for carrying out the reconstruction process is the

ﬁltered-backprojection (FBP) method, which in effect implements the inverse Radon

118

transform [74].

4.2 The Image Reconstruction Problem

A typical data gathering geometry for parallel-scanning tomography in SPECT is
depicted in Figure 4.2. In nuclear medicine, the intensity f (x) represents the con-
centration of radioactive material at a point x in space, and is typically the quantity
of interest. 17 and 9 are orthogonal unit vectors which together deﬁne the projection
plane on which the data collection process takes place. The plane contains the detec—
tor array, positioned according to the vector 0. The convergence point of a projection
line (ray) on the plane is identiﬁed by s, the number of units along 0. Thus, we speak
of the projection of f at 30. The vector x and the projection rays lie on the n-O
plane.

The geometry in Figure 4.2 depicts a data collection process in which the object,
and not the array of sensors, is rotated in order to capture the projection data. The
two schemes are equivalent as far as the processing of the data is concerned, however,
our choice simpliﬁes some of the notation. During the process, the object f is rotated
through angles —0 about C in the right coordinate frame (11,0, ()2.

The scanning process as just described is an idealization because it models a
process that originates from zero volume of active material—the material “contained”
in the n-O plane. In order for the model to be meaningful, f must be interpreted
as the average concentration of radioactive material within a neighborhood of this

plane: let fvol represent the true volumetric density of this material, and let a be the

 

2Note the slightly overloaded use of the theta symbol: in bold phase, it stands for the array of
sensors’ unit vector, and in plane phase, it denotes the relative rotation of this unit vector and the
object. The distinction is clear and should not cause any problem.

119

 

Figure 4.2. Projection geometry for the intensity f (x): The unit vector 9 deﬁnes
the direction of the detector array on a plane perpendicular to C, While 3 identifies a
sensor Within the array. f (x) is the density of radiopharmaceutical at x.

aperture gain (loss) of any one detector’s collimator, then

 

f(x) '3) = / a(:(:{‘:08‘;)tC) fvol(x + tC) dt

is the effective density of radiopharmaceutical which is responsible for the activity
sensed by the detector at 50. If, indeed, this density is a function of s as indicated,
it would prove the poor selectivity of each collimator’s aperture, resulting in blurring
effects. In general, however, the aperture for rectangular collimators may be expressed
as the product of the in-plane gain a1, and the gain or) due to the displacement in the

C-direction, i.e., a(x — 30 + tC) = a1(x — 30) a2(t(). Thus,
f(x) = / a(x) fvol(x+tC)dt (4.1)

Clearly, the narrower the support for a2, the ﬁner the detail conveyed by f in the

C-direction, however, at the expense of lower intensity and greater susceptibility to

120

noise.

As photons travel towards the array of detectors from within the subject, many
are absorbed by nuclear interactions with the surrounding matter and only a fraction
of the original number survive to be registered and counted. Let u(z)dt be the
probability that a photon reaching the point z will be absorbed within dt units from
2. Then, the number of photons 09(3) that during a time interval T reach the detector
at 30 due to the radioactivity at a point x with the object rotated an angle —0 is

Poisson distributed:
c9(s)|)\9(s) ~ Poisson(/\g(s)),
where the underlying intensity’s projection A9(s) is given by3

A9(s) = / a(x — se) f(Rg(x))e'I10 ”(Maul-“898‘ dx. (4.2)

f is as deﬁned in (4.1), u is known as the absorption loss coefficient, R9(x) = eK9(x—
17) + n, and K = (‘1’ ’01). The time units have been chosen so that T = 1.

Since ex“ is the transformation that rotates a vector by an angle 8 about C, R9(y)
transforms the vector y by rotating its y—n component by this angle. Thus, f (R4)(x))
is a measure of the radioactivity at x after the object has been rotated by the angle
—8 about C. The exponential efo1 “malfxﬂhtlaalldt gives the total absorption loss for
the trajectory through the rotated object from x to 30.

Recovering f from the set of projections /\9(8) represents a very difficult problem.
There exists, however, an exact solution to (4.2) when the attenuation function u is

constant on a convex region that includes the support of f (assumed compact), and

if the aperture gain at 39 is zero everywhere off the perpendicular to 0 at 36 [75].

 

3Since all but the unit vector C considered here lie on the (17,0) plane, we restrict the notation
to a two-coordinate format from now on.

121

Since in practice the absorption coefficient is generally unknown, it is common
to ignore it and compensate for its effect by postprocessing the reconstructed image.
This approach is computationally intensive and leads to suboptimal results as the
effect from absorption is nonlinear and, consequently, cannot be fully accounted for
independently of the reconstruction process [76]. For this reason, in applications in
which the value of ,u is small, it is often ignored altogether. For the remaining of
this chapter, we only consider examples where it is safe to do so.4 For this simpliﬁed

model the solution is

 

)=2—7r2‘/‘1r /( (—x 77) )r(81()90_3d8d6' (4.3)

The superscript T denotes transposition, and Ag(s) stands for the derivative “355—31.
This expression is one incarnation of the celebrated inverse Radon transform and is
the basis for the FBP reconstruction technique—see Section 4.3 and [75].

The assumption of having only those photons counted that strike the detector
array at right angles implies the decoupling of the counting processes taking place
among the sensors. This is an idealization that is invariably made in practice, but
which results in blurred images. Much of this blurring, however, can be removed by
high-pass ﬁltering if noiseless estimates for the projections Ag are available. Since the
actual sensors’ aperture gain a acts as a spatial low-pass ﬁlter on the projections, the
linear shift-invariant operation may be reversed by operating on the reconstructed
image given that the inverse Radon transform and convolution operators commute.
For simplicity’s sake, we ignore blurring effects in the examples to follow and concen-
trate on the most detrimental factors in nuclear medicine tomography imaging: data

variability due to its Poisson statistics.

 

4In Section 4.7, however, we show how to correct for the lack of symmetry in the effective
attenuations associated with A9(s) and A9+1800 (s) when present.

122

ECT methods aim to recover the function f from a ﬁnite set of measurements c9(s).

For example, for K detectors in the array and N projections, s = —%, . .. ,§ — 1
and 8 = 0, 2W”, . .. ,(N — n27}. Sometimes it is helpful to work with a normalized

length for the array of sensors. We choose a length of 2 length-units per aperture
(per array of sensors) so that. now, 3 = —1,—1 + 725, . .. , 1 — %. Also, in the interest
of brevity, we use the alternate indexing c2, which denotes the counts from projection
angle0=iiQW7r andsensors= %k—1. Here, 7220,... ,N—1andk=0,... ,K—l,
where it is always assumed K = 2J for some positive integer J.

The data vectors c" E [c}_1, . . . ,c3]T arranged in matrix form c E [c°, . . . ,cN‘l]

constitutes the data sinogram. The corresponding intensity sinogram is given

by A E [A°,... ,AN '1], where each column vector is similarly deﬁned as A” E
[A'I’{_1,... ,A3]T. Note that
2(k+1)/K—1
2:] Azwwn(s)ds, (4.4)
2k/K—1

the highest resolution unnormalized Haar scaling coefficient at position (sensor) k of
the projection at 9 = ~1;—(;-n.A similar relation holds for CE, which is the total photon
count in the (% - 1, @K — 1) interval.

Each set of K measurements in a projection constitute a ﬁnite sampling of (4.2),
and cannot strictly satisfy the Nyquist sampling requirement for the object f is of
ﬁnite extent, and therefore, of inﬁnite frequency support. Consequently, the quality of
the output of any method that one may devise to manipulate the data is compromised.
Since there exists no recourse to amend this common degradation, we assume the

“K-sampling” to be ﬁne enough that all high frequency features of interest may be

reconstructed without distortion.

123

4.3 The Filtered-Backprojection Reconstruction
Technique

The F BP method for inverting the Radon transform of an image may be deduced using
the Fourier Slice Theorem [77]. Using two—dimensional polar notation for the object
function, i.e., writing f (r, ()6) instead of f (x),5 and similarly for its Fourier transform,
this theorem simply states that f(w, 6) = Ap(w), that is, the Fourier Transform f may
be written in terms of the projections’ Fourier transforms.

Since the Fourier inversion formula can also be expressed as f (r, (t) =

47; f0 f0; f(w 6)e|w| “”05“ 4’) dwd6, a simple substitution leads to
f(r, 46): é]; TAO (rc-os(6 ¢))d6 (4.5)
where TA9(s _ 27 f0; /\,9(w )[w] e’“ did. This transformation represents a high-pass

ﬁlter operation on the projections A9. The frequency response of the ﬁlter is [W].
The integration (4.5) suggests that f (r, Q5) is the result of a projection operation on
TAg, thus, the name ﬁltered-backprojection method. This projection, however, is not
equivalent to the projection process by which the set of Ag’s are formed, for some
non—linear weighting is implicitly applied to 7A9, but we shall not elaborate on this.

Practical Radon inversion algorithms often begin by discretizing the expression
in (4.5). Although radically distinct in nature to algebraic approaches (ART), these
methods may become similar in form [78]; but, most often, (4.5) is implemented by
means of the FFT for computational efﬁciency reasons. When this is done, band-pass

ﬁlters are chosen to replace the high-pass ﬁlter so that excessive noise ampliﬁcation

 

5The various parameters relate as follows. Referring to expression (4.3), (x - n)TeK96 is a dot
product of the vectors x — n and eK96. Their magnitudes are Ix — n] = r and 1, respectively. Thus,
(x — n)TeK96 = rcos(u), for some angle 11, and <5 = 6 — u.

124

will not occur past the object’s main frequency band. These ﬁlters by necessity
represent a compromise between recovering high resolution details and minimizing

high-frequency noise.

4.4 Some Limiting Aspects of Conventional Re-
construction Methods

Classical reconstruction methods applied to SPECT produce undesirable and highly
variable image reconstructions due to the Poisson statistics of the projection data.
In practice, it is common to postprocess the raw reconstructions with a low-pass
smoothing ﬁlter to improve the images. This approach often produces visually ap-
pealing results, but can lead to detrimental loss in resolution and ﬁne detail structure.
More advanced methods estimate the intensity sinogram A from the noisy data. For
example, Wiener ﬁltering [79] approaches have been applied to the projections prior
to reconstruction. Furthermore, (fully) Bayesian approaches to this problem have
been proposed based on Markov random ﬁeld models, e. g., [80, 81]. However, these
methods are very computationally expensive.

One possible remedy to these methods’ complexity consists of simply estimating
the individual intensity projections A" from the set of counts, and then applying an
inversion method (e.g., FBP) to obtain an approximation to the function f. This
method is widely used in ECT [69, 75, 77], and in practice, the estimation is carried
out by simply low-pass ﬁltering the data. This approach, although straightforward,

is highly suboptimal for the following reasons:

1. Due to the Poisson distribution of the data, the noise is signal dependent,
and so, the linear ﬁltering process cannot optimally remove it everywhere—see

Chapter 3.

125

2. High frequency intensity details in the underlying signal projections are as at-

tenuated as the noise.

3. Useful information that may exist in adjacent and nearby data projections is

not exploited towards leveraging the estimation process.

4. The greater residual noise in the projection estimates due to points 1 and 3 leads
to a degraded performance of any subsequent restoring operation, including the
inverse Radon transformation. For example, a deblurring ﬁlter (essentially a

high-pass ﬁlter) will emphasize noise in high-frequency bands.

One very simple approach to improving the quality of the estimates for the un-
derlying intensity’s projections is to impose the 1-d MMI model on each individual
projection, and estimate accordingly. Doing this, however, induces the strong con-
dition that any two consecutive projections X’ and An“ be independent. Since the

n+1,and

actual projections’ estimates are driven by the corresponding data c" and c
since these data projections are correlated, the effect of this assumption is lessened.
Nevertheless, given that in nuclear medicine the data’s SNR is typically low, the un-
realistic prior model tends to exert a strong inﬂuence on the estimator’s outcome.
Our own studies indicate that this does in fact occur, with the consequence of having
highly detrimental circular artifacts appearing in the reconstructed images.

A step towards correcting for the projections’ independence condition consists of
imposing the 2-d MMI model on an image in the projection space (i.e., sinogram.)
In the following sections, we introduce two extensions to the basic 2—d MMI modeling

approach introduced in the last chapter, and which lead to estimators that produce

well—coordinated estimates of the intensity sinogram and improved reconstructed in-

tensity image.

126

   

(a) (b)

Figure 4.3. Shepp—Logan head phantom. (a) Phantom intensity. (b) Phantom inten-
sity sinogram.

4.5 First Approach to Multiscale Modeling and
Estimation of ECT Intensities

One important characteristic of a sinogram A is that point sources in the original
object f correspond to sinusoids in the sinogram (thus, its name); each point corre-
sponds to a unique amplitude-phase pair, but all have the same period. As a result,
a very strong correlation exists among consecutive projections. This phenomenon is
clearly displayed in Figure 4.3(b), which is the intensity sinogram of the Shepp-Logan
head phantom image in Figure 4.3(a).

Correlation among neighboring pixels suggests that a MMI-like model may be
applicable to the intensity sinograms as they appear to be characterized by smooth
intensity variations with a few discontinuities. To understand the mechanism by
which this characteristic emerges in sinograms corresponding to intensity images that
are themselves well—modeled by the MMI paradigm, the concept of natural pixels
(NP) is helpful. Originally developed for the incomplete data tomographic problem

[78, 82], NPs give us the means to visualize the relationship of corresponding samples

127

5?’
PT

my:
3’:
it i
by:
PF

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

‘k+1 My“; ' 111
/""'\ f \ ’ [I]
Cf \ j
\) <3
<3 V )i ’
L’L’T L) NPs/ T
(a) (b)

Figure 4.4. Natural Pixels. (a) Projection samples at scale j are obtained by integrat—
ing the object over strips 2j “sensors” wide. Thus, the highest resolution achievable
is determined by the sensors’ apertures, and correspond to samples at scale j = 0.
(b) Natural pixels are delimited by the overlap of the integration strips corresponding
to the same sample position It at two consecutive projection angles 71 and n + 1.

in consecutive projections. We use this newly gained insight to extend the basic MMI

approach of Chapter 3 to the ECT modeling and estimation problem.

4.5.1 The SI-MMI Method

In emission tomography, sinograms are 2—d positive functions, and so we may represent
them at multiple scales just as we did the intensity images of Chapter 3. As before,
the representation at scale j is denoted by A,-, where for j = 0, A0 E A. Clearly,
A,- may be obtained directly from the object f by simply projecting through wider
swaths.

The set. of projection strips belonging to two consecutive projections A? and A?“
at scale j are shown in Figure 4.4(b). The overlap of the two samples A2,, and

A}? delimit the natural pixel NPg",c shown at the center of the ﬁgure. It is clear

128

that, under the smooth-intensity—variation-with-a—few-discontinuities assumption, the
greatest contribution of the object to these samples comes from the Nng region,
and only a small fraction AA?)c is contributed by the regions outside it. That is,

3’), = NP},c + AAZk where typically Nng >> AAy‘k; except at ﬁne scales, where
the residuals AAg-fk’s may take on signiﬁcant values as the areas of integration they
encompass increase and those of the NP’s decrease.

In the object’s sinogram, A?)c and AZ? appear as two horizontally adjacent pixels
at the kth row. The perturbation 5 associated with these pixels can be written in
terms of the natural pixel they deﬁne:

AA}, — mg?

6 = n n n z 0 (4.6)
AA“, + AAJ-jtl + 2NPM

 

This shows that the “horizontal” perturbations will tend to be concentrated around
zero, and given the symmetry of the NPs, these will be symmetrically distributed
as well. Note that for the perturbation to take on an extreme value of 1 or -1, it is
necessary that either AAg-fk or AAEII, but not both, assume the -NP value; a very
unlikely situation as demonstrated by the geometry of integration.

Perturbation variates corresponding to differences between vertically adjacent
samples in the intensity sinogram also behave in the desired manner. This is be-
cause the projections themselves enhance the smoothness characteristic of the orig-
inal object through the integration process (see Fig. 4.4(a)). Speciﬁcally, since a
projection sample is simply a scaled value of an average taken over an integration
strip, a projection sample represents, within a ﬁxed factor, the “typical” intensity6
that the object assumes in the integration strip. Therefore, the perturbation 6 asso-

ciated with any two typical and adjacent samples of f behaves in similar manner to

those perturbations associated with the object itself. Histograms for perturbations

 

6Note that these “typical” values need not ever occur.

129

between vertically adjacent pixels in synthetic and clinical data have veriﬁed this
general behavior, and resemble the histograms in Figure 3.3.

The behavior of the perturbation (innovation) variates according to the desired
prior does not on itself warrant the suitability of the MMI model, for the intensity
variation in sinograms is arranged in a highly structured fashion, and so, the assumed
independence of the innovations in the model may not apply. We have taken a
practical approach and have sought to empirically justify the model’s applicability.

The steps taken are:
1. Form sinogram c from raw projection data {c2}.

2. Obtain optimal estimate A of underlying intensity sinogram A from data

sinogram c using the 2-d SI-MMI-model based estimator of Chapter 3.

3. Reconstruct intensity image estimate ffrom sinogram estimate A by the

F BP method.

We call this the sinogram-image-SI-MMI—model—based ﬁltered-backprojection recon-

struction method, or the SI-MMI approach to ECT, for short.

4.5.2 Example

To demonstrate the performance of the new approach to SPECT, we have applied it to
two sets of data: the Shepp—Logan head phantom and real data from a human pelvic
clinical study. The two data sets have been processed in three different ways so that
comparisons can be made and a relative degree of performance can be determined.
In Figure 4.5(b), an (unprocessed) reconstruction of the phantom data based on
Poisson data projections is presented. Figure 4.5(c) shows a lowpass ﬁltered7 version

of Fig. 4.5(b). The smoothing process removes high frequency noise at the expense

 

7A 2-d Butterworth ﬁlter with cut-off 7r / 3 rad. was used in both examples of this section.

130

of some loss of high deﬁnition detail in the image. Finally, Figure 4.5(d) displays the
reconstructed image using the new sinogram estimation method. In this new image,
high deﬁnition features (e. 9., edges) are clearly preserved despite the high degree of
noise reduction.

Figures 4.6(b), (c), and (d) of the pelvic bone study were obtained by the same
processes used in the corresponding phantom ﬁgures. Note that in Figure 4.6(d), the
high deﬁnition image structure becomes evident after processing according to the new
approach. This is in contrast with the result in Figure 4.6(c), where the image is seen

to be oversmoothed to achieve a similar degree of noise removal.

131

 

Figure 4.5. Shepp-Logan head phantom image reconstruction. (a) Phantom. (b)
Phantom reconstruction from noisy data. (c) Lowpass ﬁltered image from (b). (d )
Phantom reconstruction from MMI-processed sinogram.

132

 

(C) (d)

Figure 4.6. Pelvic bone study image reconstruction. (a) Sinogram of pelvic bone. (b)
Pelvic bone reconstruction from noisy data. (c) Lowpass ﬁltered image from (b). (d)
Pelvic bone reconstruction from AVIAN-processed sinogram.

133

4.6 A New Multiscale-Based Tomographic Inver-

sion Method

Despite the substantial improvements in quality and ﬁdelity of reconstruction that the
SI-MMI approach represents, we recognize that the always present sinusoidal struc-
ture of sinogram images has not been fully exploited by this method, when it could
be used to leverage the reconstruction process. The second method to ECT to be
introduced in Section 4.7 is based on the tomographic reconstruction approach devel-
oped here and which itself evolves from considerations of this sinusoidal extructure.
As a result, the new inversion formula reconstructs the desired images not from their
corresponding sinograms, but from their cumulative sinogram intensities. This will
be especially important in the ECT problem, where for low SNR data sinograms the
high variability of the data signiﬁcantly undermines this structural constraint in sino-
grams. In Section 4.6.3 a fast implementation algorithm for the new Radon-inverse
transform is also presented.

There exist other multiscale-base tomographic reconstruction techniques [83, 84,
85, 86], but none is statistically motivated and offers no advantage to the ECT re-

construction problem.

4.6.1 The Multiscale Reconstruction Formula

We begin with the inverse Radon transform (4.5), which may also be expressed in

terms of the Hilbert transform8 of the derivative 13(3) of the intensity projections:

f(r, ct) = 217;]: ’HAz,(r cos(9 - ¢)) (16. (4.7)

 

8See Section CI of Appendix C for the deﬁnition of the Hilbert transform and some alternate
representations.

134

 

This
the

mar

w l1

and z
[a
guarant
ofﬁnite 1
when 3p):
are met.
Since y

with j), _= I

This is equivalent to (4.3). For simplicity, and without loss of generality, we assume
the support of f to be contained in a disc of radius rmax < 1. Additionally, differen-
tiation of the intensity projections are assumed to exist everywhere in [—1,1], with
1;,(1) = Ag(—1) = 0. Differentiation at the boundaries are of left and right type.

In Section C.1 we prove that 7119(3) = —%fol{A’0(0’2(T)) - A;,(01(7'))}“7T with
01(r) = (—1 4 s) r + s and 02(7) = (1 — s) r + 3. Substituting this into (4.7) results

in

where

fo"{At(t(r)) — X9(t(—T))} d6, for —1_<_ T g 1

90", a); T) E (4.9)

0, otherwise
and t(r) _=_ T + (1 — |r|) rcos(6 — (b) for all r in IR.

In arriving at (4.8) we have made use of the Fubini-Tonelli Theorem [28], which
guarantees the interchangeability of the integration operations as we assume f to be
of ﬁnite energy. In the rest of this chapter, however, we generally make no special note
when appealing to this theorem, and give as understood that the required conditions
are met.

Since 9 is real and odd in 7', its Fourier transform Q is imaginary and odd. Then,

with g, E Im{§},

_1 00
90333:?) g.(r,¢;w)sin<wr)w. (4.10)

135

and
1
g,('r,¢;w) = —2/ g(r, (15:7) sin(wr) dr. (4.11)
0

Thus, (4.10) into (4.8) proves that

1 °° . 1sin(arr)
f(r~0)- 475i 9i(T.d),w)/O T drdw

 

1 f: (4.12)
= 473 g,(r. (tun) Si(w) den,
where Si(w) denotes the sine integral.
Since t(r) is differentiable everywhere except at r = 0,
, (A 0 t)'(T)
MUM) = “lg—('77—
(4.13)

: (1, 0 two
1 — sgn(r) rcos(6 — as)

for 0 < |r| _<_ 1. Note that since [r] < 1, the ratio is well deﬁned. Using (4.13) in

(4.9), and recognizing that g(r, ()5; 0) = 0, gives us

 

 

" (A9 0 t)’(T) (A9 0 t)’(-T)
' = d6
g(r, ¢’ T) [0 {1 — sgn(r) rcos(6 — o) + 1 + sgn(r) rcos(6 — o)
for 0 < [7'] S 1, and zero otherwise. Once this expression is used in (4.11), an

integration by parts over 7' results in

at, a...) = 2 f0" {[01 wjoijgggjtgﬂ d. + [0‘ “ejsffgcf:(j9°j>q§;i> 3.} d9.

 

 

This expression accounts for the fact that (A9 0 t)(1) = (A9 0 t)(—1) = 0. Here, 0

denotes the composition operation.

136

We may express Ag ot separately in the intervals [—1,0] and [0,1] in terms of
Haar wavelet expansions (see Chapter 2). With p;k(r, <15; 6) and p;k(r, a); 6) denoting
its wavelet coefﬁcients corresponding to these intervals at scale j and shift 19, the

expansions are

(Aeot)- )=Zp;..( rcb; outta)
j,k

(Aoot)( ijirrtimd)
jk

both for 7' E [0, 1]. Substituting these in the above integrals yields

 

. . _ 7' LUZ: k pjk( 7‘ ¢;6) )fo wjk(r T)cos(wr)dr
gi(r,¢,w)—2/0 {w 1—rcos(6— 4;)

+2212” p;k(r, (i); 6) fol 2,15,),(7') cos(wr) dr d6
1 + rcos(6 -— 45)

 

(4.14)
— —2w 2 2 jij( r, q)()/0 2i)(2"jr - k)cos(wr)dr
j<0
o<1~<2- 1- 1

+ 2 sin(w) Z 2’1 Qj,o(7“, (25),

321

where -w f012,l( 2,l(2‘ jr— k) cos(.or)dr = sin(23kw)— 2sin(2j(k+1/2)w)+sin(2j(k+1)w)

and

 

 

+ _
. = '/2 1r pj.k(ra 65; 6) pj,k(r1 ¢i 6)
03.1.6.8) _ 21 f0 {1_TCOS(6_¢) +1+TCOS(9_¢) d6. (4.15)

The factor 21/2 is required in this deﬁnition so that QM is independent of scale for

1'21.

137

Now, (4.14) into (4.12) becomes

f(mb) = 5’; Z mama) + 47172 c2100: 4).

 

130
0932-141
where

7/27r forj=0, k=0
3/7r-2‘23 forj<0, k=0

ajk E _21 .
l/VT'kTE—Ejmj f0fj<0,0<k<2_J—1

2421—4) -1 . _ _-

1/27r-2_3_2,+4,2J forj<0,k—23—1.

Deﬁne the unnormalized Haar wavelet and scaling coeﬂicients

~ t(21(k+1/2)) t(2~’(k+1))
gij-(T', d); 6) E / A9“) dt -/ Ag(t) (it
t t(

(211:) 22‘(k+1/2))

foerOandOSkS2‘j—1,and

1

5x030”, CD; 6) E / A9“) dt.

rcos(6—4))

(4.16)

(4.17)

(4.18)

(4.19)

These and the (normalized) wavelet coefﬁcients in (4.15) are related according to

2—j/2 échU', (l); 9)
1 — rcos(6 — 96)

 

p;k(r1¢16) =
and

p;k(r1 Cb; 6) = p;k(fr1 <15; 0 + 7T).

The last expression is due to the equalities A9(-t) = A9+,,(t) and 1+r cos(6 — ab) =

1 — r cos(6 + 7r — ()5), which hold for all intensity sinograms void of attenuation effects.

138

1~1qi1§l.}) ‘-

  
  
  
   

lf. I.) r"‘\.‘)\ ‘ .'

Emu

1‘ : 111.411 maul! ’ ‘ ma {be
theme on

‘3 $3,104“ ' ' mm 4.7.
the meld
Mano.)

.
'1
I
. .
1.! .1‘,:-“ . , .. . . e/ nixuvuirlIJG)
\
. u
1".» 111m. . i in;- -~.v. ., w- v. . 1W --u-‘xr.i:m'
(rm-11'- v imum.” - 11- 13. . .. . .1 .'-;.<' 111.. [3.1 '71" ":11' )i'ilfi'N (Q- l

. I

a

-V
I

IS.

I
.
o—v
.

In Section 4.7, we make use of this equivalence to introduce a simple strategy for
mitigating attenuation effects in real data. By assuming this condition, the following

more compact notation for the coefficients Qj,k(r, d)’s is possible:

 

 

_ 27f éj,k(r1 ¢, 6)
Qj.k(r, 45) —/0 (1_ rcos(6 _ 65))2 d6 (4.20)
foerOandOSkS2‘j—1, and
_ 27f X0.0(7'1 4516)
621,221.21) _ f0 (1_TCOS(0_ 2))? d6. (4.21)

Expression (4.16) together with (4.17), (4.20) and (4.21) give the means to recon-
struct the intensity f in a scale-by-scale basis from the unnormalized wavelet (and
the one coarse scale) coefﬁcients (4.18) and (4.19) of the “warped” projections A9 0 t.
Figure 4.7 depicts the wavelet 230,0 superimposed on the projection A9,, as well as
the wavelets 219-1,0 and it-“ superimposed on the projection A92. The coefﬁcients
60,0(r, d); 61), 6-1,0(r, cf); 62) and 6-1,1(r, a); 62) are the coefficients associated with these
wavelet functions.

Formula (4.16) may be used as a starting point for further theoretical analysis
on the tomographic reconstruction process; however, since it requires a continuum
of projections, it is of very limited practical use. As is conventionally done with
other reconstruction methods, one may discretize formula (4.16) by straightforward
sampling. Clearly, the reconstructions obtained in this manner will be a visually
appealing approximation at best, but of poor reliability for delicate studies, e. 9., di-
agnostic medicine. In general, it seems very difficult to predict a priori the sampling
resolution required to achieve a desired image quality, and to determine after recon-
struction which features are real and which are artifacts, often proves to be a bigger

task yet.

139

 

 

 

 

 

 

Figure 4.7. Wavelet functions for image reconstruction. The contribution to the
sinogram of an intensity image from a single point appears along a sinusoidal path.
The point intensity can be determined from the unnormalized Haar wavelet coefﬁ-
cients of each projection. The associated wavelet functions are deﬁned in intervals
[7‘ cos(6 — 43), 1] according to the (r, (1)) location of the intensity point, and the pro jec-
tions’ angles 6 ’5. Three such wavelet functions are depicted here over vertical axes.

Below, we use the sampling theorem to represent the new reconstruction formula
in terms of a ﬁnite number of projections. The resulting expression will also represent
only an approximation. There exists, however, an important difference between our
sampling approach and that of traditional methods used to account for the sampled
nature of the data. In traditional methods, it is not only the data that is discretized
by the sampling, but also the inverse Radon transform operator. This leads to errors
on two counts: one, from the loss of information due to aliasing effects, and two, from
the degradation of the conditions under which the inverse transformation holds true.

By using the sampling theorem to incorporate the sampled data into our otherwise
perfect inverse transformation, the errors are limited to aliasing effects. However, for
these types of errors there exist preﬁltering steps that one can apply to the data to
minimize their adverse effects.

Let A0 5 27r/N be the angular sampling interval. Then, assuming that only little

140

of the sinogram’s frequency content in the 6 direction falls above N / 47r cycles per

radian (half the sampling frequency), we have

-1

6j.k(r,¢; 6) ‘26,“ ,6; nA6)Sa(£—6-(6 —— nA6))
and
~ N—l ~ 7r
10,00, a; a) = 10,0(7, <1; nA6) Sa(-A—0(6 — mm) ,
n=0

 

where Sa(:1:) E Sinai”. Substituting these into (4.20) and (4.21) results in

2

—1

~ 2" Sa 6 A6
Qj.k(7', cl?) = j,k(7', ¢;nA6)/O (1 £— :gcfos(—6n $332 d6 (4.22)

 

3
II
o

foerOandOSkSZ‘j—1,and

 

”"1 ‘2" Sa(l(6 — mm)
Q10( 7", a): -2 A00( (,7" a, nA6) f0 (1 -Ar6cos(6 _ as»? d6. (4.23)

2,r Sa( 739(0— mm)

The integral f0 (1- —r—cos(6 ¢—))

 

d6 deﬁes a close-form solution, but an excellent ap-
proximation may be obtained if N Z 32 and rmax = .96. These conditions are easily
met. In particular, studies often cite values N Z 64, or even N Z 128. The spec-
iﬁcation for rmax can always be met by simply padding the projection vectors with
a few extra zeros. In fact, the condition for the number of projections N required
can be further relaxed by simply making rmax yet smaller by this same procedure.
For instance, if N Z 16, an excellent approximation is achieved with rmax S .8. In

appendix C.2 we show under what conditions

 

 

(16 z (4.24)

[2" Sa(&(6 — 71136)) A6
0 (1 —rcos(6----<Z3))2 (1 —’rcos(nA6-'¢)l2

141

"é: L7: 11‘ ‘
l

v I
1"
31‘ v “
\'_§‘ ...-
I u
"fl,~1"lj. ' ‘1; 'v. . :1,

111mm 21m armrm -. ,. ‘1
"1314:: ‘m'T .111 1; 'li v-v; .
rii'mv c1011)“ Judi'r‘)’

G l .
Lh‘ﬂinpﬁl 1". i,’.l'.f"|-1;t u: ,4 , 1. ,1}

.Sn'l’lﬂu'lt', «..z::- um n' -l'

“I .3. P .0191 d?! 1:12 All». r~1 .,

(59.11 ~ - _ “[-

1"va Win ' an

.1 . . ,1: .1 — '-.’.:u: 11w _ﬁ-H

\' 2, ’ [1v $,:,~- 7 L)

   
  
  
    

”'1'“ r1 I'RI‘M‘.
' 11 11:14.11 Milk!" l

1 \

. H .

‘ WA) ﬂaﬁl "1

.41 . 1')
d

7}

'i‘
«E
l
a -

'_, 1,1” ‘Ili‘u 941 Y“ . f.

.-~..-" 5' '1' Jul

. 1”,!" I.

.1 :;......‘ .1‘-:fdum‘.
..*'

u. l‘ ‘ l. ..v '11 lfJMl‘ﬂ

_... 7_ V“ _(jf{:‘.--;9jf&}”§. '§\;‘- V,

for n = O, 1,... ,N — 1. Since the approximation can always be made as good as
needed, from now on we regard it as an equality.

Using (4.24) in (4.22) and (4.23), and in turn, substituting these into the recon-
struction formula (4.16) we obtain

f(r 45) = _1_ E 2i1r X30016) + 2350 2099-14 01311: 621-0395) (4 25)
l N ":0 (1 — rcos(6 - Zgn»? ’ '

 

where 6;k(r, (15) 5 6,1036; €T’rn) and X300, (1')) E 5.0.00, ct; 317571). That is,

~ 21(k+1/2)+(1—2J(k+1/2))rcos(¢—%n)
n (73¢) _/
2

j.k A21rn(t) dt

J'k+(1—2Jk)rcos(¢—2W”n) N

21(k+1)+(1—2j(k+1))r cos(rb-zﬁn)
— / 144,,(4) dt (4.26)
2

3(k+l/2)+(l—2J(k+1/2))rcos(¢—2W"n) N

forjﬁOandOSkS2'j—1,and

1

43,064) = / A4,...(ndt (4.27)

r cos(qb- 2W’rn)

(see (4.18) and (4.19)), and a“. as given in (4.17).

This formulation shows that the intensity image at any point (13¢) may be ob—
tained by ﬁltering and adding the wavelet coefﬁcients of each warped projection
A216" 0 t, and then averaging them together over the projection angles. One sig-
nificant advantage of this method is that it bypasses altogether the need for the ramp
ﬁlter le, characteristic of backprojection approaches. As already noted, these ﬁlters
are known to enhance high-frequency noise in the projection data.

Another advantage of the proposed formulation surfaces when denoising and de-
blurring the projection data: multiscale—based ﬁlters and estimators that adapt to

speciﬁc regions of the intensity image may be applied during the reconstruction pro-

cess (4.25). This is a valuable feature not offered by conventional tomographic inver-

142

sion schemes. Since we are always free to coordinate any ﬁltering done on consecutive
projections, (4.25) also enables us to exploit the sinusoidal structure characteristic of
sinograms. This could be useful when compensating against non-uniform attenuation

effects, for example.

4.6.2 Computation of 62k and 23,0

Formulas (4.26) and (4.27) do not lead to practical evaluation of the coefﬁcients 6;}
and 23.0 because the signals A 21%,, are never available. Instead, in practical tomo—
graphic processes the set of values A}: is produced. These are the integrals deﬁned

in (4.4), and represent samples of a scaled moving averaging process which may be

described by

2(t+1)/K—l
Al‘s] A211,,(s) ds
2t/K—1 N
_ S w —i-’i(K—1)w‘ K V
-( a(§)e 4 1%,,(3449 (4.28)

for t in [0, K — 1]. ()V denotes inverse Fourier transformation.

We would like to know under what conditions the sampling interval supports the
“nomina ” bandwidth9 of the signals A} This bandwidth could be no more than gK
radians per length-unit, for the sampling interval is 2/ K length-units. In turn, this
implies that the bandwidth for each of the signals MT”, can be no more than §K2
radians per length-unit, as dictated by (4.28). For example, if K = 32, the bandwidth
of any /\_2_N_1gn can be no more than 256w radians per length-unit, or equivalently, 256
cycles per aperture (per length of array of sensors).

These estimates have not accounted for the effect of the “envelope” |Sa(§)| since its

 

9Strictly, no sampling interval could ever be small enough to satisfy the Nyquist criterion given
that the projections are of bounded support. Nominal bandwidth refers to that which if “recon-
structed” would represent an acceptable signal.

143

  
  
  
  
  
  
   
   
  
  
  

1‘. . 1~' ‘. '>'.11ll ‘ . 1 ' "llr'l'rl‘Vfﬁ 01A”.
1. . 1 -'r.'I-f'.-l(lﬂﬂ$(“
1.1 _3 1.~|1’:|.'h-.H!'ﬂ

:. *1 to uni:

‘ l. . ‘ . . . . 1 '|' '1‘. b) b‘lﬂ
;.1‘ l ‘ . 1‘ 1 l4 .4' ' 7 . _ '1" '9‘.
- “» ev'l n m 18! 111.“ 2

V
' 111 I .1 :t’ .i 11$ 4 "11"“? ”(W

; -- . -. -~..1.a) 31—3
‘11lJP“1-'!:lj'-"11";‘.3." .1. ‘. . . ~1 3 1 31121409“

. - , , , .1
Ayélugli'OI-mmml -Illnll'..1. ' ‘ . . V 1 .. 11" 1.9.1121“!

11.111.11.111” .r'»lr.151l!_';."'l .1 ' -~ 5 . 1 .1 ...: 11m:

1‘ .

'~ 1111:“ alum rm Ml M13 "V. — 11; 1 '1. 1 1111111961156“!
. '\
Waluusdmh .21. "~ )1 ii “1 - . ~ 51- 1' .“ .1. ”-3,. mm- 1

1' . . . . -
ﬁrs .‘(hmlwmpa '10 Jury—11.134114. 1. 1 m 1 ' .1 :1; mun my id In:

12",“; I. 1,. x. J, l. .13.), In") m?
Whliffﬁal "Mﬂf'NJ-v" ‘nl‘lh l'J-u‘“: 1'3" .111 :m "a'u‘. '4 .1 WIN!

I,“ 91”"4‘. d4 mm- o! “Syn-E1 4.1"» '-1.- r .- .7111.» |:;./ 1 1'; 4 .
W Ihilld III) a! 41-51 6311611315651? human? 1111-11311: 3 'J-asmd 30 Ml ‘ ‘

155‘6’ ﬁkflkiq 31 In H.

W 611-! g 1

frequency contribution extends to inﬁnity. However, beyond 6471 radians per length-

unit, its contribution, and therefore, that of /\2_«,,,

has been reduced by over 40 dB
from its peak value at d.c. Consequently, the sampling need only satisfy the Nyquist
requirement for, say, 647r radians per length-unit;10 and this occurs for K _>_ 16. Since
this is the case for practically all tomographic studies of interest that we may be

concerned with, we conclude that each A? can be recovered from the K samples A2.

Thus,

{:1}; Sa( (lg—(t — 72(1)) (4.29)

The projections A23" are ﬁnite in extent and so, we can think of them as consisting
of zeros beyond the aperture’s boundaries, if necessary. Then, any integration of A2W1r

over an arbitrary interval [51, {2], where {1 S 52, can be computed as

52
— n — n
A Ai—Tn(5) d8 _ (A%(l+£1)+l %(1+€2l+l)
1

1:0

= i A}:{Sa(w[(%)2(1+51)+%l-kl) _s.(.[(4)2(1+,,)44144)},

With L— _ [if (1 " élll Equivalently,
£2 L_l A to; E— 2 1 1'4.) 5. 2 V
/ A2an(8)ds = Z (A: rect(w/27r) (e < 2) (1+1.C ) _ e ( 2) (1+£2)))
£1 [:0

271 _, ‘1’ sin(wK/4) e

 

2 2
1 7' Iv: sin(wKL/4) _,K(z;—nw (64%) n+4.) Bid-’5) n+4»)

where 3:, denotes the discrete-time Fourier transform of the extended projection

 

”Here, we assume that A?“ has most of its energy concentrated below 64 cycles per aperture.

144

 

 

(- -- ,0, A3, A32 - - - ,/\’}\,_1,0, - - - ). This expression leads to the desired result:

52
f X
51 ‘

n(8) ds = (if, 8:1(213235) )v

 

zl':

 

k- K2(1+g)—K(L—1)
— 4

 

 

— (8:11:21?)

 

(4.30)

 

k- K2(1+52)—K(L—1)
- 4

Expression (4.30) allows us to compute the integral of A 2%,, over any chosen interval
by means of the fast Fourier transform. In general, the integration values obtained
are only approximations, but this is of no concern for they may be made as accurate
as necessary by simply extending the length of the FFT by zero padding.

The coefﬁcients 62k and X31, can, therefore, be computed accurately by applying
(4.30) to (4.26) and (4.27). Unfortunately, (4.30) cannot be computed efﬁciently since
the indicated inverse Fourier transformations must be recomputed for every change
of the integration’s lower limit 51, as the parameter L is a function of it. Next,
we introduce a new formulation that avoids the computation of these coefﬁcients

altogether, and which leads to a very efficient inversion algorithm.

4.6.3 A Fast Multiscale Radon-Inverse Transform Algorithm

Define the nth cumulative projection as

W '

53(3) 5 [1 /\2n,,(t) dt (4.31)

for —1 S s S 1, and every n. = O, . . . ,N — 1, and the associated (normalized) quantity

u" r = /-\"(2jk + (1 — 23k) rcos(cb — 21%,,» 4.32
J'k( ’05) — (1 — [21k + (1 - 21k)rcos(¢ — -21%n)])2 ( )

 

~

for j S 0 and 0 S k 3 Ti — 1. Note that :\"(rcos(q§ — 2W5?» = 3,0(7‘, (15), 01‘

e(luivalently, (1 — rcos(gb — 2ﬁn)2 V8;0(T, gb) = 5‘3’0(r, ()5), with 513.0(1‘, qb) as defined in

145

(4.27).
The normalized wavelet coefﬁcients in (4.26) corresponding to the warped projec-

tions A15" 0 t may now be expressed as
N

31,147”, 95)
(1 — rcos(d) — —.—n))2

—(1 - 231k +1/2))2.".1/2( 91>) + (1 — 2"(k- + 1))2-v31..1(r,¢).

 

=(1 -2jk)21/j,k(7',¢)

Substituting this into (4.25), the following reconstruction formula results.

23—1

N:{ m+z:a..[1—2jk>2u. was)

j<0k=0

— 2(1— 2% +1/2»? Van/2e, <1) + (1 — 2% +1»? M a] } (4.33)

In practice, we are always interested in a discretized reconstruction formulation
for computer implementation purposes, and so, the reconstructions are necessarily of
finite resolution; thus, it suffices to work with only a finite number of scales. With
J _<_ 0, we denote by f J the intensity image reconstructed from the —J coarsest scales
in (4.33). The scale-limited inversion formula may be written in simpler form if we first
gather the VJ’fk(r, gt) terms with equal value for all allowed combinations of parameters
j and It For example, it is always true that V_3 k(r qb)=|k 1 = VEQ‘k+1/2(r,q§) =

k=0
V33’k+1(7‘, (15)] k=0' Doing this yields

1 — 2J_lk)2 ﬂu: V'}_1,k(73 <15) (4-34)

m, <1): 71‘; (

where
ﬁJﬂ : 41-J/71' (4.35)

146

{mil -'.~ flu-"1. -l

315.3‘1111152m" .1‘r.Z°'.‘- . 1 i1 'iH V

it'll“ {(3.}. ,1

  
    
    
   
   

' dih'tu‘) tiff”
,_. LAWN?” M! m u. .

1 ' I
1
-
__‘ -
D

'3
L r.\,‘€~l ,1-

v.11," q“ om .

. “11"qu ‘ ,*

~77" ' 4:111

, .

.~‘1|2.-1§‘\~M71Yd,

. .

s... humid-m .‘
x' ,

, wawo.‘
.111 91mm, 3
1/ «11.11.111.00 ,

.... ..1 J

(. ..

x 7 i. . «.....mf‘“

and, for 1 S k S 2‘J+1 — 1,

—2QJ‘(k_1)/2 for (6 Odd
(3.1,k = —2aJ+m. (2_m-k_1)/2 for It even. (4-36)

J+m‘ —la
+ 2.1;] Qj 2-j+J—1k 'i” aj‘Q—j+J-1k_1

Here, m‘ E max{m E N U {0} |2”"k E N}.

The reconstruction formula (4.34) represents a fast multiscale inversion algorithm
for the Radon transform, and to our knowledge, is the most efﬁcient algorithm to-
day. Its complexity is less than that of fast-Fourier-transform based approaches by a
factor of log K. This estimate assumes that all the coefﬁcients are precomputed to
reconstruction, an easily satisfied condition.

Figure 4.8 displays the general behavior of the coefﬁcients (1 — 2J‘1)2 BM used in
(4.34). The plots correspond to the case of J = —5, but the fast decay behavior of
the coefﬁcients as a function of k is observed whenever J < 0. From Figure 4.8 (b),
it is evident that at k = 8, the factor (1 — 2"“)2 )6”, has been reduced by over 50 dB
from its value at k = 0. This and the fact that the data V3_1,k(r, ()5) is non-increasing

with k, justifies the simpler reconstruction formula

'2
H
a...
O

I
y—a

(1 — 21—116)? ,Bj‘k Vii—1&0" (b), (4.37)

fJ(T9¢) : 1%f

I!
c
a-
II
o

n

where [so is a small positive integer, say, k0 = 8. For values of J much smaller than
-5, however, k0 needs to be increased accordingly.

As mentioned before, an important advantage of the new formulation is that it
allows us to selectively reconstruct an image at any desired scale supported by the
data. Also, since the algorithm does not assume a discretized integration process in
the data collection stage, it is better suited for real-world applications. In fact, our

experiments have shown that for simulated data where the projections are the result

147

 

 

50

 

H 1onog(1-2J")"’ IB /8 I

) , o——-<> 10 log IBJJ‘I BJ‘OI

JJK J.0

 

 

 

 

 

 

 

 

 

 

70

 

Figure 4.8. Tomography reconstruction-formula coefﬁcients. (a) (1 — 2’”)2 ,8“,
versus k. (b) 1010g (ﬂu/ﬂyol and lOlog(1 — 2J‘1)2 IﬁM/ﬁyol versus 1:. In both
cases (a) and (b), J = -5. These graphs reveal the rapid decay of the coefﬁcients’
magnitude as k increases.

of a simple counting process, not a true integration operation, the new formulation
does not perform as well as conventional discretized algorithms. Here, however, we
are mostly interested in ECT applications, to which the fast tomographic inversion

formula is especially well suited as we show next.

4.7 Second Approach to Multiscale Modeling and

Estimation of ECT Intensities

The SI-MMI modeling approach to ECT applications relies on the proximity of adja-
cent pixels in the intensity sinogram. In Section 4.5, it was argued that such proximity
typically exists in the vertical direction (same projection vector) due to the proximity

often encountered among adjacent samples in the intensity image f, which when inte-

148

grated together, only reinforces this characteristic. On the other hand, the proximity
of pixels in the horizontal direction was justiﬁed by the large overlap of the integra-
tion swaths corresponding to consecutive pixels. The regions of overlap deﬁned the
natural pixels.

We also noted that the inequality N Pg“,c >> AASPJC, which validated (4.6), could
not be justiﬁed for narrow integration swaths (high-resolution scale projections). Only
as the increment angle between projections is signiﬁcantly reduced, does the difference
NPR}. — AA}; become large. Therefore, the SI-MMI method can only be justiﬁed for
low resolution projections when the angular sampling is also coarse. Certainly, in
these cases, it would be suboptimal to have to work with coarse scale representations
even though high-resolution projection data are available.

To circumvent this restriction, we introduce a new approach to modeling, esti-

mating, and reconstructing the intensity image from its Poisson data sinogram.

4.7.1 The CSI-MMI Method

We deﬁne a discretized version of the cumulative projection (4.31) as X" E
[_’,‘{_1,... ,X3]T, with elements 5.2 E X"(%k — 1) for k = 0,... ,K — 1. Then,
we let the cumulative sinogram image (CSI) corresponding to these projections be
A E [ie, . .. ,XN-l].

The cumulative projections making up the CSI represent integrations of the object
f over increasingly wider swaths, starting at the top end of the projection vectors.
Therefore, the natural pixels associated with horizontally adjacent samples (pixels)
increasingly approximate more accurately the value of these samples as we move from
top to bottom in the CSI. Consequently, the intensity proﬁles (X2, 51),, . . . if”) be-
come smoother as we choose smaller values of k. In fact, in an ideal tomographic
projection process, integration over the entire object is the same regardless of pro jec-

tion angle, and so, 3.8 = - -- = 3.6V“, i.e., a constant intensity proﬁle.

149

From this, we may model the intensity versus angle corresponding to each row
of the CSI by a 1-d MMI model, each with a different prior. For example, the
prior corresponding to (Xﬁ +1, X}, +1, . . . ,5»? +31) would have parameters that reﬂect the
greater variability of the intensity than, say, the intensity (X0, X1, . . . , 5.5-1). In terms
of the various beta density components of the mixture prior for the perturbation
variate 6 introduced in Chapter 3, those representing high variability, e.g., those
deﬁned by small values of the shape parameter 5,, will have higher contribution to
the model for high values of k than for those of low values of k.

The modeling of the CSI’s intensity variation in the vertical direction, i.e., X",
proves more problematic than for the case of the standard (non-cumulative) sino-
gram. This is because in the cumulative projections, the consecutive pixels may not
be modeled by independent innovation variates—as is characteristic of MMI-based
models—given their strong dependence due to their construction, particularly for low
values of k. We have carried out numerous experiments where we have neglected this
fact in a pragmatic pursuit of the “what if”, just to be reminded by the very poor
results that in this case the fact is not to be ignored.

The need for a suitable model for the individual cumulative projections result-
ing from ECT studies arises only from the necessity to estimate their underlying
intensities before reconstruction. In Section 4.5 we saw the great impact that a
good model can make to the tomographic reconstruction process. However, because
the multiscale—based tomographic reconstruction approach developed in the previ-
ous section only makes use of the cumulative sinogram—as opposed to the standard
Sinogram—the Poisson-distributed data is an excellent representation of the underly-

ing intensity themselves. To justify this, we make the following observations:

1. Much of the CSI Poisson data have high SNR due to the rapid accumulation of

high counts obtained by construction.

150

2. CSI data characterized by low counts enter the reconstruction process with little
weight due (1) to its relative low magnitude, and (ii) to the weight given by the

coefﬁcients of the reconstruction formula (4.34) (See also Figure 4.8.)

Modeling the CSI’s intensity by the data has the advantage of simplicity, but
more importantly, it helps to maintain a higher degree of truth in the reconstructed
image. Moreover, due to the accumulation process by which the CSI is formed, the
coordination of the estimation of the underlying intensity in the vertical direction—as
k varies—is assured despite the fact that we only impose the 1-d MMI model in the
horizontal direction—as the angle varies. With this model at hand, the underlying
intensity can be estimated using the Bayesian estimator introduced in the last Chap—
ter. We call this approach to modeling, estimation, and reconstruction the CSI-MMI

method.

4.7.2 Example

Figure 4.9 is a pelvic bone study obtained using the CSI-MMI method. Fig-
ure 4.9 (c) shows the pelvic bone reconstructed image obtained from raw data using
the multiscale-based inversion approach of Section 4.6. In Figure 4.9 (d), the under-
lying intensity of the same raw data used in (c) has been estimated and reconstructed
by the CSI-MMI method. For purposes of comparison, we have repeated the images
of Figures 4.6 (b) and (d) in Figures 4.9 (a) and (b), respectively. These images
correspond to the ﬁrst pelvic study presented in the example of Section 4.5.

The CSI-MMI processed image in Figure 4.9 (d) reveals a greater amount of detail
than does the SI-MMI processed image of Figure 4.9 (b). In particular, note that what
appears as two bright blobs in the upper left of the SI-MMI image, it becomes better
represented in the result obtained by the CSI-MMI method. In this case, the features

preserve the structure found in the unprocessed image of Figure (a). Likewise, the

151

three bright areas easily distinguishable in the lower left of Figure 4.9 (d) are almost
impossible to pick out in the SI-MMI image, although their existence is well suggested
by a narrow band of bright pixels.

The images reconstructed directly from raw data using the FBP algorithm and
the new multiscale reconstruction method display greater detail than either image in
Figures 4.9 (b) and (d); however, the variability of intensity of their pixels is large
enough that makes it very difficult to determine the true structure within the images.
Clearly, this is especially true of Figure 4.9 (a), demonstrating the great advantage
of the multiscale inversion algorithm alone.

In addition to the above differences of image quality among the set, there exist
several other features in the CSI-MMI image that are hardly discernible in the SI—
MMI and the raw ﬁltered backpropagated images of Figures 4.9 (a) and (b). One
example is the very prevalent small dark region near the center of the image which
lies on the diagonal going from the lower left to the upper right corners. While in
Figures 4.9 (a) the region is well deﬁned, the variability of pixels surrounding it makes
it an unreliable feature. In contrast to the virtual disappearance of this feature in
the SI-MMI image, the region is well deﬁned in the image constructed with the new
multiscale approach.

Although we deem the CSI-MMI methods superior to the SI-MMI approach from
the above results, we believe that in practical nuclear medicine and other emission
tomgraphy studies, there is a real advantage by working with both type of processed
images that we have introduced in this chapter. While the CSI-MMI method offers
greater ﬁdelity of images, the SI-MMI-based images displays the greater structures
of objects in a more visually appealing manner without unduly sacriﬁcing too much

small detail.

152

 

(01)

Figure 4.9. Second pelvic bone study image reconstruction. (a) Pelvic bone from
noisy data using a F BP—based reconstruction. (b) Pelvic bone from SI-MMI—processed
sinogram using a F BP-based reconstruction. (c) Pelvic bone from raw data using
the multiscale-based reconstruction of Section 4.6. (d) Pelvic bone from CSI-MMI-
processed sinogram using the multiscale-based reconstruction.

153

CHAPTER 5

Conclusions

We have introduced a view of the multiscale structure of processes that succinctly
expresses the limitations of models constructed from limited—scale information. At
the same time, this view has made apparent the potential advantages of modeling
within the multiscale framework for estimation purposes. For this, we introduced a
unifying approach to characterizing models and estimators alike. The measures of
anomy, accuracy, precision, and resolution power were introduced in order to char-
acterize models and quantify their degree of goodness as estimators. While formally
deﬁned using the information-theoretic concept of distance, the new concepts reflect
our common-day sense of the terms, and so, they are appealing to use and general in
nature. Also, the new characterization of models and estimators has given a clearer
view on interplay between scale (space / frequency) resolution and information resolu—
tion (resolution power).

We have given general guidelines for obtaining linear transformations to construct
Gaussian multiscale models that are potentially better suited for estimation purposes.
Much work needs to be done in this respect, but a wide range of possibilities has been
opened by the idea of constructing transformations that are statistically motivated by
using the criteria set forth here. We argued the various advantages of the Bayesian es-

timators within the multiscale framework, and showed how this framework facilitates

154

the formulation of practical priors.

We have developed a Bayesian approach for Poisson intensity estimation, and in-
troduced a novel MMI prior model for intensity functions based on a multiplicative
innovations structure. The MMI captures many of the key features of real-world in-
tensity functions and provides an excellent match to the Poisson distribution. The
MMI model facilitates a multiscale Bayesian estimation procedure that proceeds in
a natural fashion from coarse-to-ﬁne resolutions. The estimator has a simple closed
expression and can be implemented in 0(N) operations, where N is the dimension of
the ﬁnest resolution of the discretized intensity. The issue of choosing the parameters
of the prior model was addressed, and a simple moment-matching method was pro-
posed for ﬁtting the parameters to a given set of data. The MMI model was extended
to a shift-invariant (stationary) model called the SI—MMI model, and it was shown
that the SI-MMI model has a 1/ f correlation structure that is more regular than that
of the MMI model. We developed a fast version of the SI—MMI which leads to a very
efﬁcient algorithm of complexity 0(N).

We have illustrated the performance of the multiscale Bayesian estimator by com-
paring its performance to other wavelet-based approaches on several benchmark prob-
lems. We have also studied the application of the framework to photon-limited imag-
ing problems, and examined its potential to improve the quality of nuclear medicine
images. The framework also appears to have potential in other applications such as
network tomography [4] and network trafﬁc modeling and synthesis.

We have introduced two new approaches to the estimation and reconstruction of
emission computed tomography images. The SI—MMI method is a simple, computa-
tionally efﬁcient, extension to the 2-d MMI approach which similarly incorporates a
Bayesian estimator in a very natural way. Its MMI-based prior model is well justiﬁed
based on the sinogram formation process. The superiority of the SI-MMI over the

commonly used FFT-based-reconstruction post-ﬁltering approach was demonstrated

155

with comparative examples using real and simulated data.

The second method introduced was the CSI-MMI method. This method is based
on the new multiscale approach to tomographic reconstruction also presented here.
The reconstruction method was statistically motivated on the ECT problem and facil-
itates the robust reconstruction of ECT images. The new Radon-inversion transform
formulation allows incorporating some local ﬁltering within the reconstruction steps,
which leads to a more efﬁcient. “one-step” process. In future work, we plan to exploit
this feature to develop methods to mitigate attenuation and blurring effects. We
developed a fast algorithm that implements the reconstruction more efﬁciently than
FF T-based inversion methods. The new inversion approach has the advantage over
more traditional reconstruction methods of allowing efﬁcient reconstruction at any
desired scale. Fitting the reconstruction scale to the highest resolution supported by
the data is important to the efﬁciency of reconstruction, and to the accuracy and
precision of the ﬁnal image. In future work we will investigate further the application
of the new reconstruction method to simulated data, which from our very limited
trials, proves not to perform as well as FFT-based inversion approaches.

CSI—MMI method estimates the cumulative sinogram image by imposing 1-d MMI
priors to each of its rows, and adapting them individually using the data. The robust-
ness of the data column-wise in the cumulative sinogram justiﬁes this one dimensional
approach and helps the reconstructed image stay true. Also, the one dimensional ap-
proach in conjunction with the new inversion method makes the CSI—MMI method
a highly computationally efﬁcient alternative to the image formation process from

photon-limited projection data.

156

APPENDICES

APPENDIX A

Appendix to Chapter 2

A.1 Proof of Expression (2.2)

Recall that

h.j,k(w ): — 51—01(w——_ €0j€0>e ”WW—jg") for all j,k E Z, (A.1)

and that f is an L2(lR) function. Thus, by Plancherel formula, f is also in L2(IR).

Since

 

 

w—jéo w_%Ek—j€0 _
21( £0 ) 1( £0 ) _ 1 (A2)

.7

whenever k = 0, and zero otherwise, with 21.060 = 27r
f(w)—:—7T—Zl(w—j£0) l w-—-k- J50 f(w—2—7rk)
U050 50 “20 u0

_ w— J50 “J50 27f _,2_7r r
_ 2).—W )/f(w)1(L—€O_)uozcs(—w w' uok)dw.

 

 

158

Using the Poisson formula and the fact that every Fourier series’ component is indi-

vidually integrable [19], the expression may be rewritten as

—OJ£O w, - J60 ikuO(‘-"_W,) I
1 —— do)
)fmﬂ w) (( 50 > :8

k

{Egg-Z ::J( _;__00j(:0)) eikqu/f(w’) 1(w__ _:_0j€0) e—ikuow’dwl (A.4)
],k
=(Z (,.jfizk )jizk
j.k

 

A.2 On the Monotonicity of I p(X ; A)

In this section we show that the average mutual information 1,,(X ;A) is a strictly
decreasing function of p, and thus, bijective and invertible; consequently, that p is
uniquely determined for each value of Ip(X; A) = 1’. We assume an absolutely con-
tinuous model, one with absolutely continuous densities.

By deﬁntion

now) = [a(XIMPW loggggfgldxdx.

where pp(x|A) E “p(x — A), i.e., a uniform density with support of which is the

N -dimensional cube C(A; p) centered at A and sides 2p, and

Here, * denotes convolution. It is not difﬁcult to show that if yp is an independent
rv distributed according to up, then A + yp ~ pp [30].

A well known relation for mutual information establishes that I p(X ; A) = H p(X ) —

159

H p(X IA). In this case, Hp(X) is the entropy of pp(x), or equivalently, the entropy of
the rv A+yp. Thus, we write H(A+yp) E Hp(X), when it is clear that H(A+yp) must
be understood as a functional over the sample space A x C (A; p). The conditional

entropy Hp(X IA) is given by

 

Hp(X|/\) = [pp(X|A)p(/\) log m1”) ddi

Up( (x— A) 10 ddi
=”(/" / gr— upxl( —A)
=/ p(A)/00‘ loglClddi.

RN Am) lCl

This entropy evaluates to log(2p)N, which is the entropy of yp. Consequently,
[p(xiA) = H(A +Yp) _ H(Yp)-
Let p2 > p1 > 0, and consider the difference of entropies H (A+yp,) — H (A+yp2),

where ym, ypz, and A are all independent. Clearly,

H(A—Pym) — H(A+YP2)=(H(A+y/J1)- H(A'l'ypilA»
- (H(A+yp2) - H(»\+yml>\)) +H(A+yp,|A) - H(A+yml>\)

= I(A+yp.;>\) - 1(A+ym;A)+H(ypl) - H(yP2)'

In order to arrive at this expression we have made use of the fact that for any two
independent m’s A and y, H(A+y|A) = H(ylA) = H(y). Subtracting H(yp,)-H(yp,_,)
from every term in this expression, and writing [m (X; A) for H (A + ym) — H (ypl),
and Ip2(X; A) for H(A + ypz) — H(ypz), we obtain

1p.(X;A) - 1p2(X;A) = I(A+yp1;/\) - I(A+yp2;>\).

By hypothesis, out of the 171’s yp1 and ym, the latter induces greater uncertainty

into its corresponding sum with A, and so, the information that A conveys about

160

A+ yp2 is less than that conveyed about A +y,,,. Thus, I(A+yp,; A) > I(A+ ypz; A)

and, therefore,

I,,,(X;A) > Im(X;A).

A.3 An Alternate Interpratation for Precision
By definition

1p(X;A) = [pp(X|A)p(A)10gM

dA dx,
pp(X)

where pp(x|A) is a uniform density the support of which is the N-dimensional cube

C (A; p) centered at A and sides 2p, and

mix) = [ppIxIAIpIAI dA

=ICI f(p ,. ,. (A5)

:I—CIP(’\ e C(x; p))

Here, |C| stands for the volume of C(x;p), i.e., |C| = (2p)N, and P() denotes prob—
ability mass.

Writing I p(X ; A) in terms of these expressions yields

,,(;X A) =/ / dAdx
IRN a(x, p) ICI)10 P(A 6 C(x; 12))

P(A E C(x; p))logP(A E C(x; p))dx

 

=|_10| RN
—1

= _ / P(A e C(x;p)) log P(A 6 C(x;p)) dx
ICI 1, mm )

where the summation is over all vectors k = (h, . .. ,kN) with k1, . .. ,kN integer

161

 

Irv
U
a
I
- ‘»
.5 '0
‘.
__ n
l
r

   
   

‘ I

I, .. I I A
.. ~ ,' - / . /_ I I ‘ :Ifu ' '. "' N'WIIO‘L‘“ " I
'uf
t
0
mm...» , I - - — ..~ W
. ,.. , . .( I) . H:“I"f5ﬂ19’IA‘ i.‘
. ,'.II
A
r ‘.\ —_ l,‘ ,_"‘\
‘tlf‘l I1! II- ./ M ‘
mIvzr-uiwad'
«--
I -.",n; HJAK‘ u . .h'
/.
(«3.1. W -'

'Ilulq H‘viﬁd‘fli‘ ‘1 ir'J "“ . - . ’ H VII 10!
,- . i ,, 'l

V
.th',-,.r.z"13 1,149.4“ ;. " l '71- u".-
1'!) I‘I. v} . . ,A (km (“A .. ,.AI = )l e‘I'IJ'rvI 11'; 157') *3 -

r, I. -

rm

 

.A‘JM-‘l . ‘ -

numbers. So,

—1

[p(xiA) = a

Z] P(A 6 C(x + 2pk;p)) log P(A E C(x + 2pk;p))) dx.
1‘ C(in)

Clearly, P(A E C (x + 2pk; p)) are probability masses on k parameterized by x,
6.,PZk (A E C( (x+ 2pk;p)) = land P(A E C(x+ 2pk;p)) Z 0 for all x and

k. Thus, the entropy Hp,x of the discrete ru’s {H(A) E Pizza] with probability

distributions
as... = k) = / p(A) dx (A 6)
C(x +2pk )
are
H... E — 2 P(ém) log 1305...). (M)
e...

Rewriting the mutual information in terms of these entropies,

1
Ip(X;A) = |_C'-l-/CO Hp‘x dX. (AB)
:P

This expression says that [p(X;A) is an average of entropies corresponding to
discrete ru’s obtained by partitioning the N -dimensional space into cubes of sides
2p each, and assigning them probability distributions corresponding to the volume
of p(A) “under” each cube. The entire partition set of cubes are shifted by x E
C (0; p) to obtain the uncountable set of distinct discrete rv’s over which the averaging
process is carried out. This justiﬁes the alternate and intuitive interpretation given
in Section 2.3.2 for a model’s precision, and which we restate below. Its equivalence

with the original deﬁnition is clear for in both cases we have ’P = I p(X ; A) = “X ; A)-

162

 

a
‘l
'I n ~
I
I
o
. l '
.~ .
I I
I
.1 ‘
r ‘ '

V 4
I‘ I '
I l .l
.I‘
l I
l! I, '. ,
I J
{a}. ,
“I 81l;.t.-.-HII. .~v,.-.~'..- I .
mm s» «um-I a.“ .. . . .
“HIUIC’I XII-1‘ {.[ji‘ug'I-g. __ . I ,, I ,
5 1 WI fl‘tlildi Bil. v.0”: \ Ma Ms}: J 7.. --r

Aﬂixbn‘lb 95!, H, “Li‘fl (us. ..Ix', . II. I ‘ ._

nevi; sluuhdanI-um Him/11:: in:- 97.1371:- I». «4: I Fr I

‘J’Jmi': . ~ . . . . .
Irmrpn ,.Il .mIl-III «mm» 9» Irwin 1).: , hm mt,

H3! .8

'n 'f ' H .3 p . I
. . 1-)“ ~- ‘ an“ Watson .I'n.‘ II: In} uni?

 

4' inborn . ﬂ

\v)

  
   
  
  

"'4‘ '43in
."NJIU M
47L.” .. '.' 5)": II. ~

hear .Juo

u mitt :- I:

P is the entropy of the m A when quantized to the number of bits determined by

I (X ; A), averaged over all possible quantization schemes.

A.4 Only an MA Model has Anomy of Zero

To prove that only an MA model has anomy of zero, we let

ORA“) E — [pp(X|A) log(tp(x|A) + (1 - t)pp(XlA)) dx

for all values oft in [0, 1], and show that N = f(aPAU) — ap,A(0))p(A) dA > 0,
unless p(xIA) = pp(x|A).
By repeated differentiation of ap‘),(t) with respect to t, we obtain

.. _ p.<xIA)—p<xIA> "
OW) ‘ ‘" ‘1)’/pp(""‘) (thxIA) + <1 — tIppIxIAI) ‘1’“

 

Suppose that pp(x|A) 79 p(x|A). Then, it is clear that GSA“) > 0 for all t. Since

0;,A(O) = 0, a;’,\(t) > O for all t > 0. Thus, ap,A(1) > ap,,\(0) and, consequently,

N > 0. When p(xIA) = pp(x|A), 0%),(1) = ap,)‘(0), and the desired result is obtained.

A.5 Equivalence of Anomy Deﬁnitions

We want to show that
D(Pp(X)|lP(X)) = D(pp(X|A)P(A)NIKKI/“1900).

Let W be a linear mapping A I—+ A1, where A 6 IR”, A1 E R3 and N a positive

163

even number. We only consider those mappings W for which the N by N matrix

3:.
“I

W

is invetible. Here, I and 0 are the 1;- by g]— identity and zero matrices.
Denote by A‘ and Ab the ﬁrst (top) and last (bottom) N / 2 elements of A, respec-

tively. Then, assuming all the following ratios to be well deﬁned,

MAXIM) _ pp(X) pp(>\IIX) _ pp(X) pp(/\‘,>‘1|X)

P(XIAI) p(X) P(AIIX) _ P(X) p(A‘IAIIX)’

 

 

where the last equality is obtained by multiplying and dividing by p(AtlA1,x). Recall
that in general, meaning is given to pp(x|¢) by the integral fpp(x|A, ¢) p(AIqb) dA

(= fpp(x|A)p(A|¢)dA.) Since (i) = A (:2) = AA,

Pplxlxl) = Pplx) pp(A|x) = _e(_x|_A_) (A.9)

p(XIAI) p(X) P(AIX) p(XlM'

 

 

Since the derivation could have equally been started from this last ratio of densities
and concluded that removal of information from the conditioning of the densities is
inconsequential,1

110;) = p(XIA) (A.10)

p(X) p(XIA) '

 

1Note that by choosing W = [0,1] and Ab a constant vector independent of x, the desired result
is explicitly obtained.

164

Now, from the deﬁnition of the information-theoretic distance D(--||) and (A.10)

. x = pp(XIA)
D(p.(xIA)p<AIIIp< IAIpIAII / MX'MW) 10g 'pIxIA) M" (A.11)

—' 0 WM X: X X
—fp. pp<xI1g;—(—")d D(pp()llp())-

A.6 Proof of Expression (2.35)

We shall prove that for a sufﬁciently smooth (slowly varying) density function p(A)
relative to p(xlA) the parameterized mutual information 1,,(X ; A) can be well approx-
imated by H (A) — log |C|. We present two different proofs, starting with the simplest.
Although the proofs are essentially the same, they begin from the two different in-
terpretations given to the quantity [p(X;A). The ﬁrst of these interpretations was
introduced in Section 2.3.2; the second interpretation was introduced in Section A.3

of this appendix.

First Proof

We know that Ip(X;A) may be written as the difference Hp(X) — H,,(X|A), where
H)=p(X - -f pp )x1(0s19IIX)alx and H (X IA - -fppXIA)p(/\)10gpp(X|«\)dxdz\
In Appendix A.2 we showed that Hp(X|A) = loglCl, where [CI = (2p)N and N is

the size of the signals x and A. Now, since

MAX) = fpp(XI>\)p(>\) dA.

by the assumption of smoothness of p(A) the 2p width-per-dimension of pp(x|A) =

ZIP(x — A) is of the order of the width-per-dimension of p(x|A).Therefore, p(A) may

165

be regarded constant. within the N -dimensional cube of sides 2p, and

1
pp(X) = — p(A) 61* % INK)-
lCl C(x;p)

Therefore,
HIIX) z — ij(X)10sp>.(X) dx = H(A).
Second Proof

From Appendix A.3 we known that 1,,(X ; A) can be written as

1
I X;A =— H ,.d,
p( ) lClv/Cmm) p, X

where

H... a - Z) P(é...) log PIE...)
6....

and

P05... = k) = / p(A) dA.
C(x +2pk;p)

From the assumption of smoothness of p(A) the 2p width-per-dimension of pp(x|A) =
Up(x -— A) is of the order of the width-per-dimension of p(x|A). Therefore, p(A) may

be regarded constant within the N -dimensional cube of sides 2p, and
POE... = k) % PA(X + 20k)lC|-

Then,

Hm: % -- ZPMX + 2pk)|C|1<>s(P>.(X + 2pk)|0|)-
k

166

Applying this result to the expression for I p(X ; A) gives

IPIX;A) z — 2: / PA(X + 2pk) IogpIIx + 2.2km
k C(0;P)
— log ICI Z] [a(x + 2pk) dA
k C(OHO)
z — Z / p(A) 10gp<A)dA
k C(X‘f-kaho)

—log C j p A dA
I I; C(x+2pk;P) ( )

=- / p(A)logp(A)dA—log|0| / pom,
RN R”

which is the desired result.
We note that the approximations made here are reminiscent to those often used
within the context of high resolution quantizers in computing their mean-square errors

(see for example [21].)

A.7 A Sufﬁcient Condition for the A/ P Accuracy
Inequality

As usual, we let |C(Aj;pj)| = (2pj)NJ be the volume of the cube C(Aj;pj) centered
at A], of sides twice the EPR pj, and of dimension N,- = N0/2j. For brevity of
notation, in the next few steps below we use the notation particular to scales j = 0

and j + 1 = 1, but the development applies to any two consecutive scales in general.

From deﬁnition (2.24) and equality (2.25)

A)
Nz-lo CA; -/ _p(_1_/ lo xAddi.
1 gl ( 1p1)l IRNIIC(Al;p1)| C(A1;pl) gp( ll 1) 1 1

167

.
I
‘C U.
I
u 3.

 

 

“III”: I'erIwI‘ '-
I . . - m,p-”lb

  
   
   
  
 

) ﬁnaiamuaﬁ f
, ‘1‘) A f
‘ntlsu 'I 431

“

v u: - f f I '
Y‘JBHJ) If. i';‘. w: .'..IIIHZ'I

Mm... . > . . .
. . . - _ 'I'.| "~’D"
.l ... \

.I'IﬂI'V.‘ .. ., . -
. .I . . 1)} ~41) "91V" ~'

H A ~ ’ I
. -. A 91. . III I [: ”A I. , ‘ ‘
. . . I' ""7 ,1...” “I, .5

‘2
a

a

[I'V‘l' " 171‘1‘“ ‘ ‘I'.| _‘I, - I. ' . ' ’
. . .I . ‘ :"lft CHI, ’06 .‘l'

' .. .1 . .r::!. ‘1‘“, ‘3‘,” I...
17¢",
. .l“ ';
. l I ‘. ‘f-l
.ADI’WY-A’, 5" :,-' _ s,. A ‘ 5.1 1‘.
‘ ' ’- ’ \ " l‘tﬁll‘) . I -
I A . ‘ ’ 'l ' . M
\' ,_ t
.{c

If W is an orthonormal transformation such that2

01 W1
= WAO and = on,

A1 X1
where 91 and wl are simply the “error” vectors A0 — A1 and x0 — x1, respectively,

the above inner-integral can be expressed as

/ IogpleIAIMx.
C( 1:01)

2/ log/ IXGIIAQZ/ p(X1,W1|A1,01)dW1d01dX1,
C(Alipl) RNI k C(OH-Qpikmi)

where the summation is over all vectors k = (k0, . .. ,kN,_1) whose elements are
integers. C (91 +2p1k; p1) are the cubes centered at 01+2p1k with sides 2p1. Refer to
Figure A.1 for a representation of the various components in these expressions. Note
that in the ﬁgure the point (91, A1) is the same as the point A0. '

Now, deﬁne

 

a(xllA1,91) E

With the aid of the Jensen’s inequality we obtain

/ logp(xIIA1)dX1 Z ICQIIPIHIOEICQIIPIH
C(Alipl)

+/ / p(91|A1)1og(1+a(X1|A1I91))d91dX1
C(AIIPI) RNI

1 f f
(3' A p 6 A / 10gp(x IW IA ,9 )dw d0 dxl.
l ( lip1)l C(Alml) 1R"! ( 1| 1) C(91;P1) 1 1 1 1 1 1

 

2For later convenience, we have departed from writing in the customary order the projection
vectors A1 and x1 ﬁrst, and the “errors” 01 and wl second.

168

 

I
I
I
I
I
I
I
3, , I‘-
. I
C . '
«4' '3 ‘
.. ‘ ‘
, I.
I
‘ . .l
t. V
.

    
  

II; ‘I-IIIIm'a m

H an In AI' .‘

R -/' ~j ',. . if .7 “ MI..."
.."‘.I ‘
I: ' . oIlI nrmyﬂg‘“ I, '
‘ 1‘ h I' r‘ ( :‘h("’.
3"- .«90» 5".

"d! 7m‘.'”r‘l A.) :(x‘li'l' 2' ‘!

'r .7 . av <, ... . .
01"1‘1’1‘ ‘4 0.14)» U """HJ'" ‘ "l' 4 l - ‘II l‘ "H nII v'l 7'1" .1 . . '

1". I-- .1 ‘III; ”f. I17 ~. ' 1d: .34”

_h
I
l

X1

   
 

M1 - C(MOI ; Pl)

WI

COM ; P1)

M
C(91; P1)

 

 

Figure A.1. Simpliﬁed geometric representation of EPR cubes at consecutive scales.
For simplicity of notation only those cubes corresponding to scales 3' = 0 and j +
1 = 1 are represented (e.g., C(Ampo) and C(A1;p1)). Here, x0 = (xﬂmxgjl)? By
integrating p(xole) over adjacent copies of C (A1, 01; p1) as shown, a useful relation
between the anomies No and M can be obtained (See text for details.)

Then,
M .<. -logIC(A1;p1)|'~’
P(AI) / /
_ _ OAlo 1+axA,0 ddedA
/R~1|C(>‘1;p1)| C(A1;p1) MM 1' I) g( ( 1| 1 1)) 1 1 1

P(AI) / f f
’ — 9" lo x,WA,9 dwdedi,
[Rn/l |C(’\1;P1)|2 C(A;;pl) RN11“ 1| 1) C(91;p1) EM 1 1| 1 1) 1 1 1 1

01'

 

A
N1 _<_ —108IC()\1,91;P1)| “/ M" 10g(1 + a(xllAODdxldAO
1R~o IC('\11P1)I C(Axm)
P(Ao)
— 10 x A dx dA.
AND |C(A1,91;pl)l C(Al,91;pl) gp( 0| 0) 0 0

169

Since

N0 = - log |C(A0;p0)l _/ P(Ao)

—— logp x A dx dA ,
IRNo IC(A0;P0)|/C(Ao;po) ( 0| 0) 0 0

 

No
No - N1 2 log (E) +/ p(AO) / log(1+ a(xllA0))dx1 dAo
1R C(ANPI)

po ”0 (200M
1
+ pA {———/ lo px A dx
[R < 0) (MN, CMW g < 0| 0) o

1
_W/C:(% )lng(X0lA0) dXo} dAo.
:PO

So, for a model to satisfy the A/ P conditions, it suffices that for all scales

 

NJ
log(-"—) < / p(A.){A,-+1<a)+A’j<p)—Aj<p>} dA., (A.12)
Pj+1 IRNJ'
where
Alp) = 1 / 10gp(X-I/\')dx-
J _ (ijle C(A;;pj) J J J,
1
A"(p)E—-+/ logpx°A- dx-
J (2pj+1)NJ C(Ag'+1.91+1;Pj+1) ( Jl J) J
and
A- (a)=—1—-——/ log(1+a(x- |A-))dx-
J+l _ (2pj+l)Nj+l C(A9'+1:pj+1) J+1 J 2+1

are the averages of log p(ijAj) over C(Aj; pj) and C(Aj+1, 9j+1;pj+1), and the average

of log(1 + a(xj+1|AJ-)) over C(Aj+1;pj+1), respectively.

170

 

A.8 Coarse-Scale-Data Limited Models

We want to prove the equivalence between A/ P and CSDL models. The following

depicts the conditions associated with CSDL models.

 

 

 

The CSDL Model Conditions
j-scale Model j-scale CSDL Model
P(XjIAle/V‘) P(Xj+1|)\jlp(’\jl
7’1“ 7’1 > 7”#14 PJ'HJ
A1 ~41 < Ania Ajay

 

 

 

Although the conditions have been stated in terms of data belonging to one coarser
scale than that of the intensity’s, it equally applies to observations obtained at any
number of higher (coarser) scales above the scale of the intensity.

Recall that all the models under consideration satisfy p(xj+1|Aj) = p(xj+1|Aj+1).
Motivated by this relation, we define pp(xj+1|AJ-) E pp(xj+1|AJ-+1), which is a logical

choice. Clearly,

pj+1J E I(Xj+1;Aj) = I(Xj+1;A,-+1) = 792'.

Since similarly 1,3(Xj+1; A3) = [p(Xj+1;/\j+1),

Ip‘(Xj+1§Aj+1) = Pj+l,j = 733' = Ip(Xj+1;Aj+1)

171

and so, ,5 = ,0. On the other hand we have

_ pﬁb‘j-HIAJ')
. .2 - . ,\. A-l ___ . d .
N+13/Pp(xy+1| .)p( .) og MMI,» ax,“ A.

Pp(Xj+1lAj+1)
= ~x~ A. A. 10 dx- dA.
fpp(3+1lg+1)p(3+1) gp<xj+llxj+1>

= . . . Pp(xj+1|)\j+1) . _ ___ .
fpp(xJ+IIAJ+l)p()‘J+-l)log p(xj+1|Aj+1) dXJ+1 dAJ-H My

 

 

Thus, P34,”- = 79,- and A34,”- = A], establishing that if a model satisﬁes the A/ P
conditions, it also satisﬁes the CSDL model conditions, and vice versa. Therefore,
the two sets of conditions are equivalent.

Consequently, for models for which (1) E[xj|Aj] = Aj (2) p(xj+1|AJ-) =
p(xjHlAJ-H), and (3) H(AjlxjH) > H(Ajlxj) hold at every valid scale—the same
three conditions adopted in Section 2.4.1—ﬁner scale patterns in the data convey the
required information for a more precise model, but modeling based solely on this new

information necessarily produces a less accurate representation.

172

APPENDIX B

Appendix to Chapter 3

B.1 Posterior Distributions

In this section we show that

f(AjICO) = f(Ajlcj) (B-1)

and
f(5j,k|Co) = f(6i.lej-1»?kicj—l,2k+l)- (32)

These are proved in the following two propositionsl

Proposition 1 p(AJ-lco) = p(Ajlcj) for j = O, - -- ,J

Proof: Since the same information contained in the set {Co} is contained in {CO, cj},

 

1Throughout this dissertation numerous has been the instances where use of a table of integrals
and other mathematical expressions has been made. Our main source of information in this respect
has been the tables compounded by Gradshteyn and Ryzhik (see [87]).

173

...v— -.

O
H XII"
' '1 """7 ll) 03 XIV. .
1 1"I' '

   
 

.; ..' .,;.'s::-1.~."'! mi .-

r

7141' " Wade 9',
v

I!”

 

,. l .:1‘ 111'!" "1‘ -
’ 9’ 1 . ,uihds'- - 9. .1...‘, . >‘ ' u ' 1‘ ' ' ' ‘7
, - ,. ~~. Luv-«1117121111410!!!
, . dugout-raider: tin-.11. “mug. 11:. ....” . .'-.. ._ ,5 Q“: m
. . . v w: .1 . 3 ..v
H . Wm lid‘. nl antlmmt‘m‘ l1 «‘JLI . -. . «. ,.. , .‘ 1,,
- . 1. 4 .. .. 1 ﬂint-'1 ‘ l

. .I . L‘- “"' V
i o 3f " “w'wi 7w... '1 Winn . I" nun-mam

'mo
.
4%
g
..

‘1
.-
;‘F ‘
- v
- ‘—
.
;1
'4 “c
I
I
.

 

applying Bayes’ theorem f (Ajlco) may be written as

p(AJ-ch)

f(AjICo) = f(AjICo»le = P(Colcjdjlmcolcji

Thus, it sufﬁces to show that p(colcj,AJ-) = p(colcj) for j = 1, - ~ , J.

Define sequences of even and odd elements of cj_1 and Aj_1: €311 =

(Cj—1,O~.Cj—1,21"' aCj—1,1’V'/2J-1—2) and 0311 = (Cj—1,11Cj-1,3a ' " ,Cj—1,N/2j—1_1), and sim-

ilarly for Aj_1 and A34. Clearly,

p(C‘- .C"- IAj) .
l = #mch'IJAJ-l) 1fc3?_1= (:1- elf—1 2 Q

119—1|ch (8.3)

0 otherwise.

Then, with the aid of the total probability theorem we may write for the non-trivial

case

.) = fp(c§_1.c,- — c§_1IA,-, ;_,)p(,\;_,|A,-) dA§_1
P(CjIAj)
f Hk p(Cj—1,2k|/\j—1.2k) Hk 19(0ch - Cj—1,2kl)\j,k - Aj—1,2k)p(A"?_1|AJ-) clA‘3._1
= 2 J
Hkp(cj,kl)‘j,k)
fHk e-Aj— 1.21: (Aj— 1,c2klJ 12" I'Ike—Aj.k+Aj—12k("j,k")‘j-1,c2k)”‘ c1 12" P(A;-1|/\j)d)1§_1

(Cj- 12kl' k(cjk- Cj- 12k)‘
_,\. k.._J.1__
Hk e (A(Cchc)!

Cj’k )‘j-i2k)c’—l’2k ( )‘j-12k)c""_cj“'2"
= —' 1-—-’— A"; A- dx:_,
[I]; ( AjJt Ach p( 7 1| 3) J 1

Cj-l.2k

 

p(cj_1|CJ-, A

 

 

where all the indicated products are over k = 0 through N / 21 — 1. In these expressions
we have exploited the conditional independence of the data scaling coefﬁcients at any
speciﬁc scale given their corresponding intensity scaling coefﬁcients.

Since the innovation variates are given by y“, = 52f? (see (3.11)), the condi-

tional densities within the integral signs may also be written as pA;_1W(A;_1|Aj) =

174

A,- .
11,11,117” (( A1;2k)k| ()‘j.k)k)9 where Yj = (yj,o,yj,1,°" ,yj,N/2j_1). Then, With a
change of variables and recalling the mutual independence of yj and Aj, we obtain

the equivalent expression

CM c-12k C'k_C'-l2k
p(Cj—llcjv A3) = H (wk) ’ ' (1‘ Wk) " J ' P(YJ‘) dy)‘ (BA)
k Cj-l.2k
The absence of AJ- in this expression indicates that cj_1|cj is, at least explicitly,
independent of Aj. However, one may argue that Aj intervenes only functionally to
deﬁne the probability mass p(cj_1|cj, A,-). That is, if we denote the right side of (8.4)
by g(cj-1, cj), we must. still verify whether g(cj_1,cj) = p(cj_1|cj). We may achieve

this as follows.

p(Cj-llcj) = [P(Cj—llcjaAjlpo‘lej)dAJ‘

= [g(cj_1,cj)p()\jlcj)d/\j = 9(C1-1’CJ)'

Therefore, p(cj_1|cj, A3) = p(cj_1|cj).
Using this result, one arrives at the desired ﬁnal conclusion p(colcj, A3) = p(colcj)

using induction: Let I be any integer such that O S l g j — 1, and assume that

P(Cj—zlcj, A1") = P(Cj—zlcjl- Clearly,

P(Cj—z—ilcj,z\j)
: /Zp(Cj—l-1|Cj-11Aj—la Cj,Aj)p(Cj_l|Aj_l,Cj,Aj)p(Aj_llcj,A1.)dAj~l

Cj_(

= Z[P(Cj—I—ilcj—z,Aj—1)P(Cj—1|Aj_1,cj,Aj)p(AJ-_¢|cj,Aj)dAj_,

91.1

175

’— T' ‘

 
   
  
    

Y ...)ng .(‘(t\‘,f‘ ‘1,
.. I , .. .,; .‘miifnrn 1m .- 3

‘ in

. , L . I... ‘1']
L n\ -\l\ll
., , w . . ~ »-4:.,am:m$~g

l ' " " '9“: 1:" ram-0M v‘ .7:
' i 11*:

.. m, 1.1?- wily-11111113}

.
. i 1 W
. wzwm '1111, ”mn""

IA

uu‘? ‘HF‘HF-m ‘1, .' . '

 

‘Ajlxj' ‘./1.")l _!4,\ "'13. f» .( i g I I ‘ ‘ " “C!

I 1-“.l‘l“._'f; f("x’ .". . I I“, "

,.I
I ' VII
l V'.\
.' 7 l I
~ .. .LI ’
v I.
.- . I ’

 

Now, since p(cj_1|cj, Aj) = p(cj_1|cj) and the assumption,

p(Cj-l—l ICj, A1) = Z: P(Cj-z-i lcj—l)p(Cj—llcja Aj)

Cj_z

= Zp(ci-l-lle-lvlel)(Cj-llcjl = P(Cj—z—ilcjl-

Cj_(

Taking l = j — 1 completes the proof. Note that we made use of the fact that
given scaling coefﬁcients at a given scale, knowledge of the corresponding scaling co-

efficients at a coarser scale (greater value of j) does not alter the likelihood of an event.

Proposition 2 f (6,,klc0) = f (6j,k|cj_1,2k,cj_1,2k+1) forj = 1, - -- ,J

Proof: In order to keep the mathematical notation as clean as possible in these and
subsequent derivations, we introduce the following simplifying notations: c1 = cj_1,2k
and c2 = cj_1,2k+1, c;_1 = cj_1\(c1,c2), and y; = yj\yj,k. Now, since y“, = %(1 +
51k) (see (3.10)), it suffices to show that p(yj,k|c0) = p(yj,k|c1,c2) forj = 1,~- ,J.
This may be done as follows.

By the total probability and Bayes’ theorems,

p(ijcICO) = [p(yj,,\,|c0)dy;d,\,-

p(ConjaAj) .
————-—— .,A' d .dA-.
/ p(CO) p(yJ J) y] J

Since the information conveyed by {y}, A,-} is the same as that conveyed by {Aj_1},
P(ColyJ'aAj) = p(COIAj-l) = P(Aj—llcolmcol/PQj—ll- Using PFOPOSitiOTl 1, this be-

comes P(Colva‘j) = P(Aj-1ICj—1)P(Co)/P()\j—1) = p(Cj—lIAj—1)p(CO)/p(Cj—l)- Upon

176

substitution in the above integral

 

pom--1) .
p(yjnco) = / 13(ch p(dey, d1,-

P(Cj—1IYj,/\jl ..
= - -,,\. d .dA-
/ P(Cj-il p(yJ J) y] J

 

p(03_1ly;, *1)
P(Cj—ll

P(Cl—l lAj)
— . , . A- __J__
- P(yml [p(cl, cglym, 1k) (Ci-1)

A61 +Cze-A pc(*

 

= [P(Cleczlyjk1 AM) p(yj’ Al) dy; dAj

P(I‘v‘) dAJ’
Cj_1j|/\ )

C1! C2! m ) p(AJ- )dAj.

C
= p(yj.k)y,-,l(1 -yj.1-)C2 /

Thus, p(ijcICO) = p(yj,kl01,62)-

B.2 Optimal Estimation of the Multiplicative In-
novation

In this section we ﬁnd a closed form for the optimal estimate 6M of the innovation

coefficient 5j.k° For the sake of simplicity, in the few steps that follow we will disregard

the indices j and k, and simply write 6 for 513k and similarly for other quantities.
The minimum mean square error (mmse) optimal estimate of the innovation co-

efﬁcient 6, given all the information available in CO is given by

E[6|=c0] /:()6p 6|c0)d 6:]:(61) 6|c1,c2)d (8.5)

in accordance with (B2). Applying Bayes’ theorem to this expression we obtain

3 = filamctczlopww = fl,fo°°6p(c1,c2|6,i)p(xla)dAp<6>d§
f3, p(cl, c2|6)p(6) d6 f _1, f0°°p(c1,c2|6,1)p(1|5) dAp(6) d6

 

 

Here, p(Al6)= pA( ) due to the independence of Aj k and 6 J k. Also, notice that Cl and

177

 

 

1111391111

.- .4" '1 ”M . l
1. (11!; .l
I .4. .JM .
_, .. - ...—m. .
’/l.'.
."!"‘\\v -

, 5
1 ‘5‘ ‘Jl‘!’|\ a“
‘.

.‘ ‘ l‘.‘ L" .l. ’ V ’1

  
     
   
   
  

.1. . -~ ." 7:11:53271 15ml
' S

'1’.
1101.1 1
. 1‘1
-.;.,y.1,'..1"";_
, 1.’

.11.. ,m- «a: so"? ' 12'
‘4.

- 1113.6 .1 Mr t

' t

mfimpp. ' .-

-0.')11.21h-'n...1 1.‘ > 1 1 ‘jﬂi
,. ‘3 .,‘.' .’ 11"

l. , _ ﬂ ,... .. O _J ' .‘l ”('l'B'KMle Id, “51 u l'

(a. “.l-‘I ' l ' v; .
,8) "I -. ~ . ,‘ ~- .1. w 1 “11.315111 3:.

D',_
"V's;

”It, I} '5“ I’LL/9.111.! '2 , ”"1“111' " ‘1‘ 1 ' 1.“1.';‘;’. 13.8)(‘11’ ' r
". r .‘

.4 a.
night/1mm: 3 ~ '7 "

1am -~;-_"__I.Z'_‘-'.".‘-Ll"ﬂ .. I " mam-um? :
a “it .'.‘I v:'. ‘ _Ii-‘ ':

3.76“}... "z. 1; 1.31:- ,..; ”1 , ,_ MMI-11659.4!) ,._,.-,

.' '_

If
‘. p ' i 1 ' ~ V ’ ‘
. u. _ . .MJI l'lloun’! Juli: . “61'" 1. “A L. v‘v."'T.itN;‘1llI'L 'tlll-JJ uubf 97
. . . -

~ pa
‘ .'r‘

 

C2 only depend on A and 6 by way of their respective statistical means A1 = -;—A (1 + 6)
and A3 = %A (1 — 6) (See (3.11) and (3.12)). Given A1 and A2, A and 6 do not convey
any further information on the behavior of Cl and c2. Moreover, given A1 and A2, c1

and C; are independent. Therefore,

fl, I,” 6p (ctcz (114,411.32) p(AMpr) d6
_1 °°p c1.c2 Ah? Al-zé (A) dpr) d6
0
1:11 fooo 6pe__—_-A(cll+6)/2 (A#)Cl £32242 (A1—— (5)62 p(A)dAp(6) d6

00 e____-A(1+6)/2 c1 e—Ml 6)/2 c '
L11 lo c (”T”) 72—01345) 2 P(AldAP(5)d5

on)

 

It is now possible to factor out every A-dependent term from the integrals in 6. Doing
this, the resulting integrals in A in the numerator and denominator cancel out leading

to

 

g_[3,15(1+5)‘=1(.1—5)‘~‘2p(5)(15

L11 (1+5)c‘(1-5)c’p(5)d5° (8'7)

Substituting the beta mixture model (3.13) into this expression and carrying out the

integration we obtaine the desired result:

 

 

 

22' p B(8:+C1 81+C2)

A _ 3 B(s,-, s.) )(23.+c)

6j,k "- dj,k Z B(3i+C1 3i+C22 . (8.8)
ipi B(8;‘,8;‘)

178

w" “v

   
  
   

.1‘z'i ‘ MN" ”(’6 .
I ‘ : (Ithuﬂ ' -'
I III‘Iu-Mjno-

r.vI1-n-u.‘l’ .

- ' ugh-'11 019

 

In 112.1195 I’

. t‘ ‘

H, H

< ‘ "J'Mﬂsm _

,.~-»

v , I/ 1 ".J'l. »

16,“. 7-
| q

.‘1 -.
v" .
- n.‘

I
‘0‘ ’ﬁ 1%

APPENDIX C

Appendix to Chapter 4

C.1 The Hilbert Transform as a Continuous Av-

eraging Process

The Hilbert transform of an L2(IR) function f is deﬁned as

Hf(t) a 1 0° Mar, ((3.1)

7r _oot-T

which is the convolution of f and %. For functions with ﬁnite support, it is possible
to give a more intuitive meaning to this operation. In particular, if supp( f ) = (a, b],

where a < b, then

_ l
b a / mo.,az(r)dr, (0.2)
7r 0

 

71(f(t) = —

where

f(02(T)) — f(UIITll (C.3)

02(7) — 01(7)

"101.02 (T)

179

and 01(7) E (a — t)r +t and 02(7) E (b — t)r + t. Note that 01(0) = 02(0) = t, while
01—> a and 02 —» b linearly as r —-> 1.
Clearly, if the derivative of f exists everywhere in (a, b), ma,,02(r) is the average

derivative of f in the interval (01(7),0'2(T)) C [a,b], i.e.,

 

1 02(7)
mom 7' = ’ a do. C4
2( ) 02(Tl-01(T) [01(1) f( ) ( )
Proof: Substituting C.3 into C.2, and replacing the deﬁnitions for 01(r) and 02(7),

we obtain

Hf(t)=—%/O f((b—t)r+t)C:_—T+%/0 f((a—t)r+t)£:_:

Now, in the ﬁrst integral let u = (a — t)r + t, and in the second integral let u =

(b — t)r + t. In both cases we have that df = iii—:11, which upon substitution we get the

desired result C. 1:

 

 

u—t u—t

Hath—ﬁve) d“ +71; [m 0’“

C.2 Proof of Expression (4.24)

In this section, we investigate the nature of the approximation

 

[2” Sa(—12!0—7m) d6~ 21r/N (0.5)
0 (

1— rcos(6 — (15))2 I ~ (1— rCOS(d> — $73371»?

for n = O, 1,. . . ,N — 1, and ﬁnd the conditions under which it is adequate.

Let rect(;z:) be the function that is 1 in [—32”-, g], and zero otherwise. Then, the

180

Fourier transform of Sa(‘-3_£6) is Freedﬁ). Thus, Sa (%6) = ﬁ ﬁll? e‘wodw, and

2" Sa(-1y-6 - 7m) ”/2 21f ewe
/ _/ 8-1.41. / dﬁdw
0 (1 — rcos(6— (1)))2 :N Mg 0 (1 - rCOS(9 - (25))2

N/2 _ w 00 211’ t ’
1 eWTn/ N rec (02—7r 1r) e—zwﬁ dﬁdw,
=2” -N/2 —oo (1 " 7005(9— 99))?

 

 

 

Of

 

 

A
(2” 341219 _ 7m) do = .1— m lvﬂreC‘It—“l ...-ten .1... (c 6)
0 (1 —rcos(6—¢))'~’ 27r _N/2 (1—rcos(6—¢))2 1 -

where as “usual, ()A denotes Fourier transformation. Clearly, if the “bandwidth” of

2_7r

(1N, cos(6 93¢»? were less or equal to N / 2, the right integral of (C.6) would reverse its

           

Fourier transformation, giving (C.5) with equality. This, however, can never occur

_2__r 9— 1r
since (1”,, co:((192—— 4);), has compact support, and so, its bandwidth is never ﬁnite. Note

 

that C.6 is simply a restatement of the Nyquist Sampling Theorem and the recon-
struction series which derives from it [88].
Although we must content ourselves with an approximation, we can choose the

parameters in (OS) such that its two sides are as close as desired. To see how this

N "92;rect(
(1- -rcos(9— 1:11))?

 

is possible, ﬁrst note that the bandwidth of does not change with N,

2_ (9-1r)

and so, the right-side integral of (C.6) monotonically approaches (INTE(E)))7 as N

 

increases. Variations on <15 only affect the phase of the Fourier transform and does

not modify the bandwidth. On the other hand, as the value of r goes from O to

1, (lwrr::((0%- (:9), goes from a constant to a U shape function on its support [0, 271],

 

drastically increasing its high-frequency content. A detailed analysis would show
that this increase is monotonic because the energy increase in the signal continuously
adds towards building up the singularities at the interval’s boundaries.

In order to ﬁnd values for r and N that make (C.5) a good approximation, nu-

2h 1r9—’—rect( 2;
(1— rcos(G— ¢)))

 

merical solutions were obtained. The Fourier magnitude of 2for four

181

14
12

NO¢O

 

 

 

 

12 5 10 15 20

 

 

 

 

 

(b)
30 25
40 20
30 15
20 10
10 5
5 10 13 20 5 10 1‘5 20 25 33 35

(C) (d)

Figure C.1. Magnitude frequency content of (1Nr::((a— (my (abscissa in radians/s) for

¢=0and(a)N=16,r=.8,(b)N=32, r: .,9 (c)N=32,r=.96,and(d)
N=64,7‘=.96,

 

combinations of N and r are shown in Figure C.1.
It is seen from these graphs that the right-hand side integration operation of CG
would change very little if it were over IR rather than over [—%, g]; although Figure

C.1(c) probably corresponds to an acceptable upper limit for 7' when N = 32. The

2—" rect(9—2"— K")

(1- NTCOSU’ 21:15))
integration —7 and % in each graph 1S as follows: (a) 38 dB, (b) 53 dB, (c) 30 dB,

and (d) 67 dB.

 

decay in the frequency response of 2 at the lower and upper limits of

182

-.v 7": "

"
”NIH
v ‘ ' H
. ,, r. "’
9- "/
.«1 T r- :H
\ u u
.‘ J ('3)

   
    

 
  

f ‘ I ,.- , 1 , -H'»'~1! (IbDI‘U

' I ' ‘H "
U 1 . ’. ‘— 1.313% H

11'; Ilu'uer‘O'J'I:
11'-) :11 fDu‘Lur-u‘ . 1 , ' , 11,-31 .‘4'11 713w 311mg; M111“?

«.11 1.; ‘4 w'. 1 .-.1°'w 11 H 91?)” [INV-

' 7131.11.35 1

. ‘3‘“. ..1 ‘ - {‘J" 1.. 11“", "Jum'lm ! “f,
. I» , ' . ,- . I . .. a I. l. ‘. L
I! .U (”0111 ”llthgJ .fH: .1 1,; ‘~' ‘ '; ’I '_ 3 Mfl‘ "V VZNW'ML’

" r" ...“ ,.. ,- .., ‘ ‘ -. .
. , v ‘1} ”it . ,hj ’~“ .‘I .EUI ’ . ‘ - u“. . 1 1 v.‘ (X'LH'I AU. ;T‘ W' I

.
‘ .A
It
.
~ .
L, ,.
. L. -. 2
' 7‘ T.
. l
O ‘ r
. .
Q
I

BIBLIOGRAPHY

 

.
'11" Q
..
r
.
. -
. L

. 111).?!" “#1)

1. -- ....

l

 

BIBLIOGRAPHY

[1] D. L. Snyder and M. I. Miller, Random Point Processes in Time and Space. New
York: Springer-Verlag, 1991.

[2] D. L. Snyder, A. M. Hammoud, and R. L. White, “Image recovery from data
acquired with a charge-coupled-device camera,” J. Opt. Soc. Am. A, vol. 10,
no. 5, pp. 1014—1023, 1993.

[3] H. Heffes and D. M. Lucantoni, “A Markov modulated characterization of pack-
etized voice and data traffic and related statistical multiplexer performance,”
IEEE J. Selected Areas Comm, vol. 4, pp. 856—868, 1986.

[4] Y. Vardi, “Network tomography: Estimating source-destination trafﬁc intensities
from link data,” J. Amer. Statist. Assoc, vol. 91, pp. 365—377, 1996.

[5] M. A. King, R. B. Schwinger, P. W. Doherty, and B. C. Penney, “Two-
dimensional ﬁltering of SPECT images using the Metz and Wiener ﬁlters,” J.
Nuc. Med., vol. 25, pp. 1234—1240, 1984.

[6] G. A. O’Driscoll and et al., “Differences in cerebral activation during smooth

pursuit and saccadic eye movements using positron-emission tomography,” BI-
OLOGICAL PSYCHIATRY, vol. 44, no. 8, pp. 685—689, 1998.

[7] E. E. Smith, J. Jonides, and et al., “Components of verbal working memory: Evi-
dence from neuroimaging,” PROCEEDINGS OF THE NATIONAL ACADEMY
OF SCIENCES, vol. 95, no. 3, pp. 876-882, 1998.

[8] R. Dume and P. Jansen, “Computer tomography of barrels with radioactive
contents,” Nuclear Engineering and Design, vol. 130, pp. 89—102, Sep. 1991.

[9] J. Frank, Three-dimensional Electron Microscopy of Macromolecular Assemblies.
San Diego: Academic Press, 1996.

[10] F. Abramovich, T. Sapatinas, and B. W. Silverman, “Wavelet thresholding via
a Bayesian approach,” Tech. Rep., Math. Dept, Univ. Bristol, 1996.

184

[11] H. A. Chipman, E. D. Kolaczyk, and R. E. McCulloch, “Adaptive Bayesian
wavelet shrinkage,” J. Amer. Statist. Assoc, vol. 92, pp. 1413—1421, 1997.

[12] D. L. Donoho and I. M. Johnstone, “Adapting to unknown smoothness via
wavelet shrinkage,” J. Amer. Statist. Assoc, vol. 90, pp. 1200—1224, Dec. 1995.

[13] I. M. Johnstone and B. W. Silverman, “Wavelet threshold estimators for data
with correlated noise,” Journal of the Royal Stastistical Society, Series B, vol. 59,
pp. 319—351, 1997.

[14] E. D. Kolaczyk, “Wavelet shrinkage estimation of certain Poisson intensity sig-

nals using corrected thresholds,” Statistica Sinica, under revision, 1997.

[15] B. Vidakovic, “Nonlinear wavelet shrinkage with Bayes rules and Bayes factors,”
J. Amer. Statist. Assoc, vol. 93, pp. 173-179, 1998.

[16] R. D. Nowak, R. L. Gregg, T. C. Cooper, and J. E. Siebert, “Removing Rician
noise in MRI via wavelet-domain ﬁltering,” in submitted to Annual Meeting of
Intl. Soc Magn. Reson. Med., 1998.

[17] P. G. Hewitt, Conceptual Physics. Boston: Little, Brown & Company, 1985.

[18] E. L. Lehmann and G. Casella, Theory of Point Estimation. New York: 2nd
Edition, Springer-Verlag, 1998.

[19] G. H. Hardy and W. W. Rogosinski, Fourier Series. New York: Dover Publica-
tions, Inc, 1999 republication of the original 1956 edition.

[20] S. Mallat, “A theory for multiresolution signal decomposition: The wavelet rep—
resentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 11, pp. 674—693, July 1989.

[21] S. Mallat, A Wavelet Tour of Signal Processing. San Diego: Academic Press,
1998.

[22] M. Frazier, An Introduction to Wavelets Through Linear Algebra. New York:
Springer, 1999.

[23] C. P. Robert, The Bayesian Choice. New York: Springer-Verlag, 1994.

[24] M. D. Srinath, P. K. Rajasekaran, and R. Viswanathan, Introduction to Statis-
tical Signal Processing with Applications. Englewood Cliffs, N.J.: Prentice-Hall,
1996.

185

[25] H. Stark and J. W. Woods, Probability, Random Processes, and Estimation The-
ory for Engineers. Englewood Cliffs, N.J.: Prentice-Hall, 1994 Second Edition.

[26] S. Wolf, Guide to Electronic Measurements and Laboratory Practice. Englewood
Cliffs, NJ: Prentice-Hall, 1973.

[27] R. G. Gallager, Information Theory and Reliable Communication. New York:
John Wiley and Sons, Inc., 1968.

[28] G. B. Folland, Real Analysis —Modern Techniques and Their Applications. New
York: Wiley-Interscience, 1984.

[29] C. E. Shannon and W. Weaver, The Mathematical Theory of Communication.
Urbana, Illinois: Univ. of Illinois Press, 1963.

[30] A. Papoulis, Probability, Random Variables, and Stochastic Processes. New York:
McGraw—Hill, 1984.

[31] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York:
John Wiley & Sons, Inc, 1991.

[32] T. K. Moon and W. C. Stirling, Mathematical Methods and Algorithms for Signal
Processing. Upper Saddle River, NJ: Prentice Hall, 2000.

[33] R. T. Odgen, Essential Wavelets for Statistical Applications and Data Analysis.
Boston: Birkh'auser, 1997.

[34] M. S. Crouse, R. D. Nowak, and R. G. Baraniuk, “Wavelet-based statistical signal
processing using hidden Markov models,” IEEE Trans. Signal Processing, to

appear in Special Issue on Theory and Applications of Filter Banks and Wavelets,
1998.

[35] D. L. Donoho and I. M. Johnstone, “Ideal spatial adaptation via wavelet shrink—
age,” Biometrika, vol. 81, pp. 425—455, 1994.

[36] G. P. Nason and B. W. Silverman, “The stationary wavelet transform and some
statistical applications,” in Lecture Notes in Statistics: Wavelets and Statistics,
vol. New York: Springer-Verlag, pp. 281—299, 1995.

[37] N. Weyrich and G. T. Warhola, “Denoising using wavelets and cross-validation,”
Technical Report AFIT/EN/TR/94-01, Department of Mathematics and Statis-
tics, Air Force Institute of Technology, Ohio, 1994.

186

[38] G. P. Nason, “Wavelet regression by cross-validation,” Technical Report 44 7,
Department Statistics, Standford University, California, 1994.

[39] G. P. Nason, “Wavelet shrinkage using cross-validation,” Journal of the Royal
Statistical Society, Series B, vol. New York: Springer-Verlag, pp. 58:463—479,
1996.

[40] R. D. Nowak, “Optimal signal estimation using cross-validation,” IEEE Signal
Processing Letters, vol. 4, no. 1, pp. 23—25, 1997.

[41] F. Abramovich and Y. Benjamini, Thresholding of wavelet coeﬁicients as multiple
hypotheses testing procedure. New York: Springer-Verlag, 1995.

[42] H. C. Andrews and B. R. Hunt, Digital Image Restoration. Englewood Cliffs,
New Jersey: Prentice Hall, 1977.

[43] D. T. Kuan, A. A. Sawchuk, T. C. Strand, and P. Chavel, “Adaptive noise
smoothing filter for images with signal-dependent noise,” IEEE Trans. Pattern
Anal. Machine Intelligence, vol. 7, pp. 165—177, 1985.

[44] R. D. Nowak and R. G. Baraniuk, “Wavelet-domain ﬁltering for photon imaging
systems,” Proc. SPIE, Wavelet Applications in Signal and Image Processing V,
vol. 3169, pp. pp. 55—66, August 1997.

[45] R. Nowak, R. Hellman, D. Nowak, and R. Baraniuk, “Wavelet domain ﬁltering
for nuclear medicine imaging,” in Proc. IEEE Med. Imaging Conf., pp. 279—290,
1996.

[46] J.—C. Pesquet, H. Krim, and E. Hamman, “Bayesian approach to best basis
selection,” in IEEE Int. Conf. on Acoust, Speech, Signal Proc. — ICASSP ’96,
(Atlanta), pp. 2634—2637, 1996.

[47] E. P. Simoncelli and E. H. Adelson, “Noise removal via Bayesian wavelet coring,”
in IEEE Int. Conf. on Image Proc. — I CIP 1996, (Switzerland), September 1996.

[48] N. L. Johnson, S. Kotz, and A. W. Kemp, Univariate Discrete Distributions.
New York: John Wiley and Sons, 1992.

[49] L. L. Scharf, Statistical Signal Processing. Detection, Estimation, an Time Series
Analysis. Reading, MA: Addison-Wesley, 1991.

[50] B. Mandelbrot, “Intermittant turbulence in self-similar cascades: Divergence of

high moments and dimensions of the carrier,” J. Fluid Mech., 1974.

187

[51] S. Lovejoy and D. Schertzer, Multifractals in rain. Cambridge, U.K.: Cambridge
Press, 1995.

[52] C. J. G. Evertsz and B. B. Mandelbrot, Multifractal measures. New York:
Springer-Verlag, 1992.

[53] R. D. Mauldin, W. D. Sudderth, and S. C. Williams, “Polya trees and random
distributions,” Ann. Stat, vol. 20, pp. 1203—1221, 1992.

[54] M. Lavine, “Some aspects of Polya tree distributions for statistical modeling,”
Ann. Stat, vol. 20, pp. 1222—1235, 1992.

[55] S. LoPresto, K. Ramchandran, and M. T. Orchard, “Image coding based on mix-
ture modeling of wavelet coefficients and a fast estimation-quantization frame-
work,” in Data Compression Conference ’97, (Snowbird, Utah), pp. 221—230,
1997.

[56] E. D. Kolaczyk, “Bayesian multi-scale models for Poisson processes,” Technical
Report 468, Department of Statistics, University of Chicago, 1998.

[57] P. Diaconis and B. Efron, “Computer-intensive methods in statistics,” Scientiﬁc
American, pp. 116—129, May 1983.

[58] A. M. Zoubir and B. Boashash, “The bootstrap and its application in signal
processing,” Signal Processing Magazine, vol. 15, no. 1, pp. 56—76, 1998.

[59] R. D. Nowak and R. G. Baraniuk, “Wavelet-based ﬁltering for photon imaging
systems,” IEEE Trans. Image Processing, submitted April 1997.

[60] E. Simoncelli, W. Freeman, E. Adelson, and D. Heeger, “Shiftable multiscale
transforms,” IEEE Trans. Inform. Theory, vol. 38, pp. 587-607, 1992.

[61] R. Coifman and D. Donoho, “Translation invariant de—noising,” in Lecture Notes
in Statistics: Wavelets and Statistics, vol. New York: Springer-Verlag, pp. 125-
150, 1995.

[62] M. Lang, H. Guo, J. E. Odegard, C. S. Burrus, and R. O. Wells, “Noise reduc-
tion using an undecimated discrete wavelet transform,” IEEE Signal Processing
Letters, vol. 3, no. 1, pp. 10—12, 1996.

[63] J .-C. Pesquet, H. Krim, and H. Carfantan, “Time-invariant orthonormal wavelet
representations,” tSP, vol. 44, no. 8, pp. 1964—1970, 1996.

188

 

 
  
   
   
 
  
   
      
   
    
     
 
   

v]. r"'
1" .l. .V’de‘aw‘|
g.»

0
ﬂ.

1

;1 .(l ban ‘
.29”

‘H will '5’, '
“1'3. ﬁnk .,

. i l;

. “M. 366108"

.. [v .02.! ”J."

‘-.'.. 1..” H.
"w ”Pr/34W ‘0 ,' .
,, 11"31‘70') INC?" ‘

. ..
‘ .
o
D
'd,’ ‘1
‘. ,

" ‘1'“8‘ 1‘ .
-: IWn‘xﬂ “
;, ._1.8 ”If V,
s; M.) at! std“ __

I u

.1 7‘

' .-

r y . l 1:. ‘1] (Mu .I ‘ k
. , ”)5.

.1 ,1 V ,7 , .' 1 - > -"-, lnt‘b'z ﬂ .

. . . 1 '. 41 “I
;.ll!_+1-1I . 1 , v '- I‘ am >

....1 3331'.

1’; 1_’
. . '1." V
11'1": V l '1'.) ‘3'] I, J! ~‘
, .. , .‘ ,.- -l 1 ‘1‘ -: ‘9 1.1.35! ”7 '.

1‘ .1
7:111'11'! ."~

‘_)]-"'¢-| 1].:1! «I 11 ,I'I

n , “5’
:..“\I1’.

‘11,, 1'11 l "I‘ll ‘rl Iw‘ .Il,
'~= ' .1111.“ ‘ . l

“'21 .111, gul'rw’vr‘ 1 v-"'.’

7.":
m11n1arjcr'l’,“ .rvliuw «.1 ‘1 i . ~..- 1 e . ”..me 41.1. 11:10 gf

1'. , ~ r ‘ Ho, 1. wh-lumu, _
.v-t‘IJl 1.1 {)1 m1 .I .01} ,E ._v

’ l

Igﬂ‘n‘gam‘i 111mg: . 51311

I

4
labia-wInna-1111.111?!”tam'wzxwmi'?‘ ,- . ‘ ' 1 :n :14")! .ll. '
51.4}: 'ITU' ‘ "'

.0] .. . in». "'

I
.
...
. .,.
6'12 ‘

. .
- .vv
. - _‘ , .
, {_QS‘J‘A'}. In.“ _. A
_—,L - '

[64] R. Nowak, “Shift invariant wavelet-based statistical models and 1/f processes,”
in Proc. IEEE Digital Signal Processing Workshop, paper no.83, Bryce Canyon,
UT, 1998.

[65] A. P. Pentland, “Fractal-based description of natural scenes,” IEEE Trans. Pat-
tern Anal. Machine Intell., pp. 661—674, July 1984.

[66] B. J. West and M. F. Shlesinger, “On the ubiquity of 1 / f noise,” Intl. J. Modern
Physics (B), vol. 3, no. 6, pp. 795—819, 1989.

[67] G. W. Wornell, Signal processing with fmctals: a wavelet-based approach. Upper
Saddle River, NJ: Prentice-Hall, 1996.

[68] I. Daubechies, Ten Lectures on Wavelets. Philidelphia: SIAM CBMS-NSF Series
in Applied Mathematics, no. 61, 1992.

[69] J. A. Sorenson and M. E. Phelps, Physics in Nuclear Medicine. New York: Grune
& Stratton, 1987.

[70] I. M. Gelfand and S. G. Gindikin, eds., Mathematical Problems of Tomography.
Rhode Island: American Mathematical Society, 1990.

[71] B. V. Jackson, P. L. Hick, M. Kojima, and A. Yokobe, “Heliospheric tomogra-
phy using interplanetary scintillation observations,” in Proc. of the E05 XXI
meeting, paper no. 15.02, The Hague, Netherlands, 610 May, 1996.

[72] X. Descombes, F. Kruggel, and D. Y. von Cramon, “Spatio-temporal FMRI
analysis using Markov random ﬁelds,” IEEE Trans. Medical Imaging, vol. 17,
no. 6, pp. 1028-1039, 1998.

[73] X. Hu, T. H. Le, T. Parrish, and P. Erhard, “Retrospective estimation and
correlation of physiological ﬂuctuation in functional MRI,” Magn. Reason. Med.,
vol. 34, pp. 201—212, 1995.

[74] A. K. Jain, Fundamentals of Digital Image Processing. Englewood Cliffs, N .J.:
Prentice Hall, 1989.

[75] F. Natterer, The Mathematics of Computerized Tomography. Stuttgart: John
Wiley & Sons Ltd and B G Teubner, 1986.

[76] T. F. Budinger and G. T. Gullberg, “Transverse section reconstruction of gamma-
ray emitting radionuclides in patients,” Reconstruction Tomography in Diagnos-
tic Radiology and Nuclear Medicine, University Park Press, Baltimore, MD, 1976.

189

[77]

[78]

[79]

[80]

[81]

[82]

[83]

[84]

[85]

[86]

[87]

[88]

A. C. Kak and M. Slaney, Principles of Computerized Tomographic Imaging.
New York: IEEE Press, 1988.

M. Bhatia, W. C. Karl, and A. S. Willsky, “Tomographic reconstruction and es-

timation based on multiscale natural-pixe base,” IEEE Trans. Image Processing,
vol. 6, no. 3, pp. 463—477, 1997.

J. M. Links, R. W. Jeremy, S. M. Dyer, T. L. Frank, and L. C. Becker, “Wiener
ﬁltering improves quantiﬁcation of regional myocardial perfusion with thallium-
201 SPECT,” J. Nuclear Medicine, vol. 31, pp. 1230—1236, 1990.

T. Hebert and R. Leahy, “A generalized EM algorithm for 3-d Bayesian recon-
struction from Poisson data using Gibbs priors,” IEEE Trans. Medical Imaging,
vol. 8, no. 2, pp. 194—202, 1989.

S. S. Saquib, C. A. Bouman, and K. Sauer, “A non-homogeneous MRF model
for multiresolution bayesian estimation,” IEEE Int’l Conf. on Image Proc, 1996.

M. H. Buonocore, W. R. Brody, and A. Macovski, “A natural pixel decompo-
sition for two-dimensional image reconstruction,” IEEE Trans. Biomed. Eng.,
vol. BME—28, no. 2, pp. 69—78, 1981.

J. DeStefano and T. Olson, “Wavelet localization of the radon transform in even
dimensions,” in IEEE-SP Int. Symp. T ime-Frequency and T ime—Scale Anal,
pp. 137—140, Oct. 1992.

B. Sahiner and A. E. Yagle, “Limited angle tomography using the wavelet trans-
form,” in IEEE Nucler Science Symp. and Medical Imaging Conf., pp. 219—222,
Oct. 1993.

Z. Wu, “MAP image reconstruction using wavelet decomposition,” Lec Notes
Comput. Sci., vol. 687, pp. 354—371, 1993.

F. Rashid-Farrokhi, K. J. R. Liu, C. A. Berenstein, and D. Walnut, “Wavelet-
based multiresolution local tomographiy,” IEEE Trans. on Image Processing,
vol. 6, pp. 1412—1430, Oct. 1997.

I. S. Gradshteyn and I. M. Ryzhik, Table of Integrals, Series, and Products. New
York: Academic Press, 1965 fourth edition.

A. Papoulis, The Fourier Integral and its Applications. New York: McGraw-Hill,
1962.

190