z . .5. mail" .u .__ . a l.- l .2 . Ewim a.» X r. .; 5 14!. . . .5‘ A .3 am“): {.1 IS» . (v 3:? a .Eyemx v : a w... . 4. «Sn. L . 5.3,." am E ‘ a. y. .. M5, L_. 5 ‘ . . , 4.? mm": . ‘ m mug. 5 7ki¥ fig :3“? mi . V“ ‘2 4 A ,_.m.:w. .2 «‘1 - 1.5. THESIS 2 cm This is to certify that the dissertation entitled MULTISCALE MODELING AND ESTIMATION OF POISSON PROCESSES WITH APPLICATIONS TO EMISSION COMPUTED TOMOGRAPHY presented by Klaus E. Timmermann has been accepted towards fulfillment of the requirements for Ph.D degree in Electrical Eng MajoEprofesso\ Dategl/Z 2’;/00 MS U is an Affirmative Action/Equal Opportunity Institution 0- 12771 LIBRARY Michigan State University PLACE lN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE moo mm." MULTISCALE MODELING AND ESTIMATION OF POISSON PROCESSES WITH APPLICATIONS TO EMISSION COMPUTED TOMOGRAPHY By Klaus Edmond Timmermann A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Electrical and Computer Engineering 2000 ABSTRACT MULTISCALE MODELING AND ESTIMATION OF POISSON PROCESSES WITH APPLICATIONS TO EMISSION COMPUTED TOMOGRAPHY By Klaus Edmond T immermann Many important problems in engineering and science are well-modeled by Pois- son processes, and in many instances it is of great interest to accurately estimate the intensities underlying the observed Poisson data. This dissertation addresses the problem of general Poisson process estimation, but the work is primarily motivated by the photon-limited imaging problem. First, a Bayesian approach to Poisson in- tensity estimation based on a multiscale framework is presented. It is shown that the multiscale representation of signals provides a very natural and powerful framework for this problem. Using this framework, a novel multiscale Bayesian prior to model intensity functions is devised. The behavior of the new model is characterized by a study of its correlation properties. The new prior leads to a simple, Bayesian intensity estimation procedure. Practical fast shift-invariant algorithms for the new estimation framework are presented and applied to photon-limited data. We extend the modeling and estimation approaches for general Poisson processes to the emission computed tomography (ECT) image reconstruction problem. Two multiscale approaches are introduced. The first approach is based on the geometrical r. ' (film i put)” . . . ‘ . l m. we «. r... l ETAIIIGII‘ fl [Uln‘if'iV L" ¢ II “viral-la “A '. run 'T" 7"“ \r v .1.) ..".. . 7‘"'("‘ tr“ A“ ‘,"..‘._‘ ,li Thi- .V I0 3 Vif‘ 3 ll , YA. fil yl‘ -v - ‘hA\h.‘\,l; ‘9 OH} in _ “4.3;‘i‘ l H.101;— :I PC"; a» I, A: 9' properties of the so-called natural pixels of the intensity image. Within this frame— work, we develop a practical prior model for the sinogram image, which is used to estimate the underlying sinogram intensity from the raw projection data prior re- construction. The sinogram estimate is then used in conjunction with the standard filtered-backprojection algorithm to produce an improved image reconstruction. The superiority of the new approach over the conventional filtered-backprojection based reconstruction is illustrated with clinical and simulated data. The second, more sophisticated'approach to ECT is based on a new multiscale- based Radon inverse transform. It is shown that the new formulation lends itself to a very natural discretization of the Radon inverse operator which is especially amenable to numerical computation. Within this new reconstruction framework, we develop a prior model for the cumulative sinogram image ——a new intermediate image representation between a sinogram and the intensity image. The new model is unique in that it exploits the high degree of redundancy of information present in the sinogram.We demonstrate the superiority of the proposed method using real data. As the multiscale framework to modeling and estimation represents the founda- tion of the methods presented here, we introduce a new approach to characterizing multiscale models and estimators in order to assess their qualities. This approach is shown to be more general in nature than classical statistical descriptors ( e. g., bias- ness, mean-square error, etc.) which are based on limited information. Towards this end, we introduce the information-theoretic definitions of anomy, accuracy precision, and resolution power. Based on criteria developed with these concepts, we explore the advantages of multiscale Bayesian modeling and motivate its application to the estimation problem. COpyright © by Klaus Edmond Timmermann 2000 T “Tm ( C‘ In memory of my Father, who instilled in me my love for nature, who taught me the values for order, simplicity and social responsibility, and who got me started in sciences. To my Mother, who nourished the spiritual side in my life, and who taught me the importance of family and true friendship. To my beautiful wife, who has been there for me in times of trial, and in every other moment to enjoy life together. To my children, Karl and Derek, who make it all worth it, and who remind me day to day that I am not. there yet. We dun ii others for thr- :. wish to Illahii I. rife than all W First. I w: time. help and a {hark my adv} defined to :1. t . . in n‘v artisan?" and he patience as he and Dr. Hat‘fil ire: weigh ”we: A150. my g apported and Jim x3115. .lr.. Wily and £11.: FA Es;-;.;uza. Azxi r»- ‘ T 0 dr‘ .,,' i ‘ . w .a.‘ TIL‘) last that ACKNOWLEDGMENTS We often go about our day-to-day business of life not realizing how much we owe others for the many good things that come to us. It is only in the rare occasions we wish to thank them in writing that we come to realize the many they are and that. cite them all we cannot. Below is my short list. First, I would like to thank the members of my guidance committee for their time, help and advice during the various stages of this long endeavor. In particular, I thank my advisor, Dr. Robert Nowak for the many hours of technical discussions he dedicated to me during our years of association; Dr. John Deller for his friendliness and helpfulness throughout. these years; Dr. Michael Frazier for his kindness and patience as he introduced me to the fascinating areas of real and multiscale analysis; and Dr. Hassan Khalil for his openness and warmth during an interview which would later weigh heavily in my decision to attend Michigan State. Also, my gratitude and thanks go to the following very special people who supported and encouraged me during my years of study: Jim and Lois Mills, Jim Mills, Jr., Sue Vollmar, Ken and Jan Dart, and my very dear brother and sister Willy and Emmy, and their spouses, Maria Esther Timmermann and Jesus Alonso Espinoza. And to those whom I owe the most: my wife, Judy and parents, Leonor and Guillermo Timmermann, I thank from the deepest of my heart. I also thank the National Science Foundation for its very generous financial sup- port without which I would not have been able to pursue this higher academic degree. vi l IDIIOdUCIlL 1.1 OPRW N‘IIA.£ N The Multi: '2.l Prelim 9') “Hip” "- - 1.2“! ..... o»: co dr- go x1; 0 9—4 bx. Q" b~ TABLE OF CONTENTS 1 Introduction 1.1 Organization and Summary of Contributions .............. The Multiscale Modeling Paradigm 2.1 Preliminaries ............................... 2.2 Multiresolution through Wavelets .................... 2.2.1 The Fourier Transform and Scales ................ 2.2.2 Wavelet Representation of Signals ................ 2.3 Multiscale Modeling and Attributes ................... 2.3.1 The Components of a Model and their Notation ........ 2.3.2 Anomy, Accuracy, Precision, and Resolution Power ...... 2.3.3 Two Illustrative Examples .................... 2.4 The Multiscale Modeling and Estimation Advantage .......... 2.4.1 A/ P Models and their Properties ................ 2.4.2 Bayesian Multiscale Models and Estimation .......... 2.5 Other Multiscale Modeling and Estimation Approaches ........ 2.5.1 Threshold Smoothing Methods ................. Multiscale Modeling and Estimation of Poisson Processes 3.1 Preliminaries ............................... 3.2 Notation .................................. 3.3 Why the Unnormalized Haar Transform? ................ 3.4 A New Probability Model for Intensity Images ............. 3.4.1 Multiscale Signal Model Framework ............... 3.4.2 Multiscale Multiplicative Innovations Model .......... 3.4.3 Prior Distribution for Innovations ................ 3.5 Estimation ................................. 3.5.1 Bayesian Multiscale Intensity Estimator ............ 3.5.2 Selection and Analysis of the Beta-Mixture Prior ....... 3.5.3 Estimation of Prior Parameters ................. vii 7 7 10 10 13 27 27 30 42 47 48 56 63 63 7O 71 72 74 75 75 78 81 84 84 3.5.4 Optimal Estimation from Large Ensembles ........... 91 3.5.5 Example of Estimation from a Single Observation ....... 93 3.6 Stationary Intensity Models and Estimators .............. 94 3.6.1 A Shift-Invariant MMI Model .................. 95 3.6.2 A Fast Shift-Invariant MMI Estimator ............. 97 3.6.3 Autocorrelation Functions of MMI and SI-MMI Models . . . . 100 3.6.4 SI-MMI Model and 1 / f Processes ................ 105 3.7 Numerical Comparison of Wavelet-Based Intensity Estimators . . . . 107 3.8 Application to Photon-Limited Imaging ................. 109 3.8.1 Photon—Limited Imaging Simulation ............... 111 3.8.2 Application to Nuclear Medicine Imaging ............ 112 4 Emission Computed Tomography 116 4.1 Preliminaries ............................... 116 4.2 The Image Reconstruction Problem ................... 119 4.3 The Filtered-Backprojection Reconstruction Technique ........ 124 4.4 Some Limiting Aspects of Conventional Reconstruction Methods . . . 125 4.5 First Approach to Multiscale Modeling and Estimation of ECT Intensitie8127 4.5.1 The SI-MMI Method ....................... 128 4.5.2 Example .............................. 130 4.6 A New Multiscale-Based Tomographic Inversion Method ....... 134 4.6.1 The Multiscale Reconstruction Formula ............. 134 4.6.2 Computation of 52k and A610 ................... 143 4.6.3 A Fast Multiscale Radon-Inverse Transform Algorithm . . . . 145 4.7 Second Approach to Multiscale Modeling and Estimation of ECT In- tensities .................................. 148 4.7.1 The CSI—MMI Method ...................... 149 4.7.2 Example .............................. 151 5 Conclusions 154 A Appendix to Chapter 2 158 A.1 Proof of Expression (2.2) ......................... 158 A2 On the Monotonicity of Ip(X; A) .................... 159 A3 An Alternate Interpratation for Precision ................ 161 A4 Only an MA Model has Anomy of Zero ................. 163 A5 Equivalence of Anomy Definitions .................... 163 A6 Proof of Expression (2.35) ........................ 165 viii 1.? A 8 AS C0111 B Appendix Bl Pm"... 8.? 0pm C Appendix Cl The H C2 Pltmfi BIBLIOGRA A7 A Sufficient Condition for the A/ P Accuracy Inequality ........ 167 A8 Coarse-Scale-Data Limited Models ................... 171 B Appendix to Chapter 3 173 B.1 Posterior Distributions .......................... 173 B2 Optimal Estimation of the Multiplicative Innovation .......... 177 C Appendix to Chapter 4 179 C.1 The Hilbert Transform as a Continuous Averaging Process ...... 179 C.2 Proof of Expression (4.24) ........................ 180 BIBLIOGRAPHY 184 ix 3.1 3.2 LIST OF TABLES AMSE results for various test intensities and estimation algorithms. Peak intensity = 8. ............................ 109 AMSE results for various test intensities and estimation algorithms. Peak intensity = 128 ............................ 109 gv :‘c ~r~ lg to . / ._. ._: .___ _. A 1 [Q a.) .L. 0-3 f‘ " . :x ,4 ._ CO no is '1 c F“ _ 'I‘ l r 1 I . .1 1 \' ‘ C O u 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 4.1 4.2 4.3 4.4 4.5 4.6 LIST OF FIGURES The Haar and Daubechies 4-pt scaling and wavelet functions in the interval ................................... 16 M ultiresolution analysis expressed as a sequence of direct sums . . . . 17 Wavelet-based Signal Processing ..................... 21 Time-frequency plane ........................... 22 M ultiscale Signal Analysis ........................ 24 M ultiscale representation of images ................... 26 Components of a Model of a Process .................. 29 The various probability densities defining accuracy .......... 35 Multiscale representation of a 1-d model of size N = 8 ........ 49 Method for obtaining consecutively increasing EPRs p? +1, p; +1, . . . for a Gaussian model ............................. 54 M ultiscale scaling coefficients {CM} ................... 74 Structure of a Hear-based intensity estimator ............. 77 Histogram of perturbation variates (6 = B/A) .............. 78 MMI model interpreted as a probabilistic tree ............. 81 Three component Beta-mixture distribution .............. 84 Perturbation estimate ii as a function of d/c .............. 89 Poisson intensity estimation ....................... 96 Fast Poisson intensity estimation .................... 100 Correlation functions for MMI and SI-MMI 256-point (J = 8) intensity priors .................................... 104 Photon-limited image estimation using MMI models .......... 113 Nuclear medicine image estimation using SI-MMI models ....... 115 Tomographic data collection geometry ................. 119 Projection geometry for the intensity f (x) ............... 121 Shepp—Logan head phantom ....................... 128 Natural Pixels ............................... 129 Shepp-Logan head phantom image reconstruction ........... 133 Pelvic bone study image reconstruction ................. 134 xi i4“ 44“ 141. C 4.7 4.8 4.9 A.1 C.1 Wavelet functions for image reconstruction ............... 141 Tomography reconstruction-formula coeflicients ............ 149 Second pelvic bone study image reconstruction ............. 154 Simplified geometric representation of EPR cubes at consecutive scales 170 2_1r 9-1 Magnitude frequency content of (lfrizégi’gng .............. 183 xii CHA Introd .1 treat Elwin Prism pm w gives I159 TO Il is EHCO‘JIfiPWl 3 and ne‘pp {6115‘ r: of a 1*: WE O‘DSPR‘E (in and wish to e onetime: i ' r 151‘)“ '. A. I , «Lf‘ "Elan. In . PET " n“ ' J1 €4)sd4&¢:/A ‘ “35 A PIJISVTI; i l. P: (OIIlDdf'i T‘ in. ,V 1. » .‘ v. . N' f ‘7'? v t ' ' (5(A..“‘;-.‘ m ,. , . N“); CHAPTER 1 Introduction A great number of important phenomena in science and engineering are modeled as Poisson processes. Often, it is of interest to estimate the underlying intensity which gives rise to these phenomena. The intensity estimation problem of Poisson processes is encountered in many fields including medicine [1], astronomy [2], communications [3], and networks [4]. This dissertation considers the problem of estimating the in- tensity of a general Poisson process from a single observation of the process. That is, we observe counts c which obey1 CIA ~ Poisson()\), (1.1) and wish to estimate the intensity A. The counts 0 and intensity A are typically one-dimensional (1-d) signals or 2-d images, but may be of any other higher finite dimension, in general. For example, the basic photon-limited imaging process is widely regarded as obey- ing a Poisson law. The problem is described as follows. We observe photon emissions in a compact region of the plane. The photon emissions are the result of an underly- ing two-dimensional intensity function. We are interested in estimating the intensity 13M denotes the random vector c given the vector A. function if“ pilt'ciilt iLIIIIli. I liiiiutioii ui . its. due in p. said): Baas posa-fil (ring .. infinity 3 hi this Iii» ma 'lijf .MN‘Il I of signals [39 w ill g this 113111“ is devised. \Vc expression for i and shown :0 UPI " "iii B11725 invariant alzuri {ODEIIICJIII III). .35 a seroziii MC ‘féiimci I311 El 15 an '* * Ml) ' I‘ l" h v‘ _ i \ .- I'LIIIIUIIEII ll am “ UH [(I'Ir 3‘. [I rddjr,é(o:,.. . " ~ -a\‘J I14. c, I = IH'l'v . ».d.1(;‘;e< rr vl., A W Adrift" 0‘19 if: ')i, I“ I." 7-0" Twp,“ ' uh; . , V ~. 1“ .4 (J.iili__ function from the counts of photon detections. Nuclear medicine imaging is one ap— plication that motivates our study of the photon-limited imaging problem. The major limitation of nuclear medicine imaging is the low-count levels acquired in typical stud- ies, due in part to the limited level of radioactive dosage required to insure patient safety. Because of the variability of low-count images, it is very common to employ a post-filtering or estimation procedure to obtain a “better” estimate of the underlying intensity [5]. In this dissertation we introduce a Bayesian approach to Poisson intensity esti- mation based on a multiscale framework. It is shown that multiscale representation of signals provides a very natural and powerful framework for this problem. Us- ing this framework, a novel multiscale Bayesian prior to model intensity functions is devised. We look at the nature of the proposed prior by deriving a closed-form expression for its autocorrelation. Some extensions to the basic model are developed and shown to possess very desirable properties. With these new priors, a simple, optimal Bayesian intensity estimation procedure is developed. A practical fast shift- invariant algorithm for the new estimation framework is also presented and applied to photon-limited data. As a second contribution of this dissertation, we extend the above general modeling and estimation approaches to the emission computed tomography imaging problem. ECT is an important and very active area of research in many fields. For example, in functional neuroimaging, ECT is used to map regions of activity in the brain associated with physical [6] and intellectual [7] tasks; in nuclear waste management, emission tomographic methods may be employed to determine the activity density of radioactive material within cemented barrels [8]; and in electron microscopy, ECT tecbniques make possible 3-d image reconstruction of chromosomes’ structures [9]. One important diagnosis method in nuclear medicine is single-photon emission computed tomography (SPECT). Essentially, the process aims to reconstruct. density maps iiinae daia I‘Iiiit‘t'Ir well-iiiotieieri imaging of 11' levels and. Iiu porianee to no sions of tom 171;: However. maxi other ECT op; This dissvr‘ suspicion p so called noiu the sinogram i Slhograzri inia; the raw projei confine-tion w All llllpigyed I r? . . - maps (images) of radiopharmaceutical distribution within a patient from projection data collected at many angles about the subject. Projection data in SPECT are well-modeled to be the outcome of Poisson processes, and just as in the case of static imaging of nuclear medicine discussed above, they are characterized by low-count levels and, therefore, by low signal-to—noise ratios (SNR). Motivated by its great im- portance to medicine, and by the challenge posed by its low-count nature, our discus- sions of tomography focuses throughout on the SPECT problem of nuclear medicine. However, many results obtained here are directly applicable or easily extendable to other ECT applications. This dissertation presents two new multiscale approaches to the ECT image re- construction problem. The first approach is based on geometrical properties of the so called natural pixels of intensity sinograms as well as on the high structure of the sinogram image. Within this framework, we develop a practical prior model for sinogram images which are used to estimate the underlying sinogram intensity from the raw projection data prior reconstruction. The sinogram estimate is then used in conjunction with the standard filtered-backprojection (FBP) algorithm to produce an improved image reconstruction. We illustrate the superiority of this method over the conventional FBP-based approach using clinical and simulated data. We also introduce a second and more sophisticated approach to ECT based on a new multiscale-based approach to the Radon inverse transform. This new formulation lends itself to a very natural “discretization” of the Radon inverse operator which is especially amenable to numerical computation. It will be seen that in fact, in contrast to Fourier-based numerical reconstruction methods widely used in practice, the new inversion operator requires no discretization itself, and it is only the data that needs to be sampled. Fourier-based numerical reconstruction methods require conflicting approximations to the Radon operator by simultaneously discretizing the frequency and space planes. In these methods, the intensities must be assumed to be of finite with Y y bu;nnlli in Al I \ DTN‘UHI l I ELM lei '3‘]; II M Hill ue 51:1. Ii ‘ h""? ‘ru ILLS: IllM l 'I'v . Lijfis (f .-. .. .) I) ' . I [hit {new 9}- . ‘ . ‘1? {".. \ “‘—AA 7' 'l1." dam“ "3 7 AA A We. »“A “7- ~‘ . \ in u“: i f o l “.3. 3“"? 13' D». ”1’76 ()1; ( support in both space and frequency, and such signals do not exist. The method presented here avoids such conundrum and reconstructs intensity images without the associated potential artifacts. Within the new reconstruction framework, we develop a prior model for the cumu- lative sinogram image (intensity) (C81). The new model is unique in that it exploits the high degree of redundancy of information present. in a sinogram to create a very robust tomographic reconstruction of photon-limited images. We demonstrate the advantage of the proposed method using clinical data. Multiscale representation and analysis of signals have been used in the past as the basic framework in the modeling and estimation of Poisson, Gaussian, and other types of processes [10, 11, 12, 13, 14, 15, 16]. While in general all these approaches have been successful, except for the case of Gaussian processes, a clear justification for their choice has not been offered. In an attempt to explain the advantage of multiscale modeling and estimation in a most general way, we introduce information-theoretic measures that quantify the degree of goodness of models in various aspects. For this, we introduce the concepts of anomy, accuracy, precision, and resolution power. Based on criteria developed with these concepts we gain insight into the advantage of Bayesian estimators within the framework of multiscale representation of processes. The Bayesian approach is shown to give the means for a systematic and maximal use of the available information at each scale of multiscale models. Additionally, we give general guidelines for constructing new multiscale linear transformations which are statistically motivated and applicable to the general Gaussian model. The trans- formations are potentially better suited than conventional time/ frequency multiscale analysis for estimation purposes, and may also be extended to other models as well. 1.1 Organization and Summary of Contributions The main objective of Chapter 2 is to motivate the multiscale “paradigm” as an approach to modeling for Bayesian estimation. Throughout this dissertation the multiscale modeling of processes and the multiscale representation of signals play very important roles as they provide the underlying framework for the Poisson models and estimators presented in Chapters 3 and 4. Therefore, in Section 2.2 we first present an elementary review of wavelets. Much intuition about multiscale modeling and estimation is gained from its understanding. Also, much of the notation use throughout this dissertation is introduced here and in Section 2.3. In addition, Chapter 2 offers the following three main contributions. First, in Section 2.3 we present a new, unifying approach to characterize and qualify models and estimators alike. For this, we introduce four new measures that are more general, and we believe more natural, than the conventional statistical measures often used for this purposes. Second, in Section 2.4 the advantages of modeling and estimating within the multiscale framework in general, and within the multiscale Bayesian ap— proach in particular, are established for a class of processes. Third, we give general guidelines to constructing multiscale models which are statistically motivated, and consequently, are better suited for the estimation problem. These guidelines are de- veloped for Gaussian processes, but they can be easily extended to other processes. lVIuch work needs to be done in this respect, but the criteria set forth here opens a wide range of new and exciting possibilities. In Section 2.5 we conclude the chapter with a brief review of some important existing multiscale and estimation approaches. There are four major contributions in Chapter 3. First, in Section 3.4 we describe a new, multiscale, prior probability model for non-negative intensity functions. This model employs a multiplicative innovations structure in the scale-space domain. Sec- ond, based on this new prior, in Section 3.5 we derive a simple and computationally efficient, Bayesian estimator of the intensity given an observation of counts, under squared error loss. It is shown through examples that the Bayesian estimation proce- dure significantly outperforms existing wavelet-based methods. Third, we extend in Section 3.6.1 the multiscale intensity prior to a shift-invariant one, and develop a fast shift-invariant estimation procedure. Furthermore, we obtain closed-form expressions for the correlation functions of both priors, and show that the correlation behavior of the shift-invariant prior has 1 / f spectral characteristics and is more regular than that of the shift-variant prior. Fourth, in Section 3.8 we apply the framework to photon- limited imaging and examine its potential to improve nuclear medicine imaging. The main contributions in Chapter 4 are three. First, in Section 4.5 we introduce an extension to the prior modeling approach of Chapter 3 which is especially well suited for modeling computed tomography sinograms. The new prior is based on geo- metrical consideration of the inherent structure of sinograms. The excellent match of the prior to real and synthetic data is seen in the examples. Second, in Section 4.6 we present a new, multiscale-based, Radon-inverse transform algorithm. The transform has three major qualities for ECT applications: it admits a very efficient computa- tional implementation; it allows reconstruction of images at any desired resolution supported by the data with the corresponding computational savings; and it provides a very robust reconstruction of images from photon-limited projection data. This last quality is unique to the new method and is illustrated with an example. Third, in Section 4.7 we introduce a third extension to the intensity modeling approach devel- oped for general Poisson processes and apply it to the cumulative sinogram images of emission-computed tomography. This prior is used in conjunction with the new Radon-inverse transform to reconstruct highly reliable images from photon-limited data. The superiority of this modeling, estimation, and reconstruction method is illustrated with clinical data. Finally, some comments and conclusions are given in Chapter 5. CHAPTER 2 The Multiscale Modeling Paradigm 2. 1 Preliminaries In his book “Conceptual Physics” [17], Hewitt identifies a crucial factor that made the evolution of human scientific knowledge possible when he writes: “Science had its beginnings before recorded history when people first discovered recurring relationships around them. Through careful observations of these relationships, they began to know nature and, because of nature’s dependability, found they could make predictions that gave them some control over their surroundings. ” Hewitt’s recurring relationships refer to the observed patterns of cause and effect that appeared to dictate the course of nature in many instances. It is evident that while our ancestors could not make predictions about the exact shape of a flame in a fire, they could always expect the whole of the flame and smoke to rise. Thus, looked at on a large enough scale, the phenomenon could be satisfactorily predicted. Probably one of the greatest successes in science before modern times was achieved in predicting the movement of the planets. In their observations of the skies, the Mayans did not have to contend with erratic or chaotic effects, but only had to discover the patterns manifested at very large scales. It is no coincidence that nowadays scientists still seek to discover patterns in whatever is under study in order to advance their knowledge. After all, if the behavior of artificial neural networks is a hint of how the human brain works, to understand nature means training our brains enough so as to recognize patterns in our environment, for then we may predict its future behavior. As the subjects studied by people became more and more complex, the patterns to be discovered were less evident and more difficult to perceive. Mathematics then became the primary tool in this quest. After the experimentation phase, models were postulated and tested. At first, the mathematical models were deterministic in nature, but as scientists’ interest shifted towards natural phenomena “belonging” to sufficiently small scales, probabilistic models had to be introduced. For example, before the turbulent behavior of the flames in a fire could be understood, statistical thermodynamic models had to be postulated to explain the molecular behavior of gases. Clearly, the new probabilistic models represented more accurate but less precise1 models than their deterministic counterparts. They predicted the outcome of an ex- periment more reliably while providing less detail about such outcome—gas molecules’ “typical” behavior could be predicted very successfully; however, very little could be said about any given molecule’s state, e.g., its location and momentum. The reason for this was that while small scale phenomena were being modeled, the experiments carried out to undercover their hidden patterns were of much larger scales; conse- quently, only the patterns displayed at these larger scales provided information about the underlying phenomena. In the case of gas molecules, only their aggregate behav- ior could be discerned and so, only probabilistic inferences about individual molecules could be made. 1The terms accuracy and precision are used here in a general sense, meaning, respectively, the accordance of an assertion with the truth, and the amount of information conveyed by that assertion. Thus, by forecasting rain over the Pacific Ocean this year, one makes a highly accurate assertion but of very low precision for not much information about the time or place of the event is given, for example. In Section 2.3.2 we give precise meaning to these terms. As advances in technology allowed probing at smaller and smaller scales, the models created became more and more precise. Nevertheless, to maintain accuracy, they necessarily had to be probabilistic in nature, for if we accept that every event in nature has its origins in the smallest of scales, the Heisenberg Principle prevents us from observing the full state (e. g., displacement / momentum, time/ frequency, etc.) of whatever “lives” at those scales, and so, prevents us from producing infinitely accurate deterministic models—never mind that our known deterministic “laws” of nature do not apply at these extremely small scales. Although in practice we are often content with deterministic coarse-scale averaged representations of observed phenomena, the above discussion brings to light the in- tuitive notion that while modeling of phenomena based on coarser scale information alone may be more accurate, it can only be achieved at the expense of precision. Alternatively, coarse-scale models may be constructed more accurately than their fine-scale counterparts, but the coarse-scale models are necessarily less precise. These two statements are not just reworded versions of one another; however, once the information-theoretic definitions for accuracy and precision have been introduced, we will show them to be equivalent. At that point, we will also be able to gain a greater insight as to their interpretation. Their significance is as follows. The first assertion establishes that construction of more precise models requires new information, and that such information may only be found in finer scale patterns. This implies the futility of trying to improve the precision of estimators beyond what the observations’ scale supports. The ability to trade precision for accuracy by modeling different scales of a process, as established by the second assertion, is of great significance to the point estimation problem. We will show that under certain conditions the robustness2 of an estimator 2In the literature (see, for example [18]) robustness connotes the consistency of an estimator’s performance under all possible distributions for the observations, that is, under typical observations can be enhanced by leveraging the estimation process through this trade. The multi- scale Bayesian estimation approach introduced in Chapter 3 will be shown to provide an intrinsic way to achieve this. Given the crucial roll that multiscale representation of signals plays throughout this dissertation, we next give a brief introduction to the topic. To this end, we make use of wavelet theory as it gives a natural perspective of multiscale analysis. We avoid as much as possible the formalities associated with this topic, however, and emphasize a motivational point of View, as this is the goal of the present chapter. 2.2 Multiresolution through Wavelets 2.2.1 The Fourier Transform and Scales In the quest to identify the underlying patterns of phenomena and processes, ex- pressing the signals of interest as a linear combination of more elemental functions has often proved invaluable. With this approach, features in the signals which are uniquely characteristic to the event under study have been brought to view and in this manner have helped to identify the patterns. For most problems encountered in engineering the sets of functions of greatest utility have been those which are complete in L2(R) (or more generally, in L2(lRN))3 since they can represent any finite energy signal that might arise.4 Undoubtedly, the Fourier system has had the greatest of influences in this respect since its discovery in 1807. One reason for this is the orthogonality of the set, which leads to its simplicity and wide range of use. However, it is the fact that the elementary functions are as well as those including outlayers. Here, we use the term to mean a degree of goodness. Later in the chapter we will introduce the concept of resolution power and use it for this purpose. 3For the sake of simplicity, throughout this dissertation we focus our attention on signals on the real line when this suffices to make the desired point. 4Clearly, the fidelity of such a representation is only in the mean-square error sense and so, it is only accurate to within a set of measure zero. We ignore these technicalities for the most part. 10 eigenfunctions to linear translation invariant operators that has mostly contributed to its great extent of applications. The Fourier transform decomposes a function into its frequency components; this is particularly easy to visualized in the case of periodic functions—clearly, not L2 functions. In this case, the components are denumerable and readily identifiable at finite intervals along the frequency axis; thus, admitting a series representation for the signals. For L2 functions, the situation is radically different as no one component is present in the original signal, for if any one were, it would be so only through its manifestation of energy in some form, no matter how small this is, but we know that no energy is born at any given frequency. An alternate view of the Fourier representation is that the transform decomposes a signal into a countable number of scales corresponding to an arbitrary partitioning of the frequency axis. The system in this case is still orthogonal, but the basic elements, or atoms, are now the functions that the Fourier exponentials integrate to within each of the frequency intervals. The advantage of this system from the temporal viewpoint is that L2 signals may now be represented as linear combinations of such atoms, with each atom contributing a finite amount of energy—if present at all, that is. Fiom the frequency perspective, we have that the highly frequency-localized analysis of signals is preserved, although not to the infinite frequency resolution of the original Fourier system. As an example, this scale decomposition may be constructed with a set of Gabor- like functions defined on the frequency line: . 1 _ ' . . j . hj.k(w) E —l(w——i€—O) e’kuow‘fi") for all j, k E Z, (2.1) £0 50 where l() is the indicator function over the {—1/2, 1 / 2) interval, i2 = —1, and uo and {0 are arbitrary parameters. Clearly, the set {lg-thy. partitions the frequency line 11 into adjacent non-overlapping uniform intervals of width £0, and include modulating factors. Letting uoéo = 27r, it is easy to show that the frequency spectrum f of f may be expressed as (see Appendix A.1) fiw) = Zvj+2 ——-)V,-+1 9V, 9 vj-I ——> /® /® /®/ 6-) / Wi+1 w] wj-I W j+2 Figure 2.2. Multiresolution analysis expressed as a sequence of direct sums. Each space V,-._1 encompasses the functions living in V,- and W,-, and all those functions formed by the sum of any two elements in V,- U W,. Since W,- is orthogonal to V,- and, therefore, to every V1 with l 2 j, W, defines the scale of L2(R) functions with features not representable in any V, with l 2 j. by the repeated substitution of the argument t for 2t, and in each iteration, multiply- ing the resulting expression by 21/2—the discrete wavelet reconstruction expression resuks Cj—1,k = Eat—21011 + gk—ude), (2.9) (62 for all j, k in Z. The corresponding discrete wavelet decomposition relations are Cch = Z hz—2ij—1,i (2-10) 162 and dj,k = Zgz-mch—u (2-11) (62 These are obtained by again expressing qt“, and 1?ch in terms of (15,.”c according to the generalized versions of (2.5) and (2.6) in c“, E (f, any.) and d”, E (f, 1b,,k), and Writing the resulting integrals in term of c,_1,k. The coefficients hk and 9,, in (2.9), (2.10) and (2.11) are the wavelet filter coef- ficients, often referred to as quadrature mirror filter (QMF) coefficients within the 17 linear-filter-processing community. It is not difficult to show that the sequences (hk) and (9],) correspond to low-pass and band-pass filters, respectively. Signals’ Finite Representation The fact that we can express the coefficients at one scale in terms of those at the scale immediately “above” or “below” without the need of scaling or wavelet functions is key to the usefulness of the wavelet transform, for otherwise, the wavelet transform would have gone the way of the Fourier transform before the FFT (Fast Fourier Transform) was invented. For the transform to be computationally practical, however, it is also necessary that a function may be representable by only a finite number of coefficients. The following shows how under very mild conditions this is possible. We first rewrite (2.4) as f = E Z (Ext/5.}: + Z Z d,,,,z/;,-,,, + Z Z d,,,,v,,k, (2.12) ngVkeZ .fl O as j —> —00. This implies that if J’ is chosen small enough, the (f — f 1:) term can be ignored maintaining an approximation to f as closely as desired. Doing this in (2.13), and arbitrarily choosing 7 the lowest scale approximation J’ to be scale zero we obtain f ’31 Z Zdj,k¢’j,k +fJ 0’synthosls ——> Figure 2.3. Wavelet-based Signal Processing. A finite-support function f is first projected to a suitable scale space, e.g., V5. The scaling coefficients c = c0 corre- sponding to the projection are DWTed. Signal processing algorithms are employed to generate new wavelet coefficients d from the original. By an IDWT, corresponding scaling coefficients 6 = (:0 are obtained from which the desired final function f may be synthesized. particular, for N = 4, J = 2 and (ho h, 0 0) (ho h1 h2 123) 90 91 0 0 h—2 h—l ho hl O 0 10 go 9.1 92 93 \0 0 0 1} Km 9-1 90 9,) A significant advantage of the DWT in problems involving finite-support functions of continuous variables is that, once the vector of scaling coefficients on has been obtained, one may operate solely on the coefficients (1 resulting from the DWT until the actual desired function is needed; at that point, the new function is synthesized using (2.14) with the newly obtained scaling coefficients following the inverse DWT (IDWT) of the processed coefficients, d. This is the basis for discrete wavelet-based signal processing. A pictorial representation of this idea is shown in Fig. 2.3. Significance of the DWT coefficients Each element of the set {dj,k}k=o,...,~-1 is associated with a region of the time— frequency j=1,...,J plane according to the location where the energy of the corresponding wavelet is mostly concentrated. In particular, wavelet coefficient dJ-Jc corresponds to a region 21 %.>1Hno 1’? Hi lll hoquoncy Hequency (a) (b) J >/ K i A W Figure 2.4. Time-frequency plane. (a) Partition induced by the Haar system, and (b) partition induced by the system (2.3). Nodes and links in (a) display the func- tional relationships among the various wavelet coefficients. The uppermost node represents wavelet coefiicient dyo, and those at the bottom, represent coefficients dog, C10,], - - - , d0,N_.1. See text for an alternate interpretation. centered at (uch, 5M) and of dimensions r,- x 0,, where u”, E f t ijy,(t)|2 dt, {M E fwlifij,k(w)|2 did, and for any k, r, E [(t — Uj‘k)2le‘k(t)|2 dt, and o]- E [(w — e.k)"’Iz/3j,k(w)|2dw. The time-frequency partition induced by the Haar system is shown in Figure 2.4(a), and that induced by the system (2.3) is shown in Figure 2.4(b) for referen’ce. The nodes in (a) represent the wavelet coefficients, and the links represent their functional relationships according to (2.9), (2.10) and (2.11). Another useful interpretation of the tree—like structure superimposed onto the time-frequency plane is that each row of nodes constitutes a projection of the original signal onto a subspace of NOR). Specifically, the upper most node always represents c 1,0, the second row, the (c 1-1,0, CJ_1,1) vector, and so on. The bottom row corre- sponds to the highest resolution representation available, and as was indicated before, is often taken to be the original signal of interest. In this sense, we speak of the j-scale representation a, E (cm, - - - ,cj. N/21_1)T of co, since c, are the scaling coefficients of f, whenever co are of f. Notice that under this alternate interpretation, the signal 22 represented by the coefficients on any given row, i.e., at any scale, of Figure 2.4(a), may possess energy within the entire frequency band extending from 0 Hz (top of the figure) down to the row of coefficients. As an illustration of these two interpretations, the values taken by the coefficients in each row of the time—frequency plane are shown in Figure 2.5. At the bottom of the figure, the scaling coefficients corresponding to scale zero constitute the original signal of interest. Right above them, the scaling coefficients corresponding to scales 1, 2, and 3 are shown along (to their left) one translate of the scaling functions used in obtaining them. Each sample on any row corresponds to a node in Figure 2.4 according to the second interpretation. On the other hand, the values taken by each node conforming to the first View are displayed on the right side of the figure. These are the wavelet coefficients obtained by the inner products between the signal at scale zero and the sequence of translates of wavelets shown in the rightmost column. Both views are valuable depending on what is being sought. Vanishing Moments There are many wavelet systems in existence and each is better suited for a particular set of applications. The systems may be differentiated by a myriad of properties that they may or may not hold. One very important distinguishing property often stated is the degree of regularity of the family of wavelets. For important families of wavelets this may be measured by the number of vanishing moments [21]. A wavelet function w has v E N vanishing moments if 8 ftpw(t)dt=O forp=0,... ,v—1, (2.16) 8Throughout this dissertation we avoid writing the limits of integration as much as possible for simplicity of notation, but every integral indicates a definite integration operation. When the limits are omitted, the region of integration is the entire space where the variable of integration is defined. 23 ¢' k (Cj,k) (dj,k) Wj,k , scale j r" ( .J “v flr fl 2 J1 n ....: :_ «will fl” Figure 2.5. Multiscale Signal Analysis. Scale and Wavelet coefficients (second and third columns, respectively) of the signal at scale 0 (bottom row) are displayed for scales 1, 2, and 3. Columns 1 and 4 give one single translate of a scale function and wavelet used in obtaining the coefiicients. These functions correspond to the unnormalized Haar system, which we review more extensively in Chapter 3. but not for p = v. By a simple change of variable one can easily show that if 212 has v vanishing moments, then each 10,-), has v vanishing moments as well. Therefore, any function f which can be closely approximated by a polynomial of order v — 1 will have all its wavelet coefficients d“. be zero or nearly zero. This situation represents a high degree of compression, for only a set of coarse scale scaling coefficients suffices to represent the function. Beyond parsimonious representations, a sufficiently regular system may be ex— ploited in estimating a signal from within noise. If the true signal is smooth enough, most of the energy in the wavelet coefficients will be from noise, and thus a simple thresholding scheme may remove much of this noise without seriously altering the true signal upon reconstruction. We review some of these wavelet-based estimation 24 procedures in Section 2.5. Images There are various ways to extend wavelet analysis to images or L2(IR2) functions. A simple approach often taken consists of constructing a separable 2-d multiresolution analysis V,2 as the tensor product of 1-d multiresolution counterparts: V]? E V]- ® V, for allj E Z. (2.17) The resulting family of subspaces {VJ-2} of L2(R2) possess 2-d versions of the four properties i—iv given earlier characterizing 1-d multiresolution analyses. Specifically, the scaling property is satisfied by a scaling function defined as ¢j.k1,k2(t1» t2) 5 ¢j.k1(t1)¢j.k2(t2). (2.18) In two dimensions, the detail space W,-2 is the space spanned not by one set of wavelets corresponding to scale j, but by three sets, each associated with a given orientation: horizontal, vertical, and diagonal. These wavelets are constructed as follows. w2k1,k2(tlat2) E ¢j.k1(t1)’l/'Jj,k2(t2) w;k1.k2(t1’ t2) 5 wj.k1(tl)¢j.k2(t2) (2.19) I d wj.k1.k2 (tlv t2) 5 with (tlle,k2(t2). Each set of wavelets defines a subspace of L2(R2) such that the projection of a function f onto it gives the energy (a measure of the amount of detail content) of the function in that orientation. As an illustration, Fig. 2.6 shows the 2-d wavelet decomposition of the standard cameraman picture. On the left is a discretized rendering of the 25 Figure 2.6. Multiscale representation of images. (Left). The standard cameraman image: highest resolution scaling coefficients, i.e., co. (Right). Wavelet representation of the cameraman image: c2 are the scale 2 scaling coefficients, (11 and d2 are the wavelet coefficients at scales 1 and 2 corresponding to the horizontal, vertical, and diagonal orientation according to the h, v, and d designations. original scene; hence, we regard it to constitute the set of scaling coefficients at the finest of scales available, and which we arbitrarily denote as scale zero.9 The wavelet analysis of this image up to scale 2 is shown on the right. The representation is standard: on the upper left, the scaling coefficients corresponding to j = 2 form a coarser representation of the original image; the horizontal, vertical, and diagonal wavelet coefficients dfikl‘kz, diam» and dink: at the first and second scales provide the details of the original image c0 not in c2. In Chapter 3, we introduce a new two-dimensional multiscale representation of images which is derived from the 1-d Haar system but which is not separable. The 9This interpretation in which each pixel in the picture assumes the value of a scaling coefficient (at scale zero, in the present case) is justified if c0)“, E (f, ¢O.k1.k2> 2 f(k1. k2). It may be shown that under proper dilation of the original function f, most wavelet systems. and certainly, all bounded systems of compact support, satisfy this condition whenever f is Lipschitz, i. e., there exist constants c < co and o < a s 1 such that If(t’1,t'2) — fem)! : C | no quantization, n = 0 => quantization to nearest integer no greater than 1:, etc. It is possible to define, however, a more general quantization operation as follows: for any real 3 > O and r 20, y: %[s(x-—r)] +r. 32 take values between 0 and 1. A useful definition is given by Pr 5 I(X;A)/H(A), where H (A), or simply H (A), is the entropy of A, i.e., - f p(A) log p(A) dA. Clearly, Pr may be expressed as a percentage as well. For the rest of this chapter, however, only ensembles with absolutely continuous densities will be explicitly considered. One interesting result stemming from the adopted definition of precision is the fact that the precision P of any given model is always equal to or greater than the precision P5 of any augmented model. That is, if the original model is given by the pair p(xlA) and p(A), then the augmented model is given by the pair p(6 IA) = p(6(x)[A) and p(A). This is a consequence of the Data Processing Inequality theorem well known in information theory, and which establishes that I (X ; A) Z I (6(X ); A), for any transformation 5 [27]. Thus, we have mgr. am) As stated earlier, the model p(xlA) = 60(x—A) is associated with an infinite precision. Inequality (2.22) is reminiscent to the Cramér-Rao inequality, which states that the mean squared error of any unbiased estimator 6(x) of the parameter A is lower bounded by the reciprocal of the Fisher information J (A) of A. Writing this in terms of the reciprocal of the variance, the inequality becomes var-1(6) may be regarded as a second-order measure of precision, but it is clear that the inequality (2.22) is stronger for it involves the entire model in all its moments. 33 Anomy and Accuracy While precision tells the average amount of information between a model’s input and output, it does not give any indication about the utility of the information that either x conveys about A, or that A conveys about x. For instance, it is possible to construct a model in which the output gives much information about its cause (i.e., a very precise model), but which is not all that useful as it is not all that accurate. This would occur, for example, in a model characterized by a multimodal density p(xlA) with very well defined (“sharp”) modes. Figure 2.8 illustrates one such density when A and x are scalars. Assuming a fairly uniform p(A) and that the scenario depicted in Figure 2.8 is representative of the model in general, the integral (2.20) takes on a high value, because the regions on which log gills-l becomes large and positive are more likely than those regions on which it assumes negative values. Thus, the model is highly precise. On average, an outcome .7: places the value of the input A to “within” P bits from its true value. However, since the real line may be partitioned in an infinite number of ways with any given finite number of bits,15 the model does not necessarily specify a region which includes the input A, in which case the accuracy of the model is poor. In the example, the P bits of information conveyed by the realization :1: places A to within one of the five regions where p(rIA) is greater than p(zr), i.e., within the neighborhoods of the modes, none of which include the input A. Therefore, the model, while precise, is inaccurate. This example illustrates a situation where the realizations {13,} are, on average, far removed from their input A despite the fact that $327,123,- —+ A as n increases 15For example, one bit of information may partition the real line into the positive and negative halves—the sign bit—or may discriminate between numbers with a fractional part in the [0, .5) or [.5, 1) intervals—first bit to the right of the binary point—and so on. 34 pp(XI>~) / p(XIX) p(X) l/\' U‘ , A-p A A+p x Figure 2.8. The various probability densities defining accuracy. p(xlA) and p(x) are pdfs associated with the given model, the precision of which is P. pp(x|A) and pp(x) E f pp(x|A)p(A) dA (not shown) are auxiliary pdfs with the characteristic of representing the same precision P, while pp(x|A) is the pdf with the smallest support about A. without bound—Khinchine’s Strong Law of Large Numbers [28]. It is important to note that the degree of accuracy of a model is not an indication of its truthfulness. A model, i.e., the joint density p(x|A)p(A), is always assumed correct given the available information. This is not to say that models of varying degrees of accuracy cannot be constructed for the same process; the alternate models may differ in precision as well. Although for particular input-output pairs of realizations a model may be highly inaccurate, on average a (correct) model will possess some degree of accuracy, and its precision bits of information will be useful in that same measure. If in the example of Figure 2.8, p(x|A) had been unimodal with a very sharp mode at a: = A, the model would have been highly accurate, and all information given by :r would serve to identify the location of A within its neighborhood. This goes to illustrate that accuracy is some measure of the spread of p(xlA) with respect to the most accurate (MA) distribution, which we define to be one that conveys the same 35 amount of information—one that has the same precision—but whose information most effectively identifies the neighborhood of each A responsible for each outcome x. This requirement brings to mind UMVU models, but as was argued earlier, un- biasedness and minimum variance are only first- and second-order measures of the desired property in the model. A most general measure of the “spread” of a distribution is given by its en- tropy. So, we reason that the accuracy .A of a model must be related to the difference H (X IA) — Hp(X IA) of conditional entropies corresponding to the ac- tual conditional density p(xlA) and to the MA distribution, which we denote by pp(x|A).16 Here, H(XIA) E — fp(x|A)p(A)10gp(x|A) dA dx, and Hp(X|A) E - fpp(X|A)p(A) log PplxlAl d’\ dx. The MA distribution pp(:r|A) for the particular input A in the above example is shown in Figure 2.8. In general, pp(x|A) is defined to be a uniform density centered at x = A and of support the N -dimensional cube C (A; p) of sides 2p. We denote it by “p(x — A). This choice is motivated by the fact that of all the probability densities of compact support with a given entropy, the uniform has the smallest of supports, and thus, is most concentrated around its center [29]. To be meaningful, the MA distribution must. convey the same precision as is conveyed by the actual model. Denoting the average mutual information between a process’ input and output under its MA model by Ip(X; A), i.e., Ip(X;A) E [pp(x|A)p(A) logggf—Il—S-A dA dx, (2.23) where pp(x) E fpp(x[A)p(A)dA, we require that the “radius” p of pp(x|A) be such that [p(X; A) = I (X ;A). Clearly, this requirement. makes sense only if [p(X; A) can 16We have departed from the convention of denoting the rv as a subscript for its density function. The subscript p in pp(x]A) denotes only a parameter. 36 be shown to be a one-to—one function of p, for otherwise, p would not be uniquely determined for a given precision of value I (X ; A). In Appendix A.2 we show this to indeed be the case. Although p is not a true radius, thinking of it as such is more descriptive, and helps to better interpret its significance. We shall call it the equivalent precision radius or EPR, for short. Although appealing, measuring the “spread” or “disorder” of the actual model with respect to that of the most accurate model by the difference of their conditional entropies presents a fundamental flaw, which we now illustrate with a simple example. Suppose that p(xlA) also represents a uniform density of support of radius p, but which for every x, is centered so that it does not overlap with p,,(x|A).l7 This situation results in a difference H (X IA) — H p(X IA) = 0, leading us to conclude that the actual and MA model have the same accuracy as well as precision. This is clearly not the desired result, as the actual model is completely inaccurate, for its information is misleading for every outcome x. We now introduce a new measure of the “disorder” of a model which gauges the average degree to which the precision bits fragment the IR” space in determining the probable input sets for each possible outcome. We call this measure the anomy N of a model, and is defined as N E D (pp(XI»\)p(A)l|p(XIA)p(A))a (224) where D(p||q) is the Kullback Leibler distance or relative entropy between the densities l7Strictly, this condition is not realizable if the model is to be correct, however, close approxima- tions may be constructed. 37 p and q, i.e., = _, 0 an D(P($)[[(1(17))—/P( )1 ”(1(1) dw. More specifically, D (WWWIIpp> = fpp\j+ilp(’\j+1) P,- P,- > 17,-“ 79,-“ Aj Aj < Aj+1 Aj+1 Throughout we have confined all discussions to finite length signals and their models, but we believe extensions can easily be made to models of signals with infinite lengths, which appear to represent the simpler case; we do not pursue them here, however. Whether the A/ P conditions are met at any given scale j depends on the form of the model p(ijAj)p(AJ-) and on the particular transformation used to generate the next scale model. Thus, a model p(onA0)p(A0) is an A /P model if a set of orthonor- mal linear transformations {Wj}20 exists such that each of the models p(ijAj)p(AJ-) satisfy the A/ P conditions. The models of interest here all satisfy (1) EIijAJ-I = A], and (2) p(xj+1IAj+1) = p(xj+1IAj) (i.e., Aj+1 is a sufficient statistics for xj+1). If in addition to these condi- tions, we impose the mild restriction that for all scales (3) H (AjIxJ-H) > H (AjIxj) (i.e., that the model be non-trivial), the inequality Pj > Pj+1 of the A/ P conditions 20It is understood from our previous discussions that the set of transformations must be such that A1+1 = . A- ( 9M W] J where Aj+1 is a “coarser scale representation” and with half the length of Aj. When the transfor- mation is a wavelet transformation, 0j+1 is the vector of wavelet coefficients at scale j + l. 49 is trivially satisfied: 791‘ = I(Xj;1\j) = H(A)) - H(AjIXj) > H(Aj) - H(AjIXj+1l = I(Xj+1;AJ-) Z I(Xj+1.Aj+1) = 77341- The second inequality in this expression is due to the Data Processing Inequality theorem, because the mapping A, H Aj+1 is non-invertible (see Section 2.2.2). The idea of non-triviality of a multiscale model is perhaps better understood by considering the idea of a trivial model. We define a trivial model to be one that for any valid scale j, the output x,- conveys no more information about the input A,- than xj+1 does. This concept is analog to that of finite bandwidth signals, where we regard an oversampled signal S, to be a trivial representation of its decimated counterpart 31-“. In this instance, the oversampled signal conveys no more information than sj+1 does, and so, it is also the case that H(tJ-Isj) = H(thsj+1) for any rv tj. Now we can add to the requirements that a model must meet and show that under the augmented set of conditions the inequality Aj < A)“ is also satisfied. Our effort in this direction has only produced conditions which are much less intuitive than the required and sufficient condition JV,- > AG“. This is because, in addition to the already intuitive nature of anomy as a “distance”, this measure is much more encompassing than more traditional ones involving only a finite number of statistical moments. Consequently, it becomes cumbersome trying to characterize general A / P models using conventional statistics. In Appendix A.7, we develop a sufficient condition for the accuracy inequality. We use it in the following example as a starting point to delineate a general approach to finding a set of transformations {Wj} for a Gaussian model such that the A/ P 50 conditions hold. We use the notation of Example 2 of Section 2.3.3, and those of Appendix A.7. Please refer to those sections as needed, especially Figure A.1. Also, recall that IC(A,-; p,-)I = (2p,-)NJ‘ denotes the volume of the cube C(A,; p,) centered at A,-, of sides twice the EPR p,-, and of dimension N,- = N0/2j. Example Condition (A.12) may be simplified by noting that A,-+1(a) 2 0. So, a new (and more stringent) condition for the accuracy condition to hold is log (— ”J ) < / p(A.) {Au-(pi — aim} dA.. (2.38) pj-H IRNJ' Resolution of the integral— f( p(A p)dA, was already obtained in (2. 32): 2. —/ P(’\lej(PldAj =llog{(27re)N1detK,-}+ p_, trKT‘l. 11W 2 6 3 And the integral f p(A,-) A3(p) dA, can similarly be solved for: , 1 P2 _ [1th p(A,)A,-(p)dA,- = —§log{( (27re) N1 detW, K ,WJT} — J——+-l tr rW,-K,- 1W; = —-;—log{( 27re) NidetK- ,} — p—Jé-Z‘th trKj'l. Recall that K, is the covariance matrix of the N, --dimensional Gaussian pdf p(x, IA,). trTK Substituting these two results in (2.38) and letting a,= _ -6—\{— the condition 18 p. log (—J-) < a,- (p? — p3“). Let f(p,-+1) E log (31:1) — a,- (p?+1 — p?), then we have as a condition up...) > 0. (2.39) 51 f (p,+1) is a concave function with one maximum at p,+1 = 1/ M23}. The set of p,+1 > 0 for which (2.39) is satisfied is always an interval, except when p,- = 1/ J23}, where there is no solution at all. If p,- > 1/\/2a—,-, the interval of solutions lies immediately to the left of p,-; and if p,- < 1 / «271,-, the interval of solutions lies immediately to the right of p,. Because p,- and a,- are already known quantities at the time of search for the transform W,- that satisfies condition (2.39), we can readily determine whether we should seek a transform that induces a new EPR that is greater than or less than p,. Let (1 Z Z (N, be the eigenvalues of K ,-, and m1, . .. ,mN, the corresponding eigenvectors normalized to 1, i.e., IIm,II = 1. Since the axes of the Nj-dimensional ellipsoid Q,(z) E zK,-'lzT = 1 are fl}... ,\/Z_N_j [32], the point 1/\/27,- is the square root of the harmonic average of the square of the ellipsoid’s axes, scaled by V3. Ignoring this last factor, this averaging operation heavily favors the smaller axes, and so, if the number of large axes is not disproportionately greater than the number of small ones, it is convenient to begin the search for the desired transform by choosing W,- to be such that p,+1 takes the smallest value possible among the values of the ellipsoid’s axes. By starting with the smallest possible value and incrementing it at each step of the algorithm, we are assured not to miss the interval of solutions. So we shall proceed in this fashion of favoring simplicity over optimality of search. Let us first assume that p,- > 1/\/2—d,. Then, our initial choice is W,” = M? E [m1 , . . . , mNj], which produces the smallest EPR p?+1 possible because p(x,-+1|A,-+1) is obtained by integrating p(x, IA,-) over the space spanned by the first N , / 2 eigenvectors m1, . . . , mN, )2. Again, these correspond to the direction of slowest change of p(x, IA,) [32]. Consequently, as p(x,-IA,-) is projected onto the x,+1-(w,-+1 x x,+1) hyper-plane it becomes the density p(x,+1IA,-+1) with the smallest possible profile. See Figure 2.10 for a depiction of this at one step ahead in the iteration process. 52 C(Aj§ PI) "j+| wj+l Cl)» 1;P~ I) 1+ 1+ C( xj+liej+l ; pj+l) A'jM Q,-(2) = c W,- , ._/ij Figure 2.10. Method for obtaining consecutively increasing EPRs p9+1,p,1.+1, . . . for a Gaussian model. The transform W, gets updated so that the projections of p(x,IA,~) onto the x,+1-(w,-+1 X x,+1) hyper-plane progress from its narrow side to its broadside. The process is depicted at an intermediate step near the beginning of the algorithm. If p,-+1 = p?+1 does not satisfy condition (2.39), we next rotate the axes w,-+1—A,-+1 a small angle in a direction which tends to broaden the projection of the density and make the corresponding EPR bigger. The possibilities of rotation paths increase very rapidly with N,; in fact, when N,- > 2 the number is clearly uncountable. One simple approach, however, is to take W,1 = M}, the first rearrangement of M? corresponding to a reordering of the eigenvalues obtained by interchanging the middle two: C1 CNj/z—i CN,/2+1 CNj/2 CN,/2+2 (N,- Thus, AI} E Imi, , mN,/2—l» mN,/2+1, mpg/2, HIM/2+2, 1 mm]- 53 If (2.39) is satisfied with the presently computed p,+1 = p? +1, we still move one step forward in the algorithm just as if the condition had not been satisfied. The idea is to proceed until (2.39) is satisfied with an EPR sufficiently close to 1/ m, since this will insure a large spread between ./V, and MM,” and therefore, a greater advantage when estimating within the multiscale framework. If (2.39) is once again satisfied with the new EPR p,+1 = p]+1 induced by W,1 = M}, but [pg+1 — 1/\/2—o_,-I < Ip,1-+1 — UM], we stop; otherwise, we proceed using M}, the second rearrangement of A1,. In this case, we interchange the next two columns of M,1 nearest to the center but which have not been exchanged previously. This corresponds to the eigenvalue reordering C1 . . . (NJ/2—2 (NJ/2+2 (NJ/2+1 (NJ/2 (AG/2‘1 (NJ/2+3 ° ° ' (Ni. We proceed in this manner until (2.39) is satisfied and we can no longer reduce the distance between the computed EPR and 1/ \/2—a,. If at the outset p,- < 1/ V271,, the approach is the same as before, but in this case we do not test (2.39) while p,+1 < p,, for we know the solution, if one exists, must lie to the right of p,-. Once a suitable transform W,- is found, the search for W,+1 may begin in a similar fashion. Comments Several points are worth noting here about the searching approach just delineated. First is the fact that some intermediate transforms may be required between two consecutive trial transforms M; and M,“ if p311, “overshoots” the target value. It is easy to envision methods for constructing the sequence of transforms AI; which 21Since we are attempting to satisfy only a sufficient condition, it is clear that the optimal EPR, one for which the difference IN, — .N',+1I is maximum, may be found to be other than 1/ ,/2o,-. 54 represent finer rotation changes at each step than those provided here. One possible way would be, for example, to consider all possible permutations of the eigenvalues of K ,, and construct the transforms M,’ of normalized eigenvectors with the corre- sponding ordering which generates the monotonic rotational increasing sequence from M]? to MJN’I—l = [mN,, mN,_1, ... , m2, m1]. We shall not dwell on this issue as our intention is solely to illustrate the general idea as to how one may find the set of transforms {W,}. ‘ Another important point is that the sequence of transformations obtained in this manner, and which guarantee that N,- > .A/,+1 at all valid scales, may be viewed as a new multiscale transform which is statistically motivated. This is in contrast with wavelet based and other existing multiscale transforms that operate in the time or space domain. Clearly, the possibility exists that the two types of multiscale filtering coincide for a class of models, which we would call scale ergodic models. For this class of models the A/ P conditions may be satisfied through the following mechanism. From the discussions of Section 2.2, we know that the coarser scale representation x,+1 of x, is obtained by a linear transformation that produces elements {x,+1,,-},- that are more highly correlated than the elements {x,,,},- of x, are. Consequently, for the models of interest here, for which E[x,IA,-] = A,, the density function of x,+1IA,+1 tends to be sharper than the one for x,IA,-, concentrating a greater probability mass about its mean. This, in turn, tends to reduce the size of EPRs, leading to a better match between the densities p(x,IA,) and pp, (x,IA,-), and so, increasing the accuracy of the model with scale. On a more general point, determining at the outset whether a model admits trans- formations W, such that (A.12) is satisfied at each scale is, in general, a difficult task, and we believe that each case, or at least, each family of models, needs to be looked at separately. We defer this effort for future work. However, under the assumption that a model does admit such a sequence of transformations, we can use condition (A.12) to 55 aid us in their search in a manner very similar to what we have done here—although, this may entail an intensive iterative process. With knowledge of the sequence of transformations, we can generate the sequence of multiscale representations {x,} of the data x0, which can then be used to exploit the estimation advantage referred to earlier in the section. Coarse-Scale-Data Limited Models Earlier in the chapter we stated that “while modeling of phenomena based on coarser scale information alone may be more accurate, it can only be achieved at the expense of precision.” We shall say that a model that possesses this property satisfies the coarse-scale-data limited model conditions, or CSDL model conditions, for short, and refer to such models as CSDL models. At that time we also asserted the equiva- lence of this phenomenon to the fact that coarse-scale models may be constructed more accurately than their fine-scale counterparts, but always at the expense of pre- cision, which is the defining property of A/ P models. We relegate the proof of this equivalence to Appendix A.8. 2.4.2 Bayesian Multiscale Models and Estimation In the past section we have seen that an advantage exists in estimating an underlying intensity within a multiscale framework whenever the model is an A / P model. How- ever, the advantage is only potential, because whether it is realized or not depends not only on the model itself, but also on the particular estimator used at each scale. To illustrate this we take an extreme example. Suppose the model at hand is indeed an A/ P model, and suppose that within the top-down multiscale leveraging estimation framework we obtain at each scale an estimate A, of A, by randomly choosing a signal among all those that satisfy A,+1 = A N' A W,A,. Here, W, E Iii-21"”), and A,“ denotes the previously obtained estimate for 56 A,+1, which at the coarsest scale J considered we let A; = x J. Clearly, the potential advantage in this case does not, on average, lead to a better estimate of A0 than one would obtain, for example, by simply letting A0 = 1:0. It is not hard to see that under the first estimation scheme one has P0 = 0 and N0 = 1, and thus R0 = 0; while for the second method, Po and N}, are the original values corresponding to the model p(xOIAo) p(Ao), which for practical models give R0 > 0. This example highlights the nontrivial aspect of how to exploit available infor- mation in a multiscale estimation scheme. So, we ask what is the optimal estimator 6(x,) for A, that. most efficiently and systematically exploits knowledge of the esti- mates A,+1, . .. , AJ? Clearly, the answer must satisfy two conditions: the estimator must have the means to extract all relevant information (patterns) from coarser scales, and it must optimally reduce the estimation error. It is well known that Bayesian estimators are optimal in the mean-square error sense, and in fact, under various other reasonable criteria as well, whenever a suitable prior density of the antecedent is available [23]. Furthermore, since a Bayesian-based estimator can make explicit use of the posterior distribution p(A,IA,+1, . . . ,AJ)—the most comprehensive statement of A, ’s dependency on the prior information—they are the most natural and optimal multiscale estimators of choice. A Bayesian multiscale estimator 6 E {6(x,)}22 can be formulated as follows. For the desired estimate A0 = 6(x0) to be optimal, each indiviual estimate23 A, = 6(x,) must also be optimal. The optimal estimate at scale j is the posterior 22Note that each estimator 6 is scale dependent, but we let the argument indicate this. That is, 6(Xj) # 6(Xj+() for z ¢ 0. A 23When referring to the estimate A, = 6(x,), it is clear that x, is taken to be a realization of the corresponding random variable, which we also denote by x,—a clear abuse of notation. Here, however, the terms estimate and estimator are used interchangeably in order to reduce some of the repetitive vocabulary. 57 mean A E [Aj IX,,A,'+1,... ,XJ] =/Ajp()\jIXj,Xj+1,... ,XJ)dAj. (2.40) X.- A Bayesian approach facilitates the solution of (2.40) by expressing it in terms of the known model distribution of x, given A,. Applying Bayes’ theorem to (2.40) and using the equalities p(x,IA,,A,+1, . .. ,AJ) = p(x,IA,) and p(A,IA,-+1, . .. ,AJ) = p(A,IA,+1)24 we obtain the desired form X, = fAj p(ijAjllXAJiAj-H) dAJ (2.41) f p(ijAjlplAjIAj+l) dA, The prior p(A,IA,+1) can be obtained from the distribution p(A,IA,+1) allowing for uncertainty in the deviation of A,+1 with respect to A,+1. If the data x,, and therefore, x,+1, . . . ,XJ, are determined to be reliable (accurate), then p(A,IA,+1) can be taken to be well approximated by PJyIA,+1(’\jIXj+1l- Since for A/ P models, the accuracy of the data increases with scale, the approximation is increasingly better with scale as well. This represents an important advantage of the multiscale estimation framework: While Bayesian estimation facilitates the realization of the multiscale leveraging es- timation advantage by fully exploiting all available information in the scales, a mul- tiscale model enhances the reliability of the Bayesian estimator by virtue of the more accurate data at those higher scales—the A/ P property. Another benefit of the multiscale description of a Bayesian estimator is that the formulation of realistic priors {p(A,IA,+1)} is often much simpler than devising the 24Clearly, p(A,IA,-+1,A,+1,... ,AJ) = p(A,IA,+1), and assuming the optimality of the estimate Aj+1, it is alsci reasonable to assume p(A,+1|A,+1,...,AJ) = Ap(A,+1IA,+1). Therefore, p(A,IA,-+1,... ,AJ) = f£(AjIAj+1,A,+1,... ,AJ)p(Aj+1IAj+1,... ,AJ)dAj+1 = fp(Aj|/\j+1)P(Aj+1|X3-+1) dA,+1 = MMI/‘2'“)- 58 prior p(AO) alone. In the next chapter, we elaborate this point further, and show how to construct a practical prior especially suited for general Poisson processes. Also, within the multiscale Bayesian approach, it is sometimes possible to formu- late the problem such that the inferences can be computed very efficiently. This is particularly true if the prior is chosen to be conjugate to the conditional density of the model [23]. The estimator developed in the next chapter is an excellent example of this. Resolution Power of Multiscale Bayesian Models We now consider the precisions and accuracies of a model augmented by a Bayesian estimator outside and within the multiscale framework. We shall refer to the resulting models as the standard and multiscale augmented models, respectively. With these formulations at hand, we can gain some insight into the mechanics of the multiscale Bayesian approach that lead to its resolution power advantage. For simplicity, we shall assume the estimators to be invertible, which is certainly the case for the Poisson processes addressed in Chapter 3. One immediate consequence of this assumption is that the precision of either augmented model is unchanged by the estimators, and that these precisions corresponding to the standard and multiscale augmented models are the same. Since our interest ultimately lies in whatever occurs at the finest of scales, we restrict the following derivations and discussions to scale j = 0, but all the results apply equally to every other scale. This is important because the benefits of the multiscale approach can only be guaranteed if every intermediate scale enjoys similar benefits. We first consider the standard augmented model. Let (0 be a Bayesian estimator for A0 without the benefit of any conditioning 59 from higher scales: _ p(XolA ) co = Elele] = f me.) (7015-) dA.. (242) We recognize the factor 1%]? as the ratio in the mutual information log W of x0 and A0 at the particular realizations at which the ratio is evaluated. In general, as we know, this ratio is a measure of the dependence between the two random variables: when it takes values near one, it indicates highly independent entities; when it takes large values, it indicates that the variables are highly dependent; and when the ratio is very small, it indicates near mutual exclusiveness. In our applications we do not encounter this third case as we assume x to be positively correlated to A. Inspecting (2.42), we see that the estimator produces an average based almost exclusively on the information provided by the prior when it deems that the rv x0 conveys very little information regarding A0. On the contrary, when the dependence between the variables is high, W will de-emphasize those values of A0 which are far from the regions that make the ratio large so that the computed average corresponds to the average of those regions that favor the particular outcome x0. This is the general mechanics of a Bayesian estimator, and gives insight into its powerful approach to producing an estimate. Now, the mechanics that makes a multiscale-based Bayesian estimator superior to its standard version can be seen by computing the difference of the anomie corresponding to their augmented models: Nc. — M. = /Ppo(C0|AO)P(Ao)10gM9|—Agl dcodAo PlCoIAO) p90 (60IA0) — 6 A A 1 d6 dA, fpml oI 0)P( 0) 0g p(5oIAo) o o where 60 denotes the estimator (2.41) for j = 0, and C0 is the estimator (2.42). Note that the EPRs associated with both of these estimators is the same. This is 60 a consequence of the invariance of the precision under the standard and multiscale model augmentation scheme by an invertible estimator. The above difference reduces to _ — , 210134191 N20 Mo - /Ppo(00I)\0)P(A0)108 PCOWOIAO) €100 €130, where do is a dummy variable. The ratio of densities may be interpreted in accordance with the definitions of the estimators 60 and C0: P60(0’0I)\0) PlEleIXoiAllIAo) PCOWOIAO) p(ElAOIXOIIAO) It is intuitive that this ratio is non-increasing as a function of the distance [00 — A0]. That is, while the probability of finding the estimate EIAono, A1] near the true value A0 is greater than the probability of finding EIAOIxo] in the same region, the prob- ability of E[A0Ixo,A1] decreases more rapidly than that of EIAOIxo] as their differ- ence to A0 increases, ie, p(EIAon0,A1]IA0) is more “concentrated” about A0 than p(EIAon0]IA0) is. Assuming this behavior yields . 1 P6o(0'ol)\0) NCO MO > /p(A0) 3:12) (2t)N0 ./C(Ao;t)10g pC0(00IA0) duo dAOv which in terms of the density of XOIAO may be written as . 1 p(XoIAoV |J60(Xo)| o ’No / A l -——-—/ l d dA , NC 5 > M O)t-l-gb(2t)N° 00.0,.) 0gp(Xole)/|Jc,(3 lim——/ lo ——dx, (2.43) <° 5° ~00 No a...) glam” ° 61 For simplicity, we consider the 2—d case. We let (A1, 01)T = WAO and ($1,w1)T = on, where W is a linear orthonormal transformation. Then, due to the linearity of the expectation operator IaE[(A1.61)Tlx1.uu] laAifxmvl) 691(1'1,w1) _ 511(11401) aéllxlvwfll [JC0(X0)I _ _ , 6(1'1.w1) 7 . _ 6171 6101 aw) 82:1 IJ50(X0)I aEIIA1.01)TJI1.W1.A1I aAifl‘MUiJu) 891(11,w1,A1) 331(1‘iyw1J1) 891(11.w1,A1) . I - I 6(1‘13111) 62:1 3101 8101 6101 Typically, the dependency of the estimate Al on the crossterm w, and the dependency of the estimate 51 on the crossterm $1 are weak compared to the dependencies on 2:1 and wl, respectively. Therefore, the behavior of (2.43) may be inferred from the approximation 6} , 69 . Hmong I 12:1“) ‘22.”)! IJ50(X0)I l3A1($1.w1.A1) 691(x1.w1.A1) I. 6:131 6w1 This ratio is, at least on average, greater than one, for the functional dependency of A1 on A1, and that of 51 on 61, makes the denominator less sensitive to changes of 3:1 and w. A more rigorous argument can be formulated, but we leave this effort for future work. Ftom (2.43), we conclude that M, < .A/Co, and so, A50 > AC0. Since the precision associated with the standard and multiscale formulation of the Bayesian mode is the same, we conclude that the multiscale-based Bayesian estimator achieves a greater resolution power than the traditional Bayesian estimator: R50 > RC0- Although some of the derivations in this section have not been carried out with all the desired rigour, they lead to plausible results which cast light on the advantage of the multiscale-based Bayesian estimation framework over the traditional Bayesian approach. The main objective of this chapter up to this point has been to promote this approach by establishing, at least in its beginning form, a foundation under which models and estimators alike can be studied under a common all-comprehensive set of criteria that applies equally to multiscale and traditional frameworks. These criteria should not only become useful in comparing various estimation approaches, 62 but should also become the basis for designing guidelines for new estimators. The new characterization of models and estimators also reflects our attempt to put forward a clearer view of the interplay between scale (space / frequency) resolution and information resolution (resolution power). Their relation depends on the specific transformations used in creating the various scale models from the original high- resolution model, and is reflected in the accuracies and precisions attained at those scales. 2.5 Other Multiscale Modeling and Estimation Approaches In this section we briefly review some other important multiscale modeling and es- timation approaches. These are the threshold smoothing methods: Hard and Soft thresholding, Universal, and the SureShrink Method; Cross- Validation Method; and False Discovery Rate. This list is not all exhaustive, but in conjunction with the multiscale Bayesian approach, they represent the most important classical methods. Their importance derives not only from their wide use in practice, but also from the fact that they represent estimation paradigms from which many other methods have later derived. 2.5.1 Threshold Smoothing Methods The standard model used to represent the input-output phenomena of a very large 25' class of processes is x = A + 17, (2.44) 25Clearly, this model is not as general as a Bayesian-based model. 63 where x is the “noisy” data, A the underlying intensity, and where n represents additive noise [33]. Within the thresholding methods, estimation of the intensity from the data occurs in the frequency-space (wavelet) domain, where the intensity of typical real-world signals can be projected onto a relatively small subspace compared to that occupied by the noise’s projection. The segregation of the signals’ energies is what makes their separation from noise possible. The general approach consists of the following steps: Wavelet transform the data, reduce the smaller wavelet coefficients to zero ac- cording to some thresholding rule and, inverse transform the coefficients to recover an estimate of the intensity. The samples of signals of interest (the elements of A) are typically highly cor- related. This induces a high structure among the wavelet coefficients containing significant signal energy. Specifically, they exhibit the properties of clustering and persistence across scales [34]. Clustering is the property of coefficients tending to take values in the order of those of its neighbors; and persistence across scales means that large/ small values of wavelet coefficients tend to propagate across scales. These phenomena will be illustrated in Chapter 3. A consequence of this high structure is the sparseness in the wavelet domain representation of signals, which is to say that only a relatively few coefficients convey the signal’s features. Meanwhile, noise samples (the elements of 11) are often well-modeled as being in- dependent, and therefore, the noise displays a “flat” distribution across the frequency spectrum where the intensity lives. This causes the noise energy to become distributed among the wavelet coefficients of a much larger subspace than that occupied by the signal. Consequently, large wavelet coefficients are associated with signal energy, and small coefficients with noise contribution. Thus, reducing the smaller coefficients to zero effectively removes noise from the intensity, with only a minor smoothing effect on the reconstructed intensity. 64 The literature on thresholding schemes often assumes the noise to be independent of the intensity, and much of that work also assumes it to be Gaussian distributed I12, 33]. These two points are to be found in high contrast with the methods devel- oped later on in this dissertation. These assumptions greatly simplify the estimation problem because the wavelet transform, being a linear orthogonal operator, sustains the two assumptions over the transformation. These common assumptions are made in the following descriptions. Hard and Soft Thresholding Let W denote the DWT. Applying it to (2.44) w = 0 + er (2.45) results. Here, w E Wx and 0 E WA, the wavelet coefficients of the data and of the signal (see (2.15)). For any given threshold r > 0 and an element wk of w, there are two standard ways of modifying the coefficient. These are the hard and soft thresholding. Hard thresholding produces a new wavelet coefficient according to the rule wk if [wk] 2 r E’) a- ll 0 otherwise. Soft thresholding uses a rule that modifies every coefficient, even those with strong signal to noise ratios. The thresholding condition is wk—r ifw;c >r wk = 0 if [1ka S 7 wk +r if wk < -r 65 When applied to the denoising of images, soft thresholding is claimed to give more pleasing estimates than the hard thresholding [35]; however, this can be due to over- smoothing effects, that often has this visual appeal but which in fact may repre- sent a degraded estimate under most conventional error measures. The smoothing phenomenon will be illustrated in Chapter 4 within the context of tomographic im- age reconstruction. For a more in-depth treatment of thresholding techniques, see [10,12,35,36] Universal Method This method can take the form of a hard or soft thresholding rule. In the latter case, Donoho & Johnson [35], the developers of the method, call it VisuShrink. This is because the method usually oversmooths the noisy signal, which as we noted before, tends to produce visually appealing estimates. A more accurate method is also given in [35], which is a minimax thresholding method that is optimal in terms of L2 risk. Both universal and minimax based methods are global thresholding approaches because the chosen threshold is applied to all the wavelet coefficients. In practice, however, it is typical to threshold only coefficients at the finest resolution (i.e., scale j = 0), since coarser-scale features of the data tend to belong to the original intensity signal. The distinguishing characteristic of the universal method is the basis on which the threshold is computed. Assuming the noise or error term for each element of w is i.i.d. normally distributed like N (0, 02), the elements of W11 are also distributed according to N (0, 02). In this case, the simplest of the universal thresholds is calculated to be 7,. = m. This threshold insures that as N increases, the probability that all noise coeffi- cients get “rooted out” tends to one. Clearly, because the threshold TN is derived from asymptotic arguments, the universal method does not perform well for signals 66 of short lengths [33]. SureShrink Method In contrast to global thresholding approaches, the SureShrink method is a data- dependant selection procedure. This very popular method, also introduced by Donoho & Johnstone [12], finds thresholdings 1', at each scale j so that the L2 risk of the estimator A for A will be small. Using the equality EIATA] = EIOTBI, the risk can be expressed in the wavelet domain: N-l A 1 A A 120., x) a E If); 2W — 1,)2] oc E [239“ — 9,02] . (2.46) k=0 M The wavelet coefficient estimates {6/l,’k}k are obtained from the data wavelets {w,),}jc by a soft thresholding rule with threshold r,. Using Stein’s Unbiased Risk Estimator (SURE) defined as N—l N—l SURE(r,;w,) E N + (7',2 - 2) 21(ij6 S Tj) + 2 wire I(wj.k > 73'), k=0 k=0 where I() stands for the indicator function and w, E (w,),)k, an unbiased estimate of the risk of the wavelet coefficient estimates {fi,,k}k can be obtained. Then, with the set of thresholds {1,} such that for each j, r, = arg min T>0 SURE(1'; w,), we can expect that the risk R(A, A) be asymptotically minimized, because due to the Law of Large Numbers, the SURE criterion is asymptotically close to the true risk [33]. Cross-Validation Method The general method of cross-validation (CV) was initially adapted to wavelet regres- sion by Weyrich & Warhola [37] and N ason [38]. Other important related work have followed since then (see for example [39, 40].) Similarly to the SureShrink method, 67 the multiscale CV approach aims to minimize the risk (2.46). Here, however, a global threshold r is chosen such that the average risk corresponding to two different es- timates for A is minimized. The first of these estimates, AOdd, is obtained by soft thresholding the data wavelets corresponding to the odd samples of x; and the sec- even 0nd estimate, A , is similarly computed from the wavelets associated with the even elements. The threshold used is one which minimizes 1 MW) E fl {2(Xpdd _ xix/emf + 2(Xfx'en _ x?dd)2} , i where 135“" and (Bi-”dd are interpolated samples of x required to align the “data” samples to the even and odd estimates. Essentially, the CV approach seeks to obtain the threshold value that reduces the squared error incurred when predicting half of the data samples with estimates that are based solely on the other half of the data elements. Clearly, due to the required interpolation of the data, the separation between the portion of the data used in estimating and that used in testing—that is, used in computing the associated risk— is not complete in the case of the multiscale approach. False Discovery Rate The False Discovery Rate (FDR) approach to computing the required threshold is due to Abramovich & Benjamini [41]. They structure the problem in terms of a multiple hypothesis test. There are N — 1 null hypotheses H0 : 6,), = 0, which are regarded to be highly probable on average. The aim of the approach is to reduce the erroneous possibility that a coefficient satisfying the null hypothesis is included in the reconstruction of the signal estimate, or that a coefficient not satisfying H0 is included with the wrong sign. Thus, if R is the number of coefficients that are included in the reconstruction (erroneously or not), and Q the number of coefficients incorrectly 68 included, then Abramovich & Benjamini attempt to include as many coefficients as possible maintaining the expected value of Q / R below a user-specified value. Comparison between this and the VisuShrink method shows that the FDR ap- proach performs better for signals that include some abrupt changes, while the VisuShrink method is superior when the function space only includes smooth in- tensities [41]. 69 CHAPTER 3 Multiscale Modeling and Estimation of Poisson Processes In this chapter, we present and analyze a new multiscale Bayesian framework for the modeling of Poisson processes. In the previous chapter, we motivated this approach for arbitrary processes by showing that multiscale representation of signals makes possible the utilization of all available information in the signal. These results, how- ever, did not specifically show the way to model any particular process in order to gain the multiscale advantage. There are always many ways of achieving this even within the multiscale framework, but not all lead to simple and practical models. The approach introduced here will be shown to provide a very powerful and natural frame- work to studying a wide variety of Poisson processes. The new framework makes full use of the Poisson probability model and enables the incorporation of realistic prior information into the estimation process. We will also show how it can be applied to photon-limited imaging. 70 3. 1 Preliminaries The problem of estimating the intensity A of a general Poisson process from a single observation1 c of the process has been studied in great depth. For example, many earlier approaches to Poisson intensity estimation were based on the idea of modeling the variability of the process by Gaussian fluctuations with non-stationary charac- teristics, e. g., [42, 43]. Recently, simple wavelet-based approaches to this problem make use of the square-root of the counts (a variance stabilizing transformation that makes the data approximately Gaussian) and then apply standard wavelet thresh- olding techniques for Gaussian noise removal [12]. More sophisticated wavelet-based estimation procedures attempt to deal with the Poisson statistics directly. Kolaczyk has developed a wavelet-based thresholding scheme for the estimation of a special class of Poisson processes termed “burst-like” processes [14]. The burst-like Poisson process is characterized by a homogeneous, low intensity background with spatially isolated bursts of high intensity, and is motivated by problems in astronomical imag- ing. Nowak and Baraniuk propose a wavelet-based method for the estimation of more general Poisson intensities in [44] using the cross-validation estimator developed in [40]. This method is applied to nuclear medicine image estimation in [45]. Both meth- ods [14, 44] can provide satisfactory results in certain situations. However, neither method adopts a Bayesian perspective, and hence they do not explicitly make use of prior information that may be available. As we noted at the end of last chapter, several wavelet-based Bayesian estimation procedures have been proposed for Gaus- sian data, e. g., [46, 34, 11, 15, 47], however, such methods are not applicable to the Poisson problem considered here. 1For completeness, in Section 3.5.4 we address the case of multiple observations of the same and related processes, but our main thrust throughout will be the single-observation case. 71 3.2 Notation To simplify the presentation we work with one-dimensional intensity functions in the interval [0, 1]. In a later section, we extend the modeling and estimation approaches to two-dimensional problems. Furthermore, we assume that the intensity function is discretized so that A is represented as a vector of length N with elements (Ak)£:1 0 . The counts ck are the elements of the vector c, also of length N. We follow the wavelet representation and notation introduced in Section 2.2.2 throughout. Specifically, let c0 be the data sequence of counts c of length N, where N is assumed to be a positive integer power of two, and let c0), be its kt” element. As before, the subscript 0 denotes the finest scale (resolution) of analysis. Similarly, let A0 be the finest resolution representation of the intensity sequence A, i.e., A0 = A, also of length N. Then, COIAO ~ Poisson(A0), (3.1) and the objective is to estimate A0 from the observation co. From Figures 2.1-(a) and -(b), and expressions (2.10) and (2.11) the wavelet filter coefficients corresponding to the Haar system are (h0,h1) = (fi, 53) and (go, 91) = (715,—71,,,-). For reasons soon to be addressed, it is convenient to work with the unnormalized Haar system. The filter coefficients corresponding to this system are simply (ho,h1) = (1,1) and (go, gl) 2 (1, —1). Therefore, a multiscale analysis of CD can be obtained by iterating Cch = Cj—1,2k+Cj—l,2k+1s (3.2) dik = Cj—l,2k"Cj—l.2k+1i (3.3) for j =1,...,J and k = 0, . . .,N/2j — 1, and J = log2(N). As before, J denotes the 72 Figure 3.1. Multiscale scaling coefficients {CM}. At the top, we have scaling co— efiicients at the coarsest resolution. At the bottom, we have the finest resolution, expressed by the data themselves. The connecting segments illustrate the functional dependencies among the various scaling coefficients c,,k according to expression (3.2). coarsest scale of analysis, and CM. and d,,,c denote the scaling and wavelet coefficients of the data, respectively, at scale j and position (shift) k. The scaling coefficients c, = (c,,k),1c”=/(2,j-l represent a lower resolution representation of the data c,_1. The “detail” information in c,_1, which is absent in c,, is conveyed by the sequence of wavelet coefficients (1, = (d,,k)£f=/ 31"]. Using (2.9), c,_1 can be perfectly reconstructed from c, and d,. Figure 3.1 is an annotated version of tree-structured representation of Figure 2.4-(a), which shows the functional dependencies among the various scaling coefficients of a sequence c of length N = 8. Similarly, as for Co, we define the scaling coefficients A,.), and the wavelet coeffi- cients 0,), of the intensity function A0: Aik = ’\j—1.2k+)\j—1,2k+la (3.4) 011k = Aj—1.2k — Aj—1,2k+1- (3.5) When the intensity of interest is of a discrete nature, A0 is simply the sequence A 73 itself. If, on the contrary, the intensity signal is a function of a continuous variable t 6 [0,1], say A(t), then A0 corresponds to the sequence of scaling coefficients at scale 0. That is, in accordance with the definition of the unnormalized Haar wavelet transform 011 the interval, A01: = (A, 0501:) = if; 1)/ N A(t) dt, and more generally, 21(k+1)/N A311- : (A, 2j/2¢j.k> = / A(t) dt (3-6) 22 k/N 3.3 Why the Unnormalized Haar Transform? Multiscale analysis based on the unnormalized version of the Haar transform has the unique property that every scaling coefficient is the sum of two finer-scale scaling coefficients, and consequently, due to the reproducing property of the Poisson dis- tribution,2 every scaling coefficient is Poisson distributed. Furthermore, it is well known that given two Poisson variates, c1IA1 ~ Poisson(A1) and CQIAQ ~ Poisson(A2), the conditional distribution of Cl given A1, A2, and the sum c1 + c2 is binomial [48]. This reveals a very simple “parent-child” relationship between the scaling coefficients across scales. In Section 3.5, these facts are crucial in the development of the proposed intensity estimator. Similar attributes (reproducibility and simple parent-child rela- tionship) do not hold for more general multiscale analyses of Poisson processes based on other wavelet systems. This is in marked contrast to the Gaussian case, in which such attributes hold for a wide variety of wavelet analyses (including all orthogonal wavelet systems). In short, multiscale transforms of Poisson processes other than the unnormalized Haar transform are much more difficult to analyze and process. The natural match between the unnormalized Haar transform and the Poisson process is the primary motivation for choosing it. The use of the unnormalized Haar wavelet transform to carry out the multiscale 2c,|A, ~ Poisson(A,~), c,IA,~ independent => 2 c,-I Z: A,- ~ PoissonQ: A,). 74 analysis of the data has additional benefits springing from the following points. Pois- son processes result from counting independent events occurring in disjoint regions of time or space of equal size. In such cases, the unnormalized Haar scaling coefficients correspond exactly to these counts occurring at intervals of sizes varying according to scale. Thus, the scaling and wavelet coefficient have a very natural interpretation according to (3.6) and (3.3). The Haar basis functions also have the property of being completely localized in space. By this we mean that at each scale, scaling functions, as well as wavelets, do not overlap. Therefore, at each scale, scaling coefficients are conditionally independent, that is, P(CjI’\j) = lecj,kl)‘j.k)- (3-7) k Also, Poisson processes are inherently nonnegative; therefore, a good estimator should always produce intensity estimates that are either positive or zero. An estimator based on the Haar system may be designed with this quality. 3.4 A New Probability Model for Intensity Images 3.4.1 Multiscale Signal Model Framework To formulate a Bayesian estimator for this problem, we must first propose a prior probability model for the unknown intensity A. The observed data c is regarded as the realization of a Poisson process spawned by the unknown realization A of some random sequence with prior density p(A). In last chapter’s terms, A and c are respectively the cause and effect of the process completely determined by p(cIA)p(A), where p(cIA) is the Poisson probability mass distribution. Just as we did in the last chapter for general processes, we can formulate the present model within a multiscale framework, and similarly arrive at the optimal 75 intensity model [a multiscale ] Haar multiscale Inverse - estimated transform processing Transform Figure 3.2. Structure of a Haar-based intensity estimator. estimator (see (2.41)) x, _ fAjp(c,~lA,)p(A,|A,-+1)dAj (3 8) J " A ' ' fP(Cj|’\j)P(AjI’\j+1)dAj This Bayes estimator poses two interrelated problems. First, the specification of a meaningful and useful prior p(A,IA,+1), which we deem to be an excellent approx- imate for p(A,IA,+1) (see Section 2.4.2). Second, the numerical computation of the estimator. The role the above quantities play in the estimation process is illustrated in Figure 3.2. The remainder of this section describes a new prior probability model for the Haar scaling and wavelet coefficients of a non-negative intensity that leads to a very simple specification of p(A,IA,+1). In Section 3.5, we derive an efficient algorithm for computing the optimal estimator (3.8). There are two important reasons for adopting a multiscale approach to this prob- lem: 0 Prior models that are mathematically tractable, computationally practical, and empirically supported can be specified very naturally. e Poisson data is much more reliable (accurate) at coarse scales than at fine reso- 76 L —1 -oi5 o 0.5 1 —1 —o.5 o 0.5 1 coefficient value coefficient value (a) (b) v , 1 f .4 .M -1 -0.5 0 0.5 1 -1 —0.5 O 0.5 1 coefficient value coefficient value (0) (d) Figure 3.3. Histogram of perturbation variates (6 = 9/A) for scale (a) j = 1, (b) j = 2, (c)j = 3, and (d) j = 4 of the cameraman image of Figure 3.10(a). The general invariance of the distributions’ structure across scales illustrates a self-similar property of real-world image statistics. lutions (higher counts => higher signal-to—noise ratio). Therefore, more reliable coarse-scale estimates can be leveraged to improve high resolution estimators.3 The first point is due partly to the fact that multiscale decompositions of real- world intensities are often statistically self-similar. By this we mean the property that the various scale representations preserve the major features and characteristics 3The good match between the Poisson and Gaussian distributions that occurs at high counts suggests a similar positive relation between the Poisson models’ anomies and corresponding signal- to—noise ratios as seen to exists in the Gaussian case of Example 3 of Section 2.3.3. 77 of the original object, except for the usual gradual loss of resolution. In particu- lar, it has been widely recognized that the distribution of the wavelet coefficients of real-world signals tend to be similar at all scales of analysis, and are usually con- centrated around the origin and unimodal [34]. The self-similarity captured by the Haar multiscale analysis is illustrated by considering the distributions of the wavelet coefficients at various scales. Figure 3.3 illustrates this phenomena. The histograms in this figure correspond to the wavelet coefficients 4 at scales 1, 2, 3, and 4 for the cameraman image of Figure 2.6. The similarity between these distributions facilitates the specification of Bayesian prior for the intensity in a very natural way. The second point above motivates an estimation process that evolves from coarse to fine scales (See Chapter 2). It is easily verified that the signal-to—noise ratio (SNR) in a Poisson process increases linearly with the underlying intensity (signal). Thus, according to (3.2), cm = 2;? Cox, and so the signal—to-noise ratio at scale J is 2" times as large as that for the average data point 00*. For example, for a 128 by 128 pixel image, this represents a SN R improvement of 42 dB. 3.4.2 Multiscale Multiplicative Innovations Model We now describe a new Haar-based probability model for the intensity. Let AM and 6,,c denote the random variables corresponding to the j, k-th scaling and wavelet coefficient of the intensity, respectively. At the coarsest scale j = J, the single scaling coefficient Aw has a density with support on IV. In this work, we choose the gamma density, since it is especially easy to use in conjunction with the Poisson mass function, and because it provides a reasonable mechanism for incorporating prior knowledge of the intensity range. However, as noted earlier, the SNR in the count 0,1,0 is typically 4More precisely, here we are plotting the histogram of the ratio of the wavelet coefficient relative to the corresponding scaling coefficient. That is, each wavelet coefficient is divided by the corresponding scaling coefficient at the same scale and position. If the scaling coefficient is zero, the operation maps to zero. The motivation for this ratio will become apparent in the following sections. 78 very high, and therefore any reasonable prior with support on IR+ will not significantly influence the estimation of A $0.5 Next, introduce statistically independent perturbation variables {6,1,} and model the wavelet coefficients by 911k = AM 5j.k- (39) Each wavelet coefficient is modeled as an independent perturbation of its correspond- ing scaling coefficient. Furthermore, the perturbations at all scales and positions are assumed to be mutually independent. Applying recursions (3.4) and (3.5) to these coefficients, we find that A,.ljc = %(A,,W2] + (-1)’° 6,,[k/21), where I] stands for the integer part of the argument. To gain some insight into this model, consider the random variable y,,;c defined by (1 + 6”). (3.10) NIH 35.1.». E The variable y,,;c can be viewed as the canonical multiscale parameter for Poisson processes because of the following parent-child relationship. It is well known that given two Poisson variates c1 and c2 such that c1IA1 ~ Poisson(A1) and c2IA2 ~ Poisson(A2), the conditional distribution of c1 given the sum c1 + c2 is binomial with parameter y = 13% [48]. In the context of our multiscale analysis, this special property implies a very simple parent-child relationship. Specifically, the conditional distribution of child c,_1,2)c given the parent CM = c,-1,2k + c,_1,2k+1 is binomial with parameter y,,,. This relationship demonstrates the fundamental role of y”, in the multiscale analysis of Poisson processes. 5In fact, in practice we often use the estimate Am 2 cm. 79 “/1, ‘~ / \ A'1.o,Q\ \AAH A12 K Ana ,/ \ , /’ . . / ‘ 1— \ 1- \1— ‘ 1— Ym/ \. Y1.o Y1,1/ \\ Ym Y1,2/ \ Y1,2 Y1.3/ \ Y1,3 / \ / \ / \ / \ O O 6 D O O O O 10.0 A'0.1 A'0.2 A'0.3 )‘OA A0.5 A'0.6 A0.7 Figure 3.4. MMI model interpreted as a probabilistic tree. The MMI model can be viewed as a tree-structured probability model in which the intensity A,,;c at coarse scale j is refined (split) via the multiplicative innovation y,,,c to obtain two new intensities A,.”,c and A,-”),H at the next finer scale of analysis j — 1. The innovations variates {y,,,} are mutually independent at all scales j and positions k. Using y“, in conjunction with (3.9) we have )‘j—lflk = Achijm (3.11) /\j-1,2k+1 = )‘j,k(1"yj.kl' (3-12) We can interpret these refinements as a multiscale innovations structure, with the innovations y,,,c and 1 --y,,;c entering in a multiplicative fashion, in contrast to the more standard additive innovations structure encountered in Gaussian estimation problems [49]. We call this model a multiscale multiplicative innovations (MMI) model. The model is graphically depicted in Figure 3.4. The following are some key properties of the MMI model. So long as the distributions for the perturbations {613k} are chosen to be similar across scales, the MMI model gives rise to self-similar intensity representations, which as discussed in Section 3.4.1, are typical of real-world intensities. Also, here we 80 only consider temporally homogeneous processes; it would be undesirable if the prior depended on the observation time interval in a complicated manner. Due to the multiplicative innovations structure, only the coarsest scale of the prior is dependent on the observation time interval. Thus, the model is essentially invariant to the length of the observation time. Moreover, in Section 3.5 the MMI model is shown to provide a mathematically tractable match to the Poisson nature of the data which leads to a very simple estimator formulation. The MMI model is closely related to other models studied in physics and statistics. The MMI model belongs to the class of cascade models, which are used in statistical physics for modeling a variety of natural phenomena including turbulence modeling [50] and rainfall distributions [51]. In fact, because MMI model is a type of cascade model, it can be shown that the MMI model is a random multifractal [52]. It is also interesting to note that the MMI model is a special case of a Polya tree [53, 54]. In the statistics community, Polya trees are used to model probability distributions, a problem analogous to modeling a non-negative intensity function. 3.4.3 Prior Distribution for Innovations A prior distribution is determined by the nature of the ensemble of objects to be modeled, and so, the better defined the ensemble is, the more informative the prior, and consequently, the better the model. Towards this end, we emphasize photon— limited images, particularly, nuclear medicine images; however, as noted earlier, a very large class of other real-world images are well characterized by the same features of the prior p(6) for the perturbations 6,,k: statistical self-similarity across scale, symmetry about the origin, unimodality, concentration around zero (see Section 3.4.1), and support on the [—1, 1] interval. This last property is due to the fact that the range of 6,), is [—A,,,, AM]. The rest of the properties are based on the characteristics of observed wavelet coefficients’ dis- 81 tributions resulting from natural signals [34], and which have been exploited in other areas including wavelet-based compression [55]. These properties are also illustrated in the histograms of Figure 3.3. All the properties, however, may equally be inferred a priori from a few observations. The images of interest have the common characteristic of being mostly smooth with a few discontinuities. Typically, these discontinuities form edges that are long enough to span several scales. On average, neither the edges nor the overall intensity variations have any preferred orientation, as dictated by nature. It is the slowly varying intensity typical of images which is responsible for the high concentration of wavelet coefficients around the origin, and it is its non-oriented nature that causes their symmetry. The few outlayers coefficients are due to the discontinuities, which are found nearly in the same numbers in opposite orientations—think about the silhouette of a phase, for example. The statistical self-similarity is due to the span of images feature across scales. For example, large regions of smooth intensities and discontinuities alike are observed unaltered at many scales. One very general class of probability density functions that possesses the desired characteristics and which reflect the above image characteristics are beta-mixture densities of the form (1 _ 62) si—l 219102281-1’ (3.13) 3(3113 for —1 S 6 S 1, where B is the Euler beta function, 0 S p, g 1 is the weight of the i-th beta density fi$—%;—ir with parameter s,- _>_ 1, and 2,”; p,- = 1. Figure 3.5 depicts a mixture of three beta densities. A similar method was also recently proposed using a prior based on a mixture of a Dirac impulse and a single beta distribution [56]. Other classes of density functions may also provide the desired characteristics, but 82 , -o.2 —o.1 o 0.2 coefficient value Figure 3.5. Three component Beta-mixture distribution (solid line) superimposed on the histogram of the perturbation variates (6 = B/A) of Figure 2.6 of Section 2.2.2. The beta mixture parameters here are 31 = 1, .92 = 100, 33 = 10000, p1 = 0.001, [)2 = 0.400, and p3 = 0.599. the beta family has a significant computational advantage. As pointed out above, yj‘k parameterizes the conditional distribution of the child 9-12;, given the parent CM. This conditional distribution is binomial. Hence, from a practical perspective, the use of a prior that is conjugate to the binomial will greatly facilitate computations [23]. It is well known that the beta family is conjugate to the binomial. For this reason, the beta mixture prior described above leads to a very simple, closed-form estimator which is discussed in the next section. However, for the sake of brevity, in our derivation of the optimal estimator, given in Section 8.2, we directly compute the posterior means based on a beta mixture prior, without explicitly noting the use of conjugacy. 83 3.5 Estimation 3.5.1 Bayesian Multiscale Intensity Estimator We shall focus on the posterior mean estimator, although other estimators (e.g., MAP) may also be considered within our framework. The posterior mean is the optimal Bayes estimate under a quadratic loss function. The posterior mean estimate of an intensity AM, given all the information available in the data c0 and the MMI prior model pg, is the conditional mean 3:ch E E[Aj,k|c0]. In this section we derive simple closed-form expressions for the posterior mean. First, based on the analysis in Section B.1 of Appendix B, KM = E[/\j.k]Cj]- This implies that a simple coarse-to—fine procedure can be employed in the estimation process. At the coarsest scale 3' = J, the intensity is represented by a single scal- ing coefficient A 1,0. Let us begin by considering the estimation of A 1,0. We have X 1,0 E E [AJ,0[CO] = E [Awlcm]. As argued in Section 3.4, the corresponding count cm is itself usually a very good estimate for A“) provided the total number of counts is sufficiently large. This choice has the added advantage of insuring the preservation of total number of counts, i.e., 2k Soy, = 2k 0%. The posterior mean estimate for the wavelet coefficient 03",, is given by $3) ',k E E[6j,k|CO] = E [Aj,k]C0] E [6j.k|C0] a where we have used (3.9), and have exploited the independence between AM and 511k- 84 Now, we may simply write 933k = Xch 6M: (3.14) with the obvious definition 3],, E E [6,,klc0]. property of the expectation operator: > >’ The desired form for X“, may be obtained using (3.4) and (3.5), and the linearity 1 k 1.1: = E - (bulk/21 + (-1) 6j+1.[k/2]) 2 C0] 1 A .x = 5 (Aj+1,[k/2] +(-1)A 6j+1.[k/2]) - (3-15) Here is where we exploit the multiscale framework for estimating the desired intensity. The estimate for 2:33;, is leveraged by the more robust coarser-scale scaling coefficient’s estimate Xj+1.[k/2]- In Section B.2 we show that Z p.3(3i+Cj-l.2ks3i+Cj—1,2k+1) A ' I B i» t 2 i ' 51.x: = d“. i (s S” “6““) (3.16) 2 DB(S;'+CJ'—l,2ko5i+Cj—1,2k+l) . i 1 3091.30 The parameters {p,, 3,} are the defining parameters for the beta mixture model in (3.13). Note that 333;, guarantees nonnegativity of the resulting intensity estimates. That is, X131: 2 0 for j = 0,... , J and all It. To verify this, simply rewrite (3.16) with —d-’4'5—| Slfor the factor dJ-j, inside the upper summand and consider the fact that 28', +c, k all 2'. The overall algorithm is described below. 85 Bayesian Multiscale Intensity Estimation 1. Estimate coarsest scale coefficient /\J.0 = 0.1.0 2. Forj=Jdowntolandk=0toN/2j—1 Compute: 3; 1, according to (3.16) (71.1. = in- 311 Refine: -= % < ,... a: .) /\j-1,2k+1=% (3:31. — 6“) These simple procedural steps produce posterior mean estimates for finer and finer representations of the underlying intensity, and terminate with the desired, finest-scale estimate X0. The complexity of the proposed estimator is 0(N), the same order as the fast wavelet transform itself. For large data sets it is possible that the full J = log2(N) iterations over the scales in Step 2 above are not necessary. For instance, for long data sequences the estimator may be initiated at a scale J * < log2(N), for which the estimate X]. = cJ. is already very reliable. In practice, even for low-count data, we have found that J" = 5 provides excellent results. Using J“ other than J is equivalent to dividing the original data sequence into equal subsequences, estimating their underlying in- tensities separately, and concatenating the individual results to obtain the overall intensity estimate. However, in Section 3.6 we introduce a shift-invariant version of the estimator above, for which the equivalence just described does not hold. 86 3.5.2 Selection and Analysis of the Beta-Mixture Prior Our experiments and analysis with real-world intensity functions have led to several conclusions. 1. The perturbation densities of many real-world intensities are very well character- ized by a weighted combination of three beta densities with shape parameters .91 = 1, 32 = 100, and s3 = 10000. 2. The .91 = 1 component is the uniform density, and we have found that fixing the corresponding weight to a small positive constant (e.g., p1 = 0.001), to insure that the prior density 115 is bounded from below over the entire [—1, 1] interval, is appropriate in most. situations we have studied. 3. The key parameter that distinguishes the characteristics of different intensity func- tions is the trade-off between the 32 = 100 component and the lower vari- ance .93 = 10000 component. The trade-off is parameterized by the probability 102 E [0,1 -p1]o (Note P3 =1- 101- 102-) To gain some insight into the functioning of the MMI model estimator, consider Figure 3.6 which plots If“, versus the ratio dj,k/cj,k for three cases: low (c = 10), medium (0 = 30), and high counts (0 = 1000). These are, respectively, the dashed, solid, and dot-dashed curves. The ratio dJ-JC/cjj, may be regarded as an empirical counterpart to the perturbation variate 613k. Figure 3.6(a) and (b) correspond to two 6-priors with 132 = 0.1, and p2 = 0.9, respectively. The rest of the parameters take the values given in points 1—3 above. From these figures we can observe the following. At high counts, when the SN R is high, the estimator regards djjc/cjj, to be a good estimate of 5ch for almost every value of d,.), /cj,k; thus, the resemblance to the unit-slope linear function. In contrast, for low counts, the SNR is much lower, and consequently the estimator severely attenuates 87 .-.-.. C=1000 II! —- C=30 I!" 0.5’ --- C=10 1"" 1 g (E) "II! "A g g ...... - ~~~~~~ 1 o q) o i’ Q Q '1 I. -0 5 O 5 r . I". -1 ‘ 1 1 . -1 -0. 5 0 0.5 1 d/c Figure 3.6. Perturbation estimate 3 as a function of d/c for c = 10 (dash), c = 30 (solid), and c = 1000 (dot-dash). The estimator’s defining parameters are 31 = 1, 32 = 100, 33 = 10000, p1 = 0.01, and (a) 102 = 1 - p1 — p3 = 0.100, and (b) p2 =1—p1 -—p3 = 0.900. djk/ijc. This phenomenon is reminiscent of the behavior of wavelet-domain threshold estimators designed for additive Gaussian white noise (AGWN) [12], except that the threshold is adaptive to the local intensity. The MMI model estimator minimizes the expected squared error with respect to the MMI model prior. Of course, the error is minimized by balancing the trade-off between fidelity to the data and fitting to the prior model. If the data are very ‘reli- able,’ then the estimator favors the data. If the data are unreliable, then the estimator favors the prior. As noted, at low counts the ratios dJ-jc/cjjc are not accurate esti- mates for 6, and so, the Bayesian estimator attenuates them to minimize the expected squared error in accordance with the prior. The nonlinearities in Figure 3.6(a) corre- spond to a lower-variance prior (smaller 192) than that giving rise to the nonlinearities of Figure 3.6(b). As a result, the nonlinearities in Figure 3.6(a) display a ‘dead—band’ characteristic, since the prior requires that a greater number of djj,/cjjc samples be mapped towards zero, in contrast to the higher-variance prior which displays a less 88 harsh attenuation of small dJ-jc/cJ-Jc in Figure 3.6(b). This behavior is similar to that observed in other wavelet-based Bayesian estimators designed for AGWN [46, 11, 15]. However, the functioning of the MMI model estimator is in contrast to that of es- timators for AGWN processes. In the latter cases, all wavelet coefficients are typically attenuated independently and in disregard to the values of the corresponding scaling coefficients [46, 11, 15]. The MMI model estimator, on the other hand, adapts not just to the statistics of a particular scale, but also naturally incorporates information from coarser scales. This is crucial in the Poisson problem since the coarse-scale in- tensities (scaling coefficients) are indicative of the statistical reliability of the wavelet coefficient. 3.5.3 Estimation of Prior Parameters The Haar wavelet coefficient distributions of real-world intensities often fit a profile which resembles that of Figure 3.5 as previously discussed. However, while many distributions follow this general characteristic, one expects that subtle variations will exist from application to application. Therefore, it is of interest to adapt the prior to the problem at hand. Here we give a very simple approach to fitting the prior based on a moment-matching method. This adaptation can be viewed as an empirical Bayesian extension of framework described above. Recall, we assume that for each scale j, the set {633k} is independent identically distributed (i.i.d.) with an M—component beta-mixture density distribution with parameters {pj,,-,s,-}f‘il. We let the mixing probabilities {p13,} depend on scale to enable variations of the density across scale. Also note that here the index i does not refer to shift (as does It in yjj, for example), but rather it refers to the i-th component of the mixture density. Then, by (3.10), at each scale j, {yj,k}£I=/§j_l is also i.i.d. with an [VI-component standard-beta-mixture density distribution with parameters 89 {191.11 Silfiii 3,—1 1 _ 3,—1 pJ-(y) = [my ( 9) . 0 s y s 1. (3.17) Since y”, is independent of AM, E[A}‘_1‘2k] = E[A;.“k] E[y;‘k] and E A" 121 E [113%] = —[E[Z)\—k]_]_ (3-18) The moments E[yj,k] need not be computed using this expression since the prior model (3.17) gives the mean % for any choice of parameters. For n 2 2, the moments E[A" j_ 12k] and E[A" k] are easily estimated from the data. Since E[cj klAj k]— — Aj k, EULA-l = ElE[CJ.k|/\j.kll = Bicycl- And in general, it can be shown that for all 71 there exists a degree n polynomial pn(-) such that 13033.] = Ean(CJ‘.k)l' For example, Bldg? ,=kl EClJ k ’CJkl ~ [WW-2303', k’CJk) Substituting these estimates for various 71 into (3.18) we can obtain empirical estimates for the various moments of y“. Equating these to the moments of the beta-mixture model (3.17) produces a set of equations that can be solved for the parameters {pj,,-, 8,},311 As mentioned above, we have found that the choice of a three component prior 90 beta-mixture model suffices for many real-world intensities. At all scales j, we suggest the shape parameters 31 = 1, 32 = 100, and 33 = 10000, with weights p“ = .001, pig = 1-pj,1-pj,2, and 13,-,2 adapted at each scale using the second moment estimate for 31M: 3.5.4 Optimal Estimation from Large Ensembles Until now, we have strived to construct an estimator which optimally estimates the intensities of Poisson processes from a single observation of the process. There are situations, however, when several observations of the same process are available, or, when a large set of realizations corresponding to distinct elements of the ensemble to which the object of interest belongs is at hand or may be created.6 These two conditions require distinct approaches in order to exploit all information available and optimally estimate the underlying intensity. In the first situation, one arrives at a suitable approach by viewing the set of observations (of the one process) as sub-intervals of a longer integration time for the process. Then, the counts corresponding to this larger integration interval are simply the sum of all the counts available on a sample (pixel) basis, and represent sufficient statistics for the intensity pixels [18]. This well known result is due to the reproducing property of the Poisson distribution,7 and may be stated as A E E[A|c1, . .. ,cm] = E[A|c1 + +cm]. 6Under certain conditions, estimates for the object’s distribution may be obtained from simulated large scale ensembles, for example, with the use of the bootstrap method [57, 58]. 7See footnote 2 in page 74. 91 In the second scenario, an optimal estimator would be desirable if, for example, a large data bank of nuclear medicine images of, say, knees are available from past clinical studies. These data then could be used to leverage the estimate of the nuclear images of a new patient’s knees. Situations like this admit two possible solutions. First, the moment-matching method described in the last section could be employed with the larger observation data set to obtain very robust estimates of the prior parameters. A robust prior clearly would lead to a robust estimate of the intensity of interest. If the data set is large enough that the ratios p(Cj—1.2k = C1 +1|Cch = 0 +1) P(CJ—1.2k = CIICch = C) can be computed sufficiently accurately, a prior-independent optimal estimator for the innovations {yjjc} can be devised as follows. Abbreviating the indexing notation, we have yp(yl01 c )dy / =/yp(cilcy)p(( gently- (CIIC) Since p(cllc, y) = (col)ycl(l — y)‘3‘°l (see [48]), and since the innovations are indepen- dent of any rv at their own and coarser scales, A f y (f,)yc‘(1 - y)c‘“1p(y) dy y = P(01|C) _cl +1 I (51:11) yc’“(1 -y)c“c‘p(y)dy c + 1 p(cllc) ° 92 Thus, we obtain A _ Cl +1 p(cj-1‘2k = C1 + 1'01"}; = C +1) yk _ 3.20 J c + 1 p(Cj—1,2k = ClleJc = C) ( ) Clearly, to compute the ratio of probabilities in this expression we only need, for example, a table for p(CIIC). For instance, for a realization C“, = 0, necessarily 9.131, = 0, and the innovations estimate reduces to if”, = p(cj_1,2k = 1|cj,k = 1), which typically takes on values close to 5 For very large counts of CM, one would expect that p(cllc) z p(cl + 1|c + 1). Therefore, for very large counts 171.}: z 961, reflecting the fact that at high counts the SNR is high and the data is, therefore, reliable. 3.5.5 Example of Estimation from a Single Observation For our first illustration of the performance of the Bayes’ estimator we give the follow- ing simple example. We offer more comparative examples in Section 3.7 once a more sophisticated model and estimator is introduced in the next section. In Section 3.8, two-dimensional examples are given. Consider the test intensity function depicted in Figure 3.7 (a). Figure 3.7 (b) depicts a realization of counts from this intensity. Note that the first “bump” in the low intensity region of (a) is statistically reliable—it is easily distinguished in the count data (b). The second bump in the high intensity region is equal in size to the first bump, but it is not statistically reliable as is seen in the count data in (b). Therefore, we expect that a good estimator should recover the first bump from the data and not attempt to recover the second. Figure 3.7 (0) depicts the estimate resulting from the PRESS-optimal estimator8 proposed in [59]. It is an improvement 8This estimator is a wavelet threshold type operation that is derived using the statistical method of cross-validation [40]. 93 over the raw data, but still exhibits quite a bit of variability. Figure 3.7 (d) depicts the intensity estimated by the Bayesian approach introduced in Section 3.5.1. Due to the very simple—and unnatural—structure of the intensity where the cor- responding wavelet coefficients may be found in one of only two states, zero and “large”, a simple two-component beta-mixture density model sufficed for the innova- tions yJ-‘k. The parameters of the beta densities were 31 = 1 and .92 = 10000. The mixing parameter p1 = 1 — p2 was adapted at each scale using the data-based mo— ment matching approach proposed in Section 3.5.3. Note that the Bayes’ estimate does an excellent job at correctly estimating the reliable features (edges and bump in low intensity region). The experiment shown in Figure 3.7 was repeated in 1000 independent trials. The normalized MSE of the raw count estimate was 1.000, the MSE of the PRESS-optimal estimator was 0.375, and the MSE of the new Bayes’ estimator was 0.106. 3.6 Stationary Intensity Models and Estimators One potential limitation of the MMI model is that it is not stationary due to the fact that the Haar wavelet transform is shift-dependent. That is, the analysis and estimation depends on the alignment between the Haar basis functions and the data. Moreover, the coarse scale approximation by Haar wavelets is piece-wise constant, an unrealistic intensity model. To circumvent such problems, shift-invariant wavelet transfonns have beenproposed in literature [60, 61, 62, 63, 36, 64]. In this section, we provide a unified Bayesian framework for shift-invariant analysis and estimation based on the MMI model. We formulate a fast shift-invariant estimator from this analysis, and illustrate it with an example. Also, we characterize the autocorrelation functions of the MMI model (non-stationary) and a shift-invariant MMI model (stationary), and show that the shift-invariant MMI model has a more regular correlation behavior 94 ['1 ; 11 ll] so a 80 360' g 50' 4 40- ‘0’ ’ l 20. _ , i E ay’lkww “l" NW1.“ L... o Sp.“ — J" 0 3pm (a) (b) 140 14o—~ 120 , l - 120 ‘ ..1' "f 1., ,1 1.. 1111‘.)1:1 ‘ 1‘°°:’ : r 3°01 2 ”1 g 00] 3 00‘ . g ’ m 40’ I ’ E ‘0’ l '. ; g ] ‘ j 1 ‘ [ l 20 _llr'»+wr«1, with-w; I 20 If i l 0211. Space “J OT Space H (c) ((31) Figure 3.7. Poisson intensity estimation. (a) Test intensity function. (b) Realization of Poisson process with intensity in (a). (c) PRESS-optimal estimate using algorithm in [59]. (d) Bayes’ estimate using algorithm described in Section 3.5.1. which may be better suited for modeling real-world intensities. 3.6.1 A Shift-Invariant MMI Model Shift-invariant MMI intensity models can be easily constructed within a Bayesian framework. Specifically, the shift of the intensity function with respect to the Haar wavelet system can be viewed as an additional degree of freedom in the model, and a probability measure on the shift parameter can be introduced. It is assumed that all shifts are circular. If we regard the original model as “unshifted” (shift = 0), then the standard MMI model introduced in Section 3.4 is denoted p(Alshift = 0). Now 95 let p(shift = m) denote a probability mass function for the shift, and consider the averaged MMI model p(A) = Zp(Alshift = m)p(shift = m). (3.21) If p(shift = m) is the uniform distribution, a non-informative prior expressing no preference for any particular shift, then the averaged MMI model provides a shift- invariant (stationary) intensity prior and we call it the shift-invariant (SI) MMI model. It is shown in Section 3.6.3 that the SI—MMI model is more regular than the basic MMI model. Estimation using the SI-MMI model is easily carried out by computing the opti- mal Bayes shift-dependent estimator given in Section 3.5.1 for each shift, and then computing an average of the results. The complexity of the shift-invariant estimator is 0(N2) operations. However, note that if we employ a J *-scale SI-MMI model, with J "‘ < log2(N), then the estimator is invariant to shifts modulo 2].. This is due to the fact that the scaling functions at the coarsest scale have support over 2". samples. In general the J *-scale SI-MMI model only requires a uniform shift prior over a 2" sample region of support, rather than over the entire range of circular shifts. Thus, if J "' < log2(N), then the complexity of shift-invariant estimation is only 0(N2J). As pointed out earlier (see Section 3.5.1), in many applications it suffices to take J * smaller than log2(N). Note that the fast shift-invariant methods described in [61, 62] are not applicable in this case due to the dependence between wavelet coefficients across scale. This dependence stems from the multiplicative relationship between scaling and wavelet coefficients.9 However, fast shift-invariant estimation algorithms are feasible if not “homogeneous” (see below). We next present one such approach. 9The fast shift-invariant methods treat all wavelet coefficients independently. 96 3.6.2 A Fast Shift-Invariant MMI Estimator Referring to the tree representation of the intensity model of Figure 3.4 in Sec- tion 3.4.2, we know that the mapping {A0,k}k 1—1 {A10} U {yjjb—j is invertible. There- fore, p(A) oc p(AJ,0,yJ,0, . .. ,yl’N/2_1), and so, the shift-invariant model (3.21) may also be expressed as p(A) 0< 2901.0): 11115711111 = m), (322) m j,k where we have exploited the invariance of A JD with respect to shift, and the indepen— dence of the set {yj,k}j,k. The notation p(yjif’k) is simply a more economical way of ex- pressing p(yj,k|shift = m). Note that with this notation, Agfo E 11031370113110 - - (gift), for example, is a component of A0,", and not of A09 in the averaging process. The shift-invariant model of the previous section is characterized by the assump— tion that any two distributions p(yfl‘) and p(yjmkz) are equal for all shifts m1 and m2. If, however, we choose probabilities such that p(ygnk) = 0 for all k 75 0 the above model becomes p(A) oc 1901.0) 2 111903-70) p(shift = m) (323) m .7 Clearly, this new model is also shift invariant for it is obtained by averaging over all possible shifts, which we take to be equally likely. However, we say the model is not homogeneous because a different model results if we choose a new set of distributions to be the non-zero distributions in (3.22), say, for example, p(yjnk) = 0 for all j and 10, except whenever k # 1. Despite the lack of homogenity, this intensity model has proved to be valuable as it leads to an efficient estimator of complexity 0(N). Also, note that more sophisticated fast algorithms can be created from this general prescription by simply choosing not 97 one, but a combination of “legs” of the tree to represent the elements of the intensity sequence as the tree shifts. We have experimented with these notions and have formulated the corresponding estimators. Although the resulting models from the added complexity in the construction process are theoretically more appealing for they “close” the gap between the homogeneous and non-homogeneous models, we have found that they add very little as to the quality of the intensity estimates. For this reason, we have not pursued them here any further. The procedure for the fast estimator derived from the simpler intensity model (3.23) is the following. Fast Shift-Invariant MMI Intensity Estimation 1. Estimate coarsest scale coefficient /\J,0 = CJ,0 2. Form=0toN—1andj=Jdowntol Compute: 03% according to (3.16) 3137,10 = % (1 + 3‘70) Refine: A m _ Am Am Aj—1,O - A130 310,0 The desired intensity estimate is A = (3130359,... ,Ago" 1). We note that these procedural steps call for N / 2 redundant shifts in order to compute the N elements of the intensity sequence. Although it is not hard to describe the above estimation algorithm void of these redundancies, we feel that the description is clearer in the form presented. 98 120’ ’— ‘ 12°). '11 thu’IIIII'I ”I 100’ ##T‘w ' 100. I II“. | III I'M ‘MIIWI [I E 80 a so ‘0' 40' r I 20- ‘ 20> ? ‘ | ‘ "fifth-1’ If ‘1 III R “PHI“ $151,“ I or— _ —_7 Vfl—‘Lfi’ Om -_' _ m Space Space (a) (b) 140 no 120'- ‘20 100. 100 g g . 8 80 E .0. g i 3 so :3. 60 g a 13 40: 3 ‘0' I I [ ' I 20' 1 2° “ 1 _ f _'__—_ ‘ —#ilg‘—F-— 0 Space 0 Sp.“ Figure 3.8. Fast Poisson intensity estimation. (a) Test intensity function. (b) Realiza- tion of Poisson process with intensity in (a). (c) Bayes’ estimate using the algorithm described in Section 3.5.1. (d) Bayes’ estimate using the fast shift-invariant algorithm described here. Example In order to illustrate the quality of the estimates obtained by the fast shift-invariant estimator we repeated the example of Section 3.5.5. We created a sequence of Poisson counts from the intensity in Figure 3.8 (a), and used them to estimate the intensity using the fast algorithm. The result is depicted in Figure 3.8 (d). For comparison, we reproduced Figure 3.7 (d) in Figure 3.8 (c), which shows the estimate based on the MMI model of Section 3.5.1. By repeating this experiment with a large number of different count realizations, 99 we found that the quality of the estimate evident in Figure 3.8 (d) was representative of the fast shift-invariant estimator. 3.6.3 Autocorrelation Functions of MMI and SI—MMI Mod- els The underlying beta mixture densities capture the key heavy—tailed, non-Gaussian nature of wavelet coefficient distributions. However, the shift-dependent nature of the Haar wavelet transform generates a non-stationary correlation structure as illustrated next. For the sake of illustration, we focus on the 1-d case. Extensions to higher dimensions are straightforward. Also, to keep things simple, we assume that the maximum number of scales J = log2(N) are computed in the analysis, where N is the length of the intensity vector. The results easily extend to other choices for J * 7E J. First, consider that basic MMI model (shift: 0) and let us introduce the following notation. Let p3 E E[A‘2,’0], the second moment of the scaling coefficient at the coarsest scale, and let p? E E [yik] = E [(1 — yj,k)2], the second moment of the innovations variates. Recall that we assume the distribution of y“, does not depend on position k. The variables y“ and 1— y“, have a common second moment due to the symmetry of the distribution of yJ-‘k about a For illustration consider the correlation between the intensities A00 and A0,; in the MMI model depicted in Figure 3.4 in Section 3.4.2. Note that the finest scale for which A09 and A03 have a common predecessor above in the tree is j = 2 and the predecessor is A20. Therefore we can write /\0,0 = 311.0 312.0 A20, /\0,2 = yi,1 (1 " 92.0) /\2.0- 100 The correlation between the two intensities is computed E [A0,0Aog] = E [yro 311,1 92,0 (1 " 312,0) A30] - Exploiting the independence of the innovations variates, we have E [A0,0/\0,2I = E [311,0] ELI/1,1] E [92,0 (1 - 3/2,0)I E [A30] - Examining the individual product terms and making use of the moments defined above, Elyml =El311,1l = 24, E [312,0 (1‘ 3120)] = 1/2 — pg, E [A30] = #3103- Hence, Elhophoal = 2‘2 (1/2 - 03) #3393. N ow consider a general case in which we are interested in the correlation between two intensities, say ADJcl and /\0,k2. Let j ‘ be the finest scale for which AM, and A0,!” have a common predecessor (above) A)”, in the MMI model tree. The scale 3'" can be explicitly calculated using the binary representations of 101 and kg, and depends on the exact positions of k1 and kg with respect to the alignment of the Haar basis 101 functions. Assuming that k1 < k2, we have r-l )‘OJci = H 31:31: yj'Jc Awe, (3.24) i=1 j‘-1 /\o.k2 = H 312,]: (1 — yj‘Jc) Male. (3-25) i=1 In the expressions for AM, and Auk-.2 above, we use a generic spatial index k, since the distributions of independent innovations variates do not depend on position. We do, however, use y“, and yg-‘k to distinguish the independent innovations variates corresponding to AM, and ADM, respectively. The correlation is given by j'-1 j'—1 E [Aoalhoazl = E H yank H all. yj'Jc (1 — yj‘Jc) A3231: - (3-26) i=1 i=1 Again exploiting the independence of the innovation variates and making use of the moments defined above, we have E [Afk] = #3 Hijzj. +1 p? and T(k1,k2) E E[A0,k1A0.k2] J = 2'2””) (1/2-pf-‘M3 H p? (3-27) i=j‘+l Note that because 3'“ depends on the alignment between the intensity function and the Haar basis, r(k1, k2) is not a function of the difference k1 — k2 alone. This shows that the intensity distribution represented by the MMI model is non—stationary. Two columns (fixed 101) of the autocorrelation function of a 256 length MMI model intensity prior are shown in Figure 3.9. Note also that the autocorrelation function is highly irregular (piecewise-constant), which is an undesirable model for real-world intensities. Now consider the autocorrelation function of the SI-MMI model (3.21). Given a displacement of n between two intensities, say A03 and A0,", we can compute the probability of the finest scale j * of a common predecessor, with respect to the uniform 102 _____ _.__.* ——‘"‘fi 1 - - - r(0.k) ------ r(55.k+55) ‘ -— r(k) Sl-MMI Figure 3.9. Correlation functions for MMI and SI-MMI 256-point (J = 8) intensity priors. MMI model autocorrelation r(0, k) (dash-dash) and 7‘(55, k + 55) (dash—dot) plotted as a function of k. Stationary SI-MMI model autocorrelation r(k) (solid). For both the MMI and SI-MMI models in this example, #3 = 1 and p12- : 0.26, j = 1,. . . , J. distribution over the shift parameter, as follows. We need to count the number of shifts that give rise to each possible value of j", or more precisely the probability of the set of shifts that give rise to each possible 3". For example, suppose n = 1 and consider the tree in Fig. 3.4. In this case, four shifts (out of eight) result in j“ = 1, another two shifts result in j* = 2, and two shifts (one wrapping around due to circularity) result in j" = 3. Hence, we compute p(j" = 1|n = 1) = 1/2, p(j“ = 2|n = 1) = 1/4, and p(j“ = 3|n = 1) = 1/4. Similarly, if n = 2, then p(j’ =1|n = 2) = O, p(j“ = 2|n = 2) = 1/2, and p(j* = 3|n = 2) = 1/2, where again we must be careful to account for the wrap-around effect of the circular shifting. Given a displacement of n = k between two scaling coefficients in a J -scale MMI model, the probability of the scale 3'“ is determined by inspecting the associated binary tree. We need only consider |k| S 23"1 due to the periodicity of the MMI. First note that we have p(j“ _<_ J | n = k) = 1. Next, notice that (2“ - k)+ is the number of shifts of a length-k sequence that fit within a length-2’" sequence, where 103 (3:)+ = max(:z:, 0). In other words, (2m — k)+ is the total number of scaling coefficient pairs spaced 1: apart at the bottom of a m-level binary tree. Also note that the number of m-level subtrees, m < J, at the bottom of a larger J-level binary tree is precisely 2J‘m. Hence, for m < J 130'" S m In = k) = (2m - |k|)+ 2""‘2" = (2m — |k|)+ 2"". (3.28) It follows that p(mlk) E 100' 0, m < [10g2(|k|+1)l 1- 2"'"lkl, m = (10g2(lkl +1)l 2""lkl, [log2(|k| +1)1 < m < J |k|2‘J+1, m = J, where [log2(|k|+1)] denotes the smallest integer greater than or equal to log2(|k| +1). With these probabilities defined, the autocorrelation function is given by J 7‘(k) E E[/\0,1)\0,1+k] = Zump(m|k), (3.29) m=1 where J um22*2‘m-1’(1/2—p3..)u3 H p? (3.30) i=m+1 is simply the autocorrelation between two intensities at the bottom of an MMI binary tree model for which 3"“ = m, as in (3.27). Note that the SI—MMI autocorrelation is stationary. The autocorrelation function for a 256 length SI-MMI intensity prior 104 is shown in Figure 3.9. Unlike the autocorrelation of the MMI model, the SI—MMI model’s autocorrelation is piece—wise linear (compared to piece-wise constant) and hence is more regular and potentially better suited for the analysis of natural intensi— ties. These results are similar in spirit to the analysis in [61], where it is observed that the shift—invariant Haar wavelet transform is line-preserving.10 Moreover, the results here suggest possible schemes for choosing the parameters of the SI—MMI model. For example, the decay of the autocorrelation function may be tailored by appropriate choices of the pf, i = 1,... ,J. Larger values for the p? cause r(k) to decay faster. We also note that similar correlation analysis may be carried out for related wavelet- domain signal models developed for Gaussian data [46, 34, 11, 15, 47]. We also note that the issue of combining or mixing trees (which is essentially what is being done in the SI-MMI model) has been studied in the more general setting of the Polya tree. Some very interesting theoretical results concerning the continuity of densities gener- ated by mixtures of Polya trees (SI-MMI models are a special case) are given in [54]. These results may provide further insights into the SI-MMI model and may suggest other extensions of our framework, but we have not pursued this here. 3.6.4 SI—MMI Model and 1 / f Processes Remarkably, the SI-MMI correlation function has a fractal 1 / f -like character. Fractal 1 / f random process models are commonly used in image modeling, and it has been observed that natural signals and images often display a correlation structure similar to that of the SI-MMI model [65, 66]. To see that the SI-MMI correlation is 1 / f , consider the simple case in which the second moments of the innovations are constant 10Thefie conclusions extend to the SI-MMI model as well. 105 independent of scale, i.e., pj = p, j = 1,... ,J. Then um. = 2'2”“) (1/2 - p?) #319” 7’" (331) = C (2p)-2m, (3.32) where C is a constant independent of m. Next equate (2p)’2m with 2‘74”" and solve for ’7. This gives um = 020*)“, (3.33) where '7 = —1—log2p2. Note that 1 / 4 < ,02 < 1 / 2, where the lower and upper bounds corresponds to the two extreme limits of the beta density: a point mass at 1/2 or two point masses at 0 and 1, respectively. This implies 0 < 7 < 1. Combining (3.33) with (3.29) and after some algebra, we have TU?) .__ C(2(‘r-1)Ilog2(|k|+l)] _filkl2(,_2)J + 5| k|2(7—2)110g2(|kl+1)1) (334) for lkl > 0, where 6 = Egg—3%. In the case k = 0, r(0) = H3P2J- Finally, making use of the approximation 2l10g2('k'+1)] 2 [It], for Ikl > 0 we have r(k) z C [(1 - fi)lk|<7-1)+ BlklZW'W] . (3.35) For large J and '7 < 1, the term fi|k|2(7‘2)J is negligible, and hence the correlation function behaves like |k|("“1) and the power spectrum decays like l-f—IF (see [67] for the relationship between autocorrelation functions and power spectrums of 1 / f pro- cesses). That is, the SI-MMI model produces non-negative, stationary processes with 1/ f characteristics. For more details on SI wavelet models and 1 / f processes see [64]. 106 3.7 Numerical Comparison of Wavelet-Based In- tensity Estimators Here we compare the performance of the new Bayesian estimation algorithm with several existing methods. To assess the performance of each method, four test inten- sity functions were used. These functions were the “Doppler,” “Blocks,” “HeaviSine,” and “Bumps” test signals proposed in [12]. Each test function is 1024 samples long (i.e., N = 1024, J = 10). These functions serve as benchmark tests for signal es- timators, and they were designed to be representative of a variety of natural signal structures. We refer the reader to [12] for more information‘about the test functions. Since the intensity functions must be non-negative, each test function was shifted and scaled to obtain an intensity with a desired peak value and a minimum value of m. Realizations of counts are generated from each intensity using a standard Poisson random number generator [23]. We compare the performance of the simple estimator based on the raw counts (COUNT), a shift-invariant version of the cross- validation estimator11 (CV) proposed in [44], the SI-MMI model estimator described in Section 3.6.1 with a three component beta-mixture model for the innovations with parameters12 31 = 1, 32 = 100, and s3 = 10000, the square-root estimation methods using a shift-invariant version of the Haar wavelet transform (D2), and the square-root estimation method using a shift-invariant version of the Daubechies-8 (D8) wavelet. The method proposed in [14] is not compared since it is derived under a “burst-like” process model which is not appropriate for these test functions with the exception of the Bumps function. ' The square-root method first computes the square—root of the counts, then treats the square-root data as though it were Gaussian, takes the shift-invariant discrete 11See footnote 8 in page 93. 12The mixing parameter p1 = 0.001, and parameter p2 (and hence 193) is determined using the data-adaptive moment-matching method given in Section IV-C. 107 Table 3.1. AMSE results for various test intensities and estimation algorithms. Peak intensity = 8. Intensity C O U N T CV BA YES D2 D8 Doppler 0.1786 0.0588 0.0154 0.0548 0.0443 Blocks 0.1935 0.0617 0.0178 0.0700 0.0800 H eavz'S ine 0.1833 0.0552 0.0052 0.0294 0.0300 Bumps 0.5207 0.1877 0.1475 0.4570 0.4317 Table 3.2. AMSE results for various test intensities and estimation algorithms. Peak intensity = 128. Intensity CO U N T C V BA YES D2 D8 Doppler 0.0111 0.0047 0.0026 0.0095 0.0059 Blocks 0.0120 0.0040 0.0027 0.0077 0.0126 HeaviSz'ne 0.0115 0.0039 0.0007 0.0036 0.0028 Bumps 0.0324 0.0171 0.0143 0.1046 0.0908 wavelet transform [61, 62], applies a soft-threshold nonlinearity to wavelet coeffi- cients, and computes the inverse transform of the thresholded coefficients. After this processing, the result is squared to obtain an intensity estimate. For both square-root methods the universal threshold proposed in [12] was used. We consider both the D2 and D8 wavelet (which can be applied in the case of Gaussian data) to demonstrate that the Haar-based method introduced here can out- perform the square—root method even when more regular wavelets like the D8 are used. All methods employ a 5-scale wavelet transform. In practice, we could use full J-scale transforms, but their performances were roughly the same as that of the 5—scale transforms in the experiments, and the 5-scale transforms are more com- putationally efficient. Table 1 gives the average mean-square errors (AMSE) of the various methods for a peak intensity of 8. Table 2 gives the AMSE of each method for 108 a peak intensity of 128. The AMSE is estimated using 25 independent trials in each case, and each AMSE is normalized by the squared Euclidean norm of the underlying intensity function. Inspection of the tables shows that all methods offer significant im- provements over the simple count estimator. Moreover, the SI—MMI based estimator outperforms all others in every case. We also note that similar tests and compar- isons were made with the shift-variant versions of each estimator. As expected, the shift-variant estimators did not perform as well as their shift-invariant counterparts. However, the MMI model estimator outperformed the other shift-variant methods in all cases as well. 3.8 Application to Photon-Limited Imaging In this section we apply the MMI models and estimation procedure to the problem of photon-limited imaging. Photon-limited imaging arises in many fields including medicine and astronomy. The fundamental problem in photon-limited imaging is the variability due to quantum effects in the emission and detection of photons. In many problems, the photon counts collected during image acquisition are well-modeled by a temporally homogeneous and spatially inhomogeneous Poisson process. Assume that we detect photon emissions in a compact region of the plane. The photon emissions are the result of an underlying two—dimensional continuous intensity function. We are interested in estimating the intensity function from the photon detections. For practical reasons. (computing and display), we seek an estimate of the intensity at a finite scale (resolution) represented by a ‘pixelized’ intensity. A crude estimate of the pixelized intensity is obtained by simply counting the number of photons detected in each square pixel region of the plane. This “count” image is highly variable due to the random nature of the photon emission process. However, lower resolution images, obtained by counting the number of photons detected in 109 larger square pixel regions of the plane, provide better (less variable) estimates of the low-resolution intensities. This illustrates the advantage of multiscale analysis in photon-limited imaging. Relatively reliable coarse-scale estimators of the intensity can be leveraged to obtain finer details using the multiscale Bayesian framework. To illustrate the effectiveness of the framework in photon-limited imaging appli- cations, we consider a simulated experiment and a real—world application to nuclear medicine imaging. Note that the MMI models and estimator are easily generalized to two dimensions. Specifically, we take the 2-d multiscale parameters to be the factors corresponding to the multiplicative refinement of a coarse scaling coefficient (inten- sity) into four finer scaling coefficients by first splitting it vertically (horizontally) into two halves, then next horizontally (vertically) splitting each half into two quarters. That is, if A“ is a 2—d intensity function, then we define A0,)” E A“ at the finest scale j = O and for coarser scales take )‘j+1,k,l = Aj,2k,2l + )‘j,2k,21+1 + Aj.2k+1,21 + /\j.2k+1.2z+1, yr M21321 + /\j.2k.21+1 j+lvkvl 9 More: + /\j,2k.21+1 + ’\j.2k+1,21 + /\j,2k+1,21+1 2 )‘j.2k.2l yj+1,k,l 9 Amer + )‘j.2k.2l+1 ,\. 3 _ 1.2k+1.2l yj+l.k,l — A A ‘ (3.36) jolt-+1.21 + j.2k+1.2l+1 Note that in the 2—d case we have three sets of multiplicative innovations, one vertical set y1 and two horizontal sets 3;2 and y3. In the analysis of count images, each scaling coefficient is the sum of four counts, and each wavelet coefficient is simply the difference of two counts. Hence, all the machinery developed for the one-dimensional case, based on sums and differences of pairs of counts, is immediately applicable to two (or even higher) dimensional data. Note that this 2-d multiscale analysis defined above differs from the standard 2-d Haar wavelet analysis [68], which involves 110 a vertical, horizontal and diagonal differences. We use the alternative 2-d analysis because, unlike the standard 2-d Haar analysis, it allows us to decouple the Poisson problem, just as in the 1-d case. In both experiments, the 5-scale Haar transform is employed and a three component beta-mixture density is used for the SI-MMI model of Section 3.6.1 with fixed shape parameters .91 = 1, 32 = 100, and 33 = 10000. The mixing probability p1 is fixed at 0.001, and p2 (and p3) is adapted to the data at each scale using the moment-matching method described in Section 3.5.3. We have found these choices of 31, S2, and 33, combined with the flexibility of the data-adaptive p2, to provide very good results for a wide-variety of imagery. 3.8.1 Photon-Limited Imaging Simulation Figure 3.10 depicts a typical realization of a simulated photon-limited imaging appli- cation and the resulting estimates provided by the MMI and SI-MMI models. The maximum intensity in the image in Figure 3.10(a) is 60.00, and the average intensity is 25.40. Hence, this simulation models a fairly low intensity (low SNR) imaging problem. A realization of counts is generated from this intensity using a standard Poisson random number generator [23]. Note the visual improvement provided by the MMI and SI-MMI model estimates in Figure 310(0) and ((1), respectively, in comparison to the count image Figure 3.10(b). Furthermore, the estimate based on the SI-MMI model appears to be better than that of the MMI model. In fact, in 25 independent trials of this experiment we estimated the average mean squared pixel error to be 25.28 for the count image, 7.36 for the MMI model estimator, and 4.01 for the SI—MMI model estimator. 111 Figure 3.10. Photon-limited image estimation using MMI models. (a) intensity func— tion, (b) realization of Poisson counts (average squared pixel error = 24.80), (c) in- tensity estimate using MMI model (average squared pixel error = 7. 36 ), (d) intensity estimate using SI-MMI model (average squared pixel error = 3.98). 3.8.2 Application to Nuclear Medicine Imaging Nuclear medicine imaging is a widely used commercial imaging modality [69]. Un- like many other medical imaging techniques, nuclear medicine imaging can provide both anatomical and functional information. However, nuclear medicine imaging has a much lower signal-to-noise ratio relative to other imaging techniques. Hence, im- provements in image quality via optimized signal processing represent a significant opportunity to advance the state-of-the-art in nuclear medicine. 112 Nuclear medicine images are acquired by the following procedure [69]. Radioac- tive pharmaceuticals that are targeted for uptake in specific regions of the body are injected into the patient’s bloodstream. As the radioactive pharmaceuticals decay, gamma rays are emitted from within the patient. Imaging the gamma ray emissions provides a mapping of the distribution of the pharmaceutical, and hence a mapping of the anatomy or physiologic function of the patient. Gamma rays are detected and spatially located using a gamma camera, which converts gamma rays into light. Pho- tomultiplier tubes then detect and locate the emissions. The raw nuclear medicine data is an image of photon detections (counts). The raw data may be viewed directly or used for tomographic reconstruction. The major limitation of nuclear medicine imaging is the low-count levels acquired in typical studies, due in part to the limited level of radioactive dosage required to insure patient safety. Because of the variabil- ity of low-count images, it is very common to employ a post-filtering or estimation procedure to obtain a “better” estimate of the underlying intensity. An excellent discussion of the potential diagnostic benefits of various frequency domain processing methods is given in [5]. Advantages of multiscale methods over frequency domain methods for photon-limited imaging problems are discussed in [59]. To illustrate the potential of our multiscale Bayesian framework in nuclear medicine imaging, consider the spine and heart studies depicted in Figure 3.11. Fig- ure 3.11(a) depicts the count image from a nuclear medicine spine study. The radio- pharmaceutical used here is Technetium-99m labeled diphosphonate. In bone studies such as this, brighter areas indicate increased uptake of blood in areas where bone growth is occurring. This may reflect areas where bone damage has occurred. Func- tional changes in bone can be detected using nuclear medicine images before they will show up in x-ray images. The maximum count in this image is 178 in the “hot-spot” at the bottom of the spine. The maximum count in the upper portion of the spine is 75. Figure 3.11(b) shows the SI-MMI model estimate of the underlying intensity. 113 (C) ((0 Figure 3.11. Nuclear medicine image estimation using SI-MMI models. (a) Spine count image. (b) SI-MMI model estimate of underlying intensity. (c) Heart count image. (d) SI-MMI model estimate. Figure 3.11(c) depicts an image of a heart obtained from a nuclear medicine study. The image was obtained using the radiopharmaceutical Thallium—201, and the max- imum count is 33. In this type of study, the radiopharmaceutical is injected into the bloodstream of the patient and moves into the heart wall in proportion to the local degree of blood perfusion. The purpose of this procedure is to determine if there is decreased blood flow to the heart muscle. Figure 3.11(d) shows the SI-MMI model estimate. In both studies, we see that the SI-MMI model estimator preserves the important image structure and has significantly lower variance compared to the raw 114 count image. The intensity estimates provided by the SI—l\»"Il\/’II model may enable better diagnosis in clinical nuclear medicine. 115 CHAPTER 4 Emission Computed Tomography In this chapter, we extend the MMI models and estimation approaches to emission tomography imaging. We introduce two extensions: one based on geometrical consid- erations of natural pixels of images, and the other based on a new multiscale inversion approach to the tomographic image reconstruction problem. The new models are used to represent the intensity sinogram of the projected images, from which their optimal estimation can be computed prior to reconstruction. We illustrate the performance of the new estimators with examples using synthetic and clinical data. 4. 1 Preliminaries The first explicit formula for the reconstruction of a function on a plane given its integrals over lines was first derived by Radon in 1917. Although it was not until the late sixties that practical applications of the Radon formula started to appear, first in radioastronomy and then in electron micrography [70], the practical computed tomog- raphy that it gave birth to has since become one of the most important tools in many branches of science for processing information that otherwise would be inaccessible. For example, in astronomy and geology, imaging of the Sun’s and Earth’s internal structures have been possible only by tomographical techniques [71]. It is perhaps in 116 medicine where we have witnessed the greatest impacts of this technology. Imaging the internal structure of human patients by noninvasive methods is frequently the safest and most expeditious method in determining their conditions, and thus, often the most desirable method for diagnosis. Although our interest is in emission computed tomography (ECT) in general, in this chapter we focus exclusively on nuclear medicine tomography—more specifi— cally, in single—photon emission tomography (SPECT)—f0r the following reasons. The problems encountered in nuclear medicine tomography typify many of the problems encountered in ECT at large; and, concentrating in SPECT in particular, permits us to be more specific in the treatment of the subject. Also, nuclear medicine tomog- raphy represents one of the most challenging applications due to typical low SNR regimes under which it takes place; therefore, the methods presented here are tested for robustness. Further, nuclear medicine tomography represents one very important application that directly affects many people’s quality of life. There exists a number of tomographical methods for diagnostic imaging. For ex- ample, x—ray computed tomography (x—ray CT), magnetic resonance imaging (MRI), positron emission tomography (PET), and SPECT. The physiological basis of emis- sion computed tomography medical imaging (e. 9., PET and SPECT), opposed to the anatomic approach of transmission imaging (e.g., x—ray CT and MRI), are unique in that measurements of functional abnormalities are readily possible from a single image.1 In nuclear medicine imaging a radiopharmaceutical, which is targeted for uptake by some specific organs, is administered to the patient. As the process of nuclear decay takes place in the radiopharmaceutical, gamma-ray photons are emitted in all directions. Those photons incident to an array of detectors placed in close proximity 1Although functional measurements in transmission-based imaging are possible, these require more complex procedures involving sequences of images obtained at controlled time-intervals that must be carefully registered [72, 73]. 117 ) ‘2 ll" 1 i ii \\ j) ll; ‘ (a z A <1: .t€;’ \\ 5' r a 1% we 1. . to .f i ‘1 {,4 I ,I / / Figure 4.1. Tomographic data collection geometry. The array of detectors collect and count the number of photons induced by the arriving gamma rays emanating from within the subject. The array is repositioned at equal angle intervals about the long axis of the subject so as to obtain a sufficient number of projections. to the patient and in the vicinity of the organs of interest are counted. The number of photon counts are then interpreted as a function of physiological activity or intensity. For tomographical image reconstruction of a transversal slice of the body, mea- surements of photon counts are made at many angles about the long axis of the body. Other image geometries are also possible, in which case, the axes about which mea- surements are taken are also perpendicular to the images’ planes. Figure 4.1 depicts a typical data measurement geometry. In modern non-diffracting tomography (e. g., all the tomographic methods re- ferred to thus far), Radon-transform, algebraic, and iterative-based reconstruction algorithms are the main general approaches to image formation. Radon-transform based methods are almost exclusively used in nuclear medicine because algebraic and iterative reconstructions generally are more computationally intensive. One very Popular and efficient algorithm for carrying out the reconstruction process is the filtered-backprojection (FBP) method, which in effect implements the inverse Radon 118 transform [74]. 4.2 The Image Reconstruction Problem A typical data gathering geometry for parallel-scanning tomography in SPECT is depicted in Figure 4.2. In nuclear medicine, the intensity f (x) represents the con- centration of radioactive material at a point x in space, and is typically the quantity of interest. 17 and 9 are orthogonal unit vectors which together define the projection plane on which the data collection process takes place. The plane contains the detec— tor array, positioned according to the vector 0. The convergence point of a projection line (ray) on the plane is identified by s, the number of units along 0. Thus, we speak of the projection of f at 30. The vector x and the projection rays lie on the n-O plane. The geometry in Figure 4.2 depicts a data collection process in which the object, and not the array of sensors, is rotated in order to capture the projection data. The two schemes are equivalent as far as the processing of the data is concerned, however, our choice simplifies some of the notation. During the process, the object f is rotated through angles —0 about C in the right coordinate frame (11,0, ()2. The scanning process as just described is an idealization because it models a process that originates from zero volume of active material—the material “contained” in the n-O plane. In order for the model to be meaningful, f must be interpreted as the average concentration of radioactive material within a neighborhood of this plane: let fvol represent the true volumetric density of this material, and let a be the 2Note the slightly overloaded use of the theta symbol: in bold phase, it stands for the array of sensors’ unit vector, and in plane phase, it denotes the relative rotation of this unit vector and the object. The distinction is clear and should not cause any problem. 119 Figure 4.2. Projection geometry for the intensity f (x): The unit vector 9 defines the direction of the detector array on a plane perpendicular to C, While 3 identifies a sensor Within the array. f (x) is the density of radiopharmaceutical at x. aperture gain (loss) of any one detector’s collimator, then f(x) '3) = / a(:(:{‘:08‘;)tC) fvol(x + tC) dt is the effective density of radiopharmaceutical which is responsible for the activity sensed by the detector at 50. If, indeed, this density is a function of s as indicated, it would prove the poor selectivity of each collimator’s aperture, resulting in blurring effects. In general, however, the aperture for rectangular collimators may be expressed as the product of the in-plane gain a1, and the gain or) due to the displacement in the C-direction, i.e., a(x — 30 + tC) = a1(x — 30) a2(t(). Thus, f(x) = / a(x) fvol(x+tC)dt (4.1) Clearly, the narrower the support for a2, the finer the detail conveyed by f in the C-direction, however, at the expense of lower intensity and greater susceptibility to 120 noise. As photons travel towards the array of detectors from within the subject, many are absorbed by nuclear interactions with the surrounding matter and only a fraction of the original number survive to be registered and counted. Let u(z)dt be the probability that a photon reaching the point z will be absorbed within dt units from 2. Then, the number of photons 09(3) that during a time interval T reach the detector at 30 due to the radioactivity at a point x with the object rotated an angle —0 is Poisson distributed: c9(s)|)\9(s) ~ Poisson(/\g(s)), where the underlying intensity’s projection A9(s) is given by3 A9(s) = / a(x — se) f(Rg(x))e'I10 ”(Maul-“898‘ dx. (4.2) f is as defined in (4.1), u is known as the absorption loss coefficient, R9(x) = eK9(x— 17) + n, and K = (‘1’ ’01). The time units have been chosen so that T = 1. Since ex“ is the transformation that rotates a vector by an angle 8 about C, R9(y) transforms the vector y by rotating its y—n component by this angle. Thus, f (R4)(x)) is a measure of the radioactivity at x after the object has been rotated by the angle —8 about C. The exponential efo1 “malfxflhtlaalldt gives the total absorption loss for the trajectory through the rotated object from x to 30. Recovering f from the set of projections /\9(8) represents a very difficult problem. There exists, however, an exact solution to (4.2) when the attenuation function u is constant on a convex region that includes the support of f (assumed compact), and if the aperture gain at 39 is zero everywhere off the perpendicular to 0 at 36 [75]. 3Since all but the unit vector C considered here lie on the (17,0) plane, we restrict the notation to a two-coordinate format from now on. 121 Since in practice the absorption coefficient is generally unknown, it is common to ignore it and compensate for its effect by postprocessing the reconstructed image. This approach is computationally intensive and leads to suboptimal results as the effect from absorption is nonlinear and, consequently, cannot be fully accounted for independently of the reconstruction process [76]. For this reason, in applications in which the value of ,u is small, it is often ignored altogether. For the remaining of this chapter, we only consider examples where it is safe to do so.4 For this simplified model the solution is )=2—7r2‘/‘1r /( (—x 77) )r(81()90_3d8d6' (4.3) The superscript T denotes transposition, and Ag(s) stands for the derivative “355—31. This expression is one incarnation of the celebrated inverse Radon transform and is the basis for the FBP reconstruction technique—see Section 4.3 and [75]. The assumption of having only those photons counted that strike the detector array at right angles implies the decoupling of the counting processes taking place among the sensors. This is an idealization that is invariably made in practice, but which results in blurred images. Much of this blurring, however, can be removed by high-pass filtering if noiseless estimates for the projections Ag are available. Since the actual sensors’ aperture gain a acts as a spatial low-pass filter on the projections, the linear shift-invariant operation may be reversed by operating on the reconstructed image given that the inverse Radon transform and convolution operators commute. For simplicity’s sake, we ignore blurring effects in the examples to follow and concen- trate on the most detrimental factors in nuclear medicine tomography imaging: data variability due to its Poisson statistics. 4In Section 4.7, however, we show how to correct for the lack of symmetry in the effective attenuations associated with A9(s) and A9+1800 (s) when present. 122 ECT methods aim to recover the function f from a finite set of measurements c9(s). For example, for K detectors in the array and N projections, s = —%, . .. ,§ — 1 and 8 = 0, 2W”, . .. ,(N — n27}. Sometimes it is helpful to work with a normalized length for the array of sensors. We choose a length of 2 length-units per aperture (per array of sensors) so that. now, 3 = —1,—1 + 725, . .. , 1 — %. Also, in the interest of brevity, we use the alternate indexing c2, which denotes the counts from projection angle0=iiQW7r andsensors= %k—1. Here, 7220,... ,N—1andk=0,... ,K—l, where it is always assumed K = 2J for some positive integer J. The data vectors c" E [c}_1, . . . ,c3]T arranged in matrix form c E [c°, . . . ,cN‘l] constitutes the data sinogram. The corresponding intensity sinogram is given by A E [A°,... ,AN '1], where each column vector is similarly defined as A” E [A'I’{_1,... ,A3]T. Note that 2(k+1)/K—1 2:] Azwwn(s)ds, (4.4) 2k/K—1 the highest resolution unnormalized Haar scaling coefficient at position (sensor) k of the projection at 9 = ~1;—(;-n.A similar relation holds for CE, which is the total photon count in the (% - 1, @K — 1) interval. Each set of K measurements in a projection constitute a finite sampling of (4.2), and cannot strictly satisfy the Nyquist sampling requirement for the object f is of finite extent, and therefore, of infinite frequency support. Consequently, the quality of the output of any method that one may devise to manipulate the data is compromised. Since there exists no recourse to amend this common degradation, we assume the “K-sampling” to be fine enough that all high frequency features of interest may be reconstructed without distortion. 123 4.3 The Filtered-Backprojection Reconstruction Technique The F BP method for inverting the Radon transform of an image may be deduced using the Fourier Slice Theorem [77]. Using two—dimensional polar notation for the object function, i.e., writing f (r, ()6) instead of f (x),5 and similarly for its Fourier transform, this theorem simply states that f(w, 6) = Ap(w), that is, the Fourier Transform f may be written in terms of the projections’ Fourier transforms. Since the Fourier inversion formula can also be expressed as f (r, (t) = 47; f0 f0; f(w 6)e|w| “”05“ 4’) dwd6, a simple substitution leads to f(r, 46): é]; TAO (rc-os(6 ¢))d6 (4.5) where TA9(s _ 27 f0; /\,9(w )[w] e’“ did. This transformation represents a high-pass filter operation on the projections A9. The frequency response of the filter is [W]. The integration (4.5) suggests that f (r, Q5) is the result of a projection operation on TAg, thus, the name filtered-backprojection method. This projection, however, is not equivalent to the projection process by which the set of Ag’s are formed, for some non—linear weighting is implicitly applied to 7A9, but we shall not elaborate on this. Practical Radon inversion algorithms often begin by discretizing the expression in (4.5). Although radically distinct in nature to algebraic approaches (ART), these methods may become similar in form [78]; but, most often, (4.5) is implemented by means of the FFT for computational efficiency reasons. When this is done, band-pass filters are chosen to replace the high-pass filter so that excessive noise amplification 5The various parameters relate as follows. Referring to expression (4.3), (x - n)TeK96 is a dot product of the vectors x — n and eK96. Their magnitudes are Ix — n] = r and 1, respectively. Thus, (x — n)TeK96 = rcos(u), for some angle 11, and <5 = 6 — u. 124 will not occur past the object’s main frequency band. These filters by necessity represent a compromise between recovering high resolution details and minimizing high-frequency noise. 4.4 Some Limiting Aspects of Conventional Re- construction Methods Classical reconstruction methods applied to SPECT produce undesirable and highly variable image reconstructions due to the Poisson statistics of the projection data. In practice, it is common to postprocess the raw reconstructions with a low-pass smoothing filter to improve the images. This approach often produces visually ap- pealing results, but can lead to detrimental loss in resolution and fine detail structure. More advanced methods estimate the intensity sinogram A from the noisy data. For example, Wiener filtering [79] approaches have been applied to the projections prior to reconstruction. Furthermore, (fully) Bayesian approaches to this problem have been proposed based on Markov random field models, e. g., [80, 81]. However, these methods are very computationally expensive. One possible remedy to these methods’ complexity consists of simply estimating the individual intensity projections A" from the set of counts, and then applying an inversion method (e.g., FBP) to obtain an approximation to the function f. This method is widely used in ECT [69, 75, 77], and in practice, the estimation is carried out by simply low-pass filtering the data. This approach, although straightforward, is highly suboptimal for the following reasons: 1. Due to the Poisson distribution of the data, the noise is signal dependent, and so, the linear filtering process cannot optimally remove it everywhere—see Chapter 3. 125 2. High frequency intensity details in the underlying signal projections are as at- tenuated as the noise. 3. Useful information that may exist in adjacent and nearby data projections is not exploited towards leveraging the estimation process. 4. The greater residual noise in the projection estimates due to points 1 and 3 leads to a degraded performance of any subsequent restoring operation, including the inverse Radon transformation. For example, a deblurring filter (essentially a high-pass filter) will emphasize noise in high-frequency bands. One very simple approach to improving the quality of the estimates for the un- derlying intensity’s projections is to impose the 1-d MMI model on each individual projection, and estimate accordingly. Doing this, however, induces the strong con- dition that any two consecutive projections X’ and An“ be independent. Since the n+1,and actual projections’ estimates are driven by the corresponding data c" and c since these data projections are correlated, the effect of this assumption is lessened. Nevertheless, given that in nuclear medicine the data’s SNR is typically low, the un- realistic prior model tends to exert a strong influence on the estimator’s outcome. Our own studies indicate that this does in fact occur, with the consequence of having highly detrimental circular artifacts appearing in the reconstructed images. A step towards correcting for the projections’ independence condition consists of imposing the 2-d MMI model on an image in the projection space (i.e., sinogram.) In the following sections, we introduce two extensions to the basic 2—d MMI modeling approach introduced in the last chapter, and which lead to estimators that produce well—coordinated estimates of the intensity sinogram and improved reconstructed in- tensity image. 126 (a) (b) Figure 4.3. Shepp—Logan head phantom. (a) Phantom intensity. (b) Phantom inten- sity sinogram. 4.5 First Approach to Multiscale Modeling and Estimation of ECT Intensities One important characteristic of a sinogram A is that point sources in the original object f correspond to sinusoids in the sinogram (thus, its name); each point corre- sponds to a unique amplitude-phase pair, but all have the same period. As a result, a very strong correlation exists among consecutive projections. This phenomenon is clearly displayed in Figure 4.3(b), which is the intensity sinogram of the Shepp-Logan head phantom image in Figure 4.3(a). Correlation among neighboring pixels suggests that a MMI-like model may be applicable to the intensity sinograms as they appear to be characterized by smooth intensity variations with a few discontinuities. To understand the mechanism by which this characteristic emerges in sinograms corresponding to intensity images that are themselves well—modeled by the MMI paradigm, the concept of natural pixels (NP) is helpful. Originally developed for the incomplete data tomographic problem [78, 82], NPs give us the means to visualize the relationship of corresponding samples 127 5?’ PT my: 3’: it i by: PF ‘k+1 My“; ' 111 /""'\ f \ ’ [I] Cf \ j \) <3 <3 V )i ’ L’L’T L) NPs/ T (a) (b) Figure 4.4. Natural Pixels. (a) Projection samples at scale j are obtained by integrat— ing the object over strips 2j “sensors” wide. Thus, the highest resolution achievable is determined by the sensors’ apertures, and correspond to samples at scale j = 0. (b) Natural pixels are delimited by the overlap of the integration strips corresponding to the same sample position It at two consecutive projection angles 71 and n + 1. in consecutive projections. We use this newly gained insight to extend the basic MMI approach of Chapter 3 to the ECT modeling and estimation problem. 4.5.1 The SI-MMI Method In emission tomography, sinograms are 2—d positive functions, and so we may represent them at multiple scales just as we did the intensity images of Chapter 3. As before, the representation at scale j is denoted by A,-, where for j = 0, A0 E A. Clearly, A,- may be obtained directly from the object f by simply projecting through wider swaths. The set. of projection strips belonging to two consecutive projections A? and A?“ at scale j are shown in Figure 4.4(b). The overlap of the two samples A2,, and A}? delimit the natural pixel NPg",c shown at the center of the figure. It is clear 128 that, under the smooth-intensity—variation-with-a—few-discontinuities assumption, the greatest contribution of the object to these samples comes from the Nng region, and only a small fraction AA?)c is contributed by the regions outside it. That is, 3’), = NP},c + AAZk where typically Nng >> AAy‘k; except at fine scales, where the residuals AAg-fk’s may take on significant values as the areas of integration they encompass increase and those of the NP’s decrease. In the object’s sinogram, A?)c and AZ? appear as two horizontally adjacent pixels at the kth row. The perturbation 5 associated with these pixels can be written in terms of the natural pixel they define: AA}, — mg? 6 = n n n z 0 (4.6) AA“, + AAJ-jtl + 2NPM This shows that the “horizontal” perturbations will tend to be concentrated around zero, and given the symmetry of the NPs, these will be symmetrically distributed as well. Note that for the perturbation to take on an extreme value of 1 or -1, it is necessary that either AAg-fk or AAEII, but not both, assume the -NP value; a very unlikely situation as demonstrated by the geometry of integration. Perturbation variates corresponding to differences between vertically adjacent samples in the intensity sinogram also behave in the desired manner. This is be- cause the projections themselves enhance the smoothness characteristic of the orig- inal object through the integration process (see Fig. 4.4(a)). Specifically, since a projection sample is simply a scaled value of an average taken over an integration strip, a projection sample represents, within a fixed factor, the “typical” intensity6 that the object assumes in the integration strip. Therefore, the perturbation 6 asso- ciated with any two typical and adjacent samples of f behaves in similar manner to those perturbations associated with the object itself. Histograms for perturbations 6Note that these “typical” values need not ever occur. 129 between vertically adjacent pixels in synthetic and clinical data have verified this general behavior, and resemble the histograms in Figure 3.3. The behavior of the perturbation (innovation) variates according to the desired prior does not on itself warrant the suitability of the MMI model, for the intensity variation in sinograms is arranged in a highly structured fashion, and so, the assumed independence of the innovations in the model may not apply. We have taken a practical approach and have sought to empirically justify the model’s applicability. The steps taken are: 1. Form sinogram c from raw projection data {c2}. 2. Obtain optimal estimate A of underlying intensity sinogram A from data sinogram c using the 2-d SI-MMI-model based estimator of Chapter 3. 3. Reconstruct intensity image estimate ffrom sinogram estimate A by the F BP method. We call this the sinogram-image-SI-MMI—model—based filtered-backprojection recon- struction method, or the SI-MMI approach to ECT, for short. 4.5.2 Example To demonstrate the performance of the new approach to SPECT, we have applied it to two sets of data: the Shepp—Logan head phantom and real data from a human pelvic clinical study. The two data sets have been processed in three different ways so that comparisons can be made and a relative degree of performance can be determined. In Figure 4.5(b), an (unprocessed) reconstruction of the phantom data based on Poisson data projections is presented. Figure 4.5(c) shows a lowpass filtered7 version of Fig. 4.5(b). The smoothing process removes high frequency noise at the expense 7A 2-d Butterworth filter with cut-off 7r / 3 rad. was used in both examples of this section. 130 of some loss of high definition detail in the image. Finally, Figure 4.5(d) displays the reconstructed image using the new sinogram estimation method. In this new image, high definition features (e. 9., edges) are clearly preserved despite the high degree of noise reduction. Figures 4.6(b), (c), and (d) of the pelvic bone study were obtained by the same processes used in the corresponding phantom figures. Note that in Figure 4.6(d), the high definition image structure becomes evident after processing according to the new approach. This is in contrast with the result in Figure 4.6(c), where the image is seen to be oversmoothed to achieve a similar degree of noise removal. 131 Figure 4.5. Shepp-Logan head phantom image reconstruction. (a) Phantom. (b) Phantom reconstruction from noisy data. (c) Lowpass filtered image from (b). (d ) Phantom reconstruction from MMI-processed sinogram. 132 (C) (d) Figure 4.6. Pelvic bone study image reconstruction. (a) Sinogram of pelvic bone. (b) Pelvic bone reconstruction from noisy data. (c) Lowpass filtered image from (b). (d) Pelvic bone reconstruction from AVIAN-processed sinogram. 133 4.6 A New Multiscale-Based Tomographic Inver- sion Method Despite the substantial improvements in quality and fidelity of reconstruction that the SI-MMI approach represents, we recognize that the always present sinusoidal struc- ture of sinogram images has not been fully exploited by this method, when it could be used to leverage the reconstruction process. The second method to ECT to be introduced in Section 4.7 is based on the tomographic reconstruction approach devel- oped here and which itself evolves from considerations of this sinusoidal extructure. As a result, the new inversion formula reconstructs the desired images not from their corresponding sinograms, but from their cumulative sinogram intensities. This will be especially important in the ECT problem, where for low SNR data sinograms the high variability of the data significantly undermines this structural constraint in sino- grams. In Section 4.6.3 a fast implementation algorithm for the new Radon-inverse transform is also presented. There exist other multiscale-base tomographic reconstruction techniques [83, 84, 85, 86], but none is statistically motivated and offers no advantage to the ECT re- construction problem. 4.6.1 The Multiscale Reconstruction Formula We begin with the inverse Radon transform (4.5), which may also be expressed in terms of the Hilbert transform8 of the derivative 13(3) of the intensity projections: f(r, ct) = 217;]: ’HAz,(r cos(9 - ¢)) (16. (4.7) 8See Section CI of Appendix C for the definition of the Hilbert transform and some alternate representations. 134 This the mar w l1 and z [a guarant offinite 1 when 3p): are met. Since y with j), _= I This is equivalent to (4.3). For simplicity, and without loss of generality, we assume the support of f to be contained in a disc of radius rmax < 1. Additionally, differen- tiation of the intensity projections are assumed to exist everywhere in [—1,1], with 1;,(1) = Ag(—1) = 0. Differentiation at the boundaries are of left and right type. In Section C.1 we prove that 7119(3) = —%fol{A’0(0’2(T)) - A;,(01(7'))}“7T with 01(r) = (—1 4 s) r + s and 02(7) = (1 — s) r + 3. Substituting this into (4.7) results in where fo"{At(t(r)) — X9(t(—T))} d6, for —1_<_ T g 1 90", a); T) E (4.9) 0, otherwise and t(r) _=_ T + (1 — |r|) rcos(6 — (b) for all r in IR. In arriving at (4.8) we have made use of the Fubini-Tonelli Theorem [28], which guarantees the interchangeability of the integration operations as we assume f to be of finite energy. In the rest of this chapter, however, we generally make no special note when appealing to this theorem, and give as understood that the required conditions are met. Since 9 is real and odd in 7', its Fourier transform Q is imaginary and odd. Then, with g, E Im{§}, _1 00 90333:?) g.(r,¢;w)sinq§;i> 3.} d9. This expression accounts for the fact that (A9 0 t)(1) = (A9 0 t)(—1) = 0. Here, 0 denotes the composition operation. 136 We may express Ag ot separately in the intervals [—1,0] and [0,1] in terms of Haar wavelet expansions (see Chapter 2). With p;k(r, <15; 6) and p;k(r, a); 6) denoting its wavelet coefficients corresponding to these intervals at scale j and shift 19, the expansions are (Aeot)- )=Zp;..( rcb; outta) j,k (Aoot)( ijirrtimd) jk both for 7' E [0, 1]. Substituting these in the above integrals yields . . _ 7' LUZ: k pjk( 7‘ ¢;6) )fo wjk(r T)cos(wr)dr gi(r,¢,w)—2/0 {w 1—rcos(6— 4;) +2212” p;k(r, (i); 6) fol 2,15,),(7') cos(wr) dr d6 1 + rcos(6 -— 45) (4.14) — —2w 2 2 jij( r, q)()/0 2i)(2"jr - k)cos(wr)dr j<0 o<1~<2- 1- 1 + 2 sin(w) Z 2’1 Qj,o(7“, (25), 321 where -w f012,l( 2,l(2‘ jr— k) cos(.or)dr = sin(23kw)— 2sin(2j(k+1/2)w)+sin(2j(k+1)w) and + _ . = '/2 1r pj.k(ra 65; 6) pj,k(r1 ¢i 6) 03.1.6.8) _ 21 f0 {1_TCOS(6_¢) +1+TCOS(9_¢) d6. (4.15) The factor 21/2 is required in this definition so that QM is independent of scale for 1'21. 137 Now, (4.14) into (4.12) becomes f(mb) = 5’; Z mama) + 47172 c2100: 4). 130 0932-141 where 7/27r forj=0, k=0 3/7r-2‘23 forj<0, k=0 ajk E _21 . l/VT'kTE—Ejmj f0fj<0,0'.11ll ‘ . 1 ' "llr'l'rl‘Vffi 01A”. 1. . 1 -'r.'I-f'.-l(lflfl$(“ 1.1 _3 1.~|1’:|.'h-.H!'fl :. *1 to uni: ‘ l. . ‘ . . . . 1 '|' '1‘. b) b‘lfl ;.1‘ l ‘ . 1‘ 1 l4 .4' ' 7 . _ '1" '9‘. - “» ev'l n m 18! 111.“ 2 V ' 111 I .1 :t’ .i 11$ 4 "11"“? ”(W ; -- . -. -~..1.a) 31—3 ‘11lJP“1-'!:lj'-"11";‘.3." .1. ‘. . . ~1 3 1 31121409“ . - , , , .1 Ayélugli'OI-mmml -Illnll'..1. ' ‘ . . V 1 .. 11" 1.9.1121“! 11.111.11.111” .r'»lr.151l!_';."'l .1 ' -~ 5 . 1 .1 ...: 11m: 1‘ . '~ 1111:“ alum rm Ml M13 "V. — 11; 1 '1. 1 1111111961156“! . '\ Waluusdmh .21. "~ )1 ii “1 - . ~ 51- 1' .“ .1. ”-3,. mm- 1 1' . . . . - firs .‘(hmlwmpa '10 Jury—11.134114. 1. 1 m 1 ' .1 :1; mun my id In: 12",“; I. 1,. x. J, l. .13.), In") m? Whlifffial "Mflf'NJ-v" ‘nl‘lh l'J-u‘“: 1'3" .111 :m "a'u‘. '4 .1 WIN! I,“ 91”"4‘. d4 mm- o! “Syn-E1 4.1"» '-1.- r .- .7111.» |:;./ 1 1'; 4 . W Ihilld III) a! 41-51 6311611315651? human? 1111-11311: 3 'J-asmd 30 Ml ‘ ‘ 155‘6’ fikflkiq 31 In H. W 611-! g 1 frequency contribution extends to infinity. However, beyond 6471 radians per length- unit, its contribution, and therefore, that of /\2_«,,, has been reduced by over 40 dB from its peak value at d.c. Consequently, the sampling need only satisfy the Nyquist requirement for, say, 647r radians per length-unit;10 and this occurs for K _>_ 16. Since this is the case for practically all tomographic studies of interest that we may be concerned with, we conclude that each A? can be recovered from the K samples A2. Thus, {:1}; Sa( (lg—(t — 72(1)) (4.29) The projections A23" are finite in extent and so, we can think of them as consisting of zeros beyond the aperture’s boundaries, if necessary. Then, any integration of A2W1r over an arbitrary interval [51, {2], where {1 S 52, can be computed as 52 — n — n A Ai—Tn(5) d8 _ (A%(l+£1)+l %(1+€2l+l) 1 1:0 = i A}:{Sa(w[(%)2(1+51)+%l-kl) _s.(.[(4)2(1+,,)44144)}, With L— _ [if (1 " élll Equivalently, £2 L_l A to; E— 2 1 1'4.) 5. 2 V / A2an(8)ds = Z (A: rect(w/27r) (e < 2) (1+1.C ) _ e ( 2) (1+£2))) £1 [:0 271 _, ‘1’ sin(wK/4) e 2 2 1 7' Iv: sin(wKL/4) _,K(z;—nw (64%) n+4.) Bid-’5) n+4») where 3:, denotes the discrete-time Fourier transform of the extended projection ”Here, we assume that A?“ has most of its energy concentrated below 64 cycles per aperture. 144 (- -- ,0, A3, A32 - - - ,/\’}\,_1,0, - - - ). This expression leads to the desired result: 52 f X 51 ‘ n(8) ds = (if, 8:1(213235) )v zl': k- K2(1+g)—K(L—1) — 4 — (8:11:21?) (4.30) k- K2(1+52)—K(L—1) - 4 Expression (4.30) allows us to compute the integral of A 2%,, over any chosen interval by means of the fast Fourier transform. In general, the integration values obtained are only approximations, but this is of no concern for they may be made as accurate as necessary by simply extending the length of the FFT by zero padding. The coefficients 62k and X31, can, therefore, be computed accurately by applying (4.30) to (4.26) and (4.27). Unfortunately, (4.30) cannot be computed efficiently since the indicated inverse Fourier transformations must be recomputed for every change of the integration’s lower limit 51, as the parameter L is a function of it. Next, we introduce a new formulation that avoids the computation of these coefficients altogether, and which leads to a very efficient inversion algorithm. 4.6.3 A Fast Multiscale Radon-Inverse Transform Algorithm Define the nth cumulative projection as W ' 53(3) 5 [1 /\2n,,(t) dt (4.31) for —1 S s S 1, and every n. = O, . . . ,N — 1, and the associated (normalized) quantity u" r = /-\"(2jk + (1 — 23k) rcos(cb — 21%,,» 4.32 J'k( ’05) — (1 — [21k + (1 - 21k)rcos(¢ — -21%n)])2 ( ) ~ for j S 0 and 0 S k 3 Ti — 1. Note that :\"(rcos(q§ — 2W5?» = 3,0(7‘, (15), 01‘ e(luivalently, (1 — rcos(gb — 2fin)2 V8;0(T, gb) = 5‘3’0(r, ()5), with 513.0(1‘, qb) as defined in 145 (4.27). The normalized wavelet coefficients in (4.26) corresponding to the warped projec- tions A15" 0 t may now be expressed as N 31,147”, 95) (1 — rcos(d) — —.—n))2 —(1 - 231k +1/2))2.".1/2( 91>) + (1 — 2"(k- + 1))2-v31..1(r,¢). =(1 -2jk)21/j,k(7',¢) Substituting this into (4.25), the following reconstruction formula results. 23—1 N:{ m+z:a..[1—2jk>2u. was) j<0k=0 — 2(1— 2% +1/2»? Van/2e, <1) + (1 — 2% +1»? M a] } (4.33) In practice, we are always interested in a discretized reconstruction formulation for computer implementation purposes, and so, the reconstructions are necessarily of finite resolution; thus, it suffices to work with only a finite number of scales. With J _<_ 0, we denote by f J the intensity image reconstructed from the —J coarsest scales in (4.33). The scale-limited inversion formula may be written in simpler form if we first gather the VJ’fk(r, gt) terms with equal value for all allowed combinations of parameters j and It For example, it is always true that V_3 k(r qb)=|k 1 = VEQ‘k+1/2(r,q§) = k=0 V33’k+1(7‘, (15)] k=0' Doing this yields 1 — 2J_lk)2 flu: V'}_1,k(73 <15) (4-34) m, <1): 71‘; ( where fiJfl : 41-J/71' (4.35) 146 {mil -'.~ flu-"1. -l 315.3‘1111152m" .1‘r.Z°'.‘- . 1 i1 'iH V it'll“ {(3.}. ,1 ' dih'tu‘) tiff” ,_. LAWN?” M! m u. . 1 ' I 1 - __‘ - D '3 L r.\,‘€~l ,1- v.11," q“ om . . “11"qu ‘ ,* ~77" ' 4:111 , . .~‘1|2.-1§‘\~M71Yd, . . s... humid-m .‘ x' , , wawo.‘ .111 91mm, 3 1/ «11.11.111.00 , .... ..1 J (. .. x 7 i. . «.....mf‘“ and, for 1 S k S 2‘J+1 — 1, —2QJ‘(k_1)/2 for (6 Odd (3.1,k = —2aJ+m. (2_m-k_1)/2 for It even. (4-36) J+m‘ —la + 2.1;] Qj 2-j+J—1k 'i” aj‘Q—j+J-1k_1 Here, m‘ E max{m E N U {0} |2”"k E N}. The reconstruction formula (4.34) represents a fast multiscale inversion algorithm for the Radon transform, and to our knowledge, is the most efficient algorithm to- day. Its complexity is less than that of fast-Fourier-transform based approaches by a factor of log K. This estimate assumes that all the coefficients are precomputed to reconstruction, an easily satisfied condition. Figure 4.8 displays the general behavior of the coefficients (1 — 2J‘1)2 BM used in (4.34). The plots correspond to the case of J = —5, but the fast decay behavior of the coefficients as a function of k is observed whenever J < 0. From Figure 4.8 (b), it is evident that at k = 8, the factor (1 — 2"“)2 )6”, has been reduced by over 50 dB from its value at k = 0. This and the fact that the data V3_1,k(r, ()5) is non-increasing with k, justifies the simpler reconstruction formula '2 H a... O I y—a (1 — 21—116)? ,Bj‘k Vii—1&0" (b), (4.37) fJ(T9¢) : 1%f I! c a- II o n where [so is a small positive integer, say, k0 = 8. For values of J much smaller than -5, however, k0 needs to be increased accordingly. As mentioned before, an important advantage of the new formulation is that it allows us to selectively reconstruct an image at any desired scale supported by the data. Also, since the algorithm does not assume a discretized integration process in the data collection stage, it is better suited for real-world applications. In fact, our experiments have shown that for simulated data where the projections are the result 147 50 H 1onog(1-2J")"’ IB /8 I ) , o——-<> 10 log IBJJ‘I BJ‘OI JJK J.0 70 Figure 4.8. Tomography reconstruction-formula coefficients. (a) (1 — 2’”)2 ,8“, versus k. (b) 1010g (flu/flyol and lOlog(1 — 2J‘1)2 IfiM/fiyol versus 1:. In both cases (a) and (b), J = -5. These graphs reveal the rapid decay of the coefficients’ magnitude as k increases. of a simple counting process, not a true integration operation, the new formulation does not perform as well as conventional discretized algorithms. Here, however, we are mostly interested in ECT applications, to which the fast tomographic inversion formula is especially well suited as we show next. 4.7 Second Approach to Multiscale Modeling and Estimation of ECT Intensities The SI-MMI modeling approach to ECT applications relies on the proximity of adja- cent pixels in the intensity sinogram. In Section 4.5, it was argued that such proximity typically exists in the vertical direction (same projection vector) due to the proximity often encountered among adjacent samples in the intensity image f, which when inte- 148 grated together, only reinforces this characteristic. On the other hand, the proximity of pixels in the horizontal direction was justified by the large overlap of the integra- tion swaths corresponding to consecutive pixels. The regions of overlap defined the natural pixels. We also noted that the inequality N Pg“,c >> AASPJC, which validated (4.6), could not be justified for narrow integration swaths (high-resolution scale projections). Only as the increment angle between projections is significantly reduced, does the difference NPR}. — AA}; become large. Therefore, the SI-MMI method can only be justified for low resolution projections when the angular sampling is also coarse. Certainly, in these cases, it would be suboptimal to have to work with coarse scale representations even though high-resolution projection data are available. To circumvent this restriction, we introduce a new approach to modeling, esti- mating, and reconstructing the intensity image from its Poisson data sinogram. 4.7.1 The CSI-MMI Method We define a discretized version of the cumulative projection (4.31) as X" E [_’,‘{_1,... ,X3]T, with elements 5.2 E X"(%k — 1) for k = 0,... ,K — 1. Then, we let the cumulative sinogram image (CSI) corresponding to these projections be A E [ie, . .. ,XN-l]. The cumulative projections making up the CSI represent integrations of the object f over increasingly wider swaths, starting at the top end of the projection vectors. Therefore, the natural pixels associated with horizontally adjacent samples (pixels) increasingly approximate more accurately the value of these samples as we move from top to bottom in the CSI. Consequently, the intensity profiles (X2, 51),, . . . if”) be- come smoother as we choose smaller values of k. In fact, in an ideal tomographic projection process, integration over the entire object is the same regardless of pro jec- tion angle, and so, 3.8 = - -- = 3.6V“, i.e., a constant intensity profile. 149 From this, we may model the intensity versus angle corresponding to each row of the CSI by a 1-d MMI model, each with a different prior. For example, the prior corresponding to (Xfi +1, X}, +1, . . . ,5»? +31) would have parameters that reflect the greater variability of the intensity than, say, the intensity (X0, X1, . . . , 5.5-1). In terms of the various beta density components of the mixture prior for the perturbation variate 6 introduced in Chapter 3, those representing high variability, e.g., those defined by small values of the shape parameter 5,, will have higher contribution to the model for high values of k than for those of low values of k. The modeling of the CSI’s intensity variation in the vertical direction, i.e., X", proves more problematic than for the case of the standard (non-cumulative) sino- gram. This is because in the cumulative projections, the consecutive pixels may not be modeled by independent innovation variates—as is characteristic of MMI-based models—given their strong dependence due to their construction, particularly for low values of k. We have carried out numerous experiments where we have neglected this fact in a pragmatic pursuit of the “what if”, just to be reminded by the very poor results that in this case the fact is not to be ignored. The need for a suitable model for the individual cumulative projections result- ing from ECT studies arises only from the necessity to estimate their underlying intensities before reconstruction. In Section 4.5 we saw the great impact that a good model can make to the tomographic reconstruction process. However, because the multiscale—based tomographic reconstruction approach developed in the previ- ous section only makes use of the cumulative sinogram—as opposed to the standard Sinogram—the Poisson-distributed data is an excellent representation of the underly- ing intensity themselves. To justify this, we make the following observations: 1. Much of the CSI Poisson data have high SNR due to the rapid accumulation of high counts obtained by construction. 150 2. CSI data characterized by low counts enter the reconstruction process with little weight due (1) to its relative low magnitude, and (ii) to the weight given by the coefficients of the reconstruction formula (4.34) (See also Figure 4.8.) Modeling the CSI’s intensity by the data has the advantage of simplicity, but more importantly, it helps to maintain a higher degree of truth in the reconstructed image. Moreover, due to the accumulation process by which the CSI is formed, the coordination of the estimation of the underlying intensity in the vertical direction—as k varies—is assured despite the fact that we only impose the 1-d MMI model in the horizontal direction—as the angle varies. With this model at hand, the underlying intensity can be estimated using the Bayesian estimator introduced in the last Chap— ter. We call this approach to modeling, estimation, and reconstruction the CSI-MMI method. 4.7.2 Example Figure 4.9 is a pelvic bone study obtained using the CSI-MMI method. Fig- ure 4.9 (c) shows the pelvic bone reconstructed image obtained from raw data using the multiscale-based inversion approach of Section 4.6. In Figure 4.9 (d), the under- lying intensity of the same raw data used in (c) has been estimated and reconstructed by the CSI-MMI method. For purposes of comparison, we have repeated the images of Figures 4.6 (b) and (d) in Figures 4.9 (a) and (b), respectively. These images correspond to the first pelvic study presented in the example of Section 4.5. The CSI-MMI processed image in Figure 4.9 (d) reveals a greater amount of detail than does the SI-MMI processed image of Figure 4.9 (b). In particular, note that what appears as two bright blobs in the upper left of the SI-MMI image, it becomes better represented in the result obtained by the CSI-MMI method. In this case, the features preserve the structure found in the unprocessed image of Figure (a). Likewise, the 151 three bright areas easily distinguishable in the lower left of Figure 4.9 (d) are almost impossible to pick out in the SI-MMI image, although their existence is well suggested by a narrow band of bright pixels. The images reconstructed directly from raw data using the FBP algorithm and the new multiscale reconstruction method display greater detail than either image in Figures 4.9 (b) and (d); however, the variability of intensity of their pixels is large enough that makes it very difficult to determine the true structure within the images. Clearly, this is especially true of Figure 4.9 (a), demonstrating the great advantage of the multiscale inversion algorithm alone. In addition to the above differences of image quality among the set, there exist several other features in the CSI-MMI image that are hardly discernible in the SI— MMI and the raw filtered backpropagated images of Figures 4.9 (a) and (b). One example is the very prevalent small dark region near the center of the image which lies on the diagonal going from the lower left to the upper right corners. While in Figures 4.9 (a) the region is well defined, the variability of pixels surrounding it makes it an unreliable feature. In contrast to the virtual disappearance of this feature in the SI-MMI image, the region is well defined in the image constructed with the new multiscale approach. Although we deem the CSI-MMI methods superior to the SI-MMI approach from the above results, we believe that in practical nuclear medicine and other emission tomgraphy studies, there is a real advantage by working with both type of processed images that we have introduced in this chapter. While the CSI-MMI method offers greater fidelity of images, the SI-MMI-based images displays the greater structures of objects in a more visually appealing manner without unduly sacrificing too much small detail. 152 (01) Figure 4.9. Second pelvic bone study image reconstruction. (a) Pelvic bone from noisy data using a F BP—based reconstruction. (b) Pelvic bone from SI-MMI—processed sinogram using a F BP-based reconstruction. (c) Pelvic bone from raw data using the multiscale-based reconstruction of Section 4.6. (d) Pelvic bone from CSI-MMI- processed sinogram using the multiscale-based reconstruction. 153 CHAPTER 5 Conclusions We have introduced a view of the multiscale structure of processes that succinctly expresses the limitations of models constructed from limited—scale information. At the same time, this view has made apparent the potential advantages of modeling within the multiscale framework for estimation purposes. For this, we introduced a unifying approach to characterizing models and estimators alike. The measures of anomy, accuracy, precision, and resolution power were introduced in order to char- acterize models and quantify their degree of goodness as estimators. While formally defined using the information-theoretic concept of distance, the new concepts reflect our common-day sense of the terms, and so, they are appealing to use and general in nature. Also, the new characterization of models and estimators has given a clearer view on interplay between scale (space / frequency) resolution and information resolu— tion (resolution power). We have given general guidelines for obtaining linear transformations to construct Gaussian multiscale models that are potentially better suited for estimation purposes. Much work needs to be done in this respect, but a wide range of possibilities has been opened by the idea of constructing transformations that are statistically motivated by using the criteria set forth here. We argued the various advantages of the Bayesian es- timators within the multiscale framework, and showed how this framework facilitates 154 the formulation of practical priors. We have developed a Bayesian approach for Poisson intensity estimation, and in- troduced a novel MMI prior model for intensity functions based on a multiplicative innovations structure. The MMI captures many of the key features of real-world in- tensity functions and provides an excellent match to the Poisson distribution. The MMI model facilitates a multiscale Bayesian estimation procedure that proceeds in a natural fashion from coarse-to-fine resolutions. The estimator has a simple closed expression and can be implemented in 0(N) operations, where N is the dimension of the finest resolution of the discretized intensity. The issue of choosing the parameters of the prior model was addressed, and a simple moment-matching method was pro- posed for fitting the parameters to a given set of data. The MMI model was extended to a shift-invariant (stationary) model called the SI—MMI model, and it was shown that the SI-MMI model has a 1/ f correlation structure that is more regular than that of the MMI model. We developed a fast version of the SI—MMI which leads to a very efficient algorithm of complexity 0(N). We have illustrated the performance of the multiscale Bayesian estimator by com- paring its performance to other wavelet-based approaches on several benchmark prob- lems. We have also studied the application of the framework to photon-limited imag- ing problems, and examined its potential to improve the quality of nuclear medicine images. The framework also appears to have potential in other applications such as network tomography [4] and network traffic modeling and synthesis. We have introduced two new approaches to the estimation and reconstruction of emission computed tomography images. The SI—MMI method is a simple, computa- tionally efficient, extension to the 2-d MMI approach which similarly incorporates a Bayesian estimator in a very natural way. Its MMI-based prior model is well justified based on the sinogram formation process. The superiority of the SI-MMI over the commonly used FFT-based-reconstruction post-filtering approach was demonstrated 155 with comparative examples using real and simulated data. The second method introduced was the CSI-MMI method. This method is based on the new multiscale approach to tomographic reconstruction also presented here. The reconstruction method was statistically motivated on the ECT problem and facil- itates the robust reconstruction of ECT images. The new Radon-inversion transform formulation allows incorporating some local filtering within the reconstruction steps, which leads to a more efficient. “one-step” process. In future work, we plan to exploit this feature to develop methods to mitigate attenuation and blurring effects. We developed a fast algorithm that implements the reconstruction more efficiently than FF T-based inversion methods. The new inversion approach has the advantage over more traditional reconstruction methods of allowing efficient reconstruction at any desired scale. Fitting the reconstruction scale to the highest resolution supported by the data is important to the efficiency of reconstruction, and to the accuracy and precision of the final image. In future work we will investigate further the application of the new reconstruction method to simulated data, which from our very limited trials, proves not to perform as well as FFT-based inversion approaches. CSI—MMI method estimates the cumulative sinogram image by imposing 1-d MMI priors to each of its rows, and adapting them individually using the data. The robust- ness of the data column-wise in the cumulative sinogram justifies this one dimensional approach and helps the reconstructed image stay true. Also, the one dimensional ap- proach in conjunction with the new inversion method makes the CSI—MMI method a highly computationally efficient alternative to the image formation process from photon-limited projection data. 156 APPENDICES APPENDIX A Appendix to Chapter 2 A.1 Proof of Expression (2.2) Recall that h.j,k(w ): — 51—01(w——_ €0j€0>e ”WW—jg") for all j,k E Z, (A.1) and that f is an L2(lR) function. Thus, by Plancherel formula, f is also in L2(IR). Since w—jéo w_%Ek—j€0 _ 21( £0 ) 1( £0 ) _ 1 (A2) .7 whenever k = 0, and zero otherwise, with 21.060 = 27r f(w)—:—7T—Zl(w—j£0) l w-—-k- J50 f(w—2—7rk) U050 50 “20 u0 _ w— J50 “J50 27f _,2_7r r _ 2).—W )/f(w)1(L—€O_)uozcs(—w w' uok)dw. 158 Using the Poisson formula and the fact that every Fourier series’ component is indi- vidually integrable [19], the expression may be rewritten as —OJ£O w, - J60 ikuO(‘-"_W,) I 1 —— do) )fmfl w) (( 50 > :8 k {Egg-Z ::J( _;__00j(:0)) eikqu/f(w’) 1(w__ _:_0j€0) e—ikuow’dwl (A.4) ],k =(Z (,.jfizk )jizk j.k A.2 On the Monotonicity of I p(X ; A) In this section we show that the average mutual information 1,,(X ;A) is a strictly decreasing function of p, and thus, bijective and invertible; consequently, that p is uniquely determined for each value of Ip(X; A) = 1’. We assume an absolutely con- tinuous model, one with absolutely continuous densities. By defintion now) = [a(XIMPW loggggfgldxdx. where pp(x|A) E “p(x — A), i.e., a uniform density with support of which is the N -dimensional cube C(A; p) centered at A and sides 2p, and Here, * denotes convolution. It is not difficult to show that if yp is an independent rv distributed according to up, then A + yp ~ pp [30]. A well known relation for mutual information establishes that I p(X ; A) = H p(X ) — 159 H p(X IA). In this case, Hp(X) is the entropy of pp(x), or equivalently, the entropy of the rv A+yp. Thus, we write H(A+yp) E Hp(X), when it is clear that H(A+yp) must be understood as a functional over the sample space A x C (A; p). The conditional entropy Hp(X IA) is given by Hp(X|/\) = [pp(X|A)p(/\) log m1”) ddi Up( (x— A) 10 ddi =”(/" / gr— upxl( —A) =/ p(A)/00‘ loglClddi. RN Am) lCl This entropy evaluates to log(2p)N, which is the entropy of yp. Consequently, [p(xiA) = H(A +Yp) _ H(Yp)- Let p2 > p1 > 0, and consider the difference of entropies H (A+yp,) — H (A+yp2), where ym, ypz, and A are all independent. Clearly, H(A—Pym) — H(A+YP2)=(H(A+y/J1)- H(A'l'ypilA» - (H(A+yp2) - H(»\+yml>\)) +H(A+yp,|A) - H(A+yml>\) = I(A+yp.;>\) - 1(A+ym;A)+H(ypl) - H(yP2)' In order to arrive at this expression we have made use of the fact that for any two independent m’s A and y, H(A+y|A) = H(ylA) = H(y). Subtracting H(yp,)-H(yp,_,) from every term in this expression, and writing [m (X; A) for H (A + ym) — H (ypl), and Ip2(X; A) for H(A + ypz) — H(ypz), we obtain 1p.(X;A) - 1p2(X;A) = I(A+yp1;/\) - I(A+yp2;>\). By hypothesis, out of the 171’s yp1 and ym, the latter induces greater uncertainty into its corresponding sum with A, and so, the information that A conveys about 160 A+ yp2 is less than that conveyed about A +y,,,. Thus, I(A+yp,; A) > I(A+ ypz; A) and, therefore, I,,,(X;A) > Im(X;A). A.3 An Alternate Interpratation for Precision By definition 1p(X;A) = [pp(X|A)p(A)10gM dA dx, pp(X) where pp(x|A) is a uniform density the support of which is the N-dimensional cube C (A; p) centered at A and sides 2p, and mix) = [ppIxIAIpIAI dA =ICI f(p ,. ,. (A5) :I—CIP(’\ e C(x; p)) Here, |C| stands for the volume of C(x;p), i.e., |C| = (2p)N, and P() denotes prob— ability mass. Writing I p(X ; A) in terms of these expressions yields ,,(;X A) =/ / dAdx IRN a(x, p) ICI)10 P(A 6 C(x; 12)) P(A E C(x; p))logP(A E C(x; p))dx =|_10| RN —1 = _ / P(A e C(x;p)) log P(A 6 C(x;p)) dx ICI 1, mm ) where the summation is over all vectors k = (h, . .. ,kN) with k1, . .. ,kN integer 161 Irv U a I - ‘» .5 '0 ‘. __ n l r ‘ I I, .. I I A .. ~ ,' - / . /_ I I ‘ :Ifu ' '. "' N'WIIO‘L‘“ " I 'uf t 0 mm...» , I - - — ..~ W . ,.. , . .( I) . H:“I"f5fl19’IA‘ i.‘ . ,'.II A r ‘.\ —_ l,‘ ,_"‘\ ‘tlf‘l I1! II- ./ M ‘ mIvzr-uiwad' «-- I -.",n; HJAK‘ u . .h' /. («3.1. W -' 'Ilulq H‘vifid‘fli‘ ‘1 ir'J "“ . - . ’ H VII 10! ,- . i ,, 'l V .th',-,.r.z"13 1,149.4“ ;. " l '71- u".- 1'!) I‘I. v} . . ,A (km (“A .. ,.AI = )l e‘I'IJ'rvI 11'; 157') *3 - r, I. - rm .A‘JM-‘l . ‘ - numbers. So, —1 [p(xiA) = a Z] P(A 6 C(x + 2pk;p)) log P(A E C(x + 2pk;p))) dx. 1‘ C(in) Clearly, P(A E C (x + 2pk; p)) are probability masses on k parameterized by x, 6.,PZk (A E C( (x+ 2pk;p)) = land P(A E C(x+ 2pk;p)) Z 0 for all x and k. Thus, the entropy Hp,x of the discrete ru’s {H(A) E Pizza] with probability distributions as... = k) = / p(A) dx (A 6) C(x +2pk ) are H... E — 2 P(ém) log 1305...). (M) e... Rewriting the mutual information in terms of these entropies, 1 Ip(X;A) = |_C'-l-/CO Hp‘x dX. (AB) :P This expression says that [p(X;A) is an average of entropies corresponding to discrete ru’s obtained by partitioning the N -dimensional space into cubes of sides 2p each, and assigning them probability distributions corresponding to the volume of p(A) “under” each cube. The entire partition set of cubes are shifted by x E C (0; p) to obtain the uncountable set of distinct discrete rv’s over which the averaging process is carried out. This justifies the alternate and intuitive interpretation given in Section 2.3.2 for a model’s precision, and which we restate below. Its equivalence with the original definition is clear for in both cases we have ’P = I p(X ; A) = “X ; A)- 162 a ‘l 'I n ~ I I o . l ' .~ . I I I .1 ‘ r ‘ ' V 4 I‘ I ' I l .l .I‘ l I l! I, '. , I J {a}. , “I 81l;.t.-.-HII. .~v,.-.~'..- I . mm s» «um-I a.“ .. . . . “HIUIC’I XII-1‘ {.[ji‘ug'I-g. __ . I ,, I , 5 1 WI fl‘tlildi Bil. v.0”: \ Ma Ms}: J 7.. --r Aflixbn‘lb 95!, H, “Li‘fl (us. ..Ix', . II. I ‘ ._ nevi; sluuhdanI-um Him/11:: in:- 97.1371:- I». «4: I Fr I ‘J’Jmi': . ~ . . . . . Irmrpn ,.Il .mIl-III «mm» 9» Irwin 1).: , hm mt, H3! .8 'n 'f ' H .3 p . I . . 1-)“ ~- ‘ an“ Watson .I'n.‘ II: In} uni? 4' inborn . fl \v) "'4‘ '43in ."NJIU M 47L.” .. '.' 5)": II. ~ hear .Juo u mitt :- I: P is the entropy of the m A when quantized to the number of bits determined by I (X ; A), averaged over all possible quantization schemes. A.4 Only an MA Model has Anomy of Zero To prove that only an MA model has anomy of zero, we let ORA“) E — [pp(X|A) log(tp(x|A) + (1 - t)pp(XlA)) dx for all values oft in [0, 1], and show that N = f(aPAU) — ap,A(0))p(A) dA > 0, unless p(xIA) = pp(x|A). By repeated differentiation of ap‘),(t) with respect to t, we obtain .. _ p. " OW) ‘ ‘" ‘1)’/pp(""‘) (thxIA) + <1 — tIppIxIAI) ‘1’“ Suppose that pp(x|A) 79 p(x|A). Then, it is clear that GSA“) > 0 for all t. Since 0;,A(O) = 0, a;’,\(t) > O for all t > 0. Thus, ap,A(1) > ap,,\(0) and, consequently, N > 0. When p(xIA) = pp(x|A), 0%),(1) = ap,)‘(0), and the desired result is obtained. A.5 Equivalence of Anomy Definitions We want to show that D(Pp(X)|lP(X)) = D(pp(X|A)P(A)NIKKI/“1900). Let W be a linear mapping A I—+ A1, where A 6 IR”, A1 E R3 and N a positive 163 even number. We only consider those mappings W for which the N by N matrix 3:. “I W is invetible. Here, I and 0 are the 1;- by g]— identity and zero matrices. Denote by A‘ and Ab the first (top) and last (bottom) N / 2 elements of A, respec- tively. Then, assuming all the following ratios to be well defined, MAXIM) _ pp(X) pp(>\IIX) _ pp(X) pp(/\‘,>‘1|X) P(XIAI) p(X) P(AIIX) _ P(X) p(A‘IAIIX)’ where the last equality is obtained by multiplying and dividing by p(AtlA1,x). Recall that in general, meaning is given to pp(x|¢) by the integral fpp(x|A, ¢) p(AIqb) dA (= fpp(x|A)p(A|¢)dA.) Since (i) = A (:2) = AA, Pplxlxl) = Pplx) pp(A|x) = _e(_x|_A_) (A.9) p(XIAI) p(X) P(AIX) p(XlM' Since the derivation could have equally been started from this last ratio of densities and concluded that removal of information from the conditioning of the densities is inconsequential,1 110;) = p(XIA) (A.10) p(X) p(XIA) ' 1Note that by choosing W = [0,1] and Ab a constant vector independent of x, the desired result is explicitly obtained. 164 Now, from the definition of the information-theoretic distance D(--||) and (A.10) . x = pp(XIA) D(p.(xIA)p\)p(>\) dA. by the assumption of smoothness of p(A) the 2p width-per-dimension of pp(x|A) = ZIP(x — A) is of the order of the width-per-dimension of p(x|A).Therefore, p(A) may 165 be regarded constant. within the N -dimensional cube of sides 2p, and 1 pp(X) = — p(A) 61* % INK)- lCl C(x;p) Therefore, HIIX) z — ij(X)10sp>.(X) dx = H(A). Second Proof From Appendix A.3 we known that 1,,(X ; A) can be written as 1 I X;A =— H ,.d, p( ) lClv/Cmm) p, X where H... a - Z) P(é...) log PIE...) 6.... and P05... = k) = / p(A) dA. C(x +2pk;p) From the assumption of smoothness of p(A) the 2p width-per-dimension of pp(x|A) = Up(x -— A) is of the order of the width-per-dimension of p(x|A). Therefore, p(A) may be regarded constant within the N -dimensional cube of sides 2p, and POE... = k) % PA(X + 20k)lC|- Then, Hm: % -- ZPMX + 2pk)|C|1<>s(P>.(X + 2pk)|0|)- k 166 Applying this result to the expression for I p(X ; A) gives IPIX;A) z — 2: / PA(X + 2pk) IogpIIx + 2.2km k C(0;P) — log ICI Z] [a(x + 2pk) dA k C(OHO) z — Z / p(A) 10gp . . . . . . - _ 'I'.| "~’D" .l ... \ .I'IflI'V.‘ .. ., . - . .I . . 1)} ~41) "91V" ~' H A ~ ’ I . -. A 91. . III I [: ”A I. , ‘ ‘ . . . I' ""7 ,1...” “I, .5 ‘2 a a [I'V‘l' " 171‘1‘“ ‘ ‘I'.| _‘I, - I. ' . ' ’ . . .I . ‘ :"lft CHI, ’06 .‘l' ' .. .1 . .r::!. ‘1‘“, ‘3‘,” I... 17¢", . .l“ '; . l I ‘. ‘f-l .ADI’WY-A’, 5" :,-' _ s,. A ‘ 5.1 1‘. ‘ ' ’- ’ \ " l‘tfill‘) . I - I A . ‘ ’ 'l ' . M \' ,_ t .{c If W is an orthonormal transformation such that2 01 W1 = WAO and = on, A1 X1 where 91 and wl are simply the “error” vectors A0 — A1 and x0 — x1, respectively, the above inner-integral can be expressed as / IogpleIAIMx. C( 1:01) 2/ log/ IXGIIAQZ/ p(X1,W1|A1,01)dW1d01dX1, C(Alipl) RNI k C(OH-Qpikmi) where the summation is over all vectors k = (k0, . .. ,kN,_1) whose elements are integers. C (91 +2p1k; p1) are the cubes centered at 01+2p1k with sides 2p1. Refer to Figure A.1 for a representation of the various components in these expressions. Note that in the figure the point (91, A1) is the same as the point A0. ' Now, define a(xllA1,91) E With the aid of the Jensen’s inequality we obtain / logp(xIIA1)dX1 Z ICQIIPIHIOEICQIIPIH C(Alipl) +/ / p(91|A1)1og(1+a(X1|A1I91))d91dX1 C(AIIPI) RNI 1 f f (3' A p 6 A / 10gp(x IW IA ,9 )dw d0 dxl. l ( lip1)l C(Alml) 1R"! ( 1| 1) C(91;P1) 1 1 1 1 1 1 2For later convenience, we have departed from writing in the customary order the projection vectors A1 and x1 first, and the “errors” 01 and wl second. 168 I I I I I I I 3, , I‘- . I C . ' «4' '3 ‘ .. ‘ ‘ , I. I ‘ . .l t. V . II; ‘I-IIIIm'a m H an In AI' .‘ R -/' ~j ',. . if .7 “ MI..." .."‘.I ‘ I: ' . oIlI nrmyflg‘“ I, ' ‘ 1‘ h I' r‘ ( :‘h("’. 3"- .«90» 5". "d! 7m‘.'”r‘l A.) :(x‘li'l' 2' ‘! 'r .7 . av <, ... . . 01"1‘1’1‘ ‘4 0.14)» U """HJ'" ‘ "l' 4 l - ‘II l‘ "H nII v'l 7'1" .1 . . ' 1". I-- .1 ‘III; ”f. I17 ~. ' 1d: .34” _h I l X1 M1 - C(MOI ; Pl) WI COM ; P1) M C(91; P1) Figure A.1. Simplified geometric representation of EPR cubes at consecutive scales. For simplicity of notation only those cubes corresponding to scales 3' = 0 and j + 1 = 1 are represented (e.g., C(Ampo) and C(A1;p1)). Here, x0 = (xflmxgjl)? By integrating p(xole) over adjacent copies of C (A1, 01; p1) as shown, a useful relation between the anomies No and M can be obtained (See text for details.) Then, M .<. -logIC(A1;p1)|'~’ P(AI) / / _ _ OAlo 1+axA,0 ddedA /R~1|C(>‘1;p1)| C(A1;p1) MM 1' I) g( ( 1| 1 1)) 1 1 1 P(AI) / f f ’ — 9" lo x,WA,9 dwdedi, [Rn/l |C(’\1;P1)|2 C(A;;pl) RN11“ 1| 1) C(91;p1) EM 1 1| 1 1) 1 1 1 1 01' A N1 _<_ —108IC()\1,91;P1)| “/ M" 10g(1 + a(xllAODdxldAO 1R~o IC('\11P1)I C(Axm) P(Ao) — 10 x A dx dA. AND |C(A1,91;pl)l C(Al,91;pl) gp( 0| 0) 0 0 169 Since N0 = - log |C(A0;p0)l _/ P(Ao) —— logp x A dx dA , IRNo IC(A0;P0)|/C(Ao;po) ( 0| 0) 0 0 No No - N1 2 log (E) +/ p(AO) / log(1+ a(xllA0))dx1 dAo 1R C(ANPI) po ”0 (200M 1 + pA {———/ lo px A dx [R < 0) (MN, CMW g < 0| 0) o 1 _W/C:(% )lng(X0lA0) dXo} dAo. :PO So, for a model to satisfy the A/ P conditions, it suffices that for all scales NJ log(-"—) < / p(A.){A,-+1} dA., (A.12) Pj+1 IRNJ' where Alp) = 1 / 10gp(X-I/\')dx- J _ (ijle C(A;;pj) J J J, 1 A"(p)E—-+/ logpx°A- dx- J (2pj+1)NJ C(Ag'+1.91+1;Pj+1) ( Jl J) J and A- (a)=—1—-——/ log(1+a(x- |A-))dx- J+l _ (2pj+l)Nj+l C(A9'+1:pj+1) J+1 J 2+1 are the averages of log p(ijAj) over C(Aj; pj) and C(Aj+1, 9j+1;pj+1), and the average of log(1 + a(xj+1|AJ-)) over C(Aj+1;pj+1), respectively. 170 A.8 Coarse-Scale-Data Limited Models We want to prove the equivalence between A/ P and CSDL models. The following depicts the conditions associated with CSDL models. The CSDL Model Conditions j-scale Model j-scale CSDL Model P(XjIAle/V‘) P(Xj+1|)\jlp(’\jl 7’1“ 7’1 > 7”#14 PJ'HJ A1 ~41 < Ania Ajay Although the conditions have been stated in terms of data belonging to one coarser scale than that of the intensity’s, it equally applies to observations obtained at any number of higher (coarser) scales above the scale of the intensity. Recall that all the models under consideration satisfy p(xj+1|Aj) = p(xj+1|Aj+1). Motivated by this relation, we define pp(xj+1|AJ-) E pp(xj+1|AJ-+1), which is a logical choice. Clearly, pj+1J E I(Xj+1;Aj) = I(Xj+1;A,-+1) = 792'. Since similarly 1,3(Xj+1; A3) = [p(Xj+1;/\j+1), Ip‘(Xj+1§Aj+1) = Pj+l,j = 733' = Ip(Xj+1;Aj+1) 171 and so, ,5 = ,0. On the other hand we have _ pfib‘j-HIAJ') . .2 - . ,\. A-l ___ . d . N+13/Pp(xy+1| .)p( .) og MMI,» ax,“ A. Pp(Xj+1lAj+1) = ~x~ A. A. 10 dx- dA. fpp(3+1lg+1)p(3+1) gp = . . . Pp(xj+1|)\j+1) . _ ___ . fpp(xJ+IIAJ+l)p()‘J+-l)log p(xj+1|Aj+1) dXJ+1 dAJ-H My Thus, P34,”- = 79,- and A34,”- = A], establishing that if a model satisfies the A/ P conditions, it also satisfies the CSDL model conditions, and vice versa. Therefore, the two sets of conditions are equivalent. Consequently, for models for which (1) E[xj|Aj] = Aj (2) p(xj+1|AJ-) = p(xjHlAJ-H), and (3) H(AjlxjH) > H(Ajlxj) hold at every valid scale—the same three conditions adopted in Section 2.4.1—finer scale patterns in the data convey the required information for a more precise model, but modeling based solely on this new information necessarily produces a less accurate representation. 172 APPENDIX B Appendix to Chapter 3 B.1 Posterior Distributions In this section we show that f(AjICO) = f(Ajlcj) (B-1) and f(5j,k|Co) = f(6i.lej-1»?kicj—l,2k+l)- (32) These are proved in the following two propositionsl Proposition 1 p(AJ-lco) = p(Ajlcj) for j = O, - -- ,J Proof: Since the same information contained in the set {Co} is contained in {CO, cj}, 1Throughout this dissertation numerous has been the instances where use of a table of integrals and other mathematical expressions has been made. Our main source of information in this respect has been the tables compounded by Gradshteyn and Ryzhik (see [87]). 173 ...v— -. O H XII" ' '1 """7 ll) 03 XIV. . 1 1"I' ' .; ..' .,;.'s::-1.~."'! mi .- r 7141' " Wade 9', v I!” ,. l .:1‘ 111'!" "1‘ - ’ 9’ 1 . ,uihds'- - 9. .1...‘, . >‘ ' u ' 1‘ ' ' ' ‘7 , - ,. ~~. Luv-«1117121111410!!! , . dugout-raider: tin-.11. “mug. 11:. ....” . .'-.. ._ ,5 Q“: m . . . v w: .1 . 3 ..v H . Wm lid‘. nl antlmmt‘m‘ l1 «‘JLI . -. . «. ,.. , .‘ 1,, - . 1. 4 .. .. 1 flint-'1 ‘ l . .I . L‘- “"' V i o 3f " “w'wi 7w... '1 Winn . I" nun-mam 'mo . 4% g .. ‘1 .- ;‘F ‘ - v - ‘— . ;1 '4 “c I I . applying Bayes’ theorem f (Ajlco) may be written as p(AJ-ch) f(AjICo) = f(AjICo»le = P(Colcjdjlmcolcji Thus, it suffices to show that p(colcj,AJ-) = p(colcj) for j = 1, - ~ , J. Define sequences of even and odd elements of cj_1 and Aj_1: €311 = (Cj—1,O~.Cj—1,21"' aCj—1,1’V'/2J-1—2) and 0311 = (Cj—1,11Cj-1,3a ' " ,Cj—1,N/2j—1_1), and sim- ilarly for Aj_1 and A34. Clearly, p(C‘- .C"- IAj) . l = #mch'IJAJ-l) 1fc3?_1= (:1- elf—1 2 Q 119—1|ch (8.3) 0 otherwise. Then, with the aid of the total probability theorem we may write for the non-trivial case .) = fp(c§_1.c,- — c§_1IA,-, ;_,)p(,\;_,|A,-) dA§_1 P(CjIAj) f Hk p(Cj—1,2k|/\j—1.2k) Hk 19(0ch - Cj—1,2kl)\j,k - Aj—1,2k)p(A"?_1|AJ-) clA‘3._1 = 2 J Hkp(cj,kl)‘j,k) fHk e-Aj— 1.21: (Aj— 1,c2klJ 12" I'Ike—Aj.k+Aj—12k("j,k")‘j-1,c2k)”‘ c1 12" P(A;-1|/\j)d)1§_1 (Cj- 12kl' k(cjk- Cj- 12k)‘ _,\. k.._J.1__ Hk e (A(Cchc)! Cj’k )‘j-i2k)c’—l’2k ( )‘j-12k)c""_cj“'2" = —' 1-—-’— A"; A- dx:_, [I]; ( AjJt Ach p( 7 1| 3) J 1 Cj-l.2k p(cj_1|CJ-, A where all the indicated products are over k = 0 through N / 21 — 1. In these expressions we have exploited the conditional independence of the data scaling coefficients at any specific scale given their corresponding intensity scaling coefficients. Since the innovation variates are given by y“, = 52f? (see (3.11)), the condi- tional densities within the integral signs may also be written as pA;_1W(A;_1|Aj) = 174 A,- . 11,11,117” (( A1;2k)k| ()‘j.k)k)9 where Yj = (yj,o,yj,1,°" ,yj,N/2j_1). Then, With a change of variables and recalling the mutual independence of yj and Aj, we obtain the equivalent expression CM c-12k C'k_C'-l2k p(Cj—llcjv A3) = H (wk) ’ ' (1‘ Wk) " J ' P(YJ‘) dy)‘ (BA) k Cj-l.2k The absence of AJ- in this expression indicates that cj_1|cj is, at least explicitly, independent of Aj. However, one may argue that Aj intervenes only functionally to define the probability mass p(cj_1|cj, A,-). That is, if we denote the right side of (8.4) by g(cj-1, cj), we must. still verify whether g(cj_1,cj) = p(cj_1|cj). We may achieve this as follows. p(Cj-llcj) = [P(Cj—llcjaAjlpo‘lej)dAJ‘ = [g(cj_1,cj)p()\jlcj)d/\j = 9(C1-1’CJ)' Therefore, p(cj_1|cj, A3) = p(cj_1|cj). Using this result, one arrives at the desired final conclusion p(colcj, A3) = p(colcj) using induction: Let I be any integer such that O S l g j — 1, and assume that P(Cj—zlcj, A1") = P(Cj—zlcjl- Clearly, P(Cj—z—ilcj,z\j) : /Zp(Cj—l-1|Cj-11Aj—la Cj,Aj)p(Cj_l|Aj_l,Cj,Aj)p(Aj_llcj,A1.)dAj~l Cj_( = Z[P(Cj—I—ilcj—z,Aj—1)P(Cj—1|Aj_1,cj,Aj)p(AJ-_¢|cj,Aj)dAj_, 91.1 175 ’— T' ‘ Y ...)ng .(‘(t\‘,f‘ ‘1, .. I , .. .,; .‘miifnrn 1m .- 3 ‘ in . , L . I... ‘1'] L n\ -\l\ll ., , w . . ~ »-4:.,am:m$~g l ' " " '9“: 1:" ram-0M v‘ .7: ' i 11*: .. m, 1.1?- wily-11111113} . . i 1 W . wzwm '1111, ”mn"" IA uu‘? ‘HF‘HF-m ‘1, .' . ' ‘Ajlxj' ‘./1.")l _!4,\ "'13. f» .( i g I I ‘ ‘ " “C! I 1-“.l‘l“._'f; f("x’ .". . I I“, " ,.I I ' VII l V'.\ .' 7 l I ~ .. .LI ’ v I. .- . I ’ Now, since p(cj_1|cj, Aj) = p(cj_1|cj) and the assumption, p(Cj-l—l ICj, A1) = Z: P(Cj-z-i lcj—l)p(Cj—llcja Aj) Cj_z = Zp(ci-l-lle-lvlel)(Cj-llcjl = P(Cj—z—ilcjl- Cj_( Taking l = j — 1 completes the proof. Note that we made use of the fact that given scaling coefficients at a given scale, knowledge of the corresponding scaling co- efficients at a coarser scale (greater value of j) does not alter the likelihood of an event. Proposition 2 f (6,,klc0) = f (6j,k|cj_1,2k,cj_1,2k+1) forj = 1, - -- ,J Proof: In order to keep the mathematical notation as clean as possible in these and subsequent derivations, we introduce the following simplifying notations: c1 = cj_1,2k and c2 = cj_1,2k+1, c;_1 = cj_1\(c1,c2), and y; = yj\yj,k. Now, since y“, = %(1 + 51k) (see (3.10)), it suffices to show that p(yj,k|c0) = p(yj,k|c1,c2) forj = 1,~- ,J. This may be done as follows. By the total probability and Bayes’ theorems, p(ijcICO) = [p(yj,,\,|c0)dy;d,\,- p(ConjaAj) . ————-—— .,A' d .dA-. / p(CO) p(yJ J) y] J Since the information conveyed by {y}, A,-} is the same as that conveyed by {Aj_1}, P(ColyJ'aAj) = p(COIAj-l) = P(Aj—llcolmcol/PQj—ll- Using PFOPOSitiOTl 1, this be- comes P(Colva‘j) = P(Aj-1ICj—1)P(Co)/P()\j—1) = p(Cj—lIAj—1)p(CO)/p(Cj—l)- Upon 176 substitution in the above integral pom--1) . p(yjnco) = / 13(ch p(dey, d1,- P(Cj—1IYj,/\jl .. = - -,,\. d .dA- / P(Cj-il p(yJ J) y] J p(03_1ly;, *1) P(Cj—ll P(Cl—l lAj) — . , . A- __J__ - P(yml [p(cl, cglym, 1k) (Ci-1) A61 +Cze-A pc(* = [P(Cleczlyjk1 AM) p(yj’ Al) dy; dAj P(I‘v‘) dAJ’ Cj_1j|/\ ) C1! C2! m ) p(AJ- )dAj. C = p(yj.k)y,-,l(1 -yj.1-)C2 / Thus, p(ijcICO) = p(yj,kl01,62)- B.2 Optimal Estimation of the Multiplicative In- novation In this section we find a closed form for the optimal estimate 6M of the innovation coefficient 5j.k° For the sake of simplicity, in the few steps that follow we will disregard the indices j and k, and simply write 6 for 513k and similarly for other quantities. The minimum mean square error (mmse) optimal estimate of the innovation co- efficient 6, given all the information available in CO is given by E[6|=c0] /:()6p 6|c0)d 6:]:(61) 6|c1,c2)d (8.5) in accordance with (B2). Applying Bayes’ theorem to this expression we obtain 3 = filamctczlopww = fl,fo°°6p(c1,c2|6,i)p(xla)dAp<6>d§ f3, p(cl, c2|6)p(6) d6 f _1, f0°°p(c1,c2|6,1)p(1|5) dAp(6) d6 Here, p(Al6)= pA( ) due to the independence of Aj k and 6 J k. Also, notice that Cl and 177 1111391111 .- .4" '1 ”M . l 1. (11!; .l I .4. .JM . _, .. - ...—m. . ’/l.'. ."!"‘\\v - , 5 1 ‘5‘ ‘Jl‘!’|\ a“ ‘. .‘ ‘ l‘.‘ L" .l. ’ V ’1 .1. . -~ ." 7:11:53271 15ml ' S '1’. 1101.1 1 . 1‘1 -.;.,y.1,'..1"";_ , 1.’ .11.. ,m- «a: so"? ' 12' ‘4. - 1113.6 .1 Mr t ' t mfimpp. ' .- -0.')11.21h-'n...1 1.‘ > 1 1 ‘jfli ,. ‘3 .,‘.' .’ 11" l. , _ fl ,... .. O _J ' .‘l ”('l'B'KMle Id, “51 u l' (a. “.l-‘I ' l ' v; . ,8) "I -. ~ . ,‘ ~- .1. w 1 “11.315111 3:. D',_ "V's; ”It, I} '5“ I’LL/9.111.! '2 , ”"1“111' " ‘1‘ 1 ' 1.“1.';‘;’. 13.8)(‘11’ ' r ". r .‘ .4 a. night/1mm: 3 ~ '7 " 1am -~;-_"__I.Z'_‘-'.".‘-Ll"fl .. I " mam-um? : a “it .'.‘I v:'. ‘ _Ii-‘ ': 3.76“}... "z. 1; 1.31:- ,..; ”1 , ,_ MMI-11659.4!) ,._,.-, .' '_ If ‘. p ' i 1 ' ~ V ’ ‘ . u. _ . .MJI l'lloun’! Juli: . “61'" 1. “A L. v‘v."'T.itN;‘1llI'L 'tlll-JJ uubf 97 . . . - ~ pa ‘ .'r‘ C2 only depend on A and 6 by way of their respective statistical means A1 = -;—A (1 + 6) and A3 = %A (1 — 6) (See (3.11) and (3.12)). Given A1 and A2, A and 6 do not convey any further information on the behavior of Cl and c2. Moreover, given A1 and A2, c1 and C; are independent. Therefore, fl, I,” 6p (ctcz (114,411.32) p(AMpr) d6 _1 °°p c1.c2 Ah? Al-zé (A) dpr) d6 0 1:11 fooo 6pe__—_-A(cll+6)/2 (A#)Cl £32242 (A1—— (5)62 p(A)dAp(6) d6 00 e____-A(1+6)/2 c1 e—Ml 6)/2 c ' L11 lo c (”T”) 72—01345) 2 P(AldAP(5)d5 on) It is now possible to factor out every A-dependent term from the integrals in 6. Doing this, the resulting integrals in A in the numerator and denominator cancel out leading to g_[3,15(1+5)‘=1(.1—5)‘~‘2p(5)(15 L11 (1+5)c‘(1-5)c’p(5)d5° (8'7) Substituting the beta mixture model (3.13) into this expression and carrying out the integration we obtaine the desired result: 22' p B(8:+C1 81+C2) A _ 3 B(s,-, s.) )(23.+c) 6j,k "- dj,k Z B(3i+C1 3i+C22 . (8.8) ipi B(8;‘,8;‘) 178 w" “v .1‘z'i ‘ MN" ”(’6 . I ‘ : (Ithufl ' -' I III‘Iu-Mjno- r.vI1-n-u.‘l’ . - ' ugh-'11 019 In 112.1195 I’ . t‘ ‘ H, H < ‘ "J'Mflsm _ ,.~-» v , I/ 1 ".J'l. » 16,“. 7- | q .‘1 -. v" . - n.‘ I ‘0‘ ’fi 1% APPENDIX C Appendix to Chapter 4 C.1 The Hilbert Transform as a Continuous Av- eraging Process The Hilbert transform of an L2(IR) function f is defined as Hf(t) a 1 0° Mar, ((3.1) 7r _oot-T which is the convolution of f and %. For functions with finite support, it is possible to give a more intuitive meaning to this operation. In particular, if supp( f ) = (a, b], where a < b, then _ l b a / mo.,az(r)dr, (0.2) 7r 0 71(f(t) = — where f(02(T)) — f(UIITll (C.3) 02(7) — 01(7) "101.02 (T) 179 and 01(7) E (a — t)r +t and 02(7) E (b — t)r + t. Note that 01(0) = 02(0) = t, while 01—> a and 02 —» b linearly as r —-> 1. Clearly, if the derivative of f exists everywhere in (a, b), ma,,02(r) is the average derivative of f in the interval (01(7),0'2(T)) C [a,b], i.e., 1 02(7) mom 7' = ’ a do. C4 2( ) 02(Tl-01(T) [01(1) f( ) ( ) Proof: Substituting C.3 into C.2, and replacing the definitions for 01(r) and 02(7), we obtain Hf(t)=—%/O f((b—t)r+t)C:_—T+%/0 f((a—t)r+t)£:_: Now, in the first integral let u = (a — t)r + t, and in the second integral let u = (b — t)r + t. In both cases we have that df = iii—:11, which upon substitution we get the desired result C. 1: u—t u—t Hath—five) d“ +71; [m 0’“ C.2 Proof of Expression (4.24) In this section, we investigate the nature of the approximation [2” Sa(—12!0—7m) d6~ 21r/N (0.5) 0 ( 1— rcos(6 — (15))2 I ~ (1— rCOS(d> — $73371»? for n = O, 1,. . . ,N — 1, and find the conditions under which it is adequate. Let rect(;z:) be the function that is 1 in [—32”-, g], and zero otherwise. Then, the 180 Fourier transform of Sa(‘-3_£6) is Freedfi). Thus, Sa (%6) = fi fill? e‘wodw, and 2" Sa(-1y-6 - 7m) ”/2 21f ewe / _/ 8-1.41. / dfidw 0 (1 — rcos(6— (1)))2 :N Mg 0 (1 - rCOS(9 - (25))2 N/2 _ w 00 211’ t ’ 1 eWTn/ N rec (02—7r 1r) e—zwfi dfidw, =2” -N/2 —oo (1 " 7005(9— 99))? Of A (2” 341219 _ 7m) do = .1— m lvflreC‘It—“l ...-ten .1... (c 6) 0 (1 —rcos(6—¢))'~’ 27r _N/2 (1—rcos(6—¢))2 1 - where as “usual, ()A denotes Fourier transformation. Clearly, if the “bandwidth” of 2_7r (1N, cos(6 93¢»? were less or equal to N / 2, the right integral of (C.6) would reverse its Fourier transformation, giving (C.5) with equality. This, however, can never occur _2__r 9— 1r since (1”,, co:((192—— 4);), has compact support, and so, its bandwidth is never finite. Note that C.6 is simply a restatement of the Nyquist Sampling Theorem and the recon- struction series which derives from it [88]. Although we must content ourselves with an approximation, we can choose the parameters in (OS) such that its two sides are as close as desired. To see how this N "92;rect( (1- -rcos(9— 1:11))? is possible, first note that the bandwidth of does not change with N, 2_ (9-1r) and so, the right-side integral of (C.6) monotonically approaches (INTE(E)))7 as N increases. Variations on <15 only affect the phase of the Fourier transform and does not modify the bandwidth. On the other hand, as the value of r goes from O to 1, (lwrr::((0%- (:9), goes from a constant to a U shape function on its support [0, 271], drastically increasing its high-frequency content. A detailed analysis would show that this increase is monotonic because the energy increase in the signal continuously adds towards building up the singularities at the interval’s boundaries. In order to find values for r and N that make (C.5) a good approximation, nu- 2h 1r9—’—rect( 2; (1— rcos(G— ¢))) merical solutions were obtained. The Fourier magnitude of 2for four 181 14 12 NO¢O 12 5 10 15 20 (b) 30 25 40 20 30 15 20 10 10 5 5 10 13 20 5 10 1‘5 20 25 33 35 (C) (d) Figure C.1. Magnitude frequency content of (1Nr::((a— (my (abscissa in radians/s) for ¢=0and(a)N=16,r=.8,(b)N=32, r: .,9 (c)N=32,r=.96,and(d) N=64,7‘=.96, combinations of N and r are shown in Figure C.1. It is seen from these graphs that the right-hand side integration operation of CG would change very little if it were over IR rather than over [—%, g]; although Figure C.1(c) probably corresponds to an acceptable upper limit for 7' when N = 32. The 2—" rect(9—2"— K") (1- NTCOSU’ 21:15)) integration —7 and % in each graph 1S as follows: (a) 38 dB, (b) 53 dB, (c) 30 dB, and (d) 67 dB. decay in the frequency response of 2 at the lower and upper limits of 182 -.v 7": " " ”NIH v ‘ ' H . ,, r. "’ 9- "/ .«1 T r- :H \ u u .‘ J ('3) f ‘ I ,.- , 1 , -H'»'~1! (IbDI‘U ' I ' ‘H " U 1 . ’. ‘— 1.313% H 11'; Ilu'uer‘O'J'I: 11'-) :11 fDu‘Lur-u‘ . 1 , ' , 11,-31 .‘4'11 713w 311mg; M111“? «.11 1.; ‘4 w'. 1 .-.1°'w 11 H 91?)” [INV- ' 7131.11.35 1 . ‘3‘“. ..1 ‘ - {‘J" 1.. 11“", "Jum'lm ! “f, . I» , ' . ,- . I . .. a I. l. ‘. L I! .U (”0111 ”llthgJ .fH: .1 1,; ‘~' ‘ '; ’I '_ 3 Mfl‘ "V VZNW'ML’ " r" ...“ ,.. ,- .., ‘ ‘ -. . . , v ‘1} ”it . ,hj ’~“ .‘I .EUI ’ . ‘ - u“. . 1 1 v.‘ (X'LH'I AU. ;T‘ W' I . ‘ .A It . ~ . L, ,. . L. -. 2 ' 7‘ T. . l O ‘ r . . Q I BIBLIOGRAPHY . '11" Q .. r . . - . L . 111).?!" “#1) 1. -- .... l BIBLIOGRAPHY [1] D. L. Snyder and M. I. Miller, Random Point Processes in Time and Space. New York: Springer-Verlag, 1991. [2] D. L. Snyder, A. M. Hammoud, and R. L. White, “Image recovery from data acquired with a charge-coupled-device camera,” J. Opt. Soc. Am. A, vol. 10, no. 5, pp. 1014—1023, 1993. [3] H. Heffes and D. M. Lucantoni, “A Markov modulated characterization of pack- etized voice and data traffic and related statistical multiplexer performance,” IEEE J. Selected Areas Comm, vol. 4, pp. 856—868, 1986. [4] Y. Vardi, “Network tomography: Estimating source-destination traffic intensities from link data,” J. Amer. Statist. Assoc, vol. 91, pp. 365—377, 1996. [5] M. A. King, R. B. Schwinger, P. W. Doherty, and B. C. Penney, “Two- dimensional filtering of SPECT images using the Metz and Wiener filters,” J. Nuc. Med., vol. 25, pp. 1234—1240, 1984. [6] G. A. O’Driscoll and et al., “Differences in cerebral activation during smooth pursuit and saccadic eye movements using positron-emission tomography,” BI- OLOGICAL PSYCHIATRY, vol. 44, no. 8, pp. 685—689, 1998. [7] E. E. Smith, J. Jonides, and et al., “Components of verbal working memory: Evi- dence from neuroimaging,” PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES, vol. 95, no. 3, pp. 876-882, 1998. [8] R. Dume and P. Jansen, “Computer tomography of barrels with radioactive contents,” Nuclear Engineering and Design, vol. 130, pp. 89—102, Sep. 1991. [9] J. Frank, Three-dimensional Electron Microscopy of Macromolecular Assemblies. San Diego: Academic Press, 1996. [10] F. Abramovich, T. Sapatinas, and B. W. Silverman, “Wavelet thresholding via a Bayesian approach,” Tech. Rep., Math. Dept, Univ. Bristol, 1996. 184 [11] H. A. Chipman, E. D. Kolaczyk, and R. E. McCulloch, “Adaptive Bayesian wavelet shrinkage,” J. Amer. Statist. Assoc, vol. 92, pp. 1413—1421, 1997. [12] D. L. Donoho and I. M. Johnstone, “Adapting to unknown smoothness via wavelet shrinkage,” J. Amer. Statist. Assoc, vol. 90, pp. 1200—1224, Dec. 1995. [13] I. M. Johnstone and B. W. Silverman, “Wavelet threshold estimators for data with correlated noise,” Journal of the Royal Stastistical Society, Series B, vol. 59, pp. 319—351, 1997. [14] E. D. Kolaczyk, “Wavelet shrinkage estimation of certain Poisson intensity sig- nals using corrected thresholds,” Statistica Sinica, under revision, 1997. [15] B. Vidakovic, “Nonlinear wavelet shrinkage with Bayes rules and Bayes factors,” J. Amer. Statist. Assoc, vol. 93, pp. 173-179, 1998. [16] R. D. Nowak, R. L. Gregg, T. C. Cooper, and J. E. Siebert, “Removing Rician noise in MRI via wavelet-domain filtering,” in submitted to Annual Meeting of Intl. Soc Magn. Reson. Med., 1998. [17] P. G. Hewitt, Conceptual Physics. Boston: Little, Brown & Company, 1985. [18] E. L. Lehmann and G. Casella, Theory of Point Estimation. New York: 2nd Edition, Springer-Verlag, 1998. [19] G. H. Hardy and W. W. Rogosinski, Fourier Series. New York: Dover Publica- tions, Inc, 1999 republication of the original 1956 edition. [20] S. Mallat, “A theory for multiresolution signal decomposition: The wavelet rep— resentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 11, pp. 674—693, July 1989. [21] S. Mallat, A Wavelet Tour of Signal Processing. San Diego: Academic Press, 1998. [22] M. Frazier, An Introduction to Wavelets Through Linear Algebra. New York: Springer, 1999. [23] C. P. Robert, The Bayesian Choice. New York: Springer-Verlag, 1994. [24] M. D. Srinath, P. K. Rajasekaran, and R. Viswanathan, Introduction to Statis- tical Signal Processing with Applications. Englewood Cliffs, N.J.: Prentice-Hall, 1996. 185 [25] H. Stark and J. W. Woods, Probability, Random Processes, and Estimation The- ory for Engineers. Englewood Cliffs, N.J.: Prentice-Hall, 1994 Second Edition. [26] S. Wolf, Guide to Electronic Measurements and Laboratory Practice. Englewood Cliffs, NJ: Prentice-Hall, 1973. [27] R. G. Gallager, Information Theory and Reliable Communication. New York: John Wiley and Sons, Inc., 1968. [28] G. B. Folland, Real Analysis —Modern Techniques and Their Applications. New York: Wiley-Interscience, 1984. [29] C. E. Shannon and W. Weaver, The Mathematical Theory of Communication. Urbana, Illinois: Univ. of Illinois Press, 1963. [30] A. Papoulis, Probability, Random Variables, and Stochastic Processes. New York: McGraw—Hill, 1984. [31] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: John Wiley & Sons, Inc, 1991. [32] T. K. Moon and W. C. Stirling, Mathematical Methods and Algorithms for Signal Processing. Upper Saddle River, NJ: Prentice Hall, 2000. [33] R. T. Odgen, Essential Wavelets for Statistical Applications and Data Analysis. Boston: Birkh'auser, 1997. [34] M. S. Crouse, R. D. Nowak, and R. G. Baraniuk, “Wavelet-based statistical signal processing using hidden Markov models,” IEEE Trans. Signal Processing, to appear in Special Issue on Theory and Applications of Filter Banks and Wavelets, 1998. [35] D. L. Donoho and I. M. Johnstone, “Ideal spatial adaptation via wavelet shrink— age,” Biometrika, vol. 81, pp. 425—455, 1994. [36] G. P. Nason and B. W. Silverman, “The stationary wavelet transform and some statistical applications,” in Lecture Notes in Statistics: Wavelets and Statistics, vol. New York: Springer-Verlag, pp. 281—299, 1995. [37] N. Weyrich and G. T. Warhola, “Denoising using wavelets and cross-validation,” Technical Report AFIT/EN/TR/94-01, Department of Mathematics and Statis- tics, Air Force Institute of Technology, Ohio, 1994. 186 [38] G. P. Nason, “Wavelet regression by cross-validation,” Technical Report 44 7, Department Statistics, Standford University, California, 1994. [39] G. P. Nason, “Wavelet shrinkage using cross-validation,” Journal of the Royal Statistical Society, Series B, vol. New York: Springer-Verlag, pp. 58:463—479, 1996. [40] R. D. Nowak, “Optimal signal estimation using cross-validation,” IEEE Signal Processing Letters, vol. 4, no. 1, pp. 23—25, 1997. [41] F. Abramovich and Y. Benjamini, Thresholding of wavelet coefiicients as multiple hypotheses testing procedure. New York: Springer-Verlag, 1995. [42] H. C. Andrews and B. R. Hunt, Digital Image Restoration. Englewood Cliffs, New Jersey: Prentice Hall, 1977. [43] D. T. Kuan, A. A. Sawchuk, T. C. Strand, and P. Chavel, “Adaptive noise smoothing filter for images with signal-dependent noise,” IEEE Trans. Pattern Anal. Machine Intelligence, vol. 7, pp. 165—177, 1985. [44] R. D. Nowak and R. G. Baraniuk, “Wavelet-domain filtering for photon imaging systems,” Proc. SPIE, Wavelet Applications in Signal and Image Processing V, vol. 3169, pp. pp. 55—66, August 1997. [45] R. Nowak, R. Hellman, D. Nowak, and R. Baraniuk, “Wavelet domain filtering for nuclear medicine imaging,” in Proc. IEEE Med. Imaging Conf., pp. 279—290, 1996. [46] J.—C. Pesquet, H. Krim, and E. Hamman, “Bayesian approach to best basis selection,” in IEEE Int. Conf. on Acoust, Speech, Signal Proc. — ICASSP ’96, (Atlanta), pp. 2634—2637, 1996. [47] E. P. Simoncelli and E. H. Adelson, “Noise removal via Bayesian wavelet coring,” in IEEE Int. Conf. on Image Proc. — I CIP 1996, (Switzerland), September 1996. [48] N. L. Johnson, S. Kotz, and A. W. Kemp, Univariate Discrete Distributions. New York: John Wiley and Sons, 1992. [49] L. L. Scharf, Statistical Signal Processing. Detection, Estimation, an Time Series Analysis. Reading, MA: Addison-Wesley, 1991. [50] B. Mandelbrot, “Intermittant turbulence in self-similar cascades: Divergence of high moments and dimensions of the carrier,” J. Fluid Mech., 1974. 187 [51] S. Lovejoy and D. Schertzer, Multifractals in rain. Cambridge, U.K.: Cambridge Press, 1995. [52] C. J. G. Evertsz and B. B. Mandelbrot, Multifractal measures. New York: Springer-Verlag, 1992. [53] R. D. Mauldin, W. D. Sudderth, and S. C. Williams, “Polya trees and random distributions,” Ann. Stat, vol. 20, pp. 1203—1221, 1992. [54] M. Lavine, “Some aspects of Polya tree distributions for statistical modeling,” Ann. Stat, vol. 20, pp. 1222—1235, 1992. [55] S. LoPresto, K. Ramchandran, and M. T. Orchard, “Image coding based on mix- ture modeling of wavelet coefficients and a fast estimation-quantization frame- work,” in Data Compression Conference ’97, (Snowbird, Utah), pp. 221—230, 1997. [56] E. D. Kolaczyk, “Bayesian multi-scale models for Poisson processes,” Technical Report 468, Department of Statistics, University of Chicago, 1998. [57] P. Diaconis and B. Efron, “Computer-intensive methods in statistics,” Scientific American, pp. 116—129, May 1983. [58] A. M. Zoubir and B. Boashash, “The bootstrap and its application in signal processing,” Signal Processing Magazine, vol. 15, no. 1, pp. 56—76, 1998. [59] R. D. Nowak and R. G. Baraniuk, “Wavelet-based filtering for photon imaging systems,” IEEE Trans. Image Processing, submitted April 1997. [60] E. Simoncelli, W. Freeman, E. Adelson, and D. Heeger, “Shiftable multiscale transforms,” IEEE Trans. Inform. Theory, vol. 38, pp. 587-607, 1992. [61] R. Coifman and D. Donoho, “Translation invariant de—noising,” in Lecture Notes in Statistics: Wavelets and Statistics, vol. New York: Springer-Verlag, pp. 125- 150, 1995. [62] M. Lang, H. Guo, J. E. Odegard, C. S. Burrus, and R. O. Wells, “Noise reduc- tion using an undecimated discrete wavelet transform,” IEEE Signal Processing Letters, vol. 3, no. 1, pp. 10—12, 1996. [63] J .-C. Pesquet, H. Krim, and H. Carfantan, “Time-invariant orthonormal wavelet representations,” tSP, vol. 44, no. 8, pp. 1964—1970, 1996. 188 v]. r"' 1" .l. .V’de‘aw‘| g.» 0 fl. 1 ;1 .(l ban ‘ .29” ‘H will '5’, ' “1'3. fink ., . i l; . “M. 366108" .. [v .02.! ”J." ‘-.'.. 1..” H. "w ”Pr/34W ‘0 ,' . ,, 11"31‘70') INC?" ‘ . .. ‘ . o D 'd,’ ‘1 ‘. , " ‘1'“8‘ 1‘ . -: IWn‘xfl “ ;, ._1.8 ”If V, s; M.) at! std“ __ I u .1 7‘ ' .- r y . l 1:. ‘1] (Mu .I ‘ k . , ”)5. .1 ,1 V ,7 , .' 1 - > -"-, lnt‘b'z fl . . . . 1 '. 41 “I ;.ll!_+1-1I . 1 , v '- I‘ am > ....1 3331'. 1’; 1_’ . . '1." V 11'1": V l '1'.) ‘3'] I, J! ~‘ , .. , .‘ ,.- -l 1 ‘1‘ -: ‘9 1.1.35! ”7 '. 1‘ .1 7:111'11'! ."~ ‘_)]-"'¢-| 1].:1! «I 11 ,I'I n , “5’ :..“\I1’. ‘11,, 1'11 l "I‘ll ‘rl Iw‘ .Il, '~= ' .1111.“ ‘ . l “'21 .111, gul'rw’vr‘ 1 v-"'.’ 7.": m11n1arjcr'l’,“ .rvliuw «.1 ‘1 i . ~..- 1 e . ”..me 41.1. 11:10 gf 1'. , ~ r ‘ Ho, 1. wh-lumu, _ .v-t‘IJl 1.1 {)1 m1 .I .01} ,E ._v ’ l Igfl‘n‘gam‘i 111mg: . 51311 I 4 labia-wInna-1111.111?!”tam'wzxwmi'?‘ ,- . ‘ ' 1 :n :14")! .ll. ' 51.4}: 'ITU' ‘ "' .0] .. . in». "' I . ... . .,. 6'12 ‘ . . - .vv . - _‘ , . , {_QS‘J‘A'}. In.“ _. A _—,L - ' [64] R. Nowak, “Shift invariant wavelet-based statistical models and 1/f processes,” in Proc. IEEE Digital Signal Processing Workshop, paper no.83, Bryce Canyon, UT, 1998. [65] A. P. Pentland, “Fractal-based description of natural scenes,” IEEE Trans. Pat- tern Anal. Machine Intell., pp. 661—674, July 1984. [66] B. J. West and M. F. Shlesinger, “On the ubiquity of 1 / f noise,” Intl. J. Modern Physics (B), vol. 3, no. 6, pp. 795—819, 1989. [67] G. W. Wornell, Signal processing with fmctals: a wavelet-based approach. Upper Saddle River, NJ: Prentice-Hall, 1996. [68] I. Daubechies, Ten Lectures on Wavelets. Philidelphia: SIAM CBMS-NSF Series in Applied Mathematics, no. 61, 1992. [69] J. A. Sorenson and M. E. Phelps, Physics in Nuclear Medicine. New York: Grune & Stratton, 1987. [70] I. M. Gelfand and S. G. Gindikin, eds., Mathematical Problems of Tomography. Rhode Island: American Mathematical Society, 1990. [71] B. V. Jackson, P. L. Hick, M. Kojima, and A. Yokobe, “Heliospheric tomogra- phy using interplanetary scintillation observations,” in Proc. of the E05 XXI meeting, paper no. 15.02, The Hague, Netherlands, 610 May, 1996. [72] X. Descombes, F. Kruggel, and D. Y. von Cramon, “Spatio-temporal FMRI analysis using Markov random fields,” IEEE Trans. Medical Imaging, vol. 17, no. 6, pp. 1028-1039, 1998. [73] X. Hu, T. H. Le, T. Parrish, and P. Erhard, “Retrospective estimation and correlation of physiological fluctuation in functional MRI,” Magn. Reason. Med., vol. 34, pp. 201—212, 1995. [74] A. K. Jain, Fundamentals of Digital Image Processing. Englewood Cliffs, N .J.: Prentice Hall, 1989. [75] F. Natterer, The Mathematics of Computerized Tomography. Stuttgart: John Wiley & Sons Ltd and B G Teubner, 1986. [76] T. F. Budinger and G. T. Gullberg, “Transverse section reconstruction of gamma- ray emitting radionuclides in patients,” Reconstruction Tomography in Diagnos- tic Radiology and Nuclear Medicine, University Park Press, Baltimore, MD, 1976. 189 [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] A. C. Kak and M. Slaney, Principles of Computerized Tomographic Imaging. New York: IEEE Press, 1988. M. Bhatia, W. C. Karl, and A. S. Willsky, “Tomographic reconstruction and es- timation based on multiscale natural-pixe base,” IEEE Trans. Image Processing, vol. 6, no. 3, pp. 463—477, 1997. J. M. Links, R. W. Jeremy, S. M. Dyer, T. L. Frank, and L. C. Becker, “Wiener filtering improves quantification of regional myocardial perfusion with thallium- 201 SPECT,” J. Nuclear Medicine, vol. 31, pp. 1230—1236, 1990. T. Hebert and R. Leahy, “A generalized EM algorithm for 3-d Bayesian recon- struction from Poisson data using Gibbs priors,” IEEE Trans. Medical Imaging, vol. 8, no. 2, pp. 194—202, 1989. S. S. Saquib, C. A. Bouman, and K. Sauer, “A non-homogeneous MRF model for multiresolution bayesian estimation,” IEEE Int’l Conf. on Image Proc, 1996. M. H. Buonocore, W. R. Brody, and A. Macovski, “A natural pixel decompo- sition for two-dimensional image reconstruction,” IEEE Trans. Biomed. Eng., vol. BME—28, no. 2, pp. 69—78, 1981. J. DeStefano and T. Olson, “Wavelet localization of the radon transform in even dimensions,” in IEEE-SP Int. Symp. T ime-Frequency and T ime—Scale Anal, pp. 137—140, Oct. 1992. B. Sahiner and A. E. Yagle, “Limited angle tomography using the wavelet trans- form,” in IEEE Nucler Science Symp. and Medical Imaging Conf., pp. 219—222, Oct. 1993. Z. Wu, “MAP image reconstruction using wavelet decomposition,” Lec Notes Comput. Sci., vol. 687, pp. 354—371, 1993. F. Rashid-Farrokhi, K. J. R. Liu, C. A. Berenstein, and D. Walnut, “Wavelet- based multiresolution local tomographiy,” IEEE Trans. on Image Processing, vol. 6, pp. 1412—1430, Oct. 1997. I. S. Gradshteyn and I. M. Ryzhik, Table of Integrals, Series, and Products. New York: Academic Press, 1965 fourth edition. A. Papoulis, The Fourier Integral and its Applications. New York: McGraw-Hill, 1962. 190