LIBRARY Michigan State Unlverslty PLACE IN RETURN BOX to "move this checkout from your ncotd. TO AVOID FINES return on or baton data duo. DATE DUE DATE DUE DATE DUE MSU IIAn Nflrmuttvo maul/Equal Opponunlty Infiltwon Wanna-9.1 _ . _ 7 ‘__.i —___..__.———-_,..._. OBJECT MATCHING USING DEFORMABLE TEMPLATES By Yu Zhong A DISSERTATION Submitted tO Michigan State University in partial fulfillment Of the requirements for the degree Of DOCTOR OF PHILOSOPHY Department Of Computer Science June 24, 1997 ABSTRACT OBJECT MATCHING USING DEFORMABLE TEMPLATES By Yu Zhong We have proposed and implemented a general object localization and retrieval scheme based on object Shape using deformable templates. Prior knowledge of an object Shape is described by a hand-drawn prototype template which consists of the representative contour / edges. The shape variations in an object class are achieved us- ing a set of probabilistic deformation transformations on the template. The deformed shape template then interacts with the input image via a directional edge potential field calculated from the salient edge features. A Bayesian scheme, which is based on the prior knowledge and the edge information in the input image, is employed to find a match between the deformed template and objects in the image. The scheme is invariant to location, rotation, and moderate scale changes of the objects. We have investigated three different deformation transform basis functions, namely, the trigonometric basis in the 2D domain, the spline representation, and the wavelet basis. We address the advantages and disadvantages of these deformation basis functions. A coarse-to-fine algorithm is implemented for efficient and automatic object local- ization. We have successfully applied the deformable template matching algorithm to digital library retrieval tasks, including a hierarchical Shape—based retrieval system for a trademark image database and a two-stage retrieval system using object color, tex- ture, and shape. We have also applied the deformable template matching algorithm to track objects in image sequences, where Shape and gradient information, combined with the consistency between corresponding object regions throughout the sequence, and the inter-frame motion are used to track the boundary of moving objects. Future work on this tOpic could be directed on the following topics: (i) learning the template and its inherent variations from representative training samples; and (ii) annotating and indexing image databases using deformable templates. To My Parents Xinmin and Qiyu iv ACKNOWLEDGMENTS I would like to acknowledge all the people who have assisted me during the years of my graduate study at Michigan State University. I am the most grateful to my advisor, Dr. Anil K. Jain, for both his professional and personal advice and guidance. He has provided me with numerous valuable ideas, insights, and comments. He has also been very understanding and supportive. I am very fortunate to have him as my advisor. I would also like to thank the other members of my dissertation committee. I would like to thank Dr. Sridhar Lakshmanan from the University of Michigan-Dearborn for his many insightful discussions. He has also exposed me to a rich literature of deformable template—related work. Dr. John J. Weng has always been available to me to discuss ideas and concepts in computer vision and learning. Dr. V. Mandrekar gave me many suggestions on stochastic processes, and Markov Fields. Dr. Eric Torng has helped me to appreciate the beauty underlying the complexity of computing theory. Besides the committee members, I would like to thank our lab director, Dr. George C. Stockman, who has been very supportive over the years. My sincere thanks go to all the members of my family. I am very grateful to my parents, Xinmin and Qiyu, for their never-fading love, care, understanding, and V encouragement. I could have accomplished nothing without their love and support. I own my life and every achievement to them. I would like to dedicate this dissertation to them. I would also like to thank my husband Yuntao and my baby boy Richard, who make my life even more joyful. I have learned to appreciate and been fascinated by the young life through Richard’s growth. My fellow students and colleagues at the MSU PRIP lab have provided help and moral support throughout my stay there. I would like to thank them for their interests and concerns, especially Jinlong Chen, Shaoyun Chen, Yao Chen, Yuntao Cui, Chitra Dorai, Hansye Dulimarta, Ling Hong, Sally Howden, Qian Huang, Wey Shiuan Hwang, Kelle Karu, Gongjun Li, Yonghong Li, Jianchang Mao, Aniati Murni, Sharathcha Pankanti, Nalini Ratha, Jarle Strand, Dan Swets, Oivind Due Trier, Aditya Vailaya, Marilyn Wulfekuhler, Bin Yu, and especially the lab managers Lisa Lees and Karissa Miller. I would also like to thank Dr. Alok Gupta and Dr. Marie- Pierre Jolly of Siemens Corporate Research Lab, Princeton for their support of my research. My special thanks go to Scott Connell for proofreading the draft of this disserta- tion. I would also like to thank Cathy Davison, Lora Mae Higbee, and Linda Moore for their administrative assistance. vi TABLE OF CONTENTS LIST OF FIGURES ix LIST OF TABLES xv 1 Introduction 1 1.1 Problem Statement .............................. 3 1.2 Motivations and Challenges ......................... 3 1.3 Our Approach ................................. 7 1.4 Relationship to Existing Approaches .................... 9 1.5 Contributions ................................. 11 1.6 Dissertation Overview ............................ 13 2 Literature Review 15 2.1 Rigid Template Matching .......................... 17 2.1.1 Correlation-based Matching ........................ 17 2.1.2 Hough Transform .............................. 21 2.2 Deformable Models .............................. 27 2.2.1 Free-form Deformation Models ...................... 29 2.2.2 Parametric Deformation Models ...................... 39 2.3 Discussion ................................... 53 3 Unifying Deformable Models in a Bayesian Framework 57 3.1 Bayes’ Theorem ................................ 58 3.2 Bayesian Formulation for Deformable Models ............... 59 3.2.1 Free-form Deformable Models ....................... 61 3.2.2 Analytical Form-based Parametric Deformation Models ......... 62 3.2.3 Prototype-based Parametric Deformation Models ............ 64 3.3 Discussion ................................... 65 4 A Shape-based Deformation Model 67 4.1 Representation of the Prototype Template ................. 69 4.2 Deformation Transformations ........................ 69 4.2.1 Two-Dimensional Trigonometric Basis .................. 70 4.2.2 Deformation Transform Using Spline Representation .......... 74 4.2.3 Deformation Using Wavelet Transforms ................. 79 4.2.4 Comparison of the Three Deformation Transforms ............ 87 4.2.5 A Probabilistic Model of Deformation .................. 87 4.3 Bayesian Formulation and Objective Function ............... 90 vii 4.3.1 Prior Distribution ............................. 91 4.3.2 Likelihood .................................. 92 4.3.3 Posterior Probability Density ....................... 96 4.3.4 Objective Function ............................. 97 4.4 Discussion ................................... 98 5 Image Segmentation and Object 'ITacking 108 5.1 Preprocessing ................................. 109 5.2 Image Segmentation ............................. 111 5.3 Object Tracking ................................ 114 5.3.1 Tracking Criteria .............................. 116 5.3.2 Objective Function ............................. 122 5.3.3 Tracking Algorithm ............................. 124 5.3.4 Experimental Results ............................ 126 5.3.5 Summary .................................. 130 6 Multi-resolution Algorithm for Localization 132 6.1 Multiresolution Algorithm .......................... 133 6.2 Experimental Results ............................. 136 7 Retrieval From Image Databases 148 7.1 A Shape-based Retrieval System for Trademark Image Databases . . . . 151 7.1.1 Browser Using Simple Shape Features .................. 152 7.1.2 Refinement Using Deformable Template Matching ............ 157 7.1.3 Experimental Results ............................ 158 7.1.4 Machine Perception versus Human Perception .............. 163 7.1.5 Summary .................................. 166 7.2 Image Database Retrieval Using Color, Texture and Shape ........ 168 7.2.1 Matching Using Color and Texture .................... 170 7.2.2 Integrating Texture, Color and Shape ................... 178 7.2.3 Experimental Results ............................ 179 7.2.4 Discussion .................................. 183 7.3 Summary ................................... 187 8 Summary and Future Work 190 8.1 Learning in Deformable Template Matching ................ 191 8.2 Image Database Annotation and Indexing ................. 193 8.3 Incorporating Region Information ...................... 194 8.4 Shape Matching in the Compressed Domain ................ 195 viii 1.1 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 LIST OF FIGURES Deformable template matching. (a) a prototype template (bitmap of con- tour) of a saxophone, (b) an image (of a CD cover) containing a sax- ophone to be matched with the template in (a). ............ An overview of the template matching techniques .............. An example of rigid template matching. Figs. (a), (b) and (c) can be successfully matched to the rigid template using translation, scaling, and rotation. ............................... An example of correlation-based matching. ................ Hough Transform: From image Space to parameter space. (a) A line in the image Space (a: — y) described by parameters mo, co: y = co + moat; (b) transform two points (x1,y1) and ($2,312) on the line in (a) in the (m — c) parameter Space. ........................ Hough Transform: detection of a straight line. (a) A line in the image Space described by the equation (y = 1.322: — 0.924); (b) votes in the parameter space. The bin of the maximum count corresponds to parameter values (1.32, 0.93) with a precision of 0.01. ......... Constructing the R-table in the Generalized Hough transform ....... Hough Transform: detection of an arbitrary shape. (a) an irregular shape; (b) an input image; (c) votes in the (cc—y) translation parameter space. Scale and orientation are fixed for display purpose. ((1) the detection corresponding to the maximum vote. .................. An example of deformable template matching. .............. Overview of deformable template models ................... Image segmentation using the active contour approach. The potential is defined inversely proportional to the image gradient. (a) Input image: the snake is initialized around the object of interest. (b) Intermediate image: the snake is attracted to the salient edges by image gradient. (c) Final configuration: the snake converges to the object boundary of high image gradient. ........................... The planar graph constructed from the input image. ........... Synthetic images showing how the performance of the “ratio region” algo- rithm can degrade if objects with small area are desired. ....... Image segmentation using the “Ratio Region” approach [34] ........ Examples of elastic matching. The reference shapes (left) are stretched onto some unknown shapes (right) iteratively by “forces” derived from local pattern matches [21]. ....................... ix 16 18 19 22 23 25 26 28 30 32 35 35 38 2.15 Runway detection using parametric deformable templates [87]. (a) ini- tialization of the template; (b) the detected runway boundary using the region homogeneity assumption after the initialization of (a). (c) initialization of the template; ((1) the detected runway boundary using the gradient information after the initialization of (c) .......... 2.16 Vehicle segmentation using a polygonal template [79]. (a) the polygonal vehicle template and its parameterization; (b) one segmentation result using the template and motion and gradient informations. ...... 2.17 A polygonal hand template [59]. ...................... 2.18 Example of the eigenmodes [117]. (a) an upright tree shape; (b)-(e) The lower 18 modes (black outline) for the shape in (a). (b) and (c) are the translation modes, (d) is the rotation mode. The rest are nonrigid variations. ................................ 3.1 Parameterization of the runway boundary template. The runway bound- ary template consists of two parallel straight lines y = k(a: — c1) and y = k(m — c2), with parameters It for the slope and cl and oz for the intercepts. ................................ 4.1 Basis functions for the deformation displacement field. .......... 4.2 Deformation of a bird template using the 2D trigonometric basis. (a) the bird template with no deformation; (b) deformed bird template using randomly generated deformation parameter values. From left to right, the interpolation level (M, N) equals 1, 2, and 3, respectively. 4.3 Deformations using the B-spline representation. (a) The spline represen- tation of the prototype. The red dots are the control points. (b) De- formed templates obtained by randomly displacing the control points in (a) according to an 222'. d. Gaussian distribution. .......... 4.4 B-spline wavelet basis. They are generated by shifting and dilating a mother wavelet function. ......................... 4.5 Deformations using the B-spline wavelet basis; (a) prototype template; (b) deforming the template in (a) using the B-spline wavelet basis. Two resolution levels are used which account for a total of 24 defor- mation parameters. The deformation parameter values are randomly generated using a Gaussian distribution. ................ 4.6 Edge potential field of the input image in Fig. 1.1(b). The potential at a pixel is displayed as the greyscale value. The larger the gray scale value, the larger the potential value. The yellow pixels denote the edge map of the input image. The potential at locations far away from the edge pixels is high, and the potentials at locations near the edge pixels is low. ................................... 43 73 85 4.7 Directional edge potential field. The deformed template (in solid red) is put in the edge potential field created by the edgemap (in solid yellow). Let (x, y) be a point on the template. Let (278,313) be the nearest edge pixel to (x, y) in the image field. Then the directional potential at template pixel (1', y) is defined as the edge potential at (2:, y) (caused by (we, ye) according to Eq. (4.35) multiplied by the cosine of 5(17, y), the tangent angle between (:23, y) and (mmye). The directional edge potential of the template is the average directional edge potential of all the template pixels. ......................... 4.8 Comparison of the affine transform and nonrigid deformation transform. (a) deforming a rectangle using the affine transform; (b) deform- ing a rectangle using the nonrigid deformation transform defined in Eq. (4.1). ................................. 4.9 Deformed templates derived from a prototype. (a) the prototype; (b) randomly generated deformed templates. ............... 4.10 Ranking of the deformed templates derived from the prototype. They are arranged with decreasing similarity from top to bottom, left to right, using deformable template matching method. ............. 4.11 Ranking of the deformed templates using invariant moments. They are arranged with decreasing similarity from top to bottom, left to right, using invariant moments. ........................ 4.12 Locating a fish using the deformable templates with different deformation basis. (a) Using 2D trigonometric basis. M is the number of approxi- mation levels, ndf is the number of basis used (number of degree’s of freedom). (b) B—spline basis. N is the number of control points used. (c) B—spline wavelet basis. M is the number of resolution levels. 5.1 Preprocessing. (a) input image (256 x 256); (b) edge map using the Canny edge detector; (c) magnitude of the edge potential field; ((1) direction field; .................................... 5.2 Localization of a saxophone using manually chosen initial template posi- tion. The input image size is 285 x 286. (a) The prototype template; (b) input image; (0) initial position, L = 0.603; (d) 10 iterations, L = 0.327; (e) 16 iterations, L = 0.186; (f) 30 iterations, L = 0.123. 5.3 Localization of a fish using manually chosen initial template position. Im- age size is 256 x 256. (a) Prototype template; (b) input image; (0) initial position, L = 0.432; (d) 4 iterations, L = 0.308; (e) 7 iterations, L = 0.221; (f) 40 iterations, L = 0.157. ................ 5.4 Sensitivity of the matching algorithm to the initial position. ....... 5.5 Computing color/greyscale distance. The detected object in the first frame is used as the reference object. For frame i+1, color/greyscale distance is computed for the band (shaded region) around the detected object boundary in frame 2'. ........................... xi 95 100 102 103 104 105 110 112 113 114 120 5.6 Computing color distance for an input frame. (a) The detected object in the first frame is used as the reference object (image size: 288 x 352). (b) The tracking result for the third frame; (c) The fourth input frame; ((1) The computed color distance map (negated) for the fourth frame. The greyscale value is inversely related to the color distance. The bright region in the map indicates a potential object with a matching color composition. ............................ 5.7 Integrating image gradient, color consistency, and inter-frame motion. (a) The image gradient for the input frame in Fig. 5.6(c); (b) The color distant map (negated) for this frame; (c) The inter-frame motion for this frame; ((1) The integrated image potential field using Eq. 5. 3. 5.8 Tracking a heart in a medical image (MR1) sequence (each frame size is 64 x 64) using the spline representation. (a) Input image sequence; (b) the image gradient of the input sequence; (c) template initialization in the first frame; ((1) tracking results for the sequence: the deformable template is able to capture the contractions and the expansions as the heart beats. ................................ 5.9 Tracking a human face in an image sequence (each frame size is 120 x 160). (a) template initialization in the first frame; (b) tracking results for the sequence. ................................. 5.10 Tracking a human hand in a weather forecast TV program (each frame size is 288 x 352). ............................ 6.1 Locating a fish using the coarse-to-fine multiresolution algorithm. From left to right: edge potential field, output. (a) coarse level: the template and potential field are subsampled 1 : 4 in each dimension, a few configurations are obtained and passed to the next level (finer); (b) finer level: the template and potential field are subsampled 1 : 2 in a: and y direction, 6 configurations are obtained and passed to the next level (finest); (c) finest level, 1 configuration is obtained L = 0.22. 6.2 Automatic localization of desired objects using the coarse-to-fine multires- olution matching. (a) retrieval of a guitar using multiresolution de- formable template matching (320 x 304), L = 0.186; (b) retrieval of a star using multiresolution deformable template matching (256 x 256), L = 0.157; (From top to bottom: hand-drawn template, input image, retrieved deformed template at the coarsest level, retrieved deformed template at the finest level.) ...................... 6.3 Automatic localization of “seeds” using the coarse-to—fine multiresolution matching. (a) a “seed” template; (b) the input image of the cross- section of an orange (453 x 436); (c) retrieved objects when the objec- tive function is thresholded at 0.160. .................. 6.4 Automatic localization of microscopic forms using the coarse-to-fine mul- tiresolution algorithm. (a) a morphotype template; (b) two localized forms (L = 0.18, 0.21). .......................... xii 125 138 143 6.5 6.6 6.7 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 7.14 Automatic localization of human hand using coarse-to-fine algorithm. (a) the hand template; (b) input images which contain a hand (121 x 160); (c) retrieved hands (L E [0.191,0.267]). ................ Applying a tower template. (a) the template; (b) retrieval of tower 1 (280 x 280), L = 0.227; (c) retrieval of tower 2 (280 x 280), L = 0.243. What if the template is not present in the image? (a) applying a fish template to a saxophone image, L = 0.587 (L = 0.142 for the sax- ophone template); (b) applying the fish template to a guitar image, L = 0.230 (L = 0.142 for the guitar template); (c) applying a sax- ophone template to a fish image, L = 0.430 (L = 0.170 for the fish template). ................................ The hierarchical trademark database retrieval model. ........... Some images from the trademark image database .............. Examples of hand drawn query trademarks. ................ Examples of hand drawn query trademark templates. ........... Database pruning results for the hand-drawn bull sketch as shown in Fig. 7.3: the top 10 retrievals given in the increasing order of dis- similarity. ................................. Database pruning results for the hand drawn kangaroo Shown in Fig. 7.3. The top 10 retrievals are given in the increasing order of dissimilarity. Deformable template matching; (a) initial position of the bull template overlaid on the edge map of a bull logo, (b) final match for the bull logo. .................................... Deformable template matching; (a) initial position of the boomerang tem- plate overlaid on the edge map of a boomerang logo using the gener- alized Hough transform, (b) final match for the boomerang logo. Deformable template matching; (a) initial position of the bear template overlaid on the edge map of a bear logo using the generalized Hough transform, (b) final match for the bear logo. .............. Deformable template matching; (a) initial position of the deer template overlaid on the edge map of a deer logo using the generalized Hough transform, (b) final match for the deer logo. ............. Deformable template matching; (a) initial position of the kangaroo tem- plate overlaid on the edge map of a kangaroo logo using the generalized Hough transform, (b) final match for the kangaroo logo. ....... Deformable template matching result of the boomerang image using the bull template. ............................... Human perception versus the deformable template matching algorithm. (a) a hand-drawn trademark; (b) the tOp retrievals from a logo image database for the query in (a) by human subjects. The first number under each retrieved image is the number of ballots from the five human respondents; the second number is the dissimilarity measure given by the deformable matching algorithm .................... Diagram of the image retrieval system using color, texture, and shape. xiii 145 161 162 163 165 166 167 169 7.15 Some sample images from the database. They have been “scaled” for display purposes. ............................. 7.16 Interface for specifying reference texture/color ................ 7.17 Features extracted from the block DCT coefficients. (a) 250 x 384 input color image; (b) DCT features for the Y frame (intensity); (c) DCT features for the Cr frame (chrominance); (d) DCT features for the Cb frame (chrominance); .......................... 7.18 Retrieval based on color. (a) query example is given by the rectangular region; (b) top-4 retrieved images from the database which contain blocks of similar color. .......................... 7.19 Retrieval based on texture. (a) query example is specified by the rectan- gular region; (b) matching macroblocks are marked with crosses in the query image; (c) other nine retrieved images from the database which contain regions of similar texture. ................... 7.20 Retrieval based on color and shape. (a) query color example is speci- fied by the rectangular region; (b) sketch for the Shape; (c) matching macroblocks are marked with crosses in the query image; ((1) retrieved shapes. .................................. 7.21 Retrieval based on color, texture, and shape. (a) query region example is given by the rectangular region; (b) sketch for the shape; (0) retrieved shape. ................................... xiv 180 182 189 2.1 4.1 7.1 7.2 LIST OF TABLES A taxonomy of the deformable template matching approaches. ..... Comparison of the three deformation schemes ................ Dissimilarity values for the five query images when the deformable tem- plate matching is applied to the top 10 retrieved images from the fast pruning stage. .............................. Performance of the two-stage algorithm; the database contains 592 color Images .................................... XV 183 Chapter 1 Introduction The ultimate goal of computer vision is to Simulate the human perception and in- terpretation of the world around us. Given an image of a scene, in terms of a pixel array of different greyscales or colors, it is extremely difficult to locate and recognize different objects present in it. In spite of numerous efforts in the computer vision field, very little progress has been made in recent years [77, 72]. A general solution to computer vision problems is not envisioned, at least, in the foreseeable future. Suc- cessful computer vision (machine vision) applications are limited to specific domains and for specific applications [101]. One major difficulty in image processing and object recognition tasks is how to integrate and interpret the diverse local image cues (gradient, texture, intensity, etc.) [18, 49]. The bottom-up methods often fail due to poor-contrast, occlusion, adverse viewing conditions, and noise. A model or structure-free interpretation is doomed by the underconstrained nature of the problem. Imperfect image data can be augmented with extrinsic information such as geometrical models of the objects likely to be 1 2 present in the scene. The object shape is one of the most important characteristics which distinguishes an object from the background in an image [122]. The object Shape model can be used to complete the information provided by local image features, such as gray level, texture, or color. Furthermore, the global shape information is more robust than local features in the presence of image noise and poor imaging conditions. However, it Should be pointed out that it is more difficult to define a shape mathematically than other visual cues such as color and texture. The interpretation of an object shape is often subjective, which depends not only on the geometrical pixel patterns, but also on the context and the previous experience and knowledge of the observer. The geometrical Shape constraints can vary from local and generic to global and specific, taking various forms. For example, they can incorporate the smoothness or stretchness constraints, or, they can be specified using a hand-crafted parametric or tabular form. Such model information is determined based on a Specific application of interest, and should be incorporated explicitly in an integrated and robust computer vision system. We take full advantage of the global shape of an object, and address the problem of locating and retrieving an object from a complex image using its 2D shape/ boundary information [76]. 1.1 Problem Statement The problem under investigation is stated as follows: given a hand-drawn template, which gives an inexact description of the salient or characteristic edge/boundary information of a 2D object of interest, how to use this template to locate and identify all the objects in the image which resemble this template? By object location and identification we mean that a description of the object boundary is given in terms of a deformed template. This match is also quantified by a numerical value which indicates the goodness of the match between the template and located object. Figure 1.1 illustrates the problem of interest, where Fig. 1.1(a) shows the hand drawn sketch of a saxophone and Fig. 1.1(b) shows an image which contains a saxophone which resembles the given template in shape. Note that while the Sketch resembles the saxophone in the CD cover image, there is also an observable discrepancy in the two shapes. 1.2 Motivations and Challenges The deformable object matching problem has wide applications in image processing and computer vision, including image/video database retrieval, object recognition and identification, image segmentation, and object tracking. In many of these appli- cations, an a priori shape information is available in the form of an inexact model of the object which needs to be matched to the objects present in the input image. We give a few examples in the following: (a) Figure 1.1: Deformable template matching. (a) a prototype template (bitmap of contour) of a saxophone, (b) an image (of a CD cover) containing a saxophone to be matched with the template in (a). Image segmentation When an approximate shape of the object to be segmented is available, we can apply the deformable template matching model to segment the object from the background. We place an approximate template in the vicinity of the object in the image. This coarse template is then attracted to the salient edges in the image in a similar manner as a “snake” [81] until the deformed template agrees with the object boundary. For example, in medical image segmentation, the general shape of an organ or a tissue of interest is often available, and we need to trace or segment the organ or tissue from a given image for diagnostic purposes. Content-based retrieval from image databases In an image database retrieval system, the user may provide a set of curves and ask to retrieve all database images which contain such a set of curves. An automatic content-based image retrieval 5 system [41, 55, 63, 64, 65, 127] should be able to search the database for the images which contain objects with similar characteristics as specified by the user. Shape features have already been incorporated in content-based database retrieval systems [41,74,138] Object tracking In object tracking, the shape of the object varies from frame to frame, but the inter-frame differences are small. The deformable template approach can be used to track the object due to its capability to incorporate both global structure information and variations in the shape class. Digital video encoding In very low bit rate digital video encoding, it is desirable to track an object in the sequence and encode it using its shape, texture, and motion. Such a coding can give enormous savings in storage and transmission. It also provides a convenient representation for searching, indexing and retrieval. The deformable template model provides a good solution to the representation and tracking problems. Object recognition and classification Such an approach can also be used to recognize and identify objects. A prototype template can be created for each object class. The identification and recognition can be performed by applying each prototype to the input image and identifying the object class whose prototype interprets the given object the best. The deformable template matching problem is challenging because of the following factors: 0 There are intrinsic variations in an object’s Shape. As has been said, “there 6 are no two leaves of the same shape”, so an object shape will have intrinsic variations. Object deformation is expected in most imaging applications be- cause of the varying imaging conditions, sensor noise, occlusion and imperfect segmentations. A solution to this problem should handle the shape variations in a meaningful way. The approach Should be general enough to handle shapes with different ap- pearances. Furthermore, the given template which describes the characteristic edge/boundary of the object of interest can be either open or closed, singly connected or multiply connected. For example, we want the same scheme to be able to locate hands using a hand template and a face sketch to locate a face. It should sensibly combine both the prior information about the shape and the input image information (likelihood) to draw an inference. The presence of the object of interest in the image is not known. If the object is indeed present, we do not know the number of its occurrences, its position, scale, and orientation. The objects of interest are not presegmented from the background. Comparing two segmented shapes is an easier problem: One can first align the two shapes in scale and orientation, and extract some shape features [56] such as moments [67], area, circularity, major axis orientation [111] eccentricity, etc., to see if the two are topologically close in the feature space. In our problem, object localization involves both a segmentation and a matching problem. As both 7 the segmentation and matching problems are well known to be difficult, this makes the template matching problem even more challenging. This fact along with the dimensionality of the pose parameter vector (position, orientation, scale) makes it a computationally difficult problem. 1.3 Our Approach We approach the problem of general object localization and identification as a pro- cess of matching a deformable template to the object boundary in an input image. The prior Shape information of the object of interest is specified as a sketch or bi- nary template. This prototype template is not parameterized, but it contains the edge/boundary information in the form of a bitmap. Deformed templates are ob— tained by applying parametric transforms to the prototype, and the variability in the template shape is achieved by imposing a probability distribution on the admissi- ble mappings. Among all such admissible transformations, the one that minimizes a Bayesian objective function is selected. The deformation transformation determines the set of deformed templates which can be obtained using the prototype, i.e., the variety of shapes that the deformable template can take. We would preferably like to use a small set of parameters to rep- resent a large class of deformations. It is desirable to use that deformation transform which approximates the shape classes well. We have investigated three different kinds of deformation transforms, namely, 0 a set of two-dimensional trigonometric basis functions with different frequency components, 0 a set of spline basis functions with local compact support, and o a set of wavelet basis functions with local compact support. These transformations can model a large set of Shape deformations, although each of the deformation basis functions has its own advantages and deficiencies. The objective function we try to minimize consists of two terms. The first term plays the role of a Bayesian data likelihood. This likelihood term is a potential energy linking the edge positions and gradient directions in the input image to the object boundary Specified by the deformed template. The second term corresponds to a Bayesian prior. This prior term penalizes the various deformations of the template —— large deviations from the prototype result in a large penalty. A match between the deformable template and the object in the input image requires minimization of the objective function with respect to the set of deformation and pose parameters. We minimize the objective function by iteratively updating the transformation parameters to alter the shape of the template so that the best match between the deformed template and the edges in the image is obtained. We use the gradient descent method because of its simplicity and efficiency compared to the stochastic global energy optimizers. A problem with the gradient descent method is that it needs a good initialization to avoid locking to local extremas. In most applications, the objective function is non-convex. Therefore, we have used a number of techniques to find good initializations. c We have used a multiresolution coarse-to—fine search algorithm. The coarse level 9 is targeted to find good candidates for the finer level match at a relatively low computational cost. At the coarse level, the inputs are subsampled for savings in computations and we use coarse and smooth energy fields to attract the deformable template to regions of interest and to avoid local extremas. At finer levels, the matching is performed only at the outputs from the previous coarser level. Finer energy fields are used for more accurate localization. 0 We have also used region information (inside the template) to find locations in an input image where the desired objects are likely to occur. The deformable templates are only initialized at locations with similar texture and/or color. Furthermore, we can compute the color and texture features directly from Dis- crete Cosine Compressed images. So our approach can be directly applied to JPEG images or the I-frames in MPEG videos. The low-cost localization processes help to reduce the computational cost of the au— tomatic localization/searching process using deformable template models. 1.4 Relationship to Existing Approaches We note that certain elements of this approach bear some resemblance to existing studies. These similarities are described below. Deformation model The idea of representing the deformation as probabilistic transformations on the prototype template is akin to the work of Grenander and his colleagues [4, 28, 97, 98], where such transformations are used to derive a set 10 of Object images from the “ideal” one. However, the use of a bitmap to represent the characteristic/salient curves of the object makes the modeling very general and flexible. Potential energy The use of potential functions to influence template deforma- tions towards salient image features is akin to the work in [81]. However, our work is different in that we relate the potential to the nearest input edge pixels, and this potential is further modified based on the angle between the template pixels and the closest point on the image edge. We also use additional edge direction information to suppress the adversary effects of noisy/ spurious edges. Multi-resolution coarse-to-fine localization We have used a multi-resolution coarse-to-fine algorithm to automatically locate objects of interest in a given image. At a coarse level, we use smoother potential fields, coarse step sizes, and subsampled templates/images which help to escape from local extremas and be attracted to the desired features. At finer levels we use finer potential fields and step sizes for accurate matching. This approach is akin to the classical multi-resolution algorithms and the more recent scale-space approaches. Combining region cues Texture and color are the most commonly used region information for matching, segmentation, and retrieval tasks. Numerous texture and color features are proposed in the literature. We have extracted texture / color features from Discrete Cosine Transform compressed images (J PEG and MPEG). 11 Object tracking Deformable contours such as snakes have been used to track fea- tures in image sequences. We have also applied our deformable shape model to image tracking problems. Because our deformable shape model is different from a snake in that it incorporates prior shape information, this model shows some advantages in tracking when weak gradient or partial occlusion is present. Instead of being at- tracted to spurious features, it tries to maintain its prior shape when the gradient is not strong enough or when object features are missing due to occlusion. 1 .5 Contributions As an effective way to integrate both the prior, structural shape knowledge and the local, pixel information, this dissertation exhibits potential applications of deformable template matching in image segmentation, localization, matching, retrieval from im- age/video databases, and object tracking. We have used theparadigm to successfully solve the following problems. 0 Given a rough contour of the object of interest, segment the object embedded in a complex background. 0 Automatically localize objects in images using a coarse-to—fine multiresolution matching scheme; 0 Retrieve images from image databases using hierarchical architectures which achieve both efficiency and accuracy; 0 Track objects in image sequences using the boundary information. 12 Our experimental results show that under this new paradigm, we can 0 match objects that are curved or polygonal, closed or open, simply-connected or multiply-connected; retrieve objects based on boundary information alone, even in complex images; localize objects independent of their location, orientation, Size, and number in the image; and Select an appropriate initialization of the template for computational efficiency. The primary contribution of this research is that it sensibly combines existing ideas along with new ones to provide a systematic paradigm for general object matching. This scheme can be applied to objects with different appearances. We do not need a new algorithm to apply to a new shape class; the shape template can be very flexible, open or closed, have a single component or multiple components. The other contributions of the dissertation include: o A probabilistic transform-based deformation model is used in a novel way; sev- eral deformation transforms are utilized. A new likelihood function based on the edge map is proposed which utilizes both the position and direction infor- mation for robust matching results; these two components are integrated in a Bayesian formulation. This model is very flexible and general in that we can easily design different likelihood or deformation transforms to serve different application requirements. 13 c We have addressed the problem of selecting an appropriate initialization of the template in two ways. A coarse-to-fine multiresolution algorithm is proposed to obtain a small set of plausible candidate poses at low resolutions which are then screened at a higher accuracy. When object region information (e.g., color and texture) is available, we use it to find locations with similar region characteristics and then perform the shape matching only at those locations. One attractive feature of this method is that the color and texture features can be computed directly from Discrete Cosine Transform (DCT) compressed images (JPEG or MPEG). A proper initialization of the deformable template avoids many unnecessary computations. 1.6 Dissertation Overview The rest of the dissertation is organized as follows. Chapter 2 gives a review and analysis of the previous work on template matching, and, in particular, the deformable template matching approach. The advantages and disadvantages of each of the approaches are addressed. Chapter 3 formulates the existing energy-based deformation methods in a Bayesian framework. The prior and likelihood for each of the methods (free-form active contour model, deformation model using parametric forms, and deformation model via parametric mappings) are discussed. It is shown that the correspond- ing MAP estimate is consistent with the solution to the energy minimization problem. 14 Chapter 4 proposes the shape-based general object matching method based on de- formable template matching. Chapter 5 addresses the applications to image segmentation and object tracking. Chapter 6 discusses the implementation issues for efficient localization. A multires- olution algorithm is proposed to automatically localize the desired objects in an image. Chapter 7 presents the applications of the deformable template matching method to digital library applications. Two image database retrieval systems are described. Chapter 8 summarizes the proposed approach, experimental results and limitations. It also gives an outline of the future work. Chapter 2 Literature Review Model-based shape matching is a well-known problem in the computer vision and im- age processing domain. Its applications include image segmentation, object matching and tracking, object recognition and interpretation, and image database retrieval. Early research in this area concentrated mainly on rigid shape matching, where the matched shapes were obtained by applying simple transformations such as transla- tion, rotation, scaling, and the affine transformation to the model template [26, 62]. The transformations are characterized by a set of global parameters, which cannot ef- fectively incorporate complex variations such as local deformation. Examples include correlation-based matching and the Hough transform [11, 39, 66, 104, 123]. Because of the rigidness of the above approaches, their utility is limited. In most applications, an exact model of the object is not available because of the variability in the imaging process and inherent within-class variabilities. Deformable template matching, which is receiving increasing attention, is more versatile and flexible in dealing with the deficiencies of rigid shape matching. It is a more powerful technique because of its 15 16 capability to deal with a variety of shape deformations and variations. In the following, we give a survey of the existing work in this field. We first briefly review the classical template matching methods, which are rigid in the sense that the appearance of an object of interest in the image is known accurately, that is, it is described exactly by a template. Two well known template matching methods, namely, the correlation-based method [2, 9] and the generalized Hough transform, are discussed. Then, we survey the more powerful deformable template matching meth- ods. We partition these methods into two classes: (i) free-form, and (ii) parametric, based on whether the deformed templates are parameterized or not. The parameter- ized deformable template matching methods are further classified as either analytical form-based or prototype-based, where the first class is identified by specification of an analytic form for the templates, and in the second class, a template instance is obtained by transforming (parametrically) a prototype template. An overview of the various template matching techniques is given in Fig. 2.1. [ Template Based Object Matching [ l Rigid Template Matching l [ Deformable Template Matching ] [Correlation BasedJ [ Hough Transform ] [ Free-form Deformable ] [ Parametric Deformable [ [ Analytic Form Based] [ Prototype Based [ Figure 2.1: An overview of the template matching techniques. 17 2.1 Rigid Template Matching When a template is compared to an object in a rigid way, a match is attempted either directly or indirectly, accounting for possible pose and scale changes. By rigid template matching we mean that the object shape can be obtained from the model template via a sequence of operations of translation, rotation, and scaling. An in- stance of such an example is given in Fig. 2.2 where, for the template fish on the left, successful matches are obtained between this template and the three fish on the right as indicated. When the pose and scale are known, the crosscorrelation between the aligned template and object gives a high matching score. Otherwise, when the pose parameters are unknown, the Hough transform provides a clever way to detect the presence of an object as well as an estimate of the pose. Both methods are examples of rigid template matching, which are typically used to detect image features (e.g., straight lines, corners, or objects) [46, 57, 114, 113]. 2.1.1 Correlation-based Matching Correlation-based matching [2, 9] is a simple filtering method [134] to detect a par- ticular Shape or object in an image. An object can be detected if its appearance is known accurately in terms of the template. Given a template t(x) and an image i(x), the crosscorrelation between i(x) and t(a:) at position y is defined as: R;t(y) = Z z'(X)t(x - x)- (2-1) 18 Template Figure 2.2: An example of rigid template matching. Figs. (a), (b) and (c) can be successfully matched to the rigid template using translation, scaling, and rotation. 19 The crosscorrelation is maximized when the portion of the image i overlapped with the template is identical to t, i.e., when there is a perfect match (Fig. 2.3). The idea of correlation-based matching is straightforward. The template is Shifted in the image field, and the crosscorrelation at each position is calculated. A match is reported at positions where the correlation exceeds a threshold value. In the example in Fig. 2.3, the given template is applied to an image with noise in the lower-right corner. The correlation at each position is computed. Ideally a peak in the correlation array indicates a position of good match, as the one in the upper—left corner. However, a “false” match is caused by the bright noisy pixel with a value 7. The “X”S in the correlation array indicate “not available”. Template Image Correlation 11000 742 11100 532 10100 218 00000 xxx 00007 xxx Hp—ly—d p—Ip—Lp—A pip—Lye; >4><><><>< ><><><><>< Figure 2.3: An example of correlation-based matching. The crosscorrelation array can be efficiently calculated using the correlation the- orem for Fourier transforms: Fan).- = (IL-(rt. I = Fa). T = Fm. (2.2) 20 where F denotes the Fourier transform operation, and T“ is the complex conjugate of T. This theorem says that point-by-point multiplication in the Fourier transform domain is equivalent to convolution in the spatial domain [90]. Because efficient algorithms for Fourier transforms are available [31], the correlation theorem enables costly correlations in the spatial domain to be computed efficiently in the transformed domain. The correlation array can be obtained by 1. taking the Fourier transforms of both the feature image and the template image, 2. performing multiplication in the frequency domain, 3. taking the inverse Fourier transform. A shortcoming of the correlation-based matching is that it requires that the tem- plate be very accurate, and it is sensitive to shape deformations including scale, orientation change, and partially offset features, etc. Nevertheless, it is still a very powerful tool for image matching because of its simplicity and computational effi- ciency, and has been used directly or indirectly in many applications. One successful instance is offered in the QBIC [102] system for content-based image database re- trieval, where correlation is performed on images of reduced resolutions to quickly locate shapes similar to the query sketch. A moderate amount of robustness to shape deformations is achieved by calculating the correlations at a low resolution. It is noted that the correlation-based matching is a special example of filtering operations in image processing. Filtering is a very general notion of transforming the image intensities in some way so as to enhance or suppress certain image features [5, 54, 111]. In a general sense, many difficult low level image matching and segmentation 21 tasks can be cast in a Similar scenario as the correlation-based feature or Object detection, where we need to find: 0 a set of filters which best captures the desired image features, and o appropriate orthogonal basis and transforms which guarantee the computational efficiency. However, the filtering technique is application-dependent and in most cases, it is very difficult to define good filters and to find the appropriate transform space. 2. 1 .2 Hough Transform An elegant and versatile technique to detect parameterized shapes (of object bound- aries) was first proposed by Hough [66]. It was later generalized by Ballard [11] to detect any arbitrary shape which can be represented in a tabular form [39, 104, 123]. Basically, the Hough method transforms points in the image (spatial) space into a parameter space. Note that each point in the parameter space determines a curve in the image plane, and each spatial feature corresponds to a curve in the parameter space by reversing the roles of the parameters and the spatial coordinates. So, given a parametric form for a family of curves or shapes which we expect to detect, and a collection of “interesting” image points, we increment, for each such point, the corre- sponding entries in the quantized parameter space. The count of each quantized bin in the parameter space is the number of image points that lie on the specific curve defined by the corresponding parameter values. The specified shape is detected by finding the peak(s) in the parameter space, or in other words, the set of parameter 22 values that best describe the given image points. In this way, a global evidence accu- mulation process of shape detection is transformed into a search for local peaks. Such an example is illustrated in Fig. 2.4, where a straight line y 2 most + co in the (a: — y) space is determined by parameters m0 and c0. Given a point (1:1, yl) on this line in the a: — y space, all the straight lines which pass through this point are described by y1 = mazl + c, where 1:1,y1 are fixed and m, c are variables. They correspond to a straight line c = y1 — rum, in the transformed parameter space (m — c) in (b), where y1,:1:1 are the parameters, and m, c are free variables. In particular, this line should pass through the point (m0,c0). With the same reasoning, another point (3:2,y2) on the line y = mx + c is also associated with a line c = y2 — 332m in the (m — 0) space, which also passes the point (mo, Co). In a similar manner, all points on the line y = mox+co in the a: —y space have an associated line in the parameter space (m—c). All these lines intersect at (m0, co) in the (m — c) plane. So the line y = more + Co can be detected by finding the peak at (m0, co) in the transformed space (m — c). y ‘ y=mox+Co CA c=y1 -x1m (x 21Y2) (x 11Y1) 1 p x (a) (b) Figure 2.4: Hough Transform: From image space to parameter space. (a) A line in the image space (:1: — y) described by parameters m0, c0: y = co + moat; (b) transform two points (2:1,y1) and (3:2, y2) on the line in (a) in the (m — c) parameter space. 23 Figure 2.5 illustrates an example of straight line detection using the Hough trans- form. Fig. 2.5 (a) gives an image which contains a line segment which is generated using the equation y = 1.32:1: — 0.924. We transform this image space to the param- eter space and obtain the vote array in (m — c) space (Fig. 2.5 (b)). The left—lower corner of the parameter array corresponds to (1.1, -1.35), the bins the discretized with a width of 0.01. The votes are shown inversely related to the greyscale value so that dark bins correspond to large votes. The bin of maximum vote is located at [22,42] (with respect to the lower-left corner), which corresponds to the parameter value (1.32, --0.93). This value predicts the equation of the line segment well. (a) (b) Figure 2.5: Hough 'IIansform: detection of a straight line. (a) A line in the image space described by the equation (y = 1.32:1: — 0.924); (b) votes in the parameter space. The bin of the maximum count corresponds to parameter values (1.32,0.93) with a precision of 0.01. The algorithm for detecting a general shape which cannot be represented by an analytical form using the generalized Hough transform (GHT) is given below. The general shape is represented by a tabular form (a table of the coordinates): 24 0 Calculate the R—table based on the template: each entry is indexed by the tangent angle 45 for points on the template; each entry contains a list of (rij, aij) values corresponding to ¢,, where each element in the list corresponds to a point on the template whose tangent angle has a value (25,-, and rij is the distance of this point to the centroid of the template (arc, ye), aij is the slope angle of the line connecting this point and the centroid, as described in Fig. 2.6. o Initialize the accumulator array to zero: A($cmin : xcmaxa ycrnin : ycmaxa 3min : 81710.12) 6min : 0max)- The parameters are center position (2:, y), scale s, and rotation angle 9. o For each edge point x = (3:, y) in the image, do the following: - Compute tangent angle ¢(x). - Look at the ¢(x) entry in the R-table and obtain a list of (r, 0) pairs. - Calculate possible center, scale and rotation angle for every pair (1', a) in the list and for every combination of (s, 0): (330, y) = (2:, y) + r(¢)s(cos(a + 9), sin(a + 6)) - Increment the accumulator array ({L‘C, ya, 3, 6). 0 Possible shape candidates are given by the maxima in array A. Figure 2.7 shows an example of object detection using the GHT. Fig. 2.7 (a) Shows the template, (b) in an input image, and (c) Shows the votes in the translation parameter space (a: — y). The vote at each pixel in (c) is the vote when the centroid 25 of the template is placed at the pixel position, and is proportional to the darkness. Fig. 2.7 (d) shows the result where the detection (in green) corresponding to the maximum vote is overlapped on the input edgemap (in red). (x.y) Figure 2.6: Constructing the R-table in the Generalized Hough transform. One of the advantages of the Hough Transform (HT) method is that it is relatively insensitive to noise, minor occlusions, or gaps. But the computational requirement of HT is rather high. The storage and computation time increase exponentially with the number of parameters, making it practical only for curves with a small number of parameters. Complete surveys on different variants of the HT technique and its applications can be found in [70] and [88]. The HT method can be viewed as template matching. However, it is a rigid scheme in that it is not capable of detecting a shape which is difierent from the template by transformations other than translation, rotation or scaling. A deformable template, on the other hand, is able to “deform” itself to a certain degree to fit the data, 26 (C) ((0 Figure 2.7: Hough Transform: detection of an arbitrary shape. (a) an irregular shape; (b) an input image; (c) votes in the (a: — y) translation parameter space. Scale and orientation are fixed for display purpose. (d) the detection corresponding to the maximum vote. 27 by transformations that are possibly more complex than translation, rotation, and scaling. 2.2 Deformable Models The rigid template matching is effective in some domains, but it has a number of disadvantages. It would fail if the objects in the image are slightly distorted due to the imaging process, viewpoint change, or the diversity of objects shape (inter- object variance). On the contrary, a deformable model, which is “active” in the sense that it can change its shape to fit the data, is more invariant to the shape distortions. It is a promising shape model because of its flexibility, and its ability to both impose geometrical constraints on the shape and to integrate local image evidences. Deformable matching utilizes the flexibility in deformable models to match objects of similar shape (or appearance) when the deformation cannot be explained by affine transforms. In Fig. 2.8 the fish template is matched to the two fishes labeled as (a) and (b). Note that despite the global similarity, one of the matched fishes (a) is different from the template fish by local abnormality, but the similarity of the template to (a) and (b) is significantly higher than to the other objects. There has been a substantial amount of research on deformable models in recent years. These activities can be partitioned into two classes: 0 free-form models, and o parametric models. 28 (q) Template Figure 2.8: An example of deformable template matching. 29 By free-form deformable models, we mean that there is no global structure of the tem- plate except for some general regularization constraints; the template is constrained only by continuity and/or smoothness constraints. Such a free-form model can be deformed to match salient image features like lines and edges using potential fields (energy functions) produced by those features. Since there is no global structure for the template, it can represent an arbitrary shape as long as the continuity and smooth- ness constraints are satisfied. On the other hand, parametric deformable models can control the deformations using a set of parameters. The parameters are capable of encoding a specific characteristic shape and its variations. This type of model is used when more specific shape information is available, which can be described by a set of parameters. There are two ways to parameterize the shape variation. One is to handcraft a parametric formula for the curves in the shape template. Then differ- ent shapes can be obtained using different parameter values. Another method is to design a prototype for a shape class, and then apply a parametric transformation on the prototype to obtain different deformed templates. These various deformable template models are illustrated in Fig. 2.9. 2.2.1 Free-form Deformation Models Free—form deformable models assume very little structure about the object shape. Active Contours Dynamic contour models have become popular after Terzopoulos, Kass and others introduced the snake model [81, 129, 131]. In their approach, an energy-minimizing 30 Deformable Template Models Analyti *foxm 'l‘i. 7. . 1; .. M u," > antithirui. C31: we. Figure 2.9: Overview of deformable template models. 31 spline, called a “snake”, is controlled by a combination of o the internal spline force which enforces the smoothness, o the image force which attracts the Spline to the desired features, and o the external constraint force. Each force creates its own potential field and the spline actively adjusts its position and shape until it reaches a local minimum of the potential energy: 5...... = [0 1{g..,.(v(s)) + amageoan + Econ(v(s))}ds, (2.3) where s is the parameterization of the contour, 11(3) is a point on the contour, and v, and 2),, are the first and second derivatives of the contour, respectively. The inter- nal energy of the spline, finds) = (a(s)|v,,(s)]2 + fl(s)|vs,(s)|2)/2, characterizes the stretchness and smoothness of the snake. The image energy, Limage(v(s)), represents the potential due to image forces, and Ecm(v(s)) represents the potential created by external constraint forces. The potentials are defined so that they decrease along the direction of the forces and there are low potentials near salient image features. Once an appropriate initialization of the contour is specified, the snake can quickly converge to the nearby energy minimum, using a variational method. The converged config- uration is expected to give a sensible description of the object of interest. Fig. 2.10 presents an example of image segmentation using the active contour. In Fig. 2.10(a) a snake is initialized (manually) near the human hand, which is the object of interest. This snake is attracted to the salient edges of high gradient (Fig. 2.10(b)) and is 32 finally locked at the object boundary (Fig. 2.10(c)). (a) Figure 2.10: Image segmentation using the active contour approach. The potential is defined inversely proportional to the image gradient. (a) Input image: the snake is initialized around the object of interest. (b) Intermediate image: the snake is attracted to the salient edges by image gradient. (c) Final configuration: the snake converges to the object boundary of high image gradient. This snake model provides a powerful interactive tool for image segmentation. However, the implementation of the original snake is vulnerable to image noise and the initial position. Numerous provisions have been made in the literature to improve the robustness and stability of the snakes [30, 89]. For example, Cohen [29] introduced a “balloon force” which can either inflate or deflate the contour. This force helps the snake to trespass spurious isolated weak image edges, and counters its tendency to shrink. The resulting snake is more robust to the initial position and image noise, but human intervention is needed to decide whether an inflationary or deflationary force is needed. Amini [3] and later Geiger et al. [48] suggested using dynamic programming to minimize the energy function. Their methods exhaustively search all the admissible solutions, and each iteration results in a locally optimum contour. As a result, this method is guaranteed to converge in a finite number of iterations. This idea of active contour has been successfully extended to perform tasks including edge and subjective contour detection, motion tracking, stereo matching and image 33 segmentation [149, 91, 112, 150, 155]. Optimal Active Region While most active contour or snake approaches optimize a cost function based on an exterior boundary cost with no regard to the enclosed interior region, Cox et a1. [34] examined how to apply the smoothness constraint globally and how to utilize both the boundary and interior information in image segmentation and interpretation. The implementation of their algorithm is based on a computationally efficient graph parti- tioning algorithm that minimizes the ratio between the exterior “boundary cost” and the interior “region benefit”. For each greyscale image, they constructed a planar graph where the edges of the graph correspond to the between-pixel line processes in the image and the single faces (with 4 surrounding edges) correspond to the pixels (Fig. 2.11). In the figure, “X”s denote the pixels of the image, the blue nodes cor- respond to nodes of the graph, and the arcs connecting the nodes are the edges of the planar graph (they correspond to the line processes of the input image). Each arc of the graph is assigned a positive cost according to the intensity gradient. Each simple face of the graph is assigned a nonnegative “benefit” based on the greyscale value of the corresponding pixel. A closed path of the graph corresponds to a closed contour in the image. To find a good contour, i.e., a segmentation, in the image, they minimized the following objective function to obtain a partition in the graph: L _ Z, cost(e,) _ 23- benefit(fJ-) ’ (2'4) 34 where cost(e,:) denotes the cost of edge 6,, benefit(f,-) denotes the benefit of face fj, the summation in the numerator is over all the edges on the contour, and the summation in the denominator is over all the pixels (faces) enclosed by the contour. The edge cost on the contour is nonnegative, and inversely related to the intensity gradient, and the weight at a pixel is also nonnegative, and is assigned based on the intensity value according to different segmentation goals. For example, the “ratio re- gions” can be made to have an a priori preference for large, round objects bounded by high intensity gradients. Nevertheless, the algorithm may not find a small high con- trast object in a large image if the edge costs and face benefits are incorrectly chosen. Figure 2.12 illustrates this point. Figure 2.12 (a) shows a large light rectangle on a dark background with white Gaussian noise added to the image. The black boundary shows that the white rectangle is easily segmented from the image. Figure 2.12 (b) is similar, but the white rectangle is now narrower with a correspondingly smaller area. In this circumstance, the algorithm fails to find the rectangle, but instead, finds a region with higher boundary cost but significantly larger area and correspondingly smaller ratio cost. However, we can alter the edge costs in a non-linear fashion in order to improve the segmentation. An example of this is shown in Figure 2.12 (c) where the edge costs have been squared. In this case, the correct segmentation has been found for the narrow rectangle. Following are some of the characteristics of the “ratio region” approach. 0 The algorithm is not iterative and finds the globally optimum closed contour. Though it is based on dynamic programming, the complexity of the algorithm 35 0 0 I, 0 - 0 0 0 0 X X X X X X X X X X X X X X X X = c A X X X X X X X X 3 3 X X X X X X X X 3 . . . 3 X X X X X X X X x]x]xx xxxx DDIDGI-"1I1I1|1I Figure 2.11: The planar graph constructed from the input image. (a) (b) (C) Figure 2.12: Synthetic images showing how the performance of the “ratio region” algorithm can degrade if objects with small area are desired. 36 is O(n log n), where n is the number of pixels in the search window. 0 The smoothness of the contour is maintained in a novel way based on the global properties of the region’s area and perimeter. 0 Very little prior information is required. Only one node which passes the optimal contour is needed for the success of the approach. This condition can be relaxed by providing a node in the neighborhood of the object contour, and then use the node with the maximum gradient magnitude in its neighborhood as the root node. The process can also be fully automated by initiating the core algorithm at all salient edge pixels in the image. 0 Region information can be incorporated into the objective function in a natural way. Not only does the border information contribute to the segmentation, but the internal intensity or homogeneity information is also useful. A significant application of active contour models is an online, interactive segmen- tation of regions of interest in a given image. There is a clear advantage to minimize user interaction in a number of applications for various reasons including ease of use and robustness. Ratio snakes are empirically shown to allow very coarse initialization, e. g., only a single point on the desired contour or a bounding box enclosing the region of interest is needed. Figures 2.13(a)—(c) illustrate several examples of segmentation using the “ratio regions”. In each of the examples, an image subwindow which con- tains the object of interest is specified. A point is specified (denoted by a cross) on the desired object boundary and the optimal closed contour in the subwindow which passes through the given point is then reported. Figure 2.13(a) illustrates that the 37 final optimum contour may be far away from the boundary of the subregion. Sim- ilarly, in Figure 2.13(c), it is unlikely that a traditional snake would find the hand based on the rectangular initialization; the presence of high contrast dark vertical lines between the boundary and the hand would almost certainly result in a strong local minimum near these structures. Elastic Matching One of the early techniques for matching two deformed objects is the elastic match— ing method [10, 21, 44, 100]. In such approaches, the two objects are presegmented from the background. Correspondences are established for the two objects, and a match is attempted. For example, the elastic deformable model [10, 21, 100] estab- lishes an elastic model for one of the two images to be matched. Then this image is “warped” iteratively towards the other one by some local forces (Fig. 2.14). Fischler and Eschlager [44] described a system that built up subtemplates that correspond to significant object features, and then searched for a match using a two-step process: 1. find subpart matches, and then 2. find matching configurations that satisfy relational constraints. The applications of this model include handwritten numeral recognition, cartoon frame filling, alignment of deformed images and line drawings, motion detection, image registration [19, 20], stereo matching [99] and volume matching. Unfortu- nately, elastic matching has traditionally been computationally slow, has problems with correspondence, and is not robust in the presence of noise. Further, most of the 38 Figure 2.13: Image segmentation using the “Ratio Region” approach [34]. 39 algorithms have not been automated. may 2221 [We 5335'? 9%”? CW7 Figure 2.14: Examples of elastic matching. The reference shapes (left) are stretched onto some unknown shapes (right) iteratively by “forces” derived from local pattern matches [21]. 2.2.2 Parametric Deformation Models A parametric deformable template refers to the parametric Shape model representing the a priori knowledge about the structural properties of a class of objects. By design- ing a global shape model, boundary gaps are easily bridged, and overall consistency is more likely to be achieved. By parameterizing the model, a compact description of the shape can be achieved. Furthermore, it is capable of representing a variety of shapes, and is relatively robust to image noise and distortion. Parametric deforma- tion models are commonly used when some prior information of the geometrical shape is available, which can be encoded using preferably, a small number of parameters. There are two general ways to parameterize the shape class and its variations: 1. One can represent the shape as a collection of parameterized curves, i.e., pa- rameterize the geometric shape directly. The template is represented by a set of 40 curves which is uniquely described by some parameters. The specific analytical form incorporates the prior knowledge of the shape of the objects of interest. The geometrical shape of the template can be changed by using different values of the parameters. The use of different parameter values gives rise to differ- ent shapes. Variations in the shape are determined by the distribution of the admissible parameter values. This representation requires that the geometrical shapes be well structured. 2. A so called “standard”, or “prototype”, or “generic” template is specified to describe the “most likely”, or “average”, or “characteristic” shape of a class of objects which has a global conforming structure and possibly, individual deviations. Each instance of the shape class is derived from the “prototype” via a parametric mapping. The use of different parameter values again gives rise to different Shapes. Variations in the shape are also determined by the distribution of the admissible parameter values of the mapping. In both the cases, the deformable templates interact with the image features dynamically by adjusting the parameters according to the image forces. Similar to the active contour approach, an objective function which is a weighted sum of an internal energy term and an external energy term is used to quantify how well a deformed template matches the objects in the given image. Recall that in the active contour approach, the internal energy, in terms of the stretchness and the elasticity of the spline, actually imposes a rather general and weak a priori distribution on the contour model, i.e., the contour should be smooth and compact. In the parametric 41 deformable template approaches, where the a priori shape preferences are explicitly encoded by the parameters, a similar internal energy term is defined based on the constraints and interactions on the geometrical structures. For example, it can be defined to penalize the deviation of the deformed template from the “expected” shape. The external energy term, which pertains to the fidelity of the deformed template to the input image, is introduced so that the template deforms according to the desired goal. It is perceived that the internal energy corresponds to a geometric measure of the fitness, and the external energy corresponds to an image fidelity measure of fitness. The two fitness measures are combined to give an overall measure of fitness, appropriately weighting both the prior knowledge and the image data. The set of parameters which optimizes the objective function gives a description of the detected or matched shape. The value of the objective function quantifies the plausibility of the detection. Analytical Form-based Parametric Deformable Models In some applications the geometric shapes of objects of interest can be approximated by a set of analytic curves of the same form, with different parameter values. We can handcraft the analytic model for such a shape class, and then find the description of a shape instance by determining the parameter values that best describe the shape. One of the first instances of such a shape model is that of Widrow [148], where parameterized templates called “rubber masks” were used to describe 2D irregular shapes. The parameters were sizes and relationships between subparts of a shape. Lakshmanan et al. [87] have used a parametric template model to locate the 42 airport runway boundary in radar images. In their work the a priori knowledge that the runway boundary consists of two straight, parallel edges are used to derive a global shape model for the runway. The runway boundary is modeled as two parallel straight line segments, parameterized by the slope and the two intercepts of the lines. The runway edge detection problem is formulated as a Bayesian estimation using a physics-based model of the radar imaging process based on the assumption that the runway boundary divides the image into three relatively homogeneous regions. The set of parameter values which maximizes the Bayesian a posteriori density determines the detected runway boundary. An alternative to the likelihood is based solely on the local image gradient on the template. It is observed that the latter typically gives a more accurate estimate of the road boundaries, but is more demanding on the initial position of the template (Fig. 2.15). The global deformable model helps in the runway detection because the model is able to integrate the local intensity homogeneity and gradient information and adjusts itself to the desired position. The use of the prior structural information contributes to its robustness to image noise. A typical edge detector does not work well here because the image is textured due to the noisy nature of the millimeter wave images. A polygonal template is designed by Dubuisson et al. [79] to characterize a general model of a vehicle (Fig. 2.16 (a)). They derived an a priori probability distribution to constrain the template to be deformed within a set of allowed shapes of a typical vehicle. The likelihood function is constructed based on motion information and edge directionality so that the deformed template is contained in the motion area and its boundary coincides with salient edges in the input (highway traffic) image 43 Figure 2.15: Runway detection using parametric deformable templates [87]. (a) initialization of the template; (b) the detected runway boundary using the region homogeneity assumption after the initialization of (a). (c) initialization of the tem- plate; (d) the detected runway boundary using the gradient information after the initialization of (c). 44 sequence. This approach was used to segment vehicles in images of traffic scenes. One segmentation result of this method is illustrated in Fig. 2.16 (b). 64' “4‘” I), 85' (Xij) 82' ()S'YI) 0i 63' ()S'Y!) eI' (XO'YI) 91' (er‘) . (a) (b) Figure 2.16: Vehicle segmentation using a polygonal template [79]. (a) the polygonal vehicle template and its parameterization; (b) one segmentation result using the template and motion and gradient informations. Yuille et al. [151] have used deformable templates to extract facial features such as the eyes and the mouth. They designed parametric models for eye and mouth tem- plates using circles and parabolic curves. The parameters which control the shape of a template are the center and the radius of the circle, and the characteristic pa- rameters of the parabola. Regularization constraints are imposed on the parameters in terms of the size of facial features such as the mouth and the eyes, as well as the interactions between them, e.g., the center of the mouth template should be close to the line which is at equal distances to the centers of the two eye templates. The choice of the parametric forms for the feature templates and the interactions between the parameters should reflect the known structural information about the facial features. The image (external) energy term is defined in terms of edges, peaks, and valleys in 45 the input intensity image based on the features of the eyes and mouth so that different parts of the template interact with different image features such as intensity peaks and valleys. By using the deformable template model, the global geometry of the shape and different local image cues are integrated to give a comprehensive goodness- of-fit of the detection. This method gives reasonable detection and tracking results of the eyes and mouths in real images when the initial positions of the templates are sufficiently close to the desired objects. Deformable boundary templates with more degrees of freedom were proposed by Staib et al. [121] to detect objects in medical images. As mentioned earlier, the prior model information can range from very general such as smoothness constraints to very specific such as an exact template. Staib et al.’s prior model is between the two extremes such that some prior information about the global shape is available but it is not exact. Elliptic Fourier descriptors are used to represent open or closed bound- ary templates which are smooth and are continuously deformable with no obvious decomposition. The parameters of the deformable templates are the Fourier coeffi- Cients. A distribution on the Fourier coeflicients is specified so that there is a flexible lDias towards some particular shapes. The spread of the distribution is governed by the variability among instances of the object class. A Bayesian decision rule is then uSed to obtain the optimal estimate of the boundary, where the likelihood function iS based on the correlation between the template and the boundary strength in the input image. For all the techniques discussed above, a good initialization of the contour is re- quired for meaningful solutions. The approximate translation, orientation and scale 46 of the object to be segmented are supposed to be known. Furthermore, the initial contour implicitly biases the converged configuration. The applicability of the para- metric deformable model is limited because the shapes under investigation have to be well-defined so that they can be represented by a set of curves with a preferably small number of parameters. The parametric forms can be ad hoc. In fact, the para- metric curves can hardly fit some actual boundaries in real images no matter how the parameters are adjusted. Prototype-based Parametric Deformable Models The pattern theory proposed by Grenander [58] described a systematic framework to represent shape classes that exhibit a lot of variability but also possess a characteristic structure. Grenander and Keenan [60] formulate a global, pattern-theoretic model of shape which consists of: (1) space of generators (G); (2) connector graph (0); (3) bonding relations (R = (p, A)); and (4) transformation group (S, S : G .——> G). The generators G are the basic building blocks of the structure. For example, they can be edges or nodes for polygonal shapes, or pixels for greyscale image patterns. The connector graph 0, where the nodes correspond to each of the generators, de- scribes the interactions between the generators. The bonding relations R apply the 47 geometric constraints so that the resulting configurations are physically meaningful. The transformation group S maps one generator to another, and is a mechanism to produce new structures from the given ones. This framework provides a structured method to systematically generate very general shape classes. The appropriate choice of the generators, transformation group, connector graph, and regularity conditions depend upon the specific application. In general, the above shape model can be represented by: 0 a model template which describes the overall architecture of the shape, and o a parametric statistical mapping which governs the random variations in the building blocks of the shape [4, 28, 61, 60, 97]. These factors together should control the desired global and local geometry of the shape class. Usually, the prototype template is based on the prior knowledge of the objects, which can be either specified by the high~level knowledge, or obtained from training samples. The parametric statistical mapping is chosen to reflect the particular deformations allowed in the application domain. The shape classes described by Grenander’s pattern theory can be very versatile because of different choices of the prototype template and the deformation process [60]. They can be tailored for very different applications. For example, in their work on human hands, Chow et al. [28] used polygons to approximate the contour of human hands (Fig. 2.17). The building blocks in the shape model are the polygonal edges which meet the regularity condition because polygons are simple and connected. Variations in different hands are described by Markov processes on the edges. Chow 48 et al. applied this shape model for hand synthesis and restoration. Similar approaches have also been used to model leaves, 3—D chairs, and 3-D human organs. In another paper on restoration of human hands from noisy greyscale images, Amit et al. [4] used an intensity image to represent a typical human hand. All instances of the class of hands are obtained by applying a number of admissible continuous transformations to the “ideal” hand image. Furthermore, this set of continuous mappings is governed by a Gaussian distribution. The observed image is assumed to be corrupted by an additive noise process. The reconstruction is obtained by maximizing the posterior distribution. Figure 2.17: A polygonal hand template [59]. The limitations of this approach, as well as most other structured template match- ing methods, is that a good approximation of the location, orientation and scale of the object in the data should be provided. How to deal with the pose and the scale of the initial template is still an open problem. 49 Shape Modeling and Learning The success of the structured deformable template matching approaches depends, ob- viously, on an accurate description of the shape class —— the expected shape instances and their variations. This information, similar to the prior distribution in a Bayesian framework, can be subjective. Usually, it can be obtained from past experiences. It can also be computed from a representative set of shape instances. Some recent work on shape modeling has focused on the active learning of the shape models from training samples, influenced by the goals of “active vision”. To describe a shape class, one has to learn both the “representative” shape and the “variability” in the shape class [33, 32, 32, 35, 107]. Cootes et al. [33, 32] proposed the “active shape models” for templates repre- sented as line-drawings. By an “active shape model”, they mean that instead of handcrafting the parametric form for the shape class, the prototype shape and its deformations are learned from a collection of correctly annotated example shapes. Basically, polygonal representations are used for shape modeling. By manually align- ing the training set, i.e., establishing the correspondences between the “landmark points” (nodes) of training samples of the same class, they calculated the mean posi- tion and variation of each node from the training shapes. This mean shape is used as the generic template of the class of shapes. A number of modes of variation, i.e., the eigenvectors of the covariance matrix, are determined for describing the main factors by which the instance shapes tend to deform from the generic shape. A small set of linearly independent parameters are used to describe the deformation. In this way, 50 their shape model allows for considerable meaningful variability, but is still specific to the class of structures it represents. The major contribution of their work is that the active shape model is able to learn the characteristic pattern of a shape class and can deform in a way which reflects the variations in the training set. The limitations of the approach are its sensitivity to partial occlusion, and its inability to handle large scale and orientation change. Kervrann and Heitz [82] proposed a deformable model which is very similar to Cootes et al.’s model. They presented an unsupervised approach to learn the struc- ture and deformation modes of 2D polygonal objects, given long image sequences. They used a combination of both the global and local deformation modes to model a deformable shape. The global mode is the same as that in Cootes et al.’s work, i.e., is modeled by a generic shape plus the global deformation which is a linear combi— nation of the variation modes obtained from principle component analysis. The local deformation, which is considered to contain additional information from the new im- age frame, is modeled by a Markov random process for the consecutive nodes, which takes into account interactions between the neighboring points. In the training stage, upon the processing of every new image frame, the computed local deformations are used to update the global average template and the global deformation modes. They applied this approach to object tracking. However, a good initial template is still required, and the convergence of the sequence is not guaranteed. Given a set of representative shape examples, Pentland [105] and Pentland and Scaroff [107, 117] have prOposed a novel shape modeling method using the finite ele- ment models (FEM). Although FEM is a well developed and powerful approximation 51 method for solutions to problems in material science and mechanical engineering, this is the first time it was used to solve a computer vision problem. They used 3-D fi— nite element models which act like lumps of elastic clays to model 3—D shapes. They derive modes of vibration of a suitable base shape, such as an ellipsoid, and build up shapes using different modes of variation. The first few modes are the large-scale variations of the shape; the higher order modes are more localized (Fig. 2.18). A total of 30 modes were used to model human heads. They fitted models to range data by an interactive process, and compared modeled objects using fitted parameter values. The advantages of these models are that they are easy to construct and allow for a compact parametric representation of a family of shapes. Additionally, a close-form solution can be obtained for the complex 3-D shape modeling problem. However, this does not always lead to a compact description of the variability within a particular class of objects. A 3D model which combined both local and global deformation is the deformable superquadrics proposed by Terzopoulos and Metaxas [128]. The global shape is mod- eled by a conventional superellipsoid with 6 parameters, which provides salient part descriptors for object recognition and database indexing purposes. The local defor- mation is modeled by a free-form spline, which reconstructs complex shapes that the global abstraction misses. The problem solving is physics-based, via translating vi- sual data into external forces, and simulating the equations of motion through time to adjust the translational, rotational and deformational degrees of freedom of the model. However, a user has to handcraft parts for complex shapes. The number of degrees of freedom of the model can be large enough to make the problem in- 52 (a) (8) Figure 2.18: Example of the eigenmodes [117]. (a) an upright tree shape; (b)-(e) The lower 18 modes (black outline) for the shape in (a). (b) and (c) are the translation modes, ((1) is the rotation mode. The rest are nonrigid variations. 53 feasible. This technique also suffers from the constraints present in all the existing approaches: a good initialization of the pose, scale, and deformation parameters is required to achieve reasonable results. A general and comprehensive visual learning scheme is proposed by Weng et al. [35, 36, 126, 142, 143, 146, 144, 145]. They have proposed a system called “SHOSLIF” [143, 144] for comprehensive sensory learning which involves visual, auditory and other sensory information, where the term “comprehensive” here refers to both the compre- hensive coverage of the sensory world and comprehensive coverage of the recognition algorithm. This system achieves a unified theory and methodology for comprehensive sensor-actuator learning with a logarithmic scalability. The SAIL project by Weng [145] is aimed at the deveIOpment of the first “living machines” whose objectives in- clude “development of a systematic theory and a practical methodology for machines to learn autonomously while interacting with its environment, on a daily basis, via its sensors and effectors, on-line in real time, under interactive guidance from human teachers.” 2.3 Discussion In the previous sections we have briefly surveyed recent work on deformable template modeling for 2D shapes. A summary of the discussed work is listed in Table 2.1. The common difficulties that have been experienced in the application of existing approaches to deformable template matching are as follows: 0 The algorithms need a good initialization to give meaningful results, otherwise, 54 Authors Category Prior Model Imaging Model Application Domain Kass, Witkin free-form Curves that satisfy Potential Image segmentation, and regularity constraints field defined by subjective contours Terzopoulos [81] salient features Cox, Satish and free-form Closed Image gradient Image segmentation Zhong [34] contours that satisfy global regularities Lakshmanan, Analytical Two parallel straight Homogeneous Runway detection Jain and form- edges regions following Zhong [87] based the lognormal parametric distribution Dubuisson, Analytical A polygon that satis- Motion energy + Vehicle segmentation Lakshmanan, form— fies some constraints edge potential and registration and Jain [79] based parametric Yuille, Hallinan Analytical Circles, lines, Potential defined Facial feature and Cohen [151] form- parabolic curves by salient detection based features parametric Staib and Analytical Elliptic Fourier Image boundary Medical image Duncan [121] form- descriptors strength segmentation based parametric Chow, ,Prototype— Polygon whose nodes Two Gaussian Hand synthesis and Grenander based form a Markov distr. for pixels restoration and Keenan [28] parametric process inside and out- side the template Cootes, Taylor Prototype- Polygon Learning shape and Lanitis based models parametric Jain, Zhong and Prototype- Bitmap Potential field Object localization Laksh- based defined by edges and matching manan [76] parametric Table 2.1: A taxonomy of the deformable template matching approaches. 55 they get stuck at local minima and thus lead to incorrect results. 0 The convergence of the algorithms to true solutions is slow due to the large number of parameters associated with the deformable template. In the high-dimensional parameter space, the landscape of the objective functions is usually very complicated with many local minima. Also, because most algorithms utilize the gradient descent method in the entire space of high dimensionality, they are inevitably slow. Few reported works have dealt with the scale and orientation problems successfully. A fast, scale/orientation invariant solution is yet to be found. Our deformation model falls in the category of prototype—based parametric defor- mation models. It shares some of the characteristics of the work by Amit et al. [4] and of Cootes et al. [33, 32], but has its unique properties which are appropriate for the specific application domain of interest to us. We represent the prototype tem- plate in the form of a bitmap which describes the characteristic contour/edges of an object shape. It is then deformed to fit salient edges in the input image by applying a probabilistic transformation on the prototype contour which maintains smoothness and connectedness. The matching is carried out by maximizing the a posteriori prob- ability which combines both the prior shape information and the image information. A Bayesian framework was previously adopted for contour estimation [42, 124] where the prior is used to impose local smoothness and the likelihood is calculated based on edge positions. In our case, it is natural to choose a prior which reflects the global shape of the object of interest. The likelihood model that we use to fit the deformed template to the salient edges in the input image is similar to the ones used 56 in [42, 81, 124], but the exact functional form of our likelihood is different because it incorporates both the edge position and directional information to give a better edge localization. Details of the deformation and likelihood models are given in chapter 4. Chapter 3 Unifying Deformable Models in a Bayesian Framework Statistical approaches to image analysis using the Bayesian paradigm have been very popular in recent years. This paradigm has been primarily developed for situations when prior knowledge of a process is available and that knowledge needs to be com- bined with the sensed data to make statistical inference about the parameters of the process. This methodology has been very successful in integrating low-level image analysis and high-level tasks. Its application domain in computer vision and image processing includes image restoration [51], segmentation [27], shape modeling [108], and inference [16]. A number of researchers have noticed the links between the Bayesian framework and deformable models and tried to obtain a general solution using the Bayesian framework [95, 130, 124]. In this chapter, we investigate the relationship between deformable matching methods and the Bayesian models. We further conclude that 57 58 we can tailor the prior distribution and the likelihood in a Bayesian scenario so that the corresponding Maximum A Posteriori (MAP) solution is equivalent to the solution of the original deformable matching problem, irrespective of whether the problem is an active contour problem, or a parameterized deformable matching problem. 3. 1 Bayes’ Theorem The Bayes’ rule is “a system for modifying historical information on a process in the light of current data” [45]. Let the initial (prior) knowledge about a process be characterized by a density function on the parameters 11 of the process, f (11) Let the current conditional density of observed data (1 given 11 be f (dlu). According to Bayes’ rule, the posterior density function of parameters u, given the observed data d, is _ f (n)f (dlu) ““‘d’ “ f..f(U)f(dIu)du (3'1) or, equivalently, _ f(0)f(dIU) where f(d) = fu f(u)f(d|u)du. In the Bayesian model for computer vision applications, the prior model typically represents the initial knowledge about the objects in a particular scene and the likeli- hood function represents the joint probability distribution of the sensed data (image), conditioned on the objects in the scene. By applying the Bayes theorem, the posterior distribution of the object in the scene is obtained for inferencing purposes such as 59 segmentation and classification. 3.2 Bayesian Formulation for Deformable Models We note that in almost all the deformable models, the objective function consists of two parts: 0 The first term, which is referred to as the internal energy, prior, or geometrical information, is related only to the geometric shape of the deformed template (contour) which is an intrinsic property of the template, independent of the input image: 1. In free-form deformation models such as active contour, this term is Spec- ified in terms of the “elasticity” and “stretchness” due to the weak reg- ularization constraints on the contour. It is equivalent to a “prior” or “preference” for smooth and compact contours. 2. In the first category of the parametric deformation models (Sec. 2.2.2), this term is specified in terms of the model parameters, which reflects either the choice of the parameter values, or the interactions between different parts of the template; this again determines the prior knowledge of the shape, in terms of the geometrical or contextual constraints, independent of the input image. 3. In the prototype-based deformation model, this term is also a function of the deformation parameters. It penalizes the deviation from the expected 60 shape, and biases the choice of the geometric shape. 0 The second term, in all the cases, pertains to the input image data. Via this term, the deformable model interacts with the image, being attracted to the desired salient image features. This term measures the fidelity, or goodness of fit, to the input image. The deformed template which optimizes the objective function leads to an inter- pretation of the image. Although in some of the studies, the deformable process is performed in a Bayesian framework, we note that it can be generalized to almost all the deformable modeling problems. In fact, all the energy minimization prob- lems which consist of an internal and an external energy term can be cast as an inference task in a probabilistic framework. To be more specific, almost all the de- formation models discussed above can be interpreted in terms of Bayesian estimation in Eq. (3.2), using a prior f (u) which characterizes the intrinsic properties of the deformable template u, and a likelihood f (dlu) which relates the template to the sensed data (1. When the deformable template modeling is cast in the Bayesian framework using Eq. ( 3. 2), the prior model f (u) is a probabilistic description of the quantity we are trying to estimate before any image data becomes available. It typically imposes the geometrical preferences of the shape model. The imaging model f (dlu) is a description of the noisy or stochastic process that relates the deformed template to the input image or sensor values d. This likelihood captures the desired image cues. Bayes’ rule combines these two probabilistic models to form a posterior model f (uld) 61 which describes probabilistically the best estimate of u given the data d and prior knowledge of u. Note that f (d) is a constant, given (1. Therefore, maximizing the posteriori density in Eq. (3.2) is equivalent to maximizing the product {f (dlu) f (u)} Intuitively, the internal energy term, which is a measure of the geometrical struc- ture on a deformed template or contour, is related to the prior model in the Bayesian formulation; the external energy term, which describes the interaction between the template and the image, corresponds to the likelihood model. The prior and sensor models are determined according to the applications and goals. In the following sub- sections we address the details of the Bayesian formulation for each of the deformable template models. 3.2.1 Free-form Deformable Models The unknown quantity u in Eq. (3.2), in the free-form deformation model, controls the relative positions of the nodes on the contour to enforce the “stretchness” and “compactness”. The prior can be specified in terms of the regularity constraints of the spline, and the likelihood can be derived from the image potential energy. It has been pointed out [124, 130] that by using a Gibbs distribution for the prior model 1 1301) = Z) exp [_£int(u)]a (3.3) 62 where 5m: is the internal energy of the snake as defined in Eq. (2.3), and a Gibbs distribution for the sensor model P(dlu) = '21—, eXP [-£.-mage(u, d)], (3.4) where Limage(u,d) is the external energy as defined in Eq. (2.3), the maximum a posteriori (MAP) estimate, i.e. the estimate of u which maximizes the conditional probability P(UId) 0C P(dIU)P(U) (3-5) oc 21; exp [—£',:,,t(u)] 21; exp [—£,-mage(u, d)] OC EXP [_(£int(u) + SimageO-li d))], is the same as the minimum energy configuration in Eq. (2.3) of the snake. 3.2.2 Analytical Form-based Parametric Deformation Models The unknown quantity u in Eq. (3.2) is the set of parameters of the analytical defor- mation models. The Bayesian formulation for such deformable template models can be derived in a similar manner. The prior is an appropriate probabilistic distribu- tion on the model parameters, which encodes the knowledge of the shape variations and the contextual constraints, independent of the imaging process. The most often used prior densities include the uniform distribution and the Gaussian distribution. 63 The sensor model depends on the imaging process, and a large diversity of stochastic models can be used according to the Specific goal. For example, in detecting straight edges in millimeter-wave images [87], the analytical template edge image is formed by assuming that the observed millimeter-wave image consists of two straight and parallel edges which are determined by three parameters: the slope of the two edges, k, and the intercept of each edge, c1 and c2 (Fig. 3.1). These parameters are assumed to be equally likely in their domain (uniform prior density). y y=k(x-c1) y=k(x.c2) c1 c2 Figure 3.1: Parameterization of the runway boundary template. The runway bound- ary template consists of two parallel straight lines y = k(:r — c1) and y = k(:c — c2), with parameters It for the slope and cl and c2 for the intercepts. We assume that all deformations of the template that will keep the two edges within the confines of L are equally probable. In other words, the prior probability density is a uniform distribution on the set of values of k, c1, and c2, which satisfy the constraint that the two deformed edges must fall into the image region. The likelihood function, is determined based on the assumption that the greyscale values in three image regions (I,II and III) separated by the two straight edges follow a log-normal distribution in each region, which incorporates the essential degradations associated 64 with the millimeter-wave imaging process [86]. More Specifically, given a hypothetical pair of underlying edges ({k, cl, c2} values) we assume that the likelihood of observing the millimeter-wave image Y is given by: P(Y | {k,cl,c2}) : (re—E(Y’lk’cl’cm, (3.6) where a is a normalizing constant and E (Y, {k, c1, c2}) denotes the energy function related to the log-normal distribution: E,(Y {k, c1,c2})= Z log[oro]2+ 2[0 ——]-12(logY,g— #7252] (3.7) (r,0)€L Ora The quantities fire and 0,9 are assumed to be uniform over the three regions separated by the two straight edges. Their values are estimated adaptively from the data by using sample statistics over the respective regions. The r3 divisor incorporates the deterministic range-dependent degradation in the data, and the (firmarg) pair incorporates the texture-like random variability in it. 3.2.3 Prototype-based Parametric Deformation Models The Bayesian formulation for the prototype-based deformable template models can be derived similarly as in Section 3.2.2. The prior model is an appropriate probabilistic distribution on the model parameters, which encodes the knowledge of the shape variations and the contextual constraints, independent of the imaging process. The sensor model depends on the imaging process and again, a large diversity of stochastic 65 models can be used according to the specific goal. Let the prior knowledge about the object of interest be represented by an ideal prototype 75. Let the possible deformations on 73 be described by a set of parameters ’P. A probability distribution is assigned to the parameters with density 7r(’P), which models the allowed variations in deformable template T, or equivalently, ’P. This is the idea of probabilistic deformation model. It is through the prior model 7r(’P) that the intrinsic constraints and variations are expressed. The imaging model describes the dependence of the observed image on the tem- plate. The appropriate model depends on the specific matching problem. Let L(I|’P) be the likelihood of Observing image data I given a deformed template determined by deformation transform ’P, the posteriori density 7r(’P|I) of the deformed template given observed image I is proportional to L(I|P) 7r(’P). (3.8) The solution is obtained by the maximum a posteriori (MAP) estimate of the true object scene. In chapter 4 we will formulate our approach using this methodology. 3.3 Discussion By using the probabilistic model for the deformable template problem, we can for- mulate the problem in a structured fashion. The physical sensor model can be easily integrated with the prior knowledge of the configuration. Other advantage of using 66 the Bayesian framework is that confidence levels can be attached to the results for image interpretation and inference. Chapter 4 A Shape-based Deformation Model We propose a deformable model which is appropriate in situations where an inexact knowledge about the shape of the object of interest is available, and this shape in- formation can be represented in the form of a hand-drawn sketch. Such an approach can be used in image segmentation, and object localization and detection, when the specific global shape model is given and need to be combined with the local image features. It can also be applied to content-based image database retrieval systems, where queries often include the Shape of the object of interest. The user may de- scribe the shape of an object using a sketch and ask to retrieve all the images in the database that contain such a shape. The sketch used to describe the shape does not need to match the object boundaries in the image exactly (Fig. 1.1). It is important that a retrieval system be robust to position, orientation, scale differences, and most importantly, moderate deformations of the object shape. This problem presents several difficulties. First of all, the object to be located in the image may be different from the prespecified shape by possible deformations 67 68 which cannot be explained by a combination of translation, rotation, and scaling transforms. Secondly, the shape to be matched is not pre—segmented from the image. So, the matching process has to be integrated with segmentation. This situation is not as easy as the problem of matching two shapes based on the Similarity of feature vectors. Thirdly, the approximate pose and scale of the object are unknown, as well as the number of instances of the objects in the input image. Fourthly, the shape model should be general enough to handle shapes of different appearances and connectivity. Lastly, since an inexact global shape is provided, we need to sensibly integrate this knowledge with the available image information. To deal with the above problems, we propose a deformation model which consists of o a prototype template which describes a representative shape of a class of objects in terms of a bitmap sketch, 0 a set of parametric transformations which deform the template, and o a probability distribution defined on the set of deformation mappings which biases the choice of possible deformed templates. This deformation model can represent object classes of similar characteristics, and incorporate variations in the class. We will discuss all the three components in more detail in the following sections. 69 4.1 Representation of the Prototype Template The prototype template consists of a set of points on the object contour, which is not necessarily closed, and can consist of several connected components. We represent such a template as a bitmap (binary image), with bright pixels (grey level of 255) lying on the contour and dark pixels (grey level of 0) elsewhere (Fig. 1.1 (a)). Such a scheme captures the global structure of a shape without specifying a parametric form for each class of shapes. This model is appropriate for general shape matching since the same approach can be applied to objects of different shapes by drawing different prototype templates. We acknowledge that this deformation model falls in the systematic framework of shape modeling described in Grenander’s pattern theory [58]. 4.2 Deformation Transformations The prototype template describes only one of the possible (though most likely) in- stances of the object shape. Therefore, it has to be deformed to match objects in im- ages. Deformation transform is an important component of the deformable template matching algorithm. It determines the admissible deformation space, or equivalently, the possible shapes that a deformable template can take. In theory, the deformation transform can be any function which matches a 2D point to another 2D point, as it is used to approximate the displacement in a 2D plane. But, a good deformation transform should be capable of representing a variety of shape variations, providing a concise description, preferably with a computational advantage, and preserve the 70 smoothness and connectivity of the template. We have used three different deforma- tion transforms to span the displacement field, namely, the trigonometric basis in the 2D continuum, and the spline basis and wavelet basis in the curve space. Each of the deformation basis has its advantages and disadvantages. We will discuss each of them in details in the rest of this section. 4.2.1 Two-Dimensional Trigonometric Basis This deformation basis is defined on the 2D continuum. Imagine that the template is drawn on a 2D planar rubber sheet which has a fixed boundary, but it can be deformed by stretching, squeezing, and twisting locally in the interior. As the rubber sheet deforms, the template drawn on it also changes its shape. The deformed rubber sheet can be obtained by applying a continuous mapping which maps the domain of the 2D image onto itself. The resulting 2D displacement field is represented as a continuous 2D vector function with certain boundary conditions. Without a loss of generality, we assume that the template is drawn on a unit square 8 = [0,1]2. The points in the square are mapped by the function (2:, y) 1—> (at, y)+(’D“(:c, y), Dy(:r, y)), where the displacement functions D‘”(a:, y) and Dy(:r, y) are continuous and satisfy the following boundary conditions: 133(0, y) E ’D“(1, y) E Dy(x, 0) E Dy(x, 1) E 0. The Space of such displacement functions is spanned by the following orthogonal bases [4]: efnn(a:, y) = (2 sin(7rn:r) cos(7rmy), 0) ei’m,(a:, y) = (0, 2cos(7rm:c) sin(7rny)), (4.1) 71 where m, n = 1, 2, . . .. These basis functions, which consist of trigonometric functions of different frequencies, vary from global and smooth to local and “coarse” as m and n increase, as shown in Fig. 4.1. Specifically, the displacement function is specified as follows: many): (73“(17 y) 79%? 31)): Z 2 ”"0 em" m" ", (4-2) m=l n=1 Am" where An", = onr2(n2 + m2), m,n = 1,2,... are the normalizing constants. The parameters g = {(f;,,,§$’n,,),m,n = 1,2, . . .}, which are the projections of the dis- placement function on the orthogonal basis, uniquely define the displacement field, and hence the deformation. We use a finite number of terms in the infinite series in Eq. (4.2) as the displacement function for the deformation mapping: D£(III, _Z X; mn ' emn A'e'l'gf’nn . (4.3) — m=l n=l Am" Note that the dependence of the displacement D on the deformation parameter vector g is made explicit in Eq. (4 .3). This continuous function preserves the connectedness of the prototype template. It also preserves the smoothness of the template when M and N are not too large (only low frequency components are used). It should be noted that the length of the deformable template varies depending on the deformation. Note also that we are only concerned with the displacements of the points on the prototype template. Figure 4.2 illustrates the deformations of a bird template sketched on a grid using the displacement functions defined in Eq. (4.3). One can see that the 72 [lllllllrllllll —_~_——a_-____h ___._4#1._4____ III Jllllll llll__L. .... ..,. Hull .3 LL _ _ .anmnnnmtmm .. 44 . I II I , xx r_r VI 5.}1 .[ ’l r p: u (3% .- tray 1 / M 11:1 Figure 4.1: Basis functions for the deformation displacement field. 73 deformation becomes more complex as higher frequency components are added to the displacement function. llllI Figure 4.2: Deformation of a bird template using the 2D trigonometric basis. (a) the bird template with no deformation; (b) deformed bird template using randomly generated deformation parameter values. From left to right, the interpolation level (M, N) equals 1, 2, and 3, respectively. This choice of the deformation transform basis has the following properties: 0 The basis set in Eq. (4.1) is defined on the 2D continuum. So, it imposes very few constraints on the prototype template. The template can be open or closed, simply-connected or multiply-connected. Only the deformation on the template pixels are computed; we do not need to compute the displacement on the rest of the 2D domain. This deformation transform gives the most flexibility about 74 the template shape; 0 All the basis functions, whether containing low or high frequencies, are global. As a result, they may not approximate local Shape features well. The global property of the deformation basis in Eq. (4.1) is not desirable in some applications because of their incapability to model local, uncorrelated changes. The other two deformation basis we define in Sections 4.2.2 and 4.2.3 have local compact support. They are expected to outperform the 2D trigonometric basis in modeling local features. 4.2.2 Deformation Transform Using Spline Representation Splines are piecewise polynomials for efficient interpolation and approximation of curves and surfaces. They are usually characterized by a set of control points, with certain continuity requirements at the boundary. Splines have the desirable properties of continuity, bounded support, spatial uniqueness, and local controllability. Splines have been commonly used for shape modeling [43, 81, 116]. We find the spline ap- proximation of the prototype template, and then deform it by displacing the control points. We can achieve local deformation by using a Spline basis for compact local support. B-splines The B-spline is a well-known spline function that provides local approximations to contours/surfaces using a small set of control points. A kth order B—spline is C"—1 75 continuous, i.e., it is continuous up to (k — 1) derivatives. The degree-m B-spline representation of a curve is given as follows: m): 2 care, mama (4.4) k-m—l i=0 where Bf"(t) are the degree-m B-spline basis, and c,’s are parameters of the represen- tation (control points). For a closed contour, we need to guarantee that the spline representation is periodic. This can be done by periodically extending the knot sequence {tj, j 6 (0,1,. . .,k — 1}} such that: A- tj = tj mod k,j E Z, (4.5) and accordingly, periodically expanding the aperiodic B—spline basis {8}"(t), j = 0, 1, - - - , k - 1} With a period (t,c —to) to obtain the periodic B—Spline basis {3}”(t), j = 0, 1,. . . , k — 1}. The closed (periodic) function can then be represented as: k—m—l ~ W) = Z amt). te 72. (4.6) i=0 B-spline Representation for Deformable Templates The B-spline gives a parameterized representation of a contour, where the control points are the parameters. The B-spline representation of a contour is somewhere between a bitmap and a parametric form. A free-form bitmap has the most degrees of 76 freedom (each pixel location can be considered as a parameter), and the least structure (it can assume any shape). A parametric form, such as an ellipse or a parabola, can be determined by a small number of parameters, and has the most rigid structure with very few degrees of freedom. With a spline representation, we can change the parameter values to deform the contour locally. It is also more structured than a bitmap, with the Shape controlled by the set of control points. To represent a 2D contour, its a: and y coordinates are fitted with a separate B-Spline basis: k—-ml [f,,(t) =2 c B'"(t t e [tm, tk_m], (4.7) i=0 where c, = [0,”, cf] are 2D vectors. Accordingly, a closed contour can be expressed as: k—lm— [fx(t)f =2 mere), teR. (4.8) i=0 To represent a deformable template using the splines, we should have (i) a default template (prototype) and (ii) a way to code the variation in Shape. This can be achieved by imposing a probabilistic distribution on the parameters (control points). We have used an ii (1. Gaussian distribution for this purpose. The set of average control points determines the prototype. If a bitmap is given to describe the default shape, we can fit a B-spline to the contour and use the estimated control points as the mean. If training samples (shapes) are available, we can learn the prototype from the samples. Let the set of default control points be {co,-, i = 0, . . . , k — m -— 1}, and the B-spline basis be {8?(t), i = 0,. . . , k — m — 1}, then the prototype is determined 77 k-m—l [$o(t)yo(t)l= Z Col-3.7"“), télthk-ml- (4-9) i=0 The deformed template can then be derived by perturbing the control points around the default ones. The deviations are penalized so that small deformations are preferred over large ones. Let the deviations from the default control points be Ac, = [5? 5?]. The deformed template is computed as: k—m—l k—m—l [$(t)y(t)l= Z: €13.30): 2 (005+Aci)3l"(t), teltmatk—ml- (4-10) We note that the B-spline representation of a deformable template is equivalent to approximating the deformation displacement using the B—spline basis functions. The displacements of the template pixels due to the disturbance g in the control points are computed as follows: k-m—l 12,3): 2 ([5: 31mm. team]. (4.11) "‘ i=0 where the deviations (63’, g?) from the default control points are the deformation parameters. To illustrate the deformations allowed by the spline representation, we Show in Fig. 4.3 some deformed versions of a template using randomly generated control points which follow an ii d. Gaussian distribution with means equal to the default control points of the prototype. Fig. 4.3(a) is the B-spline approximation of the bird template using 30 control points. Its deformed versions are demonstrated in Fig. 4.3(b). GER/ill“ fil‘. fi‘:“s CY)“ ‘CTi—II“ (b) Figure 4.3: Deformations using the B-spline representation. (a) The spline represen- tation of the prototype. The red dots are the control points. (b) Deformed templates obtained by randomly displacing the control points in (a) according to an i.i.d. Gaus- sian distribution. 79 We note that the Spline-based deformation is able to model local deformations because the deformation basis functions are local. However, it requires a 1D pa- rameterization of the template. So, it constraints the template to consist of a single component. 4.2.3 Deformation Using Wavelet Transforms Interest in wavelets has been steadily growing over the past 15 years. It has turned out to be extremely successful in many computer vision and image processing appli- cations involving compression, texture analysis, multiscale processing, image coding and restoration [7, 92, 115, 135, 139]. In the wavelet decomposition, fine scale features are captured by “narrow” func- tions with a small support, and coarse scale features are represented by “wide” func- tions with a large support. All these functions are either dilated or shifted versions of the so called mother wavelet. Each level in the wavelet decomposition corresponds to the difference (or detail) between two successive approximations. With such a lay- ered structure, details at different levels can be added to obtain image representations at different resolutions. The wavelet approach fits naturally into the framework of multiscale image processing. There are several advantages of using the wavelet transforms [37, 38] to model the shape deformations: 1. The wavelet basis is constructed in such a way that the elements with low indices have a large support and, therefore, allow for large-scale global adjustments in 80 the displacement field; on the other hand, the elements with higher indices have a smaller support and hence allow for local adjustments for small range abnormalities. With the hierarchy of coarse to fine wavelets, we can control the level of smoothness and locality in a coordinated way. 2. The wavelet transforms in the compact case can be computed very fast (in linear time complexity) using quadrature mirror filter (QMF) banks. This fast transform is reversible. 3. There is a relationship between the rate of decay of the coefficients in the ex- pansion of a function and the degree of smoothness of that function. We use the wavelet basis to span the deformation displacement field on the object contour. Wavelet Transform Let Am be a linear, orthonormal projection on the vector space Vm. Let L2(’R) be the vector space of measurable, square-integrable 1D functions f (t) A multiresolution approximation of L2(’R.) is defined as any set of vector spaces ({Vm}, m E Z) which satisfies the following properties [93]: 1. The approximation of a function f (t) 6 L202) at a resolution (m + 1) contains all the information to compute the same function at a lower resolution m, i.e., the subspaces of the multiresolution approximation are nested: 81 2. The approximation Amf (t) of f (t) is determined by 2'” samples per unit length; 3. Information is lost when a coarser approximation is used. However, the ap- proximation converges to the original function as the number of levels goes to infinity: "[irnoo Vm 2 UV", is dense in L2(’R), (4.13) and ml_i)rn°o Vm = flVm = {}. (4.14) For every multiresolution approximation of L202), there exists a unique function ¢(t), which is called a scale function, such that its dilation and translations (Amt) = 2'm/2¢(2’mt — n), m, n E Z, (4_15) form an orthonormal basis for the subspaces Vm. The approximation of a function f (t) at a resolution level m can be computed by decomposing f (t) on the set of basis ¢;,"(t), n E Z: Amf(sc )=2"" 2 ¢"‘(u ) (4-16) 712—00 where < -, - > denotes the inner product operation. The wavelet representation is a multiresolution representation based on the differ- ences of information between successive resolutions Vm and Vm+1. As Vm is a subset of Vm+1, the difference between the two sets, which is called the detail at level m, is 82 the orthogonal complement Om of Vm in Vm+1, i.e., 0m .1. Vm, and (4.17) (9m 1M." (t) (4-20) n=—oo AS a result, the dilation and translations of the wavelet Mt), $.76) = 2‘m/ 210(2‘mt - n), m, n e Z (4.21) form an orthonormal basis of L2(R). The decomposition of a function f (t) using the set of wavelet basis is called a wavelet decomposition: f(t) = Z Z < f(U),¢Z’(U) > MW)- (422) mEZnEZ 83 The set of coefficients on these basis < f (u),i/J,',”(u) >, m,n E Z is the wavelet representation of f (t). The set of wavelets i/J,’,”(t) = 2'm/ 21,!)(2‘mt — n), m,n E Z is not necessarily or- thogonal. When the wavelets are not orthogonal, we can always find a function i(t), which is called a dual wavelet of i/J(t), such that: < 1,03, 1%: >= 6,,j6kj. (4.23) In this case, #2:,” and 15:,” form a biorthogonal pair. With the biorthogonal wavelets, the wavelet decomposition of a function f (t) can be written as: f(t) = i arise), (4.24) m,nz—oo and the wavelet coefficients all," can be computed via 4:." = /°° mime. (4.25) Deformation Using Wavelet Transform The advantages of the wavelet basis in flexible shape modeling which allows both global and local degrees of freedom, come at the cost of a larger number of parameters. These are the coefficients of the local and global basis. We note that in our problem, only the displacements in the proximity of the deformed template are of interest. So, to model the displacement of the template, we only need to use the subset of basis 84 which have a nonzero support near the template boundary. In the particular case where the template consists of a Single contour, we param- eterize the contour: {2 [0,1] e732 t 1—> y(t) = (x(t), y(t)), (4.26) and then use the 1-D wavelet basis to model the displacements A1: and Ay in the two dimensions $(t) and y(t) separately: ( A33“) ' 31:1 will“) = , (4.27) ‘( mm 3? y5"(t) where x2"(t) = 86324133), yrs) = Demise). (4.28) are the details at scale m with 1 S m _<_ M. The deformation parameters are then the wavelet coefficients (8”)? and ((3)2. In this way, we reduce the problem of modeling the displacement in a 2-D continuum to the problem of modeling two l-D displacements, and as a result, reduce the number of parameters needed. We have used the B-spline wavelets [137] to span the displacement. B-spline wavelets have the desirable properties that they have compact supports, and they asymptotically converge to the Gabor functions. More details about the B-Spline wavelets can be found in [136, 137]. Figure 4.4 shows the B-spline wavelets at the .- fi. : ’ .\ As A... »v “a A» 0.1 0.2 0.0 -0.2 -0.1 0.2 0.0 0.2 0.1 0.0 -0.2 -0.1 85 first and second resolution levels. -0.2 0.1 basis 1.1 basis 1.2 basis 1.3 basis 1.4 3 2 ‘3 3 3 c2 ca 0 O O 9' 5 3 N N N 050100150200250 9050100150200250 q0501m150200250 905016015070080 basis 2.1 basis 2.2 basis 2.3 basis 2.4 a ‘o" [[ 3 3 3* q q o. O O O «:5 S S N N N. 0 50100150200250 90 50100150200250 ?0 50100150200250 90 50100150200250 basis 2.5 basis 2.6 basis 2.7 basis 2,8 N j N o d g o 3 i o. O O O «'5 5 3‘ N. N N 0501001502002509050100150200250?050100150200250 9050100150200250 Figure 4.4: B-spline wavelet basis. They are generated by shifting and dilating a mother wavelet function. In Fig. 4.5 we Show some templates obtained by deforming a prototype using the B-Spline wavelet transform. These templates are generated using deformation parameter values which come from an i.i.d. zero mean Gaussian distribution. Note that the deformations can be quite local. ‘R’DC‘R’E‘. ~0~CC<§ ‘0 Q , (4.3,) The value of o2 in Eqs. (4.29)-(4.31) reflects the confidence about the prototype template, with a large value of 02 allowing more deformation. Intuitively, larger values of a give rise to more elastic templates and smaller values of a give rise to more rigid templates. 4.3 Bayesian Formulation and Objective Function A Bayesian inference scheme is employed to integrate the prior shape knowledge of the template and the observed object(s) in the input image. The prior information of the object shape can be presented by a combination of o a prototype template, 0 a set of deformation transformations of the template, and o a probability density on the set of deformation transformations. We propose an energy function based on the image boundary strength and the de- formed template in order to arrive at the likelihood. This likelihood is then combined with the prior using Bayes’ rule to obtain the a posteriori probability density of the deformations of the template given the input image. The object is located by deform- ing the template so that the a posteriori probability density is maximized. The final shape and position of the deformed template gives a description of the object in the image. 91 4.3.1 Prior Distribution We use the prior distribution to bias the global transformations (rotation, transla- tion, and scale change) and local deformations that can be applied to a prototype template. Let 73 denote the prototype template, and let 7;,e,§,_d_ be a deformation of the prototype. This deformation is realized by rotating the prototype template by an angle 8, locally deforming the rotation by a set of parameters g, scaling the local deformation by a factor of s, and translating the scaled version along the x and y directions by an amount _cl = (dx, dy): 7;,9,§_,c_1(:v, y) = 70(5 - [(93, y) + D§(’Re(x, y))] + (d’, an), (4.32) where D§ denotes the displacement functions given in Eqs. (4.3), (4.11), and (4.27), and 729(zr, y) is the rotation of a point (:12, y) by an angle 9. Assuming that s, O,§_, and d are all independent of each other, then we can write the joint density of s, G, g, and d as the products of their marginals: ’Pr(s, O,§,d) = ’Pr(s) - Pr(O) -Pr(§) -’Pr(d). (4.33) Suppose all translations, rotations, and scale sizes are equally likely as long as the transformed template falls in the input image, then Eq. (4.33) reduces to: 73718, 9.6.4) = 1979719, (4.34) 92 where h: is a normalizing constant and ’Pr(§) is defined in Eqs. (4.29)-(4.31 ) Eq. (4.34) constitutes our prior density. Intuitively, a deformed template with a geometric shape similar to the prototype template is favored, regardless of its size, orientation, or location in the image. 4.3.2 Likelihood The likelihood specifies the probability of observing the input image, given a deformed template at a Specific position, orientation and scale. It is a measurement of the similarity between the deformed template and the object(s) present in the image. The likelihood which we currently use only incorporates the edge information in the input image, where both the edge positions and directions are considered. Other alternatives of the likelihood can incorporate texture, grey-scale homogeneity, or color information in more complex situations. How to effectively select a likelihood which reflects the requirement of a particular application will be an interesting topic for future work (see Chapter 8). The deformable template is attracted and aligned to the salient edges in the input image via a directional edge potential field. This field is determined by the positions and directions of the edges in the input image. For a pixel (2:, y) in the input image, its edge potential is defined as: 9(4, y) = — exp{—p(6§ + 5;)1/2}, (4.35) where (62,, 6,,) is the displacement to the nearest edge point in the image, and p is a 93 smoothing factor which controls the degree of smoothness of the potential field. The edge potential field can be obtained by convolving the edge map using a Gaussian mask. For pixels which are far away from the edgemap pixels, the edge potential is high. For pixels near the edge pixels, the potential is low. The potential of a pixel is zero if it happens to lie on the edgemap. The goal of the edge potential field is to attract the template to salient image features (the input edges). When we minimize the potential of a template, it is directed to the nearby edges. The edge potential of the input image in Fig. 1.1(b) computed using Eq. (4.35) is shown in Fig. 4.6. Figure 4.6: Edge potential field of the input image in Fig. 1.1(b). The potential at a pixel is displayed as the greyscale value. The larger the gray scale value, the larger the potential value. The yellow pixels denote the edge map of the input image. The potential at locations far away from the edge pixels is high, and the potentials at locations near the edge pixels is low. The edge potential field, computed from the distance to the nearest edge pixel alone, drags the template near the edge pixels. However, it is vulnerable to noisy edges, so the template may get trapped to spurious edges. We have used the gradient 94 direction to improve the performance so that the template agrees with the edgemap in both its position and direction. We modify this edge potential by adding to it a directional component. This new edge potential induces an energy function that relates a deformed template 7; e 6 d to the edges in the input image Y: anew, Y) = 5’; :0 + 90c, y)| cos(fi(:v, 20)!) (4.36) where he; is the number of pixels on the template, 5 (:13, y) is the angle between the tan- gent of the nearest edge and the tangent direction of the template at (:r, y) (Fig. 4.7), and the constant 1 is added so that the potentials are positive and take values be- tween 0 and 1. The summation is over all the pixels on the deformed template. The edge magnitudes and directions are computed using the Canny edge detector [22]. This definition requires that the template boundary agrees with the image edges not only in position, but also in the tangent direction. This feature is particularly useful in the presence of noisy edges. The lower this energy function, the better the deformed template matches the edges in the input image. Using this energy function, we now define a probability density that specifies the likelihood of observing the input image, given the deformations of the template: Pr(Y I s,O,§,gl_) = aexp{—£(7;e,£’d, Y)}, (4.37) where a is a normalizing constant to ensure that the above function integrates to 1. The maximum likelihood is achieved when E (79.9.6.4’ Y) = 0, i.e., when the deformed 95 Figure 4.7: Directional edge potential field. The deformed template (in solid red) is put in the edge potential field created by the edgemap (in solid yellow). Let (:r,y) be a point on the template. Let ($8,316) be the nearest edge pixel to (.7:,y) in the image field. Then the directional potential at template pixel (as, y) is defined as the edge potential at (:17, y) (caused by (278,113) according to Eq. (4.35} multiplied by the cosine of ,B(a:, y), the tangent angle between (2:, y) and (stage). The directional edge potential of the template is the average directional edge potential of all the template pixels. 96 template 7; e 5 d exactly matches the edges in the input image Y. We note that the normalizing constant a is not necessarily independent of the deformed template. Exact computation of this constant involves an intractable sum- mation. If this constant can be computed, then it removes the inherent bias of the energy function (and hence the likelihood) towards deformations of the template that increases its size. Since we cannot compute it, we have chosen to incorporate its effects by normalizing the energy function directly with respect to the template size - see Eq. (4.36). We acknowledge that this represents a deviation from proper Bayesian inference, but many of the existing studies also suffer from the same drawback. 4.3.3 Posterior Probability Density Using Bayes rule, the prior probability of the deformed templates in Eq. (4.34) and the likelihood of the input image given the deformed template in Eq. (4.37) can be combined to obtain the a posteriori probability density of the deformed template given the input image: ’Pr(s, 8, §_, _(1 | Y) = Pr(Y | s,@,§,c_l) ’Pr(s,@,§.d)/7’T(Y) ), (4.38) = Cl eXP{_£(7;e,£,_d_i Y)} H fi; exp{—%5 £32 + 32)} - rep = 62 exp{—€(7;,e,§,c_j, Y) '- Zfifiéfz + 53.12)} iE'P J where C1 and C2 are normalizing constants, and ’P denotes the set of deformation parameters. 97 From the Bayesian point of view, this posterior probability density reflects the distribution of the deformable template, based on the prior Gaussian distribution, and then corrected using the image edge information. The best match can be obtained by finding the Maximum A Posteriori estimate, i.e., finding the maximum of the probability function in Eq. (4.38). 4.3.4 Objective Function Our goal is to maximize the a posteriori density in Eq. (4.38) with respect to pa- rameters s, G, g, 4. Taking the natural logarithms on both Sides of Eq. (4.38) results in: In ’Pr(s, O,§,d | Y) = ln C2 — “79.9.6.6! Y) — 2(62’2 + £32)/2o2. (4.39) iE'P Equivalently, we seek to minimize the following objective function with respect to s, @,§,d: £(t,e,§,c_), Y) = 502,9, 4, Y) + 7 2(52 + 5.92), (4-40) — iE'P where 'y 2 1/202. This objective function consists of two parts: 0 a model-based term which measures the deviation of the deformed template from the prototype template, and o a data-driven term which describes the fitness of the deformed template contour to the boundary in the image. 98 The parameter 7 can be interpreted as providing a relative weighting of the two penalty measures; a larger value of 7 implies a lower variance of the deformation parameters, and as a result, a more rigid template. The objective function L in Eq. (4.40) is a function of the deformation parameters and the object pose parameters. We can find the set of parameters which gives the best objective function value. The optimal solution that maximizes Eq. (4.40) can be obtained, in principle, by an exhaustive search of the parameter Space of s, G, g, and 4. However, this is not feasible due to the high dimensionality of the search space. We find the minimum by first roughly estimating the pose of the object, and then using the gradient descent method to find the nearby local minimum of the objective function surface. 4.4 Discussion The proposed deformation model (Eq. (4.32)) is a more powerful tool than the affine transformations for object matching and localization. An affine transformation can be expressed as the product of an arbitrary sequence of rotation, translation, and scale matrices and has the property of preserving parallelism of lines, but not the magnitude of angles and lengths. Our deformation model is more general than the affine transformations, in that it (i) includes linear scaling, rotation, and translation as a special case when we set the deformation parameters to zeros, and (ii) can allow nonlinear deformation of the shape (the deformation basis functions are nonlinear, see Fig. 4.1.) which cannot be handled by the affine transforms. Figure 4.8 gives some 99 examples of the deformation using the afline transform and the nonrigid deformation transform. In Fig. 4.8(a), the parallelism of the opposite edges of the rectangle is maintained after the affine transform, while the linearity is lost after applying the nonrigid deformation transform (Fig. 4.8(b)). The proposed deformation model also has advantages over the feature-based matching methods such as invariant moments. First, the objects do not need to be segmented or grouped to apply the deformable matching process. The proposed model accumulates local, fragmental evidence and combines it with the global model to establish a match. Secondly, the invariant moments (typically only low-order mo- ments are used), are necessary but not sufficient for shape matching. In other words, two shapes can have very Similar lower order invariant moments, but with dramat- ically different appearances. For the deformation matching process, the objective function is minimized only if the template matches a subpart of the input image exactly; the objective function value provides a measure of dissimilarity. We now Show that the invariant moment features and deformable template match- ing method do give different Similarity measures. We generate in Fig. 4.9 some random samples of deformed templates using the deformation transform defined in Eq. (4 .1 ) We compute the deformation between the deformed templates and the prototype tem- plate. In Fig. 4.10 we arrange the deformed templates in Fig. 4.9 so that the amount of deformation from the prototype increases from top to bottom, left to right. The template in the upper-left corner is the prototype template with no deformation, and the template in the lower-right corner has the most deformation. We give in Fig. 4.11 shows the same set of deformed templates but now they are ranked using 100 Affine Transform Nonrigid Deformation Transform 3:297:25 Figure 4.8: Comparison of the affine transform and nonrigid deformation transform. (a) deforming a rectangle using the affine transform; (b) deforming a rectangle using the nonrigid deformation transform defined in Eq. (4.1). 101 the Euclidean distance between the invariant moment features (defined in Eqs. ( 7. 2, 7. 3)) of the template and deformed shapes. In Sec. 7.1 we will further Show how the deformable template matching method can be used to improve the matching results obtained by using simple Shape features including the invariant moments. In Fig. 4.12 we Show the results of applying the deformable template model to a fish image using different deformation basis. It is observed that with comparable numbers of degree’s of freedom, both the spline basis and the wavelet basis can model local deformations better than the global trigonometric basis. However, a template with an excessive number of degrees of freedom may be attracted to spurious image features which are not desired. We have proposed to use three different deformation basis to model the deforma- tion field. The selected set of basis determines the allowed deformation space and the template space. It is desirable to use that deformation basis which facilitates both the representation and computation in the following ways: 0 From the representation perspective, the deformation basis, as a means for ap- proximation, should be both accurate and efficient. Accuracy means that the set of basis should approximate the natural deformation with precision. Effi- ciency means that the set of basis should approximate the natural deformation well using, preferably, a small set of parameters. 0 From the computational perspective, the selected deformation basis should help to facilitate the computation for the optimal solution. It is favorable to have a smooth objective function in the deformation parameter space, so that the MVQUQQPQOVRVU Mm fm. KVUQVPVDUQOU mm QQWVUPQWQE mm m... V/yflvvflv??? WW mam. @Vyfiflifgg mmm mmm Mam POUQUQUPVQQE Mmm 105 M M24, ndf=120 Figure 4.12: Locating a fish using the deformable templates with different deformation basis. (a) Using 2D trigonometric basis. M is the number of approximation levels, ndf is the number of basis used (number of degree’s of freedom). (b) B—spline basis. N is the number of control points used. (c) B-spline wavelet basis. M is the number of resolution levels. 106 deformable matching process can quickly converge to the desired optimum; All the three deformation basis are commonly used in approximation and model- ing. We would, ideally, select the set of deformation basis which satisfies the above criteria. But for a specific applications, it is difficult to determine this. For the representation, we may adopt a measure which combines the description efficiency (say, coding length) and the deformation matching score (say, the directional edge energy of the template) to compare the various deformation transforms for a par- ticular matching problem. It is more difficult to evaluate, from the computational perspective, which set of deformation basis renders a smooth energy surface. Further, it is not possible to express analytically the energy function using the deformation parameters because the edge potential field itself is empirical. A straight forward measure of the smoothness of a function is high order derivatives. If we express the objective function explicitly in terms of the deformation parameters, we can evaluate its high order derivatives. However, this process is complicated because of the high dimensionality of the deformation parameter space. Condition number is one way to measure the ill-conditionness of a function. Let the single value decomposition (SVD) of a matrix A be A = U WV‘. The condition number of A is defined as the ratio of the largest (in magnitude) of the w,’s to the smallest of the wi’s, where w,’s are the diagonal elements of matrix W. We can compute the condition number for the Jacobian matrix of the objective function to see if it has good computational attributes. But, they also have to be computed empirically to evaluate the objective function surface. Further, the condition number is also a local measure of smoothness. 107 An alternative might be the entropy of the posterior distribution. The entropy of a distribution measures its randomness, or predictivity. A smooth probability density function is difficult to predict and possesses a large amount of randomness. Assuming that a set of pdf’s all come from the same family, we can compare the smoothness by examining the entrOpy of each of the pdf. The larger the entropy, the smoother the pdf. The entrOpy provides a global and overall smoothness measure of a function, given apprOpriate assumptions. However, the difficulty of applying the entropy to measure the smoothness of the posterior distributions in our case is that the exact posterior pdf is difficult to compute. We have to empirically evaluate the likelihood for each point in the parameter space and we have to normalize the generalized pdf over the whole parameter space. Theoretically, it is difficult to contrast the smoothness of the objective functions using various deformation basis. However, we can make some intuitive observations. The 2D-trigonometric basis provides a smoother objective function surface because the basis is global. This agrees with the empirical observations where the matching process converges more consistently and faster. Chapter 5 Image Segmentation and Object Tracking An important application of the deformable template matching method is image segmentation. As indicated earlier, the deformable template is attracted to the salient image features of the input image. When the objective function value is minimized, the template attempts to align with the edge map of the input image. Suppose we manually initialize the template near the object of interest, then the converged configuration gives a description of the object boundary. The deformable template can also track objects in image sequences. The same object present in successive frames retains its global shape, though it varies from frame to frame. The prototype can specify the common shape characteristics, whereas the deformation can specify the changes between frames. Since the object shape in two successive frames does not change much, the converged configuration in the current frame provides a good initialization for the deformable template in the next frame. 108 109 In the rest of this chapter, we present some experimental results on image seg- mentation and object tracking using the deformable template matching algorithm. 5. 1 Preprocessing Given an input image, we start out by sketching a prototype template which resembles an object of interest. Then the input image is processed to obtain the corresponding edge potential image(s). First, we use the Canny edge detector (0 = 1, mask size is 9 x 9, and the lower and upper thresholds are 0.5 and 0.9, respectively) to calculate the edge map of the input image as well as the gradient direction at each edge point. Then, the edge potential field is calculated from the output of the Canny edge detec- tor. Fig. 5.1 illustrates the intermediate results (images) of the preprocessing stage. Figure 5.1(a) is the input CD cover image which contains a saxophone. Its corre- sponding edge map is shown is Fig. 5.1(b), obtained using the canny edge detector with a a 1 of 3 pixels. Note that the boundary of the saxophone is well preserved, but there are many noisy edges near the outer contour. The edge potential magnitude (Eq. (4.35)) is illustrated in (c), where a dark grey value denotes low potential en- ergy. Fig. 5.1(d) shows the edge directional field (Eq. (4.36)), where the direction at a point is defined as the tangent direction of the closest edge pixel. The direction is not calculated for pixels in the “white” regions because they are so far away from the edge pixels, and the potential magnitude is so small that the potential fields (both the magnitude and the direction) are negligible. 10 is a real number that determines the width of the detected edges. Larger values of a will tend to find edges that are more widely spaced. 110 @(WZZZ? 08.0 5 * (C) Figure 5.1: Preprocessing. (a) input image (256 x 256); (b) edge map using the Canny edge detector; (c) magnitude of the edge potential field; ((1) direction field; 111 5.2 Image Segmentation The proposed deformable model can be used to segment desired objects from their background in an input image. When it is used in this way, the given prototype tem— plate is first placed in the neighborhood of the desired objects. The template is then deformed to match the object contours in the input images. The gradient descent algorithm is used to find the local minimum of the objective function in Eq. (4.40). The minimum energy configuration gives the segmentation result. The final deformed template as well as the intermediate sequences of the deformable template are pre- sented to show how the deformable template evolves to match the salient edges in the input image. Figure 5.2 shows the segmentation process when we (manually) place a hand- drawn template (Fig. 5.2(a)) in the neighborhood of a sax0phone in a CD cover image (Fig. 5.2(b)). Figure 5.2(c) shows the manually chosen initial position of the template, and Figs. 5.2(d) and 5.2(6) are the intermediate snapshots of the deformation process. The final converged shape of the template is shown in Fig. 5.2( f), which gives a segmentation of the desired object. In Fig. 5.3 we Show the same evolution sequence when we apply a fish template (Fig. 5.3 (a)) to an image containing a fish (Fig. 5.3 (b)). In both the cases, the de- formable templates converge to the correct object contour by decreasing the objective function value .6. Figure 5.4 shows an example of the sensitivity of the matching process in Fig. 5.3 to the initial placement of the template. As long as we initialize the template Figure 5.2: Localization of a saxophone using manually chosen initial template po- sition. The input image size is 285 x 286. (a) The prototype template; (b) input image; (c) initial position, I. = 0.603; (d) 10 iterations, L = 0.327; (e) 16 iterations, fl = 0.186; (f) 30 iterations, [I = 0.123. Figure 5.3: Localization of a fish using manually chosen initial template position. Image size is 256 x 256. (a) Prototype template; (b) input image; (c) initial position, [3 = 0.432; (d) 4 iterations, .C = 0.308; (e) 7 iterations, .C = 0.221; (f) 40 iterations, .C = 0.157. 114 (Fig. 5.3(a)) so that its centroid falls in the black region in Fig. 5.4, the template converges to the correct configuration. The value of a which we used to obtain the potential energy field is 11 pixels. The extent of the convergence region is about 40 pixels. Figure 5.4: Sensitivity of the matching algorithm to the initial position. 5.3 Object Tracking Object tracking is a challenging and important problem in computer vision. Tracking can be of interest both over time (video sequences) or through space (2D slices of a 3D structure). Tracking over time provides useful information about scene changes and motion parameters, while tracking in space helps in recovering the 3D structure from 2D projections. Deformable contours such as snakes [81, 131] have been applied to tracking tasks, including image—based tracking of rigid and nonrigid objects [8, 14, 17, 71, 81]. The force-driven snake model can easily incorporate the dynamics derived from time- 115 varying images. Kass et a1. [81] have used snakes to track facial features such as lips in an image sequence. The estimated motion of these features are then used to explain facial expressions, etc. Multiple snakes were later used by Terzopoulos and Waters [132] to track more articulate facial features. Leymarie and Levine [89] have used the snake model to track cells in biological image sequences. Bascle and Deriche [15] have combined deformable region models and deformable contours in a sequential way to track moving objects, where a correlation-based region matching method was used in the first stage to roughly locate the objects, and a gradient-based contour models was then used to refine the tracking result. A deformable stochastic model was prOposed by Kervrann and Heitz [83] to track objects in long image sequences, where a point distribution is used to characterize the structure and variations in the object shape. The snake model interacts with and is attracted to salient image features under local smoothness and stretchness constraints. It is a flexible shape model for image segmentation and feature extraction, and there have been many successful applica- tions. However, it does not inherently incorporate any global structure information except for the local regularization constraints. The configuration of a converged snake depends largely on its initialization. When a snake is used to track features in an image sequence, the converged snake in the current frame is often used to initialize the snake in the next frame. The global structure of the object shape is indirectly carried out via the initialization. However, when the image features are very weak or absent because of occlusion, the snake, due to its lack of global structure, may fail to track the shape and get distracted by spurious image features. 116 Unlike the “free—form” active contour model, our prototype—based deformable tem- plate explicitly encodes the global structure in shape modeling. It has potential in object tracking in image sequences due to the following reasons: 0 The object of interest in the image sequence can vary from frame to frame due to a change in the view point, the motion of the object, or the non-rigid nature of the object. These shape variations can be captured by the deformable shape model; 0 Although the object shape varies from frame to frame, the overall structure of the object is generally maintained. The deformable shape model can capture this overall structure by using an appropriate prototype; o The motion or deformation between two successive frames is not significantly large so the converged configuration in the current frame can be used to provide a reasonable initialization for the next frame. In the rest of this section, we describe how to apply the prototype-based template model to track objects in an image sequence. We investigate all the possible cues that can be used to improve the tracking results. In particular, we use image gradient, inter—frame motion and region correspondences to track the objects. 5.3.1 Tracking Criteria Many object tracking applications share the following properties: 117 o The inter-frame motion of the object is small so that the object in the next frame is in the neighborhood of the object in the current frame; 0 The same point on the object has a consistent color/greyscale in all the frames of the sequence; a The boundary of the moving object has a relatively large motion field. The boundary of the object has a large image gradient value. Based on the above observations, we track the boundary of an object using the following criteria: 0 Shape similarity: object shapes are similar in two successive frames; 0 Region similarity: the properties (color, texture) of a region in the object remain constant throughout the sequence; 0 Motion cue: the object boundary should be attracted to pixels with a large amount of motion. 0 Gradient cue: the object boundary should be attracted to pixels with large image gradient. These criteria are explained in more detail in the following subsections: Matching Regions Suppose the deformable template delineates the object boundary accurately in the first frame. The segmented region (object) is used as a reference object Ore; for 118 color/ texture matching for the rest of the sequence, i.e., a point on the object in each frame exhibits similar region statistics as the corresponding point on the object in the first frame. Suppose we have successfully tracked the object up to the ith frame. Since the inter-frame motion of the object is assumed to be small, we can assume that the object boundary in the (i + 1)th frame is enclosed in a band (shaded region in frame (i+ 1) in Fig. 5.5) centered at the object boundary in the ith frame. An alternate way to state this is that when the object evolve from the current frame to the next frame, the corresponding point of a current boundary point in the next frame is enclosed in a disc centered at its current position. The radius of the disc depends on the inter- frame object motion. The larger the inter-frame motion, the larger the disc. This radius determines the width of the band. As the deformable template model provides the correspondence of the boundary points in different frames, we can employ the region correspondence to help tracking the boundary. When we track the object in the (z' + 1)th frame, we can first predict a radius for each boundary point based on the tracking result in the ith frame. Each point in this disc is compared to the object region around the corresponding boundary point in the reference object in terms of color/greyscale. We compute for each point in the band a color/greyscale distance to the likely corresponding points on the reference object. The mathematical description comes as follows. Assume that the boundary of the object in the first frame consists of a linked list of no points p8,p‘1’,pg, . . . ,p20_1, where the superscript indicates the frame number, and the subscript indicates which point it is on the object boundary. We define the 119 neighborhood N 0(k) of the kth boundary point in the 0th frame to be the intersection of a disc centered at p2 and the object region. We define the neighborhood N (l) of an image pixel l to be the disc centered at Z (See Fig. 5.5). For each point I in the band in the (z+1)th frame, we compute a matching score which measures the color/greyscale similarity to possible corresponding points on the reference object. distanceU) = kprgieilna) Dist(l,N°(k)), (5.1) v k where Dist(l,N°(k)), the distance of pixel l to region N°(k), is defined using the order statistics as follows. Let the Euclidean distance between the greyscale/color of pixel l to the greyscale/color of pixels in N°(k) be d¢k0,d¢k,,dlk2, . . .,d,ka_,, in the order of increasing distance. Di3t(l,.A/°(k)) is defined as the average of the distance between the 10th percentile and the 40th percentile, that is, I: 22:21:", due.- D'tl,N°k = , 28( ()) 2].;ka (5.2) where km and 19,, are the 10th percentile and 40th percentile points. This statistics is used because it is robust to noise. Given the converged deformable template in the ith frame, we can compute a distance map in the (i + 1)th frame, where for each point in the band around the template position in the ith frame, a distance is assigned to reflect its degree of resemblance to the potential object boundary. If this pixel happens to lie on the object boundary, the distance would be very small. If it is a background pixel which 120 Ira-e 0 Frame 1 Frame 1+1 Figure 5.5: Computing color/greyscale distance. The detected object in the first frame is used as the reference object. For frame i+1, color/greyscale distance is computed for the band (shaded region) around the detected object boundary in frame 2'. is quite different from the object pixels in color, the distance would be large. As a result, when we search for the possible object boundary in the new frame, it is not likely to be located in regions with large distance values. Such an example is illustrated in Fig. 5.6, where Fig. 5.6(a) shows the reference object (a human hand) segmented from the first frame, Fig. 5.6(b) shows the localized hand in the 3rd frame, Fig. 5.6(c) is the 4th input frame, and Fig. 5.6(d) shows the negated color distance map computed for the 4th frame based on the template position in the previous frame (frame 3) and the reference object in the first frame (frame 0). We can tell the bright region (small color distance) which corresponds to the hand region in the 4th frame. Motion Cues A moving object creates a large amount of motion at its boundary. When we subtract two successive frames and compute the absolute values of the frame difference, the boundary of the moving object is highlighted. Image subtraction has been commonly used to segment objects from their background. We use it to provide a supplemental cue for object tracking. 121 Figure 5.6: Computing color distance for an input frame. (a) The detected object in the first frame is used as the reference object (image size: 288 x 352). (b) The tracking result for the third frame; (0) The fourth input frame; (d) The computed color distance map (negated) for the fourth frame. The greyscale value is inversely related to the color distance. The bright region in the map indicates a potential object with a matching color composition. 122 The motion field is obtained by computing the absolute values of the inter-frame differences, and then smoothed using a 2D Gaussian mask. Image Gradient Object boundary (either static or moving) is often characterized by discontinuities in color/greyscale, which is indicated by a large image gradient. This criterion is most commonly used in image segmentation. We compute the image gradient for each image frame as the sum of squares of the color/greyscale differences along the a:— and y— axes, and then smooth it for each frame using a 2D Gaussian mask. 5.3.2 Objective Function To track an object in an image sequence, we deform the template so that: 1. Small deformations are preferred. 2. The template is attracted to image pixels with large gradient; 3. The template is attracted to image pixels with large motion (inter-frame dis- tances) . 4. The template deforms itself to minimize the average color/greyscale distance of the enclosed pixels; The first goal is achieved by penalizing large deformation parameters. The re- maining three goals are achieved using the following definition of image potential field. 123 Image Potential Field We compute an image potential field to find a match between the deformable template and the object boundary in the image frame. The potential field for the tracking problem incorporates image gradient, color/greyscale cues, and motion cues. Let the image gradient plane for the ith frame be 9,, the absolute difference between two frames (2' — 1) and i be Mg, and the color distance map for the ith frame be 6,, then the potential 79,- is computed as: 79:? = —(gi 9 Mi) 9 (Oman: — Ci): (53) where (9 denotes pixel-wise multiplication, and Cm” is the maximum color distance in Ci. The potential 79,-(7') of a template T, when placed in the tracking potential field, is the average of potentials of all the template pixels. When the template minimizes its potential, it is doing at least one of the following: (i) increasing its gradient (attracted to image edges), (ii) increasing its motion (attracted to moving boundaries), and (iii) increasing its color/greyscale similarity to the reference object. The deviation of the template from the prototype is measured by the sum of squares of the deformation parameters. We usually prefer small deformations to large deformations. The objective function, which incorporates both the amount of 124 deformation and the goodness of matching, is defined as: “$17, 31, 8) = «42663) +7’(7')- (5-4) €i€_ This objective function is minimized to match a template model to the object of interest in a sequence. We use the gradient descent method to minimize the objective function (Eq. (5.4)) w.r.t. the deformation parameters g, the translation parameters :1:,y and the scale parameter s. Figure 5.7 illustrate an example of the fusion of image gradient, color consistency, and motion. After the integration of the multiple image cues, an image potential field is obtained which highlights the desired salient image features. 5.3.3 Tracking Algorithm The tracking proceeds as follows: 1. Initialize a template in the proximity of the object in the first frame. Cur- rently, this initialization is done manually. Apply the deformable template matching process until it converges. Record the converged template and the color/greyscale information on the segmented object boundary and its neigh- boring pixels on the object. As motion and region similarity information is not available for the first frame, only the gradient is used for the potential field. 2. Update the prototype to the converged deformable template in the current frame; Compute the color/greyscale distance map using the next frame and the 125 Figure 5.7: Integrating image gradient, color consistency, and inter-frame motion. (a) The image gradient for the input frame in Fig. 5.6(c); (b) The color distant map (negated) for this frame; (c) The inter-frame motion for this frame; (d) The integrated image potential field using Eq. 5. 3. 126 reference object (segmented object in the first frame); initialize the template in the next frame using the pose (translation, scale) of the converged configuration in the current frame; perform the deformable template matching using the po- tential which combines image gradient, inter-frame motion, and color/greyscale distance map, until it converges; 3. Repeat step 2 for all the subsequent frames. 5.3.4 Experimental Results We have applied the proposed tracking algorithm to track objects in a number of image sequences including an MRI image sequence, a home video, and a TV program. Figure 5.8 shows the tracking result of applying the deformable template match- ing method to track the left ventrical in a cardiac image (MRI) sequence using image gradient information alone. The time evolution of a particular region can be an im- portant evidence for diagnostic purposes. The difficulty for this sequence is that for frames 3, 4, and 5, the inner wall of the ventrical is hardly detectable in the bottom- right corner (Fig. 5.8(a)). The true edges of the heart wall are barely visible. The edges shown in the image do not correspond to true physiological boundaries, they are due to inhomogeneity in the acquisition and floating papillary muscles. This is also indicated by the image gradient image, where there is a very weak gradient at the locations of the inner wall, and there are spurious regions with strong gradients above the inner walls. For these several frames, the structure contained in deformable template overcomes the missing and spurious image features, and the template main- 127 tains its structure rather than being attracted to the spurious features. As shown, the deformable template is able to correctly follow the contractions and expansions of the left ventricle through time. In each frame, the inner wall of the left ventricle is detected. For this sequence, it takes 0.46 sec. to compute the potential field for the whole image sequence, and another 0.36 sec. to track the entire sequence on a SGI Indigo 2. Human face tracking can be of significant importance in a number of applications, including video conferencing where the face region is located and coded at a higher bit rate than the background for transmission and storage. Figure 5.9 shows the tracking results of a human face using only image gradient information. The sequence consists of 35 frames from a video which lasts 12 secs, which was sampled at 3 frames per second. Note that the template is capable of handling partial occlusions in the frames in the bottom row. Despite the shift and rotation of the face, the deformable template can reasonably delineate the contour of the head through the whole sequence. It takes 5.058 sec to compute the potential image for the 35 frames, and another 2.532 sec to track the face for the 12 sec video clip. Figure 5.10 shows the tracking of a human hand in a weather forecast video clip using image gradient, color region constancy, and inter-frame motion. The map in the background contains curves with very strong image gradient. The edge force in the background curves is stronger than that at the boundary of the hand. However, with the help of color and motion information, we can reasonably track the moving hand through the 15 frames. Figure 5.8: Tracking a heart in a medical image (MRI) sequence (each frame size is 64 X 64) using the spline representation. (a) Input image sequence; (b) the image gradient of the input sequence; (0) template initialization in the first frame; ((1) tracking results for the sequence: the deformable template is able to capture the contractions and the expansions as the heart beats. 129 Figure 5.9: Tracking a human face in an image sequence (each frame size is 120 x160). (a) template initialization in the first frame; (b) tracking results for the sequence. 130 5.3.5 Summary We have presented an application of the prototype-based deformable template to the tracking of objects in image sequences from different sources. The major advantage of the prototype-based deformable model has advantage over the widely used “snake model” in tracking applications is that it inherently contains global structural infor- mation about the object shape. The inherent structure makes it less sensitive to weak or missing image features. We have combined image gradient, region color/greyscale, and motion cues to fa- cilitate the tracking process. In particular, we have introduced a region-based match- ing criterion which takes advantage of the color/greyscale constancy of corresponding object pixels throughout the sequence. We have applied the algorithm to a number of image sequences from different sources. The experimental results are promising. The inherent structure in the deformable template, together with the region, motion, and image gradient cues, make the algorithm relatively insensitive to the adverse effects of weak image features and partial occlusion. The proposed framework is quite general and can be applied to a number of tracking tasks. Future work will incorporate temporal prediction to improve the tracking results. 131 Figure 5.10: ”Hacking a human hand in a weather forecast TV program (each frame size is 288 x 352). Chapter 6 Multi—resolution Algorithm for Localization The localization and identification of objects in the input image involves optimizing the objective function in Eq. (4.40) in a high-dimensional parameter space of s, 9, g, and c_l. However, this objective function to be minimized is not unimodal. In fact, it is a rather complex function over the parameter space. The minima for this function can, in principle, be obtained by using Monte Carlo relaxation algorithms such as the Gibbs sampler [51], the Metropolis algorithm [50, 96], or the stochastic diffusion algorithms [53, 52, 61]. In all such algorithms, the minima are obtained by constructing an ergodic Markov chain whose limiting (stationary) distribution has support over only the modes of the a posteriori probability density function. However, because of the characteristics of stochastic sampling, these approaches achieve the optimal solution at the cost of excessive computing time. We have implemented a multiresolution algorithm [119] to locate an object more 132 133 efficiently. The search is initiated from a smoother and subsampled energy surface to a progressively finer one, in a manner similar to searching in a scale space [13, 147, 149]. The basic idea is that by using a smoother and subsampled energy surface, the algorithm can quickly converge to the neighborhood of the optimal solution, though there is no guarantee of convergence to the correct location; by using a more accurate energy function at the finer stage, a better localization can be achieved. Note that the smoothness of the energy surface is controlled by the parameter p in Eq. (4.35). The larger the value of p, the smoother the energy function. 6.1 Multiresolution Algorithm We have employed a multiresolution approach to quickly find a good match. A global search for plausible candidate positions is performed at the coarsest stage: a smooth potential field is used with a large value of p in Eq. {4.35). This smooth potential field has fewer spurious local minima, which helps the deformable template to find the global valleys. This stage attempts to roughly locate the global optima efficiently without regard to localization accuracy. Therefore, it can be performed at a reduced resolution. A subsampled deformable template is placed at a set of regular positions, and a set of discretized orientations, in the input image using the smooth edge potential field. The spacings between the template positions are chosen to be one fourth the size of the template so that all the significant local minima of the energy surface are covered. This computation can be done efficiently because 1. We work only with a subsampled template, 134 2. We use coarse step—sizes for the deformation parameters, 3. Initial positions with considerably high energy can be discarded immediately, 4. Fewer numbers of deformation parameters and iteration steps are needed since we are only interested in an approximate match, and 5. A larger step size can be used at the coarse matching level to speed up the convergence. Finer level matchings are initialized using the promising candidates screened from the previous stage. The deformed templates obtained at stage (I — 1) with low energy (below a threshold) are used as initial templates for the matching at stage I. A larger number of deformation parameters and finer step sizes are used to obtain more detailed matches. We have used three stages altogether. In all the stages, the local minimization is performed by using a deterministic gradient descent algorithm. The hierarchy of step sizes from coarse to fine allows the multi—stage process to escape local minima in the deformation parameter space. This coarse-to-fine matching can automatically find acceptable solutions to the minimization problem at an affordable computational cost. The algorithm for object localization is summarized as follows: Begin 135 o Preprocessing — Calculate the edge map and gradient direction of the input image using the Canny edge detector; — Compute the directional edge potential images at three difierent resolutions ( coarse to fine) according to the edge map; 0 Automatic Localization End 1. Perform the coarsest-level matching, initialized at evenly spaced positions, and over a discretized set of orientations, using a coarse step size and a smooth edge potential field. For each of these initial positions, the matching process is as follows: loop: Calculate the objective function and the partial derivatives w.r.t. pa- rameters s, O,§, d - If the objective function is less than a threshold, goto step 2 (finer- level matching). - Else if the number of iterations exceeds a threshold, report no object of interest and stop; - Else use the gradient descent method to update the deformation pa— rameters. Go to 100p. . Perform finer-level matching initialized by the candidate templates gener- ated by the coarsest-level search of step I. If the objective function is less than a threshold, goto step 3 (finest-level matching). The deformation pro- cedure is the same as described in step 1, but a finer step size and a coarser edge potential field are used. . Perform finest-level matching initialized by the candidate templates gen- erated by the finer-level search. If the objective function is less than a threshold, output the retrieved object. Here, an even finer step-size and a less smooth edge potential field than step 2 are used. In the above algorithm, the partial derivatives with respect to the displacements d“’ and d” are approximated by finite differences. The partial derivatives with respect to the remaining parameters are obtained by the chain rule, and by using the partial derivatives of d‘” and d” with respect to those other parameters. 136 6.2 Experimental Results We present the experimental results on automatically locating objects using (i) shape cues, and (ii) both the shape and texture cues. In these experiments, both the presence and the number of desired objects in the image, their position, pose and scale 1 are unknown. Localized objects, along with their poses and scales are reported. A quantitative value is also associated which each located object which describes the confidence in its matching. The multiresolution algorithm described in Sec. 6.1 is applied to automatically search an input image for desired objects given the shape description, regardless of the size, orientation and location of the presented objects. Ideally, a global search in the parameter space is needed. However, in order to achieve a reasonable level of efficiency, we use the multiresolution algorithm to find a suboptimum solution. In the multiresolution implementation, we diScretized the template orientation into a number of different orientations (currently, we use 12 different directions) which uniformly cover the interval [0°, 360°]. The deformed template is initialized at each orientation. Templates with orientations within each of these direction intervals are expected to be recovered by the deformation process itself. Unlike other parameters, we do not perform a global minimization of the objective function over all the values of the scale parameter 3. Instead, we settle for a minimum of the objective function when the value of s is in the local neighborhood of 1.0 (0.7 to 1.3). We initialize the 1For moderate scale range from 0.7 to 1.3 of the size of the template. In principle, objects of all scales can be taken care of by an exhaustive search for all scales, but this is computationally prohibitive. 137 coarse template at different positions. The spacing between these positions is about one fourth the size of the template. We have used three different values (12, 8 and 3) of the smoothness parameter p in Eq. (4.35) when calculating the directional edge potential fields at different resolutions. The parameter 7 in Eq. (4.40) which reflects the relative weighting between the prior (template) and the likelihood (data) was set to 0.1. In our experiments, we found that the results were not very sensitive to the the choice of 7 as long as it is in the range 0.1 to 1.0. Finally, in order to increase the speed of the likelihood-energy calculations, we pre—compute some images, namely, the Canny edge image and three potential field images corresponding to the edge potential fields at the three resolutions. In the following we show the experimental results using the three—stage coarse-to- fine deformable template matching scheme to automatically search the input image for the desired shapes. Fig. 6.1 illustrates the three stages for locating a fish. The parameter space is searched at the coarse level, with the most subsampled potential field and template, fewer approximation levels and smoother potential surface for low computational cost. Only a few locations are passed to the next level to initialize the matching at a higher accuracy. At the next (finer) level, a smaller sampling rate is used, and only 6 locations are obtained and passed to the finest level, where one localization is reported when the objective function value is thresholded. We also illustrate that our matching scheme can localize objects independent of their location, and orientation in the image. Objects of different shapes are retrieved using different prototype templates. In Fig. 6.2 we show the localizations of a guitar and a star separately. Each of the localizations is illustrated by the initial hand- 138 Figure 6.1: Locating a fish using the coarse-to—fine multiresolution algorithm. Horn left to right: edge potential field, output. (a) coarse level: the template and potential field are subsampled 1 : 4 in each dimension, a few configurations are obtained and passed to the next level (finer); (b) finer level: the template and potential field are subsampled 1 : 2 in a: and y direction, 6 configurations are obtained and passed to the next level (finest); (c) finest level, 1 configuration is obtained .6 = 0.22. 139 drawn template, the input image, the resulting template at the coarsest level (initial template for the finer level), and the retrieved object image. In the following experiments, we demonstrate that our localization scheme is able to retrieve all the objects in an input image that resemble the prototype template. Note that these objects may be of different sizes, different orientations, and may have local variations. In [61], multiple objects resembling the same prototype template were localized by using jump-diffusion algorithms. The addition of jumps further slows down the convergence of computationally demanding diffusion algorithms. In contrast, the multi-resolution minimization method adopted in this dissertation local- izes multiple objects without any additional overhead. (Each object resembling the prototype template corresponds to a different minima of the objective function. The objective function value is below the threshold at each of those minima, and hence all these objects are retrieved.) In Fig. 6.3, we have applied a “seed” template to the image of the cross-section of an orange using our multi-stage deformation algorithm. Fig. 6.3(a) shows the “seed” template and Fig. 6.3(b) shows the input image. All the resulting templates with objective function values below a threshold are presented in Fig. 6.3(c). All the six “seeds” with different orientations are retrieved correctly, and there is no false retrieval. Classification of microbial cells in terms of morphotype has been of substantial in- terest to microbiologists. Microbiologists have discovered a few hundred morphotypes which cover a small portion of the bacteria in the world. Our deformable template matching algorithm may be used for locating and identifying specific microbial cells without first segmenting the image. Figure 6.4 shows the localization result for a 140 (a) (b) Figure 6.2: Automatic localization of desired objects using the coarse-to—fine mul- tiresolution matching. (a) retrieval of a guitar using multiresolution deformable template matching (320 x 304), .C = 0.186; (b) retrieval of a star using multires- olution deformable template matching (256 x 256), [I = 0.157; (From top to bottom: hand-drawn template, input image, retrieved deformed template at the coarsest level, retrieved deformed template at the finest level.) 141 (b) (C) Figure 6.3: Automatic localization of “seeds” using the coarse—to-fine multiresolution matching. (a) a “seed” template; (b) the input image of the cross-section of an orange (453 x 436); (c) retrieved objects when the objective function is thresholded at 0.160. 142 bacteria image. Our deformable template algorithm can also handle prototype templates that are not closed. We have applied a prototype hand template which consists of an open curve to five different hand images. Note that the hand in each of the five input images is slightly different from the others both in shape and orientation. The localization result is shown in Fig. 6.5, where 6.5(a) shows the hand-drawn template, 6.5(b) shows the five different input images, and 6.5(c) shows the corresponding retrieval and localization. Note that although the hand instances have different shapes, the deformable template is able to accommodate these variations. In Fig. 6.6 we show the retrieval of two tower images using the same prototype tower template. The input images are pictures of the Washington monument taken at different times, and from different view points (Figs. 6.6(b) and (c)). The difference in the imaging process produces two different appearances of the tower. We have used the tower template as shown in Fig. 6.6(a) to localize the towers in the two pictures. Both the towers are correctly retrieved even though they are of different scales, orientations, and appearances. The energy function cannot only be used to retrieve objects in an input image that resemble a prototype template, but the same function can also be used to reject the hypothesis that a certain object is present in an image. Figures 6.2-6.6 showed that when the image contains an object resembling the prototype template, then it can be retrieved and localized accurately by our approach. In the following experiments, we demonstrate that if we apply a deformable template to an image which does not contain an object of similar shape, then the resulting objective function value will 143 Figure 6.4: Automatic localization of microscopic forms using the coarse-to-fine multiresolution algorithm. (a) a morphotype template; (b) two localized forms (c = 0.18, 0.21). 144 Figure 6.5: Automatic localization of human hand using coarse-to-fine algorithm. (a) the hand template; (b) input images which contain a hand (121 x 160); (c) retrieved hands (.6 E [0.191,0.267]). 145 Figure 6.6: Applying a tower template. (a) the template; (b) retrieval of tower 1 (280 x 280), c = 0.227; (c) retrieval of tower 2 (280 x 280), c = 0.243. 146 be sufficiently large so that we can reject the hypothesis that the specified objects are present in the image. In Figure 6.7(a) a fish template is applied to the image containing a saxophone, in Fig. 6.7(b) we apply the same template to a CD cover image, and in Fig. 6.7(c) we apply a saxophone template to the fish image. In these cases, the final objective value I. is significantly higher than the matching scores when there is a correct match. (b) Figure 6.7: What if the template is not present in the image? (a) applying a fish template to a saxophone image, L‘, = 0.587 (E = 0.142 for the saxophone template); (b) applying the fish template to a guitar image, 5 = 0.230 (.6 = 0.142 for the guitar template); (c) applying a saxophone template to a fish image, L‘. = 0.430 (f. = 0.170 for the fish template). The computation time for the multiresolution matching algorithm depends both on the size and complexity of the input image and the template. Usually more search time is required if (i) the image size is large, or (ii) the template size is large, or (iii) the image is complex with cluttered or textured content. Excluding the computation of the edge map and the directional edge potential field, it takes about 8.8 seconds to retrieve the hand template from an image of size 121 x 160, 7.8 seconds to retrieve the saxophone (image size: 285 x 286), and 9.2 seconds to retrieve the fish (image 147 size 256 x 256) on a Sun Sparc 20 workstation. Chapter 7 Retrieval From Image Databases We are now living in the age of multimedia, where digital libraries are beginning to play a more and more important role. In contrast to traditional databases which are mainly accessed by textual queries, digital libraries, including image and video databases, require representation and management using visual or pictorial cues. The current trend in image and video database research reflects this need. A num- ber of content-based image database retrieval systems have been designed and built. Among them, QBIC (Querying by Image Content) [102] can query large on-line image databases using image content (color, texture, shape, geometric composition). It uses both semantic and statistical features to describe the image content. Photobook [106] is a set of interactive tools for browsing and searching image databases. It uses both semantic-preserving content-based features and text annotations for querying. Vinod and Murase [140] proposed to locate an object by matching the corresponding DCT coefficients in the transformed domain [140]. A similar approach has been applied to computational videos [152]. Color, texture and shape features have also been applied 148 149 to index and browse digital video databases [152]. The compression standard for videos, MPEG-4, is quite different from the former standards MPEG-1 and MPEG—2, in that it is object-oriented and allows object-based interactivity and scalability. In MPEG-4, each video frame is a composition of video objects (V0), and each V0 is associated with shape, texture and motion attributes. The most recent standard under development, MPEG-7, is aimed to facilitate representation and retrieval of multimedia databases including image, video and audio. For all these applications, shape, as an important visual cue for human perception, plays a significant role. Queries typically involve a set of curves which need to be located in the images or video frames of the database. Speed and accuracy are two important issues in the design of a database retrieval system. The retrieval accuracy can be defined in terms of precision and recall rates. The precision rate is the percentage of retrieved items which are similar to the query among the total number of retrieved items. It measures the amount of correctly retrieved items. The recall rate is the percent of retrieved items which are similar to the query among the total number of database items which are similar to the query. A good retrieval system should have high precision and recall rates, although, in reality, we have to compromise between the two. In most of the existing retrieval systems, the challenge is to extract appr0priate features such that they are representative of a specific image/object attribute and at the same time, are able to discriminate images/ objects with different attributes. Once features have been extracted to characterize the image property of interest, the matching and retrieval problem is reduced to computing the similarity in the feature 150 space and finding database images which are most similar to the query image. However, it is not always clear whether a given set of features is appropriate for a specific application. Feature-based methods cannot be applied when the objects of interest have not been segmented from the background. The proposed deformable template matching algorithm does not compute any specific shape features. It pro- vides an appealing solution for the retrieval tasks because of its capability to (i) model an overall shape, (ii) accommodate shape variations, and (iii) easily handle different shapes. However, the generality of the approach and avoidance of segmentation are achieved at the cost of expensive computation. As a result, the DTM method is more suited for ofi-line retrieval tasks rather than online retrievals. In order to make the DT M method feasible for online retrievals, we have adopted a hierarchical retrieval scheme. In the first (screening) stage, the database is browsed using some simple and efficient matching criteria; in the second stage, the DTM is applied to the small set of choices obtained in the first stage. This hierarchical mechanism can improve both efficiency and accuracy. We have designed two image database retrieval systems using the hierarchical architecture. One is a shape-based retrieval system for binary trademark image databases. Another is a two—stage retrieval system for general image retrieval us- ing color, texture and shape. We will introduce each of the systems in the rest of the chapter. 151 7 .1 A Shape-based Retrieval System for Trade- mark Image Databases A trademark identifies and distinguishes the source of goods or services of one party from those of others. There are over a million registered trademarks in the US. alone, and they represent a number of goods and products which are sold by different manufacturers and service organizations. Most of the trademarks take the form of a symbol or a design, like an abstract drawing of an animal, or a natural object (Sun, Moon, etc.). When a trademark applicant registers for a new trademark, there must be an assurance that the new design does not coincide or create confusion with any of the already registered logos. Resembling trademarks can result in disputes over copyright, etc., and determining the possibility of such a conflict is based on the similarity in the binary shapes according to the USPTO (US. Patent and Trademark Oflice). It is tedious and expensive to compare the new design to the thousands of registered trademarks manually. A shape-based retrieval system can facilitate the searching process by browsing for a small set of matching candidates. We have incorporated the deformable template matching algorithm with multiple simple shape cues into a retrieval system for a binary trademark image database, where the user provides a hand—drawn trademark, and similar trademarks in the database are to be retrieved [138]. The trademark image database retrieval system achieves both the desired efficiency and accuracy using a two-stage hierarchy: 152 o in the first stage, simple and easily computable statistical shape features are used to quickly browse through the database to generate a moderate number of plausible retrievals; o in the second stage, the outputs from the first stage are screened using the proposed deformable template matching process to discard spurious matches. This system is outlined in Fig 7.1. The first stage of the hierarchy acts as a quick browser where the database images are pruned using easily calculated shape indices, including the invariant moments and the edge direction histogram [138]. The output of this stage is a relatively small set of candidate logos which have similar shape indices as the query trademark. Those candidates can, however, be quite different from the query shape visually because the simple shape features used in the pruning stage are not “information preserving”. The deformable matching model described in Sec. 4.3, then acts as a screener in the second stage to discard the spurious candidates, where the objective function values are used as a dissimilarity measure between images. In principle, we would like to apply the deformable matching process to every database image. This is prohibited by the relatively high computational cost of the deformable matching method. That is why the hierarchical architecture is used to give both a high precision and a high recall rate. 7 .1.1 Browser Using Simple Shape Features In the trademark image database, each trademark image is an object which needs to be matched against the others. In other words, there is no segmentation problem 153 Feature 1 Feature 2 Query Shape Top I (l