LIBRARY Michigan State University PLACE N RETURN BOX to roman this checkout flom your rocord. TO AVOID FINES Mum on or baton data duo. DATE DUE DATE DUE DATE DUE I I MSU I. An Affirmative ActlaVEqnl Opportunity lmtttuion W m1 SEGMENTATION AND MATCHING OF MOVING VEHICLES FROM COMPLEX OUTDOOR SCENES By Marie-Pierre Dubuisson A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1995 ABSTRACT SEGMENTATION AND MATCHING OF MOVING VEHICLES FROM COMPLEX OUTDOOR SCENES By Marie-Pierre Dubuisson This thesis describes a machine vision system whichiis able to extract moving vehicles from complex stationary backgrounds and match them based on color and shape information. This vision system was developed in support of the Intelligent Ve— hicle/ Highway Systems (IVHS) program whose main goals are to reduce the number of accidents on the road, increase the traffic density, and provide route guidance to the travelers. The specific problem of interest in this thesis is to estimate the average travel time between two points in a road network. We propose an object matching system which includes a number of modules that can be utilized in other application domains. In particular, we have developed four image segmentation algorithms. The motion segmentation algorithm can identify the moving areas in the image using a three—frame sequence. The color segmentation algorithm combines edge information and a split-and-merge technique to identify homogeneous regions of different colors in a color image. These two algorithms are general purpose image segmentation algorithms. The other two segmentation algorithms utilize object model information. We propose an algorithm to extract an accurate contour of the moving object in the image. This technique fuses color and motion information using the theory of active contours to constrain the contour of the moving object to be smooth. The final image segmentation algorithm constrains the object to be a vehicle. We define a generic model of a vehicle and use the theory of deformable templates to fuse motion and edge information and obtain a segmentation and classification of the vehicle of interest. Once the vehicles of interest have been separated from the stationary background and the other moving vehicles, the matching strategy is based on color and shape matching. For color matching, we compare the 3D color histograms of the two objects. For shape matching, we propose a modified Hausdorff distance between point sets to compare the edge images of the vehicles. We evaluated the performance of the vision-based matching system on a database of 287 vehicles. We were able to segment and correctly classify 91.6% of the vehicles. We also demonstrated the efficacy of time-based, vehicle class—based, and color-based indexing schemes. We compared the machine vision results to the results that experts obtained manually by looking at the video tapes and concluded that the machine vision system is a feasible tool for travel time estimation. To my husband Ray, my parents Bernard and Francine, and my sisters Sophie and Séverine Acknowledgments I would like to acknowledge all the people who have assisted me throughout my studies at Michigan State University. I am particularly grateful to my advisor, Professor Anil Jain, for his advice and encouragement. He provided me with numerous ideas, suggestions and comments. I owe my great interest in research activities to him. I would like to thank the other members of my dissertation committee. Professor George Stockman always challenged me with new approaches and interesting ideas. Professor William Taylor provided valuable knowledge about real-world problems related to Intelligent Vehicle/Highway Systems. Dr. John Weng has always had an open door to discuss ideas and general concepts in computer vision. Professor Raoul LePage taught me the basics of stochastic processes and helped me understand statistical approaches to image processing. I also wish to acknowledge Dr. Shridhar Lakshmanan from the University of Michigan-Dearborn for his insight in the theory of deformable templates. I would like to express my special thanks to late Professor Richard Dubes who originally served on my dissertation committee. His wonderful teaching ability, incredible patience, and extensive knowledge greatly inspired me in pursuing research in pattern recognition. My sincere thanks go to all the members of my family. I could not have finished this dissertation without the love, patience. encouragement, and understanding of my husband Ray. I would like to share all my academic accomplishments with him. I also dedicate this thesis to my parents Bernard and Francine Dubuisson and my two sisters Sophie and Séverine Dubuisson who sent me love and support from across the Atlantic ocean through numerous phone calls and e-mail messages. I am also grateful to my in-laws Marilyn and Ray Jolly, and Ann and Curt Blaszak for their love and vi encouragement. throughout the writing of this dissertation. I would like to thank all my friends for their moral support. especially Barb Birch- ler. Dave (.‘hesney, Barb Czerny. Maureen Galsterer. Edgar Kalns, Ron Sass, and Steve Turner. also known as the “the lunch crowd”; Catherine Tamareille and Gary \-"e1'cruysse. Carine Bourget and Marwan Kruntz, and Isabelle C'assagne and Mike Demarte. Thanks also go to all the members of the Pattern Recognition and Image Processing Laboratory for their interest and concern, especially Jinlong Chen, Yun- tao Cui, Chitra Dorai, Hansye Dulimarta, Sally Howden, Qian Huang, Kalle Karu, Jianchang Mao. Aniati Murni, Tim Newman, Sharathcha Pankanti, Nalini Ratha, Dan Swets, Oivind Due Trier, Marilyn Wulfekuhler, Yu Zhong, and especially our lab manager John Lees. I would also like to acknowledge the graduate students from the Transportation Group in the Department of Civil and Environmental Engineer- ing, especially Paul Dorothy, Sumeet Johar. Martin Kane, Gill Mosseri, and Sorowit Naraupti for assisting me in collecting the data and analyzing the video tapes. Financial support in the form of graduate research assistantships from the US. Department of Transportation through the Great Lakes Center for Truck Transporta- tion Research, the State of Michigan Department of Transportation, and the National Science Foundation is gratefully acknowledged. Contents List of Figures List of Tables 1 Introduction 1.1 Model-Based Object Recognition .................... 1.2 Object Matching ............................. 1.3 Contributions of the Thesis ....................... 1.4 Outline of the Thesis ........................... 2 Intelligent Vehicle/ Highway Systems and Applications 2.1 IVHS in the United States ........................ 2.1.1 Technical committees ....................... 2.1.2 Current operational tests ..................... 2.2 IVHS in Europe .............................. 2.2.1 PROMETHEUS ......................... 2.2.2 DRIVE .............................. 2.3 IVHS in Japan .............................. 2.4 Machine Vision for IVHS ......................... 2.5 Travel time and Origin/Destination Matrices Estimation ....... 2.6 Limitations of Machine Vision Systems ................. 3 Image Segmentation 3.1 Motion Segmentation ........................... XV 1 2 11 15 19 21 22 23 24 26 26 27 28 29 32 35 39 43 viii 3.1.1 Image subtraction techniques .................. 44 3.1.2 Gradient techniques ........................ 46 3.1.3 Feature-based techniques ..................... 49 3.1.4 Proposed method ......................... 50 3.1.5 Experimental results ....................... 55 3.2 Color Segmentation ............................ 60 3.2.1 Color spaces ............................ 61 3.2.2 Color Image Segmentation .................... 66 3.2.3 Color constancy .......................... 69 3.2.4 Proposed method ......................... 73 3.2.5 Experimental results ....................... 80 3.3 Summary ................................. 85 Model-Based Image Segmentation 86 4.1 Segmentation of the Moving Object ................... 89 4.1.1 Active contours .......................... 90 4.1.2 Proposed method ......................... 94 4.1.3 Experimental results ....................... 96 4.2 Segmentation using Deformable Templates ............... 99 4.2.1 Deformable templates ...................... 104 4.2.2 A Deformable Vehicle Template ................. 106 4.2.3 Image Model ........................... 109 4.2.4 MAP Segmentation ........................ 117 4.2.5 Experimental Results ....................... 126 4.3 Summary ................................. 131 Color and Shape Matching 133 5.1 Color Histogram Matching ........................ 133 5.1.1 Color indexing .......................... 134 5.1.2 Color matching .......................... 135 5.1.3 Baseline distribution of the color matching score ........ 139 5.2 Edge Point Set l\ e (3.1) where e is a given threshold value. The criterion for merging the 4 subimages Xkl, max max max max ' min min min min Edge-based segmentation techniques identify the boundaries between homoge- neous regions in the image. Once the edge points have been extracted, the regions consist of the pixels inside the contours. See [104, 158] for surveys on edge detec- tion. A commonly encountered difficulty with this method is that the contours are -13 often not closed and some additional processing has to be performed to identify the homogeneous regions. This chapter is divided into two sections. each describing a specific image seg- mentation algorithm. The first section describes motion segmentation techniques including a new approach based on frame subtraction using a 3-frame sequence. The second section describes color spaces. color segmentation. color constancy and pro— poses a color segmentation algorithm which combines the regions extracted by a split-and-merge color segmentation and edge information obtained by the Canny edge detector. 3. 1 Motion Segmentation Motion detection and estimation have received increasing attention from computer vision scientists during the past fifteen years [139, 1410]. Numerous vision applications can benefit from the use of motion cue. Mobile robots or autonomous vehicles need to locate the sides of the road and avoid obstacles during the course of navigation. In military applications, one might want to automatically track and compute the trajectory of a given target. Medical doctors would like to automatically detect the moving organs, estimate their volumes, and possibly identify some anomalies. Motion analysis can be a powerful tool for image segmentation, particularly in complex scenes. Motion segmentation consists of extracting the moving object of interest from its background (stationary or moving). Most commonly used techniques are based on image subtraction and optical flow computation. We will describe these techniques before presenting our approach and experimental results. ~14 3.1.1 Image subtraction techniques Image subtraction is a simple technique that can extract a moving object. in a sta— tionary background, but it cannot handle a moving background nor can it. estimate the motion parameters (translation and rotation or velocity). The basic philosophy of image subtraction is that the only gray level changes in an image occur at the location of the moving object. Several variants of image subtraction are given below. Subtraction from a reference frame The current frame containing moving objects can be subtracted from a reference frame of the same scene known to be static (without any moving objects). The problem with this method is that outdoor scenes are never quite static (the wind blows the tree branches and leaves, the position of the sun produces varying amount of shadows), implying that frequent updates of the reference frame should be made. This technique was used by Sethi and Brillhart [161] to detect moving vehicles on a street and identify non-conforming behavior such as a vehicle parked on the side of the road. But, they do not mention the method that was used to update the reference frame. Shio and Sklansky [163] also used this technique to segment people crossing a street. They obtain a background image at time t by mode filtering in the time domain, i.e., the most frequent value, over a time period, at a given pixel is used in the background image. Kilger [108] updates the background image continuously by masking the regions of motion using a Kalman filtering approach in the following manner: B(+1= Bf + [01(1 — III!) + OQII’LHIt — By), (3.3) where I is the input frame, B is the background image, M is the binary mask cor- responding to the moving objects, and a1 and a; are updated based on an estimate of the rate of change of the background. A similar approach is used by Sorensen and Christensen [I65]. Subtraction using two frames A difference image can also be obtained by subtracting two frames in the image sequence. The background is not expected to change drastically in the short period of time between two successive frames. Moving areas in the scene are highlighted when the difference image is thresholded. Jain and Nagel [102] accumulated difference images to compute masks of the moving objects. Their technique is quite restrictive since it assumes that the object has moved at least its length in the two frames. This implies that the object is small enough to be contained in the first frame and in the last frame after it has moved its length. This technique was then enhanced by Jain et al. [101] to handle small amounts of object motion by using only two consecutive frames. A different method based on the same idea was introduced by Yalamanchili et al. [187]. After thresholding the difference image, pixels that belong to the background are “off”, so are pixels that belong to part of the object that overlaps between the two frames. Pixels will be “on” where the object is subtracted from the background and vice—versa. These four types of regions are used to decide which pixel belongs to the moving object. This technique assumes that the objects are uniform in gray level and strong edges can be extracted between the object. and the background. Subtraction using three frames Hsu et al. [88] proposed a method using three frames instead of two. Using three con- secutive frames, two interframe differences are computed. They are combined using a logical OR operator. Interframe differences are not computed using image subtraction l6 and thresholding, but using the likelihood ratio introduced by Yakimovsky [186]. 2 {fwd-NE + ( mp—m f’ 2 2 2 L: (3.4) LQC, x l‘P where m and 'e denote the mean and variance of the gray values in a sample area in the previous frame (P) and the current. frame (C). If the value of L, in a given area, is greater than a given threshold, then this area is assumed to be part of a moving object. 3.1.2 Gradient techniques Horn and Schunck [86] established a motion constraint equation based on the assump- tion that the gray level of any pixel in the image is constant over time. Therefore, any change in gray level at a given pixel must be due to motion. Suppose a pixel located at (:r,y) at time t moves to location (I + 6.1:. y + 6y) at time t + (it. Since the intensity of the pixel should be constant, l(r.y.t)= I(.r+6;r.y+6y.t+(ft). (3.5) Expanding the intensity function by Taylor series, we obtain: I(.r+6:r,y+6y,t+6t) = I(;r.y.t)+§ldx+,a—16y+2£ (.r y ()t ) 0 cit-l-liigher order terms. (3.6) Neglecting the higher-order terms leads to the following motion constraint equation 81 01 01 5:611: + ady + f)?” — 0, (3.7) which is conveniently written as hu+hc+h=0. as) where (u, r) = 6%» %’f) is the velocity at a given pixel (.r.g) in the image. This equation is, of course. not sufficient to compute the velocity at a. point since it only constrains the values of u and v to lie on a line in the velocity plane. Different techniques have been used to combine a number of constraint equations into a system with a unique solution. The advantage of techniques based on the motion constraint equation is that they produce a dense flow field. This flow field can then be segmented to extract the moving objects from the background. The main disadvantages are that they assume a small displacement of the object between frames and have a tendency to smooth the moving edges. Local optimization approach The local optimization technique assumes that neighboring pixels have the same ve— locity. A total of N constraint equations from pixels in a local neighborhood are grouped into a system with N equations and 2 unknowns (u and v). Theoretically, N = 2 should be enough to solve the system, but, typically, one chooses N >> 2 to minimize the effect of measurement errors. The system of equations can then be solved using a least squares method. This technique was used by Thompson and Barnard [175] and Kearney et al. [105]. Thompson [174] used this technique in Combination with contrast information to perform segmentation of the image into background and moving objects. Peleg and Rom [148] used this technique to directly estimate the motion parameters (translation and rotation coefficients) by assuming that the depth of the scene was known. Global optimization approach This approach assumes that the flow field should vary smoothly over the entire im— age. Ilorn and Schunck [86] suggested minimizing the following energy function with respect to the velocity field. E = If (144+ 1.4, + 1,)? + 42(11er + Hmfldwy. D where /\ reflects the relative importance of the smoothness in the solution. The problem with this technique is that the moving edges are smoothed in the process. Nagel and Enkelmann [I41] introduced an oriented-smoothness assumption by slightly modifying the above equation and were able to keep the moving edges sharp. Clustering in the velocity space The clustering approach consists of accumulating the constraint lines in a Hough transform manner. Each constraint line is projected into the discretized velocity plane, and contributes as one entry in each cell into which it falls. The cells with the highest number of entries correspond to the velocities of the moving objects. More details on this technique are given in the papers by Fennema and Thompson [61], and Cafforio and Rocca [23]. Other approaches Duncan and Chou [50] extracted moving contours (or temporal edges) by finding the Zero crossings of the input image sequence convolved with the second-derivative w.r.t. time of a temporal Gaussian smoothing function. 320 a2 St(:r,y,t)= (t)* [(17.y,t), (3.9) where (I'M): ' ex[)(—.s212). (3.10) The moving contours are then combined with the motion constraint equation to estimate the component of the optical flow that is normal to the moving contour. A number of papers describe techniques which. instead of using the intensity field in the motion constraint equation. use other dense matching primitives. Heeger [83] uses the energies in the outputs of Cahor filters. Fleet and Jepson [6’2] and Weng [1822] use phase information. 3.1.3 Feature-based techniques The feature-based approach (or discrete approach) to estimating the optical flow and segmentng the moving object from the background consists of matching features ex- tracted from two frames. Feature extraction and matching have been widely used for computing disparity in a pair of stereo images as well as sequences of images pre- senting motion. The correspondence problem is solved using features extracted from the first frame and matching them to similar features extracted from the subsequent frames. If the objects in the scene are subject to rigid motion, then features that oc- cur close together with the same motion parameters can be grouped to form objects, resulting in a segmentation of the image into moving objects and background. Weng et al. [183] use edges and corners as image features to be matched from one frame to the next. lx'oller ct al. [114] use local minima and maxima of the gray level images. Their main contribution was the ability to compute the trajectory of the moving object and characterize it using motion verbs. In a later paper, Koller 6t (II. [113] use edge segments to solve the correspondence problem. Zimmermann and Kories [193] also use local minima and maxima in the image convolved with a 30 band-pass filter. (_'umani rt (1!. [38] compute three features at each edge point and use them to solve the correspondence problem. The features are the luminance representing the value of the image at a contour point. the amplitude representing the total change in luminance when crossing a contour. and the sharpness representing the deviation of the luminance transition from the ideal step edge. They showed clearly how clus- tering in the displacement space can be used to separate different. moving objects with different motion parameters. Thompson and Pong [176] presented techniques to handle different. types of features such as feature points. edges. or regions. The motion parameters were checked using the rigid motion assumption. The advantages of the feature—based techniques are that they can handle large displacements between frames. the motion edges are not blurred, and they are partic- ularly useful in the case of multiple moving objects. The main disadvantage is that they produce a sparse flow fiech which makes it difficult to accurately segment the moving object. In addition, the image must contain salient. features, usually implying that it must be highly textured, and more importantly, the correspondence problem between features needs to be solved. 3.1.4 Proposed method This brief review on motion segmentation techniques shows that all the published techniques have some disadvantages and cannot be directly applied to solve our mov- ing object segmentation problem. In our application, the moving object of interest is quite large in the image, so that only a few frames containing the entire moving object are available in the image sequence. Also, since the objects are moving at a relatively high speed, the displacement between frames is quite large. Gradient- :31 based approaches for optical flow computation can only handle. small displacements. Feature-based approaches to optical flow computation require solving the correspon- dence problem between extracted features and produce a sparse flow field which cannot be used to extract an accurate mask of the moving object. Image subtrac- tion techniques using a reference frame (or background frame) use a large number of frames that. are averaged to produce the reference frame. Image subtraction tech- niques using two frames from an image sequence highlight the location of the moving object in both frames. An approximate location of the moving object can be obtained with better accuracy if three frames are used. It is quite obvious that motion seg- mentation alone will not produce an accurate mask for the moving object. but a good motion segmentation technique is capable of providing the approximate location of the moving object. The motion segmentation method proposed here is based on image subtraction. Three frames are required for the computation; as long as the moving object is entirely contained in each frame, they need not be consecutive frames in the image sequence. The method locates the moving edges accurately as shown in Figure 3.1. Suppose a ramp edge successively moves from a position in frame 1 (f1) to another position in frame 3 (f3), then the difference between frames 1 and ‘2 will look like the solid curve in Figure 3.1(b), and the difference between frames ‘2 and 3 will look like the dashed curve. The only interval where both the frame differences are meaningful (bold curve) is at the location of the ramp edge in the middle frame. If there is no motion or if there is no strong edge at a given pixel, the difference value at that pixel is not relevant. This observation can be formulated mathematically as follows. Let us consider the one-dimensional problem first: let 81(1'), 62(17), and 63(1) correspond to the intensity Value function around the ramp edge in the first, second, and third frames, centered fl f2 ’13—- / / ___/ (a) Itl- {2| lfz" f3! ”__\ / \ —— \\_—— (b) Figure 3.1: A moving ramp edge: (a) positions in frames 1, ‘2, and 3; (1)) frame differences and extraction of the moving edge. at .rl. 1‘2, and 13, respectively: the thickness of the edge is 2Ar. Let a be the value of the function before the ramp edge, b be the value of the function after the ramp edge, and m = (b — (1)/231‘ be the slope of the ramp edge, then. 61(1?) 2] f 62(1) 2‘] 83(2)) 2 a nggrl—Ax Il—AJCSIgml‘I‘A-r b $1+Ar§r/]{°W and(Lj)Ethere3(k.l)€Cs.t. P(k,[)>t2ish thlvjl = 0 otherwise. (3.15) The thresholded image D, is used to produce a motion mask AA, such that Jl/ls(i,j) = 1 if pixel (i,j) belongs to the moving object and 4M3(i.j) = 0, other- wise. This mask is obtained by deleting isolated “on” pixels in ”D, and grouping the remaining “on” pixels into a single blob through a Chamfering technique which is equivalent to a morphological “close" operation. Figure 3.3 outlines each step of our motion segmentation algorithm using 3 consec- utive frames from an image sequence containing a moving vehicle. It can be noticed that the background in this image contains several objects that have common bound- aries with the moving vehicle. The mask boundary has been overlaid on the middle frame in Figure 3.3(e) to show that even though the mask shape is not exact, mo- tion segmentation performs reasonably well in providing an approximate mask at the location of the moving object. The algorithm is summarized in Figure 3.4. The CPU time requirements on a Sparcstation 20 are the following: ‘2 s to blur the 3 frames. 0.5 s to compute and threshold the difference image, and 4.5 s to derive the motion segmentation mask. Thus. the total computation time is T 3. 3.1.5 Experimental results The motion segmentation algorithm was tested on more than 400 image sequences of moving vehicles and other moving objects in indoor and outdoor scenes. Figure 3.5 (8) (h) Figure 3.3: Steps in the motion segmentation algorithm: (a) frame 1; (b) frame 2; (c) frame 3; (d) difference image; (e) thresholded difference image; (f) significant connected components in the difference image; (g) motion segmentation mask; (h) outline of the mask overlaid on frame 2. Q! Input: 3—frame sequence f1, f2. f3 begin Blur the 3 frames to obtain f1, f2, f3 using a 3X3 averaging mask /* Compute the difference image */ . DUJ) = Ifim) — M231)! x 1mm — mum /* Threshold the difference image */ Lfvtm) > 0.001 mg; was = 1 gls_e mm = 0 Find the connected components fo_r each connected component C i_f ,B (k,l) E C such that D(i,j) > 0.1 gm Erase C from D, endfor Grow the connected components using a Chamfering technique until they form a single connected component Fill in the gaps by scanning the image left to right and top to bottom Shrink the connected components to its original size gig Output: Motion segmentation mask A4, Figure 3.4: Motion segmentation algorithm. shows some results of the motion segmentation algorithm on moving vehicles. The figure shows the three consecutive frames and the motion segmentation mask overlaid on the middle frame for all the examples. It can be seen from these images that the stationary background is quite complicated, but the motion segmentation algorithm can extract the moving objects reasonably well. Figure 3.6 shows some segmentation results on other moving objects, namely a walking person, a mobile robot, an arm, and a bicyclist. Again, the background is relatively complicated, especially in the case of the bicyclist image. but the motion segmentation algorithm performs very well. Figure 3.5: Results of the motion segmentation algorithm on some moving vehicles. Each row in this figure shows the three frames of an image sequence, followed by the segmentation result. Figure 3.6: Results of the motion segmentation algorithm on some indoor and outdoor moving objects. Each row in this figure shows the three frames of an image sequence. followed by the segmentation result. 60 3.2 Color Segmentation The perception of color by humans is a phenomenon that is both physical and psy— chological as it includes the interaction of the light with the surfaces of the objects and the response of the brain to the complex stimulus perceived by the eye. Color information seems to be crucial to the understanding of the outside world for the purpose of detection and discrimination. The detection of an object is easier if the color of the object is known and it is distinct from the background. This is the reason why many sporting equipment manufacturers produce orange golf balls that are easy to see on the green fairway. The importance of color is also found in nature as many animals use color to attract a mate, while others adapt their color to appear invisible in the foliage. Color also helps in distinguishing various objects in the environment. For example, we can easily identify a taxi cab in the busy streets of New York by its dominant yellow color. The use of color in computer image understanding can improve the process of detection and discrimination. As a result, the analysis of color images has received considerable attention for the past 20 years. This section is divided into five parts. The first part presents the color spaces that can be used to represent color for the purpose of image processing. The second part surveys a number of well-known color image segmentation algorithms for the purpose of image understanding. The third part describes the process of color constancy and briefly examines some of the algorithms presented in the literature. The fourth part describes our approach to color segmentation, a fusion of split-and-merge region segmentation and edge detection. Finally, we will present some experimental results. Off: . .‘DPfl Vin . > n _A .F‘Spgm I - I ‘w .. 'HIL :m I, t IQ ‘ Uri {_fl Enter 1, s. ‘ U ',I3 u-Jl'I‘id: fl [“l.’ . ‘J i! 61 > >3 IE M‘ M '2 8 2 ‘5 E m l 400 500 600 700 Wavelength (nm) Figure 3.7: Spectral responses of the human color sensors. 3.2.1 Color spaces The perception of color in human vision is related to the response of the color sensors (cones) at the back of the retina to different wavelengths in the visible electromagnetic spectrum (400 to 700 nanometers). The Young-Hemholtz theory states that normal human vision is trichromatic, which means that the combination of three types of sensors produces the sensation of color. The relative responses of these three sensors as a function of the wavelength A are shown in Figure 3.7. Note that as each sensor responds to a range of wavelengths, they are referred to as the long (L), medium (M), and short (S) wave sensors. The measurement at a point on an object surface for a given sensor depends on the spectral composition of the illumination source f()\), the reflectance function of the surface r()\), and the relative sensitivity of the sensor h(A). R = /f(/\)r(/\)h(/\)d/\ (3.16) The need for representing and displaying colors has yielded a relatively large number of different three-dimensional color spaces which are not necessarily related to human color perception. These systems, described below, are also presented in [6, 125, 36]. 62 The [26' B space The RGB (Red, Green. and Blue) system is based on three colors well-defined by the three monochromatic lights of wavelengths 700 nm (red). 546.1 nm (green), and 435.8 nm (blue). An intuitive way of visualizing the colors from the RGB space is to view the measurements as Cartesian coordinates. A chromaticity diagram can be obtained by normalizing the three sensor measurements R G B _, =__—._—. :-——‘ b=__—_ .- f R+G+B g R+G+B R+G+B Oh) and plotting the perceived color as a function of any two. The XYZ space In 1931 the C.I.E. (Commission Internationale de I’Eclairage) adopted a standard for color coding known as the XYZ color space. The XYZ coordinates can be obtained from the RGB values using the linear transformation [145]: X 0.618 0.177 0.205 R Y = 0.299 0.587 0.114 G (3.18) Z 0.000 0.056 0.944 B The H S I space The HS] (Hue, Saturation, and Intensity) space depicted in Figure 3.8 is defined as follows. The hue H is explained in every day language by color adjectives such as red, yellow, or blue-green and corresponds to the notion of color. The saturation 5 measures the percentage of white in a color. It describes the difference between red which is saturated and pink which is desaturated. The intensity is equivalent to the gray level in a black and white image. It is the amount of light reflected by a given surface. The H S I coordinates can be obtained from the RGB color components as Figure 3.8: The HS] color space. follows. mG—B) q Immaaa) __R+G+B QR—G—B’ ‘ R+G+B’ _ 3 H = arctan The H51 space is close to the human description of colors. For example, light yellow corresponds to (R = 140.6 2 140. B = 30,11 2 180.5 2 180, I —_— 205) while dark yellow corresponds to (R = 100.6 2 100. B = 5.H = 180.5 = 180,] = 105). Note that all the three coordinates R. G, and B are modified in the RGB space whereas only I is modified in the [[5] space to differentiate between light yellow and dark yellow. Thus, as the adjective “yellow” describes both the colors, the hue value is the same for both the colors. Only the intensity value is changed to account for the two adjectives “light” and “dark". The opponent processes space This space is also close to the human perception of colors, but it uses Cartesian rather than cylindrical coordinates. Hurvich and Jameson [91] argue that this system 64 has neurological support and is in fact the way human beings represent colors. The three coordinates Red-(Ireen (Ii—(i), \" ,.._.. (—.> -<—> - a 5 [x0 Y0 ] 00] r0 20 , (322) Where X0, Y0, and Z0 are the values for the reference white. 3; 1/3 L = 25 (100—) - 16, The L" V'l'l" space The I..-"'V'IJIJ" space was designed to obtain a color solid for which unit shifts in luminance and chrominance are uniformly perceptible. The f")"ll" coordinates can be obtained from the KGB coordinates by: . Y ”3 I = 13W'(u — (to), v- : 1:31;»"(6 — 60), w- : 25 (100?) — 16. (8.23) 0 _ ‘ — I — .—4‘.\—'_—_ >9 — _—6Y__ where no — 0.199, to — 0.308, a — .\'+15Y+32‘ and L — X+15Y+3Z' The [11213 space The [11213 space was defined by Ohta et al. [145] using a set of eight images. At each step of the recursive region splitting algorithm for color segmentation (see section 3.2.2), they applied a I{arhune11-Loe\Je transform to find the best color features for the current segmentation task. The features, obtained from the RGB tristimuli by: Ilzfligfl. RTE. 13:20—34? (3.24) appeared most often in that order. This means that, statistically, this coordinate system is the best color space to account for color differences and suitable for seg- mentation. All these different color spaces have been used for image segmentation. The most commonly used spaces are the RGB space because it is conveniently available at the outputs of color cameras and digitizing systems. and the HS] space because of its relationship to the human description of colors. The other color representation SChemes are not used as often. There is no optimal color space and it seems that the Choice of the best color space should be tied to the choice of the image segmentation 6 6 (or other processing) algorithm. 3.2.2 Color Image Segmentation Here, we are concerned with color segmentation which divides the image into regions based on the color attributes of the pixels. Thus. in the following sections, the most commonly used techniques for color image segmentation will be described. Clustering techniques Any combination of the color coordinates defined in Section 3.2.1 can be input to a clustering program as feature values for every pixel in the image. Usually, the set of features that will produce the best segmentation results is determined empirically. Region-based segmentation The split-and-merge algorithm developed by Horowitz and Pavlidis [87] for intensity images was modified by Coutance [36] to perform color segmentation using the original tristimuli values RGB. The decisions for splitting and merging are based on the averages, m, and variances, v, of the pixel values for each of the color components. For a subimage X). containing 72;, pixels, they are defined as: 1 ——Z R(,$y) m6,_g— Z C(a: y), (325) Tl): (1' y) IEXk nk (r.y)€Xk 1 kl$~y16Xk 1 1 2 UR). = 71—- 2 (3(339) mRk) a 00k _ _ 2 (0(1) yl _ 77101:)» (3 26) k (Ivy)exk 71k (1‘ tile)”: 1 ka — _ 2: (8(1), y) mBk)2 where R(.r.y), (_i(.r.y). and [3(.r.y) are the pixel values in the red. green. and blue image, respectively. The “distance" between two subimages X,- and X) is defined as (I(.\','. X1) : [mp' — 771le + [IIIG‘ — mg) + [7723, — mBj . (3.27) During the splitting step. a subimage X). will be divided into four subimages X)“, Ar)”, Ar)”, and _Yk‘ if (1) {123k + 1er + 193,} > s,.. and (3.28) (2) 3746 [1.4151 d(.\',.,,.\',.,) > 3m, where 3,, and sm are predetermined threshold values. Equivalently, during the merging step, four subimages X)“, X)”, Xk3. and X),4 will be merged into a single subimage Xk if (1) {ka + va + lin} S 'SU (329) (2) \7, l,j E [1.4] (“er ,.\’k]) S 8m. Recursive region splitting using histograms The recursive region splitting algorithm was developed by Ohlander [144] specifically for color segmentation. The basic idea of the algorithm (see Figure 3.9) is the follow- ing. The image is divided into connected regions using threshold values determined by finding the best peaks in a set of feature histograms. The subregions are then further segmented if necessary. The main steps of the algorithm are given below: Step 0: A mask corresponding to the entire image is placed at the bottom of a stack. 68 Histograms Region mask Giginnl mmk Original image Picture slack Thresholdmg Figure 3.9: Ohlanderis color segmentation algorithm. Step 1: One mask is taken from the top of the stack. Let 8 denote the region repre- sented by the mask (the region without hatching in Figure 3.9). Step 2: Histograms of color features are computed for the region S. Ohlander used 9 different features: R, G, and B; H, 3, and I; and Y, I, and Q. Step 3: If any histogram shows a dominant peak, a pair of threshold values is de- termined at the positions of the associated valleys. The image of the color feature corresponding to that histogram is thresholded. The region S is thus partitioned. Step 4: Connected regions are extracted. For each connected region, a region mask is generated and pushed down on the stack. The process is repeated until the stack becomes empty. This algorithm has some shortcomings. The first problem is related to the non- linearity of the transformations used to compute the hue, H, and the saturation, S. The saturation is unstable if the intensity value is small, and the hue is unstable if the saturation is small. Thus. one has to be very careful in using these color features. 69 The second problem is that since histograms are used to determine threshold values, small homogeneous color regions cannot be detected. Thus textured regions in the image are apt to be divided into a large number of meaningless segments. Ohta cf (11. [1-15] enhanced the algorithm proposed by Ohlander to try to solve these two problems. lising the Karhunen-Loeve transformation. they determined that the best color features were I), 12, and [3 as described in Section 3.2.1. Thus, only these three features were used by the segmentation algorithm. To solve the second problem, they first segmented the image into textured and nontextured regions and applied the Ohlander’s algorithm only to regions without texture. Textured regions were determined by the number of edge points (obtained by the Laplacian operator) in a local neighborhood. 3.2.3 Color constancy We know from our everyday experience that grass is green and tomatoes are red, inde- pendent of the surrounding illumination. That is, whether an object is seen outside in sunlight or inside under tungsten—filament lamp illumination, it will appear the same to us. This phenomena is known as color constancy. The signal reaching the photore- ceptors on the surface of the retina is the product of the spectral distribution of the illuminant and the spectral reflectance of the object. Even though these two quanti— ties are unknown, it seems that the human brain can extract the spectral reflectance of the objects from the incoming signal. A number of psychologists and computer vision researchers have tried to understand human color constancy and develop com- puter vision algorithms in an attempt to extract the spectral reflectance of objects from the intensity signal. This section briefly presents some of these algorithms. The Retinex theory The Retinex (from retina and cortex) theory was developed by Edwin Land and ex— perimented by Land and .\lc('_'ann [119. I20. I21. 122. 131] as a possible interpretation and understanding of some experiments they performed with color patches. The im- portant statements of the Retinex theory are: (i) “The composition of the light from an area in an image does not specify the color of that. area". (ii) “the color depends only on the lightness in each waveband and lightness is independent of flux". and (iii) “the color of a unit area is determined by a trio of numbers each computed on a single waveband to give the relationship for that waveband between the unit and the rest of the unit areas in the scene". This means that the reflectance of a surface is computed using the intensity at that surface and the intensities at all the other surfaces in the scene. The Retinex algorithm developed by Land is the following. The relationship between an area i and an areaj in an image for a waveband A is: RA(I',j) = Z t(log [H1 ). (3.30) A— [k where Ik+l and [1,. are the intensities of neighboring regions on a path from areaj to area i and x if]:r| 2 threshold 0 otherwise. According to this formula, if two neighboring regions have similar intensities, they do not contribute to RA(i.j). The descriptor for area i in waveband A is _ N_ H“ T.’ RA('I) : ZJ—1f\r(1]) . (3.32) 71 where N is the number of paths from any area j to area i. Three descriptors corre- sponding to the 3 wavebands are then combined to determine the hue of area i. Algorithms based on the retinex theory The main characteristic of the Retinex algorithm is that it automatically finds the edges in the image. If two neighboring regions belong to the same surface, they have similar intensities. thus the quotient of their intensities will be close to 1 and its logarithm will be close to 0. However. if the two regions belong to two different surfaces and are thus separated by an edge, the absolute value of the logarithm of the quotient of their intensities will be large enough so that a component will be added to RA(i,j). Therefore. all the subsequent algorithms based on the Retinex theory use the output of an edge detector. In a typical image, we expect the incident light to vary smoothly, with no disconti- nuities, while the reflectance should have sharp discontinuities at edges where objects adjoin. The reflectance would then be relatively constant between such edges. The color constancy algorithms based on the Retinex theory are divided into the following steps described in Figure 3.10 for the one-dimensional case. 1. Differentiate (spatially) the image. 5‘” Threshold to eliminate small values (corresponding to the small variations in the illuminant) and keep the large values (corresponding to the large changes in reflectance). 3. Integrate to obtain the reflectance. Horn [85] uses the Laplacian operator for the differentiation step. The use of a hexagonal tesselation of the image makes his algorithm quite unusual. Blake [14] used a gradient operator to find the edges in the image. His algorithm was then modified 72 Original signal ‘I/I\\¥ Differentiation Thresholding ] Integration _T—L—L Figure 3.10: Computation of the reflectances by the algorithms based on the Retinex theory. by Bretstaff and Blake [20] to work on square tesselations (typical CCD images). The Canny edge detector [24] was used to find the edges. Mathematical models Several models have been proposed to decompose reflectance and illuminant into a linear combination of some basis vectors or functions. Different color areas are com- bined into a system of equations (linear or non-linear) and solved for the surface descriptors. Buchsbaum [22] defines the illuminant as a linear combination of three illuminant basis functions and the reflectance as a linear combination of three re- flectance basis functions. The major assumption in his algorithm is that the estimate of the illuminant can be obtained by averaging the radiances over the entire field of view. Maloney and \Vandell [128. 181] combine information from p sensors, m illumination coefficients. and n reflectance coefficients and do not assume them to be equal to 3, but instead determine what these numbers should be. If the ambient light is known, then the system can easily be solved. Otherwise. they prove that the System will have a solution for some values of p. m, and n. Forsyth [66] presented 73 an algorithm which uses the underlying mathematical structure of the problem. The basic principle can be explained in a simple way: If we observe an image containing only green colors, then the illuminant cannot be bright red. This yields a rejection of all the illuminants which cannot produce the colors present in the given image. Physical approaches These approaches use the physical model of the image formation process. Healey and Binford [82] and Klinker et al. [111] assume that in specular areas (reflection of the light onto the object), the reflectance of the object is lost and only the illuminant is observed. Their algorithms consist of segmenting the image into highlight areas and matte areas. The specular areas are then used to estimate the illuminant. The illuminant can then be subtracted from the original image to estimate the reflectance. Lee [124] uses the fact that in the CIE chromaticity diagram, if two lights are added in various proportions, then the chromaticities of the mixtures lie on the straight line connecting the two points. In the CIE diagram, the coordinates of the colors from different points of the same surface will fall on the straight line connecting the illu- minant color and the object color. Straight lines obtained from different objects will all intersect at the point corresponding to the illuminant color. Once the illuminant color is determined, the object colors can also be computed. 3.2.4 Proposed method This section describes our approach to color segmentation. Several of the well-known techniques for color segmentation described in Section 3.2.2 were tried on a number of test images, and the best results were produced by the split-and-merge approach. Therefore, it was chosen for the color segmentation algorithm. The reason why the original tristimuli values RGB were used as features is mostly 74 hardware dependent since the digitizer available in our laboratory produces RG8 im- ages. The color images were acquired by recording the scene containing the moving object using a camcorder. The tape was then played on a VCR connected to the digitizing boards to obtain the HOB images. The HS] color system was also tried because it is supposed to describe perceived colors in a more natural way. Unfor- tunately, the signals stored on a videotape and delivered by the \"CR are severely distorted; the saturation and intensity signals are substantially attenuated and the hue signal is very blurry. It can be recalled from the definition of the HS] color system (Section 3.2.1) that the hue is not defined for small saturations (center axis of the cones). Also, in outdoor scenes. very few objects have saturated colors. making it more difficult to evaluate the real hue of each object. These reasons motivated our choice for the use of the RGB color space in the color segmentation algorithm. The color segmentation algorithm is divided into five steps. The first four steps come from the split-and-merge algorithm for color images, and the last step consists of using edge information to further merge uniform color regions and produce a better segmentation. The split-and-merge algorithm The split—and-merge algorithm which we have used was adapted from Coutance’s color split-and-merge technique [36] and comprises of the following four steps as shown in Figure 3.11. l. Initialization: The N x N image (where N = 2k) is divided into a number of z x 2: square subimages. In our case, a 256 x 256 image is divided into 256 16 x 16 subimages. In each square subimage Xk, the mean and variance of the intensity values Input image Initialization Splitting Merging Grouping Figure 3 11 The split and—merge algorithm for segmenting a color image for each of the three color components are computed as follows .9 “M“Hl H X 51037). I=R.G,B (3.33) N _II-inl H +z—l Z Z (1(2 Jl—mzf. I= R.G.B (3.34) ':_r j:y where (1: y) are the coordinates of the upper left corner of the square subimage and z is the dimension of the subimage (initially, z - 16) 2 Splitting: A subimage X0 is not “homogeneous if (113, + va + vyk) > t.) (3.35) A nonhomogeneous subimage is then further split into four square subimages X1, X2, X3, and X4 of Size 2/2 x 3/2 This process is repeated until the 1nd1v1dual subimages are homogeneous or have reached a minimum area of 4 pixels item :3 4. 1n) » . IAT‘DLI.“ 76 3. Merging: If four adjacent subimages X1. X2, X3, and X4, each of size 2: x z, with the same parent X0 have not been split in step 2 and they satisfy the homogeneity condition {lmR' — 77112,] + [7710. — mGJ] + [mg — m31|} 2 if UR. + va + Us. > 25 flier; Create cells 161, [92, 1:3, and 1:4, 2),. = zk/2 ends/11$ endfor /* Merging */ f_og all quadruples (191, kg, k3, k4) neighbors of same size i—fflmR. _ mRJI 'l' lmG.‘ - mGJI + lmB.‘ _' mBJl} < 25 V iaj E [1741 £12.11 Merge cells 101, kg, k3, and 1.2,, 2), = zk, x 2 endfor /* Grouping */ fig k = 1,. . . ,C' i_f§l I such that {ImRk — 313,] + lmgk — MQI + |m3k —MB,]} <16 then Merge cell It with region R) endfor Produce R = {R1,R2, . . . ,Rm} Canny edge detection on I /* Fusion of regions and edge information */ fig; i = 1,. . . ,m f_or_j = 1,... ,m Compute ng, Eij, F3], endfor F? tj’ and F?)- fg i = , . . . , m Rank the regions in increasing order of size {R(1), Rm, . . .} Find the candidates R;c to be merged with Rm such that: R;c is a neighbor of Rm, 72(1) < 72),, FE”), < 0.4 Choose the candidate R,- such that: Pflurinj : minkfrfnkrim} —and Filijrilu < 4-5 Merge RH) and R,- Delete Rm Update B(1)j, E(1)J‘, [‘(ll)j" F31”, and Pig”)- endfor end Output: Homogeneous regions R = {R1, R2, . . . , Rn} Figure 3.13: Color segmentation algorithm. 83 Figure 3.14: Results of the color segmentation algorithm: the first column shows the input images and the second column shows the segmented images. 84 Figure 3.15: Results of the color segmentation algorithm (cont’d): the first column shows the input images and the second column shows the segmented images. 3.3 fl. fr vli «CF . _.. all; 1106 Y“, "O ilUt I: {jf ‘f‘L . . h .l—I .4 .41» .. .fim - il 2A 85 3.3 Summary In this chapter, we have described two image segmentation algorithms for motion segmentation and color segmentation. The motion segmentation algorithm does not make any assumption about the shape of the moving object and the number of stationary objects in the background. As long as the moving object is entirely contained in the three frames, the displace- ment of the object between two frames can be arbitrary. This algorithm has the advantage over other motion segmentation algorithm in that it is very simple. It does not involve the computation of the optical flow in the image sequence. It does not require a frequent update of a background image. It only needs three frames containing the moving object in an image sequence. We have seen that the motion segmentation results are very reasonable given the simplicity of the algorithm. The color segmentation algorithm is based on a split-and-merge color segmentation algorithm. It also integrates edge information to produce more accurate contours around the homogeneous regions in the image. Again, we have seen that the results are very satisfactory. Chapter 4 Model-Based Image Segmentation The two algorithms presented in Chapter 3 neither require nor incorporate any spe— cific knowledge about the scene and are meant to solve generic image segmentation problems. The two segmentation algorithms presented in this chapter utilize knowl- edge about the scene and the objects expected to be present in the scene. In the first algorithm, we integrate motion and color information using the notion of active contour models which constrain the object boundaries to be continuous and smooth. In the second algorithm. we integrate motion and edge information using deformable template models which describe the geometric properties of the objects expected to be present in the scene. Both these model-based segmentation algorithms are designed to integrate multiple cues. Integration techniques combine information provided by different cues to produce a more accurate representation of the desired quantity. A number of computer vision tasks have benefited from integration techniques. Some of these tasks along with the integration strategies proposed in the literature are presented here. Integration techniques have been useful for registering multiple images of the same scene observed at different times and with slightly different view points by the same sensor. This is typically the problem of registering Landsat satellite images to see the evolution of water areas. agricultural fields. or urban areas. Ranade and Rosenfeld [157] proposed a relaxation technique to match entities from multiple images. Ton and Jain [179] used this technique to register Landsat images using the centroids of homogeneous image areas as control points and the size of the areas as features associated with the control points. In a different context, Chen and Medioni [30] proposed an iterative method for registering multiple range images in order to build a complete 3-dimensional model of an object. Sensor fusion is another area of computer vision where integration techniques have been employed. A typical example of sensor fusion is the integration of range and intensity images for object modeling or object recognition. Limb boundaries and crease edges are more accurately detected using intensity images, but range data provide better shape information especially if the image is textured. Lee [123] fused intensity and range images to automatically build line drawings of polyhedral objects. Chou and Brown [32] proposed a method to fuse intensity and range images using Markov random fields. Matthies and Elfes [130] and Moravec [137] proposed a grid- based representation to fuse the data provided by an array of ultrasonic sensors on a mobile robot to produce a single map of the environment. Moravec uses certainty grids whereas Matthies and Elfes use occupancy grids which account for sensor uncertainty using a Gaussian probability model. In this thesis, we have integrated multiple cues or multiple image processing tech- niques for a better segmentation of the observed scene. Integration of region seg- mentation and edge detection has been reported in the literature. Pavlidis and Liow [147] and Milgram [134] used it for the segmentation of intensity images, Nadabar [100] segmented range images, and Farrokhnia and Jain [59] used it for textured im- ages. In these situations, the integration techniques have been shown to produce 88 better segmentation results from two or more imperfect segmentations. Gamble et al. [71] integrated stereo. motion. color. and intensity edges using Markov random fields. Cochran and .\ledioni [34] fused a. disparity map obtained from stereo images with intensity edges to obtain surface descriptions. Chu and Aggarwal [33] proposed an algorithm using the maximum likelihood estimator and an iterative procedure to satisfy contour curvature and compactness constraints to integrate multiple region segmentations and edge maps, independent of the image sources. Pankanti et al. [146] integrated stereo. region based segmentation. perceptual grouping, shape from shading, and line labeling modules to produce an accurate depth map of the observed scene. Etoh and Shirai [56] have described a technique for finding moving areas in the image by clustering in the 8-dimensional space composed of the three color compo— nents (R, G, B), the pixel position (I. y), and the spatio—temporal intensity gradients (al/axflI/ayfiI/Bt), thus fusing motion, color, and edge information. This chapter is divided into two sections, each describing a specific model-based segmentation algorithm. The first algorithm combines color and motion information through an active contour paradigm to extract an accurate contour of the moving ob- ject in the image. The second segmentation algorithm uses the theory of deformable templates to segment the vehicle of interest (the vehicle in the lane closest to the camera) from the stationary background and the other moving vehicles. This algo— rithm integrates a motion cue and edge information. The database of images used in this chapter is shown in Appendix A along with a description of the image acquisition process. .l 89 4.1 Segmentation of the Moving Object The motion segmentation and color segmentation algorithms described in Chapter 3 are general-purpose segi’nentation algorithms which do not assume any specific knowl- edge about the objects present in the scene. However. we are concerned with the spe- cific problem of obtaining a good segn'ientation of the moving vehicles. Unfortunately, both of these segmentation algorithms have some limitations. Motion segmentation provides us with a mask which is generally larger in size than the true size of the moving object. Color segmentation produces a number of closed “uniform” regions, but the object of interest (vehicle) itself gets split into several regions. Both these algorithms do not take into account the fact that the boundary pixels on the object constitute a closed geometric contour with strong continuity and smoothness prop- erties. These constraints should be incorporated in a top-down mechanism to guide the boundary detection or segmentation process. This section describes an algorithm which is able to segment the moving object from the stationary background and produce a more accurate contour of the moving object. Two paradigms were used in designing this algorithm. 0 The results of motion and color segmentation are integrated to overcome the limitations of the individual algorithms. 0 To ensure smoothness of the contour of the detected object we have used an active contours (or snakes) paradigm. Active contours are pushed or pulled by constraint forces towards image features. These constraining forces are used to define an energy function which is to be minimized. \Ve will first describe the theory of active contours before presenting the segmen- tation algorithm based on active contours. 9 0 4.1 .1 Active contours Active contours were proposed by Kass at al. [103] as a. mean of fitting contours to image data. given user-specified constraints. Kass 6! al. define active contours or snakes as “an energy-minimizing spline guided by external constraint. forces and influenced by image forces that pull it towards image features such as lines and edges”. The energy of a contour 21(3) in a parametric representation is defined as: Etotal : AIIEint(u(5)) 'l' Eimage(u(3)) + Econ(u(8))}ds (41) Let us define each of the terms in the expression of the energy function. The internal energy E)“, is defined as a function of the first- and second—derivatives of the contour and forces the contour to be smooth. Eintf'UI-Sll = 0(3)]Us(3)l2 + (stllussfsll2 (4'2) The image energy Eimage pushes the contour towards image forces. It is a combi- nation of three diflerent image forces. Eimage(u(3)) : wlineElinefuisll "l' lwedgeEedge( 15(5)) 'I' wtermEterm(u(3)) (43) The term Eune = I(:1:,y) attracts the contour to high or low intensity values in the image depending on the sign of tonne as shown in Figure 4.1. The term Eedge = —-|VI(:I:, y)|2 attracts the contour to image points with high gradient values as shown in Figure 4.2. The term Eterm attracts the contour to line terminations as shown in Figure 4.3, where the snake is pushed towards the subjective contour in the image. The term Econ is the energy of a spring connecting a point P1 on the contour and 91 Image Initial active contour Final active contour Figure 4.1: Effect of the line image force E“,1e on a snake. Image Initial active contour Final active contour Figure 4.2: Effect of the edge image force Eedge on a snake. ‘ ‘ L S \) Image Initial active contour Final active contour Figure 4.3: Effect of the termination image force E,erm on a snake. 92 a point P2 in the image plane. It is defined as: L‘,,,,(u(s)) = —-A~(P1 — P2)2. (4.4) The goal of the snakes algorithm is to minimize the total energy function. When this energy is minimum. the snake is smooth and is "locked” to the image features. Kass et al. proposed a variational approach for minimizing Emu) (Eq.(4.1)) and solving for 11(3). This requires that the contour and the associated energy function be represented in discrete form. The discrete formulation of the internal energy function is obtained by approximating the derivatives by finite differences and using the vector notation u,- = (.r,,y,~) = (.r(ih),y(ih)). Then, . 01." 34‘ Elm“) = .—-—llui - llz'--1ll2 + fillui—l — 204' + u:‘+1ll2- (4-5) Let EC,“ = Eimage + Econ be the external energy function and let fr(i) = 8Eext/8x; and fy(i) = 8Ecxt/(9y,, where the derivatives are approximated by finite differences if they cannot be computed analytically. The Euler equations for minimizing the total energy are: Oifui - 114—11‘ ai+l(ui+l - 114) + (34—10174 — 2114—1 + 117) ‘23ifui-1 “ 211i + u£+1l+ 5141012“ — 211.41 + 11:42) (4-6) +(fx(i).fy(i)) = 0- These equations can be written in the matrix form as fr , = Ax+ (x y) 0 (4.7) Ay + fy(x.Y) = 0» iV '1 ‘l Hitt - 1 1+. -1 “1.1 93 where A is a pentadiagonal banded matrix. They can be solved using an iterative Euler method as follows. x) = (A + 71)"l(‘/X1_1- frfxt—i-Yt—ill yt = (A 'l' “/I)—1(‘)yt_1 —‘ fy(xt—lsyt-ll)a where 7 is the Euler step at each iteration. Amini et al. [3] proposed a dynamic programming approach for the minimization of the total energy function. This approach does not require the computation of high-order derivatives and it takes advantage of the discretization of the image. The total energy can be expressed as a sum of subenergies as follows: Etotalfulau29 - - - a “71) = E1(ula 112,143) + E2f‘U2, U3, U4) + ' ' ' + En—2(un—'Zsun—la unl- (4.9) The dynamic programming solution involves generating a sequence of functions of two variables {S,}?;11(u,-+1, u.) where S,- is obtained as follows: 5411241, Ur) = Ti?{Si—l(uiy 117—1) + allui — lit—ill2 + dllui+i — 2114‘ + Iii—1H2 + Eext(ui)}- (4.10) At each stage, u,_l can move to m possible positions (usually its 8 neighbors in the image grid) and 54(u,+1,u,-) is obtained by finding the position of u,_1 that will minimize it. Note that 5,, is the total energy. The minimization sequence is iterated until Eton) remains constant. The external energy does not have to be limited to the forces mentioned by Kass 8t al.. For example, Etoh et al. [57] defined two energies Eregion and Eedge to be used for contour extraction by region clustering. The region force guides the contour to a true boundary, and the edge force results in local adaptations. 94 4.1 .2 Proposed method The goal is to fuse the set of regions R = {R1,R2 ..... Rn} produced by the color segmentation algorithm with the motion segmentation mask A4, to obtain a more accurate segmentation of the moving object .Vfo. Each region R,- is examined to decide if it belongs to the moving object. If a large percentage of a region is inside the mask M,, then it should be kept as part of the moving object. If it is mostly outside, then is should be discarded. The outcome of the regions that are partly inside A4, and partly outside is decided using the smoothness constraint enforced by the active contour. We have used only the internal energy to ensure smoothness of the contour and the edge image force to push the contour towards the edges in the image. The discrete formulation of our energy function for a contour containing N points {r,,y,},i = 1,...,N is given by N Etotal = ZIOI($1 _ flit-1)2 +(y1— yi—l)2l i=1 + ,8[(-Ti+l — 2113: + 134—112 + (yi+1 - 2y: + yi—ll2l (4-11) — wedgelVIIIir yi)l2} The dynamic programming approach [3] is more appropriate for minimization because it does not require the computation of high-order derivatives. This method requires an input image and an initial estimate of the contour. The initial estimate 0f the contour required by this algorithm is specified by the contour of the motion mask M, This contour, as seen in Figure 3.3(f) is a good initial approximation of the desired contour of the moving object. The input image is computed using the Confidence measure 72? which counts the number of pixels of region R,- that are inside 95 the motion mask AA... Each pixel in the image is assigned a gray level value as follows: I(r,y) = "—fit(1~.y)e 72,. (4.12) l The parameter values used in the algorithm are a = 100, ,3 = 100, and wedge = 1. We are now able to compute another confidence measure for each region, this time using the new active contour instead of the motion mask contour. Let n? denote the number of pixels belonging to region R,- that are inside mask Ada defined by the smooth active contour. Each region R,- is examined and the confidence measures 713 and n? are used to decide if the region R,- should be kept as part of the moving object mask Mo. The fusion algorithm can be stated as follows. 1. If g:- 2 0.9 then R: 6 A40; (If at least 90% of the region is inside the motion mask, then it is kept as part of the moving object); to If 21*:- S 0.2 then R, g 7M0; (If less than 20% of the region is inside the motion mask, then it is discarded); 3. If 0.2 < ’51 < 0.9; (Undecided regions) 3.1. If E:- > gi- then R,- E Mo; (If the confidence measure of the region has increased after applying the active contour algorithm, then it is kept as a part of the moving object); 3.2. If E?- < E;- then R.- ¢ A40; (If the confidence measure of the region has decreased after applying the active contour algorithm, then it is discarded). All the regions and pixels that have been kept are grouped together to form a mask. This mask is then smoothed using a 7 x 7 averaging filter to obtain the mask A40 corresponding to the moving object. 96 This fusion algorithm was applied to the 3-frame sequence shown in Figures 3.3(a)- (c) and the intermediate outputs of the algorithm are shown in Figure 4.4. Figure 4.4(a) shows the outline of the motion segmentation mask superimposed on the middle frame of the 3-frame sequence. Figure 4.4(b) shows the results of the color segmen- tation. Figures 4.4(c)-(d) show the application of the active contour algorithm. Note that the highest gray values are assigned to the regions belonging to the moving ob- ject, but some background pixels also receive a high gray value. The initial contour is superimposed on the original image in Figure 4.4(c). The final contour obtained after a number of iterations (usually around 25) of the active contour algorithm is shown in Figure 4.4(d). Note that the new contour includes fewer background pixels than the motion mask M,. However, the object contour is smoother now. The final moving object mask Mo is shown in Figure 4.4(e) and its outline is superimposed on the middle frame in Figure 4.4(f). This contour is more accurate than the contour of the motion segmentation mask. The use of the active contour module is impor- tant for achieving a good moving object segmentation since it helps in deciding the assignment of about 5% of the regions in R = {R1,R2, . . . ,Rn}. Figure 4.5 shows a high-level description of the algorithm. The CPU time re- quirements of this algorithm on a Sparcstation 20 are the following. The energy minimization step takes 15 s, the decision for each region takes 0.4 s, and computing the final moving object mask takes 10 s. Thus, the total computation time is about 25 5. 4.1.3 Experimental results This algorithm was tested on 136 real image sequences of moving objects, both indoors and outdoors. Figures 4.6 and 4.7 show the contour of the moving object mask Mo 97 (e) (1') Figure 4.4: Steps in the moving object segmentation algorithm: (a) motion segmen- tation mask Ma; (b) color segmentation; (c) initial active contour; (d) final active contour; (e) fused mask Mo; (f) contour of mask Mo overlaid on the original image. 98 Input: Motion segmentation mask AA, and homogeneous regions R: {R19R29"'9Rn} begin fggi=1,...,n n,’ = R,- 0 1M, endfor Determine ,the intensity image for the active contour algorithm [($331) : i117" if (:r,y) E Rf The initial contour is the boundary of M, containing N points v,- /* Run the active contour algorithm using */ /* the dynamic programming approach */ while k < 100 _an_d ]Edifi'] > 0 fo_r i = 1, . . . , N v;_1 is allowed to move to any of its eight neighbors S.(v.+1.v.) = min..-,{S'.-_1(v.-,v:_1)+100I|v.- — vi—lll2 +100llvi+1 — 202' +vi—1ll2 - IIVI($iayi)ll2} endfor E). = SN(0, N) Edifl’ = E: - Ek-l k = k + 1 endwhile The final active contour produces a mask M a /* Decision for each region */ fir; i = l, . . . , n n? = R5 fl Ma ifgf 2 0.9 then R.- 6 Mo 11‘ 5; g 0.2 then R.- 6 744—0; i_fO.2 < 5,; < 0.9 then if 22 then 72.- e M. then R.- E III—o V e. 1:. I . p1,; /\ 15’. :3 I m endfor Smooth Mo using a 7x7 averaging mask end Output: Moving object segmentation mask Mo Figure 4.5: Moving object segmentation algorithm. 99 overlaid on the middle frame of the threeeframe sequence. As can be seen from Figures 4.6 and 4.7. these contours are reasonably accurate. Small details such as the spare tire on the pickup truck in Figure 4.6(a) and the emergency light on the top of the vehicle in Figure 4.7(d) were also extracted. The contour is correctly extracted even if the object is made of different colors such as the custom van in Figure 4.7(f) or the jeep in Figure 4.6(g). Figures 4.6 and 4.7 also show images where the segmentation results are not perfect. For example, the roof of the white vehicle shown in Figure 4.6(c) was not included in the moving object mask because it was merged with the snow in the center area of the road during the color segmentation process; the two regions (roof of the white vehicle and the background snow) have the same color, the Canny edge detector did not find any significant edges between them, and the fusion module could not correct this mistake. For the same reason, the bumper of a vehicle was merged with the jacket of the walking subject in Figure 4.6(b). The contour of the moving arm in Figure 4.7(g) is not very accurate due to the fact that the edges are not very well defined because of motion blur. This blur can be removed if the high-speed shutter option of the camcorder is used. However, if the high speed shutter is used inside a building, the resulting image might be underexposed. 4.2 Segmentation using Deformable Templates The active contour algorithm described in Section 4.1.2 correctly segments all the moving objects from the stationary background. Figure 4.8 shows an example of a three-frame sequence containing multiple moving vehicles. It can be seen in Figure 4.8(d) that the fusion algorithm based on active contours groups all the moving vehicles into a single moving object. In this case, classical region-based or edge—based 100 (g) (h) (0 Figure 4.6: Results of the segmentation of the moving object. The contour the moving object mask M0 is superimposed on the middle frame of the 3-frame input sequence. 101 (g) (h) It (i) Figure 4.7: Results of the segmentation of the moving object (cont’d). The contour of the moving object mask .Mo is superimposed on the middle frame of the 3-frame input sequence. 102 segmentation approach would not be able to identify the front vehicle. Motion-based segmentation would not be able to separate the front vehicle from the other moving vehicles in the image. Since our domain is restricted to vehicles and vehicles have similar shapes, we have chosen to solve this segmentation problem using a deformable template approach which utilizes specific knowledge about the geometrical properties of the objects in the scene. Figure 4.8(e) shows that the proposed deformable template approach can be used to successfully segment the front vehicle from the background and other moving vehicles. The theory of deformable templates was first developed by Grenander and his colleagues [4, 77]. Specific knowledge about the scene is incorporated through de- formable template models which describe the geometric properties of the objects present in the image. The image segmentation problem is posed as an equivalent (Bayesian) energy minimization problem. A template 7(8) is defined by a set of parameters 9 = (01, . . . ,ON). The prior probability density function p(('9) incorpo— rates specific knowledge about the objects of interest in the input image and specifies how the template is allowed to be deformed. The likelihood function p(XI9) (or data fidelity term) for an image X discriminates between different hypothesized seg- mentations in terms of how “close” they are to the observed image. The deformable template problem consists of finding the template 7'((:3) defined by the parameters (:9 which best characterizes the objects in the image. More specifically, we are seeking the template parameters (:3 which maximize the a posteriori probability p((:)|X). If the a posteriori probability is expressed as a Gibbs distribution with energy function E(€-)), then maximizing p(G|X) is equivalent to minimizing E (G) 103 (d) (6) Figure 4.8: Segmentation of the moving objects vs. segmentation of the moving vehicle of interest: (a) frame 1; (b) frame 2; (c) frame 3; (d) segmentation of the moving objects; (e) segmentation of the moving vehicle of interest. 104 4.2.1 Deformable templates Before describing our vehicle segmentation algorithm to extract. the vehicle of interest (front vehicle) from the stationary background and the other moving vehicles, we will present the general theory of deformable templates and major approaches described in the literature. We will describe the three main components of a deformable template algorithm, namely prior probability, likelihood function, and minimization technique. Prior probability The prior probability of the template is constructed in such a way that all the tem- plates which fall in the set S of allowed shapes are assigned a high probability whereas the templates which fall outside 5 are given a low probability. It is common to de- fine a continuous prior probability density function by lowering the probability of a template as it moves farther away from S. The types of prior probability density functions published in the literature can be classified into four classes: (i) a continu- ity measure through a Markov random field [103, 170]; (ii) a parameterized template and probability density functions on the parameters of the template [117, 151, 190]; (iii) a template governed by a set of stochastic differential equations [77], and (iv) an ideal template that can be deformed through a set of allowable transformations (deformations) [4]. The goal of the prior probability density function in all these models is to constrain the shape of the deformed template. Likelihood The purpose of the likelihood function is to evaluate the probability of an observed image, given a set of template parameters. Although the definition of the likelihood function is application dependent, three models have been commonly used in the literature. Yuille et al. [190] and Kass et al. [103] define likelihood functions based 105 on edge or region potentials. Another alternative is to base the likelihood function on a random field description of the image as described in Geman et al. [73], Mardia (gt al. [129]. and Lakshmanan and Grimmer [117]. The inside and the outside of the template are modeled by two independent processes. Storvik [170] defines an energy function that comprises of both of these models. Finally. Grenander and his colleagues [4, 77] assume that the true image is defined by the prior distribution on the deformable gray level template and the observed image is generated by adding an independent noise process to the true image. It, is important to note that if it is difficult to directly specify the likelihood, then it is better to define an energy function and model the likelihood as a Gibbs distribution. Minimization algorithm The final issue to be considered when solving an image segmentation problem using deformable templates is the choice of the minimization algorithm. Gradient descent is the most common optimization technique and was used by Kass et al. [103] and Besag [9]. This method has the disadvantage of requiring very good initial parameter estimates, otherwise it will converge to a local minimum. Stochastic methods based on simulated annealing have the capability of avoiding local minima. Geman and Geman [72] have used the Gibbs sampler and Lakshmanan and Grimmer [117] have used the Metropolis algorithm to solve the segmentation problem. Another class of optimization algorithms involves diffusion processes based on the Langevin equation [74, 75]. Grenander and Miller [78] and Miller et al. [136] have also introduced the concept of jump diffusion processes to explore connected and non-connected areas of the hypothesis space. The a priori knowledge about the images and objects of interest in our application domain consists of the following information: (i) the object of interest is a vehicle, 97: new ‘97 91=(XI‘Y1)' fies: (X8.Y8) Figure 4.9: Deformable template defining the generic model of a vehicle. (ii) it is located in the lane closest to the camera, (iii) it is moving, and (iv) its edges in the image are well defined. Items (i) and (ii) are directly related to the definition of the deformable template (prior) whereas items (iii) and (iv) characterize the image (likelihood). The following sections describe the three main components of our segmentation algorithm based on deformable templates 4.2.2 A Deformable Vehicle Template The first step in designing a segmentation algorithm using deformable templates consists of defining the model and prior probability density function for the template. Our definition of the deformable template of a vehicle is similar to the approach adopted by Phillips and Smith [151]. We have defined the template of a generic vehicle as a polygon characterized by N vertices (P1, P2, . . . , PN) in the 2D plane as shown in Figure 4.9. Let (9 = (01, 62, . . . . 0N), where 0, = (X,, Y,), define the parameters of the deformable template. Each pair of successive points (Pi, Pi+1),i = 1,... , N defines a line feature 59,- on the boundary of the template. We assume that the boundary is closed, so for notation purposes, PN+1 : P1. The shape of the deformable vehicle template was established by studying a database of images containing typical vehicles which can be considered as our train- ing set. Since there are large variations in vehicle shapes, it is unlikely that a single 107 VEV‘H Sedan Pickup Truck KN m (”a Hatchback Station wagon and 4x4 Van Figure 4.10: Five classes of deformable vehicle templates. template can capture all the shapes. So, we have categorized all the vehicles into the following five classes: sedans. pickup trucks. hatchbacks. station wagons and 4x45, and vans. Each vehicle class is defined by its own deformable template. Typical tem- plates of these five classes are shown in Figure 4.10. Note that the sedan and pickup trucks are modeled by a polygon with 8 vertices whereas the polygonal model of the hatchbacks, station wagons and 4x45, and vans only have 7 vertices. Other vehicle categories such as large trucks or motorcycles are not of interest in this thesis. In order for a polygon to resemble a vehicle. some constraints on the relationships between the different vertices must be satisfied. For example, ,2, should always be above 993, 591 should always be located to the left of 997, etc. A set of rules on the template parameters constrains the shape of the vehicle template. These rules were empirically determined by studying the intraclass variability of the vehicles in our training set. We used a training set of 80 vehicles of different categories. A generic rule, say rk, specifies the range of allowable values that a parameter (the X or Y coordinate of one of the vertices) is allowed to take. Without a loss of generality, we denote this parameter as Xk. The rule 7“), states: (.X’Lni" S X), S XE“). The thresh- olds X F" and X?“ are functions of the other parameters (X1, Y1, . . . , XN, Yjv) of the template. Within the interval [.YP‘“,XL““"], the prior probability for the parameter X), is maximum. As Xk deviates from the bounds of the interval, its prior probability 108 decreases. More specifically. we define a function g(rk) for rule m. as follows: I _-\’k + ‘\'Lnin if A]. < ‘x'L'nin gm) 2 4 0 if win 5 X. g X?“ (4.13) I 'max 0 r ’n18x k‘\k—‘\k lf.\k>.\k The role of g(') is to penalize a template whose parameters fall outside their allowed range. The constraint function (prior) C(O) for the template T(O) is then defined as: R C(e) = Z g(rk). (4.14) [:21 where R is the total number of rules. The larger C(O), the more the template 7(9) deviates from the set of allowed shapes. The choice of the function g() is very important since its shape greatly influences the shape of the energy function and the convergence rate of the algorithm. For the choice of g(-), we experimented with the following binary function , 0 if X3” 3 Xi. 3 X31“ 9 (rt) = (4.15) 1 elsewhere This type of function. however. is not as suitable for the following simple reason. If the value of the parameter is outside its allowed range, then the values of its neighbors are also outside the allowed range, resulting in all the template parameters to have the same constraint function value. On the other hand. a linearly decreasing slope towards the allowed range will push the template to reach states where all the parameters satisfy the rules. This is the reason why we have chosen g(-) as defined in Eq.(4.13). We have defined a set of 17 rules for the deformable template as shown in Fig- 109 ure 4.11. There is one rule for each of the 16 vertex coordinates and an additional rule to constrain the vehicle template to be a closed polygon. To simplify the notation. we define the length of the template L = ii, — i}. It can be seen that most of the rules involve the parameter L. allowing the prior probability of the template to be independent of scale. The parameters B and BI represent the row position and start— ing column position of the initial baseline and L0 is the initial estimate of the length of the vehicle. These values are estimated using motion segmentation as described in Section 4.2.3. The role of these three parameters is to constrain the position and scale of the template. When studying our training set of images, we noticed that four of the parameters. namely Y3. if], Y5. and Y} strongly influenced the shape of the vehicle and its classification as a sedan, pickup truck, hatchback, station wagon or 4x4, or van. Thus. the rules for these parameters were designed separately for the five different classes. The other parameters obey rules that are common to all categories of vehicles. We will see in Section 4.2.4 how we have incorporated the class specific rules to segment the front vehicle without knowing a priori the class of the vehicle in the input image. 4.2.3 Image Model Once the template shape and the prior probability have been defined, we need to define the likelihood which characterizes the “closeness" of the template to the image data. Our likelihood function is defined by an energy function which comprises of two terms: a motion information term and an edge directionality term. This combination of two image cues contributes to the novelty of our energy function. The energy term based on motion ensures that the template is located within the moving areas in the image, thus acknowledging the fact that the vehicle of interest is moving. The 110 All categories: Sedans: 8—2 S.Y1S 8+2 [2+0.1L SY3S Y2+0.25L X3 3 X3 3 X1 - 0.9(X1 — X3) Y3 3 Y; 3 Y3 + 0.2L X1 — 0.6L 3 X3 3 X1 — 0.2L Y4 + 0.251. 5 Yr, 3 Yi + 0.4L X3 — 0.6L 3 X4 5 X3 - 0.1L Y5 + 0.2L 3 Y7 5 Y5 + 0.4L X1 - 0.95(X1 - X4) 3 X5 3 X1 -— 1.05(X1 — X4) X1 - 0.95(X1 — X3) 5 X6 3 X1 — 1.05(X1 — X3) X5 3 X7 3 X1 — 0.75(.\'1 — X5) B — ‘2 5 X8 3 B + ‘2 81—10 ngg 81+10 Y1 — ‘2 3 Yr; 5 ’1 + ‘2 ’5 + 0.05L 3 Y6 _<_ Y5 + 0.3L Y8—2 3Y7< Y8+2 Y1+Lo—25 gig; Y1+Lo+10 Pickup trucks: Hatchbacks: ’2 + 0.35L 5 Y3 S Y2 + 0.6L Y2 3 Y3 3 Y2 Y3 S Y4 S Y3+0.05L Y3+0.1L S Y4 S Y3+0.4L Y4 + 0.1L 3 Y5 S '4 + 0.2L ’4 + 0.25L g Y5 S ’4 + 0.4L Y5 + 0.2L _<_ Y7 3 Y5 + 0.4L Y6 + 0.2L 5 Y; 5 Y5 + 0.35L Station wagons and 4x45: Vans: Y2 S Y3 S Y2 Y2 S Y3 S Y2 Y3 S Y4 S Y3 + 0.2L Y3 S Y4 S Y3 + 0.1L Y4 + 0.4L _<_ Y5 3 Y4 + 0.55L ’4 + 0.55L 3 Y5 5 Y4 + 0.9L Y6 + 0.15L 3 Y7 5 Y6 + 0.4L Y6 + 0.05L 3 Y7 3 Y5 + 0.3L Figure 4.11: Set of rules used to define the prior probability on the deformed vehicle. 111 edge directionality term ensures that the boundary of the template coincides with the strong edges having the same orientation in the image. For every point on the template boundary segment (line feature) pk, we evaluate the angle between the line feature and the gradient direction in the image at that point. For a good match, this angle should be 90°. We also include the gradient magnitude in our computation of the energy function to ensure that the template is located at strong edge sites. Note that the likelihood function does not assume that the inside, outside, and boundary of the template are independent processes. Before describing the energy function in more detail, we present the motion segmentation and edge detection processes. Motion segmentation Motion information is an important cue in our application since we know that the ve- hicle of interest is moving. Motion segmentation based on 3-frame image subtraction as described in Section 3.1.4 is used to extract the moving areas in the image. Note that this motion segmentation algorithm does not use any domain—specific knowledge about the type of objects expected to be present in the image. The resulting mo- tion mask highlights the moving areas in the image which include both the vehicle of interest and other moving vehicles. The difference image is also used to compute the initial baseline position B and BI, and an initial estimate of the length Lo of the vehicle (see Figure 4.12). These values are needed to evaluate the constraint function C(@) in Eq.(4.14). The vehicle of interest is located in the lane closest to the camera whose extent in the :r-direction (£1 and {2) can be determined if the geometry of the camera is known. The baseline of the vehicle is found by summing the rows of the difference image and locating the 112 maximum sum between 51 and 52. B = arg max Zrltij). (4.16) J ‘ 56662 The extent of the vehicle in the y-direction is determined by locating the column positions 81 and B; where sharp changes in the sums 22g d(i,j) occur. The initial estimate of the vehicle length is defined as L0 = 32 — Bl. Note in Figure 4.12(g) that. as a first approximation of the vehicle baseline. this algorithm mostly identifies the vehicle shadow. The baseline and vehicle length are refined by the deformable template algorithm. Edge information The edges are detected in the middle frame in the 3-frame sequence. thus I : f2 in the following. The gradients VII and V1,, in the .r and y directions, respectively, are computed using the Canny edge detector [‘24]. The gradient magnitude IIWHM) = Ween + vets) (4.17) and the gradient unit length direction vector 1 67 i,’ = ___,_,. v1,. '.' ,W, ' ‘ 4.18 < J) mm,“ (. J). (. 1)) ( ) are computed at every point in the image. Figure 4.13 shows the resulting edge images. 113 (s) Figure 4.12: Estimation of the location of the baseline and length of the vehicle: (a) frame 1; (b) frame ‘2; (c) frame 3; ((1) difference image; (6) row summation and location of B; (f) column summation and location of B; and 82; (g) the white line shows the estimated baseline and length of the vehicle. 114 j"; “iii" W2 '1}?! my; '93:" "i ' Figure 4.13: Edge detection: (a) input image I; (b) edge magnitude image ”VI“; (c) and (d) the two components of the edge direction image vector VI. Likelihood function ‘The likelihood function is expressed by an energy function which comprises of a motion information term and an edge directionality term. Motion information is included in the likelihood function by defining an energy function (Hg) for each line feature pr. on the boundary of the template that is maximum when all the points in p1,. are inside the motion mask .Ms. (r1(‘r7k) = -. (4.19) where NA. is the number of pixel in line feature ,:;,.. Note that 0 S (Heck) S 1. 'The edge directionality term is defined as follows. We restrict a line feature up), to be located near pixels in the image that have a high gradient magnitude and whose orientation is perpendicular to the gradient direction. The length of the line feature 9%, defined between the points (Xk, H.) and (Xk+1, Yk+1) on the boundary of the template. is defined as Hall = t/txv... — -X’t)2+(1t+i — m2 (4.20) and its unit length orientation vector is defined as 1 f r - r ,3: = +—(.Xk+1—.\k,1‘k+1— n)‘. (4.21) 119k“ Note that a: is the same for all points on the template boundary between (Xk, Y1.) and (XkH, n+1)- 116 The energy term involving the directionality of the line feature ,ci. is defined as 21...)... [NV/Hun x12] (em) = _.. (4.22) \/(Z(z.ne.~.. [IV/11031llzv/(Zwm. (1‘3: ' —i(i-J')))i2 where hm = ‘ ‘ (4.23) 0 elsewhere and ‘-’ denotes the dot product of two vectors. The first term in the numerator of Eq.(4.‘2‘2) is maximum when the pixels in the image along tpk have a high gradi- ent magnitude. The second term is maximum when the gradient direction is per- pendicular to the orientation of 92k. The term in the denominator has three major roles. As a scaling factor. it ensures that [72(tpk) is not biased towards longer line features. It also balances the importance of the gradient magnitude term and the gra- dient direction term. Finally, this type of normalization makes the energy function smoother while retaining its valleys. Using the Schwartz inequality. it can be shown that 0 S Ug(apk) S 1. The total energy function for template parameters 9 is then 1 . N (7(6)) = V Z [171(pk) + [Inert]. (4.24) 1 k=l where N is the number of line features in the polygonal model. The factor l/N normalizes the total energy function so that templates with different dimensionalities (number of line segments) can be compared. Note that U(G)) is maximum when the template perfectly coincides with the edges of the vehicle. 117 For the choice of /2(.). we considered the function A = —— 4.2" 1+ air] ( o) II’(.1') which was suggested by Blake and Zisserman [15]. In this function. .3 controls the height of the curve at. the origin and 0 controls the spread of the curve. This function is more appropriate for infinite domains. However, in our expression of the edge directionality term, h(-) is applied to a variable whose value varies between —1 and +1. For different parameters A and a. we noticed that Eq.(4.25) resulted in very narrow valleys of the energy function. The function l2.(~) that we propose in Eq.(4.23) does not suffer from this problem and results in a smoother energy function. 4.2.4 MAP Segmentation Given the constraint function and the likelihood function, the image segmentation problem can be formulated in the Bayesian framework. In particular, we seek to maximize the a posterior? (MAP) estimate defined as A O = arg meaxp(OIX). (4.26) Let us define the a posteriori probability distribution p((9|X) as a Gibbs distribution with energy [3(9) 2 —U(G)) +C(O) which combines the constraint and the likelihood energy functions: mam = %exp{—E(e)}. (4.27) where Z is a normalizing constant. Then, Eq.(4.26) reduces to A O = arg main E(O) = arg m(_)in[—U(O) + C(69)]. (4.28) Energy -l8l r -1.82 > -183 r 4.84 -1.85 r -|.86 -l.87 4.88 .1 AM k / 1 66 68 70 72 74 76 78 80 82 84 Y3 118 [Energy A 4L A L J— -|.88 P 122124126128130132134136138140142144146 Y5 Figure 4.14: Cross—sections of the energy function along two parameters Y3 and Y5. So, the segmentation problem is reduced to an energy minimization problem. The optimal template O is defined as: O = argngn [291770— k=l 2m... [IIVIll(i,j) x we: - awn] k=1 2mm... MIN) 1i{ N 1V): + “(206)6ka l]VI]|(i,j)>2\/E(z‘) . 1— ing,-me).+1 = mIOk = i] ifi =j (4.33) It has been proved that the temperature schedule T)c 2 C / (1 +log k) ensures that the Markov chain converges to the global minimum of the energy function with probability 1 [110]. The transition matrix R determines how the next template parameters are chosen, given the current template parameters. Recall that our template 7(9) is defined as a polygon consisting of N points in the 2D plane. The new location (Xi‘H, Ynk“) of 120 the nth vertex is determined according to the following transition probabilities P[X,’;‘+l =1— 1.1:“ = j X}; = 1.1: =j]=1/4 P133?“ = 1+ H?“ =1 X3." = 231;." =j1=1/4 ( ) 4.34 P[.\’,’f+‘ = 1,1?“ 23' — 1 X}; = 1.1": = j] = 1/4 P[x,‘:+l = 1,1?“ =j+1|Xff :13: = j] = 1/4. In other words, each vertex R, of the template is allowed to move to any of its 4 neighbors with equal probability. Segmentation algorithm The first requirement of the segmentation algorithm relates to the five different types of templates (vehicles) which we have designed. As mentioned in Section 4.2.2, dif- ferent classes of vehicles are constrained by different sets of rules. How can we use class information in the segmentation algorithm if the class of the vehicle in the in- put image is not known in advance? We chose to run the Metropolis algorithm as many times as the number of classes. Our segmentation algorithm is divided into five epochs, each epoch corresponding to a particular vehicle class (sedan, pickup truck, hatchback, station wagon and 4x4, and van). In each epoch, the Metropolis algorithm is started with a template belonging to the vehicle class being considered and iterates while constraining the template to remain in the same class. The algo- rithm then outputs the template for the vehicle class that has the minimum energy. This approach has two advantages. First, the five epochs can be executed in parallel. Second, in addition to segmenting the vehicle in the lane closest to the camera, the algorithm is also able to classify the vehicle into one of five classes. To apply the Metropolis algorithm, we need to define an appropriate cooling 121 _— t\ v logarithmic g geometric ------- .1 Temperature PJ'JJt-UIO‘QmC T vaY_. .._‘ M t) 250 500 750 1000 1250 1500 1750 2000 Number of iterations :— V Figure 4.15: Comparison of the logarithmic and geometric cooling schedules. (temperature) schedule. The cooling schedule is typically defined as C T :——- . k l+1ogk’ (435) where T), is the temperature at the kth iteration and C is a constant. Our experience indicates that it is very difficult to choose a “correct” value for this constant C; if C is too small, the algorithm converges to a local minimum relatively close to the starting position, and if C is too large, the number of iterations required for convergence is extremely large. Strenski and Kirkpatrick [171] have studied finite length annealing schedules and have suggested the use of a geometric cooling schedule. We adopt their strategy and choose the following cooling schedule k T -—. T1. = To (.1) A , (4.36) where To is the starting temperature and Tf is the final temperature to be reached after K iterations. Figure 4.15 shows the behavior of the geometric cooling schedule compared to the logarithmic schedule. It can be seen that in the geometric schedule, the temperature does not initially drop too rapidly, but it approaches the zero value much faster. 122 Figure 4.16 illustrates the evolution of the template as the number of iterations increases. The initial placement of the template is determined by the parameters B, Bl, and L0, which estimate the initial position of the baseline and the initial length of the vehicle. At the beginning of the algorithm the temperature is relatively high, so the algorithm explores a large space of template parameters even if the corresponding templates do not satisfy the constraints and have a relatively high energy. As the temperature decreases, the constraints are satisfied, the outline of the template gets closer to the edges of the vehicle of interest, and the energy converges towards the global minimum. However, we have experienced difficulties in choosing the parameters for the cool- ing schedule. We ran the five epochs of the Metropolis algorithm for 18 sets of pa- rameters: To 6 {1.0,2.0,5.0, 100,200, 50.0}, T; = 0.01, and K E {2000, 2500, 3000} on a particular image to observe the behavior of the algorithm. Figures 4.17 and 4.18 show the resulting templates, their vehicle class, and the corresponding energy function. It can be seen that the segmentation and classification results are strongly affected by the parameters of the temperature schedule. In 12 of the 18 cases, the vehicle was correctly classified as a van, but in two of those cases, the shape of the template is not very accurate. In the remaining 6 cases, the vehicle was mistaken for a sedan or a pickup truck. Unfortunately, the best set of parameters for the temperature schedule varies with the input image. By studying the behavior of the algorithm on a number of images, we have noticed that our formulation of the energy function is correct in the sense that its minimum always corresponds to a template which correctly outlines the contour of the vehicle. So, the problem lies with the sensitivity of the annealing algorithm to the parameters of the cooling schedule. To maximize the chance of finding the global minimum of the energy function, we run each epoch of the Metropolis algorithm 123 Iteration 0 (E = —1.508) Iteration 200 (E = 2.437) Iteration 400 (E = 2.580) Iteration 600 (E = ——0.759) Iteration 800 (E = —1.503) = —1.851) Iteration 1400 (E = —1.885) Iteration 1600 (E = —1.893) Iteration 1800 (E = —1.917) Iteration 2000 (E = —1.928) Figure 4.16: Successive iterations of the vehicle segmentation algorithm. To = 1.0, K = 2000 To = 1.0. K = 2500 To = 1.0, K : 3000 E = -—1.89, Sedan E = —1.88, Sedan E = —1.87, Pickup To = 2.0, K = 2000 To = 2.0, K = 2500 To = 2.0, K = 3000 E = —1.89. Sedan E = —1.92, Van E = —1.91, Van To = 5.0, K = 2000 To = 5.0, K = 2500 To = 5.0, K = 3000 E = —1.91, Van E = —1.89, Sedan E = —1.92, Van Figure 4.17: Results of the Metropolis algorithm for different sets of parameters. 125 To = 10.0, K = 2000 To = 10.0, K 2 2500 To = 10.0. K = 3000 E = —l.84, Van = —1.88, Sedan E = —l.93, Van ,4, To = 20.0, K = 2000 To = 20.0, K = 2500 To = 20.0, K = 3000 E = —1.91, Van E = —1.93, Van E = —1.92, Van To = 50.0, K = 2000 To = 50.0. K = 2500 To = 50.0. K = 3000 E = —1.84, Van E = —1.92, Van E = —-1.92, Van Figure 4.18: Results of the Metropolis algorithm for different sets of parameters (cont’d). 126 with 18 different sets of parameters and retain the template which has the minimum energy. Of course, this adds to the computational requirement of the algorithm. During each iteration of the Metropolis algorithm. the N vertices are scanned clockwise. Let 9"” be the set of template parameters obtained after iteration k — 1. At iteration k, a sequence of N template parameter sets (8,1295, . . . ,Ofv) is created by sequentially updating each of the vertices. For notation purposes, we define 9'6 = 0"”. Each 91‘, is created by moving the nth point in 9:4. Then 9" is defined to be Ofv when all the vertices have been updated. A high-level description of our segmentation algorithm is given in Figure 4.19. The CPU time requirements (on a Sparcstation 20) for a 256x256 images are as follows. As mentioned earlier, the motion segmentation algorithm takes 7 s. The computation of the gradient magnitude and direction based on the Canny edge de- tector takes 3.5 5. Finally, each epoch of the deformable template algorithm takes 15 s for 2,000 iterations. 4.2.5 Experimental Results The deformable template algorithm for finding the vehicle in the lane closest to the camera was tested on a total of 287 images. Figures 4.20 to 4.24 show a few of the examples of segmentation results. Note that in several of these images, the front vehicle occludes a number of other moving vehicles and the stationary background is rather complicated. It can be seen that the outline of the vehicles were accurately identified and the vehicles were correctly classified. Figure 4.25 show examples of vehicles which were incorrectly classified. In Fig- ure 4.25(a), a sedan was mistaken for a hatchback, because the trunk of the vehicle was not detected. In Figure 4.25(b), a van was classified as a pickup truck because 127 Input: Motion segmentation mask M,, estimated baseline (B, BI) and length (L0) of the vehicle, gradient magnitude image “VI”, and gradient direction image Vi begin Set Tf = 0.01 f_or; To e {1.0, 2.0, 50,100, 20.0, 50.0} fi)_r K E {2000, 2500, 3000} fo_r C 6 (Sedan, Pickup truck, Hatchback, Station Wagon, Van) Select an initial set of template parameters ('90 in C Evaluate E (90) Set (110.130 = 9° Set k = 0 /* K iterations of the Metropolis algorithm */ while I: < K Set k = k + 1 Define (9’5 = 9"” /* Examine all the vertices clockwise */ f_o; n = 1, . . . , N Choose new template parameters F /* The new template defined by the parameters F has the same vertices as the template defined by the parameters 0L1 except that point Pn has been moved to one of its four neighbors */ Evaluate E(F) Compute T), = T0(Tf/T0)k/K Compute p = exp{-[E(I‘) — E(®t-.)1/Tk} Generate u, 0 S u S l fipzlorUvM<4 0.21 32x32x32 8X8x8 0.34 16x16x16 16x16x16 0.68 8x8><8 32x32x32 0.65 4x4x4 64x64x64 0.61 2x2x2 128x128x128 0.56 Figure 5.2: Eight color patches. 16) in each dimension. Experiments with real data Figure 5.2 shows 8 of the 75 patches of color and Table 5.2 shows the color matching scores for all possible pairs. It can be seen that there are only two significant matching scores: 0.973 between (a) and (b) which are both black and 0.817 between (c) and (d) which are both red. All the other matching scores in Table 5.2 are much smaller. Thus, the color matching score is able to match similar colors and discriminate between different colors. To show more complete results, we performed a hierarchical clustering of the 75 color patches using the single linkage technique. We used the color matching score as a measure of the similarity between two color patches. Figure 5.3 shows the resulting 139 Table 5.2: Color matching scores for the eight color patches shown in Figure 5.2. 1 H (1)) 1 (c) 1 (.1) (e) 1 (f) 1 (g) 1 (hfl (a) 0.973 0.0230 0.001 0.000 0.000 0.000 0.303 (b) 0.044 0.017 0.000 0.000 0.000 0.300 (c) 0.817 0.000 0.000 0.000 0.037 ((1) 0.000 0.000 0.000 0.010 (e) 0.000 0.000 0.001 (f) 0.000 0.000 (g) 0.001 dendrogram. It can be seen that clusters of color such as red (R), black (Bk), white (W), brown (Br), yellow (Y). green (Gr). and blue (Bl) can easily be identified. Other colors such as light brown (Lb) and gray (Gy) are mixed with other clusters (white, blue, and brown) because the boundaries of the light brown and gray clusters are not as well defined. These results show that the proposed color matching score is able to match similar colors very well. 5.1.3 Baseline distribution of the color matching score How large should be the value of the color matching score between two color his- tograms before the corresponding colors can be called similar? 111 order to assess the significance of a color matching score, we have to determine a baseline distribution under the null hypothesis of randomness. This will help in analyzing the results in Table 5.2. Baseline distribution under random color hypothesis Uniform color images of size 20x50 were created by randomly generating three num- bers between 0 and 255 for the three color components. Gaussian noise with mean 0 and standard deviation 100 was then added to the uniform color images. We 140 0.0 Figure 5.3: Hierarchical clustering of the 75 color patches based on the color histogram matching score. l-ll ...) . . . ”US“ .. . /, , . 1 115(1) 1- f ~' Cumulam c Distnhulmn \ (1.0 (1.1 (12 0.3 (1.4 0.5 [1.6 0.7 l) 8 0.9 1.0 Color Matching Score Figure 5.4: Cumulative distribution of the color matching score under the random color hypothesis. generated 10,000 pairs of images using this process and computed the 10,000 color matching scores. Figure 5.4 shows the cumulative distribution of the color match- ing score under this random color hypothesis. It can be seen that the probability of obtaining a color matching score greater than 0.2, under the null hypothesis, is less than 0.05. According to this result, the gray color patch in Figure 5.2(h) is similar to the back color patches in Figures 5.2(a)-( b), because the observed color matching score is significantly larger than 0.2! This shows that this null distribution defined by the random color hypothesis is not very appropriate for our application. In real scenes, the colors will never be as diverse as the randomly generated colors. Distribution for the 75 patches of color To get an idea about the distribution of color matching scores for real data, we plotted the cumulative distribution of the 2,775 color matching scores that were obtained in the 75x75 similarity matrix using the 75 color patches defined in the previous section. This cumulative distribution is shown in Figure 5.5. Now a color matching score greater than 0.7 is quite unusual since it has a probability of 0.05. 142 [(u ... . .- . . .. _. . . - .. 2. “qj a ., . ., . .. . . ., . MIT—__J 1 l ”.5(f !- I I Cumulative Distnhutmn 1H1 (H) {II 112 0.3 0.4 0.5 (16 0.7 0.8 0.9 1.0 Color Matching Scan: Figure 5.5: Cumulative distribution of the color matching score for the 75 color patches. 5.2 Edge Point Set Matching The color matching score defines a similarity measure between two objects based only on color. A drawback of this approach is that since histograms are used to compute the matching score, the spatial relationship between the neighboring pixels of different color is not utilized. As a result, two distinct objects containing the same set of colors will have a high color matching score even though the arrangement of the different colors might be very different in the two objects. To overcome this limitation, objects should also be matched using other attributes. A number of techniques for object (2D shape) matching using moments [89], Fourier descriptors [150], or normalized chord length distributions [164] have been presented in the literature. While many of these methods are invariant to size, ro- tation, and translation, they are not information—preserving, i.e., two objects might have the same shape features and yet appear to be very different. Some authors have presented matching methods based on feature points. Davis [39], and Ranade and Rosenfeld [157] presented algorithms to match feature points using relaxation techniques. Xu [185] proposed a method which requires solving the correspondence 143 problem between feature points. These techniques have the disadvantage of being computationally demanding when the number of feature points is large. A different approach is to match relational structures (graphs [162] or stars [31]) by computing subgraph isomorphism. but these techniques strongly rely on the construction of the relational structure. Also. finding subgraph isomorphisms is an NP—complete problem which makes this approach impractical. The proposed shape matching method is based on point matching. An object is stored as a set of edge points. and two objects are compared by measuring a distance between the two sets of edge points. The proposed distance measure is motivated by the Hausdorff distance between two edge maps (sets of edge points) associated with the objects of interest [93]. We assume that a segmentation algorithm has already been applied to the entire image to extract the edge points belonging to the object of interest. We investigated 24 different distance measures and, based on their behavior in the presence of noise, have determined the best distance measure, called the modified Hausdorff distance (MHD) for object matching. MHD, like the Hausdorff distance, is easy to compute and does not require the solution of the correspondence problem. We have studied the performance of MHD on several real images. We have also empirically determined baseline distributions for MHD under two different null hypotheses to determine the significance of the observed values of MHD. 5.2.1 Distance measures The classical measure of distance between two point sets A = {a1,...,aNA} and B 2 {b1,. .., [mg] of sizes NA and N3, respectively, is the Hausdorff distance defined as: H(.A,B) = max(h(A,B),h(B,.A)), (5.7) a 13> 3* 42930 fake Figure 5.6: Hausdorff distance between two point sets where A = {—1-} and B = {O}: (a) computation of h(A.B) = 11; (b) computation of h(B,A) = 6: H(A,B) = 11. where the directed distances are defined as: h(A.B)= maxminHa—b , (5.8) 06A (DEB h(B,A) = m6a3xrtr1€12|lb — a”, (5.9) and H.” is the distance norm (e.g., Manhattan or Euclidean distance). Figure 5.6 shows an example of the computation of the Hausdorff distance between two point sets. Huttenlocher et al. [93] proposed a distance measure to compare two sets of object edge points in the presence of occlusion. This distance measure is a modification of the Hausdorff distance, where the directed distance is defined as: hid-498) = ["5234 $15111“ — blla (5-10) where K224 denotes the Kth ranked value in the set of distances (one corresponding to each element of A). For the Hausdorff distance, K would be equal to NA since h(A,B) is the maximum distance over all points in set A. A suitable value of K is determined empirically. This ranked Hausdorff distance was motivated by the need to locate a possibly occluded object model in a. given image. Our interest is in object 145 matching where the two objects of interest have already been segmented from their respective images. Definition of the distance measures It is generally agreed that the distance between two points a and b in a coordinate system is the norm of the vector joining the two points. In the following discussion, we will assume that the Euclidean distance is used. (101,3) = [la — b” (5.11) The distance between a point a and a set of points 8 = [61,. . ..bNB} is commonly defined as the minimum of the set of distances d(a,b,),i = 1, . . . , N3: d(a,B) 2 Tell? Ha — 6H (5.12) There are many ways to define the directed distance between point set A = {ah . . . , aNA} and point set 3 = (61,. . . .bNB}. For the purpose of this study, 6 possible directed distance measures were defined as follows. The distances d2, (13, and d., are some of the generalized Hausdorff distances proposed by Huttenlocher et al. [93]. d1(A,B) = 13353. d(a,B) (5.13) (MAB) = 501.’;’;A(1(a,8) (5.14) (1.3.4.13) = 75KyéAd(a,B) (5.15) d,(.4,13) = 9011'32Ad(a,8) (5.16) d,(A,B) = Tea} 401,8) (5.17) (1.34.13) = \+ Z d(a,B) (5.18) A aEA 146 Table 5.3: 2’1 distance measures between two point sets. directed function distance f1 ] f2 ] f3 ] f4 ([1 Bl 02 D3 [)4 d2 05 De D7 08 (13 09 010 011 012 ([4 013 014 DIS 016 (15 1217 DIS 019 020 d6 021 D22 D23 D24 where “’11" 0. EA represents the Km ranked distance such that K/NA = 1%. For example, BOA/Hr 06,, corresponds to the median of the distances d(a,B),Va E A. We could have chosen arbitrarily many distances depending on the value of .r. We chose the three specific values of .r (50,75,90) as a representative set. These directed distance measures can then be combined in four different ways to produce a distance measure between two point sets. f1(d(A,B),d(B,A)) = min(d(A,B),d(B,A)) (5.19) f2(d(A,B),d(B,A)) = max(d(A,B),d(B,A)) (5.20) fs(d(A.B),d(B,A)) = MB) ‘; “31"” (5.21) f.(d(.4.3),d(13,.4))= N23415: : 21:38.41) (5.22) These four ways (Eqs.(5.19)-(5.22)) of combining six directed distance measures (Eqs.(5.13)-(5.18)) result in 24 possible undirected distance measures between two point sets as summarized in Table 5.3. Note that 018 corresponds to the well-known Hausdorff distance. 147 Input: A binary image I begin /* Initialization ‘1 fit; all i.j = 1 ...... \7 ifI[i,j] = 1 then D[i.j] '2 0 gLs_e D[i.j] = so Vi.j endfor /* Forward pass ‘/ {g all i.j = l ...... \" 0(3)] 2 n1i11{D[i,j],D[z'—1,j—1]+ @034 141+ 1, 0(1— 1.j+1]+\/2,D[i.j— 1]+1} endfor /* Backward pass */ fgrgall i,j= N,...,1 D[z',j] = min{D[i.j],D[i+1.j— 11+ fi,D[i + 1.)] + 1, D[i+1.j+l]+\/2.D[i.j+1]+l} endfor m Output: A distance image D where value at a pixel indicates the distance to the closest edge point Figure 5.7: Chamfering algorithm. Efficient computation of distance measures When the points are placed on a discrete grid (like an image), there is an efficient way to compute these distance measures. Let [,4 be an image of the point set A such that IA[i,j] = 1 if the point (i,j) belongs to A and 0, otherwise. Define [3, the image for the point set 8, in the same manner. The distance from one point a to a point set 3 can be efficiently computed using the Chamfering method [7] which computes a distance map D[i,j] for the given binary image I[z,j]. Each pixel in D contains a number corresponding to the distance to the closest point in the image I. A high-level algorithm to compute a distance map using the Chamfering technique is shown in Figure 5.7. For computing the distance between two point sets A and B, the Chamfering algorithm is applied to both the binary images IA and [3 to produce the chamfer 1318 images (or distance maps) DA and 03. Now, the distance (1(a. B) is simply DLz[a,-, a3], and the distance d(b.A) is DANA-.61]. where a = ((1,411) and b = (bhbJ). 5.2.2 Evaluation of the distance measures A distance measure for the purpose of object matching should have the following properties: (1) it should have a large discriminatory power. (2) its value should increase as the two objects differ more in their shapes. Metric properties We first analyze the metric properties of the 24 distance measures in Table 5.3. A distance measure (1 : .X’ x X —-) R is a metric if it obeys the following axioms: 0) 414.11) = 401.4) Vay E X (ii) 0 D13(A,C). The modified Hausdorff distance (.922), however, correctly suggests that 022(A,B) is minimum. Let us now evaluate the ratio of the distance value for the true match and the distance value for the next best match. This ratio is 1 for Dlo, 0.48 for 014, and 0.24 for 022. This again shows Figure 5.15: Edge images for four vehicles. Table 5.4: Values of Dlo, 014, 013, and 022 for the four vehicles shown in Figure 5.15. Note that MHD (022) has the best discriminatory power for object matching. [0,0 13 c D [1)... B c D A 2.0 5.0 3.0 A 2.8 10.0 5.83 B 7.0 2.0 B 10.4 7.1 c 6.0 c 12.2 (1)18 13 c 7) 022 B c D A 31.9 22.0 31.7 A 1.0 6.2 4.3 B 106.8 24.6 B 6.4 4.4 c 35.9 c 6.3 that the modified Hausdorff distance (022) has more discriminatory power than the generalized Hausdorff distances (010 and D14) proposed in [93]. The modified Hausdorff distance was also compared to the other distances on real edge images of people while they were walking inside the Engineering building at Michigan State University [44]. The four edge images of people shown in Figure 5.16 were obtained using the same segmentation algorithm which was used to locate mov- ing vehicles [46]. The two objects A and B in Figure 5.16 are known to be identical, so the distance value between them is expected to be significantly smaller. The val- ues of distance measures 010, 014, 018, and D22 are shown in Table 5.5. Again, it can be seen that MHD has more discriminatory power than the other three distance / _ ___r) €54 (ex/Lizzi; A B C D Figure 5.16: Edge images for four people. Table 5.5: Values of 010, D”, 018, and 022 for the four people shown in Figure 5.16. Note that MHD (D22) has the best discriminatory power for object matching. [010 B c D D... B c D A 3.8 6.0 5.2 A 7.0 8.5 7.1 B 6.7 5.0 13 9.7 15.0 C 4.2 C 8.1 0,, B c 73 022 B c D A 13.0 12.9 17.4 A 2.5 3.7 3.3 13 14.3 34.0 B 4.0 5.1 c 16.7 c 3.5 measures. The Hausdorff distance 018 shows that A is more similar to C than to B. The ratio of the distance value for the true match and the distance value for the next best match is 0.88 for 010, 0.99 for 014, and 0.76 for 022. 5.2.3 Baseline distribution of the modified Hausdorff dis- tance How small should be the value of the modified Hausdorff distance between two point sets before the corresponding two objects can be called similar? This is similar to the problem of determining cluster tendency of a given data and cluster validity of a given partition [96]. A baseline distribution under the null hypothesis of randomness 156 will help us to better analyze the results in Table 5.4. Baseline distribution under random point pattern hypothesis Random point patterns were generated in 256 x 256 images and the modified Haus- dorff distance was computed between 10,000 pairs of images. Figure 5.17 shows the cumulative distribution of the MHD values under the random point pattern hypothe— sis. Table 5.4 shows that for the 4 edge images in Figure 5.15, the modified Hausdorff distance, D22 took different values: 1.0, 4.3, 4.4, 6.2, 6.3, and 6.4. According to the plot in Figure 5.17, the probability of obtaining a MHD value of 1, 4.3, or 4.4 under the null hypothesis of randomness is extremely low (< 0.0001) whereas the probabil- ity of obtaining a value larger than 6 is close to 0.2. This implies that it is extremely unusual to observe a value of 1.0, 4.3, and 4.4 for MHD by chance. In other words, an observed value of 1, 4.3, or 4.4 for MHD would indicate that the two objects are indeed very similar. This interpretation poses a problem because it implies that ob- jects B and D are also similar (022(B,D) = 4.4)! An explanation of this fallacy is the inappropriate choice of the null distribution. Our matching problem deals with matching edge point sets obtained from vehicle images. The resulting edge images present some perceptual organization which is not modeled by random point patterns. A more appropriate baseline distribution will be obtained if it is derived under the random vehicle hypothesis defined below. Baseline distribution under random vehicle hypothesis A random vehicle can be generated according to a generic vehicle model shown in Figure 5.18. Each of the model parameters L, H, C1, C2, R1, R2, 17, d1, d2, (1'3, and f T “If v I - /‘i 1 1 (15¢ - ulatiu‘ Dislnhutiun (‘um ...] .. (11117 11,1 . . n. < t_‘_~+_.__ ,’.2 . .~.. .4 S 6 7 8 0 II) It 12 I3 14 IS Modified Hausdorff Distance Figure 5.17: Cumulative distribution of MHD under the random point pattern hy- pothesis. (14 are randomly generated from a uniform distribution over the following intervals: 140 S L S 220 0.3H S h S 0.95H 0.2L S H S 0.4L 0 S d, S 0.5L 0.2L S C, S 0.25L 0.571;, S l S 0.7d5 (5.24) 0.75L S C2 S 0.8L 0 S (12 S 0.8011 0.25H S R, S 0.35H 0.2d6 S d3 S 0.6d6 These parameter values were empirically determined and lead to reasonable vehicle shapes and sizes (Figure 5.19). The perpendicular bisector of the roof of the vehicle is drawn to represent the separation between the two windows. At most two lines are drawn across the body of the vehicle to represent the “guard rail”. Additional lines are randomly generated to simulate noisy edges inside the vehicle. Four 256 x 256 images of random vehicles generated by this model are shown in Figure 5.19. The modified Hausdorff distance was computed between 10,000 pairs of random vehicle patterns generated in 256 x 256 images. Figure 5.20 shows the cumulative ‘7; n y ,1“ _ \ H // e .6, <~—> , Cl Radii R1 and R2 V C2 > ‘ L = Figure 5.18: Generic model of a vehicle. 5.1; ..l SCI}! ”F5- (9_ 0 Figure 5.19: Examples of random vehicles generated by the model. distribution of the distance values. The significance of the distance values that were obtained for the real vehicle edge images in Figure 5.15 can now be determined from this plot. The probability of obtaining a distance value of 1.0 is extremely low (< 0.0001) whereas the probabilities of obtaining distance values greater than 4 are rather high (> 0.35). This implies that the two identical vehicles A and B are indeed similar whereas none of the other match values for D22 in Table 5.4 are significant. We note that the probability of obtaining a distance value less than 3.0 is only 0.1. 5.2.4 Registration of two edge images The modified Hausdorff distance between two point sets A and [3 depends on the relative translation, rotation, and scale changes between the two point sets. The segmented vehicles do not necessarily have the same position, orientation, and scaling in the two images. Therefore, the two sets of edge points need to be registered prior 1.111) .. .. --. ..-” ”'56[ ,. .. .. .. ..5.. .. ,.. . .5 Cumulative Distnhutiun I). qu ..5.. .. .. ... ..5. .. «.... . . .. ,. ...................;.._.......... .... .. .............5.....-_..' .... 001. 4 . . - ~ . ~ . U.” '* i ......_...i....._ I 4 7 10 13 I6 19 22 25 28 3| 34 37 Modified Hausdorff Distm Figure 5.20: Cumulative distribution of MHD under the random vehicle hypothesis. to computing the distance between them. Pose clustering is a well-known technique to estimate the transformation param- eters between two sets of points. Pose clustering [168] considers all possible pairs of model and image features and computes the transformation (translation, rotation, and scaling) between them. Each pair of model/ image features defines a transformation. The possible transformations are then accumulated in a discretized transformation space using a Hough—based approach. Peaks in the transformation space define the pose candidate hypotheses for the objects. All these hypotheses are verified to find the correct one. In our image acquisition setup, if the two cameras are properly aligned with the road geometry, then we can assume that the orientation changes between the objects to be matched are negligible. Thus, we only consider 2D translation and scale changes. Therefore, the transformation space is 3-dimensional. The algorithm to identify the translation and scale parameters is shown in Figure 5.21. Two edge images E1 and E2 are input to the algorithm. The “on” pixels in these two edge images form the two edge point sets which we also denote as E1 and E2 to simplify the notation. The transformation T that best registers the two edge images is defined by the 2D 160 Input: Two edge images E1 and E»; be in Initialize H[.r.y. s] = 0 VI. y, s fgi; all s E [s,,,,,,, Smax] /* Apply 3 to the edge points in E1 */ E[[s(i — mi) + m,,s(j — 771,) + my] (— E1[i,j] Vi.j where (772,,mj) is the center of mass of the edge points in E1 /* Store all the possible translations */ £91: all (41,7116 Bi and (izajzl E E2 f (L‘ = 72 - L1 9 = .72 “.71 H[:r,y,s] (— H[.r,y,s] +1 endfor endfor H[X, Y, S] = maxnw H[:r, y,s] and T = (X, Y, S) E,T[S(z' — 771,-) + m, + X,S(j — m,-) + m,- + Y] <— E1[i,j] Vi,j git: Output: Two new edge images ET and E2, translation parameters (X,Y), and scale parameter S Figure 5.21: Pose clustering algorithm to estimate the translation and scale parame- ters between the two edge images. translation parameters (X, Y) and the scaling factor 5. Figures 5.22(a) and 5.22(b) show two edge images E1 and E2, respectively, of the same vehicle observed by two different cameras at different locations and times. Figure 5.22(c) shows E? after the transformation T(X, Y, S) has been applied to E1 so that it is registered with E2. Note that ElT is more similar to E2 than E1 is to E2. Given two objects (91 and 02 with edge images E1 and E2, the edge matching score is defined as: 0301.02) = MHD(E1T . E2) = maxtthf, 12.1.4053. 52)), (5.25) where T is the transformation applied to register the two edge images E1 and E2, 161 Figure 5.22: Pose clustering to register two edge images: (a) first edge image E; (b) second edge image E2; (c) transformation T = (—6, —5,0.875) applied to E1 to produce E3. and d is the directed distance in the definition of the modified Hausdorff distance: 1 dET, = ' — . ( 1 52) IEFIPEngng qll (5 26) l 1 dE,ET =—— ' — . 5.27 (2 1) $8,225,529?” P11 ( ) It takes between 5 s to 11 s to compute DC on a Sparcstation 20 for a pair of 256x256 images (4 s to 10 s to do the pose clustering and 1.5 s to compute the modified Hausdorff distance). 5.3 Summary We have proposed two matching scores 56(01, 02) and D"(01, 02) to compare two segmented objects 01 and 02 based on their color and shape. The color matching score, SC(01, 02) is a modification of Swain and Ballard’s color histogram matching measure. We have empirically determined the optimal bin size that should be used to build the 3D color histograms. We have applied this color matching strategy to color patches extracted from images of outdoor scenes and demonstrated that it is 162 able to match similar colors. We have also derived a distribution of color matching scores to assess the significance of the observed color matching scores. The shape matching score 06(01, (92) is based on the distance between two sets of edge points representing the objects of interest. We have determined that, among the class of distance measures based on the Hausdorff distance, the modified Hausdorff distance (MHD) is best for matching two objects based on their edge points. We have shown that MHD has the following desirable properties: (i) its value increases mono- tonically as the amount of difference between the two sets of edge points increases, (ii) it is robust to outlier points that might result from segmentation errors, and (iii) it does not discard any of the edge points. The MHD was applied to edge images of several real objects that were segmented from both outdoor and indoor scenes. We observed that MHD is better than the other distance measures for the purpose of object matching. We have also determined a baseline distribution for MHD under two different null hypotheses: (i) random point pattern hypothesis and (ii) random vehicle hypothesis. These null distributions were used to determine the significance of the observed values of MHD. Such a statistical analysis is necessary in order to avoid detecting false matches which could occur by chance. Chapter 6 Vehicle Matching System This chapter is concerned with the implementation of the feasibility study for travel time estimation using vehicle color and shape matching. We have described the algorithms for image segmentation in Chapter 4 and the techniques for color and shape matching in Chapter 5. We will explain how we can identify matching vehicle pairs based on the color and edge matching scores. Once a pair has been found, the travel time for that particular vehicle is simply the difference between the time instants when it was observed by the two cameras. The important feature of the proposed matching system is that it does not need to identify all the matching pairs. Transportation experts have determined that the average travel time can be reliably determined with one matching pair per minute. The most important requirement. however, is that the system does not detect any false matches at all, as they might drastically affect the estimation of the travel time and other statistics. Thus, we propose very conservative threshold values to decide if a potential matching pair should be accepted. This chapter is divided into 5 sections. Section 6.1 proposes a machine vision system to segment the vehicles in the lane closest to the camera and identify matching 163 164 pairs in the two databases to estimate the average travel time between two points on a road network. 111 Section 6.2. we propose a computationally efficient method which does not rely on an elaborate segmentation of the front vehicle to identify the matching pairs. In Section 6.3. the results obtained with the complete vision system and with the "real-time” vision system are compared to the results that were obtained manually by examining the video tapes. Section 6.4 is concerned with on- site implementation issues of the system and proposes an algorithm to automatically select the interesting frames, i.e., frames that contain a vehicle in the center of the image. Finally, in Section 6.5 we address the robustness of our system to images obtained under various weather conditions and time of the day. 6.1 Travel Time Estimation This section describes an integrated system for travel time estimation. The complete system is shown in Figure 6.1. At an entry point, the vehicles passing in front of the camera are segmented and stored in a database T1 = {1211, . . . ,V,l\,l} of vehicles observed by the first camera. At the exit point. the same segmentation algorithm is applied to an observed vehicle V? to extract the four features which are used by the matching process, namely, observation time, vehicle class, color histogram, and edge point set. The matching system includes an indexing step to avoid computing matching scores between V} and all the vehicles in T1. In this section, we present the four features which are used to represent a vehicle, the indexing processes (time-based and vehicle class-based), and the matching strategy which decides if a potential pair should be accepted. Segmentation of the front vehicle Vehicle Attributes 1 Observation time Vehicle class Color histogram Edge point set L J Exit Points Segmentation of the front vehicle Database of Vehicles Observed by the First Camera Figure 6.1: Matching vehicles for travel time estimation. Vehicle Attributes Observation time Vehicle class Color histogram Edge point set Reduced . Time Class Database Color Shape indexing indexing matching matching Matching Scores Decision lndentify one Reject all matching pair the matches 166 6.1.1 Feature extraction \Ve apply the model—based segmentation algorithm described in Section 4.2 to extract the vehicle in the lane closest to the camera. The database T1 of vehicles observed by the camera at the entry point contains four pieces of information per vehicle V1.1' = 1.. . . . N1. These same four features are also extracted at the exit point to 1 represent the vehicle Vii-2. These features are: 0 Observation time T: time at which the vehicle passes in front of the camera. 0 Vehicle class V: one of the five vehicle categories (sedan, pickup truck, hatch- back. station wagon or 4x4. and van) determined by the segmentation algo- rithm. 0 Color histogram H: this histogram is built using the pixels which belong to the vehicle. However, since all the vehicles have gray looking windows, there is no need to include all the pixels in the computation of the histogram. Only the pixels inside the polygon defined by the vertices {(X1,Y1), (X2,Y2), (X3,Y3), (X6, 32;), (X7. Y7), (X8, Ed} on the vehicle template (from Figure 4.9) are used to compute the color histogram. 0 Edge point set E: only the edge points obtained from the Canny edge detector that are inside the vehicle template are retained to define the edge point set for shape matching. We also keep the boundary of the deformable template as part of the edge point set. Figure 6.2 shows an example of the four features that were extracted for a given vehicle. Figure 6.2(a) shows the input image. The result of the model-based seg- mentation algorithm identifying the front vehicle as a van is shown in Figure 6.2(b). In Figure 6.2(c), we show the polygonal region inside the vehicle that is retained to compute the 3D color histogram. For display purposes. the color histogram is shown 167 (C) (d) (9) Figure 6.2: Feature extraction: (a) input image; (b) vehicle class (van); (c) area used to compute the 3D color histogram; (d) color histogram; (e) edge point set. as 3 marginal (R, G, B) histograms in Figure 6.2(d). Finally, Figure 6.2(e) shows the edge point set which includes the edges inside the vehicle template (detected by the Canny operator) and the boundary of the template. This vehicle was observed at 4:42:11 pm on June 1, 1994. 6.1.2 Time-based and vehicle class-based indexing A vehicle V? observed by the second camera needs to be matched to all the vehicles in the database T1 = {V}, . . . , vim}- However, travel time information can be used to reduce the number of potential matches. We call this step time-based indexing. Let T.we denote the current average travel time which is available from the pre- C21 (0 .\'( SCt 168 viously established matches. If vehicle V? is observed by the second camera at time T2 then it should have been observed by the first camera approximately at time . 1 '1? — 'I’ave. Time-based indexing consists of determining a time window [T12 — Tave — T“ T? — Tave‘l' 71] of size 2T, to select a subset of vehicles in T1. This subset of vehicles T} C T1 is called the time-indexed database for vehicle V3. In addition to time-based indexing, we can also perform an indexing based on vehicle class. There is no need to match a van to a sedan. Since we know the vehicle category V, of vehicle V3. we can select only those vehicles in T} which have the same category V}. These vehicles form an even smaller subset T} C T} C T1 of potential matches, called the time-indexed and class-indexed database for vehicle V3. 6.1.3 Color and shape matching After time and vehicle class indexing, we are left with a small subset T} of N,- 3 N1 candidate matches for vehicle V?. For each pair (V?,V,-1J),V,-lj E T},j = 1, . . . , N,- we compute the color matching score SC(V,-2, V3) and the edge matching score D‘TV}, V31). Note that the color matching score is a similarity measure whereas the edge matching score is a distance measure. Furthermore, the color matching score takes a value between 0 and 1 while the edge matching score is a positive number usually greater than 1. We define the total matching score as follows: [SW-2,1” )1? t 2 l _ ' ‘J 5 my“) ‘ Dom-1,) ' (6.1) The N, total matching scores for all pairs (V3, V5), j = 1, . . . , N, are computed and will be used to determine if there is a match for vehicle V? in T1. Note that S‘(V?, V5) is a similarity measure which can take any positive value. The vehicles in T3 are ranked in decreasing order of total matching scores. The 01hr Tlal 169 vehicle V3“) defined by: V} = arg max 5"(V3.V,~l)) (6.2) ‘1” 13153} is called the potential matching candidate for VB. In order to accept the match (V,2.V,-lm). a number of conditions are tested. In Chapter 5. we determined from the cumulative distributions of the matching scores that a color matching score greater than 0.7 and an edge matching score less than 3.0 were quite unusual. Recall that we do not have to identify all true matches as long as we are sure that no false matches will be accepted. In order to be certain about a match, the matching scores for the vehicle V3“) have to be significantly better than the matching scores for the other vehicles in T}. We define 3", D6. and 5‘ as follows: 5 : SC(V,-2,V-‘ )[scnfiv1 )— 5°(V,-2,V-1 )] (6.3) 1(1) ‘(ll t(’2) . ”(1),-2,11} )—De(v,?,vg ) De: ‘2’ ‘1’ (6.4) D€(V,-2,V-‘ ) '(x) s‘=5'*(v3.v.1 )[s'(v3,v.‘ )—.s"(v,."~’,v.‘ )j. (6.5) ‘(u 1(1) H2) These coefficients show the relative difference between the potential matching candi- date and the next candidate. Based on these coefficients and the matching scores, the match (Viz, Vii”) is accepted if: (SC(VE.V1 ) > 0.7) and (mi/3.12.1 ) < 3.0) ‘m '(1) (6.6) (3‘: > 0.25) or (D6 > 1.0) or (5" > 0.1) , otherwise, the match (V3, Vii”) is rejected. The first condition ensures that the poten- tial match is actually a good match and the second condition verifies that at least one 170 of the matching scores is significantly better than the other matching scores. meaning that the observed vehicle is unusual in shape or in color. 6.1.4 Experiments The proposed vision-based matching system was tested on the databases Saginaw- Coolidge and Saginaw-llaslett. Images contained in these databases are shown in Appendix A. Two Sharp Slim Cam V'L-L6‘2t' camcorders were placed 1.9 miles apart on Saginaw Street in East Lansing, Michigan on June 1, 1994 around 4:45 pm. A seven minute segment (4:39 pm to 4:46 pm) of the video tape from the first camcorder was digitized. The image sequences were segmented to extract a total of 149 vehicles (database T1). A six minute segment (4:43 pm to 4:49 pm) of the video tape from the second camcorder was also digitized and segmented to yield 138 vehicles (database T2). The model-based segmentation algorithm was tested on these 287 vehicles (149 + 138). As mentioned in Chapter 4. 91.6% of these vehicles were correctly classified. The matching algorithm was applied to all the 138 vehicles in T2. Note that the database Tl contains 149 vehicles. The cameras were placed 1.9 miles apart and the legal speed limit on Saginaw Street is 35 mph. Therefore, we initialized the average travel time between the two cameras to 3:30 min. We set the time interval T, to 40 s to identify vehicles whose speed is between ‘27 and 40 mph. After time-based indexing, the number of candidate matches per vehicle in T2 is reduced to between 17 and 43 (‘25 on the average, a reduction of 83%). After vehicle class-based indexing, this number is further reduced to between 1 and 23 (9 on the average). Thus, the indexing process reduces the size of the database T1 by 94%. on the average. A total of 8 matching pairs were identified. We manually verified that all the 171 identified matching pairs were indeed true matches and no false matches had been detected. Referring to the databases in Appendix A. the vision-based matching sys- tem identified the following matches ( 13.41), (36.61). (60.89). (64.81). (67.86). (78.99). (84.107), and (121,129). where the first number is the vehicle number in the Saginaw- Haslctt database and the second number is the vehicle number in the Saginaw-Coolidge database. 6.2 “Real-Time” Experiments The vehicle matching system for travel time estimation presented in the previous section involves two techniques that are quite time consuming, namely, the segmen- tation of the front vehicle using deformable templates, and the computation of the edge matching score using pose clustering for image registration. In this section, we present a much simpler approach. For segmentation purposes, a rectangular box which approximately contains the vehicle in the lane closest to the camera is extracted using motion information alone. Then in the matching process, color matching is used as an additional indexing step to reduce the number of edge matching scores that have to be computed. We will first describe the image seg- mentation technique used to locate the front vehicle. \Ve then present the matching strategy. 6.2.1 Locating the front vehicle In order to simplify the segmentation process and reduce the processing time, we only used the computationally efficient motion segmentation algorithm. The segmentation result consists of a rectangular box which approximately contains the front vehicle. As before. three frames f1, f2, and f3 containing the moving vehicle(s) are extracted 172 at times t1, {-2 ((1 + _V). and t3 (£1 + 12A,). respectively from the image sequence, where At is approximately 1/305. Figures 6.3(a)-(c) show an example of the three consecutive frames. These frames are blurred using a 3x3 averaging mask to produce images f1, f2, f3. The difference image is computed as described in Section 3.1 (Eq.(3.14)): Du.» = lfl(i.j) -f2(i.j)l x mm) 4312.211 (6.7) The baseline of the moving pixels in the difference image corresponds to the base- line of the front vehicle. This baseline can be easily extracted by summing the rows of the difference image. h(i)=Zd(i,j), i=1,..-V. (6.8) We assume that the image size is N x 111. The row B corresponding to the baseline is located at the first sharp transition in h(27) when it is examined from below (i = N). The topline of the moving pixels reasonably approximates the top of the front vehicle. This topline T is located at the first sharp transition in h(i) when it is examined from above (i = 1). See Figure 6.3(e) for extraction of B and T. The left and right sides of the front vehicle are estimated using the same technique. The pixels in a small neighborhood above the baseline are summed in columns. B 11(1):: (“11]“). j: 1.....111 (6.9) The left boundary (L) and right boundary (R) of the vehicle are located at the left and right sharp transitions in the plot of t(j) (see Figure 6.3(f)). The four constants (B, T, L, and R) define a rectangular box. The final rectangular box for the front vehicle in the three frame image sequence (Figures 6.3(a)-(c)) is shown in Figure 6.3(g). (f) 9. "" Figure 6.3: "Real-time" extraction of the front vehicle: (a) frame 1: (b) frame (c) frame 3: ((1) difference image; (e) profile of the row sums h(i); (f) profile of the column sums v(j) around the baseline: (g) location of the front vehicle. 174 (bl Figure 6.4: Feature extraction for the front vehicle in Figure 6.3: (a) color patch; (b) color histogram; (c) edges inside the rectangular box. 6.2.2 Feature extraction Once the front vehicle has been segmented, features describing its shape and color are extracted, so that this vehicle can be matched to other vehicles in the database. We also store the time at which the vehicle was observed. The 3D color histogram used by the color matching module is supposed to describe the color of the vehicle. Unfortunately, if we use the pixels inside the rectangular box, then some of these pixels might belong to the background. To avoid this problem, we build the 3D color histogram using a small rectangular patch extracted near the location of the doors of the vehicle. We use a color patch of fixed size (20x50 pixels) and a fixed location above the baseline of the front vehicle. Figure 6.4(a) shows the location of the color patch for the vehicle in Figure 6.3 and Figure 6.4(b) shows the 3 marginal (R, G, B) histograms. The shape of the vehicle is characterized by the set of edge points extracted by the Canny edge detector applied to those pixels that are inside the rectangular box. Figure 6.4(c) shows the edges for the vehicle in Figure 6.3. Figure 6.5 shows segmentation results for eight additional image sequences. The Figure 6.5: Example of segmentation of the front vehicle and extraction of the color patch for different scenes. Only the middle frame of the three-frame sequence is shown. rectangular area corresponding to the front vehicle and the location of its color patch are shown superimposed on the middle frame of the sequence. These results show that this rather simple algorithm is able to localize the front vehicle fairly accurately. The CPU times (on a Sparcstation 20) for various modules are as follows: 2 s for blurring the three 256 X 256 intensity frames: 0.3 s to compute the difference image; 0.001 s to extract the rectangular box and the color patch. The Canny edge detector takes an additional 2 s. The algorithm can be easily implemented on off—the—shelf hardware to achieve real—time processing. 6.2.3 Matching strategy We also propose a simpler matching strategy compared to the one described in Sec- tion 6.1. Time-based indexing is performed in exactly the same manner. The time- indexed match database T11 for vehicle V? is determined by retaining those vehicles 176 in T1 that were observed in the time interval [7:2 — Tm, — 71., T,2 — Ta", 4» Tc]. In this simpler algorithm we did not segment the vehicle. but only extracted a rectangular box around it. Therefore. we can neither determine the vehicle class nor use it for indexing. Thus, only time-based indexing is performed. Edge matching is a computationally expensive operation which should be avoided whenever possible. Recall from Chapter 5 that, on a. Sparcstation 20, color matching takes 0.05 8 while edge matching takes between 5 and 11 5. Therefore, we use color matching as an indexing step. For every vehicle V? E T2, color matching is used to compute the N, color matching scores associated with vehicles in the set T}. The color matching scores are ranked in decreasing order and the potential matching candidate for V? is the vehicle V3“) whose color matching score is maximum: V-l ) = arg max SC(V,-2,V,IJ),V,IJ E T} (6.10) ‘0 1g 151v, The decision tree approach shown in Figure 6.6 was implemented for matching. To avoid computing edge matching if unnecessary, a first decision is based on the color matching score only. If SC(V,2,V,1(1)) is less than 0.7, the potential match is .1 ,m) is sufficiently immediately rejected and the decision process is aborted. If SC(V,2, V large and larger than the other color matching scores, then the potential match is accepted and the decision process is aborted. Mathematically, this translates into SC(V,?,V}(1)) > 0.8 and 5.. > 0.05, (6.11) where 5'6 = 56023.12? )[SC(V,2,V-1 )—S°(V,2,V-‘ )]. (6.12) ‘(ll 1(1) '(21 In these cases, there is no need to compute any edge matching scores. If no decision 5 (V2. V‘ 1(1) )>0.7 yes 56(1)}, vgm) > 0.8 and SC > 0.05 ilO I10 REJECT ACCEPT SC(V,2, V30) > 0.7 and De(V,-2,V,-la) < 3.0 (V2, V-1 ) and S‘(V,2,V,-la) > 0.3 SC > 0.25 or DC >1.0 or St > 0.1 2(1) /\ ACCEPT (V2, V1) REJECT Figure 6.6: Decision tree for matching two vehicles. can be made based on the color matching scores alone. the matching system computes the edge matching scores. We decided to keep only the first two vehicles V3“) and V.1 as candidates for edge matching. The total matching scores are computed as in '(21 Section 6.1: l-S'CW2» V}, )l2 Dee/‘3. vs.) (6'13) SW3. V3,) = and the potential matching candidate for V,2 is the vehicle Vila whose total matching score is maximum. V-lazargmax{5t(V,2.V-l ).s‘(v3.v.1 )} (6.14) ‘ ’11) ‘(21 v}, = argmin{s‘(v3.v.‘ ),s‘(v3,v.1 )}. (6.15) ‘(11 ‘(21 We define SC. D6, and S" as follows: s = sen/3, v30) [52(V,2,V,‘a) — 511231239] (6.16) . Dew-2.12.1) — 01112.12!) Dc : z n, t in . n Dal/taut) (611) 9 = S‘M. v.1.) [S’(V.2.V.-‘.) — 5"(VE.V.-‘.>] (6.18) These coefficients show the relative difference between the potential matching candi- date and the other candidate. Based on these coefficients and the matching scores, the match (V,-2, V30) is accepted if: (sat/3,1230) >0.7) and (DC(V.-2,V.-‘.) <30) and (5‘(V.21V3.) >03) (619) (SC > 0.25) or (D2 > 1.0) or (52 > 0.1) otherwise, the match (V2,V,-la) is rejected. The first condition ensures that the po- 179 tential match is actually a good match and the second condition verifies that at least one of the matching score is significantly better. 6.2.4 Experimental results This “real-time" vision-based matching system was tested on the same two databases of 149 and 138 vehicles as the complete vision system described in Section 6.1. As before, time—based indexing is able to eliminate 83% of the candidate matches, on the average. For 81 of the 138 vehicles in T2. the decision was based on the color matching scores alone. In 74 of thoses cases, the largest color matching score was too low and the potential match was immediately rejected, and in 7 cases, the largest color matching score passed the decision rule 2 and the potential match was immediately accepted. In the remaining 57 cases. the two edge matching scores had to be computed, and only in 2 of those cases, the potential match was accepted. Thus. a total of 9 matching pairs were identified. Again. we manually verified that all the 9 identified matching pairs were indeed true matches and no false matches had been detected. Referring to the databases in Appendix A. this “real-time” vision-based matching system identified the pairs (24,46), (30,53), (44,71), (47,74), (64,81), (78,99), and (129,143) based on color information alone, and (23,45) and (41,67) using both color and shape information, where the first number is the vehicle number in the Saginaw- Haslett database and the second number is the vehicle number in the Saginaw—Coolidge database. Even though it appears that this simpler system performs better than the more complex system described in Section 6.1, the thresholds are not as discriminatory and might lead to a few false matches. Also. since in some cases, only color matching is 180 used. the confidence in the resulting matching pairs is not as high. In other words. this matching system should be used with caution. 6.3 Comparison with Manual Matching For performance evaluation. we compared the results of the vision-based systems to those that were obtained manually. The manual travel time estimation is done by viewing the video tapes and identifying identical vehicles. This strategy is quite strenuous (it took a graduate student 12 minutes to analyze the 6 minutes of video tape), and does not lead to identification of all the matching pairs. Humans tend to match unusual looking vehicles. In particular. in our database, the human matcher identified a stretch limousine. a mobile home, a red and grey 4 wheel drive, and an old station wagon with wooden sidepanels. Recall that travel time estimation does not require us to identify all the matching pairs, but only to match. on the average, one vehicle per minute so that the travel time estimate is reliable. Our human matcher identified 11 pairs. Seven of these pairs are in our databases T1 and T2: (15.40). (33.54). (35.60), (55.75). (57.77), (71.92), and (77.98). The other matched vehicles are not in our database because these vehicles were not located in the lane closest to the camera. Our segmentation algorithm only extracts vehicles in the lane closest to the camera. To compare the manually matching results with vision-based matching results, we plotted the travel time of a matched vehicle as a function of the time T,2 at which it was observed by the second camera. In Figure 6.7, the matches found manually are shown by the solid line, the matches found by the complete machine vision system are shown by the dashed line, and the matches found by the “real-time” machine vision system are shown by the dotted line. It can be seen that these three plots — Manual Matching ----- Com lete VISIQn System --~-- ‘Rea -Time‘ V1510n Syste Travel time (minzs) 4,200 2:30 3:00 3:30 4:00 4:30 5:00 L :44pm 4:43pm 4:40pm 4:42pm 4:43pm 4:43p" Time of day (hrzmin) Figure 6.7: Comparison of travel time estimates. are in agreement suggesting that the performance of our machine vision systems is comparable to the human observer who estimated the average travel time by looking at the video tapes. 6.4 Digitization and Storage Issues In this section. we address the problems of image digitization and storage. As men- tioned earlier, we collect the images using camcorders and store them on a video tape. Currently, our laboratory is not equipped with an automatic video digitizer, so we manually digitized all the images shown in the database in Appendix A. We will describe a technique which could be used to automatically digitize the video tapes if additional equipment is purchased. All the images used in this thesis were digitized manually as follows. Our Labo— ratory owns a color framegrabber, digicolor, made by DataCube. installed in a Sun 4/330 Sparcstation. This framegrabber is designed to digitize one frame at a time and works best with still images captured with a CCD camera. To handle video data, 182 we connect the VCR to digicolor. A frame can then be stored by pausing the VCR at a specific location on the tape, thus creating a. still image. and writing the digitized image on to a disk. To build our database of vehicle images. we manually selected the video frames which contained a vehicle in the lane closest to the camera and this vehicle was approximately in the middle of the image. Since our motion segmentation algorithm requires three frames of the moving object, we also manually digitized the preceding frame and the following frame. Manual frame selection and digitization is extremely time consuming because we need to examine the tape, pause the VCR when an interesting frame is encountered, select the three frames of interest and write the digitized frames to the disk one at a time. To show that the above operations can be done automatically, given the appro- priate equipment, we have manually digitized a sequence of 380 frames from the Saginaw-Coolidge tape. Every other frame in the sequence is shown in Figures A.29 to A.32 in Appendix A. This sequence contains 7 vehicles which passed in front of the camera and were located in the lane closest to the camera. We have developed an algorithm to automatically identify the frames where these vehicles are centered in the image. Given a three-frame sequence of intensity images containing moving vehicles, we have seen in Sections 4.2 and 6.2 that we can reliably extract the base- line of the vehicle in the lane closest to the camera. Our automatic frame extraction algorithm is also based on the same idea. The three frames f1, f2, and f3 containing the moving vehicle(s) are blurred using a 3x3 averaging mask to produce images f1, f2, and f3. The difference image is computed as described in Section 3.1: 190.1): |f1(i.j)—f2(i,j)l x mm) — f3(i,j)|- (6.20) 183 The vehicles of interest are located in the lane closest to the camera whose extent in the :r-direction (£1 and {2) can be determined if the geometry of the camera is known. The difference image is summed along the columns between {1 and {2: 52 23.0) = Z 00.11. (6.21) i=€1 Obviously, Dv(j) will be larger if there is movement between {1 and {-2 in column j. The left side Bl and right side 82 of the vehicle are determined by locating the column positions where D.,( j ) becomes larger than 0.0517,,ma,x when it is scanned from left to right and right to left, respectively. Figures 6.8(a)-(c) show an example three-frame sequence. The difference image is shown in Figure 6.8(d). The boundaries of the lane closest to the camera are shown in Figure 6.8(e) and the estimation of B, and 32 is shown in Figure 6.8(f). To avoid storing the same vehicle twice, we keep track of the state of the system. We assume that we start in the state NONE when there are no vehicles in the lane closest to the camera. The three-frame sequence contains a moving vehicle in the lane closest to the camera if 19...... > 1000, and (6.22) 82 — B; < 0.8Nc01, where NCO] is the number of columns in the image. The first condition verifies that there is something moving in the lane closest to the camera and the second con- dition verifies that there is only one vehicle. The center position of the vehicle is approximated by B] + 32 F: 2 (6.23) 184 (0 Figure 6.8: Estimation of the location of the baseline and length of the vehicle: (a) frame 1: (b) frame 2; (c) frame 3; (d) difference image; (e) location of the lane boundaries: (f) column summation and location of B, and 82. 185 If the center position of the vehicle shifts to the right side of the image (B > 1' c0I/2), then we detect a vehicle. store the three frames of interest and enter the state VEHICLE. When the center position of the vehicle occupies the left side of the image (B < 1 col/2). it means that the previous vehicle has left the field of view and a new vehicle has entered the field of view. We then switch back to the NONE state. The complete algorithm for automatic frame selection is summarized in Figure 6.9. Figure 6.10 shows an example of this algorithm. In Figure 6.10(a), ’Dmmu = 14, meaning that there are no vehicle in the lane closest to the camera. Also the state of the system is NONE. In Figure 6.10(b) as a vehicle is entering the field of view, We can see that ”Dam“ is getting larger. In Figure 6.10(c), the center position of the vehicle (indicated by the tick mark) just shifted to the right side of the image. Thus, the state of the system becomes VEHICLE and the 3-frame sequence is stored on a disk. Nothing is changed when this vehicle leaves the field of view as shown in Figure 6.10(d). We can detect that there are no vehicles in the lane closest to the camera in Figure 6.10(e) because ’Dv,max = 8. Then, in Figure 6.10(f), a new vehicle enters the field of view. It can be detected because Dun“ becomes larger than 1000 and the center position is in the left side of the image, thus the state of the system becomes NONE. The results of this algorithm for the complete 380 frame sequence are shown in Figure 6.11. In this figure, the time is represented along the y axis. Each horizontal line represents the length 82 — Bl for that particular frame. Thus, the slanted gray areas show the position of the vehicle in the image as it passes in front of the camera. The areas in light gray represent the times when there are no vehicles in the lane closest to the camera. The black segments show the times at which the three frames containing the moving vehicle centered in the image were extracted. It can be seen that we were able to correctly identify the 7 vehicles in the video sequence of 380 186 Input: A sequence of consecutive frames begin Input two consecutive frames f1 and f2 Blur f1 and f2 using a 3x3 averaging mask to produce f1 and f2 Status = NONE while new frames are available Input the new frame f3 Blur f3 to produce f3 mj=la°ncholA A ‘ . 29.0) = 25;..If1(i.j)— mm x 020.11— mm endfor v,max = maxigjchol vaj) £9;j=1,...,NCO1 if Dv(j) > 0.05Dvmax then Bl =j break 6 endfor fig '= Nco.,...,1 _i_f’D.,(j) > 0.0513max then 82 =j break endfor F = (Bl + B2l/2 i_f (Dv,max > 1000) m (82 — 81 < 0-8Ncol) then if (Status = NONE) m (B > Neel/2) the}; Status = VEHICLE Store f1. f2. and f3 en_di_f i_f (Status = VEHICLE) and (B < Ned/2) then Status = NONE endif . . . . fit-f2 fzf—f3 fl(_f2 f2<—f3 Mile end . Output: A list of 3-frame sequences containing the moving vehicles of interest Figure 6.9: Algorithm to automatically select interesting frames. 187 (a) frame 40 (b) frame 55 (c) frame 64 Dv.max : l4 Dv.max = 1624 D1;.max : 3293 StatezNONE StatezNONE State=VEHICLE (d) frame 75 (e) frame 85 (f) frame 93 Dv.max : 8037 Dv.max = 8 Dv.max = 6526 State=VEHICLE State=VEHICLE State=NONE Figure 6.10: Example of the automatic frame selection algorithm. 188 Column posiuon Figure 6.11: Results of the automatic frame selection algorithm. frames. Note that the four vehicles in the other lane in frames 151 to 197 and 253 to 301 were ignored because they are not located in the lane closest to the camera. Using this system, we can retain the 21 useful frames and discard the 359 redundant frames. The time requirement for this algorithm on a Sparcstation 20 for a sequence of 256x256 images is as follows: 0.7 s to blur the new frame. 0.006 s to evaluate DUO), 0.0001 s to decide if the three current frames should be kept or discarded, and 0.16 s to transfer the last two frames into the first two buffers so that the next frame can be stored in the third buffer. Note that the operations required by this algorithm (3x3 averaging, frame subtraction and multiplication, column summing, and relational and logical operations) are very simple. This algorithm can easily be implemented in real—time using an architecture sim- ilar to the MaxVideo 20 system available from Datacube. This architecture offers analog-digital modules, real-time convolvers, storage modules, arithmetic operations, etc. Also, depending on the application needs, specific modules and interconnection 189 patterns can be defined. Such a system could be used to process the frames as they are captured by the camera or played by the VCR and decide if the current set of three frames should be kept or discarded. By examining our video tapes, we have noticed that there is a gap of at least half a second and up to 3 seconds between each vehicle. Thus, if the current three frames have to be kept for further processing, the storage device should have enough time to store them. In summary, we have demonstrated that given the appropriate equipment, it is possible to automatically select the frames where the vehicles are centered in the images and store them for further processing in real time as the vehicles pass in front of the camera. 6.5 Effect of Weather Conditions and Time of Day In this section, we will study the robustness of our algorithms under different weather conditions and time of the day. All the databases of images which we acquired are shown in Appendix A. We have extensively tested our system on the databases Saginaw-Coolidge and Saginaw-Haslett. Images in these databases were captured on Saginaw Street in East Lansing, Michigan on a bright and sunny afternoon in June. The images in databases Maple-East and Maple- West were captured on Maple Road in Troy, Michigan on an overcast afternoon around 2:40 pm on May 24, 1993. Figure 6.12 shows some examples of color segmentation results. It can be seen that the results are very good. The algorithm was able to identify regions of different colors such as the wood sidepanel on the station wagon Figure 6.12(a), the drawing and lettering on the van in Figure 6.12(c), and the different color stripes on the van in Figure 6.12(f). The two vehicles in Figure 6.12(d) were well separated. Figure 6.13 shows some examples of model-based segmentation (using deformable 190 10 Figure 6.12: Color segmentation results on the Maple databases. 1.91 (a) Hatchback (b) Station wagon (c) Sedan ((1) Van (6) Pickup truck (f) Station wagon Figure 6.13: Model-based segmentation results on the Maple databases. templates) results. It can be seen that the algorithm was able to segment the front vehicle from the stationary background and the other moving vehicles even for unusual looking vehicle such as the covered truck in Figure 6.13(f). Only 6 vehicles in the database of 80 vehicles were misclassified. Thus, we attained a correct classification rate of 92.5%. The database Engineering was acquired on the Michigan State University campus on an overcast morning around 10:20 am on February 25. 1993. Some snow is present on the ground. Figures 6.14 and 6.15 show a few of the color segmentation and model- based segmentation results on the Engineering database. The color segmentation results are quite good. In Figures 6.14(a), however, the roof of the white vehicle was merged with the pile of snow in the center of the road because the two regions are the 192 Figure 6.14: Color segmentation results on the Engineering database. same of color and the Canny edge detector could not detect any boundaries between them. The model-based segmentation results are also good. Only one vehicle in this database of 11 vehicles was misclassified, resulting in a correct classification rate of 91%. The database Fog was acquired on the Michigan State University campus on an early foggy morning around 8:40 am on January 12, 1995. Because of the intensity of the fog, most of the stationary background is very blurry and pale. Figure 6.16 shows some color segmentation results. The algorithm was able to isolate the different colored regions in the cabin of the pickup truck in Figure 6.16(c) and in the hatchback in Figure 6.16(b). Due to the fog, however, the segmentation of the regions further away from the camera is not very good and the boundaries between the different regions are not accurate. The model—based segmentation results are very good as can 193 (a) Hatchback (b) Sedan (c) Van Figure 6.15: Model-based segmentation results on the Engineering database. be seen from Figure 6.17. The vehicles are close enough to the camera (around 20 feet) that they do not appear blurred by the fog and their edges are still sharp. Thus, the model—based segmentation process gives satisfactory performance. Only 1 vehicle in this database of 15 vehicles was misclassified. yielding a correct classification rate of 93%. The three databases Dusk were captured on the Michigan State University campus around 5:40 pm on January 10, 1995. At dusk, it is not possible to use the high speed shutter. because the CCD matrix does not receive enough light. For the other databases, we had used a speed of 1/1000 5 to avoid motion blurring. When we tried to use this speed at dusk. the camcorder warned us that it was not receiving enough light and we obtained the image shown in Figure A.24 in Appendix A. It can be seen that this image is extremely dark and image processing algorithms cannot be successfully applied to it. The images in the database Dusk-I shown in Figure A.22 were captured with a shutter speed of 1/100 5 and the images in the database Dusk-2 shown in Figure A.23 were captured at a regular speed of 1/30 s. It can be seen that the vehicles in these images are blurry due to their motion. We tried to apply the color segmentation algorithm to these images and obtained 194 Figure 6.16: Color segmentation results on the Fog database. Figure 6.17: Model-based segmentation results on the Fog database. 195 (b) Figure 6.18: Color segmentation results on the Dusk database. the results shown in Figure 6.18. It can be seen that these segmentation results are extremely poor. Very few homogeneous regions are obtained; the vehicles appear very dark and are merged with the region corresponding to the road. From these segmentation results. it is quite obvious that color or shape matching would not perform very well. The images are so dark that all the vehicles appear to be black and the Canny edge detector is not able to separate them from the background. We also applied the model-based segmentation algorithm and the results are shown in Figure 6.19. These results are a little bit more satisfactory than the color segmen- tation results. The vehicles in Figure 6.19(a) and 6.19(b) are reasonably segmented. However. about half of the vehicles in the database were very poorly segmented as can be seen from the example in Figure 6.19(c). To acquire images at night, special camcorders or cameras have to be used. Differ- 196 (b) Figure 6.19: Model-based segmentation results on the Dusk database. ent cameras present different sensitivities to light. The minimum amount of illumina- tion necessary to obtain a good image is measured in lux. Typical CCD cameras need a minimum illumination of 10 to 15 lux. A lux is defined as a unit of illumination equal to the direct illumination on a planar surface which is at a distance of one meter from a uniform point source of one candle. Color cameras, in particular, are generally less sensitive to low light than black and white cameras. A large number of black and white cameras are commercially available which are used for security purposes and require a minimum illumination of only 0.05 to 1 lux. Camcorders usually need a minimum illumination of 5 to 10 lux. The other important feature that a camera should have for IVHS applications is the high speed shutter option. We have seen that images which were not captured with a shutter speed of at least 1/1000 5 show motion blur around the edges of the moving objects. This motion blur makes the detection of edges more difficult and our algorithms are not as robust. Most cameras and camcorders have shutter speeds between 1/30 s and 1/2000 5. Some camcorders even offer speeds of up to 1/10000 s. The sensitivity of the camera varies with the shutter speed. For example. the camcorder that we used requires a minimum illumination of 0.7 lux at normal speed. 197 330 lux at 1/100 8, and 300 1th at 1/1000. This is the reason why it is difficult to use the high speed shutter at dusk when the brightness is only about 10 lux. For comparison. the brightness on a clear day at noon is 100,000 lux, 32.000 lux on a cloudy day at noon, and 1.000 lux on a clear day one hour before sunset. One has to be careful in choosing the right camera based on high speed shutter requirements. A video frame is composed of two fields containing the odd lines and the even lines, respectively. A regular shutter opens for 1 / 60 s to capture the first field and for another 1/ 60 s to capture the second field, thus achieving an acquisition rate of 30 frames per second. If the shutter is set at a higher speed, then the video rate is still 30 frames per second, but the following two scenarios are possible: (i) the shutter opens to get the first field, waits for about 1/60 s, and opens again to get the second field; (ii) the shutter opens to capture the entire frame and waits for about 1 / 30 s to capture the next frame. It is obvious that the second option will produce much better results. Using the first option, the edges of the moving objects will be jagged and more complex image processing operations will be needed. The Sharp Slim Cam camcorder that we used implements the second option, like most camcorders who have a high speed shutter option. CCD cameras, however, usually implement the first option, but sometimes offer both. 6.6 Summary We have designed a complete vision—based matching system to identify vehicle pairs in two databases of images captured at different locations of a road network. We were able to identify 8 correct matching pairs in a 6—minute video sequence. To reduce processing time, we also implemented a faster matching system which does not rely on the segmentation of the front vehicle. The confidence of the matching 198 pairs produced by this faster system is not as high because color information alone is used most of the time. We have compared the performance of the two vision- based matching systems with a manual approach for travel time estimation. We have observed that the three approaches produce comparable results. Both of our machine vision systems are able to identify a sufficient number of matching pairs to reliably estimate the travel time. We have addressed issues related to an on-site implementation of the machine- vision system. We have demonstrated that we can automatically select the frames of interest, namely the frames where the vehicle in the lane closest to the camera is centered in the image. Finally, we have experimented with different databases of images captured at different times of the day and under different weather conditions. We have shown that night vision is a problem and that our algorithms do not per- form well if the images are too dark. All the other experiments yielded very good segmentation and matching results. Chapter 7 Summary and Future Research This chapter summarizes the results of the research reported in this thesis. Several directions for future research are also outlined. 7. 1 Summary We have developed a complete matching system to estimate the travel time between two points in a road network. Vehicles are observed from the side of the road and the system determines if the same vehicle passes in front of two different cameras positioned at different locations in the road network. Using a sufficient number of matching pairs. the system can estimate the average travel time between the two cam- era locations. The system is divided into two main stages. The image segmentation stage consists of extracting the vehicle of interest from the stationary background and the other moving vehicles. The matching stage examines two segmented vehicles and determines if the two vehicles are identical based on color and shape information. We have proposed four image segmentation algorithms. The motion segmentation technique uses simple frame subtraction, multiplication, thresholding, and morpho— logical closing. It requires only 3 frames containing the moving objects in the video 199 200 sequence. We have demonstrated that this algorithm identifies the moving areas in the image reasonably well. The color segmentation algorithm combines region information obtained by a split- and-merge technique for color images and edge information. we have shown that fusing these two pieces of information produces better results in the sense that the boundaries between the regions are more accurate. These algorithms are rather general and can be applied to images in many appli- cation domains. To solve the segmentation problem in our specific domain, we have developed two model-based segmentation algorithms. The first algorithm fuses the results of color segmentation and motion segmen- tation to obtain an accurate contour of the moving object in the image. We use an active contour approach to enforce continuity and smoothness constraints on the contour of the moving object. This algorithm was tested on 136 images of moving objects (vehicles, walking people, robot, ..) captured indoors and outdoors. In all the cases, we obtained very good segmentation results. The segmentation algorithm based on motion and color information alone has the following disadvantage. If there are multiple moving objects in the image, then they are grouped into a single connected component. Additional model information must be incorporated to extract only the object of interest. In our application domain, we have defined a generic model of a vehicle and applied the paradigm of deformable templates to segment the vehicle in the lane closest to the camera from the back- ground and other moving vehicles. This algorithm has a number of novel features. A generic polygonal model of the vehicle was defined by examining a database of learning examples. This defines a set of constraints on the relative positions of the vertices of the polygonal shape. This same technique could be applied to any other polygonal shape. The imaging model fuses motion and edge information by including 201 two terms in the expression of the energy function. We force the contour of the de- formable template to align itself with the edges of the front vehicle by comparing its direction to the gradient direction of the pixels in the image. In addition to segment- ing the front vehicle. this technique is also able to classify it. into one of five categories (sedan. pickup truck. hatchback. station wagon or 4x4. and van). This model—based segmentation algorithm was applied to 393 images of vehicles and we obtained very good segmentation results with approximately 92% correct classification rate. For matching purposes. we have suggested to use both color and shape informa- tion. Color matching is performed by computing a similarity measure based on a comparison of the 3D color histograms for the two objects. Shape matching is per- formed by comparing the edges of the two objects. We have proposed a modified Hausdorff distance which is best suited for object matching. It is not as sensitive to noise as the Hausdorff distance and it does not ignore some of the edge points in the object like the ranked Hausdorff distances. We have demonstrated the properties of this measure on a number of synthetic and real images. In addition, we use pose clustering to register the two edge images and determine the translation and scaling parameters before computing the modified Hausdorff distance. We have tested the performance of the entire matching system on a database of 287 images (149 vehicles were observed by the first camera and 138 vehicles were observed by the second camera). We have shown how image acquisition time and vehicle class can be used to index the database and avoid computing matching scores for every pair of vehicles. The thresholds on the matching scores to determine if two vehicles are identical were determined by building a cumulative distribution of the matching scores under the null hypothesis of randomness. We were able to identify 8 matching pairs and verified that they were indeed true matching pairs. We also proposed a faster implementation of the complete matching system. This 202 implementation does not use, an accurate segmentation of the front vehicle but only extracts a rectangular box around it. It also uses color matching as an indexing process to avoid computing unnecessary edge matching scores. The two machine vision implementations were compared to a manual approach which consists of looking at the video tapes and isolating unusually looking vehicles to identify matching pairs. All three approaches produced similar results. Thus, we have shown that a machine vision approach to automatically estimate the average travel time between two points in a road network is feasible. We have also addressed the problem of on-site implementation by designing an algorithm which automatically selects the frames of interest in the video sequence. namely the frames where there is a vehicle in the lane closest to the camera and it is approximately centered in the image. We also briefly examined the effect of different weather conditions and time of the day by processing images that were captured during sunny afternoons. overcast mornings and afternoons, foggy mornings, and at dusk. Our algorithms could handle all tested conditions except dusk. mostly because our camcorder was not sensitive enough to low illumination conditions. 7 .2 Future Research There are a number of research issues which could be addressed in the future. The robustness of the color histogram matching module should be investigated further. In our experiments, the orientation of the sun at the two camera locations was ap- proximately the same. We should study how the color matching score is affected if the traffic is not monitored on a straight road and the orientation of the light on the side of the car is drastically different. we suspect that there might be problems only on bright sunny days when the sun is low (early morning or late afternoon). This is a 2013 problem related to color constancy and the most important question is "Do we need to apply a color constancy algorithm as described in Section 3.2.3 before computing the color matching score?” In Chapter 4 we have proposed an algorithm to segment the vehicle in the lane closest to the camera from the stationary background and the other moving vehi- cles using the deformable templates paradigm. \Ve have defined an energy function based on motion information, edge magnitude and directionality, and constraints on the shape of the deformable template. We have seen that this energy function is quite complicated and contains a large number of local minima. We have chosen a simulated annealing approach to perform the minimization task and have used a geometric temperature cooling schedule to limit the total number of iterations. How- ever, as mentioned earlier, we had problems in choosing the parameters of the cooling schedule (starting temperature and maximum number of iterations). We had to run the simulated annealing algorithm for different sets of parameters and choose the schedule which returned the minimum energy value to segment the front vehicle. For practical purposes it is not feasible to implement this strategy on-site. We would need a large number of computers working in parallel on the same problem (for example a cluster of workstations) and the resulting cost of the travel time estimation system would be exorbitant. Thus, the minimization algorithm for image segmentation us— ing deformable templates and its computational requirements should be investigated further. There exist other approaches for minimizing functions with a large number of local minima. Genetic algorithms [76] have the advantage that the search is done in parallel in many areas of the search space, but their disadvantage is that they are extremely slow. Multistart approaches [10] also explore different areas of the search space. The approximate location and length of the vehicle, determined using motion information 204 can be used to intelligently select. the different starting positions. One can also explore dynamic progran'uning approaches [35], branch-and-bound techniques [19], iterated conditional modes [9. ~11] and other cooling schedules for simulated annealing [171]. The rules which constrain the shape of the deformable template were derived manually by examining a set of vehicles for each class. Recent publications [53, 116] tackle the problem of automatically deriving the prior probability on the shape of the template by determining an average image or prototype shape of the template based on a set of training examples. A similar technique might be used to automatically establish the rules which constrain the shape of our polygonal templates. It might also be useful to include the a priori information about the frequency with which each vehicle model is typically observed in real road scenes. In particular, we know that there are more sedans than pickup trucks. During the matching stage, we compute matching scores based on color and edge information and a confidence measure on those matching scores. W'hen computing the estimate of the travel time, it might be a good idea to weight the contribution of individual vehicle travel times in favor of those that are more likely to be correct matches. The notion of color-based indexing needs further examination. In Section 6.2, we used color indexing to compute only two edge matching scores per vehicle. However, this idea does not really obey the definition of indexing because the entire database still needs to be examined to compute the color matching scores and determine the best two candidates. Usually, indexing a database involves a number of comparisons of the order of log2 N where N is the size of the database, and in the ideal case, the number of comparisons should be independent of the size of the database. This was indeed the case when we used vehicle class indexing in Section 6.1. The database was divided into five bins, one for each vehicle class. The sensation of color, however, is 205 much fuzzier than the definition of vehicle category. Since color is defined continu- ously, the division of the vehicle database into bins of different vehicle colors is not trivial. A better approach might be to use a tree structure to index the database using color information. Finally, all the image segmentation algorithms and matching strategies proposed in this thesis use a number of user-defined parameters. Most computer vision algorithms suffer from this drawback. I-Iowever, performance evaluation and automatic selection of tuning parameters is now beginning to receive some attention [156]. It will be desirable to develop an object matching system that can learn the “optimal” set of parameters from the given training images and the user requirements in terms of error probabilities. In summary, we identify the following research topics for future study: 0 Color constancy, o Minimization algorithm for the deformable templates, 0 Automatic derivation of a prototype shape, Color indexing, 0 Automatic selection of tuning parameters. 0 APPENDICES Appendix A Image Databases This appendix shows the entire database of images on which the algorithms described in this thesis were tested. These images were acquired using either a GE HQ 9-9806 or 3. Sharp Slim Cam VL-L6‘2U camcorder. The tapes were played on the VCR in the PRIP Laboratory and individual frames were digitized using the digicolor system consisting of a Data Cube frame grabber board connected to a Sun 4/330 Sparcstation. Table A.1 gives the date, time, location. camcorder type, shutter speed, and number of images for each database. Figures A.1 to A.28 show the middle frame of the three-frame color sequence. The database Sequence was digitized from the same video tape as the database Saginaw-Coolidge. A sequence of 380 frames was digitized to test the algorithm for frame selection described in Chapter 5. Every other frame of this database is shown in Figures A.29 to A.32. 207 27% m. as?) 922$ anmnv wag J :3. wimcwq ammm 852.com. m m 8\ _ no .582 82 .8 saw 2558 am: 393339 2. a 8> mo Es8”: 82 .8 saw 83 .53 siege 8. m 8\ _ no 5.288 82 .8 so .3 2E Esteem 8 w 2.2 2 no 5.288 82 .8 so .3 22.2 238$ _ w 82\ _ 22m 58% 82 .2 :2. 82:8 am: .326 ... m 8> 83% 5.22.8 82 .2 5:. 2566 am: 325 x m 2:.) 9.2m anvum m2: .2 :3. 2:7ch DmE ~-«m:Q ...._ ... 82> 2.5;. .828 82 .2 is 8955 2m: 8; l: ... 82\ _ Sea .882 82 .8 .32 .2an am: 8:82.88 ..2. m 82> 22;. 5.22.8 82 .8 82 229 282-382 2. ... 82> 8.5.5. .5228. 82 .8. 8: 22... Ewes: 7.2 m 82> 2.3m 583 82 .2 :8 8223 see 2.282358% 3: m :32) 9:25. Emomnv $3 4 :3. maize; amwm mEEooD$~Ezmem T.&x:: .3 dz H Beam 232:5 eizoicmu BEE 3mm :23qu _fl 0852 .mvwdfidudmv 2: 20 828.5 2.... 2.3. 208 Figure A.1: Database .S’aginair-C'nolidge: vehicles I to '20. 209 Figure .-\.2: Database .S'nga(r-C'oolidgfz vehicles 31 to 4D. 210 Figure .-\.3: Database Sag/maw-("ooildJ-e: vehicles 41 to 60. 211 l...‘ "‘2‘;’.£d?-J:.' ‘. M " ' * Figure AA: Database Sagimm-C'oolidge: vehicles 61 to $0. 212 Figure A5: Database Saginau'-C'001i(lyc: vehicles 31 to 100. 213 ”.-.—...“; Figure A.6: Database Stigmau'—("001’1dge: vehicles 101 to 120. 214 Fi ure A.7: Database Satinutr-Coo/id 6: vehicles 1'21 to 140. S J 9 215 Figure A.S: Database Sagimzu‘-C'oolidgc: vehicles 141 to 149. 216 A Figure A3: Database Sagzrzaw-Hasirét‘: vehicles 1 to ‘20. 217 Figure .—\.10: Database 5(1gim‘1u'-Has/EH: vehicles 21 to 40. 218 Figure A.11: Database Saginaw-Haslett: vehicles 41 to 60. 219 Figure .-\.12: Database Saginuw-Has/cfi: vehicles 61 to 80. 220 Figure A.13: Database Saginaw-Haslett: vehicles 81 to 100. . -\I.:‘ 221 Figure .-\.1~l: Database .Saginaw-Has/cz‘f: vehicles 101 to 1'20. 222 Figure A.17): Database Sag/nuir-Hmlt/f: vehicles 131 to 13$. 223 Figure .-\.16: Database .llaple-Easf: vehicles 1 to '20. 224 Figule ..-\ 17: Database .Uaple- East: vehicles '21 to 40. 225 Figure A.1S: Database .llaple-l-l’csf: vehicles 1 to 20. 226 Figure A.19: Database .llapIe-ll'rsf: vehicles '21 to ~10. 227 Figure A320: Database Engineering. 228 Figure .~\.21: Database Fog. 229 Figure .-\.23: Database Dusk-3. Figure .~\.'_’~l: Database Desk-.3. 230 Figure Ail-5: Database People-In. 231 Figure A526: Database People-Out. 232 Figure A.27: Database Objects-In. Figure A.28: Database Objects-Out. Figure A.29: Database Sequence: odd frames 1 to 95. Figure A.30: Database Sequence: odd frames 97 to 191. Figure A.31: Database Sequence: odd frames 193 to 287. Figure A.32: Database Sequence: odd frames 289 to 379. BIBLIOGRAPHY Bibliography [0 [31 [4] [51 Mobility 2000 presents intelligent vehicles and highways systems, 1990 sum- mary. Texas Transportation Institute, College Station, TX, 1990. O. Abou Khaled, N. Le Fort, M. Hassoun. and C. Laugier. Real-time decision system for the Prolab II demonstrator. In Proc. IEEE Intelligent Vehicles Symposium, pages 485—491, Paris, France, 1994. A. Amini, T. Weymouth, and R. Jain. Using dynamic programming for solving variational problems in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(9):855—867, 1990. Y. Amit, U. Grenander, and M. Piccioni. Structural image restoration trough deformable templates. Journal of the American Statitical Association, 86(414y376—387,1991. C. Baghdassarian, H. Lange, H. Sahli, and C. Laurgeau. Recognition of ar- rows in the environment of road markings. In Proc. IEEE Intelligent Vehicles Symposium, pages 219-224, Paris, France, 1994. D. H. Ballard and C. M. Brown. Computer Vision. Prentice—Hall, Englewood Cliffs, NJ, 1982. [\D 00 \J [Tl [8] [lol [11] [121 [131 [14] [15] [16] 238 H. G. Barrow. J. M. Tenenbaum, R. C. Bolles. and H. C. Wolf. Parametric cor— respondence and chamfer matching: Two new techniques for image matching. In Proc. International Joint Conference on Artificial Intelligence, 1977. R. Basri and S. Ullman. The alignment of objects with smooth surfaces. In Proc. IEEE International Conference on Computer Vision. pages 482—488, Tarpon Springs, FL, 1988. J. Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society B, 48(3):259—302, 1986. B. Betro and F. Schoen. Sequential stopping rules for the multistart algorithm in global optimisation. ilrfathematical Programming, 38:271—286, 1987. S. Beucher and M. Bilodeau. Road segmentation and obstacle detection by a fast watershed transformation. In Proc. IEEE Intelligent Vehicles Symposium, pages 296—301, Paris, France, 1994. J. R. Beveridge, J. Griffith, R. R. Kohler, A. R. Hanson, and E. M. Riseman. Segmenting images using localized histograms and region merging. International Journal of Computer Vision, 2:311-347, 1989. I. Biederman. Recognition by components: A theory of human image under- standing. Psychological Research, 94(2):115—147, 1987. A. Blake. Boundary conditions for lightness computation in Mondrian world. Computer Vision, Graphics and Image Processing, 32:314—327, 1985. A. Blake and A. Zisserman. Visual Reconstruction. MIT Press, 1987. Transportation Research Board. Advanced vehicle and highway technologies. Technical Report No. 232, National Research Council, Washington, D.C., 1991. [17] 113] [19] [22] [‘23] 239 R. Bolles and P. Horaud. ISDPO: A three-dimensional part orientation system. International Journal of Robotics Research. 5(3):3~26. 1986. R. C. Bolles and R. A. Cain. Recognizing and locating partially visible objects. the local-feature—focus method. International Journal of Robotics Research, 1(3):57-- 2. m2. Clo G. Brassard and P. Bratley. Algorithmics Theory and Practice. Prentice Hall, Englewood Cliffs. 1988. G. Brelstaff and A. Blake. Computing lightness. Pattern Recognition Letters, 52129—138, 1987. R. Brooks. Model-based three-dimensional interpretations of two-dimensional images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(2):140-150, 1983. G. Buchsbaum. A spatial processor model for object colour perception. Journal of the Franklin Institute, 310(1):1—‘27, 1980. C. Cafforio and F. Rocca. Methods for measuring small displacements of tele- vision images. IEEE Transactions on Information Theory, 22:573—579, 1976. J. Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6):679—698, 1986. M. Cappello, M. Campani, and A. Succi. Detection of lane boundaries, inter- sections and obstacles. In Proc. IEEE Intelligent Vehicles Symposium, pages 284—289, Paris, France, 1994. ['26] [‘29] [30] [31] [34] 210 l. (latling and l3. McQueen. Road transport informatics in Europe — Major programs and demonstrations. IIL‘EE: 'Iinnsartions on l'elzicular Tic/urology, .1()(1):1:i2-1.m. 1091. 11. Chen and T. lluang. Maximal matching of 3D points for multiple-object motion estimation. Pattern Recognition, 21(2):?5-90. 1988. J.-L. Chen, G. C. Stockman. and I\'. Rao. Recovering and tracking pose of curved 3d objects from 2d images. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. pages 233-239. New York, NY, 1993. S.—W. Chen and A. l\'. Jain. Object extraction from laser radar imagery. Pattern Recognition, 24(6):.587—600. 1991. Y. Chen and G. IVIedioni. Object modelling by registration of multiple range images. Image and Vision Computing, 10(3):145—155, 1992. J. K. Cheng and T. S. Huang. Image registration by matching relational struc- tures. Pattern Recognition. 17(1):149—159. 1984. P. Chou and C. Brown. The theory and practice of Bayesian image labelling. International Journal of Computer Vision. 4:185—210, 1990. C.-C. Chu and J. Ii. Aggarwal. The integration of image segmentation maps using region and edge information. IEEE Transactions on Pattern Analysis and Machine Intelligence. 15(12):l241—1252, 1993. S. D. Cochran and G. Medioni. 3-D surface description from binocular stereo. IEEE Transactions on Pattern Analysis and .lIachine Intelligence, l4(10):981— 994, 1992. [35] [:56] [37] [:38] [39] [401 [41] [431 2-11 T. 11. (.‘ormen. C. F. Leiserson. and R. L. Rivest. Introduction to Algorithms. .\“IcGraw—llill. 1991. Y. (i'outance. La ('oulcur en lv'ision par Ordinattur: .rlpplication a la Robotique. PhD thesis. Laboratoire (l'Automatique et d‘Analyse des Systemes du CNRS, Fniversité Paul Sabatier. Toulouse. France, 1991. M. Cremer and H. Keller. A new class of dynamic methods for the identification of origin—destination flows. Transportation Research. 218(2):117—132, 1987. A. Cumani. A. Guiducci. and P. Grattoni. Image description of dynamic scenes. Pattern Recognition. 21(7):661—6T3, 1991. L. Davis. Shape matching using relaxation techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1(1):60—72, 1979. E. D. Dickmanns. R. Behringer, D. Dickmanns, T. Hildebrandt, M. Maurer, F. Thomanek, and J. Schiehlen. The seeing passenger car ‘VaMoRs-P’. In Proc. IEEE Intelligent Vehicles Symposium. pages 68—73, Paris, France, 1994. R. C. Dubes. A. K. Jain, S. G. Nadabar, and C. C. Chen. MRF model-based al- gorithms for image segmentation. In Proc. International Conference on Pattern Recognition, pages 808—814, Atlantic City, NJ, 1990. M.-P. Dubuisson and A. K. Jain. Object contour extraction using color and mo- tion. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 471—476, New York, NY, 1993. M.-P. Dubuisson and A. K. Jain. 2D matching of 3D moving objects in color outdoor scenes. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 887—891. Seattle. WA, 1994. ltfl [46] [47l [48] [491 212 .\l.-P. Dubuisson and A. K. Jain. Fusing color and edge information for object matching. In Proc. IEEE International (i'onft rence on Image Processing. volume III. pages $l824l86. Austin. TX. 199-’1. .\l.-I’. Dubuisson and A. K. Jain. A modified Hausdorff distance for object matching. In Proc. 13th Intt rnational Conference on Pattern Recognition, pages 566—568. Jerusalem. Israel. 1994. .\"I.-P. Dubuisson and A. K. Jain. Contour extraction of moving objects in complex outdoor scenes. International Journal of Computer Vision. 14:83-105, 1995. ;\'I.-P. Dubuisson. A. K. Jain. and \V. C. Taylor. Segmentation and matching of vehicles in road images. In Transportation Research Record. number 1412, pages 57—63., Washigton, DC, 1993. i\’I.-P. Dubuisson. A. K. Jain, and W. C. Taylor. A vision—based vehicle matching system. In Proc. IEEE Intelligent I»"ehicles Symposium, pages 266—271, Paris, France. 1994. MsP. Dubuisson, S. Lakshmanan. and A. K. Jain. \r’ehicle segmentation using deformable templates. Technical Report CPS-.9465, Michigan State University, East Lansing, MI. 1995. J. Duncan and T.—C. Chou. On the detection of motion and the computation of optical flow. IEEE Transactions on Pattern Analysis and hIachine Intelligence, 14(3y346—352.1992. [51] [56] [57] [58] 243 W. Efenberger. ().—II. Ta. L. Tsinas. and V. Graefe. Automatic recognition of vehicles approaching from behind. In Proc. IEEE Intelligent l’rhicles Sympo- sium, pages 57 ~62. Detroit. MI. I992. E. Eldridge. Truckers go high-tech. Lansing State Journal. May. ‘2 1993. R. Epstein and A. Yuille. Training a general purpose deformable template. In Proc. IEEE International Conference on Image Processing, volume 1, pages 203—2207, Austin. TX, 1994. R. Ervin. An American observation of IVHS in Japan. Technical report, Uni- versity of Michigan. Ann Arbor. MI. 1991. S. Estable, J. Schick. F. Stein, R. Janssen, R. Ott, W. Ritter, and Y.-J. Zheng. A real—time traffic sign recognition system. In Proc. IEEE Intelligent Vehicles Symposium, pages 213—218, Paris, France, 1994. M. Etoh and Y. Shirai. Segmentation and 2D motion estimation by region fragments. In Proc. IEEE International Conference on Computer Vision, pages 192—199, Berlin. Germany, 1993. M. Etoh, Y. Shirai, and M. Asada. Contour estraction by mixture density description obtained from region clustering. In Proc. European Conference on Computer Vision, pages 24—32, Santa Margherita Ligure, Italy, 1992. T. J. Fan, G. Medioni, and R. Nevatia. Recognizing 3D objects using surface descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(11y1140—1157.1989. [59] [60] [GA [62] [63] [641 [651 214 F. Farrokhnia and A. K. Jain. A nmlti-channel filtering approach to texture segmentation. In Proc. IEEE (.l'onference on Computer Vision and Pattern Recognition. pages 36-1—3370, Maui. HI. 1991. O. D. Faugeras and .\I. Hebert. The representation. recognition. and locating of 3—D objects. International Journal of Robotics Research. 5(3):L7—52. 1986. C. Fennema and W. Thompson. Velocity determination in scene containing several moving objects. Computer Graphics and Image Processing, 9:301—315, 1979. D. J. Fleet and A. D. Jepson. Computation of component image velocity from local phase information. International Journal of Computer Vision, 5(1):77— 104, 1990. P. J. Flynn. Saliencies and symmetries: Towards 3D object recognition from large model databases. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. pages 322—327, Urbana, IL, 1992. P. J. Flynn and A. K. Jain. BONSAI: 3—D object recognition using constrained search. IEEE Transactions on Pattern Analysis and lleachine Intelligence, 13(10y1066~1075.1991. P. J. Flynn and A. K. Jain. 3D object recognition using invariant feature indexing of interpretation tables. CVGIP: Image Understanding, 55(2):119— 129, 1992. D. Forsyth. A novel approach to color constancy. In Proc. IEEE International Conference on Computer Vision, pages 9—18, Tarpon Spring, FL, 1988. [0'73 [69] [ml 245 'l‘. Fraichard. M. lIassoun. and N. Lefort. An electronic co—pilot: the ProLab II demonstrator. In Proc. European Prometheus Workshop on Collision Avoid- ance. Nurtingen. Germany. 1992. Ii. Francois and P. Bouthemy. ;\Iultiframe-based identification of mobile compo- nents of a scene with a moving camera. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. pages 166—172, Maui, III, 1991. G. Funka—Lea and R. Bajcsy. Vision for vehicle guidance using two road cues. In Proc. IEEE Intelligent Vehicles Symposium, pages 126—131, Detroit, MI, 1992. P. Gader, B. Forester. M. Ganzberger. A. Gillies. B. Mitchell, M. VVhalen, and T. Yocum. Recognition of handwritten digits using template and model matching. Pattern Recognition, 24(5):~121—432, 1991. E. B. Gamble, D. Geiger, T. Poggio. and D. \Neinshall. Integration of vision modules and labelling of surface discontinuities. IEEE Transactions on Systems, .Wan, and Cybernetics, 19:1576—1581. 1989. D. Geman and S. Geman. Stochastic relaxation. Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and .lIachine Intelligence. 6(6):?21-741, 1984. D. Geman, S. Geman, C. Graffigne. and P. Dong. Boundary detection by constrained optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence. 12(7):609—628, 1990. " ’ S. Geman and C.-R. Hwang. Diffusions for global optimization. SIAM Journal on Control and Optimization. 24(5):1031—1043. 1986. [T5] [79] [80] [81] [82] 216 B. Gidas. The Langevin equation as a global minimization algorithm. In Ii. Bienenstock. F. Flogeman. and G. “leisbuch. editors. Disordered Systems and Biological Organization. volume F20 of NA TO :15] Series. pages 321—326. Springer-\erlag. 1986. D. E. Goldberg. (lenetic .A-llgorithms in Search. Optimization. and lilac/zine Learning. Addison-\Vesley, 1, 989. U. Grenander. Y. Chow, and D. M. Keenan. Hands: A Pattern Theoretic Study of Biological Shapes. Springer-Verlag, New York, 1991. U. Grenander and M. 1. Miller. Jump-diffusion processes for abduction and recognition of biological shapes. Technical report, Monograph of the Electronic Signals and Systems Research Laboratory, Washington University, 1991. W. E. L. Grimson. Computational experiments with a feature based stereo algorithm. IEEE Transactions on Pattern Analysis and ll/Iachine Intelligence, 7:17—34, 1985. W'. E. L. Grimson. Object Recognition by Computer: The Role of Geometric Constraints. MIT Press, Cambridge, MA, 1990. W. E. L. Grimson and T. Lozano-Perez. Localizing overlapping parts by search- ing the interpretation tree. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 9(4):469—482, 1987. G. Healey and T. Binford. A color metric for computer vision. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 10-17, Ann Arbor, MI, 1988. [s3] [84} [39} [90] [91] [92] 247 D. .l. lleeger. Optical flow using spatiotemporal filters. International Journal of Computer Vision. 2(-l):279~302. 1988. \v". von Holt. Tracking and classification of overtaking vehicles on Autobahnen. In Proc. IEEE tnttttigmt lihictes Symposium. pages 31«1—3l9. Paris. France, 1994. B. Ilorn. Determining lightness from an image. Computer Graphics and Image Processing. 3:277—299. 1974. B. K. P. Horn and B. Schunck. Determining optical flow. .irlrtificiat Intelligence, 17(185—203).1981. S. Horowitz and T. Pavlidis. Picture segmentation by a tree traversal algorithm. Journal of the Association for Computing htachinery, 23(2):368—388, 1976. Y. Z. Hsu. H.-H. Nagel, and G. Rekers. New likelihood test methods for change detection in image sequences. Computer Vision. Graphics and Image Process- ing, 24:73--106, 1984. M. K. Hu. Visual pattern recognition by moment invariants. IEEE Transactions on Information Theory, 8:179—187, 1962. L. J. Hubert and J. Schultz. Quadratic assignment as a general data-analysis strategy. British. Journal of Mathematical and Statistical Psychology, 29:190— 241,1976. L. Hurvich and D. Jameson. An opponent-process theory of color vision. Psy- chological Review, 64(6)::384—404, 1957. D. Huttenlocher and S. Ullman. Recognizing solid objects by alignment. Inter- national Journat of Computer In’ision. 5(2):195—212, 1990. [9:3] [94] [96] [97] [98] [991 [100] [101] 248 D. P. Huttenlocher. G. A. Klanderman. and \V. J. Rucklidge. Comparing im— ages using the Hausdorff distance. IEEE Transactions on Pattern Analysis and .‘ltachine Intelligence. 15(9):850—863. 1993. K. Ikeuchi and T. Kanade. Automatic generation of object recognition pro- grams. Proceedings of the IEEE. 76(8):1016—1035, 1988. T. Ito and K. Yamada. Preceding vehicle and road lanes recognition methods for RCAS using vision system. In Proc. IEEE Intelligent Vehicles Symposium, pages 85-90, Paris, France. 1994. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, NJ. 1988. A. K. Jain and F. Farrokhnia. Unsupervised texture segmentation using Gabor filters. Pattern Recognition, 24:1167-1186, 1991. A. K. Jain and P. J. Flynn (eds). 3D Object Recognition Systems. Elsevier, Amsterdam, 1993. A. K. Jain and R. Hoffman. Evidence-based recognition of 3d objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(6):?83—802, 1988. A. K. Jain and S. G. Nadabar. MRF model-based segmentation of range images. In Proc. IEEE International Conference on Computer Vision, pages 667—671, Osaka, Japan, 1990. R. Jain, W. N. Martin, and J. K. Aggarwal. Segmentation through the detection of changes due to motion. Computer Graphics and Image Processing, 11:13—34, 1979. [102] [103] [104] [105] [106] [107] [108] [109] [110] 219 R. Jain and II.-II. Nagel. On the analysis of accumulative difference pictures from image sequences of real world scenes. IEEE Tranwctions on Pattern Analysis and .ltachine Intelligence. l(2):206—21-’l. 1979. M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. International Journal of Computer Vision, 2:321—331. 1988. R. Kasturi and R. C. Jain. Computer Vision: Principles. IEEE Computer Society Press, 1991. J. K. Kearney, W. B. Thompson, and D. L. Boley. Optical-flow estimation: An error analysis of gradient-based methods with local optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence. 9(2):229—244, 1987. N. Kehtarnavaz, N. Grisworld, and J. Lee. Visual control of an autonomous ve- hicle (BART) - the vehicle-following problem. IEEE Transactions on Vehicular Technology, 40(3):654—662, 1991. D. Kelly. Results of a field trial of the IMPACTS image processing system for traffic monitoring. In Proc. Conference on Vehicle Navigation and Information Systems, pages 151—167, Dearborn, MI, 1991. M. Kilger. A shadow handler in a video-based real-time traffic monitoring system. In IEEE Workshop on Applications of Computer Vision, pages 1060— 1066, Palm Springs, CA, 1992. M. Kilger. Heavy traffic monitoring in real-time. In Proc. Scandinavian Con- ference on Image Applications, pages 1077—1083, Tromso, Norway, 1993. S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science. 2202671—680, 1983. [111] [112] [11:3] [114] [115] [116] [117] [118] [119] 230 G. Klinker. S. Shafer. and T. Kanade. A physical approach to color image understanding. International Journal of Computer Vision. 4:7~38. 1990. J. J. Koenderink and A. J. van Doorn. Internal representation of solid shape with respect to vision. Biological Cybernetics. 32(4):211~-2l6. 1979. D. Koller. K. Daniilidis. T. Thorhallson, and H.-H. Nagel. Model—based object tracking in traffic scenes. In Proc. European Conference on Computer Vision, pages 437—452. Santa Margherita Ligure, Italy. 1992. D. Koller. .\'. Heinze. and H.—H. Nagel. Algorithmic characterization of vehicle trajectories from image sequences by motion verbs. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 90—95, Maui, HI, 1991. D. Koller, J. Weber, and J. lV’Ialik. Towards realtime visual based tracking in cluttered traffic scenes. In Proc. IEEE Intelligent Vehicles Symposium, pages 201—206, Paris, France, 1994. K. F. Lai and R. T. Chin. Deformable contours: Modeling and extraction. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 601—608, Seattle, WA, 1994. S. Lakshmanan and D. Grimmer. A deformable template approach to detecting straight edges in radar images. submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence, 1994. Y. Lamdan and H. J. Wolfson. Geometric hashing: A general and efficient model—based recognition scheme. In Proc. IEEE International Conference on Computer Vision, pages 238—249, Tarpon Springs, FL, 1988. E. H. Land. The retinex. American Scientist. 52:247—264, 1964. p20] [125] [126] 251 E. H. Land. The retinex theory of color vision. Scientific American. 2372108- E. H. Land. Recent. advances in retinex theory. Vision Research. 26(1):7—21, 1986. E. H. Land and J. J. McCann. Lightness and retinex theory. Journal of the Optical Society of America A, 61(1):1—11, 1971. G. C. Lee. Reconstruction of Line Drau’ing Graphs from Fused Range and Intensity Imagery. PhD thesis. Michigan State University, 1992. H.-C. Lee. Method for computing the scene-illuminant chromaticity from spec- ular highlights. Journal of the Optical Society of America A, 3(10):1694—1699, 1986. M. D. Levine. Vision in Man and .Machine. McGraw Hill, 1985. D. G. Lowe. Three-dimensional object recognition from single two-dimensional images. Artificial Intelligence, 31:335—395, 1987. D. G. Lowe. Fitting parametrized three-dimensional models to images. IEEE Transactions on Pattern Analysis and htachine Intelligence. 13(5):441—450, 1991. L. T. Maloney and B. A. VVandell. Color constancy: A method for recover- ing surface spectral reflectance. Journal of the Optical Society of America A, 3(1y29—33,1986. K. V. Mardia, T. J. Hainsworth, and J. F. Haddon. Deformable templates in image sequences. In Proc. International Conference on Pattern Recognition, volume 2, pages 132—135, The Hague, The Netherlands, 1992. [1:10] [1:31] [132] [133] [136] [137] 252 L. Matthies and A. Elfes. Integration of sonar and stereo range data using a grid—based representation. In Proc. IEEE International Conference on. Robotics and Automation. pages 723—733, Philadelphia. PA. 1988. J. J. i\*lcCann. S. P. McKee, and T. II. Taylor. Quantitative studies in retinex theory. Vision Research. 16:445458, 1976. N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087—1092, 1953. P. G. Michalopoulos. Vehicle detection video through image processing: The Autoscope system. IEEE Transactions on Vehicular Technology, 40(1):21—29, 1991. D. L. Milgram. Region extraction using convergent evidence. Computer Graph- ics and Image Processing, 1121—12, 1979. K. S. Miller, T. N. Harvey, P. W. Shuldiner, and C. Ho. Using video technology to conduct the 1991 Boston region external cordon survey. In Transportation Research Record, number 1412, pages 46—56, Washington, DC, 1993. M. I. Miller, A. Srivastava, and U. Grenander. Conditional-mean estimation via jump-diffusion processes in multiple target tracking/recognition. submitted to IEEE Transactions on Signal Processing, 1994. H. P. Moravec. Sensor fusion in certainty grids for mobile robots. AI Magazine, 9(2):61-74, 1988. [138] [131)] [140] [141] [142] [143] [144] [145] 253 B.-A. Moshe. Methods to combine different data sources and estimate origin- destination matrices. In N. H. (lartner and N. H. M. Wilson. editors. Trans- portation and Traffic Theory. pages 459-481. 1987. II.-Il. Nagel. Overview on image sequence analysis. In T. S. Huang, editor, Image Sequence. Processing and Dynamic Scene Analysis. pages ‘2—39. Springer- Verlag. 1983. H.-H. Nagel. Image sequences — ten (octal) years —— from phenomenology towards a theoretical foundation. In Proc. International Conference on Pattern Recognition, pages 1174—1185, Paris. France, 1986. H.-H. Nagel and W. Enkelmann. An investigation of smoothness constraints for the estimation of displacement vector fields from image sequences. IEEE Transactions on Pattern Analysis and IWachine Intelligence, 8(5):565—593, 1986. S. Negahdaripour and A. K. Jain. Challenges in computer vision research; future directions of research. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 189—199, Urbana, IL, 1992. R. Nevatia and T. Binford. Description and recognition of complex-curved objects. Artificial Intelligence, 8:77-98, 1977. R. Ohlander, K. Price, and D. R. Reddy. Picture segmentation using a recursive region splitting method. Computer Graphics and Image Processing, 8:313—333, 1978. Y.-I. Ohta, T. Kanade, and T. Sakai. Color information for region segmentation. Computer Graphics and Image Processing, 13:222—241, 1980. [116] [147] [148] [149} [150] [151] [152] [153] 254 S. Pankanti. A. K. Jain, and M. Tuceryan. On integration of vision modules. In Proc. IEEE (Conference on. Computer Vision and Pattern Recognition, pages 316~3‘Z‘Z, Seattle. WA, 1994. T. Pavlidis and \".-'I‘. Liow. Integrating region growing and edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(3):?25— 233, 1990. S. Peleg and H. Rom. Motion based segmentation. In Proc. International Conference on Pattern Recognition, pages 109—113, Atlantic City, NJ, 1990. A. P. Pentland. Perceptual organization and the representation of natural form. Artificial Intelligence, 28:293—331, 1986. E. Persoon and K. S. Pu. Shape discrimination using Fourier descriptors. IEEE Transactions on Systems, hian, and Cybernetics, 7:170—179, 1977. D. B. Phillips and A. F. M. Smith. Dynamic image analysis using Bayesian shape and texture models. In K. V. Mardia and G. K. Kanji, editors, Statistics and Images: 1, pages 299—322. Carfax Publishing Company, Abingdon, UK, 1993. G. Piccioli, E. de Micheli, P. Parodi, and M. Campani. Robust road sign detection and recognition from image sequences. In Proc. IEEE Intelligent Vehicles Symposium, pages 278—283, Paris, France, 1994. P. Planet, D. Pellerin, and J. Hérault. Classification automatique de vehicules aux peages autoroutiers. Recherche Transports Se'curite’, pages 13—24, 1992. [15.11 [1:35] [156] [157] [158] [159] [160] [161] L. Priese, V. Itehrmann. R. Schian, R. Lakmann. and L. Bilderkennen. Traffic sign recognition based on color image evaluation. In Proc. IEEE Intelligent Vehicles Symposium. pages 95—100. Tokyo. Japan. 1993. N. S. Raja and A. K. Jain. Obtaining generic parts from range images using a multi-view representation. CVGIP: Image Understanding, 60(1):44—64, 1994. V. Ramesh, R. M. IIaralick, X. Zhang, D. C. Nadadur, and K. Thornton. Auto- matic selection of tuning parameters for feature extraction sequences. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 672—677, Seattle. WA, 1994. S. Ranade and A. Rosenfeld. Point pattern matching by relaxation. Pattern Recognition, 12:269—275, 1980. A. Rosenfeld and A. C. Kak. Digital Picture Processing. Academic Press, 1982. L. Saxton and G. Schoene. Federal IVHS program initiatives resulting from ISTEA of 1991. In Proc. International Congress on Transportation Electronics, pages 319—328, 1992. M. Schwarzinger, T. Zielke, D. Noll, M. Brauckmann, and W. von Seelen. Vision-based car following: detection, tracking, and identification. In Proc. IEEE Intelligent Vehicles Symposium, pages 24—29, Detroit, MI, 1992. I. K. Sethi and W. L. Brillhart. Traffic analysis including non-conforming be- havior via image processing. In Proc. Conference on Vehicle Navigation and Information Systems, pages 193—201, Dearborn, MI, 1991. [162] [163] [164] [165] [166] [167] [168] [169] 2:36 L. G. Shapiro. J. I). Moriarty. R. M. I-Iaralick. and P. G. Mulgaonkar. Matching three—dimensional objects using a relational paradigm. Pattern Recognition, 17(4y335—405.1934. A. Shio and J. Sklansky. Segmentation of people in motion. In Proceedings of the IEEE workshop on Visual .IIotion. pages 325—332, Princeton, NJ, 1991. S. P. Smith and A. K. Jain. Chord distribution for shape matching. Computer Graphics and Image Processing, 20:259—271, 1982. E. Sorensen and H. I. Christensen. Monitoring of road traffic. In Proc. Scandi- navian Conference on Image Applications. pages 1085—1091, Tromso, Norway, 1993. F. Stein and G. Medioni. Structural indexing: Efficient 3-D object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 4(2):125— 145, 1992. J. Stewman and K. Bowyer. Creating the perspective projection aspect graph of polyhedral objects. In Proc. IEEE International Conference on Computer Vision, pages 494—500, Tarpon Springs, FL, 1988. G. C. Stockman. Object recognition and localization via pose clustering. Com- puter Vision, Graphics and Image Processing, 40:361—387, 1987. G. C. Stockman. Object recognition. In R. C. Jain and A. K. Jain, editors, Analysis and Interpretation of Range Images, pages 225—253. Springer-Verlag, 1990. [170] [171] [174] [175] [176] [177] [178] ‘257 C. Storvik. A Bayesian approach to dynamic contours through stochastic sam- pling and simulated annealing. IEEE Transactions on Pattern Analysis and zl/Iachine Intelligence. 16(10):976-986, 1994. P. N. Strenski and S. Kirkpatrick. Analysis of finite length annealing schedules. Algorith mica, 6:346-~366, 1991. M. J. Swain and D. H. Ballard. Color indexing. International Journal of Computer Vision, 7(1):11—3‘2, 1991. Y. Tanaka, K. Kanayama, and H. Sugimura. Travel-time data provision sys— tem using vehicle license number recognition device. In Proc. IEEE Intelligent Vehicles Symposium. pages 353—358, Detroit, MI, 1992. W. B. Thompson. Combining motion and contrast for segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2(6):543—549, 1980. W. B. Thompson and S. T. Barnard. Low-level estimation and interpretation of visual motion. Computer, 14:20—28, 1981. W. B. Thompson and T.-C. Pong. Detecting moving objects. In Proc. IEEE In- ternational Conference on Computer Vision, pages 201—208, London, England, 1987. C. Thorpe, M. H. Hebert, T. Kanade, and S. A. Shafer. Vision and navigation for the Carnegie—Mellon Navlab. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(3):362—373, 1988. C. E. Thorpe. Vision and Navigation: The Carnegie Mellon Navlab. Kluwer academic publishers, 1990. [179] [130] [181] 258 J. Ton and A. K. Jain. Registering Landsat images by point matching. IEEE Transactions on Geoscience and Remote Sensing, 27(5):642—651, 1989. B. Ulmer. \I'ITA II - Active collision avoidance in real traffic. In Proc. IEEE Intelligent I’ehicles Symposium, pages 1—6. Paris, France, 1994. B. A. VVandelI. The synthesis and analysis of color images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(1):2—13, 1987. [182] J. Weng. Windowed Fourier phase: Completeness and signal reconstruction. IEEE Transactions on Signal Processing, 41(2):657-666, 1993. [183] J. Weng, N. Ahuja, and T. S. Huang. Matching two perspective views. IEEE [184] [185] [186] [187] Transactions on Pattern Analysis and h‘Iachine Intelligence, 14(8):806—825, 1992. M. Xie, L. Trassoudaine, J. Alizon, M. Thonnat, and J. Gallice. Active and intelligent sensing of road obstacles: Application to the European Eureka- Prometheus Project. In Proc. IEEE International Conference on Computer Vision, pages 616—623, Berlin, Germany, 1993. G. Xu. Is a single image sufficient as object model for recognition? Technical report, Osaka University, 1992. Y. Yakimovsky. Boundary and object detection in real world images. Journal of the Association for Computing Machinery, 23:599-618, 1976. S. Yalamanchili, W. N. Martin, and J. K. Aggarwal. Extraction of moving object descriptions via differencing. Computer Graphics and Image Processing, 18:188—201, 1982. [188] [189] [190] [191] [192] [193] 259 K. Yoshino. M. Miyahara. K. Sakai, Y. Ito, and T. Negishi. In-tunnel traffic flow measuring system using ITV cameras. In Proc. Conference on Vehicle Navigation and Information Systems, pages 311-310, Dearborn. MI, 1991. A. Yuille, D. Cohen. and P. Hallinan. Feature extraction from faces using deformable templates. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 104—109, San Diego, CA, 1989. A. L. Yuille, P. W. Hallinan, and D. S. Cohen. Feature extraction from faces us- ing deformable templates. International Journal of Computer Vision, 8(2):99— 111, 1992. J. Y. Zheng, M. Barth, and S. Tsuji. Qualitative route scene description using autonomous landmark detection. In Proc. IEEE International Conference on Computer Vision, pages 558—562, Osaka, Japan, 1990. Y.-J. Zheng, W. Ritter, and R. Janssen. An adaptive system for traffic sign recognition. In Proc. IEEE Intelligent Vehicles Symposium, pages 165—170, Paris, France, 1994. G. Zimmermann and R. Kories. What an autonomous robot can learn about its environment starting from scratch using image sequences? In A. K. Jain, edi- tor, Real- Time Object [Measurement and Classification, pages 215-226. Springer Verlag, 1988.