LIBRARY
Michigan State
University

 

 

 

PLACE N RETURN BOX to roman this checkout ﬂom your rocord.
TO AVOID FINES Mum on or baton data duo.

DATE DUE DATE DUE DATE DUE
I
I

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU I. An Afﬁrmative ActlaVEqnl Opportunity lmtttuion
W

m1

SEGMENTATION AND MATCHING OF MOVING VEHICLES FROM
COMPLEX OUTDOOR SCENES

By

Marie-Pierre Dubuisson

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements

for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science

1995

ABSTRACT

SEGMENTATION AND MATCHING OF MOVING VEHICLES FROM
COMPLEX OUTDOOR SCENES

By

Marie-Pierre Dubuisson

This thesis describes a machine vision system whichiis able to extract moving
vehicles from complex stationary backgrounds and match them based on color and
shape information. This vision system was developed in support of the Intelligent Ve—
hicle/ Highway Systems (IVHS) program whose main goals are to reduce the number
of accidents on the road, increase the trafﬁc density, and provide route guidance to
the travelers. The speciﬁc problem of interest in this thesis is to estimate the average
travel time between two points in a road network.

We propose an object matching system which includes a number of modules that
can be utilized in other application domains. In particular, we have developed four
image segmentation algorithms. The motion segmentation algorithm can identify the
moving areas in the image using a three—frame sequence. The color segmentation
algorithm combines edge information and a split-and-merge technique to identify
homogeneous regions of different colors in a color image. These two algorithms are
general purpose image segmentation algorithms.

The other two segmentation algorithms utilize object model information. We
propose an algorithm to extract an accurate contour of the moving object in the
image. This technique fuses color and motion information using the theory of active
contours to constrain the contour of the moving object to be smooth. The ﬁnal
image segmentation algorithm constrains the object to be a vehicle. We deﬁne a

generic model of a vehicle and use the theory of deformable templates to fuse motion

and edge information and obtain a segmentation and classiﬁcation of the vehicle of
interest.

Once the vehicles of interest have been separated from the stationary background
and the other moving vehicles, the matching strategy is based on color and shape
matching. For color matching, we compare the 3D color histograms of the two objects.
For shape matching, we propose a modiﬁed Hausdorff distance between point sets to
compare the edge images of the vehicles.

We evaluated the performance of the vision-based matching system on a database
of 287 vehicles. We were able to segment and correctly classify 91.6% of the vehicles.
We also demonstrated the efﬁcacy of time-based, vehicle class—based, and color-based
indexing schemes. We compared the machine vision results to the results that experts
obtained manually by looking at the video tapes and concluded that the machine

vision system is a feasible tool for travel time estimation.

To my husband Ray,
my parents Bernard and Francine,

and my sisters Sophie and Séverine

Acknowledgments

I would like to acknowledge all the people who have assisted me throughout my studies
at Michigan State University. I am particularly grateful to my advisor, Professor
Anil Jain, for his advice and encouragement. He provided me with numerous ideas,
suggestions and comments. I owe my great interest in research activities to him.
I would like to thank the other members of my dissertation committee. Professor
George Stockman always challenged me with new approaches and interesting ideas.
Professor William Taylor provided valuable knowledge about real-world problems
related to Intelligent Vehicle/Highway Systems. Dr. John Weng has always had
an open door to discuss ideas and general concepts in computer vision. Professor
Raoul LePage taught me the basics of stochastic processes and helped me understand
statistical approaches to image processing. I also wish to acknowledge Dr. Shridhar
Lakshmanan from the University of Michigan-Dearborn for his insight in the theory
of deformable templates. I would like to express my special thanks to late Professor
Richard Dubes who originally served on my dissertation committee. His wonderful
teaching ability, incredible patience, and extensive knowledge greatly inspired me in
pursuing research in pattern recognition.

My sincere thanks go to all the members of my family. I could not have ﬁnished
this dissertation without the love, patience. encouragement, and understanding of my
husband Ray. I would like to share all my academic accomplishments with him. I
also dedicate this thesis to my parents Bernard and Francine Dubuisson and my two
sisters Sophie and Séverine Dubuisson who sent me love and support from across the
Atlantic ocean through numerous phone calls and e-mail messages. I am also grateful

to my in-laws Marilyn and Ray Jolly, and Ann and Curt Blaszak for their love and

vi

encouragement. throughout the writing of this dissertation.

I would like to thank all my friends for their moral support. especially Barb Birch-
ler. Dave (.‘hesney, Barb Czerny. Maureen Galsterer. Edgar Kalns, Ron Sass, and
Steve Turner. also known as the “the lunch crowd”; Catherine Tamareille and Gary
\-"e1'cruysse. Carine Bourget and Marwan Kruntz, and Isabelle C'assagne and Mike
Demarte. Thanks also go to all the members of the Pattern Recognition and Image
Processing Laboratory for their interest and concern, especially Jinlong Chen, Yun-
tao Cui, Chitra Dorai, Hansye Dulimarta, Sally Howden, Qian Huang, Kalle Karu,
Jianchang Mao. Aniati Murni, Tim Newman, Sharathcha Pankanti, Nalini Ratha,
Dan Swets, Oivind Due Trier, Marilyn Wulfekuhler, Yu Zhong, and especially our
lab manager John Lees. I would also like to acknowledge the graduate students from
the Transportation Group in the Department of Civil and Environmental Engineer-
ing, especially Paul Dorothy, Sumeet Johar. Martin Kane, Gill Mosseri, and Sorowit
Naraupti for assisting me in collecting the data and analyzing the video tapes.

Financial support in the form of graduate research assistantships from the US.
Department of Transportation through the Great Lakes Center for Truck Transporta-
tion Research, the State of Michigan Department of Transportation, and the National

Science Foundation is gratefully acknowledged.

 

Contents

List of Figures
List of Tables

1 Introduction
1.1 Model-Based Object Recognition ....................
1.2 Object Matching .............................
1.3 Contributions of the Thesis .......................
1.4 Outline of the Thesis ...........................

2 Intelligent Vehicle/ Highway Systems and Applications
2.1 IVHS in the United States ........................

2.1.1 Technical committees .......................

2.1.2 Current operational tests .....................
2.2 IVHS in Europe ..............................
2.2.1 PROMETHEUS .........................
2.2.2 DRIVE ..............................
2.3 IVHS in Japan ..............................

2.4 Machine Vision for IVHS .........................
2.5 Travel time and Origin/Destination Matrices Estimation .......

2.6 Limitations of Machine Vision Systems .................

3 Image Segmentation

3.1 Motion Segmentation ...........................

XV

1
2
11
15
19

21
22
23
24
26
26
27
28
29
32
35

39
43

viii

3.1.1 Image subtraction techniques .................. 44
3.1.2 Gradient techniques ........................ 46
3.1.3 Feature-based techniques ..................... 49
3.1.4 Proposed method ......................... 50
3.1.5 Experimental results ....................... 55
3.2 Color Segmentation ............................ 60
3.2.1 Color spaces ............................ 61
3.2.2 Color Image Segmentation .................... 66
3.2.3 Color constancy .......................... 69
3.2.4 Proposed method ......................... 73
3.2.5 Experimental results ....................... 80
3.3 Summary ................................. 85
Model-Based Image Segmentation 86
4.1 Segmentation of the Moving Object ................... 89
4.1.1 Active contours .......................... 90
4.1.2 Proposed method ......................... 94
4.1.3 Experimental results ....................... 96
4.2 Segmentation using Deformable Templates ............... 99
4.2.1 Deformable templates ...................... 104
4.2.2 A Deformable Vehicle Template ................. 106
4.2.3 Image Model ........................... 109
4.2.4 MAP Segmentation ........................ 117
4.2.5 Experimental Results ....................... 126
4.3 Summary ................................. 131
Color and Shape Matching 133
5.1 Color Histogram Matching ........................ 133
5.1.1 Color indexing .......................... 134
5.1.2 Color matching .......................... 135

5.1.3 Baseline distribution of the color matching score ........ 139

5.2 Edge Point Set l\<Iatching .........................
5.2.1 Distance measures ........................
5.2.2 Evaluation of the distance measures ...............
5.2.3 Baseline distribution of the modiﬁed Hausdorff distance . . . .
5.2.4 Registration of two edge images .................

5.3 Summary .................................

6 Vehicle Matching System
6.1 Travel Time Estimation .........................
6.1.1 Feature extraction ........................
6.1.2 Time-based and vehicle class-based indexing ..........
6.1.3 Color and shape matching ....................
6.1.4 Experiments ............................
6.2 “Real-Time” Experiments ........................
6.2.1 Locating the front vehicle ....................
6.2.2 Feature extraction ........................
6.2.3 Matching strategy .........................
6.2.4 Experimental results .......................
6.3 Comparison with Manual Matching ...................
6.4 Digitization and Storage Issues .....................
6.5 Effect of Weather Conditions and Time of Day .............

6.6 Summary .................................

7 Summary and Future Research

7.1 Summary .................................

7.2 Future Research ..............................
A Image Databases

Bibliography

142
143
148

163
164
166
167
168
170
171
171
174
175
179
180
181
189
197

199
199
202

206

237

List of Figures

1.1

1.3
1.4
1.5
1.6

3.1
3.2
3.3
3.4
3.5
3.6

3.7
3.8
3.9
3.10

3.11
3.12

Template matching. ........................... 11
Image matching. ............................. 12
Model-based object recognition. .................... 12
Object matching. ............................ 12
Block diagram of an object matching system. ............. 14
An example of object matching ...................... 16

Estimation of average travel time and origin/destination matrices using

an object matching system. ....................... 33
A moving ramp edge ............................ 52
A two-dimensional moving ramp edge. ................. 54
Steps in the motion segmentation algorithm ............... 56
Motion segmentation algorithm. .................... 57

Results of the motion segmentation algorithm on some moving vehicles. 58

Results of the motion segmentation algorithm on some indoor and

outdoor moving objects. ......................... 59
Spectral responses of the human color sensors. ............ 61
The H S I color space. .......................... 63
Ohlander’s color segmentation algorithm. ............... 68

Computation of the reﬂectances by the algorithms based on the Retinex the-

ory. .................................... 72
The split-and-merge algorithm for segmenting a color image. . . . . 75
Steps in the color segmentation algorithm. ............... 81

3.13
3.14
3.15

4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8

4.9

4.10
4.11
4.12
4.13
4.14
4.15
4.16
4.17
4.18

4.19
4.20
4.21
4.22
4.23
4.24

xi

Color segmentation algorithm. ..................... 82
Results of the color segmentation algorithm. .............. 83
Results of the color segmentation algorithm (cont’d). ......... 84
Effect of the line image force Enne on a snake. ............. 91
Effect of the edge image force Eedge on a snake. ............ 91
Effect of the termination image force Em,“ on a snake. ........ 91
Steps in the moving object segmentation algorithm ........... 97
Moving object segmentation algorithm. ................ 98
Results of the segmentation of the moving object. ........... 100
Results of the segmentation of the moving object (cont’d). ...... 101

Segmentation of the moving objects vs. segmentation of the moving

vehicle of interest .............................. 103
Deformable template deﬁning the generic model of a vehicle. . . . . 106
Five classes of deformable vehicle templates. ............. 107

Set of rules used to deﬁne the prior probability on the deformed vehicle. 110
Estimation of the location of the baseline and length of the vehicle. . 113

Edge detection. .............................. 114
Cross-sections of the energy function along two parameters Y3 and Y5. 118
Comparison of the logarithmic and geometric cooling schedules. . . . 121
Successive iterations of the vehicle segmentation algorithm. ..... 123

Results of the Metropolis algorithm for different sets of parameters. . 124

Results of the Metropolis algorithm for different sets of parameters

(cont’d). ................................. 125
Vehicle segmentation algorithm. .................... 127
Segmentation results for three sedans. ................. 128
Segmentation results for three pickup trucks. ............. 128
Segmentation results for three hatchbacks. ............... 128
Segmentation results for three station wagons or 4x4. ........ 129

Segmentation results for three vans. .................. 129

4.25

5.1

U1
[\3

O!
00

5.4

5.5

5.6
5.7
5.8
5.9
5.10
5.11
5.12

5.13

5.14

5.15

5.16

5.17

5.18

5.19
5.20

xii
Segmentation results leading to wrong classiﬁcations. .........

Computation of the color matching score. ...............
Eight color patches. ...........................
Hierarchical clustering of the 75 color patches based on the color his-
togram matching score. .........................
Cumulative distribution of the color matching score under the random
color hypothesis. .............................
Cumulative distribution of the color matching score for the 75 color
patches. ..................................
Hausdorff distance between two point sets. ...............
Chamfering algorithm. ..........................
Example of matching two objects using the ranked distances.

A typical random point pattern in a binary image. ..........
A typical random line pattern. .....................
Random line patterns with additive noise. ...............
Behavior of the Hausdorff distance (Dlg) as a function of the noise
levels K0 and K“. ............................
Behavior of the ranked distance measures (010 and D14) as a function
of the noise level Kc. ..........................
Behavior of the modiﬁed Hausdorff distance (022) as a function of the
noise level for 4 types of noise models. .................
Edge images for four vehicles. ......................
Edge images for four people. ......................
Cumulative distribution of MHD under the random point pattern hy-
pothesis. .................................
Generic model of a vehicle. .......................

Examples of random vehicles generated by the model. ........

Cumulative distribution of MHD under the random vehicle hypothesis.

130

136
138

140

141

142
144
147
149
149
151
151

152

152

153
154
155

157
158
158
159

5.21

O!
t0
M

6.2
6.3
6.4
6.5

6.6

6.7

6.8

6.9

6.10
6.11
6.12
6.13
6.14
6.15
6.16
6.17
6.18
6.19

A.1
A.2
A.3
A.4
A.5
A.6

xiii

Pose clustering algorithm to estimate the translation and scale param-
eters between the two edge images. ...................

Pose clustering to register two edge images ................

Matching vehicles for travel time estimation. .............
Feature extraction. ............................
“Real-time" extraction of the front vehicle ................
Feature extraction for the front vehicle in Figure 6.3. .........
Example of segmentation of the front vehicle and extraction of the
color patch for different scenes. .....................
Decision tree for matching two vehicles. ................
Comparison of travel time estimates. ..................
Estimation of the location of the baseline and length of the vehicle.
Algorithm to automatically select interesting frames. .........
Example of the automatic frame selection algorithm. .........
Results of the automatic frame selection algorithm. ..........
Color segmentation results on the Maple databases. ..........
Model-based segmentation results on the Maple databases. ......
Color segmentation results on the Engineering database. .......
Model-based segmentation results on the Engineering database.

Color segmentation results on the Fog database. ............
Model-based segmentation results on the Fog database. .......
Color segmentation results on the Dusk database. ...........

Model-based segmentation results on the Dusk database. .......

Database Saginaw-Coolidge: vehicles 1 to 20. .............
Database Saginaw-Coolidge: vehicles 21 to 40. ............
Database Saginaw-Coolidge: vehicles 41 to 60. ............
Database Saginaw-Coolidge: vehicles 61 to 80. ............
Database Saginaw-Coolidge: vehicles 81 to 100. ............
Database Saginaw-Coolidge: vehicles 101 to 120. ...........

160
161

173
174

181
184
186
187
188
190
191
192
193
194
194
195
196

208

xiv

A.7 Database Saginaw-Coolidge: vehicles 121 to 140.
A8 Database Saginaw-Coolidge: vehicles 141 to 149.

A9 Database Saginaw-Haslett: vehicles 1 to 20. .
A.10 Database Saginaw-Haslett: vehicles 21 to 40.
A.11 Database Saginaw-Haslett: vehicles 41 to 60.
A.12 Database Saginaw-Haslett: vehicles 61 to 80.
A.13 Database Saginaw-Haslett: vehicles 81 to 100.
A.14 Database Saginaw-Haslett: vehicles 101 to 120.
A.15 Database Saginaw-Haslett: vehicles 121 to 138.
A.16 Database Maple-East: vehicles 1 to 20. . . . .
A.17 Database Maple-East: vehicles 21 to 40. . . .
A.18 Database Maple- West: vehicles 1 to 20. . . .
A.19 Database Maple- West: vehicles 21 to 40. . . .
A.20 Database Engineering. .............
A.21 Database Fog. .................
A.22 Database Dusk-I. ...............
A.23 Database Dusk-2. ...............
A.24 Database Dusk-3. ...............
A.25 Database People-In. ..............
A.26 Database People-Out. .............
A.27 Database Objects-In. ..............
A.28 Database Objects-Out. .............
A.29 Database Sequence: odd frames 1 to 95. . . .
A.30 Database Sequence: odd frames 97 to 191. . .
A.31 Database Sequence: odd frames 193 to 287. .
A.32 Database Sequence: odd frames 289 to 379. .

0000000000000

0000000000000

List of Tables

1.1 A taxonomy of 3D object recognition systems. ............ 9
4.1 Classiﬁcation results for 287 vehicles. ................. 131
5.1 Hubert’s F statistic for different bin sizes. ............... 138
5.2 Color matching scores for the eight color patches shown in Figure 5.2. 139
5.3 24 distance measures between two point sets. ............. 146
5.4 Values of Du), 014, D18, and D22 for the four vehicles shown in Fig-

ure 5.15. .................................. 154
5.5 Values of Bio, 014, 013, and 022 for the four people shown in Figure 5.16.155
A.1 Deﬁnition of the databases. ....................... 207

XV

Chapter 1

Introduction

Machine vision is an interdisciplinary research area involving a number of disciplines,
including electrical engineering, computer science, statistics, and psychology among
others. It has applications in a large number of ﬁelds ranging from industrial in-
spection to interpretation of medical images. Efforts in machine vision were ﬁrst
motivated based on the observation that the human (biological) eye-brain system is
a powerful sensor-processor system that helps us interpret the images of the outside
world. Researchers have tried to understand and emulate the human eye and its
interactions with the brain for many years. but the time when a general purpose ma-
chine vision system will be available has yet to come. All the machine vision systems
currently available in the market, or in research and development stage are dedicated
to a speciﬁc task.

A machine vision system has multiple requirements: (1) it should assist a human
in diagnostic operations such as early detection of diseases and parts inspection; (2)
it should be included in systems to perform tasks that are too demanding for humans
(part sorting), or too dangerous (nuclear plant inspection), or simply impossible (in-
spection of the ocean ﬂoors or the planet Mars); (3) it should be able to perform tasks
more consistently and reliably at a lower cost and faster speed compared to humans.

The object matching system developed in this thesis attempts to meet the third
requirement. The problem which we address is the following: Given two sets of

numbered objects. determine the identiﬁcation number of the object in the ﬁrst set

(if it. exists) which is identical to a given object in the second set. A human being
might be able to tell if he/she has seen a particular object before. but if the two sets
are large and contain similar objects (for example, if all of them are vehicles), then
he/she will not be able to solve the problem without examining the two data sets at
the same time. A machine vision system. however, may not understand the content
of the images, or even what the objects are, but it is not likely to forget a shape, a

color, or a textured pattern.

1.1 Model-Based Object Recognition

An object recognition system is provided a database of 3D object models and its task
is to identify and localize the objects in a scene, i.e., determine the identity, position,
and orientation of the sensed objects. An object recognition system is characterized by
its model representation strategy, the sensing scheme used to acquire the image(s),
and the technique used for pose estimation. A large number of object recognition
systems are described in the book by Jain and Flynn [98]. In spite of over twenty ﬁve
years of research in 3D object recognition, there are numerous difﬁculties in designing

and building a general purpose object recognition system [142].

Model representation

The choice of a model representation strategy is probably the most crucial decision
to be made during the system design phase. The model representation will affect
the choice of the features to be extracted from the sensed images and the pose es-
timation technique to be used at recognition time. The simplest way to represent
two-dimensional objects is by using global features such as area, perimeter, moments,
Euler number and Fourier descriptors. This representation has the advantage of being
extremely compact, but is limited to the recognition of non-occluded objects.

Two- or three-dimensional objects can be explicitly represented as a set of features.
Point features are the simplest features that can be used to build an object model.

These point features can correspond to vertices in the case of polyhedra, landmark

points in the case of quadric surfaces (vertex of a cone or center of a sphere), or high
curvature points for arbitrary curved surfaces. Attributes such as the cornerness or
the amount of curvature can also be assigned to a point.

Polyhedra] objects are often described by their edges as wire—frames. In the case
of arbitrary objects. any contour feature that can be reliably extracted from the
sensed images can be used. Typical attributes for a contour feature are its length, its
orientation, its curvature, and its classiﬁcation (jump or crease edge). Contour (or
edge) features are the most common representation for 2D objects.

Polyhedra can also be represented as a set of planar surfaces and their adjacency
information. Planar and quadric surface patches can be used to approximate the
shape of more complicated objects. Their parameters can then deﬁne attributes of a
surface. The representation of a curved object as a set of NURB (non-uniform rational
B-spline) surfaces is becoming very popular in computer-aided design systems.

The representation of 3D objects as a list of volume elements is also known as
parts-based representation. The most well known representations include generalized
cylinders [143], geons [13], and superquadrics [149]. Raja and Jain [155] describe a
technique to obtain the generic part description directly from a range image.

Three-dimensional objects can also be implicitly represented as a set of 2D images
or 2D feature representations. If an object is described by a set of 2D descriptions,
then the recognition process is reduced to matching the sensed 2D view with all possi—
ble 2D appearances of the object. Common schemes for multiple view representations
include geometric aggregates, aspect graphs, and multiple 2D views.

Point and contour features as described earlier can be used to represent any of
the 2D appearances of a 3D object. Relationships between features are then used to
represent the object. This representation was used by Grimson [80] and Stockman -
[169].

The deﬁnition of aspect graphs is due to Koenderink and van Doorn [112]. An
aspect graph is commonly deﬁned as a graph in which a node represents a “stable
view” of the object (or aspect) and an arc represents the transitions between views by

going from one region to a neighboring region in the viewpoint space. Aspects can be

understood as topologically equivalent classes with respect to the object features that
are visible to the sensor. Stewman and Bowyer [167] proposed a method to automat-
ically generate the aspect graph of polyhedral objects under perspective projection.
Ikeuchi and Kanade [94] do not compute the aspect graph of an object. but they are
able to extract the different aspects of an object by tesselating the viewsphere into
60 view points and clustering the observed shapes into important shape classes (or
aspects).

Basri and Ullman [8] proposed to store a 3D object as a set of 2D images that
are obtained by looking at the object from different vieWpoints. The viewpoints
are obtained by tesselating the viewsphere into small patches. This representation

scheme was also adopted by Chen et al. [28] for object recognition based on silhouette

matching.

Sensing scheme

The most popular sensors for 3D object recognition are CCD cameras and range
ﬁnders.

Usual black and white (B / W) CCD cameras produce intensity images by assigning
a gray level value I(.r, y) to each pixel (:c. y) in the image. Color images composed of
three intensity images R(-.r,y), G'(.r,y), and B(;r,y) corresponding to the red, green,
and blue component, respectively, can be obtained using a color camera. Another
method of obtaining color images is to use a B/W camera and acquire the same
image three times using a red, a green, and a blue ﬁlter.

Multiple (two or three) CCD (color or B/W) cameras can be used in a stereo
vision setup to derive depth information about the scene being sensed. The epipolar
constraint can be used to solve the correspondence problem between features in the
two images, estimate the disparity between the two images, and compute the distance
between each point in the scene and the sensors, thus producing a depth map of the
scene.

Range ﬁnders directly produce a depth image where the gray level 2(2, y) at pixel

(2:. y) corresponds to the distance to the sensor. The sensors typically use active

illumination. Either an intensity grid or a sheet of light is projected onto the scene
and the projection of the light is observed by a camera. The depth at a given pixel
is estimated using the triangulation principle. A laser range ﬁnder usually produces
a dense depth map of the scene whereas systems using a light grid produce a sparse
depth map (only at those points where the grid lines intersect). Range finders produce
2%D images because. due to self occlusion of the object. some surface points are not

reached by the projected light while others are not seen by the camera.

Pose estimation

Pose estimation consists of determining the position and orientation of the object in
the image by searching for a match between features extracted from the model rep—
resentation and features extracted from the sensed image. This section describes a
number of well-known pose estimation techniques including interpretation tree search,
pose clustering, graph matching, optimization techniques. alignment, geometric hash-
ing, and evidence-based approach.

The interpretation tree scheme searches for all possible correspondences between
model features and scene features to construct candidate hypotheses. Each node in
the tree deﬁnes a possible match between a model feature and an image feature. The
bottom level of the tree contains the (leaf) nodes at which all the model features have
been matched; each leaf node corresponds to one hypothesis. One way of generating
all the hypotheses is to systematically search the tree in a depth-ﬁrst search approach.
The number of hypotheses generated by the scheme is an exponential function of the
number of features in the model and in the scene.

When the number of model/object features is large, it is not feasible to compute
all possible hypotheses. Solutions to prune the tree have been proposed in the liter-
ature. For example, Grimson [80] uses constrained search to ensure that candidate
hypotheses satisfy the constraints imposed by the geometry of the object. He deﬁnes
constraints such as the length of an edge, the angle between two edges, the distance
between two edges, the area of a surface patch. the distance between two surface

patches, and the angle between two surface normals. In a similar manner, Flynn and

(i

Jain [64] use three unary constraints (area. type, and radius of non—planar surfaces)
and four binary constraints (rotation. orientation. visibility. and parallelism of planar
surfaces) to prune the interpretation tree.

Each of the candidate hypotheses is veriﬁed by transforming the model according
to the pose hypothesis and comparing it with the image. The hypothesis whose pose
transformation brings the stored model closest to the sensed image will be retained
as the estimate of the pose. This technique can handle object occlusions by possibly
matching a model feature to the NULL feature.

Pose clustering [168] considers all possible pairs of model and image features and
computes the transformation (translation. rotation, and scaling) between them. Each
pair of model / image features deﬁnes a transformation. The possible transformations
are then accumulated in a discretized transformation space using a Hough-based
approach. Peaks in the transformation space deﬁne the pose candidate hypotheses
for the objects. Using the same veriﬁcation scheme, all hypotheses are veriﬁed to ﬁnd
the correct one.

Object features and their geometric relationships can be stored in a graph struc-
ture, where the nodes represent the features and their possible attributes and the arcs
represent the relationship between the features. When objects are modeled as graphs,
the recognition process is equivalent to the computation of subgraph isomorphisms.
If an object model matches the sensed object, then the graphs corresponding to the
two objects should be isomorphic. An example of 3D object recognition using graph
matching was published by Shapiro et al. [162]. Cheng and Huang [31] proposed a
method for matching star structures. Their method was applied to image registration,
but could also be applied to object recognition. In their 3DPO system, Bolles and
Horaud [17] describe a method which identiﬁes the best set of mutually consistent
pairings of image and model features by ﬁnding maximum cliques in the graphs.

Lowe [127] proposed an iterative approach based on Newton's method to minimize
the distance between a 3D object and corresponding 2D image contours. A similar
approach was used by Basri and Ullman [8] and Chen et a1. [28] to match an observed

silhouette with a set of model silhouettes observed from different vieWpoints. Lowe

‘

[126]. in his SCERPO system. combines edge elements into edge structures using
perceptual grouping. These edge structures are used to guide the search and are
matched one at a time by a technique. called incremental grouping.

The alignment method was proposed by lIuttenlocher [92] as a hypothesize and
test (verify) method. A limited number of features is considered to compute a pose
estimate. Huttenlocher shows that three corresponding image and model points de-
ﬁne a unique three-dimensional transformation under weak perspective projection
(orthographic projection plus scale). Once the pose has been estimated. the entire
model is used to verify the hypothesis.

Geometric hashing was ﬁrst introduced by Lamdan and Wolfson [118]. The
method reduces the search time by storing invariant features in a hash table. For
example, Stein and Medioni [166] deﬁne two structures called splash and superseg-
ment to index the hash table. Supersegments are polygonal approximations of 3D
curves at different scales and splash is a feature encoding the local surface normal
distributions on the surface. Flynn and Jain [65] used triples of planar and quadric
surface patches to index the precompiled interpretation table. The models are stored
in a hash table and indexed by their features. When a given feature is found in a
sensed image. it is searched in the hash table to ﬁnd a list of models containing that
feature. Hypotheses are then veriﬁed using the same scheme as the one used by the
interpretation tree scheme.

The evidence-based approach for object recognition was proposed by Jain and
Hoffman [99]. A discriminatory evidence rulebase, capable of differentiating between
a set of objects. provides salient information about individual surface patches and
pairs of patches. These evidence rules either refute or support hypotheses about the
presence of an object in the scene. Jain and Hoffman also proposed an algorithm to

automatically generate the rulebase.

Examples of object recognition systems

The combination of different model representation strategies, sensing modalities, and

pose estimation techniques has yielded a large number of object recognition systems

published in the literature. Some of these systems along with their characteristics are

presented in Table 1.1.

Indexing

Indexing is another important strategy in designing an object recognition system.
A recognition system is usually given a library (database) of object models. The
system ﬁrst has to decide which object is present in the scene before it can estimate
the position and orientation of the object. The simplest solution is to apply one of
the pose estimation techniques to each model in the library in a sequential fashion,
and the model that best matches the image features is chosen as the solution to the
recognition problem. I'sing this scheme. the recognition time increases linearly with
the number of object models in the library. This is one of the reasons why most
model-based object recognition systems have been tested on only a small number of
object models (see Table 1.1). Of course, model construction itselfis a time consuming
process. For a large model database, an indexing methodology is needed to cut down
the computational burden of the matching stage. An indexing scheme reduces the
list of candidate object models which need to be matched to the sensed object. To
be effective, the computational burden of an indexing mechanism must be less than
the computational cost of matching.

The most common indexing technique used by object recognition systems is based
on geometric features. Given a list of features derived from the representation scheme,
a list of models containing each feature is constructed. Then, at recognition time, if
the sensed object presents a particular feature, it is looked up in the model database
to produce a list of model hypotheses. If the features are relatively simple and are
associated with a large number of models, several features can be combined to reduce

the list of model hypotheses by using a simple voting scheme. Each model associated

Table 1.1: A taxonomy of 3D object recognition systems.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Authors Model Sensing Pose Data- Inde-

representation modality estimation base xing
size

Basri & multiple 2D intensity optimization 1 no

Ullman [8] views

Bolles & planes. cylinders range graph 1 no

Horaud [17] matching

Brooks [21] generalized intensity constrained 3 no
cylinders search

Fan, Medioni t; quadric surfaces range graph 10 yes

Nevatia [58] multiple views matching

Faugeras & quadric surfaces range search 3 no

Hebert [60]

Flynn & planes, cylinders, range constrained 20 no

Jain [64] spheres search

Flynn & planes. cylinders, range geometric 23 yes

Jain [65] spheres hashing

Grimson & planar surfaces range tree search 1 no

Lozano—Pérez [81]

Huttenlocher 8.: points intensity alignments 1 no

Ullman [92]

Ikeuchi 81. multiple aspects stereo tree search 1 no

Kanade [94] (from viewsphere)

Jain 8.; edges, surfaces range evidence— 10 no

Hoffman [99] based

Lamdan & points. edges intensity geometric 2 yes

Wolfson [118] hashing

Lowe [126] points, edges intensity optimization 1 no

Nevatia & generalized intensity graph 5 yes

Binford [143] cylinders matching

Stein & supersegments range geometric 9 yes

Medioni [166] splash hashing

Stockman [168] edges intensity pose 1 no

 

 

 

clustering

 

 

 

 

10

with a. given feature receives a vote.

\Vhen all the features have been examined, the number of votes attributed to a
model indicates the likelihood of the model being present in the scene. The mod-
els with the largest. number of votes form the candidate l1y}.)othesis list [18]. The
technique of geometric hashing described earlier is entirely based on the indexing
process.

Nevatia and Binford [143] used a description code to index the model database.
The description code is based on the characteristics of the model parts represented
as generalized cylinders. The simplicity of the description code (binary type for each
part) requires a large number of parts to be examined for the indexing procedure to
be efﬁcient.

Stein and Medioni [166] use their supersegment and splash attributes to index the
model database and quickly decide which objects might be present in the scene.

“Salient” features that have large discriminatory power can be used to effectively
index the model database. The challenge is to ﬁnd these salient features so that
the number of models associated with a given feature is small. A measure of the
saliency of a feature was proposed by Flynn [63] for a model database consisting of
96 polyhedra.

Swain and Ballard [172] proposed a technique that uses color features instead of
geometric features to index the database of models. They did not build an object
recognition system but showed how color indexing can be used to reduce the list of
model candidate hypotheses. Their technique is based on color histogram matching.
The color histogram of the model is compared to the color histogram of the sensed

color image to decide if a particular model might be present in the scene.

11

 

 

p———q_

”Mg/Ni
Cl
QM

Template Image

 

 

 

 

 

\

 

 

‘_7

Figure 1.1: Template matching.
1.2 Object Matching

Matching is a generic operation in image processing and computer vision, which is
used to determine the similarity of two entities (points, curves. subimages, objects,
etc.) of the same type. A number of different operations fall under the broad term of
“matching”. The simplest matching operation is based on template matching where
a template or a prototype of the object that is to be found in an image is provided
(see Figure 1.1). Template matching has been commonly used to detect speciﬁc
image features such as lines or corners [158], facial features (e.g., eyes and mouth)
[189]. handwritten digits [70]. etc. Template matching is performed by evaluating the
cross—correlation of a template and the corresponding window in the image.

Image matching consists of determining the correspondence between image fea-
tures such as points, lines, corners. or homogeneous regions (see Figure 1.2). The
input consists of two different images captured by different sensors in the case of im-
age registration [179], sensed by different cameras at the same time for stereo vision
[79], or captured at different times when used in motion analysis [27].

Model—based object recognition also uses matching techniques as described in
Section 1.1. Given a precise model of an object, the goal is to identify the pose
(position and orientation) of this object in the image (see Figure 1.3).

Object matching is deﬁned as follows. Two objects are sensed possibly in different

 

 

 

 

 

 

 

 

 

 

 

I

\ /
Scene 1 ‘ - — “' Scene 2

Figure 1.2: Image matching.

‘17

 

 

 

 

 

 

 

 

 

 

 

 

 

W
Model Scene

Figure 1.3: Model-based object recognition.

backgrounds by two independent cameras at different times. The goal is to determine
if the two objects are identical (see Figure 1.4).

According to this deﬁnition, object matching is different from template matching
and image matching. Object matching is also different from model-based object
recognition. An object matching system is usually simpler to construct than an
object recognition system. The analogy is similar to a speech veriﬁcation system
versus a speech recognition system. In an object matching system, precise models of

the 3D objects are usually not available and not required. The objects are observed in

 

 

 

 

 

 

   

 

 

 

 

 

 

Scene 1 Scene 2

Figure 1.4: Object matching.

13

approximately the same aspect, are entirely contained in the image. are not occluded,
but are sensed in different environments. More specifically. given two images II
and I; each containing an object of interest 01 and (92 in complex backgrounds,
respectively, the object matching system ﬁrst segments the objects of interest from
the background, and then determines the similarity between 01 and 02 by computing
a similarity measure 5(01, 0;). This similarity measure is then used to determine if
(91 and 02 are identical.

A block diagram of a typical object matching system is shown in Figure 1.5. The
ﬁrst camera observes a scene at time t1. The image sequences are segmented to extract
the object of interest. This is the main challenge of the object matching system. The
object of interest must have some particular property that the background does not
have so that segmentation can be easily performed. For example, if the object of
interest is the only object emitting heat in the image, then it can be segmented easily
given an infrared image. Also, motion cues can be used if the object of interest is
moving in a stationary background. Generic model information about the category of
objects that are of interest to the system can also be integrated in the segmentation
module. In our system, we have used motion. color. edge information, and model
information to extract the object of interest.

All the objects {0,1,2' = 1,...,N1} seen by the ﬁrst camera are stored in a
database. A second camera observes another scene at time t2. The same segmentation
algorithm is applied to the second image sequences to extract the object of interest.
The objects {03,3' = 1,... , N2} seen by the second camera are then considered in
sequence. For a given object (913, the object matching system tries to ﬁnd its match in
the database of objects {Obi = 1, . . . , N1}. Any of the matching techniques used for
model-based object recognition can be used to compute a similarity measure based

on shape information between (9,1 and 012-. In particular, our system uses pose clus-

Scene 1
Time I]

 

Segmentation

14

 

Knowledge

 

 

 

 

 

 

of the objects
of interest

 

 

 

 

   

 

Segmentation
of the objects
of interest

 

 

 

 

 

 

Database of objects
seen by the ﬁrst

   
   

. Indexing

 

camera

 

 

 

  

Reduced database
of objects seen by
the ﬁrst camera

 

 

 

. Matching

 

 

Matching
Scores

 

 

Decision

 

 

 

Identify the
matching pairs

Figure 1.5: Block diagram of an object matching system.

'

1.)

tering as part of the matching strategy. \Ve also compute a similarity measure based
on color information. Indexing techniques can also be used to avoid considering the
entire database. In our system. we index the database based on observation time (of
video frame containing the objects) and object category.

The matching system outputs a list of matching scores between the object 0,2
and the objects {Obi = 1.. . . , N1}. These matching scores are fed into a decision
module which determines which object in {0,1,1' 2 1....,N1} is identical to 032.
Note that there may not be an object in {0,1,1' = 1,..., NI} corresponding to the
object 012-. So, the object matching system needs to have a reject option. As an
example. Figures 1.6(a) and 1.6(b) show two images containing the same vehicle while
Figure 1.6(e) shows an image containing a different vehicle. The images were captured
at different locations and different times. Our object matching system was able to
extract the vehicles of interest in each image as can be seen from Figures 1.6(d)-(f).
The matching score between the two vehicles in Figures 1.6(d) and 1.6(e) was found
to be 0.514, whereas the matching score between the two vehicles in Figures 1.6(d)
and 1.6(f) is equal to 0.044. Thus, the system determined that the ﬁrst two vehicles

are identical whereas the other is quite different.

1.3 Contributions of the Thesis

This thesis proposes a complete object matching system that can handle a large
object database. The system is complete in the sense that given two sequences of
input images, it is able to automatically extract the objects of interest and identify
pairs of identical objects. The system was tested on a database of 287 vehicles. Our
segmentation algorithms were tested on more than 400 image sequences. The system

was developed to study the feasibility of an object matching approach to estimate

16

 

 

 

(e) (0

Figure 1.6: An example of object matching: (a) image sensed by the second camera;
(b)-(c) two different images sensed by the ﬁrst camera; (d)-(f) segmented objects of
interest; the matching scores based on color and edges are 0.514 for (d) and (e) and
0.044 for (d) and (f).

17

travel time in a road network as part of the Intelligent \ehicle/I-Iighway Systems
program. Note that the development and prototype construction phases still have to
be completed before the system can be used on site.

It is important to note that the objects which we are trying to match are very
similar in shape and are described by the same generic term in every day language, i.e.,
a vehicle. Typical object recognition systems reported in the literature can recognize
only a small number of objects which are very different in shape from each other, i.e.,
a doll and a toy car. Thus, an extensive testing of the various algorithms developed
here (segmentation, indexing, and matching) is an important contribution of this
thesis.

We propose two model-based segmentation algorithms. The ﬁrst technique takes
into account the fact that the contour of the object of interest should be continuous
and smooth. This model information is incorporated using an active contour which
enforces a smoothness constraint. We show that this algorithm produces a very
accurate contour of the moving object. This algorithm can be used in many situations
where the image acquisition setup can be controlled to allow only one moving object in
the camera’s ﬁeld of view at a given time (for example, by guiding the moving object
as it passes in front of the camera). In situations where the objects cannot be guided
and there are multiple moving objects in the ﬁeld of view, the active contour-based
segmentation algorithm groups all the moving objects into a single blob. To overcome
this limitation, we propose a model-based segmentation technique which assumes a
generic model of the object of interest and utilizes the theory of deformable templates.
The ﬁrst contribution of this technique lies in the deﬁnition of the vehicle template
as a polygonal shape. The constraints on the shape of the polygon are enforced by
a set of rules (one for each vertex coordinate) which were manually derived using a

set of training examples. We believe that the same approach could be used for any

18

template deﬁned by a polygonal shape.

The second contribution of the vehicle segmentation technique lies in the deﬁnition
of the likelihood function. In particular, the edge directionality term evaluates the
“edgeness” of the pixels on the boundary of the template by ensuring that the edges of
the deformable template coincide with strong edges having the same orientation in the
image. By using gradient magnitude and direction on the boundary of the template,
we implicitly correlate both sides (object and background) of the boundary. This
differs from other approaches where either the inside and the outside of the template
are modeled by two independent processes or the boundary of the template is modeled
by a line process.

Another important contribution of this thesis lies in the use of multiple cues for
segmentation and matching. The segmentation algorithm which produces an accurate
contour of the moving object fuses motion and color information. Although we do
not derive a general methodology for fusing multiple cues, we show that combining
region-based segmentation results (obtained using color information) with motion
information can help in identifying the moving object of interest. In the vehicle
segmentation algorithm, we demonstrate that multiple cues can be combined simply
by including additional terms in the expression of the likelihood energy function.
We show that the segmentation results are enhanced by the combination of multiple
cues. During the matching process also, we combine color information and shape
information to decide if two objects are identical. As a consequence, we have more
conﬁdence in the resulting matched pair of objects.

We deﬁne a distance measure between two point sets that is suitable for object
matching. This distance, which is a modiﬁcation of the Hausdorff distance, is less
sensitive to the noise introduced by the segmentation process and does not discard

any edge points that are part of the objects. We have used pose clustering to identify

19

the transformation (translation and scaling) between the two edge point sets so that
they can be registered before matching.

Other contributions include the motion segmentation algorithm which involves
only three frames in the image sequence and requires simple operations such as sub-
traction, multiplication, thresholding, and morphological closing. Our color segmen-
tation algorithm makes use of edge information in addition to region split-and—merge
to produce more accurate boundaries between different regions.

In summary, the main contributions of this thesis are:

0 A complete object matching system for objects which are similar in shape. The
matching system has been tested on a large number of images.

0 A feasibility study for travel time estimation on a road network.

0 Model-based segmentation of the object of interest.

0 Fusion of multiple cues for image segmentation and object matching.

0 Modiﬁed Hausdorff distance and pose clustering for edge point set matching.

0 Motion segmentation and color segmentation.

1.4 Outline of the Thesis

This thesis is divided into seven chapter. Chapter 1 deﬁned the elements and assump-
tions of the proposed objects matching system. Chapter 2 presents the different pro-
grams related to Intelligent Vehicle/Highway Systems (IVHS) along with the machine
vision systems related to IVHS applications. We also propose a direct application of
the proposed work on object matching in the domain of IVHS.

Chapter 3 describes image segmentation techniques which can be used to segment

the object of interest from its background. The ﬁrst technique is based on motion

20

information and the second technique uses color information. These general image
processing techniques do not require any model information.

Chapter 4 proposes two model-based image segmentation techniques to extract
the object of interest. The ﬁrst technique deﬁnes continuity constraints on the con-
tour of the object and fuses motion and color information using an active contour
model to segment the object of interest. The second technique is geared towards the
segmentation of vehicles. We deﬁne a generic model of a vehicle and use a deformable
template approach which fuses motion and edge information to segment and classify
the vehicle of interest.

Chapter 5 deﬁnes the two matching techniques which are used by the object
matching system. Color matching is done by comparing three-dimensional color his-
tograms. Shape matching is done by computing the modiﬁed Hausdorff distance
between two edge images which have been registered using pose clustering.

Chapter 6 is dedicated to the evaluation of the complete matching system. We
demonstrate on a large database of images that we can correctly identify pairs of
matching objects. We also propose a “real-time” version of the system which does
not require a very precise segmentation of the object of interest. The results of these
two vision-based matching systems are compared to results that experts obtained
manually by looking at the two databases of objects. We also tackle the problem of
on-site implementation of the matching system and demonstrate the effects of various
weather conditions and time of the day on the segmentation results.

Finally, Chapter 7 summarizes this thesis, evaluates the strengths and limitations

of the proposed system, and proposes future research directions.

Chapter 2

Intelligent Vehicle / Highway

Systems and Applications

Several government and industry funded projects are underway in different coun-
tries (USA, European countries, and Japan) to design and build prototypes of semi-
autonomous vehicles which will be driven on major “intelligent” highways. These
highway automation programs have basically the same goals, even though they take
different directions and emphasize different aspects of the program. These programs
along with their main characteristics are described in this chapter. An appropriate
phrase to describe these programs can be stated as: Smart cars driving on smart
highways to make drivers smarter. A more technical deﬁnition would be: Using
high technology electronics to integrate vehicles and highways for safety, efﬁciency,
productivity, mobility, environmental protection. and energy savings.

The role of these programs is to focus the efforts of public organizations who build
highways and private organizations who build vehicles to improve surface transporta-
tion. The motivations and rationales come from societal and economic concerns:

decrease the number of accidents on the highways to reduce medical and insurance

rv
Nu

costs. increase the traffic density to avoid building more roads for the increasing
number of vehicles. and provide information to drivers through information centers
and route guidance systems. Among the means that can be used to achieve these
goals. one can ﬁnd sensor—based description of the environment (including machine
vision, proximity sensors and range ﬁnders), map description of the roads and in-
tersections, inter-vehicle communications. communications with information centers,
etc. The IVHS programs are concerned with human factors (driver interface, driver
information, trafﬁc management centers), communications and information flows, ar-
chitecture, legal and institutional issues. socio—economic issues (public assistance and

environment), and vehicle performance.

2.1 IVHS in the United States

The Executive Committee of the Transportation Research Board (TRB) decided to
study advanced vehicle and highway technologies in 1988, at the time when the IVHS
concept started to receive attention. Since the study was proposed, interests in the
government, in universities. and in the private sector led to the creation of Mobility
2000 [1], a group of public and private sector administrators and experts that deﬁned
the initial stages of many of the most important conceptual tasks. The work of this
group resulted in the formation of IVHS America, a private, nonproﬁt membership
organization with support from government and industry to serve as a contact between
public and private sector activities in IVHS. An extensive description of the IVHS
program was published by TRB [16]. The project deﬁnes near—term goals that should
be achieved by 1996, mid-term goals to be attained by 2001, and long-term goals to

be met in 2011.

2.1.1 Technical committees

Technical committees were formed to deal with specific issues of IVIIS. These commit-
tees are divided into three main subprograms. three application subprograms which
use the technologies defined by the subprograms. and six committees which deal with

legal issues and human factors. The technical committees are listed below.

Advanced Traffic Management Systems (ATMS): The ATMS program is
dedicated to control and optimize the trafﬁc flow on road networks. It covers tech-

niques such as electronic toll collection. trafﬁc surveillance. and adaptive trafﬁc light

control.

Advanced Traveler Information Systems (ATIS): The ATIS program is used
to provide the traveler with information such as current location, trafﬁc conditions in

the area, and route guidance depending on the trafﬁc conditions.

Advanced Vehicle Control Systems (AVCS): The AVCS program mainly deals
with driver assistance and driver replacement issues such as collision warning, vision
enhancement, adaptive cruise control, dozing alert, and automatic chauffeuring. Ve-

hicle platooning on specially equipped highways also falls under AVCS.

Commercial Vehicle Operations (CVO): The CVO program includes weigh-
in-motion, automatic vehicle classiﬁcation, and automatic vehicle identiﬁcation for
trucks and commercial vehicles. The other goal of CVO is to alert the police, the ﬁre

department. and the ambulances of the occurrence of an accident on the highway.

Advanced Public Transportation Systems (APTS): The goal of APTS is

to improve operation of high occupancy vehicles, including buses, vans, and car-

 

21

pools through automatic vehicle location. signal preemption. and smart cards for fare

sharing.

Advanced Rural Transportation Systems (ARTS): The main concern of ARTS
is to consider driving conditions in rural areas and not limit the program to main high-
ways. It is the latest addition to the IVHS program. Given the lower congestion and
air pollution in rural areas, and the fact that most automobile fatalities occur on rural
roads. the relative emphasis of ARTS is on safety. It includes autonomous navigation

systems, enhanced vision, and Mayday signaling.

Other committees: The other committees are not purely “technical". but deal
with important issues related to society and human factors. They are the follow-
ing: beneﬁts, evaluation, and costs; safety and human factors; systems architecture;

standards and protocols; legal issues; and institutional issues.

2.1.2 Current operational tests

Under the "Corridors Program”. a number of corridors that would beneﬁt from IVHS
have been designated to provide long-term sites for multiple operational tests. A num-
ber of criteria are used to deﬁne a corridor. including a trafﬁc density at 1.5 times the
national average, severe or extreme ozone nonattainment, variety of transportation
facilities, and mix of passenger, transit, and commercial vehicle operations.

In addition to the Corridors Program, operational test projects are also supported.
Some of these tests were designed to enhance the trafﬁc conditions in urban areas.
The Information for Motorists (INFORM) project in Long Island, NY is designed
to improve motorist travel time by integrating surveillance and control of three free-

ways. The SMART project in Los Angeles is designed to provide congestion relief,

reduce accidents. reduce fuel consumption. and improve air quality using communica-
tion systems to advise travelers of current conditions and alternate routes. TravTek in
Orlando provides traffic congestion information. motorist services information. tourist
information, and route guidance to operators of 100 CM vehicles specially equipped
with an in—vehicle TravTek device. The Advanced Driver and \"(_=l1icle Advisory Nav-
igation Concept (ADVANCE) in Chicago is designed to evaluate the performance of
the ﬁrst large-scale dynamic route guidance system. [p to 5.000 private and com-
mercial vehicles are equipped with iii-vehicle navigation and route guidance system
units. More information about these operational tests can be found in [159].

Other test areas have been chosen for highway monitoring and commercial vehicle
operations. The States and Canadian provinces along Interstate highway l—75 are
involved in the ADVANTAGE I—75 project which will facilitate truck transportation
by equipping trucks with transponders so that they can travel along any segment
of I—75 with minimal stopping at weighing/inspection stations. The other CVO test
ﬁeld, HELP/Crescent also involves a large number of States and Canadian provinces.
It is designed to test an integrated heavy vehicle monitoring system using automatic
vehicle identification and classification, and weigh-in-motion. About 2.000 trucks
have already been equipped with transponders.

In addition, Guidestar is the statewide IVHS program in .\=‘Iinnesota. It is designed
to bring together ATMS and ATIS technology to reduce congestion and improve safety
throughout the State using the video imaging vehicle detection system AutoscopeTM.
The Forum for Advanced Safe Travel through Trafﬁc Routing and Advanced Control
(FAST-TRAC) combines ATMS and ATIS technologies in the Oakland County, MI.
The SCATS adaptive trafﬁc control system is being deployed in the urban street
network of Troy, MI. Also, vehicles borrowed from the three US. auto companies are

equipped with the Siemens Ali-Scout route guidance and information system.

 

26

The IVHS activity was just renamed Intelligent Transportation Systems (ITS) be-
cause it spans all transportation systems in general. and the first. ITS World Congress
was held in Paris at the end of 1994. Four teams were scheduled to finish the initial
phase of an ITS architecture and standard study for the Department of Transporta—
tion in December, 1994. General Motors announced that its Oldsmobile 88 LSS can
be equipped with a 82.000 navigation and information system that displays maps

and directions for drivers. This illustrates the success of the [VHS research and test

programs.

2.2 IVHS in Europe

The two largest European IVHS programs are PROMETHEUS (PROgraM for Euro-
pean Trafﬁc with Highest Efﬁciency and Unprecedented Safety) and DRIVE (Dedi-
cated Road Infrastructure for Vehicle safety in Europe). A brief description of these

two programs is given below.

2.2.1 PROMETHEUS

PROMETHEUS is part of the Eureka collaborative European research and develop-
ment initiative. It was initiated in 1986 by automotive companies and led by Daimler-
Benz as an 8—year program with a budget of $700 million. Participants include most
of the major European vehicle manufacturers, more than 70 research institutes and
universities, and more than 100 electronics ﬁrms.

PROMETHEUS research emphasizes on-board vehicle functions. It is divided into
seven subprograms, three of which are industry oriented while the others are research
oriented. The three automotive industry subprograms are the following: PRO—CAR

provides the technological development of vehicle-based systems, including vehicle

27

monitoring and intelligent driver assistance aids: PRO-NET deals with the devel-
opment of a communications network between vehicles: and PRO-ROAD develops
the communication systems between the vehicle and the roadside infrastructure and
control system to optimize trafﬁc efﬁciency. The four basic research projects are the
following: PRO-CHIP deals with the development of micro-electronics required to
allow the construction of information systems with the required capacity; PRO-ART
develops artiﬁcial intelligence tools required for autonomous navigation using. for ex—
ample, vision, knowledge—based systems. and sensor fusion: PRO-COM designs the
structure and architecture of the communication systems to be employed: and PRO-
GEN conducts the study and analysis of possible traffic scenarios. More information
on these subprograms can be found in [16. 26].

During the ﬁnal years, the program has emphasized test ﬁelds known as Com-
mon European Demonstrators (CED). There are currently about 80 CEDs which
demonstrate applications of vision enhancement, proper vehicle operation, collision
avoidance, vehicle-toovehicle communications. intelligent cruise control, lane keep—
ing, emergency systems, commercial fleet management, dual-mode route guidance
systems, travel information systems, and trafﬁc management systems. The most ad—
vanced CEDs developed by research institutions are ProLab II [2] which emphasizes
driver assistance and VITA II [180] which emphasizes autonomous driving. In the
light of these demonstrators, PROMETHEUS was declared a success when it was

completed in October, 1994.

2.2.2 DRIVE

DRIVE focuses on technologies for road infrastructure rather than in-vehicle tech-

nologies. It was launched by the European Commission in 1988 as a 3—year program

 

28

with a budget of $130 million. It was extended in 1991 as DRIVE II with a funding
of $240 million. DRIVES goals are to improve road safety and transport efﬁciency
and reduce environmental pollution.

The pilot trials fall mainly into two groups: POLIS for cities and CORRIDOR
for inter-urban highways. The program is structured around seven areas of major
operational interest: demand management; travel and trafﬁc information: integrated
urban traffic management: driver assistance and cooperative driving; freight and fleet

management; and public transport.

2.3 IVHS in Japan

Unlike the North Americans and the Europeans. the Japanese have not deﬁned a
large project for IVHS. but several independent projects related to IVHS. The ﬁrst
major IVHS project was launched in 1973 by the Ministry of International Trade and
Industry (MITI) under the title Comprehensive Automobile Trafﬁc Control System
(CACS). The goal of this project was to install two-way communications between
vehicles and the roadside to obtain cooperative route guidance.

Four different organizations have been created in an attempt to provide the neces-
sary interaction between long-term research programs and industry applications. In
1979, MITI created the Association of Electronic Technology for Automobile Traf-
ﬁc and Driving (known as JSK for its Japanese initials). This association counts
57 companies including non-Japanese ﬁrms. The Ministry of Construction created
two associations: Highway Industry Development Organization (HIDO) in 1984 and
Digital Road Map Association (DRMA) in 1988. The Traffic Management Technol-
ogy Association (TMT) was created in 1987 by the National Police Agency. More

information on these associations can be found in [54].

 

29

The two main progran'is: Road Automobile Communication System (RACS) and
Advanced Mobile Traffic Information and Communication System (A.\ITICS) have
completed their field tests and were merged to become. a single entity known as Vehicle
Information and Con'imunication Systems (VICS).

Research programs include the Next Generation Highway Trafﬁc System (NeGHTS)
program on the highway side and the Super Smart Vehicle System (SSVS) program
on the vehicle side. These programs are designed to achieve long term goals, explor-
ing such advanced concepts as lighting systems to illuminate the road surface for a
distance ahead of the vehicle and warnings from approaching vehicles about the driv-
ing conditions on the road ahead. More information on these highway automation
programs in Japan can be found in [54].

The Personal Vehicle System (PVS). a vision-based vehicle developed in the late
1980’s was able to make turns and avoid obstacles at a speed of 30 km/h. At present,
there are over 400,000 installed autonomous navigation systems and all major car
companies are offering navigation systems as part of their on—board television and
sound systems. These systems, however. do not offer dynamic route guidance incor—

porating live trafﬁc data. This information is provided through LED screens on the

highways.

2.4 Machine Vision for IVHS

Machine vision can be a powerful tool for solving some of the IVHS problems. In
the scope of the PROMETHEUS program, an entire subprogram (PRO-ART) was
designed to develop intelligent systems which could use computer vision tools to
analyze road scenes while driving on a highway. Any of the three main subprograms of

IVHS America, namely ATMS, ATIS, and AVCS can beneﬁt from the use of computer

30

vision techniques as can be seen from the few examples listed in this section.

In the domain of ATMS and ATIS which deal with trafﬁc surveillance and route
guidance, the cameras are stationary. either on the side of the road or above the road,
and the machine vision software detects the moving vehicles and analyzes their move—
ment. The simplest technique for evaluating the traffic condition consists of counting
the number of vehicles. The AutoscopeTM system developed at the University of
Minnesota [133] is designed to act as a loop detector. where the detection conﬁgura—
tion can be changed manually or dynamically by drawing lines on a graphics monitor
instead of physically embedding magnetic devices inside the road pavement. Yoshino
et al. [188] proposed a technique to count the number of vehicles in a tunnel where
the lighting condition is poor. Kelly [107] extensively tested IMPACTS, a system
developed at University College London to provide a qualitative description of the
spatial distribution of moving and stationary trafﬁc within a scene.

Some systems were designed to monitor the trafﬁc by tracking the moving vehicles,
analyzing the trafﬁc flow, and possibly detecting anomalies such as parked vehicles on
the side of the road and accidents [161]. Kilger [109] is able to track vehicles in dense
trafﬁc situations and compute the average speed of the vehicles and average trafﬁc
density. Sorensen and Christensen [165] are able to track all the moving objects in a
city trafﬁc scene and classify them as pedestrians, cars, vans, and trucks. Koller et
al. [113, 114] also analyze the movement of the moving vehicles inside a parking lot
or at intersections. In a later work, Koller et al. [115] track moving vehicles on the
highway and by evaluating the position and speed of all the vehicles, they are able to
identify stalled vehicles and vehicles which change lane.

The ATMS subprogram also deals with automatic toll collection which could uti-
lize computer vision techniques. Planet et al. [153] proposed a technique to recognize

the vehicle type when it comes to the toll booth so that the amount the driver has

 

31

to pay can be automatically computed.

For the purpose of AVCS which handles vehicle control for driver assistance, col—
lision warning, and road following, the camera is usually mounted on the vehicle.
The camera is pointed at the road to identify the various objects the way a hu-
man driver would. Road detection and road following have been extensively studied
[69, 177, 178]. Collision avoidance is also receiving some interest [68, 184]. Recently,
the tasks of lane detection for road following and obstacle detection have been com—
bined [11, 25, 95]. The detection of vehicles behind or passing the vehicle on which
the camera is mounted has been given some attention [51, 84]. For adaptive cruise
control, researchers are also interested in implementing systems that are able to follow
another vehicle [95, 106, 160].

The AVCS program is also concerned with the detection of landmarks on the
road. Zheng et al. [191] proposed a system which could recognize landmarks such as
buildings, groups of trees, or lamp posts so that the vehicle can appropriately change
its route to reach its ﬁnal destination. The vehicle also must recognize trafﬁc signs to
be able to stop at a Stop sign or a red trafﬁc light, respect the speed limit, yield to
pedestrians, etc. Researchers have spent a substantial amount of effort in road sign
detection and recognition [55, 152, 154, 192], including road marking such as arrows
[5].

The most impressive systems which integrate all these features are the vehicles
VITA II [180] developed by Daimler-Benz which has 18 cameras and VaMoRs-P [40]
developed by Universitat Bundeswehr Miinchen which has 6 cameras. Both vehicles
have been driven extensively over roads in Germany and other western European
countries and can reach highway speeds of 130 km/ h. The systems include vehicle
control (braking, acceleration, and steering), road detection and tracking (forward and

backward), obstacle avoidance, lateral object detection, vision bumper, and trafﬁc

32

sign recognition. The ProLab II vehicle [67] was (.leveloped by a group of French
universities and research institutions as a co-pilot system. The vehicle is equipped
with 8 cameras and performs a number of perception tasks such as road detection.

trafﬁc sign recognition, intersection detection, and obstacle detection to alarm, advise,

or assist the driver.

2.5 Travel time and Origin/Destination Matrices

Estimation

Let us assume that two cameras are placed on the side of a road at different locations in
a road network (see Figure 2.1). An object matching system as described in Chapter 1
would be able to extract the moving vehicles from the two image sequences observed
by the two cameras and determine correspondences between the vehicles based on
their color and shape. More precisely, when a vehicle is sensed by the second camera,
it can be matched against the database of vehicles sensed by the ﬁrst camera to
determine if the same vehicle ever passed in front of the ﬁrst camera. This strategy
will not work if two identical vehicles (same color and model) are in the road network
at the same time.

By recording the time at which each vehicle is observed, we can determine the time
it takes for a vehicle to travel between the two observation points. Current techniques
for estimating travel time through a network include the use of a “floating car” and
computing the “average speed”. These techniques are both labor intensive and suffer
from insufﬁcient amount of data. A vehicle and its driver can only collect one sample
of the travel time on each run. If the network is large, or the travel speed is slow,

this can result in only one observation in the time period of interest (rush hour) per

33

 

 

[ Object Matching System ]

 

 

 

 

Origin/Destination

[ Travel Time Estimation
Matrices Estimation

 

 

 

 

Figure 2.1: Estimation of average travel time and origin/destination matrices using
an object matching system.

day. Thus, a statistical analysis of changes in the travel time due to a change in
trafﬁc signal timing may not be possible because of the limited data set. The vehicle
matching system can provide a statistical estimate of travel time information after a
number of vehicles have been matched. This information can then be used by ATMS
to optimize the trafﬁc flow by adaptively changing trafﬁc signal timings. Real-time
travel time information can also be used by ATIS to locate congested areas in the
road network in real-time and alert information centers which will themselves warn
the drivers to avoid these areas.

The vehicle matching system can also be used to simply count the number of cars
that go from one point to another in a road network. This vehicle count can be used
off—line to estimate origin-destination matrices which are helpful in evaluating trafﬁc
patterns. Several recent studies have looked at algorithms to estimate an origin-

destination matrix using link trafﬁc counts [138, 37]. However, no solution has been

:54

found that is both accurate and non-data intensive. The vehicle matching system has
the potential of providing data from each entry point to all exit. points in the network
simultaneously, resulting in the origin-destination matrix. While this would require
multiple cameras, it would not be labor intensive for either data collection or data
analysis. The current road infrastructure is based on data that were collected over 30
years ago. Origin—destination matrices could be used to update this infrastructure by
building more roads where needed, enlarging already existing highways, or modifying
entry and exit points on the highway.

In the past, researchers have tried to match license plates instead of the entire
vehicle [135, 173]. This technique presents the advantage that a license plate number is
unique. Thus, two red Corvettes will not be mistakenly matched if they have different
plate numbers. However, this approach presents a number of difﬁculties. If the data
are manually collected, this technique requires observers stationed at each point of
interest. If the data are automatically collected, the camera needs to be focused on
the license plate, and the license plate has to be large enough in the sensed image
so that individual characters can be recognized. These constraints are difﬁcult to
overcome in dense trafﬁc situations, especially when the vehicles are moving at high
speed. In Japan, a system that estimates the average travel time using license plate
matching is already in place; a list of average travel times between various two points
in the network is displayed at different locations on the highway. Tanaka et al. [173]

report that there is usually a 1 to 5 minutes difference between the displayed time and

the actual driving time. Drivers were asked to answer a questionnaire and compare
the displayed time with the time it actually took to travel from one point to another.
Almost 88% of the people who answered the poll qualiﬁed the displayed time as
3C C Urate or almost accurate. However, this system takes advantage of the simplicity

of. t he Japanese license plate numbers which are composed of 4 unique digits. This

3 5

means that the character recognition system needs to handle only 10 classes (the
digits from 0 to 9). In the United States. such a system would have to distinguish
between 36 classes (the digits from 0 to 9 and the letters from A to Z). Miller et al.
[135] mention that their system still needs human intervention because the American
legislature does not require a speciﬁc pattern of letters, digits, and colors on a license
plate, making the character recognition task even more difﬁcult. The vehicle matching
system presents an alternative solution for estimating origin-destination matrices and

average travel time between two points in a road network.

2.6 Limitations of Machine Vision Systems

Machine vision systems have a number of limitations, especially in uncontrolled envi-
ronments. In outdoor scenes, the illumination cannot be controlled the way it might
be inside a laboratory, resulting in unpredictable shadows and highlights. An ideal
machine vision system for IVHS should be able to perform equally well in any type of
weather: bright sunshine (producing sharp shadows), cloudy skies (producing more
diffuse illumination and almost no shadows), or rain (adding “rain texture” to the
image). It should also be able to handle night-time vision. Unfortunately, such a
system does not exist yet. Most of the published machine vision systems dealing
with IVHS can handle day light situations only and a number of them are not robust
to strong shadows. The vision systems which handle night-time situations rely on
the detection of the headlights or taillights of the vehicles. But, these night—vision
systems cannot be used during the day time, because the necessary features could
not be extracted from the day—time images. Thus, for a machine vision system for
IVHS domain to work round the clock, a higher—level system is needed to analyze the

current illumination conditions so that the appropriate vision algorithm is employed.

36

Researchers in the computer (machine) vision community are motivated and chal—
lenged by the performance of the biological vision system: humans can interpret and
understand the outside world easily and rapidly based on visual cues. However. more
than 40 years of research in computer vision has shown that general machine vision
systems are very difﬁcult to build. Even for simple tasks, machine vision systems
require complex and compute-intensive processing. The computer vision community
often tends to ignore simpler solutions to a given problem that would not involve
machine vision techniques. A number of object recognition problems in speciﬁc ap-
plications could be solved using cheaper and more robust sensing and processing
methods.

Let us consider the application of object matching in the IVHS domain (see Sec-
tion 2.5). The computer vision approach consists of locating the moving vehicle,
extracting the appropriate features (color, shape, license plate, etc.), and matching
these features against similar features extracted from other vehicles. All this process-
ing is very easily performed by a human, but is very difﬁcult for a machine vision
system. The only advantage that a computer has over a human being is that, when
the number of vehicles to be compared gets large, a computer is able to remember
all the previous vehicles, whereas a human has to frequently refresh his/her memory
and examine the different vehicles.

It is important to note that our problem deﬁnition of object matching in Sec—
tion 1.2 does not assume any speciﬁc sensing modality. It simply states that after
the objects of interest have been extracted from the observed scenes, a matching
strategy is needed to decide if the two objects are identical based on their features
(see Figure 1.5). In our matching scheme, we have decided to use a color camera to
sense the environment and use motion, color, edges, and vehicle model information

to segment the vehicles of interest from the background. Then the matching strategy

37

utilizes two types of features. namely color and shape (edges). Other sensing schemes
could also be used to sense the images. in which case. the segmentation and matching
strategies are likely to be different. Different sensing modalities produce different
types of signatures. For example, a range sensor could be used to sense the depth
information (surface geometry) of the vehicle. Such sensors based on structured light
and laser radar are popular in target recognition, object recognition, and inspection
applications [29]. However, the use of active sensing in outdoor environments may not
give us sufﬁcient depth resolution to accurately distinguish between different types
of vehicles. The problems of specular reflections and safety issues associated with
the use of laser beams will also have to be taken into account. Nevertheless, if the
matching system was provided with reliable 2% range data, the segmentation prob-
lem might become easier (vehicles in the lane closest to the camera would have lower
depth values than the vehicles in the other lanes) and the matching system would be
less sensitive to ambient lighting conditions. However, the use of structured light or
laser radar will deﬁnitely increase the overall cost of the system.

Nevertheless, there does exist a simpler, cheaper, and more robust solution to the
problem of identifying individual vehicles. or as a matter of fact any type of object
(people, cows, doors in a hallway, etc.). This solution uses transponders. Transpon-
ders are devices that communicate using electromagnetic waves. Using transponders
requires placing a “tag” on the object to be identiﬁed, placing a reader at a speciﬁc
location, and using an antenna for communication. In read-only systems, upon the
reader’s request, the antenna transmits an electromagnetic signal that is used to ener-
gize the tag to enable it to transmit its identiﬁcation code. The reader can also store
information in read/ write tags. The frequency of the electromagnetic signal varies
from low (KHz), near microwave (915 MHz), to microwave (2.45 GHz) depending on

the application and the distance between the tag and the reader. Transponders are

38

active systems and issues related to wave propagation and interference need to be
addressed when designing a speciﬁc system.

Transponders could be used to solve the IVHS problem being considered in this
thesis. If all the vehicles were equipped with a tag with a unique identiﬁcation number,
readers could be placed at the points of interest in the road network, and the vehicles
passing in front of these readers would be automatically identiﬁed. Transponders are
already used to measure length and weight of trucks on highways [52]. It has been
shown that there is no need for the drivers to slow down for these measurement to
be accurate [52]. Transponders have also been used to count laps in the Indianapolis
500 circuit. Regarding other IVHS related problems, transponders could also be used
to automatically pay toll or automatically open parking lot gates.

Vehicle manufacturers in North America have not yet agreed to install transpon-
ders in noncommercial vehicles. They argue that it would violate individual privacy
since control centers and other government agencies (e.g., police) could determine the
location of a particular vehicle at any given time. One way to counter this argument
is to use more expensive writable tags. For example, to compute the average travel
time between two points in a road network, the reader at the ﬁrst location could write
a number in the tag of a particular vehicle which would then be read by the reader
at the second location. If the transponder is not used for data collection, it would
carry no useful information. Of course, a careful strategy would have to be devised
to avoid overwriting important information in a tag.

Until vehicle manufacturers agree to equip all vehicles with transponders and until
a sufﬁcient number of such vehicles are in use on roads, machine vision is the only

viable approach to automatically identify vehicles on the road.

Chapter 3

Image Segmentation

In this chapter, we address the ﬁrst requirement of the object matching system,
namely the image segmentation problem. As mentioned in Section 1.2, we need to
isolate the object of interest, namely the vehicle in the lane closest to the camera
from the stationary background and the other moving vehicles.

The process of image segmentation consists of dividing the image into nonoverlap-
ping homogeneous regions whose union is the entire image [87]. Let X = {(23, y), :r =
1,... ,N, = 1,. . . , 1M} be the domain of the image and let I(.I:. y) be the brightness
function deﬁned on X. A segmentation of X is a partition of X into subsets S,,

i = 1,... ,m for some m such that:
(a) X = Uin=15i
(b) SiﬂSj =0 Vi,j
(c) P(S,) = TRUE Vi

(d) P(S.-US.) =FALSE Viyéj

39

40

where the predicate P(S) is defined as

TRIIE if 3 (151-”11 that [III-ill — "l S (f. \7’ (any) 6 S

FA LS E otherwise

This predicate partitions an image based on uniformity of gray values. Other predi-
cates can similarly be deﬁned which utilize color and/or texture properties.

The homogeneous regions should be meaningful in the given application. For
example, for two—dimensional object recognition. the objects of interest need to be
separated from the background. whereas for natural scene analysis, image segmenta-
tion should result in regions corresponding to the sky, the trees, the roads, etc. A

meaningful image segmentation is based on the following general guidelines:

0 The regions resulting from an image segmentation should be uniform and homo-
geneous with respect to some characteristics, such as grey level, color, texture,
or motion.

0 Based on the same characteristics, adjacent regions should be sufﬁciently dif—
ferent.

o The boundaries of the regions should be simple and (spatially) accurately lo-
calized.

o The interior of the regions should also be simple in the sense that they should

not contain many holes.

Some segmentation techniques have been designed to handle a variety of discrim-
inatory characteristics (or features) while other techniques have been designed to
handle only a speciﬁc attribute such as color, texture, or motion. General image
segmentation techniques are based on clustering, region homogeneity, and boundary

detection.

41

Clustering techniques find clusters in a feature space such that each cluster rep-
resents a homogeneous region in the image. (f'lustering techniques are extensively
described in [96] and any one of the large number of partitional clustering techniques
can be used for image segmentation. The Euclidean distance metric is commonly
used to ﬁnd pixels which are sufﬁciently similar to each other to deﬁne a. cluster in
the feature space.

The set of features to be used by the clustering algorithm needs to be deﬁned
based on the given image segmentation task. Usually, the set of features that will
produce the best segmentation results is determined empirically.

Spatial information can also be taken into account in the clustering by specifying
the pixel coordinates as two additional features [97, 99]. Also, different features can
be assigned different weights to emphasize the importance of some features. When
using clustering techniques, one has to specify the number of clusters (or regions) to
be found in the image. Clustering programs usually provide the user with cluster
validity statistics that should be analyzed to determine the best clustering.

Region-based segmentation techniques can be divided into three categories: region
growing, region splitting, and split-and-merge.

Region growing techniques start with individual pixels or sets of 4 adjacent pixels
and group them to form regions. Additional neighboring pixels can be merged to the
initial seeds by a variety of methods. In single-linkage region growing, two pixels are
linked by an are if they are neighbors and if they have similar feature values. Pixels
linked together are assigned to the same region. The advantage of this technique is
that the resulting segmented image consists of connected regions. The main disad-
vantage of this technique is that it takes only a single are joining one region and a
neighboring one to merge the two regions. To avoid this problem, centroid—linkage

region growing can be used. It should be emphasized that the segmentation result

~12

depends on the order in which the pixels are scanned. and that region growing is very
sensitive to noise.

Region splitting techniqut-‘s start with the entire image and split it until homoge-
neous regions are formed. These methods work well for extracting global structures,
but are weak for detecting fine details in the image.

The split-and-merge technique developed by llorowitz and Pavlidis [87] uses both
a splitting step and a merging step in a quad-tree implementation. The image is
ﬁrst divided into square subimages. Subimages are then split. into 4 children subim-
ages if they do not satisfy a homogeneity criterion. If the four subimages have the
same parent and satisfy a. homogeneity criterion. then they are merged into a single
large subimage. A grouping process is then applied to eliminate the arbitrary region
boundaries imposed by the initialization step.

The split-and-merge algorithm was ﬁrst designed to handle image segmentation
based on gray level features. Let I)?” and If“ denote the maximum and minimum
gray level value in a subimage Xk. The criterion for splitting subimage X1, into 4

subimages is

(1,?“ —1,:"‘")> e (3.1)

where e is a given threshold value. The criterion for merging the 4 subimages Xkl,

max max max max ' min min min min

Edge-based segmentation techniques identify the boundaries between homoge-
neous regions in the image. Once the edge points have been extracted, the regions
consist of the pixels inside the contours. See [104, 158] for surveys on edge detec-

tion. A commonly encountered difﬁculty with this method is that the contours are

-13

often not closed and some additional processing has to be performed to identify the
homogeneous regions.

This chapter is divided into two sections. each describing a specific image seg-
mentation algorithm. The first section describes motion segmentation techniques
including a new approach based on frame subtraction using a 3-frame sequence. The
second section describes color spaces. color segmentation. color constancy and pro—
poses a color segmentation algorithm which combines the regions extracted by a
split-and-merge color segmentation and edge information obtained by the Canny edge

detector.

3. 1 Motion Segmentation

Motion detection and estimation have received increasing attention from computer
vision scientists during the past ﬁfteen years [139, 1410]. Numerous vision applications
can beneﬁt from the use of motion cue. Mobile robots or autonomous vehicles need
to locate the sides of the road and avoid obstacles during the course of navigation.
In military applications, one might want to automatically track and compute the
trajectory of a given target. Medical doctors would like to automatically detect the
moving organs, estimate their volumes, and possibly identify some anomalies.
Motion analysis can be a powerful tool for image segmentation, particularly in
complex scenes. Motion segmentation consists of extracting the moving object of
interest from its background (stationary or moving). Most commonly used techniques
are based on image subtraction and optical flow computation. We will describe these

techniques before presenting our approach and experimental results.

~14

3.1.1 Image subtraction techniques

Image subtraction is a simple technique that can extract a moving object. in a sta—
tionary background, but it cannot handle a moving background nor can it. estimate
the motion parameters (translation and rotation or velocity). The basic philosophy
of image subtraction is that the only gray level changes in an image occur at the

location of the moving object. Several variants of image subtraction are given below.

Subtraction from a reference frame

The current frame containing moving objects can be subtracted from a reference frame
of the same scene known to be static (without any moving objects). The problem
with this method is that outdoor scenes are never quite static (the wind blows the tree
branches and leaves, the position of the sun produces varying amount of shadows),
implying that frequent updates of the reference frame should be made.

This technique was used by Sethi and Brillhart [161] to detect moving vehicles on
a street and identify non-conforming behavior such as a vehicle parked on the side of
the road. But, they do not mention the method that was used to update the reference
frame. Shio and Sklansky [163] also used this technique to segment people crossing
a street. They obtain a background image at time t by mode ﬁltering in the time
domain, i.e., the most frequent value, over a time period, at a given pixel is used
in the background image. Kilger [108] updates the background image continuously
by masking the regions of motion using a Kalman ﬁltering approach in the following

manner:

B(+1= Bf + [01(1 — III!) + OQII’LHIt — By), (3.3)

where I is the input frame, B is the background image, M is the binary mask cor-

responding to the moving objects, and a1 and a; are updated based on an estimate

of the rate of change of the background. A similar approach is used by Sorensen and

Christensen [I65].

Subtraction using two frames

A difference image can also be obtained by subtracting two frames in the image
sequence. The background is not expected to change drastically in the short period
of time between two successive frames. Moving areas in the scene are highlighted
when the difference image is thresholded.

Jain and Nagel [102] accumulated difference images to compute masks of the
moving objects. Their technique is quite restrictive since it assumes that the object
has moved at least its length in the two frames. This implies that the object is small
enough to be contained in the ﬁrst frame and in the last frame after it has moved its
length. This technique was then enhanced by Jain et al. [101] to handle small amounts
of object motion by using only two consecutive frames. A different method based on
the same idea was introduced by Yalamanchili et al. [187]. After thresholding the
difference image, pixels that belong to the background are “off”, so are pixels that
belong to part of the object that overlaps between the two frames. Pixels will be
“on” where the object is subtracted from the background and vice—versa. These four
types of regions are used to decide which pixel belongs to the moving object. This
technique assumes that the objects are uniform in gray level and strong edges can be

extracted between the object. and the background.

Subtraction using three frames

Hsu et al. [88] proposed a method using three frames instead of two. Using three con-
secutive frames, two interframe differences are computed. They are combined using a

logical OR operator. Interframe differences are not computed using image subtraction

l6

and thresholding, but using the likelihood ratio introduced by Yakimovsky [186].

2
{fwd-NE + ( mp—m f’ 2
2 2

L:

 

(3.4)

LQC, x l‘P

where m and 'e denote the mean and variance of the gray values in a sample area in
the previous frame (P) and the current. frame (C). If the value of L, in a given area,
is greater than a given threshold, then this area is assumed to be part of a moving

object.

3.1.2 Gradient techniques

Horn and Schunck [86] established a motion constraint equation based on the assump-
tion that the gray level of any pixel in the image is constant over time. Therefore,
any change in gray level at a given pixel must be due to motion. Suppose a pixel
located at (:r,y) at time t moves to location (I + 6.1:. y + 6y) at time t + (it. Since the

intensity of the pixel should be constant,
l(r.y.t)= I(.r+6;r.y+6y.t+(ft). (3.5)

Expanding the intensity function by Taylor series, we obtain:

I(.r+6:r,y+6y,t+6t) = I(;r.y.t)+§ldx+,a—16y+2£
(.r y ()t

) 0 cit-l-liigher order terms. (3.6)

Neglecting the higher-order terms leads to the following motion constraint equation

81 01 01
5:611: + ady + f)?” — 0, (3.7)

which is conveniently written as
hu+hc+h=0. as)

where (u, r) = 6%» %’f) is the velocity at a given pixel (.r.g) in the image.

This equation is, of course. not sufficient to compute the velocity at a. point since
it only constrains the values of u and v to lie on a line in the velocity plane. Different
techniques have been used to combine a number of constraint equations into a system
with a unique solution. The advantage of techniques based on the motion constraint
equation is that they produce a dense flow ﬁeld. This flow ﬁeld can then be segmented

to extract the moving objects from the background. The main disadvantages are that

they assume a small displacement of the object between frames and have a tendency

to smooth the moving edges.

Local optimization approach

The local optimization technique assumes that neighboring pixels have the same ve—
locity. A total of N constraint equations from pixels in a local neighborhood are
grouped into a system with N equations and 2 unknowns (u and v). Theoretically,
N = 2 should be enough to solve the system, but, typically, one chooses N >> 2
to minimize the effect of measurement errors. The system of equations can then be
solved using a least squares method. This technique was used by Thompson and
Barnard [175] and Kearney et al. [105]. Thompson [174] used this technique in
Combination with contrast information to perform segmentation of the image into
background and moving objects. Peleg and Rom [148] used this technique to directly
estimate the motion parameters (translation and rotation coefﬁcients) by assuming

that the depth of the scene was known.

Global optimization approach

This approach assumes that the flow field should vary smoothly over the entire im—
age. Ilorn and Schunck [86] suggested minimizing the following energy function with

respect to the velocity field.
E = If (144+ 1.4, + 1,)? + 42(11er + Hmﬂdwy.
D

where /\ reflects the relative importance of the smoothness in the solution. The
problem with this technique is that the moving edges are smoothed in the process.
Nagel and Enkelmann [I41] introduced an oriented-smoothness assumption by slightly

modifying the above equation and were able to keep the moving edges sharp.

Clustering in the velocity space

The clustering approach consists of accumulating the constraint lines in a Hough
transform manner. Each constraint line is projected into the discretized velocity
plane, and contributes as one entry in each cell into which it falls. The cells with the
highest number of entries correspond to the velocities of the moving objects. More
details on this technique are given in the papers by Fennema and Thompson [61], and

Cafforio and Rocca [23].

Other approaches

Duncan and Chou [50] extracted moving contours (or temporal edges) by ﬁnding the
Zero crossings of the input image sequence convolved with the second-derivative w.r.t.
time of a temporal Gaussian smoothing function.

320
a2

 

St(:r,y,t)= (t)* [(17.y,t), (3.9)

where

 

(I'M): ' ex[)(—.s212). (3.10)

The moving contours are then combined with the motion constraint equation to
estimate the component of the optical flow that is normal to the moving contour.

A number of papers describe techniques which. instead of using the intensity field
in the motion constraint equation. use other dense matching primitives. Heeger [83]
uses the energies in the outputs of Cahor filters. Fleet and Jepson [6’2] and Weng

[1822] use phase information.

3.1.3 Feature-based techniques

The feature-based approach (or discrete approach) to estimating the optical ﬂow and
segmentng the moving object from the background consists of matching features ex-
tracted from two frames. Feature extraction and matching have been widely used for
computing disparity in a pair of stereo images as well as sequences of images pre-
senting motion. The correspondence problem is solved using features extracted from
the first frame and matching them to similar features extracted from the subsequent
frames. If the objects in the scene are subject to rigid motion, then features that oc-
cur close together with the same motion parameters can be grouped to form objects,
resulting in a segmentation of the image into moving objects and background.

Weng et al. [183] use edges and corners as image features to be matched from
one frame to the next. lx'oller ct al. [114] use local minima and maxima of the gray
level images. Their main contribution was the ability to compute the trajectory of
the moving object and characterize it using motion verbs. In a later paper, Koller
6t (II. [113] use edge segments to solve the correspondence problem. Zimmermann

and Kories [193] also use local minima and maxima in the image convolved with a

30

band-pass filter.

(_'umani rt (1!. [38] compute three features at each edge point and use them to
solve the correspondence problem. The features are the luminance representing the
value of the image at a contour point. the amplitude representing the total change
in luminance when crossing a contour. and the sharpness representing the deviation
of the luminance transition from the ideal step edge. They showed clearly how clus-
tering in the displacement space can be used to separate different. moving objects
with different motion parameters. Thompson and Pong [176] presented techniques
to handle different. types of features such as feature points. edges. or regions. The
motion parameters were checked using the rigid motion assumption.

The advantages of the feature—based techniques are that they can handle large
displacements between frames. the motion edges are not blurred, and they are partic-
ularly useful in the case of multiple moving objects. The main disadvantage is that
they produce a sparse flow fiech which makes it difficult to accurately segment the
moving object. In addition, the image must contain salient. features, usually implying
that it must be highly textured, and more importantly, the correspondence problem

between features needs to be solved.

3.1.4 Proposed method

This brief review on motion segmentation techniques shows that all the published
techniques have some disadvantages and cannot be directly applied to solve our mov-
ing object segmentation problem. In our application, the moving object of interest
is quite large in the image, so that only a few frames containing the entire moving
object are available in the image sequence. Also, since the objects are moving at

a relatively high speed, the displacement between frames is quite large. Gradient-

:31

based approaches for optical flow computation can only handle. small displacements.
Feature-based approaches to optical flow computation require solving the correspon-
dence problem between extracted features and produce a sparse flow field which
cannot be used to extract an accurate mask of the moving object. Image subtrac-
tion techniques using a reference frame (or background frame) use a large number
of frames that. are averaged to produce the reference frame. Image subtraction tech-
niques using two frames from an image sequence highlight the location of the moving
object in both frames. An approximate location of the moving object can be obtained
with better accuracy if three frames are used. It is quite obvious that motion seg-
mentation alone will not produce an accurate mask for the moving object. but a good
motion segmentation technique is capable of providing the approximate location of
the moving object.

The motion segmentation method proposed here is based on image subtraction.
Three frames are required for the computation; as long as the moving object is entirely
contained in each frame, they need not be consecutive frames in the image sequence.
The method locates the moving edges accurately as shown in Figure 3.1. Suppose
a ramp edge successively moves from a position in frame 1 (f1) to another position
in frame 3 (f3), then the difference between frames 1 and ‘2 will look like the solid
curve in Figure 3.1(b), and the difference between frames ‘2 and 3 will look like the
dashed curve. The only interval where both the frame differences are meaningful
(bold curve) is at the location of the ramp edge in the middle frame. If there is no
motion or if there is no strong edge at a given pixel, the difference value at that pixel
is not relevant.

This observation can be formulated mathematically as follows. Let us consider the
one-dimensional problem first: let 81(1'), 62(17), and 63(1) correspond to the intensity

Value function around the ramp edge in the first, second, and third frames, centered

fl f2 ’13—-
/
/
___/
(a)
Itl- {2| lfz" f3!
”__\
/ \
—— \\_——
(b)

Figure 3.1: A moving ramp edge: (a) positions in frames 1, ‘2, and 3; (1)) frame
differences and extraction of the moving edge.

at .rl. 1‘2, and 13, respectively: the thickness of the edge is 2Ar. Let a be the value

of the function before the ramp edge, b be the value of the function after the ramp

edge, and m = (b — (1)/231‘ be the slope of the ramp edge, then.

 

 

61(1?) 2]
f

62(1) 2‘]

83(2)) 2

a nggrl—Ax

Il—AJCSIgml‘I‘A-r

b $1+Ar§r<oo

a OSIﬂI-z—AJ:
m(r—.r2+Ar)+a .rg—AIS.rg.r2+AJ: (3-11)
b $2+A$§I<oo

a 033513—131:
m(;r—.r3+A;r)+a Ig—Axgrﬁm-I-Ax

b $3+Axgr<oo

Subtracting two edges at consecutive positions and taking the absolute value, we

f

 

0 0 3.1-Sn — A.1?
|m(.r — .rl + AM] .rl — Ar S .r 311+ _\.r
[61(.1')_ (2(1.” =< lb— ‘1] 1.1+ _\.r S .r S .r2 _ Ar
[b—m(.r-.r2+_Xr)—a[ .z'g—Arg.rs.r2+A.r
L0 .r2+_\.rSr<oc
[ (3.12)
0 031' Srg—Ar
Imfr — .1'2 + Aﬂl .1‘2 — Ax S .r 5 1'2 + A.1:
[62(I)—€3(I)l=< Ib—a] grz+Ar£.rS;r3—AJ:
[b—m(.r—.r3+A.r)—a[ .r3—AxgrSI3+Ax
k0 r3+ArSI<oo

 

Finally, if the two differences are multiplied. the only non-zero value is found at

the location of the edge in the second frame.

f
0 OS-rgxg—Aa:

lb " "If-'17 — 1’2 + AI) —— a]
lel($l—e2($)lxlezfl‘l-fsfl‘ll = <

 

x]m(I—;r2+AI)] J'g—AJ: 3.1: S$2+Ax
0 £2 + A.1: g :r < oo
(3.13)

It can be shown that the same results are obtained if the ramp edge does not move
its entire width of 2.3.x. In two dimensions, the problem is more difficult to formulate,
but the results are similar. Let us assume that an edge is thin and that its rotation
within the three frames under consideration is not too large. Then, the motion can
be described as shown in Figure 3.2. If the edges are projected onto a straight line,

the two—dimensional problem is equivalent to the one-dimensional problem where the

 

 

 

 

I‘ 2/
2”
\Q'\
\. /
\/
44— _/F _‘P-

 

 

 

 

 

 

 

Figure 3.2: A two-dimensional moving ramp edge.

edges have different slopes and different thicknesses in the three frames. It can be
shown that [61(X) — c2(x)| x [62(X) — c.3(x)] will be non-zero only in the interval
[x2 — AXg, x2 + Axg].

The algorithm for motion segmentation can be stated as follows. Three frames
f1, f2, and f3 are extracted at times t1, t2 (t1 + At), and t3 (t1 + 2At), respectively
from the image sequence. Each frame is a 256x256 image with 256 gray levels. They
are blurred using a 3x3 averaging mask to ensure that all edges are ramp edges to
produce the images f1, f2, and [3.

The difference image between frames 1, 2, and 3 is computed as follows:

130.1) = the.» — fztml x lfzf'iJ) — .f3(i,j)| (3.14)

where ‘x ’ denotes the multiplication and M denotes the absolute value operator. The
value of D(i,j) is low if the value of the pixel at position (1,j) has not significantly
changed in the three frames. This difference image highlights the pixels around the
edges of the moving objects.

This difference image is thresholded to keep only locations of signiﬁcant change. If

any pixel of the difference image is above a high threshold tgigh, then it is immediately

output. as is the entire connected region C that contains the point. and lies above a

low threshold I'd”. The resulting thresholded image 'D, is then defined by:

. .. l if’D(i.j)>/]{°W and(Lj)Ethere3(k.l)€Cs.t. P(k,[)>t2ish
thlvjl =
0 otherwise.

(3.15)

The thresholded image D, is used to produce a motion mask AA, such that
Jl/ls(i,j) = 1 if pixel (i,j) belongs to the moving object and 4M3(i.j) = 0, other-
wise. This mask is obtained by deleting isolated “on” pixels in ”D, and grouping the
remaining “on” pixels into a single blob through a Chamfering technique which is
equivalent to a morphological “close" operation.

Figure 3.3 outlines each step of our motion segmentation algorithm using 3 consec-
utive frames from an image sequence containing a moving vehicle. It can be noticed
that the background in this image contains several objects that have common bound-
aries with the moving vehicle. The mask boundary has been overlaid on the middle
frame in Figure 3.3(e) to show that even though the mask shape is not exact, mo-
tion segmentation performs reasonably well in providing an approximate mask at the
location of the moving object.

The algorithm is summarized in Figure 3.4. The CPU time requirements on a
Sparcstation 20 are the following: ‘2 s to blur the 3 frames. 0.5 s to compute and
threshold the difference image, and 4.5 s to derive the motion segmentation mask.

Thus. the total computation time is T 3.

3.1.5 Experimental results

The motion segmentation algorithm was tested on more than 400 image sequences of

moving vehicles and other moving objects in indoor and outdoor scenes. Figure 3.5

 

 

 

 

 

 

 

 

 

 

 

 

 

(8) (h)
Figure 3.3: Steps in the motion segmentation algorithm: (a) frame 1; (b) frame 2;
(c) frame 3; (d) difference image; (e) thresholded difference image; (f) signiﬁcant
connected components in the difference image; (g) motion segmentation mask; (h)
outline of the mask overlaid on frame 2.

Q!

 

Input: 3—frame sequence f1, f2. f3
begin
Blur the 3 frames to obtain f1, f2, f3 using a 3X3 averaging mask
/* Compute the difference image */ .
DUJ) = Ifim) — M231)! x 1mm — mum
/* Threshold the difference image */
Lfvtm) > 0.001 mg; was = 1 gls_e mm = 0
Find the connected components
fo_r each connected component C
i_f ,B (k,l) E C such that D(i,j) > 0.1
gm Erase C from D,
endfor
Grow the connected components using a Chamfering technique until they
form a single connected component
Fill in the gaps by scanning the image left to right and top to bottom
Shrink the connected components to its original size
gig
Output: Motion segmentation mask A4,

 

 

 

 

Figure 3.4: Motion segmentation algorithm.

shows some results of the motion segmentation algorithm on moving vehicles. The
ﬁgure shows the three consecutive frames and the motion segmentation mask overlaid
on the middle frame for all the examples. It can be seen from these images that the
stationary background is quite complicated, but the motion segmentation algorithm
can extract the moving objects reasonably well. Figure 3.6 shows some segmentation
results on other moving objects, namely a walking person, a mobile robot, an arm,
and a bicyclist. Again, the background is relatively complicated, especially in the
case of the bicyclist image. but the motion segmentation algorithm performs very

well.

 

Figure 3.5: Results of the motion segmentation algorithm on some moving vehicles.
Each row in this ﬁgure shows the three frames of an image sequence, followed by the
segmentation result.

 

Figure 3.6: Results of the motion segmentation algorithm on some indoor and outdoor
moving objects. Each row in this ﬁgure shows the three frames of an image sequence.
followed by the segmentation result.

60
3.2 Color Segmentation

The perception of color by humans is a phenomenon that is both physical and psy—
chological as it includes the interaction of the light with the surfaces of the objects
and the response of the brain to the complex stimulus perceived by the eye. Color
information seems to be crucial to the understanding of the outside world for the
purpose of detection and discrimination. The detection of an object is easier if the
color of the object is known and it is distinct from the background. This is the reason
why many sporting equipment manufacturers produce orange golf balls that are easy
to see on the green fairway. The importance of color is also found in nature as many
animals use color to attract a mate, while others adapt their color to appear invisible
in the foliage. Color also helps in distinguishing various objects in the environment.
For example, we can easily identify a taxi cab in the busy streets of New York by
its dominant yellow color. The use of color in computer image understanding can
improve the process of detection and discrimination. As a result, the analysis of color
images has received considerable attention for the past 20 years.

This section is divided into five parts. The ﬁrst part presents the color spaces that
can be used to represent color for the purpose of image processing. The second part
surveys a number of well-known color image segmentation algorithms for the purpose
of image understanding. The third part describes the process of color constancy
and briefly examines some of the algorithms presented in the literature. The fourth
part describes our approach to color segmentation, a fusion of split-and-merge region

segmentation and edge detection. Finally, we will present some experimental results.

Off: .

.‘DPfl Vin
. > n _A

.F‘Spgm

I
- I
‘w ..

'HIL :m

I,
t IQ ‘
Uri
{_ﬂ

Enter 1,

s. ‘
U ',I3
u-Jl'I‘id:

fl [“l.’ .
‘J i!

61

>

 

     
 

 

 

>3

IE M‘ M

'2

8

2

‘5

E

m l

400 500 600 700

Wavelength (nm)

Figure 3.7: Spectral responses of the human color sensors.

3.2.1 Color spaces

The perception of color in human vision is related to the response of the color sensors
(cones) at the back of the retina to different wavelengths in the visible electromagnetic
spectrum (400 to 700 nanometers). The Young-Hemholtz theory states that normal
human vision is trichromatic, which means that the combination of three types of
sensors produces the sensation of color. The relative responses of these three sensors
as a function of the wavelength A are shown in Figure 3.7. Note that as each sensor
responds to a range of wavelengths, they are referred to as the long (L), medium (M),
and short (S) wave sensors. The measurement at a point on an object surface for
a given sensor depends on the spectral composition of the illumination source f()\),
the reﬂectance function of the surface r()\), and the relative sensitivity of the sensor
h(A).

R = /f(/\)r(/\)h(/\)d/\ (3.16)

The need for representing and displaying colors has yielded a relatively large
number of different three-dimensional color spaces which are not necessarily related

to human color perception. These systems, described below, are also presented in

[6, 125, 36].

62

The [26' B space

The RGB (Red, Green. and Blue) system is based on three colors well-defined by
the three monochromatic lights of wavelengths 700 nm (red). 546.1 nm (green), and
435.8 nm (blue). An intuitive way of visualizing the colors from the RGB space is
to view the measurements as Cartesian coordinates. A chromaticity diagram can be
obtained by normalizing the three sensor measurements

R G B _,

=__—._—. :-——‘ b=__—_ .-
f R+G+B g R+G+B R+G+B Oh)

and plotting the perceived color as a function of any two.

The XYZ space

In 1931 the C.I.E. (Commission Internationale de I’Eclairage) adopted a standard for
color coding known as the XYZ color space. The XYZ coordinates can be obtained

from the RGB values using the linear transformation [145]:

X 0.618 0.177 0.205 R
Y = 0.299 0.587 0.114 G (3.18)
Z 0.000 0.056 0.944 B

The H S I space

The HS] (Hue, Saturation, and Intensity) space depicted in Figure 3.8 is deﬁned as
follows. The hue H is explained in every day language by color adjectives such as
red, yellow, or blue-green and corresponds to the notion of color. The saturation 5
measures the percentage of white in a color. It describes the difference between red
which is saturated and pink which is desaturated. The intensity is equivalent to the
gray level in a black and white image. It is the amount of light reﬂected by a given

surface. The H S I coordinates can be obtained from the RGB color components as

 

 

Figure 3.8: The HS] color space.

follows.

 

 

mG—B) q Immaaa) __R+G+B
QR—G—B’ ‘ R+G+B’ _ 3

H = arctan

The H51 space is close to the human description of colors. For example, light
yellow corresponds to (R = 140.6 2 140. B = 30,11 2 180.5 2 180, I —_— 205) while
dark yellow corresponds to (R = 100.6 2 100. B = 5.H = 180.5 = 180,] = 105).
Note that all the three coordinates R. G, and B are modiﬁed in the RGB space
whereas only I is modiﬁed in the [[5] space to differentiate between light yellow and
dark yellow. Thus, as the adjective “yellow” describes both the colors, the hue value
is the same for both the colors. Only the intensity value is changed to account for

the two adjectives “light” and “dark".

The opponent processes space

This space is also close to the human perception of colors, but it uses Cartesian

rather than cylindrical coordinates. Hurvich and Jameson [91] argue that this system

64
has neurological support and is in fact the way human beings represent colors. The

three coordinates Red-(Ireen (Ii—(i), \"<?llo\v-B111e (Y—BI). and White—Black (W-Bk)

are deﬁned as:
R—G = R, —— 26' + B. Y-Bl = 213 — It — C. W-Bk = If + G + B (3.20)

The YIQ space

The YIQ (Luminance, In-phase. and Quadrature) system constitutes the NTSC (Na-
tional Television System Committee) transmission system. It was deﬁned to minimize
the bandwidth of the signal to be transmitted. The YIQ coordinates can be obtained
by applying a linear transformation to the RGB original values.

Y 0.299 0.587 0.114 R
I = 0.596 —0.274 —0.322 G (3.21)
Q 0.211 —0.523 0.312 B

The Lab space

The Lab space was designed to agree with the Munsell color system. The Euclidean
distances between measurements in the space are directly proportional to the dif-
ferences in color as seen by the human visual system. The Lab coordinates can be

obtained from the RGB coordinates by:

f

0

X 1/3 Y 1/3 3/1/3 Z 1/3
() —<—> ,.._.. (—.> -<—> -
a 5 [x0 Y0 ] 00] r0 20 , (322)

Where X0, Y0, and Z0 are the values for the reference white.

3; 1/3
L = 25 (100—) - 16,

 

The L" V'l'l" space

The I..-"'V'IJIJ" space was designed to obtain a color solid for which unit shifts in
luminance and chrominance are uniformly perceptible. The f")"ll" coordinates can

be obtained from the KGB coordinates by:

. Y ”3
I = 13W'(u — (to), v- : 1:31;»"(6 — 60), w- : 25 (100?) — 16. (8.23)

0

_ ‘ — I — .—4‘.\—'_—_ >9 — _—6Y__
where no — 0.199, to — 0.308, a — .\'+15Y+32‘ and L — X+15Y+3Z'

The [11213 space

The [11213 space was deﬁned by Ohta et al. [145] using a set of eight images. At each
step of the recursive region splitting algorithm for color segmentation (see section
3.2.2), they applied a I{arhune11-Loe\Je transform to ﬁnd the best color features for

the current segmentation task. The features, obtained from the RGB tristimuli by:

 

 

Ilzﬂigﬂ. RTE. 13:20—34? (3.24)
appeared most often in that order. This means that, statistically, this coordinate
system is the best color space to account for color differences and suitable for seg-
mentation.

All these different color spaces have been used for image segmentation. The most
commonly used spaces are the RGB space because it is conveniently available at
the outputs of color cameras and digitizing systems. and the HS] space because of
its relationship to the human description of colors. The other color representation
SChemes are not used as often. There is no optimal color space and it seems that the

Choice of the best color space should be tied to the choice of the image segmentation

6 6

(or other processing) algorithm.

3.2.2 Color Image Segmentation

Here, we are concerned with color segmentation which divides the image into regions
based on the color attributes of the pixels. Thus. in the following sections, the most

commonly used techniques for color image segmentation will be described.

Clustering techniques

Any combination of the color coordinates deﬁned in Section 3.2.1 can be input to a
clustering program as feature values for every pixel in the image. Usually, the set of

features that will produce the best segmentation results is determined empirically.

Region-based segmentation

The split-and-merge algorithm developed by Horowitz and Pavlidis [87] for intensity
images was modiﬁed by Coutance [36] to perform color segmentation using the original
tristimuli values RGB. The decisions for splitting and merging are based on the
averages, m, and variances, v, of the pixel values for each of the color components.

For a subimage X). containing 72;, pixels, they are deﬁned as:

1
——Z R(,$y) m6,_g— Z C(a: y), (325)
Tl): (1' y) IEXk nk (r.y)€Xk
1
kl$~y16Xk
1 1 2
UR). = 71—- 2 (3(339) mRk) a 00k _ _ 2 (0(1) yl _ 77101:)» (3 26)
k (Ivy)exk 71k (1‘ tile)”:
1
ka — _ 2: (8(1), y) mBk)2

where R(.r.y), (_i(.r.y). and [3(.r.y) are the pixel values in the red. green. and blue

image, respectively. The “distance" between two subimages X,- and X) is deﬁned as

 

(I(.\','. X1) : [mp' — 771le + [IIIG‘ — mg) + [7723, — mBj . (3.27)

During the splitting step. a subimage X). will be divided into four subimages X)“,

Ar)”, Ar)”, and _Yk‘ if

(1) {123k + 1er + 193,} > s,.. and (3.28)

(2) 3746 [1.4151 d(.\',.,,.\',.,) > 3m,

where 3,, and sm are predetermined threshold values. Equivalently, during the merging
step, four subimages X)“, X)”, Xk3. and X),4 will be merged into a single subimage

Xk if

(1) {ka + va + lin} S 'SU (329)

(2) \7, l,j E [1.4] (“er ,.\’k]) S 8m.

Recursive region splitting using histograms

The recursive region splitting algorithm was developed by Ohlander [144] speciﬁcally
for color segmentation. The basic idea of the algorithm (see Figure 3.9) is the follow-
ing. The image is divided into connected regions using threshold values determined
by ﬁnding the best peaks in a set of feature histograms. The subregions are then
further segmented if necessary.

The main steps of the algorithm are given below:

Step 0: A mask corresponding to the entire image is placed at the bottom of a stack.

68

Histograms

Region mask

Giginnl mmk Original image

 

 

Picture slack
Thresholdmg

Figure 3.9: Ohlanderis color segmentation algorithm.

Step 1: One mask is taken from the top of the stack. Let 8 denote the region repre-
sented by the mask (the region without hatching in Figure 3.9).

Step 2: Histograms of color features are computed for the region S. Ohlander used
9 different features: R, G, and B; H, 3, and I; and Y, I, and Q.

Step 3: If any histogram shows a dominant peak, a pair of threshold values is de-
termined at the positions of the associated valleys. The image of the color
feature corresponding to that histogram is thresholded. The region S is thus
partitioned.

Step 4: Connected regions are extracted. For each connected region, a region mask
is generated and pushed down on the stack.

The process is repeated until the stack becomes empty.

This algorithm has some shortcomings. The ﬁrst problem is related to the non-
linearity of the transformations used to compute the hue, H, and the saturation, S.
The saturation is unstable if the intensity value is small, and the hue is unstable if

the saturation is small. Thus. one has to be very careful in using these color features.

69

The second problem is that since histograms are used to determine threshold values,
small homogeneous color regions cannot be detected. Thus textured regions in the
image are apt to be divided into a large number of meaningless segments.

Ohta cf (11. [1-15] enhanced the algorithm proposed by Ohlander to try to solve
these two problems. lising the Karhunen-Loeve transformation. they determined that
the best color features were I), 12, and [3 as described in Section 3.2.1. Thus, only
these three features were used by the segmentation algorithm. To solve the second
problem, they ﬁrst segmented the image into textured and nontextured regions and
applied the Ohlander’s algorithm only to regions without texture. Textured regions
were determined by the number of edge points (obtained by the Laplacian operator)

in a local neighborhood.

3.2.3 Color constancy

We know from our everyday experience that grass is green and tomatoes are red, inde-
pendent of the surrounding illumination. That is, whether an object is seen outside in
sunlight or inside under tungsten—ﬁlament lamp illumination, it will appear the same
to us. This phenomena is known as color constancy. The signal reaching the photore-
ceptors on the surface of the retina is the product of the spectral distribution of the
illuminant and the spectral reflectance of the object. Even though these two quanti—
ties are unknown, it seems that the human brain can extract the spectral reflectance
of the objects from the incoming signal. A number of psychologists and computer
vision researchers have tried to understand human color constancy and develop com-
puter vision algorithms in an attempt to extract the spectral reflectance of objects

from the intensity signal. This section briefly presents some of these algorithms.

The Retinex theory

The Retinex (from retina and cortex) theory was developed by Edwin Land and ex—
perimented by Land and .\lc('_'ann [119. I20. I21. 122. 131] as a possible interpretation
and understanding of some experiments they performed with color patches. The im-
portant statements of the Retinex theory are: (i) “The composition of the light from
an area in an image does not specify the color of that. area". (ii) “the color depends
only on the lightness in each waveband and lightness is independent of flux". and (iii)
“the color of a unit area is determined by a trio of numbers each computed on a single
waveband to give the relationship for that waveband between the unit and the rest of
the unit areas in the scene". This means that the reflectance of a surface is computed
using the intensity at that surface and the intensities at all the other surfaces in the
scene.

The Retinex algorithm developed by Land is the following. The relationship

between an area i and an areaj in an image for a waveband A is:

 

RA(I',j) = Z t(log [H1 ). (3.30)

A— [k

where Ik+l and [1,. are the intensities of neighboring regions on a path from areaj to
area i and

x if]:r| 2 threshold

0 otherwise.

According to this formula, if two neighboring regions have similar intensities, they do

not contribute to RA(i.j). The descriptor for area i in waveband A is

_ N_ H“ T.’
RA('I) : ZJ—1f\r(1])

 

. (3.32)

71

where N is the number of paths from any area j to area i. Three descriptors corre-

sponding to the 3 wavebands are then combined to determine the hue of area i.

Algorithms based on the retinex theory

The main characteristic of the Retinex algorithm is that it automatically ﬁnds the
edges in the image. If two neighboring regions belong to the same surface, they
have similar intensities. thus the quotient of their intensities will be close to 1 and
its logarithm will be close to 0. However. if the two regions belong to two different
surfaces and are thus separated by an edge, the absolute value of the logarithm of the
quotient of their intensities will be large enough so that a component will be added
to RA(i,j). Therefore. all the subsequent algorithms based on the Retinex theory use
the output of an edge detector.

In a typical image, we expect the incident light to vary smoothly, with no disconti-
nuities, while the reﬂectance should have sharp discontinuities at edges where objects
adjoin. The reflectance would then be relatively constant between such edges. The
color constancy algorithms based on the Retinex theory are divided into the following

steps described in Figure 3.10 for the one-dimensional case.

1. Differentiate (spatially) the image.

5‘”

Threshold to eliminate small values (corresponding to the small variations in
the illuminant) and keep the large values (corresponding to the large changes
in reflectance).

3. Integrate to obtain the reflectance.

Horn [85] uses the Laplacian operator for the differentiation step. The use of a
hexagonal tesselation of the image makes his algorithm quite unusual. Blake [14] used

a gradient operator to ﬁnd the edges in the image. His algorithm was then modiﬁed

72

Original signal ‘I/I\\¥

Differentiation

Thresholding ]
Integration _T—L—L

Figure 3.10: Computation of the reflectances by the algorithms based on the
Retinex theory.

by Bretstaff and Blake [20] to work on square tesselations (typical CCD images). The

Canny edge detector [24] was used to ﬁnd the edges.

Mathematical models

Several models have been proposed to decompose reflectance and illuminant into a
linear combination of some basis vectors or functions. Different color areas are com-
bined into a system of equations (linear or non-linear) and solved for the surface
descriptors. Buchsbaum [22] deﬁnes the illuminant as a linear combination of three
illuminant basis functions and the reflectance as a linear combination of three re-
flectance basis functions. The major assumption in his algorithm is that the estimate
of the illuminant can be obtained by averaging the radiances over the entire ﬁeld
of view. Maloney and \Vandell [128. 181] combine information from p sensors, m
illumination coefficients. and n reflectance coefﬁcients and do not assume them to
be equal to 3, but instead determine what these numbers should be. If the ambient
light is known, then the system can easily be solved. Otherwise. they prove that the

System will have a solution for some values of p. m, and n. Forsyth [66] presented

73

an algorithm which uses the underlying mathematical structure of the problem. The
basic principle can be explained in a simple way: If we observe an image containing
only green colors, then the illuminant cannot be bright red. This yields a rejection of

all the illuminants which cannot produce the colors present in the given image.

Physical approaches

These approaches use the physical model of the image formation process. Healey and
Binford [82] and Klinker et al. [111] assume that in specular areas (reflection of the
light onto the object), the reﬂectance of the object is lost and only the illuminant
is observed. Their algorithms consist of segmenting the image into highlight areas
and matte areas. The specular areas are then used to estimate the illuminant. The
illuminant can then be subtracted from the original image to estimate the reﬂectance.

Lee [124] uses the fact that in the CIE chromaticity diagram, if two lights are added
in various proportions, then the chromaticities of the mixtures lie on the straight line
connecting the two points. In the CIE diagram, the coordinates of the colors from
different points of the same surface will fall on the straight line connecting the illu-
minant color and the object color. Straight lines obtained from different objects will
all intersect at the point corresponding to the illuminant color. Once the illuminant

color is determined, the object colors can also be computed.

3.2.4 Proposed method

This section describes our approach to color segmentation. Several of the well-known
techniques for color segmentation described in Section 3.2.2 were tried on a number
of test images, and the best results were produced by the split-and-merge approach.
Therefore, it was chosen for the color segmentation algorithm.

The reason why the original tristimuli values RGB were used as features is mostly

74

hardware dependent since the digitizer available in our laboratory produces RG8 im-
ages. The color images were acquired by recording the scene containing the moving
object using a camcorder. The tape was then played on a VCR connected to the
digitizing boards to obtain the HOB images. The HS] color system was also tried
because it is supposed to describe perceived colors in a more natural way. Unfor-
tunately, the signals stored on a videotape and delivered by the \"CR are severely
distorted; the saturation and intensity signals are substantially attenuated and the
hue signal is very blurry. It can be recalled from the deﬁnition of the HS] color
system (Section 3.2.1) that the hue is not deﬁned for small saturations (center axis of
the cones). Also, in outdoor scenes. very few objects have saturated colors. making
it more difficult to evaluate the real hue of each object. These reasons motivated our
choice for the use of the RGB color space in the color segmentation algorithm.

The color segmentation algorithm is divided into ﬁve steps. The ﬁrst four steps
come from the split-and-merge algorithm for color images, and the last step consists
of using edge information to further merge uniform color regions and produce a better

segmentation.

The split-and-merge algorithm

The split—and-merge algorithm which we have used was adapted from Coutance’s color
split-and-merge technique [36] and comprises of the following four steps as shown in
Figure 3.11.

l. Initialization: The N x N image (where N = 2k) is divided into a number
of z x 2: square subimages. In our case, a 256 x 256 image is divided into 256 16 x 16

subimages. In each square subimage Xk, the mean and variance of the intensity values

 

 

 

 

   

Input image Initialization

Splitting

   

Merging Grouping

Figure 3 11 The split and—merge algorithm for segmenting a color image

for each of the three color components are computed as follows

.9
“M“Hl H

X 51037). I=R.G,B (3.33)

N
_II-inl H

+z—l
Z Z (1(2 Jl—mzf. I= R.G.B (3.34)
':_r j:y
where (1: y) are the coordinates of the upper left corner of the square subimage and

z is the dimension of the subimage (initially, z - 16)

2 Splitting: A subimage X0 is not “homogeneous if

(113, + va + vyk) > t.) (3.35)

A nonhomogeneous subimage is then further split into four square subimages X1, X2,

X3, and X4 of Size 2/2 x 3/2 This process is repeated until the 1nd1v1dual subimages

are homogeneous or have reached a minimum area of 4 pixels

item :3
4.

1n) » .
IAT‘DLI.“

76

3. Merging: If four adjacent subimages X1. X2, X3, and X4, each of size 2: x z,
with the same parent X0 have not been split in step 2 and they satisfy the homogeneity

condition

{lmR' — 77112,] + [7710. — mGJ] + [mg — m31|}<tm V i.j E [1.4] (3.36)

then they are merged to form a larger subimage X0 of size 2: x 22.
4. Grouping subimages: This step eliminates the arbitrary region boundaries
imposed by the initialization process. A square subimage Xk is merged with a region

7?.) if it satisﬁes the criterion:

{lmRk — MR,| + lmGk — Mal + [m3, — MB,|} < tg, (3.37)

where M3,, ill/[0,, and 1113, are the color component means for region R). A region
grows by adding neighboring subimages when they satisfy this criterion. The region
means are updated every time a new subimage is added to the region. When no more
subimages can be added to existing regions, a new region is created. The output of
the split—and-merge module is represented as a set of regions R = {R1, R2, . . . , Rm}.

The different threshold values listed above are parameters of the split-and-merge
algorithm. To ensure that regions are not merged prematurely and guarantee bet-
ter segmentation results, it is desirable to obtain an oversegmentation of the image.
Therefore, the grouping parameter needs to take a relatively small value. The pa-
rameter values (tm = 25, t, = 25, tg = 16) result in good segmentation results on a

large number (~ 100) of test images.

Edge detection

Edge detection is performed using the Canny edge detector [24]. The Canny edge
detector was developed to satisfy the following criteria: (1) good detection. (2) good
localization, and (3) only a single response to an edge. It consists of ﬁnding the zeros
of the second-directional derivative of the convolution of the image I with a Gaussian
ﬁlter G:

82

6?“; * 1) (3.38)

where :1: denotes the convolution operator, 5% denotes the second directional deriva-
tive, and the Gaussian ﬁlter G is deﬁned as:

exp ——+L if Jismsﬂ
G(:z:,y)= { 2 I 2 2 (3.39)

0 otherwise

In our implementation, two parameters need to be speciﬁed: a, the standard deviation
of the Gaussian ﬁlter and W, the width of the ﬁlter in pixels. Large values of a will
tend to ﬁnd edges that are more widely spaced. We have chosen a = 1.0 and W = 9.

The second stage of the Canny edge detector consists of thresholding the ﬁltered
image to keep only the signiﬁcant edges. Canny suggested the use of two thresholds.
If any part of a detected edge is above a high threshold th, then these points are
immediately output, as is the entire connected segment of the contour which contains
the points and which lies above a low threshold t). We have chosen t), = 80% and

t) = 60% of the largest magnitude value of the edge map.

Merging regions using edge information

The split-and-merge segmentation process described earlier produces a large number

of regions because of oversegmentation. The grouping stage of the split-and-merge

78

segmentation is sensitive to the order in which the subimages are scanned. so, it is not
totally reliable. A better segmentation can be obtained if these regions are further
merged using the edge information obtained from the Canny edge detector. This
merging also results in regions whose boundaries follow the edges of the objects in
the image.

The ﬁrst step consists of splitting regions that are broken by an edge. This is
done by simply ﬁnding connected regions. A connected region is deﬁned as a set of
4-connected pixels that are not edge pixels. Let p, denote the perimeter and n,- the
area of region 72,-. For each pair of regions R, and R}, we compute BU, the length
of the common boundary separating the two regions, and EU, the number of edge
points (obtained from the Canny edge detector) on this boundary. The second step
consists of merging regions based on the following three criteria.

1. Boundary strength: This criterion measures the strength of the boundary
between two regions. The larger the number of edge points on the boundary, the
larger the boundary strength. It is deﬁned as:

1 _ Eij
Fij — E; (3.40)
Note that F ,1]- is bounded between 0.0 and 1.0.

2. Similarity measure: This is the most obvious criterion for region merging.
It measures the similarity between two regions based on the color component averages
as follows.

F?)- = (mg. — 777.33)2 + (mg. — mg) )2 + (m3, — mBJ)2. (3.41)

3. Connectivity measure: This criterion was ﬁrst used by Beveridge et al.
[12] and is based on the heuristic that two regions should share a sufﬁciently large

boundary to be merged. It discourages the formation of regions with small necks

connecting larger areas and is deﬁned as:

. mintpap) .
F?)- = —-—' J - (3.42)
The regions are examined in increasing order of size, and a region R,- is merged

with a region ”R,- if the following three conditions are satisﬁed.

1. n: < 72,-; (7?..- has to be smaller than 72,);

2. F}, < t,; (Less than t, of the boundary between the two regions should corre-
spond to actual edges);

3. FE]- Ff’j = mindl‘ﬁl‘ﬁc} where R), satisﬁes the ﬁrst two conditions and F3]- F3 < to;

(Find the best region satisfying the ﬁrst two conditions (1) and (2) to be merged

with 7?..- based on the two criteria F3- and F3).

The two thresholds t, and tC were empirically chosen as 0.4 and 4.5, respectively.
Note that [‘31- is not included in the third condition, but is used in a separate condition.
This is because if there are no edge points between two regions, then P :1]. = 0. It does
not mean, however, that the two regions should be merged. It might happen that the
two regions are connected by a narrow neck and no edges were found on the short
boundary between the two regions. It is also possible that the Canny edge detector
did not detect the boundary edge elements in the intensity image, but the two regions
are actually quite different in color.

This color segmentation algorithm is applied to the middle frame of a 3-frame
sequence. Figure 3.12 shows the results on a typical color image. The split-and-
merge result is shown in Figure 3.12(b). It can be seen that a large number (1,550) of
regions were created due to oversegmentation but the advantage is that distinct image

regions were not merged by mistake. The result of applying the Canny edge detector

80

is shown in Figure 3.12(c). The final color segmentation result after integrating the
split—and-merge regions and the Canny edges is shown in Figure 3.12(d). We note that
the number of regions in the segmented image has decreased signiﬁcantly from 1.550
to 263. On the average, this integration of split-and—merge algorithm and the output
of the Canny edge detector reduced the number of regions by approximately 80%.
Also, the boundaries of the new regions are more accurate due to the information
provided by the edges.

A high-level color segmentation algorithm is given in Figure 3.13. The CPU time
requirements of this algorithm (on a Sparcstation 20) on a 256x256 image are as
follows: 0.6 s to initialize the quad-tree structure for the split-and-merge algorithm,
1.4 s for splitting, 0.1 s for merging, 0.2 s for grouping, 2 s to extract the edges using
the Canny edge detector, 1.5 s to compute edge boundary information between the
regions and evaluate the fusion criteria, and 30 s to perform the fusion between the
regions. Thus, the total computation time requirement for the color segmentation

algorithm is about 35 8.

3.2.5 Experimental results

This color segmentation algorithm was tested on more than 100 images of indoor
and outdoor scenes. Figures 3.14 and 3.15 show the results of the color segmentation
algorithm for images of vehicles and other objects. It can be seen that the color seg-
mentation algorithm performs reasonable well even in cases where the image contains

a large number of distinct regions.

81

 

 

 

 

 

 

Figure 3.12: Steps in the color segmentation algorithm: (a) input color image; (b)
split-and—merge color segmentation; (c) edge detection; (d) ﬁnal color segmentation
obtained by merging split-and-merge regions and edge information.

82

 

 

Input: An RGB color image and its corresponding gray-level image I
begin
/* Initialization of the quad-tree structure */
::)c = 16, Compute mm, mGk, 7723,, 133k, cm, 1in
/* Splitting */
f3; all cells 16
while 2), > 2
if UR. + va + Us. > 25
ﬂier; Create cells 161, [92, 1:3, and 1:4, 2),. = zk/2
ends/11$
endfor
/* Merging */
f_og all quadruples (191, kg, k3, k4) neighbors of same size
i—fflmR. _ mRJI 'l' lmG.‘ - mGJI + lmB.‘ _' mBJl} < 25 V iaj E [1741
£12.11 Merge cells 101, kg, k3, and 1.2,, 2), = zk, x 2

 

 

endfor
/* Grouping */
ﬁg k = 1,. . . ,C'

i_f§l I such that {ImRk — 313,] + lmgk — MQI + |m3k —MB,]} <16
then Merge cell It with region R)

endfor

Produce R = {R1,R2, . . . ,Rm}

Canny edge detection on I

/* Fusion of regions and edge information */

ﬁg; i = 1,. . . ,m
f_or_j = 1,... ,m

Compute ng, Eij, F3],

endfor

 

 

F?

tj’

and F?)-

 

 

fg i = , . . . , m
Rank the regions in increasing order of size {R(1), Rm, . . .}
Find the candidates R;c to be merged with Rm such that:
R;c is a neighbor of Rm, 72(1) < 72),, FE”), < 0.4
Choose the candidate R,- such that:
Pflurinj : minkfrfnkrim} —and Filijrilu < 4-5
Merge RH) and R,-
Delete Rm
Update B(1)j, E(1)J‘, [‘(ll)j" F31”, and Pig”)-
endfor
end
Output: Homogeneous regions R = {R1, R2, . . . , Rn}

 

 

Figure 3.13: Color segmentation algorithm.

83

 

 

 

 

 

Figure 3.14: Results of the color segmentation algorithm: the ﬁrst column shows the
input images and the second column shows the segmented images.

84

 

 

 

Figure 3.15: Results of the color segmentation algorithm (cont’d): the ﬁrst column
shows the input images and the second column shows the segmented images.

3.3

fl.

fr

vli

«CF
. _..

all;

1106

Y“, "O
ilUt

I: {jf

‘f‘L
. .

h

.l—I
.4

.41»

..
.fim

- il
2A

85
3.3 Summary

In this chapter, we have described two image segmentation algorithms for motion
segmentation and color segmentation.

The motion segmentation algorithm does not make any assumption about the
shape of the moving object and the number of stationary objects in the background.
As long as the moving object is entirely contained in the three frames, the displace-
ment of the object between two frames can be arbitrary. This algorithm has the
advantage over other motion segmentation algorithm in that it is very simple. It
does not involve the computation of the optical ﬂow in the image sequence. It does
not require a frequent update of a background image. It only needs three frames
containing the moving object in an image sequence. We have seen that the motion
segmentation results are very reasonable given the simplicity of the algorithm.

The color segmentation algorithm is based on a split-and-merge color segmentation
algorithm. It also integrates edge information to produce more accurate contours
around the homogeneous regions in the image. Again, we have seen that the results

are very satisfactory.

Chapter 4

Model-Based Image Segmentation

The two algorithms presented in Chapter 3 neither require nor incorporate any spe—
ciﬁc knowledge about the scene and are meant to solve generic image segmentation
problems. The two segmentation algorithms presented in this chapter utilize knowl-
edge about the scene and the objects expected to be present in the scene. In the
ﬁrst algorithm, we integrate motion and color information using the notion of active
contour models which constrain the object boundaries to be continuous and smooth.
In the second algorithm. we integrate motion and edge information using deformable
template models which describe the geometric properties of the objects expected to be
present in the scene. Both these model-based segmentation algorithms are designed
to integrate multiple cues.

Integration techniques combine information provided by different cues to produce
a more accurate representation of the desired quantity. A number of computer vision
tasks have beneﬁted from integration techniques. Some of these tasks along with the
integration strategies proposed in the literature are presented here.

Integration techniques have been useful for registering multiple images of the same

scene observed at different times and with slightly different view points by the same

sensor. This is typically the problem of registering Landsat satellite images to see the
evolution of water areas. agricultural ﬁelds. or urban areas. Ranade and Rosenfeld
[157] proposed a relaxation technique to match entities from multiple images. Ton
and Jain [179] used this technique to register Landsat images using the centroids
of homogeneous image areas as control points and the size of the areas as features
associated with the control points. In a different context, Chen and Medioni [30]
proposed an iterative method for registering multiple range images in order to build
a complete 3-dimensional model of an object.

Sensor fusion is another area of computer vision where integration techniques have
been employed. A typical example of sensor fusion is the integration of range and
intensity images for object modeling or object recognition. Limb boundaries and
crease edges are more accurately detected using intensity images, but range data
provide better shape information especially if the image is textured. Lee [123] fused
intensity and range images to automatically build line drawings of polyhedral objects.
Chou and Brown [32] proposed a method to fuse intensity and range images using
Markov random ﬁelds. Matthies and Elfes [130] and Moravec [137] proposed a grid-
based representation to fuse the data provided by an array of ultrasonic sensors on a
mobile robot to produce a single map of the environment. Moravec uses certainty grids
whereas Matthies and Elfes use occupancy grids which account for sensor uncertainty
using a Gaussian probability model.

In this thesis, we have integrated multiple cues or multiple image processing tech-
niques for a better segmentation of the observed scene. Integration of region seg-
mentation and edge detection has been reported in the literature. Pavlidis and Liow
[147] and Milgram [134] used it for the segmentation of intensity images, Nadabar
[100] segmented range images, and Farrokhnia and Jain [59] used it for textured im-

ages. In these situations, the integration techniques have been shown to produce

88

better segmentation results from two or more imperfect segmentations. Gamble et
al. [71] integrated stereo. motion. color. and intensity edges using Markov random
fields. Cochran and .\ledioni [34] fused a. disparity map obtained from stereo images
with intensity edges to obtain surface descriptions. Chu and Aggarwal [33] proposed
an algorithm using the maximum likelihood estimator and an iterative procedure to
satisfy contour curvature and compactness constraints to integrate multiple region
segmentations and edge maps, independent of the image sources. Pankanti et al.
[146] integrated stereo. region based segmentation. perceptual grouping, shape from
shading, and line labeling modules to produce an accurate depth map of the observed
scene. Etoh and Shirai [56] have described a technique for ﬁnding moving areas in the
image by clustering in the 8-dimensional space composed of the three color compo—
nents (R, G, B), the pixel position (I. y), and the spatio—temporal intensity gradients
(al/axﬂI/ayﬁI/Bt), thus fusing motion, color, and edge information.

This chapter is divided into two sections, each describing a speciﬁc model-based
segmentation algorithm. The ﬁrst algorithm combines color and motion information
through an active contour paradigm to extract an accurate contour of the moving ob-
ject in the image. The second segmentation algorithm uses the theory of deformable
templates to segment the vehicle of interest (the vehicle in the lane closest to the
camera) from the stationary background and the other moving vehicles. This algo—
rithm integrates a motion cue and edge information. The database of images used in
this chapter is shown in Appendix A along with a description of the image acquisition

process.

.l

89
4.1 Segmentation of the Moving Object

The motion segmentation and color segmentation algorithms described in Chapter 3
are general-purpose segi’nentation algorithms which do not assume any speciﬁc knowl-
edge about the objects present in the scene. However. we are concerned with the spe-
cific problem of obtaining a good segn'ientation of the moving vehicles. Unfortunately,
both of these segmentation algorithms have some limitations. Motion segmentation
provides us with a mask which is generally larger in size than the true size of the
moving object. Color segmentation produces a number of closed “uniform” regions,
but the object of interest (vehicle) itself gets split into several regions. Both these
algorithms do not take into account the fact that the boundary pixels on the object
constitute a closed geometric contour with strong continuity and smoothness prop-
erties. These constraints should be incorporated in a top-down mechanism to guide
the boundary detection or segmentation process.

This section describes an algorithm which is able to segment the moving object
from the stationary background and produce a more accurate contour of the moving

object. Two paradigms were used in designing this algorithm.

0 The results of motion and color segmentation are integrated to overcome the
limitations of the individual algorithms.

0 To ensure smoothness of the contour of the detected object we have used an
active contours (or snakes) paradigm. Active contours are pushed or pulled by
constraint forces towards image features. These constraining forces are used to

deﬁne an energy function which is to be minimized.

\Ve will ﬁrst describe the theory of active contours before presenting the segmen-

tation algorithm based on active contours.

9 0

4.1 .1 Active contours

Active contours were proposed by Kass at al. [103] as a. mean of ﬁtting contours
to image data. given user-speciﬁed constraints. Kass 6! al. deﬁne active contours
or snakes as “an energy-minimizing spline guided by external constraint. forces and
inﬂuenced by image forces that pull it towards image features such as lines and edges”.

The energy of a contour 21(3) in a parametric representation is deﬁned as:

Etotal : AIIEint(u(5)) 'l' Eimage(u(3)) + Econ(u(8))}ds (41)

Let us deﬁne each of the terms in the expression of the energy function. The
internal energy E)“, is deﬁned as a function of the ﬁrst- and second—derivatives of the

contour and forces the contour to be smooth.
Eintf'UI-Sll = 0(3)]Us(3)l2 + (stllussfsll2 (4'2)

The image energy Eimage pushes the contour towards image forces. It is a combi-

nation of three diﬂerent image forces.
Eimage(u(3)) : wlineElinefuisll "l' lwedgeEedge( 15(5)) 'I' wtermEterm(u(3)) (43)

The term Eune = I(:1:,y) attracts the contour to high or low intensity values in the
image depending on the sign of tonne as shown in Figure 4.1. The term Eedge =
—-|VI(:I:, y)|2 attracts the contour to image points with high gradient values as shown
in Figure 4.2. The term Eterm attracts the contour to line terminations as shown in
Figure 4.3, where the snake is pushed towards the subjective contour in the image.

The term Econ is the energy of a spring connecting a point P1 on the contour and

 

 

91

 

 

 

 

 

 

 

Image Initial active contour Final active contour

Figure 4.1: Effect of the line image force E“,1e on a snake.

 

 

 

 

 

 

 

Image Initial active contour Final active contour

Figure 4.2: Effect of the edge image force Eedge on a snake.

 

 

‘
‘ L
S \)

 

 

 

 

 

Image Initial active contour Final active contour

Figure 4.3: Effect of the termination image force E,erm on a snake.

 

 

92

a point P2 in the image plane. It is defined as:
L‘,,,,(u(s)) = —-A~(P1 — P2)2. (4.4)

The goal of the snakes algorithm is to minimize the total energy function. When
this energy is minimum. the snake is smooth and is "locked” to the image features.
Kass et al. proposed a variational approach for minimizing Emu) (Eq.(4.1)) and
solving for 11(3). This requires that the contour and the associated energy function be
represented in discrete form. The discrete formulation of the internal energy function

is obtained by approximating the derivatives by ﬁnite differences and using the vector

notation u,- = (.r,,y,~) = (.r(ih),y(ih)). Then,

. 01." 34‘
Elm“) = .—-—llui - llz'--1ll2 + ﬁllui—l — 204' + u:‘+1ll2- (4-5)

Let EC,“ = Eimage + Econ be the external energy function and let fr(i) = 8Eext/8x;
and fy(i) = 8Ecxt/(9y,, where the derivatives are approximated by ﬁnite differences if
they cannot be computed analytically. The Euler equations for minimizing the total

energy are:

Oifui - 114—11‘ ai+l(ui+l - 114) + (34—10174 — 2114—1 + 117)
‘23ifui-1 “ 211i + u£+1l+ 5141012“ — 211.41 + 11:42) (4-6)

+(fx(i).fy(i)) = 0-
These equations can be written in the matrix form as

fr , =
Ax+ (x y) 0 (4.7)

Ay + fy(x.Y) = 0»

iV
'1

‘l
Hitt -

1
1+. -1
“1.1

93

where A is a pentadiagonal banded matrix. They can be solved using an iterative

Euler method as follows.

x) = (A + 71)"l(‘/X1_1- frfxt—i-Yt—ill

yt = (A 'l' “/I)—1(‘)yt_1 —‘ fy(xt—lsyt-ll)a

where 7 is the Euler step at each iteration.

Amini et al. [3] proposed a dynamic programming approach for the minimization
of the total energy function. This approach does not require the computation of
high-order derivatives and it takes advantage of the discretization of the image. The

total energy can be expressed as a sum of subenergies as follows:

Etotalfulau29 - - - a “71) = E1(ula 112,143) + E2f‘U2, U3, U4) + ' ' ' + En—2(un—'Zsun—la unl-
(4.9)
The dynamic programming solution involves generating a sequence of functions of

two variables {S,}?;11(u,-+1, u.) where S,- is obtained as follows:

5411241, Ur) = Ti?{Si—l(uiy 117—1) + allui — lit—ill2 + dllui+i — 2114‘ + Iii—1H2 + Eext(ui)}-

(4.10)

At each stage, u,_l can move to m possible positions (usually its 8 neighbors in

the image grid) and 54(u,+1,u,-) is obtained by ﬁnding the position of u,_1 that will

minimize it. Note that 5,, is the total energy. The minimization sequence is iterated
until Eton) remains constant.

The external energy does not have to be limited to the forces mentioned by Kass

8t al.. For example, Etoh et al. [57] deﬁned two energies Eregion and Eedge to be used

for contour extraction by region clustering. The region force guides the contour to a

true boundary, and the edge force results in local adaptations.

94

4.1 .2 Proposed method

The goal is to fuse the set of regions R = {R1,R2 ..... Rn} produced by the color
segmentation algorithm with the motion segmentation mask A4, to obtain a more
accurate segmentation of the moving object .Vfo. Each region R,- is examined to
decide if it belongs to the moving object. If a large percentage of a region is inside
the mask M,, then it should be kept as part of the moving object. If it is mostly
outside, then is should be discarded. The outcome of the regions that are partly
inside A4, and partly outside is decided using the smoothness constraint enforced by
the active contour.

We have used only the internal energy to ensure smoothness of the contour and the
edge image force to push the contour towards the edges in the image. The discrete
formulation of our energy function for a contour containing N points {r,,y,},i =

1,...,N is given by

N
Etotal = ZIOI($1 _ flit-1)2 +(y1— yi—l)2l

i=1

+ ,8[(-Ti+l — 2113: + 134—112 + (yi+1 - 2y: + yi—ll2l (4-11)

— wedgelVIIIir yi)l2}

The dynamic programming approach [3] is more appropriate for minimization
because it does not require the computation of high-order derivatives. This method
requires an input image and an initial estimate of the contour. The initial estimate
0f the contour required by this algorithm is speciﬁed by the contour of the motion
mask M, This contour, as seen in Figure 3.3(f) is a good initial approximation of
the desired contour of the moving object. The input image is computed using the

Conﬁdence measure 72? which counts the number of pixels of region R,- that are inside

95
the motion mask AA... Each pixel in the image is assigned a gray level value as follows:

I(r,y) = "—fit(1~.y)e 72,. (4.12)

l

The parameter values used in the algorithm are a = 100, ,3 = 100, and wedge = 1.
We are now able to compute another conﬁdence measure for each region, this time
using the new active contour instead of the motion mask contour. Let n? denote
the number of pixels belonging to region R,- that are inside mask Ada deﬁned by the
smooth active contour.

Each region R,- is examined and the conﬁdence measures 713 and n? are used to
decide if the region R,- should be kept as part of the moving object mask Mo. The

fusion algorithm can be stated as follows.

1. If g:- 2 0.9 then R: 6 A40; (If at least 90% of the region is inside the motion

mask, then it is kept as part of the moving object);

to

If 21*:- S 0.2 then R, g 7M0; (If less than 20% of the region is inside the motion
mask, then it is discarded);

3. If 0.2 < ’51 < 0.9; (Undecided regions)

3.1. If E:- > gi- then R,- E Mo; (If the conﬁdence measure of the region has
increased after applying the active contour algorithm, then it is kept as a

part of the moving object);

3.2. If E?- < E;- then R.- ¢ A40; (If the conﬁdence measure of the region has

decreased after applying the active contour algorithm, then it is discarded).

All the regions and pixels that have been kept are grouped together to form a
mask. This mask is then smoothed using a 7 x 7 averaging ﬁlter to obtain the mask

A40 corresponding to the moving object.

96

This fusion algorithm was applied to the 3-frame sequence shown in Figures 3.3(a)-
(c) and the intermediate outputs of the algorithm are shown in Figure 4.4. Figure
4.4(a) shows the outline of the motion segmentation mask superimposed on the middle
frame of the 3-frame sequence. Figure 4.4(b) shows the results of the color segmen-
tation. Figures 4.4(c)-(d) show the application of the active contour algorithm. Note
that the highest gray values are assigned to the regions belonging to the moving ob-
ject, but some background pixels also receive a high gray value. The initial contour
is superimposed on the original image in Figure 4.4(c). The ﬁnal contour obtained
after a number of iterations (usually around 25) of the active contour algorithm is
shown in Figure 4.4(d). Note that the new contour includes fewer background pixels
than the motion mask M,. However, the object contour is smoother now. The ﬁnal
moving object mask Mo is shown in Figure 4.4(e) and its outline is superimposed on
the middle frame in Figure 4.4(f). This contour is more accurate than the contour
of the motion segmentation mask. The use of the active contour module is impor-
tant for achieving a good moving object segmentation since it helps in deciding the
assignment of about 5% of the regions in R = {R1,R2, . . . ,Rn}.

Figure 4.5 shows a high-level description of the algorithm. The CPU time re-
quirements of this algorithm on a Sparcstation 20 are the following. The energy
minimization step takes 15 s, the decision for each region takes 0.4 s, and computing
the ﬁnal moving object mask takes 10 s. Thus, the total computation time is about

25 5.

4.1.3 Experimental results

This algorithm was tested on 136 real image sequences of moving objects, both indoors

and outdoors. Figures 4.6 and 4.7 show the contour of the moving object mask Mo

97

 

 

 

 

 

 

(e) (1')

Figure 4.4: Steps in the moving object segmentation algorithm: (a) motion segmen-
tation mask Ma; (b) color segmentation; (c) initial active contour; (d) ﬁnal active
contour; (e) fused mask Mo; (f) contour of mask Mo overlaid on the original image.

 

98

 

Input: Motion segmentation mask AA, and homogeneous regions

R: {R19R29"'9Rn}

 

begin
fggi=1,...,n
n,’ = R,- 0 1M,
endfor

Determine ,the intensity image for the active contour algorithm

[($331) : i117" if (:r,y) E Rf

The initial contour is the boundary of M, containing N points v,-

/* Run the active contour algorithm using */

/* the dynamic programming approach */

while k < 100 _an_d ]Ediﬁ'] > 0

fo_r i = 1, . . . , N
v;_1 is allowed to move to any of its eight neighbors
S.(v.+1.v.) = min..-,{S'.-_1(v.-,v:_1)+100I|v.- — vi—lll2
+100llvi+1 — 202' +vi—1ll2 - IIVI($iayi)ll2}

endfor

E). = SN(0, N)

Ediﬂ’ = E: - Ek-l

k = k + 1
endwhile

The ﬁnal active contour produces a mask M a
/* Decision for each region */
fir; i = l, . . . , n

n? = R5 fl Ma

ifgf 2 0.9 then R.- 6 Mo

11‘ 5; g 0.2 then R.- 6 744—0;

i_fO.2 < 5,; < 0.9 then

if 22 then 72.- e M.

then R.- E III—o

 

V
e.

 

1:. I

. p1,;
/\
15’.

 

:3

I m

endfor
Smooth Mo using a 7x7 averaging mask

end
Output: Moving object segmentation mask Mo

 

 

Figure 4.5: Moving object segmentation algorithm.

99

overlaid on the middle frame of the threeeframe sequence. As can be seen from
Figures 4.6 and 4.7. these contours are reasonably accurate. Small details such as
the spare tire on the pickup truck in Figure 4.6(a) and the emergency light on the
top of the vehicle in Figure 4.7(d) were also extracted. The contour is correctly
extracted even if the object is made of different colors such as the custom van in
Figure 4.7(f) or the jeep in Figure 4.6(g). Figures 4.6 and 4.7 also show images where
the segmentation results are not perfect. For example, the roof of the white vehicle
shown in Figure 4.6(c) was not included in the moving object mask because it was
merged with the snow in the center area of the road during the color segmentation
process; the two regions (roof of the white vehicle and the background snow) have
the same color, the Canny edge detector did not ﬁnd any signiﬁcant edges between
them, and the fusion module could not correct this mistake. For the same reason,
the bumper of a vehicle was merged with the jacket of the walking subject in Figure
4.6(b). The contour of the moving arm in Figure 4.7(g) is not very accurate due to
the fact that the edges are not very well deﬁned because of motion blur. This blur
can be removed if the high-speed shutter option of the camcorder is used. However,
if the high speed shutter is used inside a building, the resulting image might be

underexposed.

4.2 Segmentation using Deformable Templates

The active contour algorithm described in Section 4.1.2 correctly segments all the
moving objects from the stationary background. Figure 4.8 shows an example of a
three-frame sequence containing multiple moving vehicles. It can be seen in Figure
4.8(d) that the fusion algorithm based on active contours groups all the moving

vehicles into a single moving object. In this case, classical region-based or edge—based

100

 

(g) (h) (0

Figure 4.6: Results of the segmentation of the moving object. The contour the moving
object mask M0 is superimposed on the middle frame of the 3-frame input sequence.

101

 

(g) (h) It (i)

Figure 4.7: Results of the segmentation of the moving object (cont’d). The contour
of the moving object mask .Mo is superimposed on the middle frame of the 3-frame
input sequence.

102

segmentation approach would not be able to identify the front vehicle. Motion-based
segmentation would not be able to separate the front vehicle from the other moving
vehicles in the image. Since our domain is restricted to vehicles and vehicles have
similar shapes, we have chosen to solve this segmentation problem using a deformable
template approach which utilizes speciﬁc knowledge about the geometrical properties
of the objects in the scene. Figure 4.8(e) shows that the proposed deformable template
approach can be used to successfully segment the front vehicle from the background
and other moving vehicles.

The theory of deformable templates was ﬁrst developed by Grenander and his
colleagues [4, 77]. Speciﬁc knowledge about the scene is incorporated through de-
formable template models which describe the geometric properties of the objects
present in the image. The image segmentation problem is posed as an equivalent
(Bayesian) energy minimization problem. A template 7(8) is deﬁned by a set of
parameters 9 = (01, . . . ,ON). The prior probability density function p(('9) incorpo—
rates speciﬁc knowledge about the objects of interest in the input image and speciﬁes
how the template is allowed to be deformed. The likelihood function p(XI9) (or
data ﬁdelity term) for an image X discriminates between different hypothesized seg-
mentations in terms of how “close” they are to the observed image. The deformable
template problem consists of ﬁnding the template 7'((:3) deﬁned by the parameters (:9
which best characterizes the objects in the image. More speciﬁcally, we are seeking
the template parameters (:3 which maximize the a posteriori probability p((:)|X). If
the a posteriori probability is expressed as a Gibbs distribution with energy function

E(€-)), then maximizing p(G|X) is equivalent to minimizing E (G)

103

 

(d) (6)

Figure 4.8: Segmentation of the moving objects vs. segmentation of the moving
vehicle of interest: (a) frame 1; (b) frame 2; (c) frame 3; (d) segmentation of the
moving objects; (e) segmentation of the moving vehicle of interest.

104

4.2.1 Deformable templates

Before describing our vehicle segmentation algorithm to extract. the vehicle of interest
(front vehicle) from the stationary background and the other moving vehicles, we will
present the general theory of deformable templates and major approaches described in
the literature. We will describe the three main components of a deformable template

algorithm, namely prior probability, likelihood function, and minimization technique.

Prior probability

The prior probability of the template is constructed in such a way that all the tem-
plates which fall in the set S of allowed shapes are assigned a high probability whereas
the templates which fall outside 5 are given a low probability. It is common to de-
ﬁne a continuous prior probability density function by lowering the probability of a
template as it moves farther away from S. The types of prior probability density
functions published in the literature can be classiﬁed into four classes: (i) a continu-
ity measure through a Markov random ﬁeld [103, 170]; (ii) a parameterized template
and probability density functions on the parameters of the template [117, 151, 190];
(iii) a template governed by a set of stochastic differential equations [77], and (iv)
an ideal template that can be deformed through a set of allowable transformations
(deformations) [4]. The goal of the prior probability density function in all these

models is to constrain the shape of the deformed template.

Likelihood

The purpose of the likelihood function is to evaluate the probability of an observed
image, given a set of template parameters. Although the deﬁnition of the likelihood
function is application dependent, three models have been commonly used in the

literature. Yuille et al. [190] and Kass et al. [103] deﬁne likelihood functions based

105

on edge or region potentials. Another alternative is to base the likelihood function
on a random ﬁeld description of the image as described in Geman et al. [73], Mardia
(gt al. [129]. and Lakshmanan and Grimmer [117]. The inside and the outside of
the template are modeled by two independent processes. Storvik [170] deﬁnes an
energy function that comprises of both of these models. Finally. Grenander and his
colleagues [4, 77] assume that the true image is deﬁned by the prior distribution on
the deformable gray level template and the observed image is generated by adding
an independent noise process to the true image. It, is important to note that if it is
difﬁcult to directly specify the likelihood, then it is better to deﬁne an energy function

and model the likelihood as a Gibbs distribution.

Minimization algorithm

The ﬁnal issue to be considered when solving an image segmentation problem using
deformable templates is the choice of the minimization algorithm. Gradient descent
is the most common optimization technique and was used by Kass et al. [103] and
Besag [9]. This method has the disadvantage of requiring very good initial parameter
estimates, otherwise it will converge to a local minimum. Stochastic methods based
on simulated annealing have the capability of avoiding local minima. Geman and
Geman [72] have used the Gibbs sampler and Lakshmanan and Grimmer [117] have
used the Metropolis algorithm to solve the segmentation problem. Another class of
optimization algorithms involves diffusion processes based on the Langevin equation
[74, 75]. Grenander and Miller [78] and Miller et al. [136] have also introduced the
concept of jump diffusion processes to explore connected and non-connected areas of
the hypothesis space.

The a priori knowledge about the images and objects of interest in our application

domain consists of the following information: (i) the object of interest is a vehicle,

 

97: new
‘97

 

 

91=(XI‘Y1)'

ﬁes: (X8.Y8)
Figure 4.9: Deformable template deﬁning the generic model of a vehicle.

(ii) it is located in the lane closest to the camera, (iii) it is moving, and (iv) its
edges in the image are well deﬁned. Items (i) and (ii) are directly related to the
deﬁnition of the deformable template (prior) whereas items (iii) and (iv) characterize
the image (likelihood). The following sections describe the three main components of

our segmentation algorithm based on deformable templates

4.2.2 A Deformable Vehicle Template

The ﬁrst step in designing a segmentation algorithm using deformable templates
consists of deﬁning the model and prior probability density function for the template.
Our deﬁnition of the deformable template of a vehicle is similar to the approach
adopted by Phillips and Smith [151]. We have deﬁned the template of a generic vehicle
as a polygon characterized by N vertices (P1, P2, . . . , PN) in the 2D plane as shown in
Figure 4.9. Let (9 = (01, 62, . . . . 0N), where 0, = (X,, Y,), deﬁne the parameters of the
deformable template. Each pair of successive points (Pi, Pi+1),i = 1,... , N deﬁnes
a line feature 59,- on the boundary of the template. We assume that the boundary is
closed, so for notation purposes, PN+1 : P1.

The shape of the deformable vehicle template was established by studying a
database of images containing typical vehicles which can be considered as our train-

ing set. Since there are large variations in vehicle shapes, it is unlikely that a single

107

VEV‘H

Sedan Pickup Truck

KN m (”a

Hatchback Station wagon and 4x4 Van

Figure 4.10: Five classes of deformable vehicle templates.

template can capture all the shapes. So, we have categorized all the vehicles into the
following ﬁve classes: sedans. pickup trucks. hatchbacks. station wagons and 4x45,
and vans. Each vehicle class is deﬁned by its own deformable template. Typical tem-
plates of these ﬁve classes are shown in Figure 4.10. Note that the sedan and pickup
trucks are modeled by a polygon with 8 vertices whereas the polygonal model of the
hatchbacks, station wagons and 4x45, and vans only have 7 vertices. Other vehicle
categories such as large trucks or motorcycles are not of interest in this thesis.

In order for a polygon to resemble a vehicle. some constraints on the relationships
between the different vertices must be satisﬁed. For example, ,2, should always be
above 993, 591 should always be located to the left of 997, etc. A set of rules on the
template parameters constrains the shape of the vehicle template. These rules were
empirically determined by studying the intraclass variability of the vehicles in our
training set. We used a training set of 80 vehicles of different categories. A generic
rule, say rk, speciﬁes the range of allowable values that a parameter (the X or Y
coordinate of one of the vertices) is allowed to take. Without a loss of generality, we
denote this parameter as Xk. The rule 7“), states: (.X’Lni" S X), S XE“). The thresh-
olds X F" and X?“ are functions of the other parameters (X1, Y1, . . . , XN, Yjv) of the
template. Within the interval [.YP‘“,XL““"], the prior probability for the parameter

X), is maximum. As Xk deviates from the bounds of the interval, its prior probability

108

decreases. More specifically. we define a function g(rk) for rule m. as follows:

I

_-\’k + ‘\'Lnin if A]. < ‘x'L'nin

gm) 2 4 0 if win 5 X. g X?“ (4.13)

 

I 'max 0 r ’n18x
k‘\k—‘\k lf.\k>.\k

The role of g(') is to penalize a template whose parameters fall outside their
allowed range. The constraint function (prior) C(O) for the template T(O) is then
deﬁned as:

R
C(e) = Z g(rk). (4.14)
[:21
where R is the total number of rules. The larger C(O), the more the template 7(9)
deviates from the set of allowed shapes.

The choice of the function g() is very important since its shape greatly inﬂuences
the shape of the energy function and the convergence rate of the algorithm. For the
choice of g(-), we experimented with the following binary function

, 0 if X3” 3 Xi. 3 X31“
9 (rt) = (4.15)
1 elsewhere

This type of function. however. is not as suitable for the following simple reason.
If the value of the parameter is outside its allowed range, then the values of its
neighbors are also outside the allowed range, resulting in all the template parameters
to have the same constraint function value. On the other hand. a linearly decreasing
slope towards the allowed range will push the template to reach states where all the
parameters satisfy the rules. This is the reason why we have chosen g(-) as deﬁned
in Eq.(4.13).

We have deﬁned a set of 17 rules for the deformable template as shown in Fig-

109

ure 4.11. There is one rule for each of the 16 vertex coordinates and an additional rule
to constrain the vehicle template to be a closed polygon. To simplify the notation.
we deﬁne the length of the template L = ii, — i}. It can be seen that most of the
rules involve the parameter L. allowing the prior probability of the template to be
independent of scale. The parameters B and BI represent the row position and start—
ing column position of the initial baseline and L0 is the initial estimate of the length
of the vehicle. These values are estimated using motion segmentation as described
in Section 4.2.3. The role of these three parameters is to constrain the position and
scale of the template. When studying our training set of images, we noticed that
four of the parameters. namely Y3. if], Y5. and Y} strongly influenced the shape of
the vehicle and its classiﬁcation as a sedan, pickup truck, hatchback, station wagon
or 4x4, or van. Thus. the rules for these parameters were designed separately for
the ﬁve different classes. The other parameters obey rules that are common to all
categories of vehicles. We will see in Section 4.2.4 how we have incorporated the class

speciﬁc rules to segment the front vehicle without knowing a priori the class of the

vehicle in the input image.

4.2.3 Image Model

Once the template shape and the prior probability have been deﬁned, we need to
deﬁne the likelihood which characterizes the “closeness" of the template to the image
data. Our likelihood function is deﬁned by an energy function which comprises of two
terms: a motion information term and an edge directionality term. This combination
of two image cues contributes to the novelty of our energy function. The energy term
based on motion ensures that the template is located within the moving areas in

the image, thus acknowledging the fact that the vehicle of interest is moving. The

110

 

All categories: Sedans:
8—2 S.Y1S 8+2 [2+0.1L SY3S Y2+0.25L
X3 3 X3 3 X1 - 0.9(X1 — X3) Y3 3 Y; 3 Y3 + 0.2L
X1 — 0.6L 3 X3 3 X1 — 0.2L Y4 + 0.251. 5 Yr, 3 Yi + 0.4L
X3 — 0.6L 3 X4 5 X3 - 0.1L Y5 + 0.2L 3 Y7 5 Y5 + 0.4L
X1 - 0.95(X1 - X4) 3 X5 3 X1 -— 1.05(X1 — X4)
X1 - 0.95(X1 — X3) 5 X6 3 X1 — 1.05(X1 — X3)
X5 3 X7 3 X1 — 0.75(.\'1 — X5)
B — ‘2 5 X8 3 B + ‘2
81—10 ngg 81+10
Y1 — ‘2 3 Yr; 5 ’1 + ‘2
’5 + 0.05L 3 Y6 _<_ Y5 + 0.3L
Y8—2 3Y7< Y8+2

Y1+Lo—25 gig; Y1+Lo+10

Pickup trucks: Hatchbacks:
’2 + 0.35L 5 Y3 S Y2 + 0.6L Y2 3 Y3 3 Y2
Y3 S Y4 S Y3+0.05L Y3+0.1L S Y4 S Y3+0.4L
Y4 + 0.1L 3 Y5 S '4 + 0.2L ’4 + 0.25L g Y5 S ’4 + 0.4L
Y5 + 0.2L _<_ Y7 3 Y5 + 0.4L Y6 + 0.2L 5 Y; 5 Y5 + 0.35L
Station wagons and 4x45: Vans:
Y2 S Y3 S Y2 Y2 S Y3 S Y2
Y3 S Y4 S Y3 + 0.2L Y3 S Y4 S Y3 + 0.1L
Y4 + 0.4L _<_ Y5 3 Y4 + 0.55L ’4 + 0.55L 3 Y5 5 Y4 + 0.9L
Y6 + 0.15L 3 Y7 5 Y6 + 0.4L Y6 + 0.05L 3 Y7 3 Y5 + 0.3L

 

 

 

Figure 4.11: Set of rules used to deﬁne the prior probability on the deformed vehicle.

111

edge directionality term ensures that the boundary of the template coincides with
the strong edges having the same orientation in the image. For every point on the
template boundary segment (line feature) pk, we evaluate the angle between the line
feature and the gradient direction in the image at that point. For a good match, this
angle should be 90°. We also include the gradient magnitude in our computation of
the energy function to ensure that the template is located at strong edge sites. Note
that the likelihood function does not assume that the inside, outside, and boundary
of the template are independent processes. Before describing the energy function in

more detail, we present the motion segmentation and edge detection processes.

Motion segmentation

Motion information is an important cue in our application since we know that the ve-
hicle of interest is moving. Motion segmentation based on 3-frame image subtraction
as described in Section 3.1.4 is used to extract the moving areas in the image. Note
that this motion segmentation algorithm does not use any domain—speciﬁc knowledge
about the type of objects expected to be present in the image. The resulting mo-
tion mask highlights the moving areas in the image which include both the vehicle of
interest and other moving vehicles.

The difference image is also used to compute the initial baseline position B and
BI, and an initial estimate of the length Lo of the vehicle (see Figure 4.12). These
values are needed to evaluate the constraint function C(@) in Eq.(4.14). The vehicle
of interest is located in the lane closest to the camera whose extent in the :r-direction
(£1 and {2) can be determined if the geometry of the camera is known. The baseline

of the vehicle is found by summing the rows of the difference image and locating the

112

maximum sum between 51 and 52.

B = arg max Zrltij). (4.16)
J

‘ 56662

The extent of the vehicle in the y-direction is determined by locating the column
positions 81 and B; where sharp changes in the sums 22g d(i,j) occur. The initial
estimate of the vehicle length is deﬁned as L0 = 32 — Bl. Note in Figure 4.12(g)
that. as a ﬁrst approximation of the vehicle baseline. this algorithm mostly identiﬁes
the vehicle shadow. The baseline and vehicle length are reﬁned by the deformable

template algorithm.

Edge information

The edges are detected in the middle frame in the 3-frame sequence. thus I : f2 in
the following. The gradients VII and V1,, in the .r and y directions, respectively, are

computed using the Canny edge detector [‘24]. The gradient magnitude

 

IIWHM) = Ween + vets) (4.17)
and the gradient unit length direction vector

1
67 i,’ = ___,_,. v1,. '.' ,W, ' ‘ 4.18
< J) mm,“ (. J). (. 1)) ( )

 

are computed at every point in the image. Figure 4.13 shows the resulting edge

images.

113

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(s)

 

Figure 4.12: Estimation of the location of the baseline and length of the vehicle:
(a) frame 1; (b) frame ‘2; (c) frame 3; ((1) difference image; (6) row summation and
location of B; (f) column summation and location of B; and 82; (g) the white line
shows the estimated baseline and length of the vehicle.

114

 

 

 

 

j"; “iii"

W2 '1}?! my;
'93:" "i '

 

Figure 4.13: Edge detection: (a) input image I; (b) edge magnitude image ”VI“; (c)

and (d) the two components of the edge direction image vector VI.

Likelihood function

‘The likelihood function is expressed by an energy function which comprises of a
motion information term and an edge directionality term. Motion information is
included in the likelihood function by defining an energy function (Hg) for each
line feature pr. on the boundary of the template that is maximum when all the points

in p1,. are inside the motion mask .Ms.

 

(r1(‘r7k) = -. (4.19)

where NA. is the number of pixel in line feature ,:;,.. Note that 0 S (Heck) S 1.

'The edge directionality term is deﬁned as follows. We restrict a line feature up), to
be located near pixels in the image that have a high gradient magnitude and whose
orientation is perpendicular to the gradient direction.

The length of the line feature 9%, deﬁned between the points (Xk, H.) and (Xk+1, Yk+1)

on the boundary of the template. is deﬁned as

 

Hall = t/txv... — -X’t)2+(1t+i — m2 (4.20)
and its unit length orientation vector is deﬁned as

1 f r - r
,3: = +—(.Xk+1—.\k,1‘k+1— n)‘. (4.21)
119k“

Note that a: is the same for all points on the template boundary between (Xk, Y1.)

and (XkH, n+1)-

116

The energy term involving the directionality of the line feature ,ci. is deﬁned as

21...)... [NV/Hun x12<EZ-V—i(zf.j)>]
(em) = _.. (4.22)

\/(Z(z.ne.~.. [IV/11031llzv/(Zwm. (1‘3: ' —i(i-J')))i2

 

 

where

hm = ‘ ‘ (4.23)

0 elsewhere

and ‘-’ denotes the dot product of two vectors. The ﬁrst term in the numerator
of Eq.(4.‘2‘2) is maximum when the pixels in the image along tpk have a high gradi-
ent magnitude. The second term is maximum when the gradient direction is per-
pendicular to the orientation of 92k. The term in the denominator has three major
roles. As a scaling factor. it ensures that [72(tpk) is not biased towards longer line
features. It also balances the importance of the gradient magnitude term and the gra-
dient direction term. Finally, this type of normalization makes the energy function
smoother while retaining its valleys. Using the Schwartz inequality. it can be shown
that 0 S Ug(apk) S 1.
The total energy function for template parameters 9 is then
1 .

N
(7(6)) = V Z [171(pk) + [Inert]. (4.24)
1 k=l

where N is the number of line features in the polygonal model. The factor l/N
normalizes the total energy function so that templates with different dimensionalities
(number of line segments) can be compared. Note that U(G)) is maximum when the

template perfectly coincides with the edges of the vehicle.

117
For the choice of /2(.). we considered the function

A

= —— 4.2"
1+ air] ( o)

II’(.1')
which was suggested by Blake and Zisserman [15]. In this function. .3 controls the
height of the curve at. the origin and 0 controls the spread of the curve. This function
is more appropriate for inﬁnite domains. However, in our expression of the edge
directionality term, h(-) is applied to a variable whose value varies between —1 and
+1. For different parameters A and a. we noticed that Eq.(4.25) resulted in very
narrow valleys of the energy function. The function l2.(~) that we propose in Eq.(4.23)

does not suffer from this problem and results in a smoother energy function.

4.2.4 MAP Segmentation

Given the constraint function and the likelihood function, the image segmentation
problem can be formulated in the Bayesian framework. In particular, we seek to

maximize the a posterior? (MAP) estimate deﬁned as

A

O = arg meaxp(OIX). (4.26)

Let us deﬁne the a posteriori probability distribution p((9|X) as a Gibbs distribution
with energy [3(9) 2 —U(G)) +C(O) which combines the constraint and the likelihood

energy functions:

mam = %exp{—E(e)}. (4.27)

where Z is a normalizing constant. Then, Eq.(4.26) reduces to

A

O = arg main E(O) = arg m(_)in[—U(O) + C(69)]. (4.28)

Energy

-l8l r

-1.82 >

-183 r

4.84

-1.85 r

-|.86

-l.87

4.88

 

 

.1

AM

k /

1

 

 

66 68 70 72 74 76 78 80 82 84
Y3

118

[Energy

 

 

A 4L

A L

 

J—

 

-|.88 P

122124126128130132134136138140142144146
Y5

Figure 4.14: Cross—sections of the energy function along two parameters Y3 and Y5.

So, the segmentation problem is reduced to an energy minimization problem. The

optimal template O is deﬁned as:

O = argngn [291770—

k=l

2m... [IIVIll(i,j) x we: - awn]

k=1

2mm... MIN)

 

1i{

N

1V):

 

+

 

 

“(206)6ka l]VI]|(i,j)>2\/<Z(i,j)em hf‘P—I: ' W(Z,J)))2 J

(4.29)

 

The energy function which is to be minimized is quite complicated. Figure 4.14

shows two cross-sections of the energy function E(O) as a function of the parameters

Y3 and Y5 (y-coordinates of two vertices on the template). Notice that, as expected,

the energy function contains many local minima. Thus, O cannot be easily determined

using a gradient-based method. To obtain a global minimum, we use the Metropolis

algorithm [132] based on simulated annealing.

Simulated annealing

The Metropolis algorithm minimizes an energy function E(('9), given a stochastic

transition matrix R, and an annealing temperature schedule Tk such that lim;MOO T1, =

0. An initial state ('90 is chosen and the algorithm sequentially constructs a sequence

of states 81, 92, . ..

F by

and set

where

119

as follows. Given a sequence of states ((90, 91,. . . , (9k), generate

P(r = no... . . . , o.) = [29,, (4.30)

[‘ with probability Pk
Gk+l : (4.31)

0;. otherwise

 

pk = min (exp { _[E(F)7:— E(®k)]} ,1) . (4.32)
k

The resulting sequence of states forms a Markov chain with one—step transition prob-

abilities given by

1 R..- if EU) 3 EU) and .- ¢ 3'

P19t+1=j19k=i1=l R,,exp{—[E(j)—E(i)]/Tk} ifE<j)>E(z‘)

 

. 1— ing,-me).+1 = mIOk = i] ifi =j
(4.33)

It has been proved that the temperature schedule T)c 2 C / (1 +log k) ensures that the

Markov chain converges to the global minimum of the energy function with probability

1 [110].

The transition matrix R determines how the next template parameters are chosen,

given the current template parameters. Recall that our template 7(9) is deﬁned as

a polygon consisting of N points in the 2D plane. The new location (Xi‘H, Ynk“) of

120

the nth vertex is determined according to the following transition probabilities

 

 

P[X,’;‘+l =1— 1.1:“ = j X}; = 1.1: =j]=1/4

P133?“ = 1+ H?“ =1 X3." = 231;." =j1=1/4 ( )
4.34

P[.\’,’f+‘ = 1,1?“ 23' — 1 X}; = 1.1": = j] = 1/4

 

P[x,‘:+l = 1,1?“ =j+1|Xff :13: = j] = 1/4.

In other words, each vertex R, of the template is allowed to move to any of its 4

neighbors with equal probability.

Segmentation algorithm

The ﬁrst requirement of the segmentation algorithm relates to the ﬁve different types
of templates (vehicles) which we have designed. As mentioned in Section 4.2.2, dif-
ferent classes of vehicles are constrained by different sets of rules. How can we use
class information in the segmentation algorithm if the class of the vehicle in the in-
put image is not known in advance? We chose to run the Metropolis algorithm as
many times as the number of classes. Our segmentation algorithm is divided into
ﬁve epochs, each epoch corresponding to a particular vehicle class (sedan, pickup
truck, hatchback, station wagon and 4x4, and van). In each epoch, the Metropolis
algorithm is started with a template belonging to the vehicle class being considered
and iterates while constraining the template to remain in the same class. The algo-
rithm then outputs the template for the vehicle class that has the minimum energy.
This approach has two advantages. First, the ﬁve epochs can be executed in parallel.
Second, in addition to segmenting the vehicle in the lane closest to the camera, the
algorithm is also able to classify the vehicle into one of ﬁve classes.

To apply the Metropolis algorithm, we need to deﬁne an appropriate cooling

121

 

_—
t\
v

 

logarithmic g
geometric -------

.1

Temperature

PJ'JJt-UIO‘QmC
T vaY_.

 

 

 

 

.._‘

M

t) 250 500 750 1000 1250 1500 1750 2000
Number of iterations

 

:—
V

Figure 4.15: Comparison of the logarithmic and geometric cooling schedules.

(temperature) schedule. The cooling schedule is typically deﬁned as

C

T :——- .
k l+1ogk’ (435)

where T), is the temperature at the kth iteration and C is a constant. Our experience
indicates that it is very difficult to choose a “correct” value for this constant C; if C is
too small, the algorithm converges to a local minimum relatively close to the starting
position, and if C is too large, the number of iterations required for convergence is
extremely large. Strenski and Kirkpatrick [171] have studied ﬁnite length annealing
schedules and have suggested the use of a geometric cooling schedule. We adopt their

strategy and choose the following cooling schedule
k
T -—.
T1. = To (.1) A , (4.36)

where To is the starting temperature and Tf is the ﬁnal temperature to be reached
after K iterations. Figure 4.15 shows the behavior of the geometric cooling schedule
compared to the logarithmic schedule. It can be seen that in the geometric schedule,
the temperature does not initially drop too rapidly, but it approaches the zero value

much faster.

122

Figure 4.16 illustrates the evolution of the template as the number of iterations
increases. The initial placement of the template is determined by the parameters B,
Bl, and L0, which estimate the initial position of the baseline and the initial length of
the vehicle. At the beginning of the algorithm the temperature is relatively high, so
the algorithm explores a large space of template parameters even if the corresponding
templates do not satisfy the constraints and have a relatively high energy. As the
temperature decreases, the constraints are satisﬁed, the outline of the template gets
closer to the edges of the vehicle of interest, and the energy converges towards the
global minimum.

However, we have experienced difficulties in choosing the parameters for the cool-
ing schedule. We ran the ﬁve epochs of the Metropolis algorithm for 18 sets of pa-
rameters: To 6 {1.0,2.0,5.0, 100,200, 50.0}, T; = 0.01, and K E {2000, 2500, 3000}
on a particular image to observe the behavior of the algorithm. Figures 4.17 and
4.18 show the resulting templates, their vehicle class, and the corresponding energy
function. It can be seen that the segmentation and classiﬁcation results are strongly
affected by the parameters of the temperature schedule. In 12 of the 18 cases, the
vehicle was correctly classiﬁed as a van, but in two of those cases, the shape of the
template is not very accurate. In the remaining 6 cases, the vehicle was mistaken for
a sedan or a pickup truck.

Unfortunately, the best set of parameters for the temperature schedule varies with
the input image. By studying the behavior of the algorithm on a number of images,
we have noticed that our formulation of the energy function is correct in the sense that
its minimum always corresponds to a template which correctly outlines the contour
of the vehicle. So, the problem lies with the sensitivity of the annealing algorithm to
the parameters of the cooling schedule. To maximize the chance of ﬁnding the global

minimum of the energy function, we run each epoch of the Metropolis algorithm

123

    
 
  

 

Iteration 0 (E = —1.508) Iteration 200 (E = 2.437)

   

Iteration 400 (E = 2.580) Iteration 600 (E = ——0.759) Iteration 800 (E = —1.503)

 

= —1.851) Iteration 1400 (E = —1.885)

 

Iteration 1600 (E = —1.893) Iteration 1800 (E = —1.917) Iteration 2000 (E = —1.928)

Figure 4.16: Successive iterations of the vehicle segmentation algorithm.

   

To = 1.0, K = 2000 To = 1.0. K = 2500 To = 1.0, K : 3000
E = -—1.89, Sedan E = —1.88, Sedan E = —1.87, Pickup

           
 

      
 

To = 2.0, K = 2000 To = 2.0, K = 2500 To = 2.0, K = 3000
E = —1.89. Sedan E = —1.92, Van E = —1.91, Van

 

To = 5.0, K = 2000 To = 5.0, K = 2500 To = 5.0, K = 3000
E = —1.91, Van E = —1.89, Sedan E = —1.92, Van

Figure 4.17: Results of the Metropolis algorithm for different sets of parameters.

125

 

To = 10.0, K = 2000 To = 10.0, K 2 2500 To = 10.0. K = 3000
E = —l.84, Van = —1.88, Sedan E = —l.93, Van

   
 

     
 

     
 

 

,4,

To = 20.0, K = 2000 To = 20.0, K = 2500 To = 20.0, K = 3000
E = —1.91, Van E = —1.93, Van E = —1.92, Van

     

To = 50.0, K = 2000 To = 50.0. K = 2500 To = 50.0. K = 3000
E = —1.84, Van E = —1.92, Van E = —-1.92, Van

Figure 4.18: Results of the Metropolis algorithm for different sets of parameters
(cont’d).

126

with 18 different sets of parameters and retain the template which has the minimum
energy. Of course, this adds to the computational requirement of the algorithm.

During each iteration of the Metropolis algorithm. the N vertices are scanned
clockwise. Let 9"” be the set of template parameters obtained after iteration k — 1.
At iteration k, a sequence of N template parameter sets (8,1295, . . . ,Ofv) is created
by sequentially updating each of the vertices. For notation purposes, we deﬁne 9'6 =
0"”. Each 91‘, is created by moving the nth point in 9:4. Then 9" is deﬁned
to be Ofv when all the vertices have been updated. A high-level description of our
segmentation algorithm is given in Figure 4.19.

The CPU time requirements (on a Sparcstation 20) for a 256x256 images are as
follows. As mentioned earlier, the motion segmentation algorithm takes 7 s. The
computation of the gradient magnitude and direction based on the Canny edge de-
tector takes 3.5 5. Finally, each epoch of the deformable template algorithm takes 15

s for 2,000 iterations.

4.2.5 Experimental Results

The deformable template algorithm for ﬁnding the vehicle in the lane closest to the
camera was tested on a total of 287 images. Figures 4.20 to 4.24 show a few of the
examples of segmentation results. Note that in several of these images, the front
vehicle occludes a number of other moving vehicles and the stationary background is
rather complicated. It can be seen that the outline of the vehicles were accurately
identiﬁed and the vehicles were correctly classiﬁed.

Figure 4.25 show examples of vehicles which were incorrectly classiﬁed. In Fig-
ure 4.25(a), a sedan was mistaken for a hatchback, because the trunk of the vehicle

was not detected. In Figure 4.25(b), a van was classiﬁed as a pickup truck because

 

127

 

Input: Motion segmentation mask M,, estimated baseline (B, BI) and length
(L0) of the vehicle, gradient magnitude image “VI”, and gradient direction image

Vi
begin
Set Tf = 0.01
f_or; To e {1.0, 2.0, 50,100, 20.0, 50.0}
ﬁ)_r K E {2000, 2500, 3000}
fo_r C 6 (Sedan, Pickup truck, Hatchback, Station Wagon, Van)
Select an initial set of template parameters ('90 in C
Evaluate E (90)
Set (110.130 = 9°
Set k = 0
/* K iterations of the Metropolis algorithm */
while I: < K
Set k = k + 1
Deﬁne (9’5 = 9"”
/* Examine all the vertices clockwise */
f_o; n = 1, . . . , N
Choose new template parameters F
/* The new template deﬁned by the parameters F has
the same vertices as the template deﬁned by
the parameters 0L1 except that point Pn
has been moved to one of its four neighbors */
Evaluate E(F)
Compute T), = T0(Tf/T0)k/K
Compute p = exp{-[E(I‘) — E(®t-.)1/Tk}
Generate u, 0 S u S l
ﬁpzlorU<p<lt_lm_aG§=Psls_e@§=9§_i

 

 

 

endfor
Set 9" = (9%,, .
1: E((-)") < E(€-)(To,11’,C)) th_er1 em, 1.30) = ok
_endw_hile
endfor
endfor
endfor
Output the best template O
gig
Output: The list of template parameters and a mask of the vehicle of interest

k

‘U

 

Figure 4.19: Vehicle segmentation algorithm.

 

128

 

Figure 4.20: Segmentation results for three sedans.

 

Figure 4.21: Segmentation results for three pickup trucks.

 

Figure 4.22: Segmentation results for three hatchbacks.

129

 

Figure 4.23: Segmentation results for three station wagons or 4x4.

 

Figure 4.24: Segmentation results for three vans.

130

 

Figure 4.25: Segmentation results leading to wrong classiﬁcations: (a) a sedan mis—
classiﬁed as a hatchback: (b) a van misclassiﬁed as a pickup truck: (c) a station wagon
misclassiﬁed as a hatchback.

the contrast at the bottom of the windows on the side of the van is sharper than the
contrast along the roof of the van. In Figure 4.25(c), a station wagon was mistaken
for a hatchback, because the template locked itselfon the edge between the road and
the sidewalk instead of on the edge along the roof of the vehicle.

A signiﬁcant advantage of the deformable template algorithm is that. in addition
to segmentation. it also classiﬁes the vehicle of interest. So, the segmentation results
can be evaluated by computing the classiﬁcation error. Table 4.1 shows the classi-
ﬁcation results for the entire database of 287 vehicles. Note that the rules used to
constrain the shape of the deformable templates were determined using a set of 80
training vehicles. It can be seen that we obtain very good classiﬁcation results since
only 24 of the 287 vehicles were misclassiﬁed. This leads to a correct classiﬁcation
rate of 91.6%. Incidentally. for matching purposes. it does not matter if the algo-
rithm misclassiﬁes a vehicle as long as it does it constantly for every image sequence
in which that vehicle appears. However, we have noticed that in our experiments, the

misclassiﬁcation of a vehicle was not consistent.

131

Table 4.1: Classiﬁcation results for 287 vehicles.

 

 

 

 

 

 

 

 

 

True Assigned class

class S PT] 11 SW ] V ]
Sedan 151 4 3 0 0
Pickup Truck 0 27 0 0 0
Hatchback 1 0 21 3 0
Station Wagon or 4x4 1 3 1 31 1
Van 2 2 1 2 33

 

 

 

 

 

 

 

 

4.3 Summary

In this chapter, we have described two image segmentation algorithms to extract an
accurate contour of the moving object in a 3~frame sequence based on motion and
color information. The object of interest is the moving vehicle in the lane closest to
the camera.

We have integrated the results of the motion segmentation and the color segmen-
tation algorithms using active contours to extract an accurate boundary of the moving
object. The integration process combines the strengths of both the algorithms. The
model information is used to enforce the boundary of the moving object to be both
continuous and smooth. We have demonstrated the performance of this segmentation
algorithm on a variety of moving objects both in indoor and outdoor scenes.

To resolve cases where there are multiple moving objects in the image which
cannot be handled by the segmentation algorithm based on active contours, we have
proposed a segmentation algorithm which utilizes knowledge about the scene and the
objects in the scene through the theory of deformable templates. We have represented
a generic model of a vehicle by a polygon and deﬁned an energy function based on
motion and edge information to locate the vehicle in the lane closest to the camera

and separate it from the other moving vehicles in the other lanes. Not only does this

132

algorithm extract the vehicle of interest, but it also classiﬁes it in one of five vehicle
categories. We have seen that the algorithm produces excellent segmentation results

and we obtained a correct classiﬁcation rate of 91.6% on a set of 287 images.

Chapter 5

Color and Shape Matching

The input to the object matching system is the two sets of objects that have been seg-
mented from their respective backgrounds. The system has to ﬁnd pairs of identical
objects from the two sets. Recall from the deﬁnition and assumptions of an object
matching system given in Chapter 1 that the objects are observed under approxi—
mately the same aspect and are not occluded. This chapter describes the techniques
Of color histogram matching and shape matching based on the Hausdorff distance be—
tWeen two point sets. These two matching techniques are used by the vehicle matching

SYStem described in Chapter 6.

5.1 Color Histogram Matching

The operation of color matching consists of deciding on the similarity of two objects
baSed on color information alone. The input to the color matching system is a list of
RG3 features for the pixels belonging to the object. The simplest measure of color
Similarity is based on comparing the average colors of the two objects. However, we
have Seen during the segmentation stage that the object of interest in our application

could be made up of regions having different colors (windows. wood panels, lettering,

133

134

...) If the color components were averaged over the entire object. then the average
color value would not. have sufficient discriminatory power.

V'Ve have modified the color indexing measure proposed by Swain and Ballard
[172] to deﬁne our color matching score. This section presents our deﬁnition of a
color matching score based on 3D histogram matching and studies its behavior to
determine the optimal bin size for the computation of the histogram. We demonstrate
the performance of the color matching score on real images. We also analyze the
cumulative distribution of the color matching scores to assess the signiﬁcance of a

particular score.

5. 1 . 1 Color indexing

Swain and Ballard [172] proposed a technique that uses color features instead of the
C0fnrrlonly used geometric features (see Chapter 1) to index a database of models.
They did not build a complete object recognition system, but showed how color in-
deXing can be used to reduce the list of model candidate hypotheses. Their technique
is based on color histogram matching. The color histogram of the object model is
COmpared to the color histogram of the entire sensed color image to decide if an object
model might be present in the scene.

A color histogram is a three—dimensional data structure which, at each location
(1', y, 2) contains the number of pixels H(:r, y, z) in the image having color components
3:, y, and 2:. In their algorithm. Swain and Ballard used the opponent processes color
Space (deﬁned in section 3.2.1). The similarity between a model histogram M(3:, y, z)

and an image histogram H(.r. y, z) is defined as

W):weminiﬁgg~ly->vM<w~z1x (...)

 

1 3 5

where [M] denotes the number of pixels in the model histogram. This measure takes
a value between 0 and l. and it is equal to 1 if the two histograms are identical.
The fact that the two histograms are identical does not mean that the image is the
same as the model. It only means that the model and the image contain the same
distribution of colors. This is the reason why color histogram matching alone cannot
be used for object recognition, but only for indexing the model database.

This indexing procedure was tested by Swain and Ballard on a model database
containing 66 models. and the experiments showed that the correct model was the

best match for 29 of the 32 test objects and it was always included in the top two

matches.

5. 1 .2 Color matching

Swain and Ballard were interested in using color indexing for object recognition.
Therefore, they measured the similarity of a model object and an entire image which
possibly contained that object. For object matching purposes, the problem is slightly
different since the two objects of interest have already been extracted from the images.
Given two objects (91 and (92 with color histograms H1(r,g,b) and H2(r,g,b), the

color matching score is deﬁned as:

.C _ Zr :9 :3 mian1(r,g.b),H2(r,g,b))
.5 (01,02) — min(]H1I,]H2I) . (5.2)

 

where [.l denotes the number of pixels. An example is illustrated in Figure 5.1 for
two one-dimensional histograms.

This dissimilarity measure also takes values between 0 and 1. If the two his-
tograms do not overlap at all, then the color matching score will be 0. If the two

histograms are identical, then SC((91,02) = 1. Even in the presence of imperfect

1365

H l(r)

H 20‘)

  

 

 

 

1
,rd
fa» 1‘1

Figure 5.1: Computation of the color matching score.

 

object extraction, the color matching operation behaves properly because as long as

one of the histograms is entirely contained in the other one, the color matching score

is still equal to 1.

Estimation of the bin size

The main difﬁculty in computing the color histogram matching score in Eq.(5.2) lies
in the choice of the bin size and or the number of bins used in constructing the
histograms. It is quite obvious that if one uses 256 bins in each of the 3 dimensions
of the histogram (thus deﬁning each bin to be of size 1), small amounts of noise in
the images might drastically change the value of the color matching score.

We have performed some experiments to evaluate the best bin size. We have cre-
ated 75 patches of color and manually clustered them into 9 color clusters, namely
black, brown, red, white, blue, light brown, yellow, green, and gray. We computed
the 75x75 similarity matrix of color matching scores and transformed it into a dis-
similarity matrix X by:

X(i,j)=1— 3°(i.j)- (53)

137

We then deﬁned an a priori 75x75 dissimilarity matrix Y in the following manner:

' . _ 0 if objects 1 and j have the same category label
1' (1.1) = (5.4)
1 otherwise

The normalized Hubert‘s F statistic [90] is designed to assess the ﬁt between the data

.X' and the a. priori structure Y. It is deﬁned as:

 

 

1 1 n—l n
f 2 {M Z 2 1X03) - nz.x1[Y(i~j) -— myl} (5-5)

0X0}, 1:1 j=i+l

where n = 75 is the size of the matrix and M = n(n —— l)/2 = 2. 775 is the number of
entries in the double sum (the number of elements in the upper triangular dissimilarity

matrices), and the moments are deﬁned by:

mx 2 (LIZZXILJ') 0} ‘2 ﬁZZXWw) “ 7713‘!

m = 322112111 0;, = {7221’3131— mt»

(5.6)

Note that 0 S F S 1 and F = 1 if the a priori clusters reflect the structure in the
data perfectly. We used the Hubert‘s F statistic to determine how well the similarity
matrices of color matching scores computed with different histogram bin sizes agree
with the clusters that were manually deﬁned.

Table 5.1 shows the value of the Hubert’s F statistic for different bin sizes. It can
be seen that the Hubert’s F statistic is low for large bin sizes because the histograms
are not precise enough. It reaches a maximum for a 1—dimensional bin size of 16.
Then, it slowly decreases for smaller bin sizes due to the noise in the data. Based on
these empirical results, we have computed the color histogram using 16 bins in each
dimension. It takes approximately 0.05 s to build the histograms and compute the

color matching score on a Sparcstation 20 for two 20x50 images using 16 bins (of size

138

Table 5.1: Hubert’s F statistic for different bin sizes.

 

 

 

 

 

 

 

 

[ Bin size Number of bins Hubert’s F statistics

128x128x128 2x2x2 0.20
64x64x64 4x4><4 0.21
32x32x32 8X8x8 0.34
16x16x16 16x16x16 0.68
8x8><8 32x32x32 0.65
4x4x4 64x64x64 0.61
2x2x2 128x128x128 0.56

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5.2: Eight color patches.

16) in each dimension.

Experiments with real data

Figure 5.2 shows 8 of the 75 patches of color and Table 5.2 shows the color matching
scores for all possible pairs. It can be seen that there are only two signiﬁcant matching
scores: 0.973 between (a) and (b) which are both black and 0.817 between (c) and (d)
which are both red. All the other matching scores in Table 5.2 are much smaller. Thus,
the color matching score is able to match similar colors and discriminate between
different colors.

To show more complete results, we performed a hierarchical clustering of the 75
color patches using the single linkage technique. We used the color matching score as

a measure of the similarity between two color patches. Figure 5.3 shows the resulting

139

Table 5.2: Color matching scores for the eight color patches shown in Figure 5.2.

 

 

 

 

 

 

 

 

 

1 H (1)) 1 (c) 1 (.1) (e) 1 (f) 1 (g) 1 (hﬂ
(a) 0.973 0.0230 0.001 0.000 0.000 0.000 0.303
(b) 0.044 0.017 0.000 0.000 0.000 0.300
(c) 0.817 0.000 0.000 0.000 0.037
((1) 0.000 0.000 0.000 0.010
(e) 0.000 0.000 0.001
(f) 0.000 0.000
(g) 0.001

 

 

 

 

 

 

 

 

 

 

 

dendrogram. It can be seen that clusters of color such as red (R), black (Bk), white
(W), brown (Br), yellow (Y). green (Gr). and blue (Bl) can easily be identiﬁed. Other
colors such as light brown (Lb) and gray (Gy) are mixed with other clusters (white,
blue, and brown) because the boundaries of the light brown and gray clusters are not
as well deﬁned. These results show that the proposed color matching score is able to

match similar colors very well.

5.1.3 Baseline distribution of the color matching score

How large should be the value of the color matching score between two color his-
tograms before the corresponding colors can be called similar? 111 order to assess the
signiﬁcance of a color matching score, we have to determine a baseline distribution
under the null hypothesis of randomness. This will help in analyzing the results in

Table 5.2.

Baseline distribution under random color hypothesis

Uniform color images of size 20x50 were created by randomly generating three num-
bers between 0 and 255 for the three color components. Gaussian noise with mean

0 and standard deviation 100 was then added to the uniform color images. We

140

 

0.0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5.3: Hierarchical clustering of the 75 color patches based on the color histogram
matching score.

l-ll

 

 

...) . . .
”US“ .. . /, , .

1

115(1) 1- f ~'

Cumulam c Distnhulmn
\

 

 

(1.0 (1.1 (12 0.3 (1.4 0.5 [1.6 0.7 l) 8 0.9 1.0
Color Matching Score

Figure 5.4: Cumulative distribution of the color matching score under the random
color hypothesis.

generated 10,000 pairs of images using this process and computed the 10,000 color
matching scores. Figure 5.4 shows the cumulative distribution of the color match-
ing score under this random color hypothesis. It can be seen that the probability of
obtaining a color matching score greater than 0.2, under the null hypothesis, is less
than 0.05. According to this result, the gray color patch in Figure 5.2(h) is similar
to the back color patches in Figures 5.2(a)-( b), because the observed color matching
score is signiﬁcantly larger than 0.2! This shows that this null distribution deﬁned
by the random color hypothesis is not very appropriate for our application. In real

scenes, the colors will never be as diverse as the randomly generated colors.

Distribution for the 75 patches of color

To get an idea about the distribution of color matching scores for real data, we plotted
the cumulative distribution of the 2,775 color matching scores that were obtained
in the 75x75 similarity matrix using the 75 color patches deﬁned in the previous
section. This cumulative distribution is shown in Figure 5.5. Now a color matching

score greater than 0.7 is quite unusual since it has a probability of 0.05.

142

 

[(u ... . .- . . .. _. . . - .. 2.
“qj a ., . ., . .. . . ., . MIT—__J 1
l

”.5(f !- I I

Cumulative Distnhutmn

 

1H1

 

(H) {II 112 0.3 0.4 0.5 (16 0.7 0.8 0.9 1.0
Color Matching Scan:

Figure 5.5: Cumulative distribution of the color matching score for the 75 color
patches.

5.2 Edge Point Set Matching

The color matching score deﬁnes a similarity measure between two objects based only
on color. A drawback of this approach is that since histograms are used to compute
the matching score, the spatial relationship between the neighboring pixels of different
color is not utilized. As a result, two distinct objects containing the same set of colors
will have a high color matching score even though the arrangement of the different
colors might be very different in the two objects. To overcome this limitation, objects
should also be matched using other attributes.

A number of techniques for object (2D shape) matching using moments [89],
Fourier descriptors [150], or normalized chord length distributions [164] have been
presented in the literature. While many of these methods are invariant to size, ro-
tation, and translation, they are not information—preserving, i.e., two objects might
have the same shape features and yet appear to be very different. Some authors
have presented matching methods based on feature points. Davis [39], and Ranade
and Rosenfeld [157] presented algorithms to match feature points using relaxation

techniques. Xu [185] proposed a method which requires solving the correspondence

143

problem between feature points. These techniques have the disadvantage of being
computationally demanding when the number of feature points is large. A different
approach is to match relational structures (graphs [162] or stars [31]) by computing
subgraph isomorphism. but these techniques strongly rely on the construction of the
relational structure. Also. ﬁnding subgraph isomorphisms is an NP—complete problem
which makes this approach impractical.

The proposed shape matching method is based on point matching. An object is
stored as a set of edge points. and two objects are compared by measuring a distance
between the two sets of edge points. The proposed distance measure is motivated
by the Hausdorff distance between two edge maps (sets of edge points) associated
with the objects of interest [93]. We assume that a segmentation algorithm has
already been applied to the entire image to extract the edge points belonging to the
object of interest. We investigated 24 different distance measures and, based on their
behavior in the presence of noise, have determined the best distance measure, called
the modiﬁed Hausdorff distance (MHD) for object matching. MHD, like the Hausdorff
distance, is easy to compute and does not require the solution of the correspondence
problem. We have studied the performance of MHD on several real images. We have
also empirically determined baseline distributions for MHD under two different null

hypotheses to determine the signiﬁcance of the observed values of MHD.

5.2.1 Distance measures

The classical measure of distance between two point sets A = {a1,...,aNA} and
B 2 {b1,. .., [mg] of sizes NA and N3, respectively, is the Hausdorff distance deﬁned
as:

H(.A,B) = max(h(A,B),h(B,.A)), (5.7)

a 13> 3*
42930 fake

Figure 5.6: Hausdorff distance between two point sets where A = {—1-} and B = {O}:
(a) computation of h(A.B) = 11; (b) computation of h(B,A) = 6: H(A,B) = 11.

 

 

 

 

 

 

 

 

where the directed distances are defined as:

 

 

h(A.B)= maxminHa—b , (5.8)
06A (DEB
h(B,A) = m6a3xrtr1€12|lb — a”, (5.9)
and H.” is the distance norm (e.g., Manhattan or Euclidean distance). Figure 5.6

shows an example of the computation of the Hausdorff distance between two point
sets.

Huttenlocher et al. [93] proposed a distance measure to compare two sets of object
edge points in the presence of occlusion. This distance measure is a modiﬁcation of

the Hausdorff distance, where the directed distance is deﬁned as:
hid-498) = ["5234 $15111“ — blla (5-10)

where K224 denotes the Kth ranked value in the set of distances (one corresponding
to each element of A). For the Hausdorff distance, K would be equal to NA since
h(A,B) is the maximum distance over all points in set A. A suitable value of K is
determined empirically. This ranked Hausdorff distance was motivated by the need

to locate a possibly occluded object model in a. given image. Our interest is in object

145

matching where the two objects of interest have already been segmented from their

respective images.

Deﬁnition of the distance measures

It is generally agreed that the distance between two points a and b in a coordinate
system is the norm of the vector joining the two points. In the following discussion,

we will assume that the Euclidean distance is used.

(101,3) = [la — b” (5.11)

The distance between a point a and a set of points 8 = [61,. . ..bNB} is commonly

deﬁned as the minimum of the set of distances d(a,b,),i = 1, . . . , N3:
d(a,B) 2 Tell? Ha — 6H (5.12)

There are many ways to deﬁne the directed distance between point set A = {ah . . . , aNA}
and point set 3 = (61,. . . .bNB}. For the purpose of this study, 6 possible directed
distance measures were deﬁned as follows. The distances d2, (13, and d., are some of

the generalized Hausdorff distances proposed by Huttenlocher et al. [93].

d1(A,B) = 13353. d(a,B) (5.13)
(MAB) = 501.’;’;A(1(a,8) (5.14)
(1.3.4.13) = 75KyéAd(a,B) (5.15)
d,(.4,13) = 9011'32Ad(a,8) (5.16)
d,(A,B) = Tea} 401,8) (5.17)
(1.34.13) = \+ Z d(a,B) (5.18)

A aEA

146

Table 5.3: 2’1 distance measures between two point sets.

 

 

 

 

directed function
distance f1 ] f2 ] f3 ] f4
([1 Bl 02 D3 [)4

 

d2 05 De D7 08
(13 09 010 011 012
([4 013 014 DIS 016
(15 1217 DIS 019 020
d6 021 D22 D23 D24

 

 

 

 

 

 

 

 

 

 

 

 

where “’11"

0.

EA represents the Km ranked distance such that K/NA = 1%. For example,

BOA/Hr

06,, corresponds to the median of the distances d(a,B),Va E A. We could have

chosen arbitrarily many distances depending on the value of .r. We chose the three
speciﬁc values of .r (50,75,90) as a representative set.
These directed distance measures can then be combined in four different ways to

produce a distance measure between two point sets.

 

 

f1(d(A,B),d(B,A)) = min(d(A,B),d(B,A)) (5.19)
f2(d(A,B),d(B,A)) = max(d(A,B),d(B,A)) (5.20)
fs(d(A.B),d(B,A)) = MB) ‘; “31"” (5.21)
f.(d(.4.3),d(13,.4))= N23415: : 21:38.41) (5.22)

These four ways (Eqs.(5.19)-(5.22)) of combining six directed distance measures
(Eqs.(5.13)-(5.18)) result in 24 possible undirected distance measures between two

point sets as summarized in Table 5.3. Note that 018 corresponds to the well-known

Hausdorff distance.

147

 

Input: A binary image I
begin
/* Initialization ‘1
ﬁt; all i.j = 1 ...... \7
ifI[i,j] = 1 then D[i.j] '2 0 gLs_e D[i.j] = so Vi.j
endfor
/* Forward pass ‘/
{g all i.j = l ...... \"
0(3)] 2 n1i11{D[i,j],D[z'—1,j—1]+ @034 141+ 1,
0(1— 1.j+1]+\/2,D[i.j— 1]+1}

 

 

endfor

/* Backward pass */

fgrgall i,j= N,...,1

D[z',j] = min{D[i.j],D[i+1.j— 11+ ﬁ,D[i + 1.)] + 1,
D[i+1.j+l]+\/2.D[i.j+1]+l}

endfor
m
Output: A distance image D where value at a pixel indicates the distance to the
closest edge point

 

 

 

Figure 5.7: Chamfering algorithm.
Efﬁcient computation of distance measures

When the points are placed on a discrete grid (like an image), there is an efﬁcient
way to compute these distance measures. Let [,4 be an image of the point set A such
that IA[i,j] = 1 if the point (i,j) belongs to A and 0, otherwise. Deﬁne [3, the image
for the point set 8, in the same manner.

The distance from one point a to a point set 3 can be efficiently computed using
the Chamfering method [7] which computes a distance map D[i,j] for the given binary
image I[z,j]. Each pixel in D contains a number corresponding to the distance to the
closest point in the image I. A high-level algorithm to compute a distance map using
the Chamfering technique is shown in Figure 5.7.

For computing the distance between two point sets A and B, the Chamfering

algorithm is applied to both the binary images IA and [3 to produce the chamfer

1318

images (or distance maps) DA and 03. Now, the distance (1(a. B) is simply DLz[a,-, a3],

and the distance d(b.A) is DANA-.61]. where a = ((1,411) and b = (bhbJ).

5.2.2 Evaluation of the distance measures

A distance measure for the purpose of object matching should have the following
properties: (1) it should have a large discriminatory power. (2) its value should

increase as the two objects differ more in their shapes.

Metric properties

We ﬁrst analyze the metric properties of the 24 distance measures in Table 5.3. A

distance measure (1 : .X’ x X —-) R is a metric if it obeys the following axioms:

0) 414.11) = 401.4) Vay E X
(ii) 0<d(x,y)<ooV.r,y€Xx7$y

(5.23)
(iii) d(:r,y)=0®:r=y Vr,y€X

(iv) (l(.r,y) S d(.r,:) + d(z,y) Vx.y,:: E X

It can be shown that the Hausdorff distance (018) is the only distance measure
in Table 5.3 that is a metric. 01,. . ., 016 are not metrics because D(A,B) = 0 does
not imply that A = B. This could be a severe problem for object matching since
two different objects might produce a matching distance of 0. For example, object
1 and object 2 in Figure 5.8 are different in shape, but the ranked distance between
them for any K < 64% is 0. All the other distance measures (017, 019,. . . , 024) are
not metrics because they violate the triangle inequality. The triangle inequality is
a desirable property for object matching because it states that two highly dissimilar

object shapes cannot be both similar to a third object shape.

1-19

v€7

Object 1 Object 2 Matching edges

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5.8: Example of matching two objects using the ranked distances.

 

 

 

 

Figure 5.9: A typical random point pattern in a binary image.
Experiments on random point patterns

Several experiments were performed on synthetic images to determine the properties
of these 24 distance measures. Binary images (256x256) were obtained by generating
a random number of black points placed randomly (from a uniform distribution) on
the grid. A typical random point pattern in a binary image is shown in Figure 5.9.
Based on matching pairs of these randomly generated binary images, we determined
that 01, . . . , Dg, 013, 017, and 021 were equal to 0 most of the time, indicating that
these distance measures did not have sufficient discriminatory power. There does not
appear to be too much difference among the operators f2, f3, and f4, although f2
produces a larger spread in the values of the distance measures, resulting in a more
discriminatory power. These preliminary observations enabled us to eliminate most ‘
of the distance measures from further evaluation. We now restrict our attention to

only the following four distance measures: D10, 014, 013, and 022.

150

Experiments on random line patterns

we now study the behavior of the 4 distance measures (010,014,013, Dgg) in the
presence of noise. For this experiment. random line patterns were generated in 20'
256x256 images by randomly generating the locations of the end points of a random
number of lines. Figure 5.10 shows a typical random line pattern. A set of 50 noisy
binary images was then generated from each of these line patterns using the following
four different noise models. Figure 5.11 shows the random line pattern in Figure 5.10
in the presence of noise.

Feature noise: Adding feature noise consists of randomly perturbing the end
points of each of the lines in the noise-free image. Let Pf and P; be the locations of
the end points of the i‘h noise—free line, and let Q] and Q; be the corresponding new
locations after random perturbations. The feature noise for a given line is deﬁned as:

k.- = ((1);ng + 111210311
" 1113112111

 

and the total amount of noise in the image is:

, 1N -
An=§§kna

where N is the number of line features in the image.

Adding features: Line features can be added to the image by simply generating
more lines. Let n be the number of “on” pixels in the noise-free image and let m be
the number of “on” pixels in the noisy images. The amount of noise is computed as
K0 2 (m - n)/n.

Deleting features: A noisy image can also be obtained by deleting some of the

line features. The amount of noise is measured by the ratio KG) 2 (n — m)/n.

 

 

 

 

 

 

 

 

   

 
   

 

 

 

 

 

 

 

 

 

 

 

 

. /
(a) ((1)

Figure 5.11: Random line patterns with additive noise: (a) feature noise, Kn = 0.18;
(b) added features, KG 2 0.36; (c) deleted features, Kd = 0.35; (d) uniform noise,
Ku = 0.0032. The noise-free random line pattern is shown in Figure 5.10.

Uniform noise: Noise can be added by randomly ﬂipping pixels in the noise-free
image. Let F be the number of pixels whose values have been reversed. The amount
of noise is then measured by Ku = F/(RC) where R and C are the number of rows
and columns in the image.

The 4 distance measures (DID, D”, Bus, and 022) were then computed between
noise-free images and noisy images for various noise levels. Figure 5.12 shows the
behavior of the Hausdorff distance (DIS) as a function of K0 and Ku. It can be seen
that the distance values are large even in the presence of small amounts of noise and
the values remain almost constant as the amount of noise increases. This is due to
the fact that the Hausdorff distance value is set by the maximum distance among the
two point sets. This maximum distance can be set by the slightest amount of noise

and then the distance remains constant as the noise level is increased. In the context

 

 

 

l)
n .‘u «1 riff m In: I311

 

 

C
”U 1101 UUI ()[E‘ ”(N (1.05 (Mb
u

 

Figure 5.12: Behavior of the Hausdorff distance (018) as a function of the noise levels

K0 and K”.

 

 

 

 

 

 

1m 0.: 0.4 on 0.11 1.1)

Figure 5.13: Behavior of the ranked distance measures (DID and D”) as a function
of the noise level K".

of object matching, this is not desirable since a few outliers will perturb the distance
measure greatly, even though the two objects might be very similar in shape.

The generalized (ranked) Hausdorff distances (Dm and D”) were proposed to han-
dle outliers and occlusions [93]. Even these distance measures present some problems
for object matching. These problems are most obvious when we plot the values of 010
and 014 as a function of Ka (see Figure 5.13). The distances between the noise-free
image and the noisy image stay equal to 0 even after a signiﬁcant amount of noise
has been added to the image. This is because parts of the images are allowed to be
different to handle occlusions. The percentage of the image pixels (not necessarily
contiguous) that should match depends on the choice of the rank K.

The distance measure 022 is the only distance that has the desirable behavior for
object matching purposes. It can be seen in Figure 5.14 that 022 takes small values

in the presence of low levels of noise and takes large values as the noise level increases.

 

 

 

 

“x m 1; 14

2‘
9 HI 1: u

   

'3
1 511 itit‘ 150 3‘."
n 1 3 ﬁt)

  

-

1m (12 (1.1 _ lth m1 111 (1.0 u: 0.4 0.6 mt II) 11.0 11.2 (H 11.6 Mt I.” 0.0 0.0! 0.02 0. 3 0.04 0.05 0.1115
In K4 Ki]

3

 

 

 

Figure 5.14: Behavior of the modiﬁed Hausdorff distance (022) as a function of the
noise level for 4 types of noise models.

This distance, which we call the modiﬁed Hausdorff distance (MHD), is thus more

appropriate than the other distance measures for object matching purposes.

Experiments with real images

The modiﬁed Hausdorff distance (022) was used to match edge images of several
objects. Figure 5.15 shows the edge points of four of the objects in our vehicle
database. These edge images were obtained by applying the segmentation algorithm
based on motion, color, and edge information described in Section 4.1 and retaining
only the edges belonging to the object of interest. Note that the two objects A and
B are very similar in shape. For the purpose of this experiment, an outlier edge
point representing a possible segmentation error was added to the image of B (see
the arrow). The four distance measures, Dm, 014, D18, and 022 were computed for
pairs of objects and their values are reported in Table 5.4. The distance between the
ﬁrst two objects (A and B) should be small since they are known to be identical.
It can be seen that the Hausdorff distance (D18) does not behave properly in the
presence of an outlier since 018(A,B) > D13(A,C). The modiﬁed Hausdorff distance
(.922), however, correctly suggests that 022(A,B) is minimum. Let us now evaluate
the ratio of the distance value for the true match and the distance value for the next

best match. This ratio is 1 for Dlo, 0.48 for 014, and 0.24 for 022. This again shows

 

 

   

 

 

 

 

 

 

 

 

 

 

Figure 5.15: Edge images for four vehicles.

Table 5.4: Values of Dlo, 014, 013, and 022 for the four vehicles shown in Figure 5.15.
Note that MHD (022) has the best discriminatory power for object matching.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

[0,0 13 c D [1)... B c D
A 2.0 5.0 3.0 A 2.8 10.0 5.83
B 7.0 2.0 B 10.4 7.1
c 6.0 c 12.2

(1)18 13 c 7) 022 B c D
A 31.9 22.0 31.7 A 1.0 6.2 4.3
B 106.8 24.6 B 6.4 4.4
c 35.9 c 6.3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

that the modiﬁed Hausdorff distance (022) has more discriminatory power than the
generalized Hausdorff distances (010 and D14) proposed in [93].

The modiﬁed Hausdorff distance was also compared to the other distances on real
edge images of people while they were walking inside the Engineering building at
Michigan State University [44]. The four edge images of people shown in Figure 5.16
were obtained using the same segmentation algorithm which was used to locate mov-
ing vehicles [46]. The two objects A and B in Figure 5.16 are known to be identical,
so the distance value between them is expected to be signiﬁcantly smaller. The val-
ues of distance measures 010, 014, 018, and D22 are shown in Table 5.5. Again, it

can be seen that MHD has more discriminatory power than the other three distance

 

 

/

_ ___r)

€54

   

(ex/Lizzi;

 

 

 

 

 

 

 

 

 

A B C D
Figure 5.16: Edge images for four people.

Table 5.5: Values of 010, D”, 018, and 022 for the four people shown in Figure 5.16.
Note that MHD (D22) has the best discriminatory power for object matching.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

[010 B c D D... B c D
A 3.8 6.0 5.2 A 7.0 8.5 7.1
B 6.7 5.0 13 9.7 15.0
C 4.2 C 8.1
0,, B c 73 022 B c D
A 13.0 12.9 17.4 A 2.5 3.7 3.3
13 14.3 34.0 B 4.0 5.1
c 16.7 c 3.5

 

 

 

 

 

 

 

 

measures. The Hausdorff distance 018 shows that A is more similar to C than to B.

The ratio of the distance value for the true match and the distance value for the next

best match is 0.88 for 010, 0.99 for 014, and 0.76 for 022.

5.2.3 Baseline distribution of the modiﬁed Hausdorff dis-

tance

How small should be the value of the modiﬁed Hausdorff distance between two point
sets before the corresponding two objects can be called similar? This is similar to
the problem of determining cluster tendency of a given data and cluster validity of a

given partition [96]. A baseline distribution under the null hypothesis of randomness

156

will help us to better analyze the results in Table 5.4.

Baseline distribution under random point pattern hypothesis

Random point patterns were generated in 256 x 256 images and the modiﬁed Haus-
dorff distance was computed between 10,000 pairs of images. Figure 5.17 shows the
cumulative distribution of the MHD values under the random point pattern hypothe—
sis. Table 5.4 shows that for the 4 edge images in Figure 5.15, the modiﬁed Hausdorff
distance, D22 took different values: 1.0, 4.3, 4.4, 6.2, 6.3, and 6.4. According to the
plot in Figure 5.17, the probability of obtaining a MHD value of 1, 4.3, or 4.4 under
the null hypothesis of randomness is extremely low (< 0.0001) whereas the probabil-
ity of obtaining a value larger than 6 is close to 0.2. This implies that it is extremely
unusual to observe a value of 1.0, 4.3, and 4.4 for MHD by chance. In other words,
an observed value of 1, 4.3, or 4.4 for MHD would indicate that the two objects are
indeed very similar. This interpretation poses a problem because it implies that ob-
jects B and D are also similar (022(B,D) = 4.4)! An explanation of this fallacy is
the inappropriate choice of the null distribution.

Our matching problem deals with matching edge point sets obtained from vehicle
images. The resulting edge images present some perceptual organization which is not
modeled by random point patterns. A more appropriate baseline distribution will be

obtained if it is derived under the random vehicle hypothesis deﬁned below.

Baseline distribution under random vehicle hypothesis

A random vehicle can be generated according to a generic vehicle model shown in

Figure 5.18. Each of the model parameters L, H, C1, C2, R1, R2, 17, d1, d2, (1'3, and

 

 

 

f T
“If v I - /‘i
1

1

(15¢ -

ulatiu‘ Dislnhutiun

(‘um

...] ..

(11117
11,1

 

 

. . n. <
t_‘_~+_.__ ,’.2 . .~.. .4

S 6 7 8 0 II) It 12 I3 14 IS
Modiﬁed Hausdorff Distance

Figure 5.17: Cumulative distribution of MHD under the random point pattern hy-
pothesis.

(14 are randomly generated from a uniform distribution over the following intervals:

140 S L S 220 0.3H S h S 0.95H
0.2L S H S 0.4L 0 S d, S 0.5L
0.2L S C, S 0.25L 0.571;, S l S 0.7d5
(5.24)
0.75L S C2 S 0.8L 0 S (12 S 0.8011
0.25H S R, S 0.35H 0.2d6 S d3 S 0.6d6

These parameter values were empirically determined and lead to reasonable vehicle
shapes and sizes (Figure 5.19). The perpendicular bisector of the roof of the vehicle
is drawn to represent the separation between the two windows. At most two lines are
drawn across the body of the vehicle to represent the “guard rail”. Additional lines
are randomly generated to simulate noisy edges inside the vehicle. Four 256 x 256
images of random vehicles generated by this model are shown in Figure 5.19.

The modiﬁed Hausdorff distance was computed between 10,000 pairs of random

vehicle patterns generated in 256 x 256 images. Figure 5.20 shows the cumulative

‘7;
n
y
,1“

 

 

 

 

 

 

 

 

_ \
H //
e .6,
<~—> ,
Cl Radii R1 and R2
V C2 >
‘ L =

Figure 5.18: Generic model of a vehicle.

 

 

 

 

5.1; ..l SCI}!
”F5- (9_ 0

 

 

 

 

 

 

 

 

 

 

Figure 5.19: Examples of random vehicles generated by the model.

distribution of the distance values. The signiﬁcance of the distance values that were
obtained for the real vehicle edge images in Figure 5.15 can now be determined
from this plot. The probability of obtaining a distance value of 1.0 is extremely low
(< 0.0001) whereas the probabilities of obtaining distance values greater than 4 are
rather high (> 0.35). This implies that the two identical vehicles A and B are indeed
similar whereas none of the other match values for D22 in Table 5.4 are signiﬁcant.

We note that the probability of obtaining a distance value less than 3.0 is only 0.1.

5.2.4 Registration of two edge images

The modiﬁed Hausdorff distance between two point sets A and [3 depends on the
relative translation, rotation, and scale changes between the two point sets. The
segmented vehicles do not necessarily have the same position, orientation, and scaling

in the two images. Therefore, the two sets of edge points need to be registered prior

 

 

1.111) .. .. --. ..-”

”'56[ ,. .. .. .. ..5.. .. ,.. . .5

Cumulative Distnhutiun

 

I). qu ..5.. .. .. ... ..5. .. «.... . . .. ,. ...................;.._.......... .... .. .............5.....-_..' ....
001. 4 . . - ~ . ~ .
U.” '*

i ......_...i....._

 

 

 

I 4 7 10 13 I6 19 22 25 28 3| 34 37
Modiﬁed Hausdorff Distm

Figure 5.20: Cumulative distribution of MHD under the random vehicle hypothesis.

to computing the distance between them.

Pose clustering is a well-known technique to estimate the transformation param-
eters between two sets of points. Pose clustering [168] considers all possible pairs of
model and image features and computes the transformation (translation, rotation, and
scaling) between them. Each pair of model/ image features deﬁnes a transformation.
The possible transformations are then accumulated in a discretized transformation
space using a Hough—based approach. Peaks in the transformation space deﬁne the
pose candidate hypotheses for the objects. All these hypotheses are veriﬁed to ﬁnd
the correct one.

In our image acquisition setup, if the two cameras are properly aligned with the
road geometry, then we can assume that the orientation changes between the objects
to be matched are negligible. Thus, we only consider 2D translation and scale changes.
Therefore, the transformation space is 3-dimensional. The algorithm to identify the
translation and scale parameters is shown in Figure 5.21. Two edge images E1 and
E2 are input to the algorithm. The “on” pixels in these two edge images form the
two edge point sets which we also denote as E1 and E2 to simplify the notation.

The transformation T that best registers the two edge images is deﬁned by the 2D

160

 

Input: Two edge images E1 and E»;
be in
Initialize H[.r.y. s] = 0 VI. y, s
fgi; all s E [s,,,,,,, Smax]
/* Apply 3 to the edge points in E1 */
E[[s(i — mi) + m,,s(j — 771,) + my] (— E1[i,j] Vi.j
where (772,,mj) is the center of mass of the edge points in E1
/* Store all the possible translations */
£91: all (41,7116 Bi and (izajzl E E2

 

f

(L‘ = 72 - L1
9 = .72 “.71
H[:r,y,s] (— H[.r,y,s] +1
endfor
endfor

H[X, Y, S] = maxnw H[:r, y,s] and T = (X, Y, S)

E,T[S(z' — 771,-) + m, + X,S(j — m,-) + m,- + Y] <— E1[i,j] Vi,j
git:
Output: Two new edge images ET and E2, translation parameters (X,Y), and
scale parameter S

 

 

 

Figure 5.21: Pose clustering algorithm to estimate the translation and scale parame-
ters between the two edge images.

translation parameters (X, Y) and the scaling factor 5.

Figures 5.22(a) and 5.22(b) show two edge images E1 and E2, respectively, of
the same vehicle observed by two different cameras at different locations and times.
Figure 5.22(c) shows E? after the transformation T(X, Y, S) has been applied to E1
so that it is registered with E2. Note that ElT is more similar to E2 than E1 is to E2.

Given two objects (91 and 02 with edge images E1 and E2, the edge matching

score is deﬁned as:
0301.02) = MHD(E1T . E2) = maxtthf, 12.1.4053. 52)), (5.25)

where T is the transformation applied to register the two edge images E1 and E2,

161

 

 

 

   

 

 

 

 

 

 

 

 

 

Figure 5.22: Pose clustering to register two edge images: (a) ﬁrst edge image E;
(b) second edge image E2; (c) transformation T = (—6, —5,0.875) applied to E1 to
produce E3.

and d is the directed distance in the deﬁnition of the modiﬁed Hausdorff distance:

 

1
dET, = ' — .
( 1 52) IEFIPEngng qll (5 26)
l
1
dE,ET =—— ' — . 5.27
(2 1) $8,225,529?” P11 ( )

It takes between 5 s to 11 s to compute DC on a Sparcstation 20 for a pair of 256x256
images (4 s to 10 s to do the pose clustering and 1.5 s to compute the modiﬁed

Hausdorff distance).

5.3 Summary

We have proposed two matching scores 56(01, 02) and D"(01, 02) to compare two
segmented objects 01 and 02 based on their color and shape. The color matching
score, SC(01, 02) is a modiﬁcation of Swain and Ballard’s color histogram matching
measure. We have empirically determined the optimal bin size that should be used
to build the 3D color histograms. We have applied this color matching strategy to

color patches extracted from images of outdoor scenes and demonstrated that it is

162

able to match similar colors. We have also derived a distribution of color matching
scores to assess the signiﬁcance of the observed color matching scores.

The shape matching score 06(01, (92) is based on the distance between two sets of
edge points representing the objects of interest. We have determined that, among the
class of distance measures based on the Hausdorff distance, the modiﬁed Hausdorff
distance (MHD) is best for matching two objects based on their edge points. We have
shown that MHD has the following desirable properties: (i) its value increases mono-
tonically as the amount of difference between the two sets of edge points increases,
(ii) it is robust to outlier points that might result from segmentation errors, and (iii)
it does not discard any of the edge points.

The MHD was applied to edge images of several real objects that were segmented
from both outdoor and indoor scenes. We observed that MHD is better than the other
distance measures for the purpose of object matching. We have also determined a
baseline distribution for MHD under two different null hypotheses: (i) random point
pattern hypothesis and (ii) random vehicle hypothesis. These null distributions were
used to determine the signiﬁcance of the observed values of MHD. Such a statistical
analysis is necessary in order to avoid detecting false matches which could occur by

chance.

Chapter 6

Vehicle Matching System

This chapter is concerned with the implementation of the feasibility study for travel
time estimation using vehicle color and shape matching. We have described the
algorithms for image segmentation in Chapter 4 and the techniques for color and
shape matching in Chapter 5. We will explain how we can identify matching vehicle
pairs based on the color and edge matching scores. Once a pair has been found,
the travel time for that particular vehicle is simply the difference between the time
instants when it was observed by the two cameras.

The important feature of the proposed matching system is that it does not need
to identify all the matching pairs. Transportation experts have determined that the
average travel time can be reliably determined with one matching pair per minute.
The most important requirement. however, is that the system does not detect any
false matches at all, as they might drastically affect the estimation of the travel time
and other statistics. Thus, we propose very conservative threshold values to decide if
a potential matching pair should be accepted.

This chapter is divided into 5 sections. Section 6.1 proposes a machine vision

system to segment the vehicles in the lane closest to the camera and identify matching

163

164

pairs in the two databases to estimate the average travel time between two points
on a road network. 111 Section 6.2. we propose a computationally efficient method
which does not rely on an elaborate segmentation of the front vehicle to identify
the matching pairs. In Section 6.3. the results obtained with the complete vision
system and with the "real-time” vision system are compared to the results that were
obtained manually by examining the video tapes. Section 6.4 is concerned with on-
site implementation issues of the system and proposes an algorithm to automatically
select the interesting frames, i.e., frames that contain a vehicle in the center of the
image. Finally, in Section 6.5 we address the robustness of our system to images

obtained under various weather conditions and time of the day.

6.1 Travel Time Estimation

This section describes an integrated system for travel time estimation. The complete
system is shown in Figure 6.1. At an entry point, the vehicles passing in front of
the camera are segmented and stored in a database T1 = {1211, . . . ,V,l\,l} of vehicles
observed by the ﬁrst camera. At the exit point. the same segmentation algorithm
is applied to an observed vehicle V? to extract the four features which are used by
the matching process, namely, observation time, vehicle class, color histogram, and
edge point set. The matching system includes an indexing step to avoid computing
matching scores between V} and all the vehicles in T1. In this section, we present the
four features which are used to represent a vehicle, the indexing processes (time-based
and vehicle class-based), and the matching strategy which decides if a potential pair

should be accepted.

     

 

 

 

 

Segmentation of the front vehicle

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Vehicle Attributes 1
Observation time Vehicle class
Color histogram Edge point set

L J

Exit Points

 

 

 

 

 

 

 

Segmentation of

the front vehicle

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

   
     
 

Database of Vehicles
Observed by the
First Camera

Figure 6.1: Matching vehicles for travel time estimation.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Vehicle Attributes
Observation time Vehicle class Color histogram Edge point set
Reduced
. Time Class Database Color Shape
indexing indexing matching matching
Matching
Scores
Decision
lndentify one Reject all
matching pair the matches

166
6.1.1 Feature extraction

\Ve apply the model—based segmentation algorithm described in Section 4.2 to extract
the vehicle in the lane closest to the camera. The database T1 of vehicles observed
by the camera at the entry point contains four pieces of information per vehicle
V1.1' = 1.. . . . N1. These same four features are also extracted at the exit point to

1

represent the vehicle Vii-2. These features are:

0 Observation time T: time at which the vehicle passes in front of the camera.

0 Vehicle class V: one of the ﬁve vehicle categories (sedan, pickup truck, hatch-
back. station wagon or 4x4. and van) determined by the segmentation algo-
rithm.

0 Color histogram H: this histogram is built using the pixels which belong to the
vehicle. However, since all the vehicles have gray looking windows, there is no
need to include all the pixels in the computation of the histogram. Only the
pixels inside the polygon deﬁned by the vertices {(X1,Y1), (X2,Y2), (X3,Y3),
(X6, 32;), (X7. Y7), (X8, Ed} on the vehicle template (from Figure 4.9) are used
to compute the color histogram.

0 Edge point set E: only the edge points obtained from the Canny edge detector
that are inside the vehicle template are retained to deﬁne the edge point set
for shape matching. We also keep the boundary of the deformable template as

part of the edge point set.

Figure 6.2 shows an example of the four features that were extracted for a given
vehicle. Figure 6.2(a) shows the input image. The result of the model-based seg-
mentation algorithm identifying the front vehicle as a van is shown in Figure 6.2(b).
In Figure 6.2(c), we show the polygonal region inside the vehicle that is retained to

compute the 3D color histogram. For display purposes. the color histogram is shown

167

 

   

 

 

 

 

 

 

 

 

 

 

 

 

(C) (d) (9)

Figure 6.2: Feature extraction: (a) input image; (b) vehicle class (van); (c) area used
to compute the 3D color histogram; (d) color histogram; (e) edge point set.

as 3 marginal (R, G, B) histograms in Figure 6.2(d). Finally, Figure 6.2(e) shows the
edge point set which includes the edges inside the vehicle template (detected by the
Canny operator) and the boundary of the template. This vehicle was observed at

4:42:11 pm on June 1, 1994.

6.1.2 Time-based and vehicle class-based indexing

A vehicle V? observed by the second camera needs to be matched to all the vehicles
in the database T1 = {V}, . . . , vim}- However, travel time information can be used
to reduce the number of potential matches. We call this step time-based indexing.

Let T.we denote the current average travel time which is available from the pre-

C21

(0

.\'(

SCt

168

viously established matches. If vehicle V? is observed by the second camera at time
T2 then it should have been observed by the ﬁrst camera approximately at time

. 1
'1? — 'I’ave. Time-based indexing consists of determining a time window [T12 — Tave —
T“ T? — Tave‘l' 71] of size 2T, to select a subset of vehicles in T1. This subset of vehicles
T} C T1 is called the time-indexed database for vehicle V3.

In addition to time-based indexing, we can also perform an indexing based on
vehicle class. There is no need to match a van to a sedan. Since we know the vehicle
category V, of vehicle V3. we can select only those vehicles in T} which have the same

category V}. These vehicles form an even smaller subset T} C T} C T1 of potential

matches, called the time-indexed and class-indexed database for vehicle V3.

6.1.3 Color and shape matching

After time and vehicle class indexing, we are left with a small subset T} of N,- 3 N1
candidate matches for vehicle V?. For each pair (V?,V,-1J),V,-lj E T},j = 1, . . . , N,- we
compute the color matching score SC(V,-2, V3) and the edge matching score D‘TV}, V31).
Note that the color matching score is a similarity measure whereas the edge matching
score is a distance measure. Furthermore, the color matching score takes a value
between 0 and 1 while the edge matching score is a positive number usually greater
than 1. We deﬁne the total matching score as follows:

[SW-2,1” )1?
t 2 l _ ' ‘J
5 my“) ‘ Dom-1,) '

 

(6.1)

The N, total matching scores for all pairs (V3, V5), j = 1, . . . , N, are computed and
will be used to determine if there is a match for vehicle V? in T1. Note that S‘(V?, V5)
is a similarity measure which can take any positive value.

The vehicles in T3 are ranked in decreasing order of total matching scores. The

 

 

01hr

Tlal

 

169

vehicle V3“) deﬁned by:

V} = arg max 5"(V3.V,~l)) (6.2)

‘1” 13153}
is called the potential matching candidate for VB.

In order to accept the match (V,2.V,-lm). a number of conditions are tested. In
Chapter 5. we determined from the cumulative distributions of the matching scores
that a color matching score greater than 0.7 and an edge matching score less than 3.0
were quite unusual. Recall that we do not have to identify all true matches as long
as we are sure that no false matches will be accepted. In order to be certain about
a match, the matching scores for the vehicle V3“) have to be significantly better than
the matching scores for the other vehicles in T}.

We deﬁne 3", D6. and 5‘ as follows:

5 : SC(V,-2,V-‘ )[scnﬁv1 )— 5°(V,-2,V-1 )] (6.3)

1(1) ‘(ll t(’2)

. ”(1),-2,11} )—De(v,?,vg )
De: ‘2’ ‘1’ (6.4)
D€(V,-2,V-‘ )

'(x)

 

s‘=5'*(v3.v.1 )[s'(v3,v.‘ )—.s"(v,."~’,v.‘ )j. (6.5)

‘(u 1(1) H2)

These coefﬁcients show the relative difference between the potential matching candi-
date and the next candidate. Based on these coefficients and the matching scores,
the match (Viz, Vii”) is accepted if:

(SC(VE.V1 ) > 0.7) and (mi/3.12.1 ) < 3.0)

‘m '(1)

(6.6)
(3‘: > 0.25) or (D6 > 1.0) or (5" > 0.1) ,

otherwise, the match (V3, Vii”) is rejected. The ﬁrst condition ensures that the poten-

tial match is actually a good match and the second condition veriﬁes that at least one

170

of the matching scores is signiﬁcantly better than the other matching scores. meaning

that the observed vehicle is unusual in shape or in color.

6.1.4 Experiments

The proposed vision-based matching system was tested on the databases Saginaw-
Coolidge and Saginaw-llaslett. Images contained in these databases are shown in
Appendix A. Two Sharp Slim Cam V'L-L6‘2t' camcorders were placed 1.9 miles
apart on Saginaw Street in East Lansing, Michigan on June 1, 1994 around 4:45
pm. A seven minute segment (4:39 pm to 4:46 pm) of the video tape from the ﬁrst
camcorder was digitized. The image sequences were segmented to extract a total
of 149 vehicles (database T1). A six minute segment (4:43 pm to 4:49 pm) of the
video tape from the second camcorder was also digitized and segmented to yield 138
vehicles (database T2). The model-based segmentation algorithm was tested on these
287 vehicles (149 + 138). As mentioned in Chapter 4. 91.6% of these vehicles were
correctly classiﬁed.

The matching algorithm was applied to all the 138 vehicles in T2. Note that the
database Tl contains 149 vehicles. The cameras were placed 1.9 miles apart and the
legal speed limit on Saginaw Street is 35 mph. Therefore, we initialized the average
travel time between the two cameras to 3:30 min. We set the time interval T, to
40 s to identify vehicles whose speed is between ‘27 and 40 mph. After time-based
indexing, the number of candidate matches per vehicle in T2 is reduced to between
17 and 43 (‘25 on the average, a reduction of 83%). After vehicle class-based indexing,
this number is further reduced to between 1 and 23 (9 on the average). Thus, the
indexing process reduces the size of the database T1 by 94%. on the average.

A total of 8 matching pairs were identiﬁed. We manually veriﬁed that all the

171

identified matching pairs were indeed true matches and no false matches had been
detected. Referring to the databases in Appendix A. the vision-based matching sys-
tem identiﬁed the following matches ( 13.41), (36.61). (60.89). (64.81). (67.86). (78.99).
(84.107), and (121,129). where the first number is the vehicle number in the Saginaw-
Haslctt database and the second number is the vehicle number in the Saginaw-Coolidge

database.

6.2 “Real-Time” Experiments

The vehicle matching system for travel time estimation presented in the previous
section involves two techniques that are quite time consuming, namely, the segmen-
tation of the front vehicle using deformable templates, and the computation of the
edge matching score using pose clustering for image registration.

In this section, we present a much simpler approach. For segmentation purposes,
a rectangular box which approximately contains the vehicle in the lane closest to the
camera is extracted using motion information alone. Then in the matching process,
color matching is used as an additional indexing step to reduce the number of edge
matching scores that have to be computed. We will ﬁrst describe the image seg-
mentation technique used to locate the front vehicle. \Ve then present the matching

strategy.

6.2.1 Locating the front vehicle

In order to simplify the segmentation process and reduce the processing time, we only
used the computationally efﬁcient motion segmentation algorithm. The segmentation
result consists of a rectangular box which approximately contains the front vehicle.

As before. three frames f1, f2, and f3 containing the moving vehicle(s) are extracted

172

at times t1, {-2 ((1 + _V). and t3 (£1 + 12A,). respectively from the image sequence,
where At is approximately 1/305. Figures 6.3(a)-(c) show an example of the three
consecutive frames. These frames are blurred using a 3x3 averaging mask to produce
images f1, f2, f3. The difference image is computed as described in Section 3.1
(Eq.(3.14)):

Du.» = lfl(i.j) -f2(i.j)l x mm) 4312.211 (6.7)

The baseline of the moving pixels in the difference image corresponds to the base-
line of the front vehicle. This baseline can be easily extracted by summing the rows

of the difference image.

h(i)=Zd(i,j), i=1,..-V. (6.8)

We assume that the image size is N x 111. The row B corresponding to the baseline is
located at the first sharp transition in h(27) when it is examined from below (i = N).
The topline of the moving pixels reasonably approximates the top of the front vehicle.
This topline T is located at the ﬁrst sharp transition in h(i) when it is examined from
above (i = 1). See Figure 6.3(e) for extraction of B and T.

The left and right sides of the front vehicle are estimated using the same technique.

The pixels in a small neighborhood above the baseline are summed in columns.

B
11(1):: (“11]“). j: 1.....111 (6.9)

The left boundary (L) and right boundary (R) of the vehicle are located at the left and
right sharp transitions in the plot of t(j) (see Figure 6.3(f)). The four constants (B,
T, L, and R) deﬁne a rectangular box. The ﬁnal rectangular box for the front vehicle

in the three frame image sequence (Figures 6.3(a)-(c)) is shown in Figure 6.3(g).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(f)

 

9.
""

Figure 6.3: "Real-time" extraction of the front vehicle: (a) frame 1: (b) frame
(c) frame 3: ((1) difference image; (e) proﬁle of the row sums h(i); (f) proﬁle of the
column sums v(j) around the baseline: (g) location of the front vehicle.

174

 

 

 

 

 

 

 

 

   

(bl

Figure 6.4: Feature extraction for the front vehicle in Figure 6.3: (a) color patch; (b)
color histogram; (c) edges inside the rectangular box.

6.2.2 Feature extraction

Once the front vehicle has been segmented, features describing its shape and color
are extracted, so that this vehicle can be matched to other vehicles in the database.
We also store the time at which the vehicle was observed.

The 3D color histogram used by the color matching module is supposed to describe
the color of the vehicle. Unfortunately, if we use the pixels inside the rectangular box,
then some of these pixels might belong to the background. To avoid this problem,
we build the 3D color histogram using a small rectangular patch extracted near the
location of the doors of the vehicle. We use a color patch of ﬁxed size (20x50 pixels)
and a ﬁxed location above the baseline of the front vehicle. Figure 6.4(a) shows the
location of the color patch for the vehicle in Figure 6.3 and Figure 6.4(b) shows the
3 marginal (R, G, B) histograms.

The shape of the vehicle is characterized by the set of edge points extracted by
the Canny edge detector applied to those pixels that are inside the rectangular box.
Figure 6.4(c) shows the edges for the vehicle in Figure 6.3.

Figure 6.5 shows segmentation results for eight additional image sequences. The

 

Figure 6.5: Example of segmentation of the front vehicle and extraction of the color
patch for different scenes. Only the middle frame of the three-frame sequence is
shown.

rectangular area corresponding to the front vehicle and the location of its color patch
are shown superimposed on the middle frame of the sequence. These results show
that this rather simple algorithm is able to localize the front vehicle fairly accurately.
The CPU times (on a Sparcstation 20) for various modules are as follows: 2 s for
blurring the three 256 X 256 intensity frames: 0.3 s to compute the difference image;
0.001 s to extract the rectangular box and the color patch. The Canny edge detector
takes an additional 2 s. The algorithm can be easily implemented on off—the—shelf

hardware to achieve real—time processing.

6.2.3 Matching strategy

We also propose a simpler matching strategy compared to the one described in Sec-
tion 6.1. Time-based indexing is performed in exactly the same manner. The time-

indexed match database T11 for vehicle V? is determined by retaining those vehicles

176

in T1 that were observed in the time interval [7:2 — Tm, — 71., T,2 — Ta", 4» Tc]. In this
simpler algorithm we did not segment the vehicle. but only extracted a rectangular
box around it. Therefore. we can neither determine the vehicle class nor use it for
indexing. Thus, only time-based indexing is performed.

Edge matching is a computationally expensive operation which should be avoided
whenever possible. Recall from Chapter 5 that, on a. Sparcstation 20, color matching
takes 0.05 8 while edge matching takes between 5 and 11 5. Therefore, we use color
matching as an indexing step. For every vehicle V? E T2, color matching is used to
compute the N, color matching scores associated with vehicles in the set T}. The color
matching scores are ranked in decreasing order and the potential matching candidate
for V? is the vehicle V3“) whose color matching score is maximum:

V-l ) = arg max SC(V,-2,V,IJ),V,IJ E T} (6.10)

‘0 1g 151v,
The decision tree approach shown in Figure 6.6 was implemented for matching.

To avoid computing edge matching if unnecessary, a ﬁrst decision is based on

the color matching score only. If SC(V,2,V,1(1)) is less than 0.7, the potential match is

.1

,m) is sufﬁciently

immediately rejected and the decision process is aborted. If SC(V,2, V
large and larger than the other color matching scores, then the potential match is
accepted and the decision process is aborted. Mathematically, this translates into

SC(V,?,V}(1)) > 0.8 and 5.. > 0.05, (6.11)

where

5'6 = 56023.12? )[SC(V,2,V-1 )—S°(V,2,V-‘ )]. (6.12)

‘(ll 1(1) '(21

In these cases, there is no need to compute any edge matching scores. If no decision

 

5 (V2. V‘

1(1)

 

)>0.7

 

 

yes

 

56(1)}, vgm) > 0.8 and SC > 0.05

 

 

 

ilO

 

I10

REJECT

 

ACCEPT SC(V,2, V30) > 0.7 and De(V,-2,V,-la) < 3.0
(V2, V-1 ) and S‘(V,2,V,-la) > 0.3
SC > 0.25 or DC >1.0 or St > 0.1

2(1)

 

 

 

/\

ACCEPT (V2, V1)

REJECT

Figure 6.6: Decision tree for matching two vehicles.

can be made based on the color matching scores alone. the matching system computes
the edge matching scores. We decided to keep only the first two vehicles V3“) and

V.1 as candidates for edge matching. The total matching scores are computed as in

'(21
Section 6.1:
l-S'CW2» V}, )l2

Dee/‘3. vs.) (6'13)

 

SW3. V3,) =

and the potential matching candidate for V,2 is the vehicle Vila whose total matching

score is maximum.

V-lazargmax{5t(V,2.V-l ).s‘(v3.v.1 )} (6.14)

‘ ’11) ‘(21

v}, = argmin{s‘(v3.v.‘ ),s‘(v3,v.1 )}. (6.15)

‘(11 ‘(21

We deﬁne SC. D6, and S" as follows:

 

s = sen/3, v30) [52(V,2,V,‘a) — 511231239] (6.16)
. Dew-2.12.1) — 01112.12!)

Dc : z n, t in . n

Dal/taut) (611)

9 = S‘M. v.1.) [S’(V.2.V.-‘.) — 5"(VE.V.-‘.>] (6.18)

These coefﬁcients show the relative difference between the potential matching candi-
date and the other candidate. Based on these coefﬁcients and the matching scores,
the match (V,-2, V30) is accepted if:

(sat/3,1230) >0.7) and (DC(V.-2,V.-‘.) <30) and (5‘(V.21V3.) >03) (619)

(SC > 0.25) or (D2 > 1.0) or (52 > 0.1)

otherwise, the match (V2,V,-la) is rejected. The ﬁrst condition ensures that the po-

179

tential match is actually a good match and the second condition veriﬁes that at least

one of the matching score is significantly better.

6.2.4 Experimental results

This “real-time" vision-based matching system was tested on the same two databases
of 149 and 138 vehicles as the complete vision system described in Section 6.1. As
before, time—based indexing is able to eliminate 83% of the candidate matches, on the
average.

For 81 of the 138 vehicles in T2. the decision was based on the color matching
scores alone. In 74 of thoses cases, the largest color matching score was too low and the
potential match was immediately rejected, and in 7 cases, the largest color matching
score passed the decision rule 2 and the potential match was immediately accepted.
In the remaining 57 cases. the two edge matching scores had to be computed, and only
in 2 of those cases, the potential match was accepted. Thus. a total of 9 matching
pairs were identiﬁed. Again. we manually veriﬁed that all the 9 identiﬁed matching
pairs were indeed true matches and no false matches had been detected.

Referring to the databases in Appendix A. this “real-time” vision-based matching
system identiﬁed the pairs (24,46), (30,53), (44,71), (47,74), (64,81), (78,99), and
(129,143) based on color information alone, and (23,45) and (41,67) using both color
and shape information, where the ﬁrst number is the vehicle number in the Saginaw-
Haslett database and the second number is the vehicle number in the Saginaw—Coolidge
database.

Even though it appears that this simpler system performs better than the more
complex system described in Section 6.1, the thresholds are not as discriminatory and

might lead to a few false matches. Also. since in some cases, only color matching is

180

used. the confidence in the resulting matching pairs is not as high. In other words.

this matching system should be used with caution.

6.3 Comparison with Manual Matching

For performance evaluation. we compared the results of the vision-based systems
to those that were obtained manually. The manual travel time estimation is done
by viewing the video tapes and identifying identical vehicles. This strategy is quite
strenuous (it took a graduate student 12 minutes to analyze the 6 minutes of video
tape), and does not lead to identiﬁcation of all the matching pairs. Humans tend to
match unusual looking vehicles. In particular. in our database, the human matcher
identiﬁed a stretch limousine. a mobile home, a red and grey 4 wheel drive, and an
old station wagon with wooden sidepanels.

Recall that travel time estimation does not require us to identify all the matching
pairs, but only to match. on the average, one vehicle per minute so that the travel time
estimate is reliable. Our human matcher identiﬁed 11 pairs. Seven of these pairs are
in our databases T1 and T2: (15.40). (33.54). (35.60), (55.75). (57.77), (71.92), and
(77.98). The other matched vehicles are not in our database because these vehicles
were not located in the lane closest to the camera. Our segmentation algorithm only
extracts vehicles in the lane closest to the camera.

To compare the manually matching results with vision-based matching results, we
plotted the travel time of a matched vehicle as a function of the time T,2 at which it
was observed by the second camera. In Figure 6.7, the matches found manually are
shown by the solid line, the matches found by the complete machine vision system
are shown by the dashed line, and the matches found by the “real-time” machine

vision system are shown by the dotted line. It can be seen that these three plots

 

 

— Manual Matching
----- Com lete VISIQn System
--~-- ‘Rea -Time‘ V1510n Syste

 

 

 

Travel time (minzs)
4,200 2:30 3:00 3:30 4:00 4:30 5:00

 

L

 

 

 

:44pm 4:43pm 4:40pm 4:42pm 4:43pm 4:43p"
Time of day (hrzmin)

Figure 6.7: Comparison of travel time estimates.

are in agreement suggesting that the performance of our machine vision systems is
comparable to the human observer who estimated the average travel time by looking

at the video tapes.

6.4 Digitization and Storage Issues

In this section. we address the problems of image digitization and storage. As men-
tioned earlier, we collect the images using camcorders and store them on a video tape.
Currently, our laboratory is not equipped with an automatic video digitizer, so we
manually digitized all the images shown in the database in Appendix A. We will
describe a technique which could be used to automatically digitize the video tapes if
additional equipment is purchased.

All the images used in this thesis were digitized manually as follows. Our Labo—
ratory owns a color framegrabber, digicolor, made by DataCube. installed in a Sun
4/330 Sparcstation. This framegrabber is designed to digitize one frame at a time

and works best with still images captured with a CCD camera. To handle video data,

182

we connect the VCR to digicolor. A frame can then be stored by pausing the VCR
at a speciﬁc location on the tape, thus creating a. still image. and writing the digitized
image on to a disk.

To build our database of vehicle images. we manually selected the video frames
which contained a vehicle in the lane closest to the camera and this vehicle was
approximately in the middle of the image. Since our motion segmentation algorithm
requires three frames of the moving object, we also manually digitized the preceding
frame and the following frame. Manual frame selection and digitization is extremely
time consuming because we need to examine the tape, pause the VCR when an
interesting frame is encountered, select the three frames of interest and write the
digitized frames to the disk one at a time.

To show that the above operations can be done automatically, given the appro-
priate equipment, we have manually digitized a sequence of 380 frames from the
Saginaw-Coolidge tape. Every other frame in the sequence is shown in Figures A.29
to A.32 in Appendix A. This sequence contains 7 vehicles which passed in front of
the camera and were located in the lane closest to the camera. We have developed
an algorithm to automatically identify the frames where these vehicles are centered
in the image. Given a three-frame sequence of intensity images containing moving
vehicles, we have seen in Sections 4.2 and 6.2 that we can reliably extract the base-
line of the vehicle in the lane closest to the camera. Our automatic frame extraction
algorithm is also based on the same idea. The three frames f1, f2, and f3 containing
the moving vehicle(s) are blurred using a 3x3 averaging mask to produce images f1,

f2, and f3. The difference image is computed as described in Section 3.1:

190.1): |f1(i.j)—f2(i,j)l x mm) — f3(i,j)|- (6.20)

183

The vehicles of interest are located in the lane closest to the camera whose extent in
the :r-direction (£1 and {2) can be determined if the geometry of the camera is known.

The difference image is summed along the columns between {1 and {2:

52
23.0) = Z 00.11. (6.21)
i=€1

Obviously, Dv(j) will be larger if there is movement between {1 and {-2 in column j.
The left side Bl and right side 82 of the vehicle are determined by locating the column
positions where D.,( j ) becomes larger than 0.0517,,ma,x when it is scanned from left to
right and right to left, respectively. Figures 6.8(a)-(c) show an example three-frame
sequence. The difference image is shown in Figure 6.8(d). The boundaries of the lane
closest to the camera are shown in Figure 6.8(e) and the estimation of B, and 32 is
shown in Figure 6.8(f).

To avoid storing the same vehicle twice, we keep track of the state of the system.
We assume that we start in the state NONE when there are no vehicles in the lane
closest to the camera. The three-frame sequence contains a moving vehicle in the

lane closest to the camera if

19...... > 1000, and
(6.22)

82 — B; < 0.8Nc01,

where NCO] is the number of columns in the image. The ﬁrst condition veriﬁes that
there is something moving in the lane closest to the camera and the second con-
dition veriﬁes that there is only one vehicle. The center position of the vehicle is

approximated by
B] + 32

F:
2

(6.23)

184

 

 

 

 

 

 

 

 

 

 

 

(0

Figure 6.8: Estimation of the location of the baseline and length of the vehicle: (a)
frame 1: (b) frame 2; (c) frame 3; (d) difference image; (e) location of the lane
boundaries: (f) column summation and location of B, and 82.

185

If the center position of the vehicle shifts to the right side of the image (B > 1' c0I/2),
then we detect a vehicle. store the three frames of interest and enter the state VEHICLE.
When the center position of the vehicle occupies the left side of the image (B <
1 col/2). it means that the previous vehicle has left the ﬁeld of view and a new vehicle
has entered the ﬁeld of view. We then switch back to the NONE state. The complete
algorithm for automatic frame selection is summarized in Figure 6.9.

Figure 6.10 shows an example of this algorithm. In Figure 6.10(a), ’Dmmu = 14,
meaning that there are no vehicle in the lane closest to the camera. Also the state
of the system is NONE. In Figure 6.10(b) as a vehicle is entering the ﬁeld of view,
We can see that ”Dam“ is getting larger. In Figure 6.10(c), the center position of
the vehicle (indicated by the tick mark) just shifted to the right side of the image.
Thus, the state of the system becomes VEHICLE and the 3-frame sequence is stored
on a disk. Nothing is changed when this vehicle leaves the ﬁeld of view as shown in
Figure 6.10(d). We can detect that there are no vehicles in the lane closest to the
camera in Figure 6.10(e) because ’Dv,max = 8. Then, in Figure 6.10(f), a new vehicle
enters the ﬁeld of view. It can be detected because Dun“ becomes larger than 1000
and the center position is in the left side of the image, thus the state of the system
becomes NONE.

The results of this algorithm for the complete 380 frame sequence are shown in
Figure 6.11. In this ﬁgure, the time is represented along the y axis. Each horizontal
line represents the length 82 — Bl for that particular frame. Thus, the slanted gray
areas show the position of the vehicle in the image as it passes in front of the camera.
The areas in light gray represent the times when there are no vehicles in the lane
closest to the camera. The black segments show the times at which the three frames
containing the moving vehicle centered in the image were extracted. It can be seen

that we were able to correctly identify the 7 vehicles in the video sequence of 380

 

186

 

Input: A sequence of consecutive frames
begin
Input two consecutive frames f1 and f2
Blur f1 and f2 using a 3x3 averaging mask to produce f1 and f2
Status = NONE
while new frames are available
Input the new frame f3
Blur f3 to produce f3
mj=la°ncholA A ‘ .
29.0) = 25;..If1(i.j)— mm x 020.11— mm
endfor
v,max = maxigjchol vaj)
£9;j=1,...,NCO1
if Dv(j) > 0.05Dvmax then Bl =j break

6

 

 

 

 

 

endfor
ﬁg '= Nco.,...,1

_i_f’D.,(j) > 0.0513max then 82 =j break
endfor

 

F = (Bl + B2l/2
i_f (Dv,max > 1000) m (82 — 81 < 0-8Ncol) then
if (Status = NONE) m (B > Neel/2) the};
Status = VEHICLE
Store f1. f2. and f3
en_di_f
i_f (Status = VEHICLE) and (B < Ned/2) then Status = NONE
endif . . . .
fit-f2 fzf—f3 fl(_f2 f2<—f3
Mile

end .
Output: A list of 3-frame sequences containing the moving vehicles of interest

 

 

 

 

Figure 6.9: Algorithm to automatically select interesting frames.

187

     

(a) frame 40 (b) frame 55 (c) frame 64
Dv.max : l4 Dv.max = 1624 D1;.max : 3293

StatezNONE StatezNONE State=VEHICLE

           

(d) frame 75 (e) frame 85 (f) frame 93
Dv.max : 8037 Dv.max = 8 Dv.max = 6526
State=VEHICLE State=VEHICLE State=NONE

Figure 6.10: Example of the automatic frame selection algorithm.

188

Column posiuon

 

 

 

Figure 6.11: Results of the automatic frame selection algorithm.

frames. Note that the four vehicles in the other lane in frames 151 to 197 and 253
to 301 were ignored because they are not located in the lane closest to the camera.
Using this system, we can retain the 21 useful frames and discard the 359 redundant
frames.

The time requirement for this algorithm on a Sparcstation 20 for a sequence of
256x256 images is as follows: 0.7 s to blur the new frame. 0.006 s to evaluate DUO),
0.0001 s to decide if the three current frames should be kept or discarded, and 0.16 s
to transfer the last two frames into the ﬁrst two buffers so that the next frame can be
stored in the third buffer. Note that the operations required by this algorithm (3x3
averaging, frame subtraction and multiplication, column summing, and relational and
logical operations) are very simple.

This algorithm can easily be implemented in real—time using an architecture sim-
ilar to the MaxVideo 20 system available from Datacube. This architecture offers
analog-digital modules, real-time convolvers, storage modules, arithmetic operations,

etc. Also, depending on the application needs, speciﬁc modules and interconnection

189

patterns can be deﬁned. Such a system could be used to process the frames as they
are captured by the camera or played by the VCR and decide if the current set of
three frames should be kept or discarded. By examining our video tapes, we have
noticed that there is a gap of at least half a second and up to 3 seconds between each
vehicle. Thus, if the current three frames have to be kept for further processing, the
storage device should have enough time to store them.

In summary, we have demonstrated that given the appropriate equipment, it is
possible to automatically select the frames where the vehicles are centered in the
images and store them for further processing in real time as the vehicles pass in front

of the camera.

6.5 Effect of Weather Conditions and Time of Day

In this section, we will study the robustness of our algorithms under different weather
conditions and time of the day. All the databases of images which we acquired are
shown in Appendix A. We have extensively tested our system on the databases
Saginaw-Coolidge and Saginaw-Haslett. Images in these databases were captured on
Saginaw Street in East Lansing, Michigan on a bright and sunny afternoon in June.

The images in databases Maple-East and Maple- West were captured on Maple
Road in Troy, Michigan on an overcast afternoon around 2:40 pm on May 24, 1993.
Figure 6.12 shows some examples of color segmentation results. It can be seen that
the results are very good. The algorithm was able to identify regions of different
colors such as the wood sidepanel on the station wagon Figure 6.12(a), the drawing
and lettering on the van in Figure 6.12(c), and the different color stripes on the van
in Figure 6.12(f). The two vehicles in Figure 6.12(d) were well separated.

Figure 6.13 shows some examples of model-based segmentation (using deformable

190

 

       

10

Figure 6.12: Color segmentation results on the Maple databases.

1.91

 

(a) Hatchback (b) Station wagon (c) Sedan

 

((1) Van (6) Pickup truck (f) Station wagon

Figure 6.13: Model-based segmentation results on the Maple databases.

templates) results. It can be seen that the algorithm was able to segment the front
vehicle from the stationary background and the other moving vehicles even for unusual
looking vehicle such as the covered truck in Figure 6.13(f). Only 6 vehicles in the
database of 80 vehicles were misclassiﬁed. Thus, we attained a correct classiﬁcation
rate of 92.5%.

The database Engineering was acquired on the Michigan State University campus
on an overcast morning around 10:20 am on February 25. 1993. Some snow is present
on the ground. Figures 6.14 and 6.15 show a few of the color segmentation and model-
based segmentation results on the Engineering database. The color segmentation
results are quite good. In Figures 6.14(a), however, the roof of the white vehicle was

merged with the pile of snow in the center of the road because the two regions are the

192

 

 

Figure 6.14: Color segmentation results on the Engineering database.

same of color and the Canny edge detector could not detect any boundaries between
them. The model-based segmentation results are also good. Only one vehicle in this
database of 11 vehicles was misclassiﬁed, resulting in a correct classiﬁcation rate of
91%.

The database Fog was acquired on the Michigan State University campus on an
early foggy morning around 8:40 am on January 12, 1995. Because of the intensity
of the fog, most of the stationary background is very blurry and pale. Figure 6.16
shows some color segmentation results. The algorithm was able to isolate the different
colored regions in the cabin of the pickup truck in Figure 6.16(c) and in the hatchback
in Figure 6.16(b). Due to the fog, however, the segmentation of the regions further
away from the camera is not very good and the boundaries between the different

regions are not accurate. The model—based segmentation results are very good as can

193

     

(a) Hatchback (b) Sedan (c) Van
Figure 6.15: Model-based segmentation results on the Engineering database.

be seen from Figure 6.17. The vehicles are close enough to the camera (around 20
feet) that they do not appear blurred by the fog and their edges are still sharp. Thus,
the model—based segmentation process gives satisfactory performance. Only 1 vehicle
in this database of 15 vehicles was misclassiﬁed. yielding a correct classiﬁcation rate
of 93%.

The three databases Dusk were captured on the Michigan State University campus
around 5:40 pm on January 10, 1995. At dusk, it is not possible to use the high
speed shutter. because the CCD matrix does not receive enough light. For the other
databases, we had used a speed of 1/1000 5 to avoid motion blurring. When we tried
to use this speed at dusk. the camcorder warned us that it was not receiving enough
light and we obtained the image shown in Figure A.24 in Appendix A. It can be
seen that this image is extremely dark and image processing algorithms cannot be
successfully applied to it. The images in the database Dusk-I shown in Figure A.22
were captured with a shutter speed of 1/100 5 and the images in the database Dusk-2
shown in Figure A.23 were captured at a regular speed of 1/30 s. It can be seen that
the vehicles in these images are blurry due to their motion.

We tried to apply the color segmentation algorithm to these images and obtained

194

 

 

Figure 6.16: Color segmentation results on the Fog database.

 

 

 

Figure 6.17: Model-based segmentation results on the Fog database.

195

 

(b)

Figure 6.18: Color segmentation results on the Dusk database.

the results shown in Figure 6.18. It can be seen that these segmentation results are
extremely poor. Very few homogeneous regions are obtained; the vehicles appear
very dark and are merged with the region corresponding to the road. From these
segmentation results. it is quite obvious that color or shape matching would not
perform very well. The images are so dark that all the vehicles appear to be black
and the Canny edge detector is not able to separate them from the background.

We also applied the model-based segmentation algorithm and the results are shown
in Figure 6.19. These results are a little bit more satisfactory than the color segmen-
tation results. The vehicles in Figure 6.19(a) and 6.19(b) are reasonably segmented.
However. about half of the vehicles in the database were very poorly segmented as
can be seen from the example in Figure 6.19(c).

To acquire images at night, special camcorders or cameras have to be used. Differ-

196

     

(b)

Figure 6.19: Model-based segmentation results on the Dusk database.

ent cameras present different sensitivities to light. The minimum amount of illumina-
tion necessary to obtain a good image is measured in lux. Typical CCD cameras need
a minimum illumination of 10 to 15 lux. A lux is deﬁned as a unit of illumination
equal to the direct illumination on a planar surface which is at a distance of one meter
from a uniform point source of one candle. Color cameras, in particular, are generally
less sensitive to low light than black and white cameras. A large number of black
and white cameras are commercially available which are used for security purposes
and require a minimum illumination of only 0.05 to 1 lux. Camcorders usually need
a minimum illumination of 5 to 10 lux.

The other important feature that a camera should have for IVHS applications is
the high speed shutter option. We have seen that images which were not captured
with a shutter speed of at least 1/1000 5 show motion blur around the edges of the
moving objects. This motion blur makes the detection of edges more difﬁcult and
our algorithms are not as robust. Most cameras and camcorders have shutter speeds
between 1/30 s and 1/2000 5. Some camcorders even offer speeds of up to 1/10000
s. The sensitivity of the camera varies with the shutter speed. For example. the

camcorder that we used requires a minimum illumination of 0.7 lux at normal speed.

197

330 lux at 1/100 8, and 300 1th at 1/1000. This is the reason why it is difﬁcult to
use the high speed shutter at dusk when the brightness is only about 10 lux. For
comparison. the brightness on a clear day at noon is 100,000 lux, 32.000 lux on a
cloudy day at noon, and 1.000 lux on a clear day one hour before sunset.

One has to be careful in choosing the right camera based on high speed shutter
requirements. A video frame is composed of two ﬁelds containing the odd lines and
the even lines, respectively. A regular shutter opens for 1 / 60 s to capture the ﬁrst
ﬁeld and for another 1/ 60 s to capture the second ﬁeld, thus achieving an acquisition
rate of 30 frames per second. If the shutter is set at a higher speed, then the video
rate is still 30 frames per second, but the following two scenarios are possible: (i)
the shutter opens to get the ﬁrst ﬁeld, waits for about 1/60 s, and opens again to
get the second ﬁeld; (ii) the shutter opens to capture the entire frame and waits for
about 1 / 30 s to capture the next frame. It is obvious that the second option will
produce much better results. Using the ﬁrst option, the edges of the moving objects
will be jagged and more complex image processing operations will be needed. The
Sharp Slim Cam camcorder that we used implements the second option, like most
camcorders who have a high speed shutter option. CCD cameras, however, usually

implement the ﬁrst option, but sometimes offer both.

6.6 Summary

We have designed a complete vision—based matching system to identify vehicle pairs
in two databases of images captured at different locations of a road network. We
were able to identify 8 correct matching pairs in a 6—minute video sequence. To
reduce processing time, we also implemented a faster matching system which does

not rely on the segmentation of the front vehicle. The conﬁdence of the matching

198

pairs produced by this faster system is not as high because color information alone
is used most of the time. We have compared the performance of the two vision-
based matching systems with a manual approach for travel time estimation. We have
observed that the three approaches produce comparable results. Both of our machine
vision systems are able to identify a sufﬁcient number of matching pairs to reliably
estimate the travel time.

We have addressed issues related to an on-site implementation of the machine-
vision system. We have demonstrated that we can automatically select the frames
of interest, namely the frames where the vehicle in the lane closest to the camera
is centered in the image. Finally, we have experimented with different databases of
images captured at different times of the day and under different weather conditions.
We have shown that night vision is a problem and that our algorithms do not per-
form well if the images are too dark. All the other experiments yielded very good

segmentation and matching results.

Chapter 7

Summary and Future Research

This chapter summarizes the results of the research reported in this thesis. Several

directions for future research are also outlined.

7. 1 Summary

We have developed a complete matching system to estimate the travel time between
two points in a road network. Vehicles are observed from the side of the road and
the system determines if the same vehicle passes in front of two different cameras
positioned at different locations in the road network. Using a sufﬁcient number of
matching pairs. the system can estimate the average travel time between the two cam-
era locations. The system is divided into two main stages. The image segmentation
stage consists of extracting the vehicle of interest from the stationary background and
the other moving vehicles. The matching stage examines two segmented vehicles and
determines if the two vehicles are identical based on color and shape information.
We have proposed four image segmentation algorithms. The motion segmentation
technique uses simple frame subtraction, multiplication, thresholding, and morpho—

logical closing. It requires only 3 frames containing the moving objects in the video

199

200

sequence. We have demonstrated that this algorithm identiﬁes the moving areas in
the image reasonably well.

The color segmentation algorithm combines region information obtained by a split-
and-merge technique for color images and edge information. we have shown that
fusing these two pieces of information produces better results in the sense that the
boundaries between the regions are more accurate.

These algorithms are rather general and can be applied to images in many appli-
cation domains. To solve the segmentation problem in our speciﬁc domain, we have
developed two model-based segmentation algorithms.

The ﬁrst algorithm fuses the results of color segmentation and motion segmen-
tation to obtain an accurate contour of the moving object in the image. We use
an active contour approach to enforce continuity and smoothness constraints on the
contour of the moving object. This algorithm was tested on 136 images of moving
objects (vehicles, walking people, robot, ..) captured indoors and outdoors. In all
the cases, we obtained very good segmentation results.

The segmentation algorithm based on motion and color information alone has the
following disadvantage. If there are multiple moving objects in the image, then they
are grouped into a single connected component. Additional model information must
be incorporated to extract only the object of interest. In our application domain, we
have deﬁned a generic model of a vehicle and applied the paradigm of deformable
templates to segment the vehicle in the lane closest to the camera from the back-
ground and other moving vehicles. This algorithm has a number of novel features.
A generic polygonal model of the vehicle was deﬁned by examining a database of
learning examples. This deﬁnes a set of constraints on the relative positions of the
vertices of the polygonal shape. This same technique could be applied to any other

polygonal shape. The imaging model fuses motion and edge information by including

201

two terms in the expression of the energy function. We force the contour of the de-
formable template to align itself with the edges of the front vehicle by comparing its
direction to the gradient direction of the pixels in the image. In addition to segment-
ing the front vehicle. this technique is also able to classify it. into one of ﬁve categories
(sedan. pickup truck. hatchback. station wagon or 4x4. and van). This model—based
segmentation algorithm was applied to 393 images of vehicles and we obtained very
good segmentation results with approximately 92% correct classiﬁcation rate.

For matching purposes. we have suggested to use both color and shape informa-
tion. Color matching is performed by computing a similarity measure based on a
comparison of the 3D color histograms for the two objects. Shape matching is per-
formed by comparing the edges of the two objects. We have proposed a modiﬁed
Hausdorff distance which is best suited for object matching. It is not as sensitive
to noise as the Hausdorff distance and it does not ignore some of the edge points in
the object like the ranked Hausdorff distances. We have demonstrated the properties
of this measure on a number of synthetic and real images. In addition, we use pose
clustering to register the two edge images and determine the translation and scaling
parameters before computing the modiﬁed Hausdorff distance.

We have tested the performance of the entire matching system on a database of
287 images (149 vehicles were observed by the ﬁrst camera and 138 vehicles were
observed by the second camera). We have shown how image acquisition time and
vehicle class can be used to index the database and avoid computing matching scores
for every pair of vehicles. The thresholds on the matching scores to determine if two
vehicles are identical were determined by building a cumulative distribution of the
matching scores under the null hypothesis of randomness. We were able to identify 8
matching pairs and veriﬁed that they were indeed true matching pairs.

We also proposed a faster implementation of the complete matching system. This

202

implementation does not use, an accurate segmentation of the front vehicle but only
extracts a rectangular box around it. It also uses color matching as an indexing
process to avoid computing unnecessary edge matching scores.

The two machine vision implementations were compared to a manual approach
which consists of looking at the video tapes and isolating unusually looking vehicles
to identify matching pairs. All three approaches produced similar results. Thus, we
have shown that a machine vision approach to automatically estimate the average
travel time between two points in a road network is feasible.

We have also addressed the problem of on-site implementation by designing an
algorithm which automatically selects the frames of interest in the video sequence.
namely the frames where there is a vehicle in the lane closest to the camera and it is
approximately centered in the image. We also briefly examined the effect of different
weather conditions and time of the day by processing images that were captured
during sunny afternoons. overcast mornings and afternoons, foggy mornings, and at
dusk. Our algorithms could handle all tested conditions except dusk. mostly because

our camcorder was not sensitive enough to low illumination conditions.

7 .2 Future Research

There are a number of research issues which could be addressed in the future. The
robustness of the color histogram matching module should be investigated further.
In our experiments, the orientation of the sun at the two camera locations was ap-
proximately the same. We should study how the color matching score is affected if
the trafﬁc is not monitored on a straight road and the orientation of the light on the
side of the car is drastically different. we suspect that there might be problems only

on bright sunny days when the sun is low (early morning or late afternoon). This is a

2013

problem related to color constancy and the most important question is "Do we need
to apply a color constancy algorithm as described in Section 3.2.3 before computing
the color matching score?”

In Chapter 4 we have proposed an algorithm to segment the vehicle in the lane
closest to the camera from the stationary background and the other moving vehi-
cles using the deformable templates paradigm. \Ve have deﬁned an energy function
based on motion information, edge magnitude and directionality, and constraints on
the shape of the deformable template. We have seen that this energy function is
quite complicated and contains a large number of local minima. We have chosen
a simulated annealing approach to perform the minimization task and have used a
geometric temperature cooling schedule to limit the total number of iterations. How-
ever, as mentioned earlier, we had problems in choosing the parameters of the cooling
schedule (starting temperature and maximum number of iterations). We had to run
the simulated annealing algorithm for different sets of parameters and choose the
schedule which returned the minimum energy value to segment the front vehicle. For
practical purposes it is not feasible to implement this strategy on-site. We would need
a large number of computers working in parallel on the same problem (for example
a cluster of workstations) and the resulting cost of the travel time estimation system
would be exorbitant. Thus, the minimization algorithm for image segmentation us—
ing deformable templates and its computational requirements should be investigated
further.

There exist other approaches for minimizing functions with a large number of local
minima. Genetic algorithms [76] have the advantage that the search is done in parallel
in many areas of the search space, but their disadvantage is that they are extremely
slow. Multistart approaches [10] also explore different areas of the search space. The

approximate location and length of the vehicle, determined using motion information

204

can be used to intelligently select. the different starting positions. One can also explore
dynamic progran'uning approaches [35], branch-and-bound techniques [19], iterated
conditional modes [9. ~11] and other cooling schedules for simulated annealing [171].

The rules which constrain the shape of the deformable template were derived
manually by examining a set of vehicles for each class. Recent publications [53, 116]
tackle the problem of automatically deriving the prior probability on the shape of the
template by determining an average image or prototype shape of the template based
on a set of training examples. A similar technique might be used to automatically
establish the rules which constrain the shape of our polygonal templates. It might
also be useful to include the a priori information about the frequency with which
each vehicle model is typically observed in real road scenes. In particular, we know
that there are more sedans than pickup trucks.

During the matching stage, we compute matching scores based on color and edge
information and a conﬁdence measure on those matching scores. W'hen computing
the estimate of the travel time, it might be a good idea to weight the contribution
of individual vehicle travel times in favor of those that are more likely to be correct
matches.

The notion of color-based indexing needs further examination. In Section 6.2, we
used color indexing to compute only two edge matching scores per vehicle. However,
this idea does not really obey the deﬁnition of indexing because the entire database
still needs to be examined to compute the color matching scores and determine the
best two candidates. Usually, indexing a database involves a number of comparisons
of the order of log2 N where N is the size of the database, and in the ideal case, the
number of comparisons should be independent of the size of the database. This was
indeed the case when we used vehicle class indexing in Section 6.1. The database was

divided into five bins, one for each vehicle class. The sensation of color, however, is

205

much fuzzier than the definition of vehicle category. Since color is defined continu-
ously, the division of the vehicle database into bins of different vehicle colors is not
trivial. A better approach might be to use a tree structure to index the database
using color information.

Finally, all the image segmentation algorithms and matching strategies proposed in
this thesis use a number of user-deﬁned parameters. Most computer vision algorithms
suffer from this drawback. I-Iowever, performance evaluation and automatic selection
of tuning parameters is now beginning to receive some attention [156]. It will be
desirable to develop an object matching system that can learn the “optimal” set of
parameters from the given training images and the user requirements in terms of error
probabilities.

In summary, we identify the following research topics for future study:

0 Color constancy,
o Minimization algorithm for the deformable templates,
0 Automatic derivation of a prototype shape,

Color indexing,

0

Automatic selection of tuning parameters.

0

APPENDICES

Appendix A

Image Databases

This appendix shows the entire database of images on which the algorithms described
in this thesis were tested. These images were acquired using either a GE HQ 9-9806
or 3. Sharp Slim Cam VL-L6‘2U camcorder. The tapes were played on the VCR
in the PRIP Laboratory and individual frames were digitized using the digicolor
system consisting of a Data Cube frame grabber board connected to a Sun 4/330
Sparcstation. Table A.1 gives the date, time, location. camcorder type, shutter speed,
and number of images for each database. Figures A.1 to A.28 show the middle frame
of the three-frame color sequence.

The database Sequence was digitized from the same video tape as the database
Saginaw-Coolidge. A sequence of 380 frames was digitized to test the algorithm for
frame selection described in Chapter 5. Every other frame of this database is shown

in Figures A.29 to A.32.

207

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

27% m. as?) 922$ anmnv wag J :3. wimcwq ammm 852.com.
m m 8\ _ no .582 82 .8 saw 2558 am: 393339
2. a 8> mo Es8”: 82 .8 saw 83 .53 siege
8. m 8\ _ no 5.288 82 .8 so .3 2E Esteem
8 w 2.2 2 no 5.288 82 .8 so .3 22.2 238$
_ w 82\ _ 22m 58% 82 .2 :2. 82:8 am: .326
... m 8> 83% 5.22.8 82 .2 5:. 2566 am: 325
x m 2:.) 9.2m anvum m2: .2 :3. 2:7ch DmE ~-«m:Q
...._ ... 82> 2.5;. .828 82 .2 is 8955 2m: 8;
l: ... 82\ _ Sea .882 82 .8 .32 .2an am: 8:82.88
..2. m 82> 22;. 5.22.8 82 .8 82 229 282-382
2. ... 82> 8.5.5. .5228. 82 .8. 8: 22... Ewes:
7.2 m 82> 2.3m 583 82 .2 :8 8223 see 2.282358%
3: m :32) 9:25. Emomnv $3 4 :3. maize; amwm mEEooD$~Ezmem
T.&x:: .3 dz H Beam 232:5 eizoicmu BEE 3mm :23qu _ﬂ 0852

 

.mvwdﬁdudmv

2: 20 828.5 2.... 2.3.

 

208

 

 

Figure A.1: Database .S’aginair-C'nolidge: vehicles I to '20.

209

 

 

Figure .-\.2: Database .S'nga(r-C'oolidgfz vehicles 31 to 4D.

210

 

 

Figure .-\.3: Database Sag/maw-("ooildJ-e: vehicles 41 to 60.

211

 

 

l...‘ "‘2‘;’.£d?-J:.' ‘. M " ' *

Figure AA: Database Sagimm-C'oolidge: vehicles 61 to $0.

212

 

 

Figure A5: Database Saginau'-C'001i(lyc: vehicles 31 to 100.

213

 

”.-.—...“;

 

Figure A.6: Database Stigmau'—("001’1dge: vehicles 101 to 120.

214

 

 

Fi ure A.7: Database Satinutr-Coo/id 6: vehicles 1'21 to 140.
S J 9

215

 

 

Figure A.S: Database Sagimzu‘-C'oolidgc: vehicles 141 to 149.

216

 

A

 

Figure A3: Database Sagzrzaw-Hasirét‘: vehicles 1 to ‘20.

217

 

 

Figure .—\.10: Database 5(1gim‘1u'-Has/EH: vehicles 21 to 40.

218

 

 

Figure A.11: Database Saginaw-Haslett: vehicles 41 to 60.

219

 

 

Figure .-\.12: Database Saginuw-Has/cfi: vehicles 61 to 80.

220

Figure A.13: Database Saginaw-Haslett:

vehicles 81 to 100.

. -\I.:‘

 

221

 

 

Figure .-\.1~l: Database .Saginaw-Has/cz‘f: vehicles 101 to 1'20.

222

 

 

 

Figure A.17): Database Sag/nuir-Hmlt/f: vehicles 131 to 13$.

 

223

 

Figure .-\.16: Database .llaple-Easf: vehicles 1 to '20.

224

 

Figule ..-\ 17: Database .Uaple- East: vehicles '21 to 40.

225

 

 

Figure A.1S: Database .llaple-l-l’csf: vehicles 1 to 20.

226

 

 

Figure A.19: Database .llapIe-ll'rsf: vehicles '21 to ~10.

227

 

 

Figure A320: Database Engineering.

228

 

 

Figure .~\.21: Database Fog.

229

 

 

 

 

Figure .-\.23: Database Dusk-3.

 

Figure .~\.'_’~l: Database Desk-.3.

230

 

 

 

 

 

Figure Ail-5: Database People-In.

231

 

 

 

 

 

Figure A526: Database People-Out.

232

 

Figure A.27: Database Objects-In.

 

 

Figure A.28: Database Objects-Out.

Figure A.29: Database Sequence: odd frames 1 to 95.

 

Figure A.30: Database Sequence: odd frames 97 to 191.

 

Figure A.31: Database Sequence: odd frames 193 to 287.

 

Figure A.32: Database Sequence: odd frames 289 to 379.

 

BIBLIOGRAPHY

Bibliography

[0

[31

[4]

[51

Mobility 2000 presents intelligent vehicles and highways systems, 1990 sum-

mary. Texas Transportation Institute, College Station, TX, 1990.

O. Abou Khaled, N. Le Fort, M. Hassoun. and C. Laugier. Real-time decision
system for the Prolab II demonstrator. In Proc. IEEE Intelligent Vehicles

Symposium, pages 485—491, Paris, France, 1994.

A. Amini, T. Weymouth, and R. Jain. Using dynamic programming for solving
variational problems in vision. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 12(9):855—867, 1990.

Y. Amit, U. Grenander, and M. Piccioni. Structural image restoration
trough deformable templates. Journal of the American Statitical Association,

86(414y376—387,1991.

C. Baghdassarian, H. Lange, H. Sahli, and C. Laurgeau. Recognition of ar-
rows in the environment of road markings. In Proc. IEEE Intelligent Vehicles

Symposium, pages 219-224, Paris, France, 1994.

D. H. Ballard and C. M. Brown. Computer Vision. Prentice—Hall, Englewood

Cliffs, NJ, 1982.

[\D
00
\J

[Tl

[8]

[lol

[11]

[121

[131

[14]

[15]

[16]

238

H. G. Barrow. J. M. Tenenbaum, R. C. Bolles. and H. C. Wolf. Parametric cor—
respondence and chamfer matching: Two new techniques for image matching.

In Proc. International Joint Conference on Artiﬁcial Intelligence, 1977.

R. Basri and S. Ullman. The alignment of objects with smooth surfaces. In Proc.
IEEE International Conference on Computer Vision. pages 482—488, Tarpon

Springs, FL, 1988.
J. Besag. On the statistical analysis of dirty pictures. Journal of the Royal

Statistical Society B, 48(3):259—302, 1986.

B. Betro and F. Schoen. Sequential stopping rules for the multistart algorithm

in global optimisation. ilrfathematical Programming, 38:271—286, 1987.

S. Beucher and M. Bilodeau. Road segmentation and obstacle detection by a
fast watershed transformation. In Proc. IEEE Intelligent Vehicles Symposium,

pages 296—301, Paris, France, 1994.

J. R. Beveridge, J. Grifﬁth, R. R. Kohler, A. R. Hanson, and E. M. Riseman.
Segmenting images using localized histograms and region merging. International

Journal of Computer Vision, 2:311-347, 1989.

I. Biederman. Recognition by components: A theory of human image under-

standing. Psychological Research, 94(2):115—147, 1987.

A. Blake. Boundary conditions for lightness computation in Mondrian world.

Computer Vision, Graphics and Image Processing, 32:314—327, 1985.
A. Blake and A. Zisserman. Visual Reconstruction. MIT Press, 1987.

Transportation Research Board. Advanced vehicle and highway technologies.

Technical Report No. 232, National Research Council, Washington, D.C., 1991.

[17]

113]

[19]

[22]

[‘23]

239

R. Bolles and P. Horaud. ISDPO: A three-dimensional part orientation system.

International Journal of Robotics Research. 5(3):3~26. 1986.

R. C. Bolles and R. A. Cain. Recognizing and locating partially visible objects.
the local-feature—focus method. International Journal of Robotics Research,

1(3):57-- 2. m2.

Clo

G. Brassard and P. Bratley. Algorithmics Theory and Practice. Prentice Hall,

Englewood Cliffs. 1988.

G. Brelstaff and A. Blake. Computing lightness. Pattern Recognition Letters,

52129—138, 1987.

R. Brooks. Model-based three-dimensional interpretations of two-dimensional
images. IEEE Transactions on Pattern Analysis and Machine Intelligence,

5(2):140-150, 1983.

G. Buchsbaum. A spatial processor model for object colour perception. Journal

of the Franklin Institute, 310(1):1—‘27, 1980.

C. Cafforio and F. Rocca. Methods for measuring small displacements of tele-

vision images. IEEE Transactions on Information Theory, 22:573—579, 1976.

J. Canny. A computational approach to edge detection. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 8(6):679—698, 1986.

M. Cappello, M. Campani, and A. Succi. Detection of lane boundaries, inter-
sections and obstacles. In Proc. IEEE Intelligent Vehicles Symposium, pages

284—289, Paris, France, 1994.

['26]

[‘29]

[30]

[31]

[34]

210

l. (latling and l3. McQueen. Road transport informatics in Europe — Major
programs and demonstrations. IIL‘EE: 'Iinnsartions on l'elzicular Tic/urology,

.1()(1):1:i2-1.m. 1091.

11. Chen and T. lluang. Maximal matching of 3D points for multiple-object

motion estimation. Pattern Recognition, 21(2):?5-90. 1988.

J.-L. Chen, G. C. Stockman. and I\'. Rao. Recovering and tracking pose of
curved 3d objects from 2d images. In Proc. IEEE Conference on Computer

Vision and Pattern Recognition. pages 233-239. New York, NY, 1993.

S.—W. Chen and A. l\'. Jain. Object extraction from laser radar imagery. Pattern

Recognition, 24(6):.587—600. 1991.

Y. Chen and G. IVIedioni. Object modelling by registration of multiple range

images. Image and Vision Computing, 10(3):145—155, 1992.

J. K. Cheng and T. S. Huang. Image registration by matching relational struc-

tures. Pattern Recognition. 17(1):149—159. 1984.

P. Chou and C. Brown. The theory and practice of Bayesian image labelling.

International Journal of Computer Vision. 4:185—210, 1990.

C.-C. Chu and J. Ii. Aggarwal. The integration of image segmentation maps
using region and edge information. IEEE Transactions on Pattern Analysis and

Machine Intelligence. 15(12):l241—1252, 1993.

S. D. Cochran and G. Medioni. 3-D surface description from binocular stereo.
IEEE Transactions on Pattern Analysis and .lIachine Intelligence, l4(10):981—

994, 1992.

[35]

[:56]

[37]

[:38]

[39]

[401

[41]

[431

2-11

T. 11. (.‘ormen. C. F. Leiserson. and R. L. Rivest. Introduction to Algorithms.

.\“IcGraw—llill. 1991.

Y. (i'outance. La ('oulcur en lv'ision par Ordinattur: .rlpplication a la Robotique.
PhD thesis. Laboratoire (l'Automatique et d‘Analyse des Systemes du CNRS,

Fniversité Paul Sabatier. Toulouse. France, 1991.

M. Cremer and H. Keller. A new class of dynamic methods for the identification

of origin—destination flows. Transportation Research. 218(2):117—132, 1987.

A. Cumani. A. Guiducci. and P. Grattoni. Image description of dynamic scenes.

Pattern Recognition. 21(7):661—6T3, 1991.

L. Davis. Shape matching using relaxation techniques. IEEE Transactions on

Pattern Analysis and Machine Intelligence. 1(1):60—72, 1979.

E. D. Dickmanns. R. Behringer, D. Dickmanns, T. Hildebrandt, M. Maurer,
F. Thomanek, and J. Schiehlen. The seeing passenger car ‘VaMoRs-P’. In

Proc. IEEE Intelligent Vehicles Symposium. pages 68—73, Paris, France, 1994.

R. C. Dubes. A. K. Jain, S. G. Nadabar, and C. C. Chen. MRF model-based al-
gorithms for image segmentation. In Proc. International Conference on Pattern

Recognition, pages 808—814, Atlantic City, NJ, 1990.

M.-P. Dubuisson and A. K. Jain. Object contour extraction using color and mo-
tion. In Proc. IEEE Conference on Computer Vision and Pattern Recognition,

pages 471—476, New York, NY, 1993.

M.-P. Dubuisson and A. K. Jain. 2D matching of 3D moving objects in color
outdoor scenes. In Proc. IEEE Conference on Computer Vision and Pattern

Recognition, pages 887—891. Seattle. WA, 1994.

ltﬂ

[46]

[47l

[48]

[491

212

.\l.-P. Dubuisson and A. K. Jain. Fusing color and edge information for object
matching. In Proc. IEEE International (i'onft rence on Image Processing. volume

III. pages $l824l86. Austin. TX. 199-’1.

.\l.-I’. Dubuisson and A. K. Jain. A modified Hausdorff distance for object
matching. In Proc. 13th Intt rnational Conference on Pattern Recognition, pages

566—568. Jerusalem. Israel. 1994.

.\"I.-P. Dubuisson and A. K. Jain. Contour extraction of moving objects in
complex outdoor scenes. International Journal of Computer Vision. 14:83-105,

1995.

;\'I.-P. Dubuisson. A. K. Jain. and \V. C. Taylor. Segmentation and matching
of vehicles in road images. In Transportation Research Record. number 1412,

pages 57—63., Washigton, DC, 1993.

i\’I.-P. Dubuisson. A. K. Jain, and W. C. Taylor. A vision—based vehicle matching
system. In Proc. IEEE Intelligent I»"ehicles Symposium, pages 266—271, Paris,

France. 1994.

MsP. Dubuisson, S. Lakshmanan. and A. K. Jain. \r’ehicle segmentation using
deformable templates. Technical Report CPS-.9465, Michigan State University,

East Lansing, MI. 1995.

J. Duncan and T.—C. Chou. On the detection of motion and the computation of
optical flow. IEEE Transactions on Pattern Analysis and hIachine Intelligence,

14(3y346—352.1992.

[51]

[56]

[57]

[58]

243

W. Efenberger. ().—II. Ta. L. Tsinas. and V. Graefe. Automatic recognition of
vehicles approaching from behind. In Proc. IEEE Intelligent l’rhicles Sympo-

sium, pages 57 ~62. Detroit. MI. I992.
E. Eldridge. Truckers go high-tech. Lansing State Journal. May. ‘2 1993.

R. Epstein and A. Yuille. Training a general purpose deformable template.
In Proc. IEEE International Conference on Image Processing, volume 1, pages

203—2207, Austin. TX, 1994.

R. Ervin. An American observation of IVHS in Japan. Technical report, Uni-

versity of Michigan. Ann Arbor. MI. 1991.

S. Estable, J. Schick. F. Stein, R. Janssen, R. Ott, W. Ritter, and Y.-J. Zheng.
A real—time traffic sign recognition system. In Proc. IEEE Intelligent Vehicles

Symposium, pages 213—218, Paris, France, 1994.

M. Etoh and Y. Shirai. Segmentation and 2D motion estimation by region
fragments. In Proc. IEEE International Conference on Computer Vision, pages

192—199, Berlin. Germany, 1993.

M. Etoh, Y. Shirai, and M. Asada. Contour estraction by mixture density
description obtained from region clustering. In Proc. European Conference on

Computer Vision, pages 24—32, Santa Margherita Ligure, Italy, 1992.

T. J. Fan, G. Medioni, and R. Nevatia. Recognizing 3D objects using surface
descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence,

11(11y1140—1157.1989.

[59]

[60]

[GA

[62]

[63]

[641

[651

214

F. Farrokhnia and A. K. Jain. A nmlti-channel filtering approach to texture
segmentation. In Proc. IEEE (.l'onference on Computer Vision and Pattern

Recognition. pages 36-1—3370, Maui. HI. 1991.

O. D. Faugeras and .\I. Hebert. The representation. recognition. and locating

of 3—D objects. International Journal of Robotics Research. 5(3):L7—52. 1986.

C. Fennema and W. Thompson. Velocity determination in scene containing
several moving objects. Computer Graphics and Image Processing, 9:301—315,

1979.

D. J. Fleet and A. D. Jepson. Computation of component image velocity from
local phase information. International Journal of Computer Vision, 5(1):77—

104, 1990.

P. J. Flynn. Saliencies and symmetries: Towards 3D object recognition from
large model databases. In Proc. IEEE Conference on Computer Vision and

Pattern Recognition. pages 322—327, Urbana, IL, 1992.

P. J. Flynn and A. K. Jain. BONSAI: 3—D object recognition using constrained
search. IEEE Transactions on Pattern Analysis and lleachine Intelligence,

13(10y1066~1075.1991.

P. J. Flynn and A. K. Jain. 3D object recognition using invariant feature
indexing of interpretation tables. CVGIP: Image Understanding, 55(2):119—

129, 1992.

D. Forsyth. A novel approach to color constancy. In Proc. IEEE International

Conference on Computer Vision, pages 9—18, Tarpon Spring, FL, 1988.

 

[0'73

[69]

[ml

245

'l‘. Fraichard. M. lIassoun. and N. Lefort. An electronic co—pilot: the ProLab
II demonstrator. In Proc. European Prometheus Workshop on Collision Avoid-

ance. Nurtingen. Germany. 1992.

Ii. Francois and P. Bouthemy. ;\Iultiframe-based identification of mobile compo-

nents of a scene with a moving camera. In Proc. IEEE Conference on Computer

Vision and Pattern Recognition. pages 166—172, Maui, III, 1991.

G. Funka—Lea and R. Bajcsy. Vision for vehicle guidance using two road cues. In

Proc. IEEE Intelligent Vehicles Symposium, pages 126—131, Detroit, MI, 1992.

P. Gader, B. Forester. M. Ganzberger. A. Gillies. B. Mitchell, M. VVhalen,
and T. Yocum. Recognition of handwritten digits using template and model

matching. Pattern Recognition, 24(5):~121—432, 1991.

E. B. Gamble, D. Geiger, T. Poggio. and D. \Neinshall. Integration of vision
modules and labelling of surface discontinuities. IEEE Transactions on Systems,

.Wan, and Cybernetics, 19:1576—1581. 1989.

D. Geman and S. Geman. Stochastic relaxation. Gibbs distributions, and the
Bayesian restoration of images. IEEE Transactions on Pattern Analysis and

.lIachine Intelligence. 6(6):?21-741, 1984.

D. Geman, S. Geman, C. Graffigne. and P. Dong. Boundary detection by
constrained optimization. IEEE Transactions on Pattern Analysis and Machine

Intelligence. 12(7):609—628, 1990.

" ’ S. Geman and C.-R. Hwang. Diffusions for global optimization. SIAM Journal

on Control and Optimization. 24(5):1031—1043. 1986.

[T5]

[79]

[80]

[81]

[82]

216

B. Gidas. The Langevin equation as a global minimization algorithm. In
Ii. Bienenstock. F. Flogeman. and G. “leisbuch. editors. Disordered Systems
and Biological Organization. volume F20 of NA TO :15] Series. pages 321—326.

Springer-\erlag. 1986.

D. E. Goldberg. (lenetic .A-llgorithms in Search. Optimization. and lilac/zine

Learning. Addison-\Vesley, 1, 989.

U. Grenander. Y. Chow, and D. M. Keenan. Hands: A Pattern Theoretic Study

of Biological Shapes. Springer-Verlag, New York, 1991.

U. Grenander and M. 1. Miller. Jump-diffusion processes for abduction and
recognition of biological shapes. Technical report, Monograph of the Electronic

Signals and Systems Research Laboratory, Washington University, 1991.

W. E. L. Grimson. Computational experiments with a feature based stereo
algorithm. IEEE Transactions on Pattern Analysis and ll/Iachine Intelligence,

7:17—34, 1985.

W'. E. L. Grimson. Object Recognition by Computer: The Role of Geometric

Constraints. MIT Press, Cambridge, MA, 1990.

W. E. L. Grimson and T. Lozano-Perez. Localizing overlapping parts by search-
ing the interpretation tree. IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence, 9(4):469—482, 1987.

G. Healey and T. Binford. A color metric for computer vision. In Proc. IEEE
Conference on Computer Vision and Pattern Recognition, pages 10-17, Ann

Arbor, MI, 1988.

[s3]

[84}

[39}

[90]

[91]

[92]

247

D. .l. lleeger. Optical flow using spatiotemporal filters. International Journal

of Computer Vision. 2(-l):279~302. 1988.

\v". von Holt. Tracking and classification of overtaking vehicles on Autobahnen.
In Proc. IEEE tnttttigmt lihictes Symposium. pages 31«1—3l9. Paris. France,

1994.

B. Ilorn. Determining lightness from an image. Computer Graphics and Image

Processing. 3:277—299. 1974.

B. K. P. Horn and B. Schunck. Determining optical flow. .irlrtiﬁciat Intelligence,

17(185—203).1981.

S. Horowitz and T. Pavlidis. Picture segmentation by a tree traversal algorithm.

Journal of the Association for Computing htachinery, 23(2):368—388, 1976.

Y. Z. Hsu. H.-H. Nagel, and G. Rekers. New likelihood test methods for change
detection in image sequences. Computer Vision. Graphics and Image Process-

ing, 24:73--106, 1984.

M. K. Hu. Visual pattern recognition by moment invariants. IEEE Transactions

on Information Theory, 8:179—187, 1962.

L. J. Hubert and J. Schultz. Quadratic assignment as a general data-analysis
strategy. British. Journal of Mathematical and Statistical Psychology, 29:190—

241,1976.

L. Hurvich and D. Jameson. An opponent-process theory of color vision. Psy-

chological Review, 64(6)::384—404, 1957.

D. Huttenlocher and S. Ullman. Recognizing solid objects by alignment. Inter-

national Journat of Computer In’ision. 5(2):195—212, 1990.

[9:3]

[94]

[96]

[97]

[98]

[991

[100]

[101]

248

D. P. Huttenlocher. G. A. Klanderman. and \V. J. Rucklidge. Comparing im—
ages using the Hausdorff distance. IEEE Transactions on Pattern Analysis and

.‘ltachine Intelligence. 15(9):850—863. 1993.

K. Ikeuchi and T. Kanade. Automatic generation of object recognition pro-

grams. Proceedings of the IEEE. 76(8):1016—1035, 1988.

T. Ito and K. Yamada. Preceding vehicle and road lanes recognition methods
for RCAS using vision system. In Proc. IEEE Intelligent Vehicles Symposium,

pages 85-90, Paris, France. 1994.

A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall,

Englewood Cliffs, NJ. 1988.

A. K. Jain and F. Farrokhnia. Unsupervised texture segmentation using Gabor

filters. Pattern Recognition, 24:1167-1186, 1991.

A. K. Jain and P. J. Flynn (eds). 3D Object Recognition Systems. Elsevier,

Amsterdam, 1993.

A. K. Jain and R. Hoffman. Evidence-based recognition of 3d objects. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 10(6):?83—802,

1988.

A. K. Jain and S. G. Nadabar. MRF model-based segmentation of range images.
In Proc. IEEE International Conference on Computer Vision, pages 667—671,

Osaka, Japan, 1990.

R. Jain, W. N. Martin, and J. K. Aggarwal. Segmentation through the detection
of changes due to motion. Computer Graphics and Image Processing, 11:13—34,

1979.

[102]

[103]

[104]

[105]

[106]

[107]

[108]

[109]

[110]

219

R. Jain and II.-II. Nagel. On the analysis of accumulative difference pictures
from image sequences of real world scenes. IEEE Tranwctions on Pattern

Analysis and .ltachine Intelligence. l(2):206—21-’l. 1979.

M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models.

International Journal of Computer Vision, 2:321—331. 1988.

R. Kasturi and R. C. Jain. Computer Vision: Principles. IEEE Computer

Society Press, 1991.

J. K. Kearney, W. B. Thompson, and D. L. Boley. Optical-flow estimation:
An error analysis of gradient-based methods with local optimization. IEEE

Transactions on Pattern Analysis and Machine Intelligence. 9(2):229—244, 1987.

N. Kehtarnavaz, N. Grisworld, and J. Lee. Visual control of an autonomous ve-
hicle (BART) - the vehicle-following problem. IEEE Transactions on Vehicular

Technology, 40(3):654—662, 1991.

D. Kelly. Results of a field trial of the IMPACTS image processing system for
traffic monitoring. In Proc. Conference on Vehicle Navigation and Information

Systems, pages 151—167, Dearborn, MI, 1991.

M. Kilger. A shadow handler in a video-based real-time traffic monitoring
system. In IEEE Workshop on Applications of Computer Vision, pages 1060—

1066, Palm Springs, CA, 1992.

M. Kilger. Heavy trafﬁc monitoring in real-time. In Proc. Scandinavian Con-

ference on Image Applications, pages 1077—1083, Tromso, Norway, 1993.

S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated

annealing. Science. 2202671—680, 1983.

 

 

[111]

[112]

[11:3]

[114]

[115]

[116]

[117]

[118]

[119]

230

G. Klinker. S. Shafer. and T. Kanade. A physical approach to color image

understanding. International Journal of Computer Vision. 4:7~38. 1990.

J. J. Koenderink and A. J. van Doorn. Internal representation of solid shape

with respect to vision. Biological Cybernetics. 32(4):211~-2l6. 1979.

D. Koller. K. Daniilidis. T. Thorhallson, and H.-H. Nagel. Model—based object
tracking in traffic scenes. In Proc. European Conference on Computer Vision,

pages 437—452. Santa Margherita Ligure, Italy. 1992.

D. Koller. .\'. Heinze. and H.—H. Nagel. Algorithmic characterization of vehicle
trajectories from image sequences by motion verbs. In Proc. IEEE Conference

on Computer Vision and Pattern Recognition, pages 90—95, Maui, HI, 1991.

D. Koller, J. Weber, and J. lV’Ialik. Towards realtime visual based tracking in
cluttered traffic scenes. In Proc. IEEE Intelligent Vehicles Symposium, pages

201—206, Paris, France, 1994.

K. F. Lai and R. T. Chin. Deformable contours: Modeling and extraction. In
Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages

601—608, Seattle, WA, 1994.

S. Lakshmanan and D. Grimmer. A deformable template approach to detecting
straight edges in radar images. submitted to IEEE Transactions on Pattern

Analysis and Machine Intelligence, 1994.

Y. Lamdan and H. J. Wolfson. Geometric hashing: A general and efﬁcient
model—based recognition scheme. In Proc. IEEE International Conference on

Computer Vision, pages 238—249, Tarpon Springs, FL, 1988.

E. H. Land. The retinex. American Scientist. 52:247—264, 1964.

p20]

[125]

[126]

251

E. H. Land. The retinex theory of color vision. Scientiﬁc American. 2372108-

E. H. Land. Recent. advances in retinex theory. Vision Research. 26(1):7—21,

1986.

E. H. Land and J. J. McCann. Lightness and retinex theory. Journal of the

Optical Society of America A, 61(1):1—11, 1971.

G. C. Lee. Reconstruction of Line Drau’ing Graphs from Fused Range and

Intensity Imagery. PhD thesis. Michigan State University, 1992.

H.-C. Lee. Method for computing the scene-illuminant chromaticity from spec-
ular highlights. Journal of the Optical Society of America A, 3(10):1694—1699,

1986.
M. D. Levine. Vision in Man and .Machine. McGraw Hill, 1985.

D. G. Lowe. Three-dimensional object recognition from single two-dimensional

images. Artiﬁcial Intelligence, 31:335—395, 1987.

D. G. Lowe. Fitting parametrized three-dimensional models to images. IEEE

Transactions on Pattern Analysis and htachine Intelligence. 13(5):441—450,

1991.

L. T. Maloney and B. A. VVandell. Color constancy: A method for recover-
ing surface spectral reflectance. Journal of the Optical Society of America A,

3(1y29—33,1986.

K. V. Mardia, T. J. Hainsworth, and J. F. Haddon. Deformable templates in
image sequences. In Proc. International Conference on Pattern Recognition,

volume 2, pages 132—135, The Hague, The Netherlands, 1992.

 

[1:10]

[1:31]

[132]

[133]

[136]

[137]

252

L. Matthies and A. Elfes. Integration of sonar and stereo range data using a
grid—based representation. In Proc. IEEE International Conference on. Robotics

and Automation. pages 723—733, Philadelphia. PA. 1988.

J. J. i\*lcCann. S. P. McKee, and T. II. Taylor. Quantitative studies in retinex

theory. Vision Research. 16:445458, 1976.

N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller.
Equation of state calculations by fast computing machines. The Journal of

Chemical Physics, 21(6):1087—1092, 1953.

P. G. Michalopoulos. Vehicle detection video through image processing: The
Autoscope system. IEEE Transactions on Vehicular Technology, 40(1):21—29,

1991.

D. L. Milgram. Region extraction using convergent evidence. Computer Graph-

ics and Image Processing, 1121—12, 1979.

K. S. Miller, T. N. Harvey, P. W. Shuldiner, and C. Ho. Using video technology
to conduct the 1991 Boston region external cordon survey. In Transportation

Research Record, number 1412, pages 46—56, Washington, DC, 1993.

M. I. Miller, A. Srivastava, and U. Grenander. Conditional-mean estimation
via jump-diffusion processes in multiple target tracking/recognition. submitted

to IEEE Transactions on Signal Processing, 1994.

H. P. Moravec. Sensor fusion in certainty grids for mobile robots. AI Magazine,

9(2):61-74, 1988.

[138]

[131)]

[140]

[141]

[142]

[143]

[144]

[145]

253

B.-A. Moshe. Methods to combine different data sources and estimate origin-
destination matrices. In N. H. (lartner and N. H. M. Wilson. editors. Trans-

portation and Traffic Theory. pages 459-481. 1987.

II.-Il. Nagel. Overview on image sequence analysis. In T. S. Huang, editor,
Image Sequence. Processing and Dynamic Scene Analysis. pages ‘2—39. Springer-

Verlag. 1983.

H.-H. Nagel. Image sequences — ten (octal) years —— from phenomenology
towards a theoretical foundation. In Proc. International Conference on Pattern

Recognition, pages 1174—1185, Paris. France, 1986.

H.-H. Nagel and W. Enkelmann. An investigation of smoothness constraints
for the estimation of displacement vector fields from image sequences. IEEE

Transactions on Pattern Analysis and IWachine Intelligence, 8(5):565—593, 1986.

S. Negahdaripour and A. K. Jain. Challenges in computer vision research;
future directions of research. In Proc. IEEE Conference on Computer Vision

and Pattern Recognition, pages 189—199, Urbana, IL, 1992.

R. Nevatia and T. Binford. Description and recognition of complex-curved

objects. Artiﬁcial Intelligence, 8:77-98, 1977.

R. Ohlander, K. Price, and D. R. Reddy. Picture segmentation using a recursive
region splitting method. Computer Graphics and Image Processing, 8:313—333,

1978.

Y.-I. Ohta, T. Kanade, and T. Sakai. Color information for region segmentation.

Computer Graphics and Image Processing, 13:222—241, 1980.

[116]

[147]

[148]

[149}

[150]

[151]

[152]

[153]

254

S. Pankanti. A. K. Jain, and M. Tuceryan. On integration of vision modules.
In Proc. IEEE (Conference on. Computer Vision and Pattern Recognition, pages

316~3‘Z‘Z, Seattle. WA, 1994.

T. Pavlidis and \".-'I‘. Liow. Integrating region growing and edge detection.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(3):?25—

233, 1990.

S. Peleg and H. Rom. Motion based segmentation. In Proc. International

Conference on Pattern Recognition, pages 109—113, Atlantic City, NJ, 1990.

A. P. Pentland. Perceptual organization and the representation of natural form.

Artiﬁcial Intelligence, 28:293—331, 1986.

E. Persoon and K. S. Pu. Shape discrimination using Fourier descriptors. IEEE

Transactions on Systems, hian, and Cybernetics, 7:170—179, 1977.

D. B. Phillips and A. F. M. Smith. Dynamic image analysis using Bayesian
shape and texture models. In K. V. Mardia and G. K. Kanji, editors, Statistics
and Images: 1, pages 299—322. Carfax Publishing Company, Abingdon, UK,

1993.

G. Piccioli, E. de Micheli, P. Parodi, and M. Campani. Robust road sign
detection and recognition from image sequences. In Proc. IEEE Intelligent

Vehicles Symposium, pages 278—283, Paris, France, 1994.

P. Planet, D. Pellerin, and J. Hérault. Classiﬁcation automatique de vehicules

aux peages autoroutiers. Recherche Transports Se'curite’, pages 13—24, 1992.

[15.11

[1:35]

[156]

[157]

[158]

[159]

[160]

[161]

L. Priese, V. Itehrmann. R. Schian, R. Lakmann. and L. Bilderkennen. Traffic
sign recognition based on color image evaluation. In Proc. IEEE Intelligent

Vehicles Symposium. pages 95—100. Tokyo. Japan. 1993.

N. S. Raja and A. K. Jain. Obtaining generic parts from range images using a

multi-view representation. CVGIP: Image Understanding, 60(1):44—64, 1994.

V. Ramesh, R. M. IIaralick, X. Zhang, D. C. Nadadur, and K. Thornton. Auto-
matic selection of tuning parameters for feature extraction sequences. In Proc.
IEEE Conference on Computer Vision and Pattern Recognition, pages 672—677,

Seattle. WA, 1994.

S. Ranade and A. Rosenfeld. Point pattern matching by relaxation. Pattern

Recognition, 12:269—275, 1980.
A. Rosenfeld and A. C. Kak. Digital Picture Processing. Academic Press, 1982.

L. Saxton and G. Schoene. Federal IVHS program initiatives resulting from
ISTEA of 1991. In Proc. International Congress on Transportation Electronics,

pages 319—328, 1992.

M. Schwarzinger, T. Zielke, D. Noll, M. Brauckmann, and W. von Seelen.
Vision-based car following: detection, tracking, and identiﬁcation. In Proc.

IEEE Intelligent Vehicles Symposium, pages 24—29, Detroit, MI, 1992.

I. K. Sethi and W. L. Brillhart. Trafﬁc analysis including non-conforming be-
havior via image processing. In Proc. Conference on Vehicle Navigation and

Information Systems, pages 193—201, Dearborn, MI, 1991.

[162]

[163]

[164]

[165]

[166]

[167]

[168]

[169]

2:36

L. G. Shapiro. J. I). Moriarty. R. M. I-Iaralick. and P. G. Mulgaonkar. Matching
three—dimensional objects using a relational paradigm. Pattern Recognition,

17(4y335—405.1934.

A. Shio and J. Sklansky. Segmentation of people in motion. In Proceedings of

the IEEE workshop on Visual .IIotion. pages 325—332, Princeton, NJ, 1991.

S. P. Smith and A. K. Jain. Chord distribution for shape matching. Computer

Graphics and Image Processing, 20:259—271, 1982.

E. Sorensen and H. I. Christensen. Monitoring of road trafﬁc. In Proc. Scandi-
navian Conference on Image Applications. pages 1085—1091, Tromso, Norway,

1993.

F. Stein and G. Medioni. Structural indexing: Efficient 3-D object recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 4(2):125—

145, 1992.

J. Stewman and K. Bowyer. Creating the perspective projection aspect graph
of polyhedral objects. In Proc. IEEE International Conference on Computer

Vision, pages 494—500, Tarpon Springs, FL, 1988.

G. C. Stockman. Object recognition and localization via pose clustering. Com-

puter Vision, Graphics and Image Processing, 40:361—387, 1987.

G. C. Stockman. Object recognition. In R. C. Jain and A. K. Jain, editors,

Analysis and Interpretation of Range Images, pages 225—253. Springer-Verlag,

1990.

 

[170]

[171]

[174]

[175]

[176]

[177]

[178]

‘257

C. Storvik. A Bayesian approach to dynamic contours through stochastic sam-
pling and simulated annealing. IEEE Transactions on Pattern Analysis and

zl/Iachine Intelligence. 16(10):976-986, 1994.

P. N. Strenski and S. Kirkpatrick. Analysis of finite length annealing schedules.

Algorith mica, 6:346-~366, 1991.

M. J. Swain and D. H. Ballard. Color indexing. International Journal of

Computer Vision, 7(1):11—3‘2, 1991.

Y. Tanaka, K. Kanayama, and H. Sugimura. Travel-time data provision sys—
tem using vehicle license number recognition device. In Proc. IEEE Intelligent

Vehicles Symposium. pages 353—358, Detroit, MI, 1992.

W. B. Thompson. Combining motion and contrast for segmentation. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 2(6):543—549, 1980.

W. B. Thompson and S. T. Barnard. Low-level estimation and interpretation

of visual motion. Computer, 14:20—28, 1981.

W. B. Thompson and T.-C. Pong. Detecting moving objects. In Proc. IEEE In-
ternational Conference on Computer Vision, pages 201—208, London, England,

1987.

C. Thorpe, M. H. Hebert, T. Kanade, and S. A. Shafer. Vision and navigation
for the Carnegie—Mellon Navlab. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 10(3):362—373, 1988.

C. E. Thorpe. Vision and Navigation: The Carnegie Mellon Navlab. Kluwer

academic publishers, 1990.

[179]

[130]

[181]

258

J. Ton and A. K. Jain. Registering Landsat images by point matching. IEEE

Transactions on Geoscience and Remote Sensing, 27(5):642—651, 1989.

B. Ulmer. \I'ITA II - Active collision avoidance in real traffic. In Proc. IEEE

Intelligent I’ehicles Symposium, pages 1—6. Paris, France, 1994.

B. A. VVandelI. The synthesis and analysis of color images. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 9(1):2—13, 1987.

[182] J. Weng. Windowed Fourier phase: Completeness and signal reconstruction.

IEEE Transactions on Signal Processing, 41(2):657-666, 1993.

[183] J. Weng, N. Ahuja, and T. S. Huang. Matching two perspective views. IEEE

[184]

[185]

[186]

[187]

Transactions on Pattern Analysis and h‘Iachine Intelligence, 14(8):806—825,

1992.

M. Xie, L. Trassoudaine, J. Alizon, M. Thonnat, and J. Gallice. Active and
intelligent sensing of road obstacles: Application to the European Eureka-
Prometheus Project. In Proc. IEEE International Conference on Computer

Vision, pages 616—623, Berlin, Germany, 1993.

G. Xu. Is a single image sufﬁcient as object model for recognition? Technical

report, Osaka University, 1992.

Y. Yakimovsky. Boundary and object detection in real world images. Journal

of the Association for Computing Machinery, 23:599-618, 1976.

S. Yalamanchili, W. N. Martin, and J. K. Aggarwal. Extraction of moving
object descriptions via differencing. Computer Graphics and Image Processing,

18:188—201, 1982.

 

 

 

[188]

[189]

[190]

[191]

[192]

[193]

259

K. Yoshino. M. Miyahara. K. Sakai, Y. Ito, and T. Negishi. In-tunnel traffic
flow measuring system using ITV cameras. In Proc. Conference on Vehicle

Navigation and Information Systems, pages 311-310, Dearborn. MI, 1991.

A. Yuille, D. Cohen. and P. Hallinan. Feature extraction from faces using
deformable templates. In Proc. IEEE Conference on Computer Vision and

Pattern Recognition, pages 104—109, San Diego, CA, 1989.

A. L. Yuille, P. W. Hallinan, and D. S. Cohen. Feature extraction from faces us-
ing deformable templates. International Journal of Computer Vision, 8(2):99—

111, 1992.

J. Y. Zheng, M. Barth, and S. Tsuji. Qualitative route scene description using
autonomous landmark detection. In Proc. IEEE International Conference on

Computer Vision, pages 558—562, Osaka, Japan, 1990.

Y.-J. Zheng, W. Ritter, and R. Janssen. An adaptive system for traffic sign
recognition. In Proc. IEEE Intelligent Vehicles Symposium, pages 165—170,

Paris, France, 1994.

G. Zimmermann and R. Kories. What an autonomous robot can learn about its
environment starting from scratch using image sequences? In A. K. Jain, edi-
tor, Real- Time Object [Measurement and Classiﬁcation, pages 215-226. Springer

Verlag, 1988.