.2) u.
19"

.k
‘19

"a“.
“l
t’ ‘1;

V

2...:
31‘? r 1.
cf .
3.... ;
in.

. Iv

, ﬁx.
7’: W. ‘V a
. .1 LI 1"

a i
v13
..J....\:1
.r. z. I

.61; l.
S t .
v :H 9
V C
.21.. 3.1
» :a..»... x
.3\>b~d.v
2- 2.3. :1
2:“!
t 5.!

I.
s.

1.3.
6.1.») 1.. 1:
1:!

iv.
6!.
. L\\\
&.hian0iz‘.\..
1 y ‘20
It

3::
Iii.
c.

 

.I
~‘
.9. .ittxl
. r» x... .

‘AQL

V...-

..

i‘ a.

. u.) .:.
GK.

 

 

 

 

 

 

river’s. .3“-
. ca; ..1

 

it?

 

MIC HCIGAN STATE

IIIIIIIIIIIILII I

 

 

 

 

 

 

 

 

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

01405 6422

This is to certify that the

dissertation entitled

ON RECOGNIZING AND TRACKING 3D CURVED
OBJECTS FROM 2D IMAGES

presented by

Jim-Long J. Chen

has been accepted towards fulﬁllment
of the requirements for

Ph.D. Computer Science

degree in

 

 

 

/Wj’£a€ Jar/7W

Major professor

Date 24 4,0? 9é

MS U is an Afﬁrmative Action/Equal Opportunity Institution 0-12771

 

LIBRARY
Michigan State
UnIversIty

 

 

 

PLACE IN RETURN BOX to roman this Checkout Irom your record.
TO AVOID FINES Mum on or Moro do" (100.

 

 

4

DATE DUE DATE DUE DATE DUE

4
“I,

 

 

I

 

 

 

 

 

__ IL:
I

:II;,:II;I

 

 

 

 

 

 

 

 

MSU I. An AIﬂnnotIvo Action/Equal Oppommlty lmtItqun
7 Wat-3:1

ON RECOGNIZING AND TRACKING 3D CURVED
OBJECTS FROM 2D IMAGES

By

Jin-Long J. Chen

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Computer Science Department

1996

ABSTRACT
ON RECOGNIZING AND TRACKING 3D CURVED
OBJECTS FROM 2D IMAGES
By
Jin-Long J. Chen

This thesis addresses the problem of recognizing arbitrary curved 3D objects from
single 2D intensity images. We propose a model-based solution within the alignment
paradigm involving three major schemes — modeling, matching, and indexing.

The modeling scheme consists of constructing model aspects for predicting the
object contour seen from any viewpoint. The indexing scheme generates from image
features hypotheses giving candidate model aspects and poses. A hypothesis group-
ing and ordering method is used to order model hypotheses based on prior knowledge
of pre—stored models and the visual evidence of the observed objects such that the
most likely model hypotheses are tested ﬁrst. The matching scheme aligns candidate
model edgemaps to the observed object edgemap and the results of alignment are
used to support/refute model hypotheses. Due to the unavailability of salient fea-
tures in objects with sculptured surfaces, matching is carried out by the Newton’s

method with Levenberg—Marquardt minimization. A hierarchical veriﬁcation strategy

is incorporated in the matching scheme to quickly eliminate false model hypotheses
from further consideration. Once a correct model is localized, a recognition success
is declared and the whole veriﬁcation procedure is terminated. If all candidate model
hypotheses are refuted, a recognition failure is reported. When combined into an in-
tegrated system, these three schemes make progress toward improving accuracy and
efficiency by pruning false model hypotheses and minimizing unnecessary verification
tests.

A prototype implementation has been tested in experiments conducted on a
database containing 658 model aspects to evaluate the performance of recognition
on 20 arbitrary curved objects, either non-occluded or partially occluded, seen from
several viewpoints. Bench tests and simulations show that a large database of many
kinds of objects including polyhedra and sculptured objects can be handled accu-
rately and efﬁciently. We have also applied this model-based alignment paradigm to
tracking a single moving object in a scene from an image sequence. Experimental
results indicate the viability of this tracking method. From these results, we conclude
that the proposed recognition—by-alignment paradigm is a viable approach to object

recognition and tracking.

To my mentor Joseph C. Berston

iv

ACKNOWLEDGMENTS

I would like to take this opportunity to thank my major advisor, Professor George
C. Stockman, for his constant encouragement, continuous ﬁnancial support, and ex-
cellent guidance throughout the course of this work. His efforts in making this learning
experience worthwhile are deeply appreciated.

Special thanks are reserved to my dissertation committee members, Professor
Anil K. Jain, Professor John Weng, and Professor Herbert Reynolds. The valuable
suggestions and comments extended by both Professor Anil K. Jain and Professor
John Weng brought this study to a successful completion. A particular debt of
gratitude is owed to Professor Herbert Reynolds for providing ﬁnancial support in
the early stage of my study at Michigan State University and also for serving on my
dissertation committee.

I also wish to thank my fellow Computer Science graduate students at the Pat-
tern Recognition and Image Processing Laboratory past and present — Jian-Chang
Mao, Sharathcha Pankanti, Qian Huang, Marie-Pierre Dubuisson, Tim Newman,
Sally Howden, Chitra Dorai, Yu Zhong, Yuntao Cui, Shaoyun Chen, Dan Swets,
Karissa Miller, and many others for their assistance and friendship and for making

V

my stay here a very enjoyable one. In addition, I am grateful to Jason Sperber
for his help in data acquisition. Financial support for this research from the Texas
Instruments is gratefully acknowledged.

Finally, as the important people in my life, I would like to express my deepest grat-
itude to my parents for their understanding, love, and moral support and in particular

my mentor, Joseph C. Berston, for his everlasting concern and encouragement.

vi

TABLE OF CONTENTS

LIST OF TABLES
LIST OF FIGURES

1 Introduction

1.1 Object Recognition by Alignment ...........
1.2 Problem Deﬁnition ...................
1.3 Matching Techniques Related to Alignment .....
1.4 Contributions of the Thesis ..............
1.5 Organization of the Thesis ...............

2 Modeling 3D Objects from Multiple 2D Images
2.1 Introduction .......................

2.2 Related Background ..................
2.3 Constructing Object Models ..............
2.3.1 Modeling the Object Silhouette ...........
2.3.2 Representing Internal Edges .............
2.3.3 Model Construction .................
2.4 Implementation Details .................
2.4.1 Aspect Model Acquisition ..............
2.4.2 Edgemap Extraction .................
2.4.3 Matching Model Edge Elements ...........
2.5 Experimental Results ..................
2.5.1 Error Analysis from an Ellipsoid ..........
2.5.2 Rotation Axis Effect .................
2.5.3 Aspect Breadth Effect ................
2.5.4 Model Alignment ...................
2.6 Summary ........................

3 Matching 3D Object Models to 2D Images
3.1 Introduction .......................

3.2 Related Background ..................
3.3 Matching Model Edgemaps to Observed Edgemaps .
3.3.1 Newton’s Method and Least-Squares Minimization

3.3.2 Heuristics used in Alignment ............
3.3.3 The Sensitivity of Pose Parameters to Object Shape
3.3.4 The Basin of Convergence ..............

vii

00000000000

ix

X

1
2
4
11
16
19

25
25
28
31
35
37
38
40
41
44
49
53
54
56
57
60
61

66
66
68
71

viii

3.4 Experimental Results ............................. 94
3.4.1 Results from Model Fitting Experiments ................. 96
3.4.2 Results from Monte Carlo Experiments .................. 99
3.5 Summary ................................... 106
4 Indexing to Model Aspects 108
4.1 Introduction .................................. 108
4.2 Related Background ............................. 112
4.3 General Framework for Indexing ....................... 115
4.3.1 Part Segmentation ............................. 115
4.3.2 Part Indexing ................................ 122
4.3.3 Hypothesis Grouping ............................ 129
4.3.4 Hypothesis Ordering ............................ 131
4.4 Experimental Results ............................. 138
4.4.1 The Indexing Result ............................ 139
4.4.2 Recognition Results ............................ 142
4.5 Summary ................................... 149
5 Object Recognition and Tracking 151
5.1 Object Recognition .............................. 152
5.1.1 Hierarchical Verification .......................... 153
5.1.2 Monte Carlo Experiments ......................... 156
5.1.3 Recognition Results ............................ 162
5.2 Motion Tracking ............................... 177
5.2.1 Previous Work ............................... 177
5.2.2 Proposed Method .............................. 179
5.2.3 Motion Segmentation ............................ 180
5.2.4 Experimental Results ............................ 184
5.3 Summary ................................... 192
6 Summary, Conclusion, and ﬁxture Research 195
6.1 Summary and Conclusions .......................... 195
6.2 Future Research ................................ 199

BIBLIOGRAPHY 203

2.1
2.2
2.3

3.1
3.2
3.3
3.4
3.5
3.6

4.1
4.2
4.3
4.4
4.5

5.1

5.2

5.3
5.4

5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14

LIST OF TABLES

The number of model aspects of 3D objects. ................ 43
The prediction error of the vehicle in Figure 2.14. ............. 53
Experimental Results of Model Alignment .................. 62
The partial derivatives of N with respect to 0, and d) ............ 76
The partial derivatives of R with respect to a, 0 and 43 ........... 77
Results from Model Fitting. ......................... 99
Convergence and Error Analysis: Edgemap vs. Silhouette. ........ 100
Convergence and Pose Accuracy vs. Et. .................. 104
Convergence and Error Analysis: Occlusion Effect. ............ 106
Some statistics for Data Set I14 with 578 test objects. ........... 139
The effect of using additional variant attributes in indexing. ....... 141
Hypothesis pruning via pose consistency grouping .............. 142
Performance of hypothesis voting schemes .................. 143
Results of model indexing and hypothesis veriﬁcation ............ 146
Number of veriﬁcations in ﬁtting correct model aspects to images as a

function of E. and 1m”. ......................... 157
Number of veriﬁcations in ﬁtting wrong models to images as a function of

E; and [max ................................. 158
Incorrect model-pose hypotheses that survived at the coarse level ..... 159
Number of veriﬁcations in ﬁtting models to their neighboring aspects as a

function of E, and Imp ......................... 160
Parameters for veriﬁcations at various levels ................. 161
Hierarchical veriﬁcation for model aspect: gorilla20 ............. 163
Recognition results for 28 non-occluded objects. .............. 168
Recognition results for 32 occluded objects. ................ 170
The effect of visual contour on indexing. .................. 172
Recognition results of objects not in the database .............. 173
Recognition results for 4 occluded objects. ................. 175
Results of motion tracking for a squirrel carving. ............. 185
Motion tracking across aspects. ....................... 191
Results of tracking the Ford Taurus. .................... 192

ix

LIST OF FIGURES

1.1 Examples of scenes containing jumbles of objects. ............ 5
1.2 The computation paradigm of the proposed recognition system. ..... 9
1.3 The major tasks of the proposed recognition paradigm ........... 10
1.4 Prototype edgemaps for construction of a model aspect ........... 20
1.5 Steps in the matching algorithm. ...................... 22
1.6 Hypothesis generation and pruning using part evidence ........... 23
1.7 Object tracking of image sequences of a moving squirrel. ......... 24

2.1 The viewing sphere can be tessellated into a regular polyhedra, each face

deﬁning an aspect. ............................ 32
2.2 Model aspect made from 5 overlapping images. .............. 33
2.3 Example input for construction of a model aspect .............. 34
2.4 The curvature method ............................. 36
2.5 Building a model using three images. .................... 39
2.6 The tessellation of the viewing sphere. ................... 42
2.7 The tripod for model construction. ..................... 43
2.8 Examples of model aspects of 3D objects. ................. 45
2.9 54 aspects of a plastic sprayer ......................... 46
2.10 54 aspects of a plastic lion toy. ....................... 47
2.11 The edgemap extraction algorithm. ..................... 48
2.12 The extracted edgemap of a pencil sharpener. ............... 48
2.13 The stereo matching algorithm for modeling internal edges ......... 50
2.14 The results from the stereo matching algorithm ............... 52
2.15 The error of the curvature method as a function of the aspect ratio f:- and

the rotation angle a. ........................... 56
2.16 An aspect of a superquadric model and its four test data .......... 58
2.17 The error of the curvature method as a function of the rotation axis and

the rotation angle for the object in Figure 2.16(a) ............ 58
2.18 The error of the curvature method as a function of aspect breadth for the

object in Figure 2.16(a). ......................... 59
2.19 Deformation of the images with respect to the predicted model. ..... 61
2.20 Examples of model alignment ......................... 63
2.21 More examples of model alignment ...................... 64
3.1 Synthesizing local point correspondences ................... 81
3.2 The edgemap alignment algorithm. ..................... 83
3.3 Matching edgemaps in an 1:2 scale ...................... 84

X

3.4
3.5

3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13

4.1
4.2

4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12

5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13

xi

The rotation of a smooth object. ...................... 85
A scene where the matching algorithm fails to recover the pose parameter

a ....................................... 87
The basin of convergence. .......................... 92
The effect of E't on the basin of convergence ................. 93
Steps in the model ﬁtting algorithm. .................... 95
Fitting a polyhedral model to a partially occluded object. ........ 96
Model ﬁtting examples of sculptured objects ................. 97
Model ﬁtting examples of occluded objects. ................ 98
Test objects used for studying the basin of convergence ........... 101
Objects for studying the effect of occlusion. ................ 105
The computational paradigm of the proposed indexing scheme ....... 111
Hoffman and Richard’s theory of curve partitioning: joining parts gener-

ally provides concavities in the resulting silhouette. .......... 117
The primitive codon types proposed by Hoffman and Richard. ...... 118
Segmentation of object silhouette into parts ................. 119
Examples of part segmentation on a variety of shapes. .......... 120
Part evidence survives occlusion. ...................... 121
Part segmentation survives noise and scaling. ............... 121
Part segmentation survives 2D transformation and 3D view distortion. 123
The effect of the number of object parts on voting schemes. ....... 140
Comparison of the performance of hypothesis voting schemes. ...... 144
Examples of hypothesis veriﬁcation via model fitting. ........... 147
More examples of hypothesis veriﬁcation via model ﬁtting. ........ 148
The hierarchical verification strategy. .................... 155
Examples of recognition ............................ 164
Examples of recognition ............................ 165
Examples of recognition ............................ 166
Objects for recognition (not existing in the database). .......... 174
Examples of false dismissal. ......................... 176
Results of the motion segmentation ...................... 183
Motion tracking of image sequences of a moving squirrel .......... 186
The evolution of parameter estimates during motion tracking of the squirrel. 187
Motion tracking on the Taurus image sequence. .............. 188
Tracking Ford Taurus from two model aspects ................ 190
Tracking Ford Taurus by changing model aspects. ............. 190
Tracking a moving vehicle in a parking lot .................. 194

Chapter 1

Introduction

One of the central problems of computer vision is the automatic recovery of the
structure and properties of three-dimensional (3D) objects in a scene from a single or
multiple two-dimensional (2D) images. The structure and properties sought include
the identities, positions, and orientations of objects. The model-based recognition
paradigm has emerged as a promising approach to this problem in that stored models
can be used to match against objects in the scene. This model-based recognition
paradigm is supported by human eye-brain biological processes where the image of a
scene is perceived by the eyes and the retinotopic patterns of objects in the scene are
recalled from memory [118]. This object recognition paradigm used to model human
biological processes [98, 118] or computer vision processes (8.9., [28, 42, 92]) assumes
some internal representation of visual input and some “computational” capacity to
match it against representations of models known to the viewer. Consequently, model-
based object recognition systems can be distinguished or classiﬁed according to their
model representation schemes and their object matching techniques [12, 31].

1

2

This chapter is organized with the next section introducing the problem of recog-
nizing 3D objects from 2D images. A few areas where opportunities of improvement
over current recognition systems exist are also outlined. Section 1.2 describes the
approach proposed in this thesis toward the object recognition problem. Since this
thesis is about model-based object recognition by alignment, Section 1.3 surveys some
of the most popular matching techniques related to alignment in use by the computer
vision community. Section 1.4 outlines the major contributions of this thesis. The

last section explains the organization and approach of this thesis.

1.1 Object Recognition by Alignment

Recognizing 3D objects from single 2D images has been an active research topic in
computer vision since its inception. A practical solution to this problem has many
applications such as inspection, automation, manipulation, navigation, security, etc.,
and will greatly impact the ﬁeld of intelligent robotics. Over the past twenty years,
many researchers have studied the object recognition problem and have made progress
in building experimental recognition systems. Most successful recognition systems use
polyhedra as object models and rely on matching special object features to features
extracted from the image. The most common strategy adopted in the computer
vision community is known as alignment [60] or hypothesize-and-verify [4, 35, 42,
55, 76, 117]. This strategy divides the problem into two parts: ﬁrst hypothesizing
correspondences of features extracted from both the object model and the image,

and then verifying the hypotheses. The veriﬁcation process requires the position and

3

orientation (i.e., pose) of the hypothesized object be computed. Many researchers
have developed techniques to make the search for correspondences more efficient by
applying photometric and geometric constraint and have also studied the problem of
pose estimation from feature correspondences. Despite these efforts, lack of efficiency
and accuracy in alignment and the capacity of handling objects with arbitrary curved
surfaces has caused current recognition systems to be less widely accepted in practice.

We briefly discuss a few areas where new research is needed.

Simple Object Model: The most dominant reason for the popularity of polyhedra
[40, 48, 60, 76] is that edge contours are relatively stable over small changes in view-
point such that reliable geometric invariants / features can be extracted for hypothesis
generation and veriﬁcation purposes. Unfortunately, the world is not simple and poly-
hedral; it is complex and full of arbitrary curved objects whose image features are
much more difﬁcult to interpret. To be widely used in practice, recognition systems
must accommodate sculptured surfaces and large model databases. Most researchers
in computer vision have focused on object models with simple surfaces, namely poly-
hedra [40], quadric surfaces [108], superquadrics [5, 89], or super-ellipsoids [50, 88],
or with slightly more complex ones such as algebraic surfaces [69, 64]. These surfaces
are not expressive enough to represent sculptured objects. Even when sculptured
objects are modeled by Boolean combinations of these primitives, they might not be

smoothly sewn together.

Inefficient Veriﬁcation Order: Hypothesis-generation procedures must compen-

sate for inaccurate features extracted from the image (6.9., due to noise or view

4

variation) by loosening constraints, thus increasing the number of incorrect hypothe-
ses that are generated. Accordingly, optimizing the order in which the hypotheses
are veriﬁed has become extremely important for the efficiency of recognition systems.
Systems that do not optimize the ordering of hypotheses tend to verify more hypothe-
ses than they should. Few systems make proper use of prior knowledge to guide the

order of the veriﬁcation so that the most likely hypotheses are veriﬁed ﬁrst.

Inaccurate Pose Estimation: Correct veriﬁcation of a hypothesis depends on ac-
curate localization of the hypothesized object. Recognition systems that use corre-
spondences of minimal sets of features to locate a hypothesized object often suffer
from an inaccurate pose estimate. One remedy is to increase the tolerances for the
veriﬁcation decision; however, this often leads to more recognition errors. The other
option is to reﬁne the pose estimate using more available image features: unfortu-
nately, approaches to such a pose reﬁnement are often sensitive to missing features
in the image (e.g., due to partial occlusion), indicating that the reﬁned pose estimate

may not be an improvement unless the number of missing features is insigniﬁcant.

1.2 Problem Deﬁnition

In this thesis, we propose a model-based recognition system capable of handling a
large database of 3D objects with sculptured (“free-form”) surfaces. The proposed

recognition system intends to solve the following problem:

Given a set of object models and an intensity image of a scene containing

a jumble, speciﬁc objects in the scene are to be recognized based on the

given single image.

This is the problem of recognizing objects with unknown 3D pose from a single 2D
image. The object of interest may be partially occluded by other scene objects and
the object models include both polyhedra and arbitrary curved objects. Examples
of scenes are shown in Figure 1.1. The proposed recognition system ﬁts within the
theoretical framework for the recognition-by-alignment paradigm espoused by Ull—
man [111] and Lowe [75]. Alignment is a two—stage process. Given a model object
and an image, the ﬁrst stage is to hypothesize an aligning transformation that would
bring the model object to a position and orientation in space that corresponds to
the projected image. The second stage is to verify the hypothesis by applying the
aligning transformation to the object model and then comparing the predicted ap—
pearance with the actual image. The degree of the match is used to refute/ support

the hypothesis, i.e., determine whether the image is in fact an instance of the model.

 

Figure 1.1: Examples of scenes containing jumbles of objects.

The proposed approach toward object recognition strives to exploit prior knowl-

edge to improve recognition performance and accuracy. Improvements are necessary

6

for object recognition systems to be practical for real applications. The proposed
approach toward improving efﬁciency and accuracy of the recognition is based on (1)
maximizing the use of prior knowledge to reduce unnecessary hypothesis veriﬁcations
and (2) reﬁning the pose estimate in a hill-climbing fashion to constrain the search.
We represent prior knowledge using feature saliency in pre—stored models and the
feature visual evidence in the images. For efﬁcient recognition, likelihood measures
based on prior knowledge are applied to select and order hypotheses for veriﬁca-
tion; they help remove relatively unlikely hypotheses from consideration and enable
the proposed recognition system to verify the most likely hypotheses ﬁrst. Robust
pose reﬁnement increases the accuracy of the pose estimate, and thus enhances the

accuracy of the veriﬁcation test as well.

The proposed object recognition system comprises three primary techniques: com-

plex object modeling, probabilistic indexing, and robust matching.

Complex Object Modeling: The proposed modeling scheme builds / learns 3D
models from sets of 2D images taken at controlled viewpoints. Our approach accounts
for the changes of contour features over smooth object surfaces due to small changes in
viewpoint. Curvature information about object surfaces is learnt from a set of images
arising from the object, enabling the modeling scheme to generate the object’s edge
appearance seen from any viewpoint. Consequently, the proposed modeling scheme

can handle both polyhedra and sculptured objects.

Probabilistic Indexing: The purpose of an indexing scheme is to use some im-

age features to hypothesize candidate models and candidate poses. The proposed

7

indexing scheme generates hypotheses based on local geometric features because they
are insensitive to partial occlusion. The proposed indexing scheme also increases the
tolerances of indexing features to accommodate noise and view variation. The prior
knowledge about the structure of pre—stored models and the visual evidence of image
features is used to compute the relative likelihoods of the hypotheses. The likeli-
hoods of these hypotheses are used to select hypotheses for veriﬁcation by removing
hypotheses with little supporting evidence while retaining those with strong support-
ing evidence. This means that incorrect hypotheses may be pruned because of their
small likelihoods. The resulting hypotheses are ordered by their likelihoods so that

the most likely hypotheses are tested ﬁrst.

Robust Matching: The proposed matching technique reﬁnes the pose estimate in
a hill-climbing fashion to constrain the search for correspondences between model
and image features. We adopt a least—squares minimization approach—similar to the
approach in [77]—to reliably estimate the object pose. This minimization technique
uses heuristics to robustly estimate pose: it is shown to be relatively insensitive to

outliers occurring due to partial occlusion.

The Recognition Task

The recognition problem addressed in this thesis is to identify and localize an arbitrary
curved 3D object from a single 2D view of a 3D scene (i.e., 3D-to-2D recognition).
Thus, an object has three translational and three rotational degrees of freedom that

must be recovered from 2D sensory data. The input to the recognition system is an

8

intensity image and a set of aspect models. The output from the recognition system is
an identity of the observed object and a transformation that best aligns the selected
model with the observed image features.

The imaging process is assumed to be approximated by weak perspective projec-
tion, i.e., orthographic projection plus a scale factor. It is also assumed that objects
are rigid. Under these assumptions, there are six unknown parameters for the image
of an object under a rigid-body transformation: three for rotation, two for translation,
and one for scaling [60].

Edge contour features are used for generating and verifying hypotheses. To allow
occlusion, only portions of edge contours are used for generating and verifying hy-
potheses of candidate models and their pose in an image. Thus, it is assumed that
objects are identiﬁable based on the shape of their edge contours. Since the proposed
recognition system can handle objects that have smoothly changing surfaces, the edge
contours are allowed to be slightly different over small changes in vieWpoint.

A block diagram in Figure 1.2 depicts the proposed recognition paradigm and some
major tasks addressed by this research. The modeling scheme consists of constructing
model aspects. The indexing scheme generates from image features hypotheses giving
candidate model aspects and poses which are then veriﬁed in the matching scheme
to refute/ support the hypotheses and also to estimate/ reﬁne the object pose of the
correct model. The localization of a correct model implicitly indicates the recognition
of the model. Once a correct model is localized, the whole veriﬁcation procedure is
terminated. If all candidate model hypotheses are refuted, a recognition failure is

reported.

Sequence of images
from scene with

Recognition

  
 
 

 
   
   

lrnages of model objects
seen from various

   

 

 

I I l
I : I
I I '
I I I
I I l
I I I
I l I
I I :
: moving objects : viewpoints I
I : .
I I |
I I I
l I I
I I |
I ' I
: Feature Extraction I Feature Extraction :
I j '
: ' I
I
i I Edge Contours :
l
I I I
' I
I Edge Indexms I Object Modeling '
: Contours Primitives I I
: (Object Silhouette) I I
I
I I i
I I
| I I
: v I Ob. I
| 100! l
I Hypothesis . Aspect Model I
' ‘ ' Database
: Generation : I
I I I
I I I
' Candidate Objects I '
: I : I
|
I Object : 2 1/2 D Edgemap |
I I I
I
I : .

     
 

 

 

  
    
  
     
   

 

Veriﬁcation

Figure 1.2: The computation paradigm of the proposed recognition system.

10

An example of the recognition of a partially occluded polyhedral object is given in
Figure 1.3. Figure 1.3(a) shows a typical scene containing two objects: a polyhedra
and a Y pipe. Figure 1.3(b) shows the edgemap of the image obtained from the feature
extraction. Figure 1.3(e) shows the model (which is a 2%D edgemap where the 3D
information is embedded in each edge element) obtained from the object modeling
scheme. Figure 1.3(d) depicts the evolution of convergence during the matching.
Figure 1.3(e) shows the ﬁt edgemap superimposed on the scene image, indicating
that the polyhedron is recognized and located. Figure 1.3 does not show the indexing

task, which is difﬁcult to demonstrate by pictures.

   

(d)

Figure 1.3: The major tasks of the proposed recognition paradigm. (a) A scene
image containing an occluded polyhedra and a Y pipe. (b) The edge map of the
scene obtained the feature extraction. (c) The polyhedra model obtained from object
modeling is used for matching (note that the 3D information is embedded in each
edge element). ((1) The evolution of convergence during the matching. (e) The ﬁtted
edgemap shown superimposed on the scene image. (Note that the indexing task is
not shown here.)

11
1.3 Matching Techniques Related to Alignment

All model-based recognition systems organize the recognition problem as a search for
a match: speciﬁcally, how to associate features extracted from the sensory data with
corresponding features extracted from the model [47]. Models can be constructed us-
ing salient geometric features such as arcs, lines, and corners. These features are local
in nature and provide locational and rotational information to be used for comput-
ing the object pose and some contextual information for constraining the matching
possibilities [104]. Thus, one way to structure the search problem is to consider the
space of all possible matches of sensory features and model features. This space is
deﬁned as correspondence space [47]. An alternative is to directly search for a trans-
formation in parameter space which causes each model feature to take on a position
in the transformation coordinate system that coincides with the position of the cor-
responding sensory feature. Below, we brieﬂy survey some of the existing matching
techniques in the literature related to alignment for the purpose of object recognition.
More reviews of related work can be found from surveys by Besl and Jain [12], Chin

and Dyer [31], and Suetens et al. [106].

Interpretation Tree

The interpretation tree (IT) search is probably the most prominent recognition
paradigm; see Grimson and Lozano—Perez [48]. The idea of interpretation tree search
is to explore all possible correspondences of features between the sensory data and

the model. A separate search is involved for each model, thus the search technique

12

itself does not involve indexing. Each node in the tree deﬁnes a matching pair of
model feature and sensory data feature, and each path (from root to leaf node) of
the tree deﬁnes an interpretation of the sensed features. An interpretation is consid-
ered to be consistent if there exists a rigid transformation which brings each model
feature to coincide with the corresponding sensory feature. One very straightforward
way of ﬁnding consistent interpretations is to simply do a systematic tree search of
all branches (e.g., backtracking or depth-ﬁrst search). This brute-force tree search is
exponential in the number of sensory features.

Several researchers have incorporated IT algorithms in alignment [17, 24, 40, 43,
49, 75]. The basic goal of these algorithms is to minimize the search by exploiting
some prior knowledge to efﬁciently prune and order the search. If the ﬁrst few chosen
matches are accurate, the IT/ Alignment algorithm can be very efﬁcient. Grimson
and Lozano—Perez [49] applied geometric (unary and binary) constraints to prune
the search. Bolles and his collaborators [16, 17] developed the Local Feature Focus
(LFF) method to avoid an exhaustive search. In this method, the most obvious fea-
ture matches (i.e., focus features) are used to guide the search for additional feature
matches that will add the most information to the current interpretation, thus re-
ducing the degrees of freedoms in the interpretation. Similar to the LFF method,
Faugeras and Hebert [40] used the rigidity constraint to select subsequent matches in
the interpretation tree. Flynn and Jain [43] used heuristic knowledge of the model
database to prune and order the interpretation tree for efﬁcient search. Lowe [75]
introduced the use of perceptual grouping of features to order the search. Camps [24]

took the approach of using probabilistic evidence to cut down the interpretation tree.

13

Interpretation tree search is a conservative approach since it considers all possibili-
ties in the correspondence space; however, this method is robust at the expense of

efﬁciency.

Geometric Hashing

While constrained search of the interpretation tree provides one method to reduce
the search time, an alternative would be to use invariant features (deﬁned on groups
of features of the sensory data and the model) as indices for the construction and
examination of tables containing the information needed to map the sensory fea-
ture to the model feature. This concept was introduced as “geometric hashing” by
Lamdan and Wolfson [71, 72]. The method contains aspects of both indexing and
matching and is based on the idea of representing an object by storing the object’s
transformation-invariant information in a hash table. At recognition time, similar in-
variants are derived from the sensory data, and are used to index into the hash table to
retrieve candidate correspondences with the model. Each candidate correspondence

constitutes a hypothesis whose validity is veriﬁed through pose information.

Pose Clustering

The pose clustering method [105, 103] uses aspects of both the correspondence space
and the parameter space. The basic idea behind this method is to consider all possible
pairs of corresponding sensed and model feature structures and then cluster on the

parameter space of the pose transformation derived from each sensed-model structure

14

pair. The feature structure may be composed of (a) 3 surface normals, (b) 2 edges,
or (c) 2 surface normals and 1 edge [104]. The task of matching is to put features in
each sensory-model structure pair into correspondence based on local evidence and
then derive the pose transformation in the process. Each pose transformation derived
deﬁnes a point in pose space where a vote is cast on behalf of this sensory-model struc—
ture pairing. A cluster of matching pairs in this space is a good indication that many
sensory data features are matched to corresponding model features. This method is
fast but can miss good solutions due to the rough approximation of geometric con-
straints. It also wastes signiﬁcant resources on regions of parameter space that could
be easily determined not to contain a solution. Breuel [20] used adaptive subdivisions
of parameter space to avoid such problems. He combined the idea of multiresolution
matching, Hough transform, search-based recognition, and bounded error recognition
into a simple, efﬁcient algorithm, whose performance is better than that of alignment

and pose clustering methods.

Alignment

Similar to the pose clustering method, the alignment method [60], or the hypothesize-
and-verify method [4, 35, 55, 76], also uses aspects of both the correspondence space
and the pose space. The basic idea is to ﬁnd a minimal set of correspondences of sen-
sory and model features sufﬁcient to determine the pose and then to use the rigidity
constraint to efﬁciently determine the other correspondences. For example, for 3D

object localization, Huttenlocher and Ullman [60] use three pairs of data and model

15

points to deduce the six parameters of the transformation. Each such transforma-
tion constitutes a hypothesis about the pose of the object. The veriﬁcation is done
by aligning the model with the sensory features, that is, applying the hypothesized
transformation to the model and using the transformed model to predict additional
features that might be evident in the sensory data. This guides the search for addi-

tional supporting matches for pose reﬁnement.

Pose Estimation via Model Fitting

A frequently required capability of an object recognition system is the localization
of identiﬁed objects in the image. Several researchers have developed techniques to
compute the object pose given the correspondences between model and image features
(pose estimation). Pose estimation techniques can be used by the alignment search
to generate an initial pose estimate. Typically, a few correspondences are used to
hypothesize the initial pose estimate; other correspondences are then found by local
search in the image, and the additional correspondences are used to reﬁne the pose
estimate. Pose estimation/ reﬁnement techniques normally involve minimization of
some error function over the free model parameters. Lowe’s [77] method minimizes
the least-squared error using a technique based on Newton’s method and Levenberg—
Marquardt minimization to iteratively compute projection and model parameters
that best ﬁt a 3D model to 2D image contours. Kriegman and Ponce [69] used an
elimination method to construct an implicit equation for image contours and then

determined the model parameters by ﬁtting the theoretical contour to the sensory

16

data. Ponce et al. [91] presented a global method for pose estimation by ﬁtting an
implicit algebraic surface to data points and a local method for pose reﬁnement us-
ing the entire image contour. Haralick et al. [51] explored the use of robust weight
functions in an iterative least—squared minimization for point-based pose estimation.
Kass et al. [63] deﬁned active contours, called snakes, that deform under “physical”
constraints in an energy minimization procedure in which the equilibrium state de-
termines the model parameters that yield the best match between the image and the

model.

1.4 Contributions of the Thesis

This thesis proposes a complete model-based object recognition paradigm that can
handle a large database of sculptured objects. A prototype implementation has been
developed which is complete in the sense that given an input image, it is capable
of detecting the identity and pose of an observed object. The system was designed
to study the feasibility of an alignment approach to recognize 3D sculptured objects
from 2D intensity images in a scene where the object of interest may be partially oc-
cluded. The system was also used to study applicability of the paradigm for tracking
a single moving object in a scene from an image sequence. The system was tested on
a database containing twenty 3D objects and eighty 2D objects with a total number
of 658 model aspects. Since most object recognition systems reported in the litera-
ture can recognize only a small number of either unoccluded polyhedral objects from

intensity images or curved objects from range images, the capacity of the new recog-

17

nition system to handle both polyhedra and sculptured objects in a more difﬁcult
scene is an important (overall) contribution of this thesis.

The ﬁrst contribution of this thesis lies in the implementation of the curvature
method proposed by Basri and Ullman [9] for object modeling. Our work both vali—
dates and extends the work of Basri and Ullman. We have explicitly included object
internal edges in addition to object silhouette for object modeling. A 3D object is
represented by a set of 2D images taken at controlled viewpoints. This model repre-
sentation is descriptive in the sense that the edge contour of an object viewed from
any given vieWpoint can be predicted by using only a small number of viewer—center
model aspects. Experimental results show the method is viable for modeling many
kinds of objects including polyhedra and arbitrary curved objects.

The second contribution of this thesis lies in the design of an iterative matching
technique and the use of two heuristics for matching. The pose estimation / reﬁnement
algorithm, which can handle objects with partial occlusion, does not assume the exis-
tence of salient features in the image, and thus, is directly applicable to smooth objects
with sculptured surfaces. Due to the unavailability of salient features for correspon-
dences, we develop an iterative matching technique for pose estimation / reﬁnement.
As with many other minimization techniques, this iterative matching technique re-
quires good initial parameter estimates to avoid converging to a local minimum. From
the experimental results, we demonstrate a high rate of convergence for a broad set of
initial parameter estimates. The two heuristics are used to improve the pose accuracy
and handle occlusion. The ﬁrst heuristic is adopted to synthesize correspondences in

order to reﬁne the pose parameters in the hill-climbing procedure and to discard

18

certain edge segments in order to allow for some occlusion. Since the silhouettes of
smooth objects have limited information, including internal edges in the ﬁtting can
help achieve more accurate pose. Thus, the second heuristic is used to maintain a
balance in the importance of alignment of silhouette and internal edges such that the
parameter ﬁtting is not overly biased to one type of edge contour.

The third contribution of this thesis lies in the use of part invariant features for in-
dexing and the use of hypothesis grouping and ordering for pruning false hypotheses.
We show that the proposed part representation is robust under global transformation
and in the presence of occlusion and spurious noise (i.e., local deformation). We have
incorporated the pose consistency constraint to group into clusters the (parts) hy-
potheses within the same model. We have also investigated various hypothesis voting
schemes which exploit prior knowledge of pre-stored models and the visual evidence
of the observed objects to determine the order of hypotheses for veriﬁcation. When
combined into a complete system, these techniques make progress toward improv-
ing accuracy and efﬁciency by pruning false hypotheses and minimizing unnecessary
veriﬁcation tests.

The fourth contribution of this thesis lies in the use of a hierarchical veriﬁcation
strategy. Although the indexing scheme has effectively pruned false model hypotheses,
the recognition system may still need to verify a few model hypotheses because the
system loosens the indexing attribute tolerance to allow noise and view variation in
the object silhouettes. This hierarchical verification strategy uses the ﬁtting error as
an indicator to reject false model hypotheses in the very early stage of the iterative

matching, resulting in a signiﬁcant reduction of the recognition time.

19

Another contribution of this thesis lies in the use of the recognition system for
object tracking. We have applied the recognition system for tracking a single moving
object in a scene from an image sequence. We have also studied how to track the
moving object across aspects, and demonstrated its applicability by tracking a real
moving car in an image sequence.

In Summary, the main contributions of this thesis are:

0 A complete object recognition paradigm that can handle partially occluded

objects with free-form surfaces in a database of 100 objects.

0 A model representation that is compact and can handle both polyhedra and

arbitrary curved objects.

0 A robust iterative matching technique that does not require feature correspon-

dences.
0 An indexing scheme that can prune false model hypotheses effectively.
0 A hierarchical veriﬁcation strategy that speeds the recognition.

0 The use of the recognition system for object tracking.

1.5 Organization of the Thesis

The remainder of this thesis is organized as follows. Chapter 2 presents our approach
toward object modeling. It sets forth a very compact multi-view representation of 3D

objects, and explains how this can be used to generate the object’s edge appearance

20

from any vieWpoint. The formal deﬁnition of model aspects is given. An example
of a model aspect is given in Figure 1.4. We also describe the setup of our imaging
conﬁguration for the acquisition of aspect models. Experimental results show that

the proposed modeling scheme can handle sculptured objects as well as polyhedra.

 

Figure 1.4: Prototype edgemaps for construction of a model aspect.

Matching 3D sculptured objects from a single 2D intensity image is treated in
Chapter 3. It is well-known that inferring 3D object shapes from projected 2D sil-
houettes may not yield a unique solution. Symmetric objects may generate the same
silhouette projections at various viewpoints. We propose an iterative matching tech-

nique which simultaneously ﬁts object internal edges and object silhouette to the

21

edge contours of an image. Two heuristics are adopted in this iterative matching
technique. One heuristic is used to synthesize correspondences in order to reﬁne the
pose in a hill—climbing fashion; the other heuristic is used to maintain a balance in the
importance of silhouette and internal edges during the ﬁtting. Results of ﬁtting on a
large set of real images are analyzed. We observe that ﬁtting simultaneously internal
edges and silhouette produces more accurate object pose than ﬁtting silhouette alone.
We conclude that our matching technique, although not perfect (mostly due to the
lack of a perfect segmentation algorithm), can reliably localize a partially occluded
3D object from an image. An example of matching is depicted in Figure 1.5.

In Chapter 4, we present an indexing scheme that operates on object silhouettes.
A robust part segmentation is introduced to decompose silhouettes into parts. The
mathematical formulas for part invariant features are speciﬁed. We explain how the
indexing scheme selects hypotheses using part invariant features. We present a hy-
pothesis grouping scheme which clusters consistent part hypotheses that stem from
the same model. We also present four different hypothesis voting schemes that order
model hypotheses based on the pre—compiled part saliency and the visual evidence
in the image. Experimental results indicate that the proposed indexing scheme gen-
erates a large number of model hypotheses, however, through hypothesis grouping
and ordering, a signiﬁcant number of veriﬁcation tests for false model hypotheses can
be avoided. An example is illustrated in Figure 1.6, where in our database of 658
model aspects, the scene of overlapped elephant and bear generates 5 model-aspect
hypotheses where elephant is ranked [11 and bear [[2 while the scene of occluded deer

and moose produces 98 hypotheses with moose ranking Ill and deer ranking [12.

22

 

Figure 1.5: Steps in the matching algorithm. (a) Intensity image synthesized from a
superquadric model. (b) Edgemap extracted from the image of (a). (c) The evolution
of convergence in the matching method. ((1) The ﬁtted edgemap shown superimposed
on the original image.

23

(a) (b) (c

@ IE %%

(e) (f) (g) 0‘)

Figure 1.6: Hypothesis generation and pruning using part evidence. (a)—(d) Parts
extracted from silhouette images of single bear, elephant, deer, and moose, respec-
tively. (e)—(h) Thicker lines indicate parts used for generating hypotheses for bear,
elephant, deer, and moose, respectively. (The images were obtained via anonymous
FTP from the Institute for Robotics and Intelligent Systems, Department of Electrical
Engineering, University of Southern California.)

In Chapter 5, we combine the modeling, matching, and indexing modules into a
recognition system. We propose a hierarchical veriﬁcation strategy to expedite the
recognition. Experimental results show that false model hypotheses can be rejected
at an early stage of veriﬁcation and that the recognition system can efﬁciently recog-
nize various kinds of objects. We also demonstrate the applicability of the complete
recognition system for tracking a single moving object in an image sequence. An
example of object tracking is illustrated in Figure 1.7.

Finally, Chapter 6 concludes the dissertation by evaluating the strengths and
limitations of the proposed system, identifying some unresolved issues, and giving

suggestions for future research.

24

 

(c)Frame t1 (d)Frame t2 (c)Frame t3 (f )Frame t4

    

(g)Frame t5 (h)Frame t3 (i)Frame t7 (j)Frame t8

Figure 1.7: Object tracking of image sequences of a moving squirrel. (a) Image of
frame to added with thresholded images of frame t1 - t3. (b)—(j) are the results of
model ﬁtting with the white contours indicating the ﬁtted edgemaps.

Chapter 2

Modeling 3D Objects from

Multiple 2D Images

2.1 Introduction

This thesis is concerned with the recognition of rigid free-form 3D objects from a single
intensity image. A model-based recognition system requires a representation that can
model arbitrary solid objects to an appropriate level of detail and can provide abstract
shape properties for matching purposes. A 3D object can be implicitly represented by
a set of 2D projections taken from all possible vieWpoints [47, 76, 93, 116], or explicitly
represented by an ob ject-centered geometric model [28, 29, 52, 121], or surface sweep
and volume descriptions [13, 21, 79, 89]. Good descriptions of object representation
may be found in the survey literature [12, 31].

The ultimate goal of vision is to recognize objects by seeing [78]. Many researchers
in the computer vision community believe that a complete, object-centered, 3D model

25

26

must be built in order to do the recognition, because the model can be used to match
against an instance of the model whenever it is viewed in a scene [48, 60, 75, 79].
Thus, most model-based vision systems for 3D object recognition require an object
modeling module to provide appearances of an object observed from viewpoints such
that the notable features and structural relations extracted from the sensed image
and from the appearance of the model can be matched to determine the object’s
identity and pose.

An alternative for 3D object representation is a set of retinotopic 2D patterns.
If all the possible 2D appearances of a 3D object are known, then recognition can
be done by simply matching the image against all the 20 appearances stored in
memory. However, this approach may give rise to the problem of requiring a very
large number of views of each object, potentially every possible view of each object.
As a consequence, this approach requires a large amount of memory, and different
views of the same object are treated as distinct objects. One way to alleviate this
memory burden is to describe objects by only representative views (i.e., aspects) in
which features extracted from the images remain unchanged within the range covered
by the view [39, 66]. For sculptured objects, the number of possible aspects may
still be very large. The advantages of using retinotopic models are that the modeling
scheme does not have to compute the vantage point and that the effect of perspective
projection from a 3D model is compensated [92]. The disadvantages are that the
representation is verbose and full of seams and that the indexing scheme must not
only produce candidate objects but also particular views of that object. The success of

this kind of modeling scheme for object recognition relies heavily on the completeness

27

of a full 3D representation of an object and the ability to index into a large set of
model views. This completeness depends on the complexity of the object and the
number of vieWpoints used [31].

The approach adopted in this thesis is a hybrid of the aforementioned two ap-
proaches. We construct a 3D object model from a set of 2D intensity images taken
from representative vieWpoints tessellating the viewsphere. These viewpoints are de-
ﬁned as aspects. Each aspect deﬁnes a local coordinate system and is associated
with ﬁve local images deﬁned relative to this local coordinate system. The curva-
ture method, devised by Basri and Ullman [9] for the construction of rigid objects
bounded by smooth surfaces, uses these ﬁve local images to generate a 2%D edgemap
which can be used to produce the appearance of the observed object seen from any
viewpoint within the aspect. The relative transformation between any two aspects is
known and the 3D information of each aspect obtained from the curvature method is
also available. This information can be fused together in order to build a complete
3D model in which the surface information is embedded in each aspect. However,
building a complete model is not necessary for many problems; one exception is when
used to track an object moving across aspects. Thus, in our approach, for recognition
purposes, we neither have to store every possible view of an object, nor do we have
to construct a complete geometric model.

This chapter is organized with the next section reviewing some existing techniques
for modeling arbitrary curved objects. Section 2.3 presents the proposed object mod-
eling scheme. Section 2.4 gives the implementation details. Section 2.5 shows results

from the error analysis of the curvature method and from the alignment of models

28

with observed objects. The last section summarizes the proposed approach.

2.2 Related Background

We are interested in recognizing 3D sculptured objects from single 2D intensity im-
ages. Much past work on this problem has dealt with the easier model domains of
polyhedra (e.g., [75, 85]) and ﬂat 3D objects (e.g., [60, 71]), or has used other sensory
information such as 3D range data (e.g., [43, 52]) or multiple images (e.g., [112]). In
the following, we would like to review some of the existing techniques for modeling
3D sculptured objects with the capacity of generating 2D views for recognition.

The most straightforward approach is to model a 3D object by a set of 2D images
taken from representative viewpoints. An object may be represented ( 1) by a set of
2D global feature vectors describing each possible stable view (e. g., [93, 114, 116]), or
(2) by a set of abstracted and precise geometric features (e.g., [47, 76, 60, 103]). The
advantages of the ﬁrst method are that the representation is extremely compact as
compared to other representations and that no correspondence between the sensory
data and the model needs to be established for matching. The limitations are (1)
that the representation is susceptible to occlusion and clutter because most features
can only be extracted from isolated objects, and (2) that the object can not be
accurately recreated or modeled from the features, nor can its pose be determined.
The major problem of the representation in the second method lies in matching
abstracted features. Features are local in nature, and each describes a portion of

the object. Each feature also contains locational and rotational information to be

29

used to recover the object pose and some contextual information to constrain the
matching possibilities. Localization can be used implicitly to solve the recognition
problem. Since many local features are used to model an object, the search space for
correspondences between image features and model features is exponential if some
constraints are not enforced. Moreover, salient features, such as arcs, lines, and
corners, may not be available for objects with sculptured surfaces to establish such
correspondence. However, unlike the global feature vector representation, partial
occlusion of objects can be handled.

Another promising approach is through building accurate geometric models of
objects and then generating views by projecting the geometric model along the di-
rection from the center of the viewsphere to a viewpoint. Some conventional CAD
systems can represent man-made geometric models efﬁciently, but may not easily rep-
resent natural objects consisting Of free-form surfaces. Several attempts have been
made to develop methods for building geometric models from multi-view range im-
ages [30, 52, 67, 100] or a sequence of intensity images [70, 97, 113, 121]. For example,
Higuchi et al. [52] presented a technique which uses a spherical representation, ob-
tained by deforming a discrete mesh to ﬁt the object surface, to merge multiple views
into a complete object model. Scales and Faugeras [97] recovered a complete ob-
ject surface from a sequence of images by fusing many views of the object extremal
boundaries together using known object motion and object internal edges to place
these views into a common coordinate frame.

Building geometric models from multi—view range images requires 3D data acqui-

sition and view registration [15, 100] to form a consistent connected model. Some

30

optical range imaging sensors, such as a laser range ﬁnder, are ready to provide
such range data from viewpoints. The relative positions of the sensor and the ob-
ject can either be chosen adaptively based on the object geometry or the positions
can be ﬁxed. The approximate transformations between viewpoints provide useful
information for merging data into a complete 3D data set. Range data obtained
from passive range ﬁnders are normally contaminated with noises and holes (i.e., im-
age locations with no depth values). Thus, constructing reliable free—form surfaces
from range data is important. One popular approach, adopted by several researchers
[30, 34, 52, 73, 80, 86], is to use deformable models to approximate a 3D surface by a
mesh represented by some kind of tessellation of the viewsphere or by a B-spline type
of mesh before deformation. During deformation, several force constraints, such as
internal forces for maintaining the regularity and smoothness of the mesh and exter-
nal forces for bringing the mesh to reach the object surface, are introduced to move
the mesh toward the object surface until all the elements of the mesh ”land” on the
object surface. The registration estimates the relative transformation between view-
points. Typically, the transformation is either assumed to be known, or it is solved
by extracting features from each view and then establishing correspondences between
them. Due to the unavailability of salient features from sculptured free-form surfaces,
it is difﬁcult to establish correspondences. This often results in iterative matching
techniques [11, 29, 37] for bringing a set of 3D points from range data as close as
possible to a set of primitives from a model. As is the case with any minimization
technique, this type of technique requires a good initial estimate of parameters that

is relatively close to the true transformation in order to avoid converging to a local

31

minimum.

Building geometric models from a sequence of intensity images requires that the
observer’s VieWpoints be controlled in order to generate a dense sequence of images
that allows incremental reconstruction of an unknown surface using the occluding
contour [32, 46, 70, 97, 113]. Under continuous observer motion, the visible rim gen-
erated by the object’s surface changes with the viewpoint, which affects the geometry
of the occluding contour and also reveals the shape information for the parts of the
surface over which the visible rim slides. A complete geometric model is then built by
reconstructing many views of an object obtained from known motion and integrating
local surface information over these views into a common coordinate system. The 3D
model of the object is represented as an object-centered mesh that can be rotated

and viewed from any direction.

2.3 Constructing Object Models

Object modeling is done for different purposes; models intended to support graphics
or manufacturing might not support recognition without signiﬁcant transformation of
data. For recognition with pose detection, the object modeling module must efﬁciently
provide as a function of VEWpoint (pose parameters) the silhouette of the object and
the internal edges created by surface marks, creases, or jumps. Edges caused by
self-occluding limbs of the object have been ignored thus far in our implementation.
Conceptually, they can be treated in the same manner as for limbs which create the

silhouette; in practice, the current algorithm will consider them to be internal edges

32

 

Figure 2.1: The viewing sphere can be tessellated into a regular polyhedra, each face
deﬁning an aspect.

and will ultimately discard them because of inconsistency in the stereo model. Some
conventional CAD systems can readily provide such edgemaps, and, as is seen below,
edgemaps can be extracted from superquadric models. In the rest of this section,
we address the process of creating models from multiple intensity images of complex

objects where use of a CAD system may be very difﬁcult.

Our modeling of a 3D object using a set of 2D images proceeds as follows. Consider
an object is placed at the center of a tessellated view sphere (see [56] for all the possible
tessellations). Each face of the resulting polyhedron deﬁnes an aspect (see Figure 2.1).
These aspects do not necessarily correspond to a topological similarity class, as has
been common in the literature [39, 66]. Reﬁnement of the tessellation could be used
to increase homogeneity, but, since our matching is based on geometric criteria, we
have not planned such reﬁnements. For each model aspect, ﬁve images are used to

create an edge-oriented 2%D representation of the object. Each aspect or view deﬁnes

33

a local coordinate system with the Z—axis deﬁned as the vector from the center of
the sphere through the center of that polygon face, and the directions of the X—axis
and Y—axis deﬁned by the horizontal and vertical lines of the image, respectively.
Of the ﬁve images, three are taken at the viewpoints on the sphere rotating around
the X-axis, and three are taken at the viewpoints on the sphere rotating around the
Y-axis (see Figure 2.2), with the central image common to both sets and taken along
the Z—axis. An example of a model aspect is given in Figure 2.3. The model silhouette

construction from these ﬁve images is speciﬁed in [9].

 

 

 

 

 

 

 

Figure 2.2: Model aspect made from 5 overlapping images.

It is important to emphasize that the object is modeled within an aspect in a
standard rotation (roll angle) about the Z—axis which is normal to the optical axis
of the central image used to create the aspect. Pan (rotation around the Y—axis

in Figure 2.2) and tilt (rotation around the X-axis) can be quantized into intervals

34

to organize the model aspects around an entire viewsphere. Roll angle is uncon-
strained and thus must be approximately computed from image features to initialize
the alignment optimization. Here, we assume a normalized zero-roll angle for the

object. Determination of roll angle during recognition will be discussed in Chapter 4.

//</T x/k/

 

Figure 2.3: Example input for construction of a model aspect.

Model aspects do not need to cover the entire viewsphere in cases where certain
views of the object are impossible, for example, from beneath a car. The infor-
mal requirement is that they cover the interesting views of an object and that they
provide adequate approximation. Also, for some purpose, e.g., recognition only or

pose computation within a constrained environment, it is not necessary to register

35

the coordinate frame of each model aspect to a global coordinate frame. However,
to continuously track an object with unconstrained pose, registration of each aspect
frame to a global frame is needed. To be used for recognition, all vieWpoints within
an aspect should produce more or less the same features which will be used to index
to that aspect. This means that tessellation of the viewsphere may be non-uniform
in practice, due to variation in views. Most of our experiments have been done with
a range of 20 degrees of rotation for one aspect. The appearance of an object at any
given viewpoint can be generated by ﬁrst indexing to the aspect of that viewpoint and
then using that model aspect to predict the object appearance. Indexing to a model
aspect will be treated in Chapter 4. We now turn to the mathematical formulas for

constructing a model aspect from ﬁve neighboring training images.

2.3.1 Modeling the Object Silhouette

The curvature method of Basri and Ullman [9] is a technique for modeling and re-
covering an object’s approximate edge appearance given a number of view-centered
2D projections. This modeling method is based on representing surface curvature of
points along the silhouette of an image central to the aspect being modeled. This
enables prediction of object silhouettes for viewpoints within that model aspect. The
basic idea is depicted in Figure 2.4. For convenience, we sketch the curvature method

adopted in this thesis.

Let X and Y be the main axes of the image plane, and let Z be the visual axis.

Consider a smooth object rotating by a rotation R around the vertical axis Y. Let

36

 

Figure 2.4: The curvature method. (a) A horizontal section of a smooth object. p
is a point on the rim, p; is an internal edge point, r is the curvature vector at p, o
is the center of the curvature circle, and c is the rotation center. The Z—axis is the
visual axis, and the Y-axis points out of the paper. (b) The object is rotated about
the Y-axis. f) is the new rim point approximated by Eq. (2.1) (based on [9]). fig is
the corresponding point of p; after rotation.

p = (2:, y,z) be a rim point on the object. Let r be the curvature radius parallel to
the X-axis. When the object is rotated by R, point p may cease to be a rim point,

and is replaced by a new rim point 1') = (5:, g, 2)‘ approximated by

p=R(p—c—r)+c+r (2.1)

where c = (re, yc, zc)t is the rotation center, and r = (r,, ry, 0)t is the curvature vector
at p. The meaning of Eq. (2.1) is as follows. The point 0 = p — r is the center of the
curvature circle. To predict the new rim point, we must ﬁrst move 0 to the rotation
center by 0 — c and then apply R to o — c. Let 6 = R(p — r — c). The new rim point
p is obtained by translating 6 with c and then r displacement. This method works
well as long as the circle of curvature provides a good approximation to the section

at p. Under weak perspective projection, the 2 component of [3 does not need to be

37

derived, thus, by modifying r to be (rt + wary + yc,0)’ and p to be (:r,y,z — zc)‘,

Eq. (2.1) can be changed into a simpler equation:

p = R(p -— r) + r. (2.2)

From Eq. (2.2), we can predict the position of f) for a rotation around an arbitrary
axis in 3D space, provided that the radii of curvature r: and ry, and the relative depth
2 — so at p are given. These parameters can be acquired from model construction [9]

and are stored as parts of the object model.

2.3.2 Representing Internal Edges

Internal edges of an aspect are those visible at any viewpoint within the aspect and
are caused by known discontinuities in albedo or surface normal or by artifacts such as
unpredictable illumination or shadows. Unlike silhouette edges, internal edges are not
usually occluded by the object surface due to a small rotation. Thus, corresponding
points on internal edges before and after rotation exist in two images, and stereo
matching can locate them in 3D. When an object is rotated by a 3 x 3 rotation

matrix R, an internal edge point p = (:L',y,z)t is transformed to a new position

.-

p = (i, :7, 2)‘ by
p = R(p — c) + c. (2.3)

Note that Eq. (2.3) is a special case of Eq. (2.1) with the curvature vector r = 0. Thus,

Eq. (2.3) can also be modiﬁed to get the form of Eq. (2.2) by letting r 2 (re, yc,0)t

38

t

and p= (:r,y,z—zc) .

2.3.3 Model Construction

We now show how to compute r using a 3—image stereo computation. Let A, B, and
C denote three images taken from three different vieWpoints in the aspect along a
circle of the viewing sphere perpendicular to the Y-axis (see Figure 2.5). Let ( be
the rotation angle rotating a camera from viewpoint A to viewpoint B, and V the
rotation angle rotating a camera from viewpoint A to viewpoint C. Equivalently, the
object may be rotated instead of the camera, which we often do. Let p1 = (r1, y, zl)‘,
p3 = (2:2,y, 22)t and p3 = (.153, y, 23)‘ be three corresponding points in images A, B,

and C, respectively. From Eq. (2.3), we have

:52 = (2:1 — zc) cosC + (21 — zc) sinC + are, (2.4)

$3 = (x1 — :cc) cos V + (21 - zc) sin V + $6. (2.5)

Solving these two equations for the two unknown parameters see and 21 — 26 yields

_ z1(cos(— cosu) — x2(1-— C08V)+$3(1— cosC)
zl—zc— (l—cosC)sinu—sin((l —cosu) ’ (2'6)

 

and

_ :cl sin(( - V) + 1728an — casinC

 

c _ (1—cos()sinV—sinC(1—cosu)°

With —-12'- < C, V < % and C at V, the denominator does not vanish.

Unlike matching the silhouettes, which are well-separated, matching internal edge

39

 

A

Figure 2.5: Building a model using three images. Points A, B, and C are the three
camera locations along a circle in space perpendicular to the Y axis (based on [9]).

points in the images gives rise to the correspondence problem, which is usually to
dious and error-prone. However, in our object modeling, the object is either rotated
around the X—axis or the Y-axis; thus, the stereo epipolar constraint will force the
corresponding points to be on either the same horizontal line or the same vertical line.
The rotation center in the X-axis, we, can be obtained by clustering on all possible
triples of edge points (31,22,153) from the same horizontal line with the constraint
2:; S .731 S $3 (because 1:2, 2:1, and 3:3 are in the right, central, and left images, re-
spectively). Eq. (2.7) can be introduced as a strong constraint to remove inconsistent
(r1,r2,x3) triples where the computed re, obtained by substituting 3:1, 3:2, and 233
into Eq. (2.7), deviates from the clustered :rc. Using a similar technique, we can also
resolve the ambiguity of point correspondence in the Y-axis and compute yc. Ex-
panding Eq. (2.3), one can notice that the rotation center so need not be computed
explicitly since the weak perspective projection model is used. Thus, similar to the
representation of boundary edges in the model, each internal edge point is associated

with the relative depth 2 — 2c and the rotation center (re, yc, 0).

40

For each model aspect, the following data is stored. If available, we store the
three rotational parameters that relate the local coordinate frame of the aspect to a
global coordinate frame for the entire model. The 4 rotation angles, VI, (3,113,, and
(v (as in Figure 2.5, where the subscripts denote the rotation axis), used to create
the 4 viewpoints are stored in order to deﬁne the limits of the aspect. Notice that
(re and yC are global to the aspect (and perhaps for the entire model), and thus are
stored only once; however, these two parameters are used to generate the predicted
image for each internal edge point using Eq. (2.3). For each silhouette point i, we
store (phri) where p; and r; are as in Eq. (2.2). Recall that re and y6 are already
combined in r; for each silhouette point i. Note that we have 3D information for each
internal edge point and approximate 3D information for a small neighborhood of each
silhouette point. Moreover, because of the construction process, we do not have to
do any hidden surface or line removal when using this model aspect to generate an

image edgemap.

2.4 Implementation Details

This section presents the implementation details regarding the acquisition of aspect
models from 2D images. We describe (1) how the viewsphere is tessellated into
aspects, (2) how the tripod is used to simulate the local coordinate system of an
aspect, and (3) how the ﬁve local images of an aspect are matched to derive the 2%D

edgemap of the aspect.

41
2.4.1 Aspect Model Acquisition

There are a variety of approaches for tessellating the viewsphere, depending on
whether they are object-independent or object-dependent. For object-dependent ap-
proaches [68], each object is processed separately, and the tessellation of the view-
sphere depends on the complexity of the object. The determination of optimal views
for an object is a difﬁcult task. Normally, it is carried out by ﬁrst tessellating the
viewsphere into a ﬁxed number of viewpoints, then integrating views, which share
the same object properties, into aspects [28, 36]. For object-independent approaches
like ours [27] and many others (e.g., [9]), the viewsphere is tessellated into a ﬁxed
number of viewpoints. As a consequence, objects are represented by a ﬁxed number
of views. Although this method is straightforward, it suffers from representing simple
objects with redundant views. In our approach, the viewsphere is tessellated into
latitude bands, each of which is then further divided along longitudinal strips (see
Figure 2.6(a)). Each viewpoint deﬁnes an aspect. The orientation of a vieWpoint is
then determined by two angles (1pm) as shown in Figure 2.6(b). Recall that each
aspect deﬁnes a local coordinate system and is associated with ﬁve local training
images. This kind of tessellation would allow some training images to be shared by
neighboring aspects, reducing the number of images required to model an object.
Another advantage is that it is easy to compute the orientation of a viewpoint and

consequently the relative orientation between any two viewpoints.

The setup for the acquisition of a model at a viewpoint is given as follows. A tripod

as shown in Figure 2.7 is used to obtain the orientation of each viewpoint (aspect) and

42

 

 

 

 

 

(b)

Figure 2.6: The tessellation of the viewing sphere. (a) The sphere is divided into cells
by meridians and parallels. (b) Points on the sphere are identiﬁed by their longitude
ib and latitude 7.

the ﬁve images associated with that aspect. The camera is ﬁxed along the Z—axis with
the image plane coinciding with the X-Y plane. The object of interest is mounted on
the tripod. The tripod allows two rotations. One is around the X-axis with rotation
angle 1,1). The other is around the Y-axis with rotation angle 7. The tripod is also
mounted on a rotating base which allows one rotation around the Y-axis with rotation
angle 6. The orientation of a viewpoint is given by the two rotation angles (t/J,7).
After the orientation of a viewpoint is determined, the Cartesian coordinate system,
as shown in Figure 2.7, yields the local coordinate system of the viewpoint with 11)
and 3 as the rotation angles around the local X-axis and Y-axis, respectively. These
two angles are used to allow the camera to grab the ﬁve local images of an aspect.

In our implementation, the following vieWpoints speciﬁed by (IP, 7) are selected:

it = 40°,60°,80°, and 7 = (2n + 1) - 10°,for n = 0, - - - , 17. For each aspect, both the

43

 

Figure 2.7: The tripod for model construction.

Table 2.1: The number of model aspects of 3D objects.

ject N Aspects ject N

54 18
54 18
54 18
54 6

54
54

sprayer 54
' 54
taurus 54
54

 

44

rotation angles around the local X-axis and Y-axis speciﬁed by 1,!) and ﬂ, respectively,
are :l:10°, giving an aspect breadth of 20°. This yields 54 aspects which cover a
band of the northern hemisphere with latitude from 30° to 90°. Table 2.1 lists a set
of acquired 3D models (see Figure 2.8) and the number of aspects for each model.
Figure 2.9 displays the 54 views of a plastic sprayer. The 54 views in Figure 2.10 are
of a plastic lion toy. For each such view, the procedures of the edge detection and the
curvature method are applied to derive the 2%D edgemap which is used to predict the
edge appearance of an object viewed within the aspect deﬁned by that view. These

two procedures are described in the next two sub-sections.

2.4.2 Edgemap Extraction

Extracting the silhouette of an object in an edgemap may be difﬁcult. An object
edgemap generated by the Canny edge detector [25] is normally contaminated with
noisy edges and background edges. It is rather difﬁcult to extract object edges unless
the background is carefully set up when images are taken. The ﬁgure—ground sepa-
ration is a difﬁcult problem. However, in the modeling phase, manual editing of an
edgemap is permissible. We use a software program called “CorelPaint” to remove
background edges from an edgemap and close the silhouette contour if necessary.
After the background edges are removed, the edgemap contains only object edges in—
cluding silhouette and internal edges. Some of the internal edges touch the silhouette
contour, making the automatic separation of these two types of edges difﬁcult. We

apply the algorithm described in Figure 2.11 to extract these two types of edges.

45

     

(a) camaro (b) elephant (c) gorilla (d) lion

 

(g) sprayer

)pig

)taurus

m)ug

       

(o) blockB

     

(r) cup (3) face (t) soap

Figure 2.8: Examples of model aspects of 3D objects.

46

 

Figure 2.9: 54 aspects of a plastic sprayer.

 

 

 

47

 

Figure 2.10: 54 aspects of a plastic lion toy.

48

 

Input: an edgemap Ea” containing the silhouette and internal edges of an object.
Output: the silhouette E, and the internal edges E,.
Processing:

(1) Find the two extreme edge locations in Eau at each horizontal scan line
if applicable, and ﬁll in edge points between these two extremes;

(2) Find the two extreme edge locations in Eau at each vertical scan line
if applicable, and ﬁll in edge points between these two extremes;

(3) “And” the edgemaps generated from (1) and (2);

(4) Locate the boundary of the binary image generated in (3), yielding
the object silhouette E,;

(5) Subtracting E, from Ea“ gives the internal edges Eg.

 

 

 

Figure 2.11: The edgemap extraction algorithm.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(b) (C)

Figure 2.12: The extracted edgemap of a pencil sharpener. (a) An edgemap containing
the internal edges and the silhouette. (b) A binary image resulting from ﬁlling edges
between boundary pixels. (c) The extracted silhouette from (b). The internal edges
can be obtained by subtracting silhouette from the edgemap in (a).

49

It should be noted that the algorithm speciﬁed in Figure 2.11 may not yield
the exact silhouette of an object if the object has a non-extreme concave contour
segment along the silhouette. In some cases, manual editing of the edgemap has to
be performed to ensure that accurate edgemaps are generated. An example of the

evolution of an edgemap generated in this manner is illustrated in Figure 2.12.

2.4.3 Matching Model Edge Elements

Once the silhouette and the internal edges of an object are separated, identifying
the corresponding points for the silhouette in the pictures is straightforward in the
procedure. When the rotation is about the Y-axis, the corresponding points must
lie on the epipolar line parallel to the X-axis. Similarly, when the rotation is about
the X-axis, the corresponding points must lie on the epipolar line parallel to the Y
axis. For each silhouette point, we only need to search in a small range for its cor-
responding point. Since the silhouette points are well separated, generally there is,
in most cases, only one matching candidate to be considered. In the case that there
are several candidates, the one with the shortest Euclidean distance is considered as
the true match. If the algorithm selects the wrong corresponding point, the outliers
will be removed by depth smoothness constraints enforced on the silhouette. Unlike
matching the silhouettes, matching internal edge points suffers from the correspon-
dence problem. We apply the heuristic algorithm described in Figure 2.13 to match

internal edges.

Four constraints are used to remove the inconsistent corresponding triples of inter-

50

 

 

Input: Five local edgemaps, E0, E1, Er, Eu, and E4, of an aspect where
E, is the central image;
E1 and E. are the left and right images rotated around the X-axis;
E, and E; are the up and down images rotated around the Y—axis.
Output: The 2%D edgemap for the aspect.

Processing:

(1) Extract the silhouettes and the internal edges from these ﬁve edgemaps
(as in Figure 2.11);

(2) For each 2:0 coordinate of internal edge point in E0, ﬁnd all possible pairs
of (2:1,2.) from E; and E, which satisfy :5, < ro < 23;;

(3) Derive the re coordinate of the rotation center and the relative depth 2

using Eqs. (2.6) and (2.7);
(4) Histogram on all xc’s and locate the peak 55,;
(5) Remove all triples (2:1, 2:0, 2:.) in which the derived 3:, deviates from :Ec;

(6) Use all 2’s from the consistent triples (2:1, $0, 9:.) which survive from step
(5) to compute the mean 2,, and the standard deviation 2,;

(7) Remove all triples (r1, so, an.) in which the derived 2 is not in the range of
[2,, — 22,, 2,, + 220];

(8) Repeat steps (1) to (7) for each yo coordinate of an internal edge point in
E0 to derive yc and 2;

(9) Remove the outliers from each internal edge contour in E0 using depth
smoothness constraint.

 

Figure 2.13: The stereo matching algorithm for modeling internal edges.

 

51

nal edges as depicted in steps (2), (5), (7), and (9) of the stereo matching algorithm
in Figure 2.13. The ﬁrst one is the stereo constraint as previously described. The
second one is the consistency constraint on the rotation center. Both constraints are
very effective in removing inconsistent triples from further consideration. Note that
internal edges are located on visible surfaces, the relative depth 2 of each internal edge
point should not differ much from neighbors. The third constraint uses the statistics
of the relative depth to remove inconsistent triples. Finally, the depth smoothness
constraint, which stipulates that the depth values along the same edge contour should
vary smoothly, is used to remove the outliers along each edge contour.

After the spatial coordinate of z, the rotation center (rmyc), and the curvature
vector [rm ry, 0] of a model aspect are derived, it is straightforward to apply Eqs. (2.2)
and (2.3) to predict the appearance of the model rotated around an arbitrary axis
within the aspect. An example is illustrated in Figure 2.14. Figure 2.14(a)-(e) give ﬁve
local edgemaps of an aspect representing the central, left, right, up, and down views
of a car (Ford Taurus), respectively. Figure 2.14(f) is the matched edgemap from
these ﬁve local edgemaps. Note that some of edge points in this matched edgemap
are removed due to inconsistency in the curvature or depth along the contour, or
due to the self-occlusion which causes some contours to appear in one edgemap but
not in the other. Figure 2.14(g)-(j) are the predicted edgemaps superimposed on
the original edgemaps given in Figure 2.14(b)-(e). Since these four edgemaps, as in
Figure 2.14(b)-(e), represent the extreme views of the aspect, the results as shown in
Figure 2.14(g)-(j) indicate the viability of the modeling scheme.

Table 2.2 lists the maximum (MAX) error and the root mean square (RMS) error

52

.f——

 

(s)

. ”DD
(C) (h)
(d) (1)

   

(e) (i)

Figure 2.14: The results from the stereo matching algorithm. (a)—(e) The ﬁve local
edgemaps for the curvature method. (f) The derived 2%D edgemap after the matching.
(g)-(j) The predicted edgemaps superimposed on the original edgemaps.

53

of the alignment of the four views in Figure 2.14. Large MAX and RMS errors in
left and right views indicate that there are several outliers in the alignment; the most
obvious one is the vertical edge segment near the rear window (see Figure 2.14(g));
since the edge segment is vertical, it has no effect on the up and down views. In
general, the prediction of the object’s edge appearance is good when the RMS error
is small. The outliers will be discarded as occluded edges in the matching process,
treated in Chapter 3, and thus they will have little effect on matching. The error
analysis of the modeling scheme and more examples of model alignment are given in

the next section.

Table 2.2: The prediction error of the vehicle in Figure 2.14.

 

 

 

 

 

Model View MAX Error (pixels) RMS Error (pixels)
left 7.0 1.78
right 5.0 1.46
up 5.0 0.83
down 4.1 0.68

 

 

 

 

 

2.5 Experimental Results

In this section, we are interested in investigating (1) what attributes to the modeling
error, and (2) how wide an aspect can be such that the modeling scheme can produce a
good approximation of the predicted appearance. We also intend to demonstrate the
vitality of the modeling scheme on modeling both polyhedra and sculptured objects.
Note that the model alignment in this section involves no projection, scaling, and

translation, but only rotation, and the prediction error means the error of alignment

54

in which only rotation is involved.

Four experiments are reported. The ﬁrst experiment was conducted with an el-
lipsoid to validate the theoretical error analysis of the curvature method in [8] using
our empirical results. The second and third experiments were conducted with syn-
thetic data generated from a superquadric model to answer the two questions listed
above. The fourth experiment with polyhedra and sculptured objects was conducted

to analyze the modeling error by aligning models with views from the same aspect.

2.5.1 Error Analysis from an Ellipsoid

The curvature method can predict exactly the appearance of objects with sharp
boundaries (for which the radius of curvature is zero) and the appearance of spherical
and cylindrical objects (for which the center of the curvature circle coincides with
the rotation axis). However, for smooth objects with arbitrary structures, the cur—
vature method only gives an approximation of their predicted appearance. In order
to demonstrate the properties of the curvature method, we applied the method to an
ellipsoid and analyzed the errors obtained.

To simplify the analysis, consider a canonical ellipsoid f; + g + i;— = 1 rotating
around the vertical (Y) axis. The error depends on the shape of the ellipsoid, that
is, the relative length of its axes, and it increases as the ellipsoid becomes elongated
in the Z-direction. Let p = (r, y) be a point on the silhouette of the ellipsoid (refer
to Figure 2.4), f) = (i, g) the new rim point replacing p when the ellipsoid is rotated

around the Y axis by an angle a, and f) = (5:, “) the approximated position of p

55

according to the curvature method. Let Ap = (Ax, Ay) denote p—p. The horizontal
section of the ellipsoid through p is an ellipse centered around p0 = (0, yo). Note that
y = g = g 2 yo because points p, f), p, and p0 all lie on the same horizontal section.

The relative error is deﬁned as

 

 

 

 

 

A . _ _
= II #13” = a: x (2.8)
IIPOPII :L‘
This error, as derived in [8], is given by
c2 c2 c2
E(;5,a) = cosa + 35(1— cosa) — cos2 a + -a—2 sin2 a. (2.9)

In the error analysis of alignment, we are interested in the absolute error deﬁned as

E' = IIAPII = IIPSPII ° E (2-10)

The error in Eq.(2.10) depends only on three parameters, the length of the vector
pgp, i.e., Ilpgpll, the rotation angle a, and the aspect ratio of the ellipsoid 022;.

To analyze the error of the predicted appearance, we used a superquadric model
to generate eight ellipsoids with aspect ratio (3;) of 1’3, 5, i, %, 2, 4, 9, and 16,
respectively. The aspect breadth was set to 90°. For each aspect ratio, ﬁfteen test
data were generated, each varied by rotating 2 degrees about the Y axis. For each
ellipsoid, we compute the alignment error, i.e., the absolute error E’ in Eq. (2.10).
Note that for the simplicity of the analysis, the value of a was ﬁxed and only the

value of c was varied when each ellipsoid was generated. Thus, “pr” = a because p

56

 

 

0.3 T ﬁ T *7 Y T Y V f Y Y 8 1* 1' ﬂ T ﬁf V Y T r 1 ‘7
III/2. _._ +—
m" --» 7 I "9" *—
o.2s - "m" .. -4- ...-.
'l/l +- .2. _.__
6 p
0.2 r
5 I [ﬁnmq
0.15 r

0.1 r

0.05 h 'I I p “H. ..... . _____ .u-v-
. ..... gum. ~~~~~ I"

0".
A A 4 ML 4 A ! A A A A A L A A ‘_ AL

Average Faror (Pixels)

Average Error (Pixels)
u a

 

  

 

 

 

 

 

 

o A A A A A .l 0
2468I0l214l618202224262830 24681012I416I8202224262830
Rotation Angle (Degrees) Rotation Angle (Degrees)

(a) (b)

Figure 2.15: The error of the curvature method as a function of the aspect ratio 7

and the rotation angle a. (a) f:- = ﬁ, %, i, and %. (b) if; = 2, 4, 9, and 16.

9

is a point on the silhouette of the ellipsoid before any rotation, implying E’ or E. As
shown in Figure 2.15, the error behaves differently in each of the two ranges: (1) when
c g a, and (2) when c > a. In the ﬁrst case, the ellipsoid’s width is larger than its
depth, the error assumes small values even for fairly large values of a. In the second
case, the ellipsoid’s depth is larger than its width, the error assumes larger values even
for fairly small values of a. As a function of a, the error function is symmetric and
the absolute value of the error increases monotonically with the absolute value of a.
Figure 2.15 also depicts this phenomenon. All these empirical results are consistent

with the theoretical results given in [8].

2.5.2 Rotation Axis Effect

The curvature method uses ﬁve local images of an object rotated around the X axis
and the Y axis within an aspect to predict the appearance of the object after a

rotation. Rotating an object about an arbitrary axis (not coincident with the X axis

57

or the Y axis) creates larger error than rotating the object about either the X axis
or the Y axis. This is because the former involves both the curvature radii r, and
r1, while the latter only involves either r, or r,. Note that the rotation axis N is
parameterized by (cosdsin 45,sin03in ¢,cos (,6). Let Nk,k = 0,- - - ,9 denote the 10
rotation axes with 45 = 90° and 0 = k - 10°. Then N0 and N9 correspond to the
X axis and the Y axis, respectively. Figure 2.16(a)-(e) show the ﬁve local images of
a model aspect with an aspect breadth of 40°. In this experiment, we generated 70
test data by rotating the aspect model (see Figure 2.16(a)) around these 10 rotation
axes with rotation angles, varied from 2° to 14°, each differing by 2° and denoted
as rot2, rot4 rot6, rot8, rot10, rot12, and rotll, respectively, in Figure 2.17. Each
predicted appearance of the model was aligned with the actual appearance of the
model to compute the prediction error of the curvature method. As the results show
in Figure 2.17, the prediction error tends to be larger with larger deviation of the
rotation axis from the two main axes, and also the prediction error becomes larger
as the rotation angle increases. Thus, we have to tolerate a few pixels of deviation

when model rotation angles are extreme.

2.5.3 Aspect Breadth Effect

As mentioned in the previous section, the viewsphere may be tessellated into a ﬁxed
number of viewpoints. This immediately leads to a question of how wide of an aspect
is appropriate. For objects with sculptured surfaces, the tessellation has to be as local

as possible to have better approximation of the predicted appearance. Wide aspects

58

  
   

      
         

 
 

 
   
   

“53:55.- ‘53:?"
Ill" (pp in”
4’0"". H ‘v . . I
"33:393.: ., :2: t-

553:: .é.' ‘:-'

 
   

(f)N4-2 (g)N5—6 (h)N6-10

Figure 2.16: An aspect of a superquadric model and its four test data. (a)-(e) The
ﬁve images of a model aspect with aspect breadth of 40°. (f )-(i) The test data in the
model aspect.

 

Average Error (Pixels)

 

 

 

O I A J l l l l A

N0 N1 N2 N3 N4 N5 N6 N7 N8 N9
Rotation Axis

 

Figure 2.17: The error of the curvature method as a function of the rotation axis and
the rotation angle for the object in Figure 2.16(a).

59

increase the chance of self-occlusion, and consequently degrade the approximation of
the predicted appearance. On the other hand, narrow aspects may represent some
simple objects with redundant views, and thus, require a large amount of storage,

but the accuracy of the predicted appearance is increased accordingly.

 

Average Error (Pixels)

 

 

 

 

0 1 1 1 1 1
20 3O 4O 50 60 70 80
Aspect Breadth (Degrees)

Figure 2.18: The error of the curvature method as a function of aspect breadth for
the object in Figure 2.16(a).

In order to investigate the effect of aspect breadth, seven aspects of a superquadric
model were constructed with aspect breadth varied from 20° to 80°. Figure 2.16 (a)-
(e) show one of the model aspects with aspect breadth of 40°. Figure 2.16(f)-(i) show
four sets of generated test data, N4 — 2, N5 — 6, N6 — 10, and N9 — 2, representing
the object being rotated around the rotation axes N4, N5, N6, and N9 with rotation
angles of 2°, 6° 10°, and 2°, respectively. Then, for each aspect, the curvature method
was applied to the 4 test data. The prediction error is shown in Figure 2.18; the

modeling error increases monotonically as the aspect breadth becomes larger. These

60

empirical results suggest that a smaller aspect breadth be adopted for tessellating the
viewsphere if a large rotation is required to predict the appearance of an object. Also,
a smaller aspect breadth makes the prediction more accurate, and consequently, makes
the matching (model ﬁtting), treated in Chapter 3, more efficient in the sense that

the ﬁtting would not wander around in parameter space due to ambiguous matches.

2.5.4 Model Alignment

To demonstrate the vitality of the curvature method, we chose twenty models for
alignment in which only rotation was involved. All objects were observed under
the weak perspective viewing model. Figure 2.19 shows the results of aligning the
predicted object model with the images. It can be seen in Figure 2.19(a) and (b)
that rotations create large deformations of the silhouettes. Figure 2.19(c) gives an
accurate alignment of the predicted object model with its corresponding sensed image.
A simple distance metric between the image edgemap and the aligned model appears

to be sufﬁcient to select the correct model.

Figures 2.20 and 2.21 illustrate the alignments of twenty predicted models with
their corresponding sensed images. Note that a model is speciﬁed by the model name
followed by an aspect number. For example, block/11 represents aspect I of the blockA
model. For models with only one aspect, the aspect number is not speciﬁed. The
white contours represent the predicted models. Those isolated white edge points (see
Figure 2.20(a) and (b) for larger pictures) indicate the outliers caused by the following

two effects: (1) the ﬁve training images were not well cleaned; and (2) the matching

61

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(a) (b) (C)

Figure 2.19: Deformation of the images with respect to the predicted model. (a)
A deformation of the pencil sharpener after a rotation of —10° around the vertical
axis. (b) A deformation of the pencil sharpener after a rotation of +20° around the
vertical axis. (c) Alignment of the pencil sharpener model with its corresponding
sensed image.

algorithm selected the wrong corresponding points. These two problems can be solved
by spending more time in editing and analyzing the ﬁve training images or the 2%D
edgemap derived from the ﬁve training images. Table 2.3 lists results of aligning these
twenty models with their corresponding sensed images. The average alignment errors
for all objects are within 1.0 pixel. It should be noted that accurate predictions were
achieved despite the fact that (1) the object had complex 3D shape; (2) the light
reﬂectance distorted some of the internal edges; and (3) only ﬁve images were used

to create a crude approximation of the radii of curvature.

2.6 Summary

In this chapter, we have presented a method to construct an aspect model for a com-
plex rigid object. We have studied and implemented the curvature method proposed

by Basri and Ullman to predict the new appearance of an object with smooth surface

62

Table 2.3: Experimental Results of Model Alignment.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Model Rotation Axis [12:5, nwnz] Rotation angle Avg. Error (pixels)
blockAl :1,0,0: +10° 0.90
blockBl :0,1,0j +10° 0.77
camar08 :1,0,0: -10° 0.62
can 0,1,0] +10° 0.56
cup :0,1,0: +10° 0.29
elephantl :0,1,0: +10° 0.29
face :1,0,0: -10° 0.77
gorillal 0,1,0: +10° 0.51
lion10 :1,0,0: +10° 0.79
mug? :0,1,0: -10° 0.60
phonel :1,0,0: -10° 0.43
pigl :0,1,0j +10° 0.81
sharpenerlO :0,1,0: + 10° 0.83
soap :1,0,0: -10° 0.26
sprayerl 0,1,0: +10° 0.46
squirrell9 :1,0,0: +10° 0.95
swan2 :1,0,0: -10° 0.69
taurus20 :0,1,0: +10° 0.43
truckl :1,0,0: +10° 0.60
zebral :0,1,0: +10° 0.50

 

 

 

 

 

 

 

63

   

(a) phonel

     

   

(c) camar08 ed)lephant1 (e) gorillal
(f) lion10g)pig1 (h) sharpenerlO

Figure 2.20: Exampr of model alignment. The white contour is the predicted model
superimposed on the corresponding sensed image.

 

(a) sprayerl (b) squirre119 (c) taurus20

 

 

(d) zebral (e) mug? (f) blockAl

 

 

(g) blockBl (h) truckl (i) can

   

(j) cup (k) face (1) soap

Figure 2.21: More examples of model alignment. The white contour is the predicted
model superimposed on the corresponding sensed image.

65

following a 3D rotation. A model aspect is constructed from 5 intensity images and
alignment is done with a single intensity image. We have shown that three pictures
are in principle sufﬁcient for approximating the radius of a curvature 1',c or ry, and
ﬁve pictures can be used to estimate the components 1'1. and 7'” independently. If
both components 7‘, and ry are present, the new appearance of an object following a
rotation about an arbitrary axis can be predicted.

The experimental results are supportive of our design. The proposed modeling
scheme was shown vital in modeling both polyhedra and sculptured objects. It was
also found to give accurate results for large transformations. In this scheme, an Object
is represented using a number of viewer—centered model aspects, rather than a single
ob ject-centered geometric model. Each model aspect covers a range of many possible
neighboring viewing angles. To represent the edge appearance of an object from any
viewpoint, a number of model aspects are required. The computations required in this
scheme during the prediction stage are simple; for example, no hidden line removal is
necessary. Our model construction is almost automatic when object prototypes are

available. Our work both validates and extends the work of Basri and Ullman [9].

Chapter 3

Matching 3D Object Models to 2D

Images

3.1 Introduction

In this chapter, we address the problem of matching 3D objects with arbitrary curved
surfaces to a single 2D intensity image, i.e., the object localization problem. We take
a general viewpoint, but we are interested in speciﬁc applications as well, such as de-
tecting the pose of industrial objects or bones in X-ray images. The work ﬁts within _
the theoretical framework for object recognition via alignment espoused by Ullman
[111] and Lowe [76]. Four stages are involved in this framework: (1) Model building,
as discussed in Chapter 2, for representing 3D objects; (2) image processing, including
segmentation, extracts object features (typically edge points); (3) conﬁgurations of
features are used to index to candidate models and candidate poses; and (4) determi-
nation of pose parameters that best align the projected 3D models and the observed

66

67

image features in order to verify object presence. Stages 1 and 2 were presented in
Chapter 2, and stage 3 will be treated in Chapter 4: in this chapter, we present con-
tributions to stage 4, that is, object localization via alignment. A matching method
is developed which can handle objects with partial occlusion and which also includes
internal object edges, in addition to object silhouette, in the matching process for a
more accurate pose estimation. Matching does not assume the existence of salient
local features in the image and hence is directly applicable to smooth objects. During
matching, a heuristic is adopted to synthesize correspondences in order to reﬁne the
pose parameters for diminishing the matching error between the model and observed
edgemaps.

We assume that (1) objects are rigid; (2) the object model used for pose esti-
mation is known; and (3) an approximate orientation of the object in the image is
also given. Assumptions 2 and 3 can be satisﬁed by the indexing scheme presented in
Chapter 4 because the indexing scheme will provide candidate object models and also
approximate particular views of the object models. The object may be partially oc-
cluded. Our program takes as input an object view roughly 20 degrees in breadth, an
object image, and a match tolerance. It outputs a reﬁned object pose when matching
converges and otherwise reports failure. In our experiments and core implementation,
we have implicitly assumed that the object is not rotated around the visual axis, i.e.,
a normalized zeroroll angle for the object as discussed in Chapter 2. A rough align-
ment of the image plane based on object features and image features will be discussed
in Chapter 4. This assumption ﬁxes one degree of freedom of the rotation and has,

no doubt, contributed to the performance reported. However, the method presented

68

applies to the general weak perspective case with six free parameters: one for scaling,
two for translation, and three for rotation.

Based on the assumptions, we deﬁne the problem more formally as follows.
Problem Statement: Let T be a set of transformations, including rotations in 3—
space, translations and scale changes, followed by an orthographic projection. Let
5M = {E1, E2, - - - En} be a set of view-centered 2%D edgemaps representing a 3D
aspect model M. Given an edgemap E0 of an observed object in a scene, an aligning
transformation T E T is sought such that the predicted edgemap of M, denoted as
Em, generated by the application of T matches E0.

This chapter is organized with the next section reviewing some existing techniques
for matching objects with sculptured surfaces. Section 3.3 presents the proposed
method for computing pose via matching a model to an observed image. An analysis
of the sensitivity of pose parameters to object shape and a study of the basin of
convergence are also presented in this section. Section 3.4 shows experimental results.

The last section summarizes the proposed approach.

3.2 Related Background

We are interested in locating 3D objects with sculptured surfaces from single 2D in-
tensity images. Several previous approaches to recovery of object pose from a single
image under perspective or weak perspective projection require that three or more
corresponding feature points be identiﬁed in both the object model and the sensed

image [41, 60, 76]. Such points are identiﬁed by salient features which are usually de-

69

pendent on the type of object and hence both the model and feature extraction need
to be specialized. The silhouette of a 3D object has also been used in object modeling
[9, 113, 121] and pose estimation [69, 107]. However, for a smooth object, the rim,
which we perceive as a silhouette, projected into an image moves and deforms over
the 3D surface according to the viewpoint position; thus, identiﬁcation of point cor-
respondences is frustrated by lack of salient and precise local features. Furthermore,
the silhouette alone does not provide for accurate location of smooth and somewhat
symmetric objects because the object silhouettes from different vieWpoints often ap-
pear similar. The lack of sensitivity of the silhouette to the pose parameters not only
precludes accurate pose estimation, but also causes wandering and slow convergence
in hill-climbing techniques. For this reason, we are forced to introduce the use of any
available internal object edges resulting from surface marks or creases to help reﬁne
object pose.

There have been efforts that have succeeded in locating smooth 3D objects from
their 2D image contours. Lowe [77] devised an iterative approach based on Newton’s
method for solving for projection and model parameters that best ﬁt a smooth 3D ob-
ject to matching 2D image contours. In his approach, the object is approximated by
piecewise polygonal surface patches which may be derived from a CAD model. Krieg-
man and Ponce [69] used an elimination method to construct an implicit equation
for the object image contours, and then determined the parameters of the implicit
equation by reducing it to a ﬁtting problem between the theoretical contour and the
observed data. This approach requires a precise geometric model of the object but is

very general regarding the type of features which may be used. Sullivan et al. [107]

70

approached the problem of pose estimation from image contours by using a combi-
nation of constrained optimization and nonlinear least—squares estimation techniques
to minimize the mean-squared geometric distance between the viewing cone and a
parameterized surface. Huttenlocher et al. [59] used the Hausdorff metric to corre-
late contour points of a model to those of an observed object. This method does not
require point correspondences and can handle partially occluded or distorted objects.
While this method is attractive for handling 2D translation, like many other methods
(e.g., [49, 55, 60, 76, 85]), including ours, its applicability to 2D rotations and 3D
transformations suffers from the need to search over many possibilities.

The proposed matching scheme has been used in our previous work [27] to locate
smooth objects based on the silhouettes. We were able to locate partially occluded
objects in a scene and efﬁciently track a moving object from a sequence of images.
While translation and scaling parameters were accurately recovered, rotation param-
eters could not be precisely obtained. In this chapter, we extend the previous method
to include the use of internal object edges to increase recognition accuracy and to
permit more accurate computation of pose. Signiﬁcant updating of both the mod-
eling and matching schemes was required. In the modeling scheme, as presented in
Chapter 2, the curvature method is augmented with the internal edges. In the match-
ing scheme, internal edges, in addition to silhouette, are used in the ﬁtting process
where a heuristic is adopted to ensure that the parameter ﬁtting is not overly biased
to either internal or boundary edges. In addition, we show results of a study of the
convergence of the matching method over the many viewpoints of an aspect. Our

convergence results are general and lend support to other techniques based on ﬁtting

_

71

[69, 77, 89, 107]. Our approach differs from the aforementioned approaches in several
aspects. First, objects are with sculptured smooth surfaces in which no piecewise
approximation or parametric model is used to describe object surfaces. Second, no
salient local features are used as matching primitives. Third, an efﬁcient 2D repre-

sentation of a 3D object is used.

3.3 Matching Model Edgemaps to Observed

Edgemaps

In Chapter 2, we have presented the curvature method for recovering an object’s
approximate appearance given a number of view-centered 2D projections. In this
section, we intend to show how to align model data to image data for computing
object pose and conﬁrming object presence. We want our method to be able to
tolerate image processing errors, image artifacts, and the presence of other objects
which may partially occlude the object of interest. Recall that from Chapter 2, an
aspect model is constructed as f) = R(p — r) + r where r = (rt, ry, 0)‘ is the radius
of curvature for rim points and r = (13¢, yc,0)‘ for internal edge points, and f) is the
new predicted point for p. Once an aspect model is constructed as above, it is simply
a matter of transforming and projecting new rim points and internal edge points 15 to
generate a predicted 2D edgemap. Under weak perspective projection, the rim point

or internal edge point 15 on the model object edgemap is given in the observed object

72

camera coordinate system by:

n=p+n (am

where t is a translation between origins of model and image coordinate systems. The
image coordinates (u, v) of an edge point f) = (53:, g, 2)‘ on the edgemap of the model
object are given by

(u,v) = (303,332) (3.2)

where s is a scale factor. Note that the Z coordinate of t has no effect on the resulting

image coordinates and can be set to 0 or ignored.

Combining Eqs. (3.1) and (3.2) with the curvature method, we obtain a transfor-
mation as follows:

q = S[R(p -— r) + r + t] (3.3)

where q = (u,v,0)‘, t = (tx,t,,,0)‘, S is the weak perspective projection matrix
with a scale factor 5, and R is the rotation matrix with rotation angle a, and ro-
tation axis N = (n1,n2,n3). We re—parameterize the rotation axis using the lon-
gitude 0 and the latitude 05 of the unit sphere: N = (121(0, ¢),n2(0, ¢),n3(0, 05)) =
(cos 0 sin d), sin 0 sin 43, cos ()5). The speciﬁc components of these two matrices are given

by:

300

5= 030;

 

 

000]

73

 

 

 

Ru Rn 313
R = 1221 322 R23
R31 R32 333
r ]
(l-nf)cosa+nf n1n2(l—cosa)—ngsina n1n3(1—cosa)+ngsina
= n1n2(1—cosa)+n3sina (1—n§)cosa+n§ n2n3(1—cosa)—nlsina
Lnln3(1—cos(1)—ngsinoz n2n3(1—cosa)+nlsina (1—n§)cosa+n§
ui

Object pose is thus modeled by 5 parameters; two for translation and 3 for orientation;
the 6th parameter, the scale factor 3, is associated with both the projection and object

size; we combine it with the pose parameters in our discussion below.

3.3.1 Newton’s Method and Least-Squares Minimization

The transformation in our 3D-to—2D alignment is a nonlinear operation: the best-ﬁt
parameters of the transformation can be obtained through a minimization technique
which uses a merit function to measure the goodness-of-ﬁt. Newton’s method is a
promising candidate for searching for the best-ﬁt parameters. As is the case with
any minimization technique, this method requires an appropriate initial guess of the
parameters to prevent from getting stuck in the basin of a local minimum. If the
aspect model is viable, we should be able to use any viewpoint of the aspect as an
initial guess, and thus initialization will come from indexing. We use the Levenberg—
Marquardt method, together with random walks to escape local minima [5, 50].
Although we do not assume any special feature points along the object boundary

or the internal edges, our method does extract a number of points where the matching

74

between object and model edgemaps is the worst. Newton’s method will diminish the
error between such point pairs if the correct aspect model is used. Our pose estimation
task is: given a number of corresponding silhouette points and internal edge points

identiﬁed in both the model object and the observed object, solve for t, s, and R (6

unknowns) to drive down the sizes of the worst gaps between the matching edgemaps.

Let D(Eo, Em(u'5)) be the mean—squared distance between certain points of the
observed edgemap E0 and the model edgemap Em, where 63 = (tr, ty, 3, a, 0, 05) is the
vector of pose parameters. In the minimization scheme, D(Eo, Em(di)) serves as the
merit function for searching for the best-ﬁt pose parameters (3. Let p; = (13;, y,, 2,), i =
1,2, . - . , N, be a set of 3D points of the model m, and let (u;, v.) be a set of 2D image
points of the observed edgemap E0. Recall that the z,- is the one modiﬁed by 26 as
discussed in Section 2.3. Let (Fu(pi;d3), Fv(p;;63)) be the aligning transformation as

deﬁned in Eq. (3.3), i.e.

Fu(Pi;¢3) = S[Rll($i - 7‘s) + R12(yi - Ty) + Rl3zi + 1‘; + ix],

Fu(Pi;U7) = S[Rzi($i — Tr) + R22(y.' - Ty) + 3232; + Ty + ty]

where jo’s are the components of the rotation matrix R.

Our goal is to minimize the mean-squared distance

N
D(Eo, £2.03» = ﬁzumpﬁa) — m): + (was) — v02}. (3.4)

I=l

75

The solution to Eq. (3.4) can be obtained iteratively by using Newton’s method:

W“) = a“) — H-l . [wt-5”)” (3.5)

where :3”) is the parameter estimate at iteration k, H is the second derivative matrix

(Hessian matrix) of D with respect to the parameters (.3 given by

 

 

 

321? N u(Pi;w __w_)( a 0F..(pa;d3) 3Fu(pi;c3)
awkawz —N ;{82 BF——_wk8wz (Fu(Pi, w) — 11;) + 602,. . 602,
OF F..(p.; d3) , - _ 6F..(p:;d3) 8F..(pi;d3) _
+ BLLJkaLUI (Fv(p|’w) U.) + 80)]: aw, }9 kal — 19 36

and [V(<Ii)] is the gradient of D with respect to the parameter (3 given by

N
22=3§1{-°————F“a('°"“I”(sips;as)—u.-)+

6w).

8Fu(Pi;‘3')

002k (Fv(p;;w) — v,-)}, k = 1, - - - ,6.

One of the most expensive aspects of implementing Newton’s method is often
the computation of the elements of the Hessian matrix. However, the calculation of
partial derivatives in closed form is straightforward given the aligning transformation:
Fu(pi;&3) and Fv(pi;u'3). The partial derivatives of Fu and F” with respect to t3, ty
and s are trivial. However, it is difﬁcult to calculate the partial derivatives of Fu
and F” with respect to R since there is no standard formulation of R in terms of
its three parameters. Most researchers represent R in terms of three Euler angles:
yaw, pitch, and roll. The partial derivatives of Cartesian coordinates with respect

to these parameters are strikingly simple as mentioned in [76]. However, we are

76

Table 3.1: The partial derivatives of l<I with respect to 0, and ct.

 

n1 n2 n3
0 —n2 n1

n n n n 2 2
— n +n
Uni+ni Qni+ni \/ 1 2

 

 

 

 

 

 

 

 

more interested in ﬁnding the rotation angle a and the direction of the rotation axis
N because N gives the normal of the ground plane and a may be used later to
derive the motion (trajectory) of the object. Given the parameterization of N as
previously described, it is straightforward to derive the partial derivatives of 111,702
and 123 with respect to their two parameters: 0 and 45. Table 3.1 gives these partial
derivatives for all combinations of n1,n2,n3,0 and a where 0’) ¢ krr, Vk = 0,1,- --

When 45 = k7r, W: = 0,1, - - -, mil/045 = (cosdcos qb,sin0cos (0,0).

Now we can build up the partial derivatives of R with respect to 0, 05 and 0 based

on the partial derivatives on Table 3.1. For example,

 

 

 

2
6:6“ = (1— cos a) 90°—”0: -—(1 — cos a)2n1%—nol= —2nln2(1 — cos a)
and
61211 On? 6n1_2n1n3(1 - cos a).
1—cosa 1—cosaZn ‘-
a¢ =( )— a¢ :( ) l— a¢— ¢n‘ + n2

The partial derivatives for all combinations of 0, gt), 0, R11, R12, R13, R21, R22 and R23

are listed in Table 3.2.

Given the parameterization F“ and F” and all partial derivatives of R with re-

spect to 0, 43 and a, we can accomplish our goal of computing the Hessian matrix

77

Table 3.2: The partial derivatives of R with respect to a, 6 and (b.

 

a 0 0‘)

R11 (Hf—USO, —2n1n2(1—Ca) 2nfn3(1—Ca)/A

R12 nlngSa—n3Ca (nf—ngxl—Ca) (2n1n2n3(l—Co,)+(n¥+ng)Sa)/A
R13 n1n3Sa+n2Ca —n2n3(l-Ca)+nlSa (n1(2n§—l)(l—Ca)+n2n3Sa)/A
R21 n1n250+n3ca (n?—n§)(1—Ca) (2n1n2n3(1-Ca)-(n¥+n§)Sa)/A
R22 (nz—lﬁ'a 2n1n2(1—Ca) 2ngn3(1—Ca)/A

323 ngngSa—nlCo, n1n3(1—Ca)+ngSa (n2(2n§—1)(1—Ca)—n1ngSa)/A

A=\/n¥+n§ Cuzcosa Sazsina

 

 

 

 

 

 

 

 

 

 

 

 

 

H and the gradient V(&3). We use the Levenberg-Marquardt algorithm to solve the
overall least-squares minimization problem. In the rest of this chapter, we will re-
fer to the matching technique as the Newton’s method with Levenberg—Marquardt

minimization, denoted as the N-L-M method.

3.3.2 Heuristics used in Alignment

We use heuristics to address three speciﬁc problems. First, because the silhouette of
smooth objects has limited information, in order to achieve more accurate pose, we
maintain a balance in the importance of silhouette and internal edge alignment as
ﬁtting progresses. Second, in order to allow for some occlusion, we provide a scheme
for discarding certain edge segments during matching. Third, because our object class
does not provide us with salient point features, we use a heuristic that synthesizes
“matching points of greatest distance” which are most useful in driving the match

error down.

78

Balancing the Fitting between Internal and Boundary Edges

The Newton’s method described above will usually converge as long as the match—
ing contours are the actual corresponding contours. Matching silhouettes is reliable
because the boundary edges are well separated [27]. However, for smooth objects,
small changes in any rotation parameter may have little effect on the silhouette. Not
only are internal edges represented exactly in the model aspect, but they are usually
closer to the viewing axis making their images more sensitive to rotation. Including
internal edges, when available, in the parameter-ﬁtting process is often necessary for
deriving accurate object pose. In doing so, we also need to consider relatively different

numbers of internal versus boundary edge segments.

Let Em and Im be the boundary and internal edges of the model edgemap, and
Bo and Io the boundary and internal edges of the observed edgemap respectively.
Deﬁne the mean-squared distances of certain boundary edges and internal edges as
D(Bo, Bm(&3)) and D(Io, Im(&3)) respectively. Then the merit function in Eq. (3.4) is

modiﬁed to '
ME... Ema» = 30w... Baez» + W-D(I.,1..(as))1 (3.6)

where W is the weight controlling the trade-off between minimizing D(Bo, Bm(a'3))
and minimizing D(Io, 1m ((5)). For large W, the parameter corrections, H'1 - [V(CJ)],
as in Eq. (3.5), will tend to favor ﬁtting the internal edges; for small W, the parameter

corrections will favor minimizing the ﬁtting error in boundary edges. To implement

ll

79

a balance, we adjust W at each iteration based on the ratio:

W‘k“) : 0(10, [Mu-5(0))
0(30, Bm(d3(‘=)))'

 

(3.7)

This heuristic does not guarantee convergence. However, it ensures that the parameter

ﬁtting will not be overly biased to one type of edge contour.

Measuring Errors and Identifying Correspondence

Newton’s method would behave well if we had at least three correct point correspon-
dences between E, and Em from which to compute the error in image matching. Point
correspondences are either ill-deﬁned or difficult to ﬁnd because salient viewpoint-
independent features such as line segments and corners may not be available from the
edgemaps of a smooth object. Finding point correspondences can be further compli-
cated if there is partial occlusion of the object. This situation suggests that a global
correspondence be established rather than local correspondences between features.
We ﬁrst establish global correspondence based on 2D template matching of internal
edges under the assumption that the matched object and model should have similar
contours and the overall distance between the corresponding contours is the smallest
of all possible correspondences. This assumption is justiﬁed for smooth objects un-
der a vieWpoint close to the correct viewpoint. Below, we describe a heuristic that
assigns local point correspondences based on the prior establishment of a good global
correspondence between the predicted edgemap Em and the observed edgemaps E0.

These synthesized local correspondences are then used to iteratively drive down the

80

ﬁt error.

Let Io and Im be the internal edges of the observed edgemap and the model
edgemap respectively (see Figure 3.1(a) and (b)). Matching starts by sliding 1m
over [0 in 2-space searching for a position where the total squared error distance
of the corresponding contours reaches its minimum. The error distance of the edge
point in [m to the nearest point in Io can be computed in constant time by using
a chamfering technique (this technique is also used by Huttenlocher et al. [59] in
the computation of Hausdorff distance). The chamfering technique [7] expands each
object contour point in all directions with an increasing value of gray level which
represents the distance from the observed object contours. The distance function
yields the distance, measured in pixels, from any image pixel to the nearest contour
point. After chamfering the observed contours once, the match error for N model
points can be computed in 0(N) time.

Once the global correspondence is established as above, we divide the bounding
box of the model edgemap into 36 regions (could be I: by k for another It) and locate
contour segments in each region if they exist (Figure 3.1(c)). On each model segment,
we designate the point that has the largest error distance as the search point, and
then use this point and the obtained error distance to search for a corresponding point
on the observed edgemap (Figure 3.1(d)). In a situation when the observed object
is partially occluded, the edgemap of the occluded portion does not have matches in
the model edgemap and tends to have large error distances in template matching.
To avoid including these incorrect point correspondences in Newton’s method, we

select as matches the best 50% of the point correspondences on the silhouettes and

81

The Observed Edgemap The Model Edgemap
(a) (b)

 

 

 

 

 

 

 

 

Figure 3.1: Synthesizing local point correspondences. (a) The observed edgemap
Ea. (b) The model edgemap Em. (c) The bounding box of Em is divided into 36 '
regions; the region is labeled if it contains a contour segment. ((1) The selected point
correspondence in each region if applicable (only silhouette point correspondences are

labeled).

 

 

 

82

the internal edges respectively. The error distance on each selected contour segment
will force the pose parameters in Newton’s method to be adjusted in such a way that
at each iteration the model contour segments will get closer to the observed contour

segments. The matching algorithm is sketched in Figure 3.2.

Note that the template matching would fail to synthesize local point correspon-
dences if the observed edgemap and the model edgemap were not in an approximate
scale. Here, we assume that the initial scale of the observed object is given. Deter-
mination of the initial scale value for matching will be discussed in Chapter 4. As
an example of showing that the scale is not ﬁxed, consider the car (Ford Taurus) in
Figure 3.3(a). Figure 3.3(b) shows the model edgemap of the car which is 2 times
larger than the observed edgemap. The initial parameter value for scale was set to
0.5. Figure 3.3(c) shows the evolution of the model edgemaps generated during the
iterations of matching. The white contour indicates the ﬁnal edgemap generated and

is shown superimposed on the original intensity image of the car as in Figure 3.3(d).

3.3.3 The Sensitivity of Pose Parameters to Object Shape

We show that the sensitivity of the rotation parameters is closely related to the
object shape. As mentioned earlier, silhouettes do not provide enough information
for accurate pose estimation, and including internal edges in the matching process is
often a necessity. Directly below we provide a theoretical characterization and later

on we show experimental evidence for this.

The basic idea of the sensitivity analysis for smooth objects is shown in Figure 3.4.

83

 

 

Input: an observed edgemap E0, a model edgemap Em,
an initial estimate of pose parameters L3 = (tx, ty, 3, 9, <15, a),
an image matching error threshold E, in pixel units, and
a limit on the number of iterations [max for failure.

Output: either failure or the best—ﬁt pose parameters (.3.

Processing: this algorithm computes d3 iteratively using Newton’s method
until the mean-squared distance D(Eo, Em(u'5)) is below E, or until
the maximum number of iterations [max have been tried.

In the ﬁrst case, we have alignment, and in the second case failure.
The iterations repeat the following steps:

(1) Establish global correspondence based on the template matching of
internal edges (or boundary edges in the case when only silhouettes
are used in alignment);

(2) Divide the bounding box of Em into 36 regions;

(3) In each region applicable, locate the point correspondence that has
the largest error distance (point correspondences are assigned
separately for internal edges and silhouettes);

(4) Select the best 50% of the point correspondences on the silhouettes
and the internal edges respectively;

(5) Use the point correspondences found in steps 3 and 4 to compute the
weight W in Eq. (3.7) and also to derive the Hessian matrix H and
the gradient V013), and then recompute 63 using Eq. (3.5);

6) Use 073 to predict the new Em and compute the modiﬁed D(Eo, Em(d3))
as in Eq. (3.6) on those contour segments in selected regions.

 

Figure 3.2: The edgemap alignment algorithm.

 

84

JD”
@ M

 

(C) (d)

Figure 3.3: Matching edgemaps in an 1:2 scale. (a) The observed edgemap. (b) The
model edgemap. (c) The evolution of convergence in the matching algorithm. (d)
The ﬁtted edgemap shown superimposed on the original image.

85

 

Figure 3.4: The rotation of a smooth object.

Let X and Y be the main axes of the image plane, and let the Z axis be the visual
axis. Consider an object rotated through some 0 angle about a rotation axis Bl with
direction (n1,n2,n3)‘. Let p3 be a rim point and [35 be the corresponding rim point
after rotation. Let p; be an internal edge point and f)"; be the corresponding point after
rotation. Let r be the curvature radius at p}. Note that r is a vector perpendicular
to both 1:1 and Z. The point 0 is the center of the circle of curvature of pi. Let
Apj = (ij,ij,Azj)‘ denote 153 — pi and Ap; = (Azg,Ay.-,Az.-)‘ denote [5; — pi.
Without loss of generality, let us assume that the rotation axis passes through the
origin 0 and points pg, 15;, p,-, and p} are on the same cross section. Let po be the
rotation center of the cross section through points pi, p}, pi, and p]. The following

statements hold for rotation R.

(l) Apj lies in a plane perpendicular to the rotation axis. That is, Apj . HI = 0.

-o

(2) Points lying on the rotation axis N are invariant to rotation R. That is,

R(Aﬁ) = AN or (R— I)(Al§l) = 0 where A is any real number, and I is the

86

3 x 3 identity matrix.

Note that points pj, p5, and 0 form a plane that is perpendicular to both l<l and
the Z axis, and a cross section is deﬁned as the intersection of the plane with the

object.

Proposition 3.1 If point 0 lies on the rotation axis HI, then p; is invariant to the

pose parameter a.

Proof: From the curvature method in Eq.(2.2), we have

155 = R(Pj - 1') +1?-

Rewriting the above equation, we obtain

1'5: - Pi = (R- I)(Pj--1‘)

Since 0 lies on the rotation axis N, we have pj — r = AN for some real number A.
From Fact (2), we obtain (R — I)(pj — r) = 0, i.e., Apj = 0. Thus, p; is invariant to

the pose parameter a. C]

From Proposition 3.1, we can see that the image of point p; will be of no help
in reﬁning a in the optimization because it does not change with a when rotated.
Equivalently, 9A1? is unaffected by the rotation. Thus, for a symmetric object such as
a circular cone, it is obvious that rotating the cone around its principle axis results in

no rotation angle information. Proposition 3.1 is clearly true for a totally symmetric

87

 

 

 

 

(a) (b)

Figure 3.5: A scene where the matching algorithm fails to recover the pose parameter
a. (a) The object is composed of three subparts. (b) The extracted silhouette of the
object.

object, such as a sphere, rotating about any axis: there is no change of silhouette
shape with respect to any rotation angle a. In fact, this proposition is consistent with
the prediction error of the curvature method in Eq. (2.9), where E(%,a) = 0 when
a = c, i.e., the cross section is a circle, which makes a irrelevant (and thus insensitive)

to the predicted appearance.

Deﬁnition 3.1 The local curve, from pj to 15,-, is said to be invariant to rotation R
up to a rotation angle ofa if the center of the circle of curvature 0 lies on the rotation

axis.

If many local curves of an observed object are invariant to rotation R up to some
rotation angle, then the pose parameter a can not be recovered accurately in the
parameter ﬁtting process. As an example, consider the object in Figure 3.5. Subparts

A and C are circular cylinders with their principle axes coincident with the rotation

88

axis. Subpart B is a cylinder with octagonal cross section. It is obvious that the
silhouettes of subparts A and C are invariant to rotation angle a, while the silhouette
of subpart B varies only within a rotation angle of 45 degrees. For designating point
correspondences on silhouettes, the contour segmentation scheme, as described in the
matching algorithm in Figure 3.2, will choose the best 50% contour segments, in which
the error measurements are among the smallest. In this case, contour segments D and
E, as in Figure 3.5(b), will not be selected (they will be treated as occluded portions
of the object because of their large error measurements). Then the silhouettes selected
for matching will remain unchanged regardless of the rotation angle a, resulting in a
loss of rotation angle information. Thus, our matching heuristics for the silhouette
behave badly in this case. However, a is sensitive to the internal edges within the

45° ambiguity intervals.

Proposition 3.2 Let p; be the internal edge points such that pi # AN for some

A 75 0. Then the pose parameter a is sensitive to p;.

Proof: Since p; is an internal edge point, the radius of curvature r = 0. Then the
curvature method becomes

ﬁ=Rm

for internal edge point p;. Rewriting the above equation, we obtain

AP: = 5a - Pi = (R - 1);».

Since pi # AN for some A 76 0, thus ”Pi -O|| > 0 (i.e., pi 3é 0). It follows

89

immediately that Ap; 7t 0. Thus, the pose parameter a is sensitive to pi. D

Proposition 3.2 also suggests that the pose parameter a is insensitive to any
internal edge point lying on the rotation axis. But they occur only at the intersections
of the rotation axis and the object surfaces because internal edge points are surface
points. For smooth objects, there are usually at most two such points existing (there
would be many for a polyhedron rotating about an edge). Moreover, our heuristics
in designating point correspondence on internal edge points will exclude these points
for alignment. Thus, the pose parameter a can be recovered for somewhat symmetric
and smooth objects if internal edges are used in the matching process. Accuracy
depends on the number of such points pi selected and the size of Api relative to the

pixel size.

Proposition 3.3 Let p = (x,y,z)‘ and f5 = (x,g,2)‘ be the corresponding points
before and after the rotation. Let the rotation axis be the Y axis, and the rotation
angle be a. Let po = (0,yo,0)‘ be the rotation center of the horizontal cross section
through p. Let Ap = (Ax,Ay,Az) denote p — p. Let I be the length of the vector
pgp and ,6 be the angle between the X axis and the vector pgp. Then

Ax = —2l sin(; + ﬂ) sin ;.

Proof: Note that the points p, f), and po all lie on the same horizontal section

implying that y = g = yo. Since the object is rotated through angle a about the

90

Y axis, we only need to consider the relationship between the coordinates (23,2)
of the original point and (ii) of the rotated point. Using polar coordinates in
the cross section, a point with polar coordinates (LB) rotates to (1,5 + a) where
I: W and ﬂ = tan‘1(z/x). Now x =1 cos B, so that 5: =1 cos(ﬂ+a). Hence

Ax

x—x= —2lsin(%+ﬂ)sin%. D

It can be inferred from Proposition 3.3 that for a ﬁxed I and a small non-zero a,
|Ax| is close to its maximum when ﬂ is close to g or 37", and |Ax| is close to 0 when
,8 is close to 0 or 7r. By the deﬁnition of the rim (i.e., the set of all points on the
visual surface, whose normal is perpendicular to the visual axis), ﬂ is either 0 or 7r
for boundary points. Thus, the rotation is less sensitive to points on the boundary
than to any other points lying on the same cross section. In fact, for a symmetric
object, we learn from Propositions 3.1 and 3.2 that when the object is rotated, the
boundary points remain unchanged, but not the internal surface points.

For simplicity of the analysis, let us assume that the object is rotated through
a about the Y axis. All the notations used here are consistent with the ones in
Figure 3.4. Let 1,- and 1,- denote the length from po to pi and 0 respectively, i.e.,
l,- = “po'ipill and l,- = ||p;o||. Let H,- be the angle between the X axis and the vector
p30, and ,6,- the angle between the X axis and the vector pin. From Proposition 3.3,
we have Ax; = —2l,~ sin(% +,B;) sin % and Ax,- = —2l,- sin(-‘;- +ﬂj) sin %. Then, IAxJ-I S
|Ax,-| if [lj sin(% + ﬂj)| S |l.-sin(% + 30]. Note that for rim point pj, the center of
the circle of curvature o is used to compute Ax,- because r remains unchanged after

rotation. For a very smooth object, if the object’s center of gravity is assumed to

be the rotation center, then both I,- and ﬂ,- tend to be small but r tends to be large,

91

causing [AxJ-I, the projection of the changes in the image, to be small. On the other
hand, I,- does not depend on the local curvature at p;, thus Ax,- depends on the values
of B,- and lg. In most cases, for this kind of object, I,- > 1,- and [3.] > LBJ-I, and
consequently, [Ax,-| 2 [ij]. Thus, the pose parameter a is more sensitive to the
internal edge points than to the silhouette points. An interesting case to consider is
a cube rotating about an axis through the center of two opposite faces. All silhouette
points have r = 0. For some aspects, 0 is more sensitive to silhouette points, and for

other aspects, 0 is more sensitive to internal edge points.

3.3.4 The Basin of Convergence

In this section, we assume that the ground truth of pose parameter estimates is known,
and we investigate the basin of convergence. In our implementation of nonlinear
least squares minimization, we use the Levenberg-Marquardt method to search for a
minimum. The Levenberg-Marquardt method is one of the most widely used methods
for nonlinear least squares minimization. This method uses a scalar parameter A to
force iterative convergence. The new form of the Hessian matrix H in Eq. (3.5) is

given by

I (1+/\)sz whenk=l
Hm = (3-8)

HH when k 9‘5 l

where H“, 16,1 2 l, - - - , 6, are the array elements of H. As A is increased, the solution
increasingly corresponds to pure gradient descent. For decreasing A, the problem

would gradually move back to the original Newton’s method. This phenomenon can

92

 

 

Enemy (MSE)
l)
‘ Starting Position
------------------------------------------ B
= Parameters
-> 1 Increasing ((hndient Descent) :t A. Decreasing
0 Local Minimum 0 GlobdMinimum

Figure 3.6: The basin of convergence.

be described as in Figure 3.6. A ball is moving downward on a bill by increasing A
until it reaches the valley, and then starts moving upward on a hill by decreasing A.
If the force of moving upward is not sufﬁcient to allow the ball to reach the top of the
hill, the ball will get stuck in the bottom of the valley. If the force is powerful enough,
the solution is searched for in the next valley. This will continue as long as the uphill
force is large enough. If the uphill of a global minimum is shallow, the search may
not stop at the global minimum. Thus, the iterations should be terminated when the

mean-squared distance error (MSE) is smaller than some error threshold Et.

Those valleys where the parameter estimates satisfy some error criteria in param-
eter space are deﬁned as global minima, otherwise, they are deﬁned as local minima.
For example, suppose that (I: and (I: are the pose parameter estimates and the ground
truth of pose parameters, respectively; let Eb be the error bounding box in parameter

space; then global minima are deﬁned as those pose parameter estimates (2: satisfying

93

 

 

 

 

 

 

 

 

   

 

 

 

 

E
2
3 8
i m E
O a:
3 E .21
0 0
.?=" Q 3’
5'0 0 o
3 > >
z < <
c
U
E
t Ell Etl Ell
Error Threshold: Et Error Threshold: Et Error Threshold: Et

(a) (b) (C)

Figure 3.7: The effect of Et on the basin of convergence. (a) The number of trials
converging to global minima. (b) The average MSE error of ﬁtting. (c) The average
converging time of ﬁtting.

|cb — ID] 6 Eb. We implicitly assume that the ﬁtting errors (MSE) of global minima
are smaller than those of local minima; this assumption is justiﬁed for pose parameter

estimates close to the ground truth which make the ﬁtting error small.

A valley with MSE below Et becomes an attractor, absorbing any ball in the
valley. Consider the number of searches which would terminate in global minima
if started at random in some region of parameter space (or equivalently, consider
the probability of searches reaching global minima). When Et is below the MSE of
the actual global minimum (i.e., the ground truth of pose parameters), denoted as
Etl (see Figure 3.7(a)), the number of global minima achieved remains ﬁxed because
there exist no attractors and converging to global minima depends solely on the
topology of the basin and the initial starting positions. When Et is increased, so is
the number of attractors which may include some valleys of local minima, and thus,
the number of searches converging to global minima is decreased. This phenomenon

also explains why the average MSE of convergence is increased as Et is increased (see

94

Figure 3.7(b)). The phenomenon depicted in Figure 3.7(c) is obvious because more
and more attractors exist as Et is increased further, and consequently, the average
number of iterations to convergence is decreased. Our results below illustrate these

concepts and also show how Et might be chosen in practice.

3.4 Experimental Results

The alignment algorithm presented in this chapter has been tested for both real
sculptured objects and synthetic objects made from a superquadric model [5]. The
algorithm has succeeded in deriving the object pose by aligning only the silhouettes
[27]. In this section, we intend to demonstrate: (1) the algorithm can handle partially
occluded objects and objects with internal edges; (2) the basin of the convergence
is broad; (3) the accuracy of the derived object pose can be further improved by
introducing object internal edges in the alignment process.

As an example of showing the intermediate stages of the alignment process, con-
sider the superquadric model, an ellipsoid, in Figure 3.8(a). Figure 3.8(b) shows the
extracted edgemap of the ellipsoid. Figure 3.8(c) shows the evolution of the model
edgemaps generated during the iterations of N-L-M method. The white contour in-
dicates the ﬁnal edgemap generated from N-L—M method. Figure 3.8(d) shows this
edgemap superimposed on the observed object. Figure 3.9 gives another example of
this alignment process. As can be seen in Figure 3.9(b), the observed object, a par-
tially occluded block, is ﬁtted by the model within only a few iterations as indicated

by the evolution of model edgemaps in Figure 3.9(a).

95

 

Figure 3.8: Steps in the model ﬁtting algorithm. (a) Intensity image synthesized
from a superquadric model. (b) Edgemap extracted from the image of (a). (c)
The evolution of convergence in the N-L—M method. ((1) The ﬁtted edgemap shown
superimposed on the original image.

96

   

(a)

Figure 3.9: Fitting a polyhedral model to a partially occluded object. (a) The evolu—
tion of convergence in the N-L—M method. (b) The ﬁtted edgemap shown superim-
posed on the original image. (Note the occlusion by the Y pipe.)

3.4.1 Results from Model Fitting Experiments

The alignment algorithm for locating objects in the scene was tested on a total of
sixty aspect models. Half of them are partially occluded objects. Figures 3.10 and
3.11 show a few of alignment examples. Note that in several of these images, the test
objects are partially occluded. As can be seen from the white contours in Figures 3.10
and 3.11, most of the observed objects are well ﬁtted. Table 3.3 shows the goodness of

the ﬁt. Note that EN is deﬁned the angle between N and N, i.e., cos’l IlNlll [I‘M], where

 

N as the ground truth for N. The translational parameters, (tmty), are recovered
within 3 pixels, the rotation angle, or, within 3 degrees, the rotation axis, N, within
0.2 radians and the scaling factor within 0.01. The ﬁnal 2 columns of Table 3.3 show
that alignment is achieved within 1.6 pixels MSE on the edgemaps within 10 iterations

of the hill-climbing procedure.

97

 

(a) can (b) camar08

 

(e) phone6 (f) swanll

Figure 3.10: Model ﬁtting examples of sculptured objects.

98

 

(c) mug26 (d) pigl

 

(e) trucklO (f) zebral

Figure 3.11: Model ﬁtting examples of occluded objects.

Table 3.3: Results from Model Fitting.

99

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Input parameters: Et = 1.6 pixels; Imax = 30 iterations
Object Pose Estimation Error Number of MSE
E" E,” E, E0, EN Iterations
(pixel) (pixel) (degree) (radian)

can 0 0 0.003 0.82 0.05 4 0.94
camar08 3 2 0.009 0.37 0.06 9 1.16
face* 0 0 0.002 2.20 0.20 7 1.39
phone6 1 0 0.001 0.40 0.06 6 1.08
sharpenerlO 3 0 0.001 2.39 0.13 8 1.36
swanll 0 0 0.001 1.05 0.15 4 0.99
blockAl 0 2 0.007 0.38 0.16 5 1.02
lion29 1 0 0.001 0.49 0.06 5 1.56
mug26 2 0 0.010 0.18 0.10 7 0.98
pigl 0 0 0.001 0.07 0.03 5 1.22
trucklO 2 2 0.009 0.74 0.07 10 0.95
zebral 3 3 0.010 0.31 0.05 8 0.76
Ep = “13 - mm,“ where 13 is the estimated parameter value of pm,e

*ground truth not actually known (only approximation).

 

3.4.2 Results from Monte Carlo Experiments

To investigate the convergence characteristics within a viewing aspect, the parame-
ters of rotation, a, 0, and 9b, in the aspect were sampled at equal intervals covering
the whole viewing aspect. Each sample of these three parameters, together with the
translation parameters (t,, t”) and the scaling parameter 3, forms the initial param-
eter estimates for the ﬁtting. The translational parameters (tmty) were randomly
selected within {-20, 20] pixels and 3 within [0.9,1.1]. Note that the indexing scheme
described in the next Chapter will justify using such initial estimates. 240 samples
were generated for each viewing aspect. Six test objects, as shown ﬁtted by their
corresponding models in Figure 3.12, were used to conduct the experiments. Three of

them are symmetric along some rotation directions. The purpose for selecting such

100

Table 3.4: Convergence and Error Analysis: Edgemap vs. Silhouette.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Number of trials: 240
Et = 1.0 pixels; Imax = 30 iterations; Eb is as deﬁned in Eq. (3.9)
Silhouette and Internal Edges

Object Avg Iteration [1 (Error < Et) ﬁlw — LB] G Eb
ellipsoid 9 240 240

cup 12 240 210
taurusl 10 240 182
phonel3 (occluded) 12 240 195
squash (occluded) 9 240 232
squirrel (occluded) 11 240 173

Silhouette Only

Object Avg Iteration [1 (Error < Et) [[Iw — 0| 6 Eb
ellipsoid 9 240 17

cup 6 240 44
taurusl 9 240 134
phone13 (occluded) 11 240 145
squash (occluded) 10 240 220
squirrel (occluded) 16 240 143

 

 

 

 

 

 

objects is to study the effect of object shape on the accuracy of the estimated object

pose.

Table 3.4 lists the number of trials converging: ﬁ( Error < Et), and the number of
the trials converging to global minima: [Hui—é] E Eb where w and (I) are the estimated
pose parameters and the ground truth parameters, respectively, and Eb is the error
bounding box in parameter space centered around the ground truth parameters C).
The criteria for global minima are, i.e., the range of Eb:

ItJr — t1] S 3 (pixels), |ty — t] S 3 (pixels), ls - s| S 0.01, ( )
3.9

[a - d] S 3 (degrees), ||N - N” S 0.2 (radians),

where N(0, d2) = (cos 0 sin ()5, sin 0 sin ()5, cos qb) is the direction of the rotation axis, and

101

 

(a) ellipsoid (b) squash

 

(c) cup (d) squirrel

   

(e) taurusl (f) phone13

Figure 3.12: Test objects used for studying the basin of convergence.

102

61, N, s, t}, and t], the ground truth parameters. ||N — NI] denotes the angle be-
tween N and N, i.e., cos"1 ll—er‘Tlfﬁﬁ' As can be seen in the column [1(Error < Et)
of Table 3.4, all 240 trials in each object experiment converged to (either global or
local) minima within an error threshold of 1.0 pixels, suggesting a broad basin of
convergence for the listed 6 object aspects. The quantity, [llw — (22] E Eb, indicates
the number of converged trials that actually converge to global minima. This quan-
tity also indicates the sensitivity of pose parameters to object shape. Consider the
ellipsoid in Figure 3.12(a) for an example. As shown in Table 3.4, all trials in the
ellipsoid experiment converged to global minima when internal edges together with
silhouette were used in the ﬁtting process. This is because the ellipsoid is made from
a superquadric model and the object can be precisely modeled. However, when the
silhouette alone was used in the ﬁtting process, only 17 (out of 240) trials converge
to global minima. The deterioration of converging to global minima is attributed
to the symmetry of the ellipsoid and the insensitivity of rotation parameters to the
object silhouette. This phenomenon is also depicted in Figure 3.8(c) where the model
silhouette converges to the object silhouette after 3 iterations but the pose is still far
from the ground truth as indicated by the cross mark. Another experiment showing
this phenomenon is conducted via the cup in Figure 3.12(c). The object is almost
symmetric around the rotation axis, and as a consequence, only 44 (out of 240) trials
converged to global minima when only the silhouette was used in the ﬁtting process.
But the situation is much improved when internal edges were incorporated into the
ﬁtting process as shown in Table 3.4. 210 (out of 240) trials were able to converge to

global minima. The squash in Figure 3.12(b) is a very smooth and symmetric object

103

made from a superquadric model. Although occluded, 232 (out of 240) trials are able
to converge to global minima when both silhouette and internal edges were used in
the ﬁtting process. The lack of internal edges in the ﬁtting process does not decrease
the number of convergence to global minima signiﬁcantly. This is because that, unlike
the ellipsoid and the cup, the squash is not symmetric about the rotation axis and
also the occlusion helps to break the symmetry of the shape.

It is observed in Section 3.3.3 that when objects are rotated, their silhouettes
may vary little while their internal edges may move signiﬁcantly in the image. Thus,
including the internal edges in the ﬁtting process should be helpful in determining
the object orientation. We have demonstrated in the above three experiments that
the pose parameters obtained by ﬁtting internal and boundary edges simultaneously
are much more accurate than by ﬁtting the boundary edges alone. Experiments con-
ducted on the other three objects, squirrel, taurusl and phone13 as in Figures 3.12(d),
(e) and (f), respectively, also reﬂect this property that incorporating internal edges in
the ﬁtting process can help improve the accuracy of pose parameters (see Table 3.4).
Since these three objects are not symmetric about the rotation axis, the convergence
rate to global minima does not deteriorate as much as that for the symmetric objects.

Table 3.5 lists the number of trials converging (anv), the average iteration (Iavg),
the average MSE (EM), and the number of trials converging to global minima (ﬁlw —
£2] E Eb). We can make the following observation from Table 3.5: (i) all trials
converge regardless of the value of Et, indicating the basin of convergence is broad.
(ii) The number of average iterations decreases as Et becomes more tolerant. This

attributes to the fact that when the MSE error threshold Et becomes more tolerant,

104

Table 3.5: Convergence and Pose Accuracy vs. Et.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Test object: taurusl (in Figure 3.12(e)).
Number of trials: 240
Imax: 30 iterations; Eb is as deﬁned in Eq. (3.9)
Et Silhouette and Internal Edges Silhouette Only
Neonv [avg Eavg lllw '- (Di 6 Eb Neonv [avg Eavg lllw "' 0| E Eb
1.5 240 5.5 1.3 33 240 4.4 1.3 29
1.4 240 6.4 1.2 52 240 5.2 1.2 33
1.3 240 7.4 1.2 86 240 6.2 1.2 44
1.2 240 8.3 1.1 113 240 7.3 1.1 60
1.1 240 9.6 1.0 145 240 8.7 1.0 92
1.0 240 10.8 0.9 182 240 9.3 0.9 134

 

 

more and more local minima attractors become effective, and the ﬁtting process is
terminated once it gets stuck in these local minima. This fact also contributes to the
following two observations: (iii) the average MSE increases as Et increases; and (iv)
the number of trials converging to global minima declines as Et is more tolerant. All

these results are consistent with the theoretical discussion given in Section 3.3.4.

As an example to see the effect of occlusion on convergence, we manually occluded
a squirrel by 0%, 15%, 35%, and 50% respectively as seen in Figure 3.13. Table 3.6
shows that for these cases all 240 trials converged, indicating the basin of convergence
is still broad even under severe object occlusion. Table 3.6 also shows that for these
cases the number of trials converging to global minima decreases as the percentage
of the occlusion increases. This is due to the limited information provided by the
unoccluded portion of the object; less constraints enforced on the ﬁtting lead to more
local minima attractors which cause fast convergence but deteriorate the performance

on the number of trials converging to global minima.

105

 

(c) occluded 35% (d) occluded 50%

Figure 3.13: Objects for studying the effect of occlusion.

106

Table 3.6: Convergence and Error Analysis: Occlusion Effect.

 

 

 

 

 

 

 

Test object: squirrel (in Figure 3.13)
Number of trials: 240
Et = 1.0 pixels; Imax = 30 iterations; Eb is as deﬁned in Eq. (3.9)
Occlusion Silhouette and Internal Edges

Percentage(%) Avg Iteration [1 (Error < Et) ﬂlw — (12] 6 Eb

0 14 240 234

15 12 240 228

35 11 240 173

50 9 240 108

 

 

 

 

 

 

3.5 Summary

In this chapter, a method for locating 3D objects with arbitrary curved surfaces from
a single 2D intensity image is outlined. Given an input image and a candidate object
model and aspect, the method will verify whether or not the object is present and
if so, report pose parameters. The modeling technique of Basri and Ullman [9], as
described in Chapter 2 and the alignment method of Lowe [77] are combined together
to produce a new technique for handling curved 3D objects. The model allows an
object edgemap to be predicted from pose parameters. Pose is computed via an
iterative search for the best pose parameters. Heuristics are used so that alignment
can succeed in the presence of occlusion and artifact and without resorting to use of

corresponding salient feature points.

Alignment experiments with both real and synthetic objects demonstrate a high
rate of convergence for a broad set of starting orientations in the same aspect. These
results are in agreement with those reported by Lowe [76]. We conclude that large

numbers of ﬁne aspects are not needed for modeling. The experimental results also

107

demonstrate that smooth objects can be handled and some occlusion can be tolerated.
Our experiments with superquadrics indicate that the alignment method can be used
with most conventional CAD systems.

Since the Newton’s method with Levenberg-Marquardt minimization is used to
iteratively diminish the error between the observed edgemap and the ﬁtted edgemap,
sensitivity of a parameter to the error influences the goodness of the ﬁt of the param-
eter. Normally, the more sensitive the parameter is, the more accurate the param-
eter ﬁt would be. It is obvious that the scaling factor 3 and the translation vector
(tx, ty) are very sensitive because a small perturbation in any of these three param-
eters would create a signiﬁcant error between the ﬁtted edgemap and the observed
edgemap. Thus, as shown in our experimental results, the estimates of these three
parameters tend to be more accurate than any others.

Finally, we have shown mathematically that aligning silhouettes alone provides
inaccurate pose parameters for rotation for objects which are somewhat symmetric
about the rotation axis. The same analysis shows that the rotation parameters are
much more sensitive to location of internal edges in the images. This principle is
supported by our experimental results. Thus, we conclude that including internal
edges in alignment is essential for an accurate pose estimation of many curved objects.
This result is general in the sense that it can be used by other published iterative
methods to reduce inefﬁcient wandering in the parameter space as well as inaccurate

orientation in convergence results.

Chapter 4

Indexing to Model Aspects

4.1 Introduction

Model-based object recognition requires that sensed data from an object be matched
to pre—stored model data. In the alignment paradigm, minimal sets of features are
used to form correspondence hypotheses between an image and a pre—stored object
model. For each such correspondence, a transformation is computed which brings
the model features into alignment with the image features, and model presence is
veriﬁed by back-projecting other model features into the image and searching for
image features at those locations. The entire process can be very expensive due to
the possible combinatoric correspondence of feature sets for generating hypotheses
and the exhaustive search at each veriﬁcation.

The most straightforward way for object recognition from a model database is by
linear search. That is, one can apply the alignment paradigm to each object in the
database in a sequential fashion, and the object model that best aligns the observed

108

109

image features is recognized. With this search strategy, the recognition time increases
linearly with the number of possible object models in the database. As the model
database size grows, the linear search strategy decreases in viability, some better
means of selecting object models are needed.

Indexing is one way to alleviate the computational burden. It is a two—stage
process. At compile time, feature sets derived from object models are encoded as
indexing vectors corresponding to entries in some appropriate data structure, such as
a hash table or tree, with pointers back to the corresponding models. This provides
the essential mechanism for indexing and fast retrieval. At run-time, feature sets
derived from the image form index vectors for fast access of pre—stored models. Thus,
correspondence hypotheses are recovered without resorting to comparison of all pairs
of model / image feature sets.

Although correspondence hypotheses can be quickly recovered through indexing,
there may still exist some ambiguity among the indexed hypotheses. This situation
arises when a single index vector derived from the sensed image retrieves many com—
peting match hypotheses. This problem can be overcome by providing the indexing
scheme with some additional information, such as the probabilistic peaking effect
used in [87] or the indexing function used in [10], to determine which hypothesis is
more likely to give a correct interpretation for the data. This additional informa-
tion not only helps in grouping hypotheses into consistent clusters, but also provides
a mechanism to rank the match hypotheses before sending them to the veriﬁcation
stage. These rankings will in general mean that correct interpretations are likely to

be veriﬁed ﬁrst, resulting in an improved over-all efﬁciency in the recognition process.

110

Our approach toward the indexing task within the alignment paradigm is as fol-
lows. As described in Chapter 2, we model a 3D object by a collection of 2%D views
(called model aspects) made from 5 local images taken by rotating the viewpoint up,
down, left and right of a central vieWpoint (see Figure 2.2). The 3D information is
made from the central edgemap of each model aspect and is used for pose estima-
tion. The silhouette of the central image is extracted and segmented into parts whose
invariant features are derived for indexing. Thus, each model aspect consists of a
2%D edgemap for pose estimation (i.e., veriﬁcation) and part invariant features for
indexing. Both model building and index building phases can be processed off-line.
During the recognition phase, we ﬁrst extract the part invariant features from the
sensed image to index into the model database for candidate models and poses. We
then apply some heuristics to determine the order of the hypotheses for veriﬁcation.
Finally, veriﬁcation is carried out by ﬁtting the 2%D edgemap of each candidate model
to the observed edgemap (i.e., pose estimation), as done in Chapter 3. The result
of the ﬁtting is then used to refute or accept the model hypothesis. Once a correct
model is found, the veriﬁcation procedure is terminated. A diagram of the proposed
indexing scheme is given in Figure 4.1.

This chapter is organized with the next section reviewing some existing techniques
for indexing 3D models from 2D images. Section 4.3 presents the general framework
for the proposed indexing scheme. Section 4.4 shows experimental results. The last

section summarizes the proposed approach.

 

 

Hypotheses Generation via Indexing

1
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I

J

 

Hash Table

 

 

CodeG’i) —"
<Pj. Mm) <Pk. Mn)....

 

 

 

 

 

 

 

 

Rankl: Mm —’ (111113.117)
m2.- Ml —’ 012.115.118.11”)

 

] Model Ordering Table ]

 

 

 

o: modeII-In —’ Some
mdeIMm—rejected

 

 

 

 

,
[

[Pi-<Pj.Mm> (=H1)]
[Pi-(RIM) (=H2) ]

 

 

 

 

 

 

 

 

 

 

Part
Segmentation

E

n

c

Invariant °

Extraction '

P, n

t
l Aspect Model l
[gm/l

r— a

[

Cluster Table

 

 

 

 

O
O
O
L J
‘ ﬁ W
Corresponienee
Table

 

Hypotheses le—’ (HI.H3.H7)(H6)(H|L] Ode-m

 

 

 

Voting [ Mn —- (HLH5.H8.H12)(H4.H9) ] m

 

 

 

 

 

 

 

 

ﬁrm —» HI. 113.116. 117...]

 

 

[Mn - 112.114. 115. 118.... ]

 

 

 

Figure 4.1: The computational paradigm of the proposed indexing scheme.

.------------------..-----------------------------------------------------

112
4.2 Related Background

We are interested in indexing to 3D models from single 2D intensity images. Index-
ing techniques are becoming widespread in the ﬁeld of computer vision for solving
the recognition problem and the model / sensory feature correspondence problem. In-
dexing involves extracting an over-constraining indexing coordinates from a group of
image features and then interpreting these features through a pre—compiled lookup
table [115]. Much work in the past on indexing has dealt with the easier problems
of 2D models (e.g., [101, 65]), flat 3D models (e.g., [72, 96]), or 3D models in which
more information is available, such as with range data (e.g., [43, 102]), or multiple

images [81].

When compiling an index, the most efﬁcient storage possible can be achieved by
using view-invariant shape descriptors that are unaffected by object pose, perspective
projection, and the intrinsic parameters of the camera [44]. For each underlying
model feature set, only a single index vector is sufficient to identify the object. Thus,
view—invariant descriptors are ideal candidates to be used to directly index into the
model database for fast object identiﬁcation. However, it has been shown [22, 33]
that there are no general-case view-invariant features for any number of 3D points
under common of projection model (perspective, weak perspective, or parallel). As
a result, most work in indexing using invariant descriptors has been restricted to 2D
objects [23, 101] or 3D planar objects [72, 85, 96], or to some special-case invariants,
which are only view invariant under special conﬁgurations of 3D points or set of

views [22, 44, 110]. For example, assuming that the object points are planar and

113

the projection is approximated by the weak perspective projection model, Lamdan
[72] was able to extract affine coordinates which are invariant to view. Burns et al.
[22] have deﬁned and discussed afﬁne coordinates and other special-case invariants
under weak perspective projection model. Forsyth et al. [44] have also assumed
a planar surface and have shown that an invariant, depending on the coefficients
of the conics, can be obtained by ﬁtting two conics to two image curves projected
from this surface. Thus, as pointed out in [22], “invariants that do not require any
restrictions are impossible, and recognition systems using unrestricted invariants can
not be expected to be as effective as those using special-case schemes”.

One way to circumvent the above problem is to represent a 3D object by multiple
2D views. Each view is treated as a model, then the indexing problem is reduced to
the problem of indexing a 2D object using 2D invariants, which have been successfully
used by many researchers [19, 33, 61, 109]. For example, Clemens and Jacobs [33]
grouped four point features as a structure for indexing, three of which were used to
generate the object pose and one of which was used to check if the pose consistency
is preserved. Groups that violate the pose constraint were discarded, resulting in a
considerable amount of reduction in time and space in indexing. Jacobs [61] provided
a more space-efﬁcient construction reducing 2D sheets to two 1D lines, each embedded
in a 2D space. Thompson and Mundy [109] grouped pairs of vertices to retrieve
hypothetical viewing transformations for their models. Due to the lack of speciﬁcity
in the indices, a voting scheme was used to integrate the noisy pieces of information
into meaningful hypotheses.

Breuel [19] argues that to effectively support indexing, one needs a representation

114

of features that is invariant under 2D afﬁne transformation (translation, rotation
and scaling), under occlusion of features, and under the addition of spurious features.
However, using this kind of indexing scheme for 3D object exacerbates the recognition
time by vastly expanding the model database. This raises the issue of determining
how many distinctive views of an object model should be stored in the database.
The aforementioned methods discretize the indices and ﬁll the hash table with
the quantized values. Attempting to cover all possible views of a feature set by
ﬁlling in a discrete look-up table is a formidable task, especially in high-dimensional
indexing spaces. One approach to this problem is by tessellating the viewsphere into
a large ﬁxed number of views and then integrating neighboring views which share
same feature sets into aspects [36]. Another promising approach is by using some
linterpolating functions, such as Radial Basis Functions (RBF) proposed by Poggio
and Edelman [90], to approximate the regions between sample views. Each distinct
interpolated feature set, covering a small region of views, is ﬁlled into the discrete
look-up table. In this way, a full range views of an object can be modeled from
a relatively small number of views. Beis and Lowe [10] used a learning method to
generate indexing functions which interpolate between the index vectors sampled at
various views about the stored models. The indexing functions provide probability
distributions describing the possible interpretations of each index value. The resulting

probabilities are used to rank the match hypotheses for veriﬁcation.

115
4.3 General Framework for Indexing

In this section, we develop the framework for the proposed indexing scheme (see Fig-
ure 4.1). From the original problem assumptions, the following limits apply. First,
the partitioning rule, proposed by Hoffman and Richards [54], is used to segment
2D shape into parts. Second, part invariant features are derived for indexing; the
indexing is carried out via hashing. Third, parts stemming from the same model
are integrated together to form groups of consistent hypotheses. Finally, four vot-
ing schemes are proposed to rank the hypotheses in order to reduce the number of

veriﬁcations required during the recognition.

4.3.1 Part Segmentation

The notion of recognition based on “parts” has become increasingly popular. Perhaps
the most compelling support for this idea is based on recognition in the presence of
occlusion. Partial occlusion of the object renders global shape descriptors ineffective
for indexing or recognition. In the alignment paradigm, veriﬁcation or refutation of
a candidate model (i.e., pose estimation) is done in a global manner which considers
portions of model features and portions of sensed features. Occluding objects and
artifacts in feature extraction may degrade the match value, but the alignment method
is still effective because fragments of matching data can be integrated into a single
global model-pose hypothesis. The indexing phase, however, must operate without
a hypothesis. Partitioning a sensed shape into parts seems inevitable for this phase

of object recognition. We use the central silhouette of an aspect model to index

116

the aspect, thus, views within the aspect should have more or less same indexing
features. The success of the indexing phase counts on a reliable partitioning scheme
that should be invariant to similarity transformation, robust to local deformation,
and stable with global transformation. A signiﬁcant aspect of a partitioning scheme
is the way decomposed parts behave under various changes that may arise in the
visual projection. First, changes in shape due to translation, rotation, and scaling of
the 2D shape must not effect the part structure within the shape. Second, changes
are localized to certain portions of the shape. It is natural to demand that local
deformations of the shape should not effect those parts remote to the change. Third,
changes arising from global deformations, such as changes in viewing direction or
viewpoint within a model aspect, or deformation of the boundary due to noise and
spurious data, should only have a small effect on the 2D shape and its part structure.

How should parts of a shape be computed? Several approaches in the past suggest
either the decomposition of its interior region or the segmentation of its boundary.
Region-based parts include a description based on the combination of primitives such
as generalized cylinders [14, 79], and superquadrics [5, 89], or the neck—based and
limb-based description proposed in [99]. In contrast, contour-based segmentations
partition the boundary at zero-crossings of curvature [83] or at high-curvature points
[45]. Primitive-based boundary partitioning schemes approximate the boundary as
a combination of primitives such as constant curvature segments [119] or using a
curvature primal sketch (e.g., cranks, bends, bumps, and ends) [3]. A signiﬁcant
departure from the traditional boundary-based techniques is the method proposed

by Hoffman and Richards [53] who advocate that part decomposition should precede

 

 

Figure 4.2: Hoffman and Richard’s theory of curve partitioning: joining parts gener—
ally provides concavities in the resulting silhouette.

part description. In contrast to primitive—based approaches, they propose a theory of
parts which depends not on the shape of parts, as described by primitives, but rather
on general principles underlying their formations or “regularities of nature”. The
transversality principle is an example of such a regularity, asserting that “when two
3D entities are combined together to form a composite object, concavities are created
at the join”. This principle provides very general rules for segmenting either 2D or 3D
objects into parts [54]. Part boundaries are naturally delimited by the concavities. In
the case of 2D plane curves or silhouettes, these concavities, as indicated by the small
arrows in Figure 4.2, appear as cusps and occur at minima of negative curvature.

Thus, our rule for partitioning 2D shapes is as follows:

 

Partitioning Rule: segment a curve at concave cusps (or minima of

 

negative curvature) to break the shape into its parts.

 

 

This partitioning rule leads to a representation of the shape boundary based on
codons [53]. Codons are formed by partitioning curves at minima of curvature; the

maxima and zeros of curvature are used to describe the shape of each segment, result-

 

Max+ Max+ Max+
Min- Min+ Min+ Min- Min- Min-
0+ 0' 1“ 1' 2

Figure 4.3: The primitive codon types proposed by Hoffman and Richard.

ing in ﬁve types of codon: 0+, 0', 1+, 1‘, and 2, shown in Figure 4.3, where maxima
of curvature are shown as C] and labeled Max+ if positive or Max- if negative, minima
of curvatures as a and labeled as Min+ if positive or Min- if negative, and zero of
curvatures as O; the arrow shows the traversal of the direction; if the direction of
rotation is counterclockwise, the curvature is positive; otherwise, it is negative; the
ﬁgure is placed to the left of the direction of traversal, and the background is to
the right. There is one more codon that describes the degenerate case of a straight
line, i.e., segment with an inﬁnite number of points of zero curvature. Rosin [94] has
extended the ﬁve standard codon descriptors to include all cases where a codon is

neighbored by a straight segment or the case where the curve is not necessarily closed.

The codon representation is very sensitive to noise due to the computation of cur-
vature, which can cause many spurious curvature extrema. To suppress noise and keep
gross shape, the curves should be smoothed when computing curvature. Mokhtarian
and Mackworth have shown that cusps are formed when there is a self-intersecting
loop on the contour, and the curve is smoothed with a large-scale Gaussian ﬁlter
[82]. Fortunately, a self-intersecting loop cannot occur on the object silhouette, and

thus, no phantom part boundary would be created by curve smoothing. However,

119

the codon representation is still unstable, preventing it from being a reliable part

descriptor by itself.

PT 115

Pl

(a) (b)

Figure 4.4: Segmentation of object silhouette into parts. (a) Boundary of squirrel
carving segmented into 7 parts. (b) Boundary of an automobile (Taurus) segmented
into 4 parts.

Examples

We illustrate our part segmentation scheme on several examples of man—made and
sculptured objects. We use a Gaussian ﬁlter with a kernel width of 55 pixels to smooth
a curve before the part segmentation scheme is applied to the curve. The results of
the segmentation scheme applied to the silhouettes of a 3D wood carving of a squirrel
and a model car are shown in Figure 4.4. The segmentation of the car in Figure 4.4(b)
might be considered by some to be unnatural. Were the wheels extracted along with
the body, the situation would change; however, we have had difﬁculty in reliably
extracting the wheels. In theory, the part representation is immune to rotation,

translation, and scaling. Although occasionally the representation may be affected by

120

noise and quantization (i.e., some parts may be slightly distorted), the segmentation
technique provides appropriate representations in practice. Moreover, because the
computation of curvature is local, occlusion of some parts of the object does not affect
other parts that are within a small distance from the occlusion. More examples are
given in Figure 4.5 to demonstrate the generality of our part segmentation algorithm;

the decompositions are natural for a variety of objects.

Figure 4.5: Examples of part segmentation on a variety of shapes.
Figure 4.6 demonstrates the robustness of the part segmentation scheme in the
presence of occlusion. Portions of occluded and unoccluded versions of the objects

with identical shapes have identical parts, as required by the robustness criterion.

Figure 4.7 illustrates the robustness of the process under noise and scaling. The top

row of Figure 4.7 shows the contours scaled 100%,75%,50%, and 25% from left to

121

gr ae-m «as
Qﬁhﬁir

(e) (s) (11)

Figure 4.6: Part evidence survives occlusion. (a)—(d) Parts extracted from silhouette
images of single bear, elephant, deer, and moose, respectively. (e)—(h) Thicker lines
indicate parts used for generating hypotheses for bear, elephant, deer, and moose,
respectively.

 

M
M

Figure 4.7: Part segmentation survives noise and scaling. The black dots indicate the
boundaries of segmented parts.

 

 

 

122

right, respectively. Gaussian noise with 0 mean and 2 pixels standard deviation was

added to the contours. The segmentation remains stable as shown in the bottom row

of Figure 4.7.

Figure 4.8 illustrates the robustness of the part segmentation scheme under global
transformation due to view variations. The segmented parts of two sculptured objects,
pencil sharpener and phone, are shown in Figure 4.8 (a) and (b) respectively. The
extreme views of these two model aspects (i.e., the object is rotated 10° around the
vertical axis to the left) were used for part segmentation to generate parts as shown
in Figure 4.8 (c) and (d). This is the worst case within a model aspect. The parts
near the pencil sharpener handle are distorted due to self—occlusion while the rest of
parts remain intact. Although the part segmentation scheme produces exact parts for
the phone, their shapes seem to vary a little. This means that the indexing scheme
must compensate for the part variation by loosing the tolerances of indexing features
extracted from parts. In Figure 4.8(e) and (f), we show the segmented contours of
(c) and ((1) under 2D transformation (rotation and translation). Figure 4.8 indicates
that our part representation is invariant under 2D transformation but slightly variant
under 3D transformation due to self-occlusion and the change of views (however, some

of the parts retain similar structure).

4.3.2 Part Indexing

Some important design criteria for an indexing scheme are: (1) the indexing prim-

itives should be invariant under translation, rotation and uniform scaling; (2) the

123

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(C) (d)

 

 

 

 

 

 

 

 

(C) (0

Figure 4.8: Part segmentation survives 2D transformation and 3D view distortion.

124

indexing scheme should be robust under partial occlusion of the object; (3) the in-
dexing scheme should be able to handle spurious and noisy data. All these three
criteria are met by our part segmentation scheme. Under this segmentation scheme,
object boundaries can be reliably segmented into parts that are generally invariant
to rotation, translation, and scaling. In addition, the degradation of the represen-
tation due to occlusion is roughly proportional to the amount of occlusion. In this
section, we derive invariant quantitative attributes from the parts and show how these

attributes can be used for indexing.

Indexing Attributes and Invariants

Given a part (curve) C, let p; = (x,,y;),i = 0,1,-- - , N — 1 be the points of C. If C
is not closed, then it can be closed by appending the ﬁrst point (pN 2 p0) so that
C encloses a 2D region. This allows us to compute region attributes from C. We
deﬁne the following attributes to encode the part: assuming that no portion of C is

occluded, these attributes satisfy the criteria for indexing.

Compactness: Let A and P be the enclosed area and perimeter of C respectively.

The compactness of the part, <(C), is deﬁned as P2/A where

N—l

P = 2 \/($1- 115.41)2 + (311' — 31:41)2

i=0

 

”-1 171' 311'

1
71:5:

$i+1 yi+1

125

where ] - ] denotes the determinant. The quantity c has been shown to be invariant

to translation, rotation, and scaling [6, 74].

Roundness: the roundness of C, T(C), is the sum of the absolute curvatures
along C, i.e., 25:31 ]K,] where K,- is the curvature at point p;. The quantity 7' can be
shown to be invariant to scaling, and the curvature K. is also known to be invariant

to rotation and translation.
Skewness: Let pN/g be the midpoint of the part and pm the midpoint of the line
connecting p0 and pN_1. The skewness, of the part, p(C) is deﬁned as

p(C) : CO —1 POP-N—l ' PmﬁN/z
IIPoP'iv—illllpm5N/2H

 

Moment invariants: modiﬁed moments using only the curve boundary were

deﬁned in [26] as follows

mpq=L$quds, for p,q:0,l,2,...

 

where f0 is a line integral along the curve C and ds = \/(dx)2 + (dy)2. The modiﬁed

central moments are thus similarly deﬁned as

upq=/C(IC-i)”(y-12)qu, (4.1)

126

where 5: = mlo/moo and g = mar/moo- For a digital curve C, Eq. (4.1) becomes

1in = Z (x — 573)“?! _ glq- (42)

(23.1060

Normalizing the central moments, we obtain

77 _ ”W
W _ p+q+1
#00

 

forp+q=2,3,---

These normalized central moments are used to derive the following:

(151 = 7720 + 7702

9252 = (7120 “ 7702)? + 477i1

453 = (7730 — 3771202 + (7703 — 37721)2

(254 = (7730 + 7712)) + (7703 + 7721)2

¢5 = (7230 — 3771200730 + 1712)l(7730 + 1712)2 — 3(7721 + 7703)2l + (31721 — 7703)
X0721 + 7703)l3('730 + '712)2 - (7721 + 770:3)2l

d)e = (7720 — 7702)[(7730 + 1712)” - (1721 + 7703)2l + 471110730 + 7712)('721 + 1703)

¢7 = (37721 — 1703)(7730 + 7712)l(7730 + 1712)2 " 3(7721 + 7703)2l + (37712 — 7730)

X0721 + 7103)[3(7730 + 7112)2 — (7721 + 7703)2l-

Hu [58] showed that the quantities, 45;,1 S i S 7, are invariant to translation, rota-
tion, and scaling. Since the quantities 45,-,2 S i S 7 are very small and have little

discriminating power, we only use 051 for indexing.

127

Convexity: The codon representation is susceptible to spurious extrema resulting
from noise, preventing it from being a reliable part descriptor by itself. However, it
still can be used to determine whether a part is concave or convex: let 0(C) denote
this shape property of a part C. Obviously, a can provide some discriminating power
for part recognition.

We encode each part C using the above attributes quantized in some appropriate
manner. All encoded parts serve as indexing primitives to search a hash table for
model hypotheses. The endpoints of each part, p0 and pN_1, encode the pose of the
part. A third point, such as the point of maximum absolute curvature encodes how
the contour lies relative to the endpoints. Note that for each pair of the corresponding
scene part and model part, these three point correspondences is sufﬁcient to determine
the scale factor of the object size and the roll angle of the image plane as mentioned
in Chapters 2 and 3. If more than one pair exists, the roll angle can be derived in a
least—square error minimization fashion. The aforementioned (variant) attributes can
be used in grouping consistent hypotheses (indexed from parts) into clusters and also
provide an approximate alignment between a sensed part and a model part similar
to it, and thus provide an initial pose hypothesis for the pose estimation algorithm
(i.e., the Newton’s method with Levenberg-Marquardt minimization) which searches

for a global alignment using all parts.

Sketch of Indexing Scheme

Our indexing scheme discretizes the indices and ﬁlls the hash table with quantized

values. The advantage is that the run-time indexing step can be done in constant

128

time. The disadvantage is that the performance of indexing may be degraded when
many part hypotheses collide in one hash bin. This arises due to the standard way
these methods have dealt with noisy images. During table construction, an entry is
made in each hash bin within an error radius 6 of the actual stored model index value.
Then all relevant hypotheses can still be accessed by looking in just one bin. The
number of collisions in a bin depends on the degree of speciﬁcity of the bin. More
speciﬁcity means ﬁner bin divisions, but the probability of false dismissal increases
accordingly. On the other hand, less speciﬁcity means coarse bin divisions, and the
probability of false alarm increases. The only way to deal with this noise issue is,
for each part indexing attribute, to search the c-volume of index space around the
index (i.e., range search) at run-time, and then to form the union of the retrieved
hypotheses. This certainly reduces the original advantage of the approach. The sketch
of our indexing scheme using range search is given next (see Figure 4.1 for clarity).
Each 2%D edgemap of a model aspect requires a few kilobytes of storage while
the attributes of each part indexing to the model aspect requires a few dozen bytes.
Each model part results in one record being inserted into the hash table. Using the
parts from the sensed image, the index step produces a set of hypotheses for those
models needing to be veriﬁed. The hypotheses are then grouped together based on
their model candidates. A good indexing scheme should index into a set of N models
in 0(logN) time or better, provided that there are no more than logN relevant
hypotheses because the system cannot afford to refute too many hypotheses. Given
part representation (a, c, 7', p, (111), the indexing scheme performs a range search using

(a, c :1: cc, 7' :l: 6,, p :1: ep, (b1 i 5,7,, ), where e, is the error tolerance of attribute x, for

129

the parts having attributes within this range in a database. The tolerance allows for
errors due to quantization, noise, and view variations of shape within a model aspect
(recall each model aspect is represented by the attributes extracted from the central
viewpoint of that aspect). The range search can be implemented using a k — d tree
structure with a worst-case running time of 0(k . Nl‘l/k) [95]. We implement the
range search using a hashing scheme. For each part attribute, we encode its range and
then use each encoded index within this range to hash into an open hash table with
b buckets. The average-case search time would be 0(N / b) for each part. Making b a

signiﬁcant fraction of N brings the search time down to a constant for most searches.

4.3.3 Hypothesis Grouping

For indexing to be effective, it is important that some data-driven grouping mecha-
nism produce sets of parts likely to come from a single object [75]. Multiple scene
parts drawn from the same observed object should yield similar relative orientation
with parts bound to the same model candidate (e.g. poses should cluster). Scene
parts, drawn from different objects but indexed to the same model candidate or
drawn from the same object but indexed to the same wrong model candidate, are not
likely to have consistent orientations relative to their corresponding parts in the model
candidate. Thus, the correct model candidate will tend to have one large group of
hypotheses while the wrong model candidates will tend to have several small groups
of hypotheses. Consequently, when hypotheses ordering is enforced on the groups of

consistent hypotheses, the correct model candidate will tend to have a better rank

130

(rank is deﬁned as the order of the model candidates in the veriﬁcation phase).

Let {H : p,- —+ (p[,Mm)} be the hypothesis that the scene part p,- corresponds
to a part p:- in the model candidate Mm. Let ’H = {H1,H2,---,H,,} be the n
hypotheses generated from the indexing of all the scene parts. We divide these 72
hypotheses according to the models to which the scene parts index and store them
into a correspondence table where the model Mm serves as a key and the hypotheses
Hm = {Hm,,-~,Hm,,} (with 71", E ’H) as entries. For each model candidate Mm
with a hypothesis list 71",, we group the consistent hypotheses based on the relative
orientation between the two corresponding parts within a hypothesis as follows. Let
(x1, y1) and (xn, yn) be the two end points of a part p. The orientation of p is deﬁned
by the angle 9 = tan-1%. Let (H,- : (p,- -—> (p[,Mm)} and {Hj : (p; —+ (pg, Mm)}
be two hypotheses from the hypothesis list of the same model candidate with the
orientations, 0;, 0:, 0,, and 0;- for parts pg, p], p], and pf,- respectively. Then two

hypotheses H,- and H ,- are said to be consistent if

K91 - 91') - (92 - 9;)l S Co, (43)

where 69 denotes some bounds on the tolerance of orientation difference. This con-
sistency grouping of hypotheses may seem to be quite complex because we need to
check every hypothesis against all others, but in fact, we do not need to do so. We
can rearrange Eq.(4.3) as ](9,- - 0:) — (0, — 0;)] S 69 and quantize the difference of
orientation using 269 as a unit into cells in the range of [—271,27r], then, H,- and H j

are said to be consistent if 0,- — 0] and 0,- — 0;- fall into the same cell. In this way,

131

hypotheses in the same cells are grouped together. Thus, hypotheses in 'Hm for model

M... are divided into some disjoint groups Hm, - - - ,Hm, with Hm, 0 71m, = 0,Vi 74 j.

4.3.4 Hypothesis Ordering

There are a few problems associated with our indexing scheme described above, espe-
cially when a large database is considered. The model hypotheses are non-uniformly
distributed in the hash table, resulting in some entries with a large number of model
hypotheses. This situation occurs frequently when the database contains many ob-
jects with similar parts. To minimize the recognition time, our indexing scheme should
exploit prior knowledge to reduce the number of veriﬁcations during the recognition
because the veriﬁcation procedure is computationally expensive. This is accomplished
by verifying the most likely model hypothesis ﬁrst and then terminating the veriﬁca-
tion procedure once a correct model is identiﬁed. Thus, our indexing scheme relies
heavily on the voting scheme for hypothesis ordering to reduce the recognition time.
Four voting schemes are studied below to make efﬁcient use of prior knowledge for

hypothesis ordering.

Majority Voting

Majority voting is the most common voting scheme: the votes are accumulated for
model hypotheses on the basis of the presence of their parts in the scene. The basic
idea behind this voting scheme is based on the fact that multiple parts from the
same scene object should support the same model candidate. Thus, if we accumulate

votes from all scene parts and rank the model hypotheses according to the total

132

votes received, then the model hypotheses with the largest number of votes are good

starting points for the veriﬁcation process.

Let M = {M1,M2,---,MN} be a set of all models in the database, and ’P =
{p1,p2, - . - ,pm} a collection of scene parts. Let M’H = {(p;, Mk),i = 1,- - - ,m,k =
i1, - - - , in} represent the model-hypothesis list retrieved from the hash table by every
scene part p,- E 'P where Mk, I: = i1, - - ~ , in are model candidates. Deﬁne a character-
istic function 6(p, M) as

lifpisapartofM
5(p,M)=

0 otherwise
where p is the part identiﬁer and M the model identiﬁer. Then the number of votes

that model M receives is

12.....(M): £360.. M).

where RMV(M) denotes the score of the model hypothesis M obtained via the ma-
jority voting scheme. The most likely candidate models M "' have the largest number

of votes M" = arg maxM RMV(M).

Bayesian Voting Using Part Occurrence

The prior knowledge incorporated in the majority voting scheme described above
assumes that each model candidate, in the model-hypothesis list indexed by a scene
part, has the same likelihood of being identiﬁed by that scene part. This assumption

may not reflect the real situation where some model candidates are more likely to

133

appear in the scene than others. Thus, given a scene part, the goal is to identify the
model candidate which most likely accounts for the presence of this scene part. To
achieve this goal, we adopt a decision-theoretic approach using a Bayesian framework
where the measure of the discriminatory power of a part for a model is deﬁned in

terms of posterior probability (similar to the Bayesian framework in [120]).

Let P(Mk) be the prior probability that model M1, is observed, and P(p,-]Mk) be
the likelihood function where P(p,-]M,-) > P(p,-]M,~) means that part p,- is more likely
to be observed when model M,- is seen than when model MJ- is seen. Let P(Mk]p;)
be the posterior probability that reflects the updated belief that model M), appears

in the scene after the part p,- is observed.

In order to derive the posterior probability P(Mk]p,-), we need to estimate prob-
abilities P(p,~]Mk) and P(Mk). If we have no other information, we assume 13(Mk)
is the same for every model in M. Once a certain part, pg, is decided to be used
for indexing for model candidates, we compute the likelihood function P(p,-|Mk) by

counting the number of occurrences of part p.- in the model Mk:

0(1)“ Mk)

P ,M 2
(pl 1‘) ZVprEMg 0(P5,Mk)

 

(4.4)

where 0(p,-, Mk) denotes the number of occurrence for a particular part p,- in model
Mk. Note that P(p,-]Mk) = 0 if p,- does not exist in the model Mk. Let M; E M be

the set of model candidates retrieved from the hash table by using p,- as the index.

134

The posterior probability P(Mk]p,-) is then computed as

P<Mk)P(p.IMk)

p(Mk|P.') = ~ ~ -
ZVMkEMl P(Mk)P(Pile)

 

(4.5)

This formula is used to determine the distribution of the posterior probabilities among

the models in M f indexed by 1),. That is

M; P(Mk]p,')“ — 10

MEM

Deﬁne P(Mk]p,-) as the indexing power ofp, for the model Mk. If E(Mk]p;) = 1.0, this
means that p, is unique to the model Mk. In other words, if a part detected in the
scene corresponds to p,, it is certain that the model M), is in the scene. On the other
hand, if the indexing power of a part is not 1.0, several models containing the part
may be in the scene. Suppose M,- and M ,- are two of the model candidates indexed
by pg, then [P(M,-]p,-) > P(Mj]p,)] indicates the belief that model object M,- is more
likely to appear in the scene. This indexing power is a decision-theoretic measure of
saliency of a part for identifying a model under the Bayesian framework whereas a
saliency measure based on population [43] is somewhat ad hoc. Once the posterior
probability P(Mk]p,-) is computed for every model candidate Mk 6 M f and every

scene part p,- E P, we can rank the model hypotheses M 6 M71 by

BMW) = i ems-160.. M) (4.6)

i=i

where RBVO(M) denotes the score of the model hypothesis M obtained via the

135

Bayesian voting scheme using part occurrence.

Bayesian Voting Using Visual Relevance

The estimated prior probability P(p,~]Mk) in Eq.(4.4) assumes that parts belonging to
the same model have equal probability (i.e., Vp, E Mk,i = l, - - - ,n, P(p,-]Mk) = ﬁ).
This estimation overlooks the fact that some parts may be more likely to identify the
object than other parts belonging to the same model. For example, a model M is
composed of two parts p1 and p2 with p1 occupying 10% of the contour and pg 90%.
Identifying p1 in the scene does not give strong belief that model M is present in the
scene; however, if p2 is observed in the scene, then we have much higher conﬁdence
that model M is present in the scene. Thus, we assign the prior probability to
parts of the same model based on their relative visual relevance. This measure uses

the length of the part normalized with respect to the size of the model. Then the

likelihood function in Eq.(4.4) is changed to

[’(pii Mk)
EVpﬁM; L(p.', Mk)

 

1301,1114.) = (4.7)

where 11(1):, M1,) denotes the length of part p,- occurring in model M1,. The posterior
probability P(M]p,-) is given by the same formula as in Eq.(4.5).

Note that the weights W(p,-) of scene parts p;,i = 1, . - - ,m were assigned equally
in BVO, i.e., W(p,-) = 1 / m, because parts were equally likely to occur in the scene,

and thus, were not considered in Eq.(4.6). Here we assign the weights of scene parts

based on their relative length, i.e., W(p,-) = Q(p,-)/Z:;1 Q(p,-) where Q(p,-) denotes

136

the length of pg. Then the ordering of model hypotheses is based on
RBWR(M) = Z 1°(Mlp:)W(pe)5(p.-, M) (4-8)
1:1

where Raw/R(M) denotes the score of the model hypothesis M obtained via the
Bayesian voting scheme using visual relevance. The justiﬁcation for incorporating
the weight of scene parts into the hypothesis ordering process is the same as the

explanation in the example given above.

The Hybrid Voting Scheme

The three aforementioned voting schemes are biased in ranking the model hypotheses.
The majority voting (MV) scheme tends to favor model candidates with many parts
because model candidates with fewer parts are less likely to accumulate the same
number of votes as model candidates with many parts. On the other hand, both
Bayesian voting schemes (BVO and BVVR) tend to favor model candidates with
fewer parts because the Bayesian approach will give higher posterior probabilities to
model candidates with fewer parts than model candidates with many parts. BVVR
is less biased than BVO, but a part with larger visual relevance has higher chance of
being distorted when the object is slightly transformed, resulting in a lower rank in
the correct model hypothesis. Thus, we need to design an unbiased voting scheme
for ordering the model hypotheses regardless of the number of parts in the model

candidates.

Let S (M, pg) represent the score for model M when pg is observed in the scene

137

and N (pg) the total number of model hypotheses indexed by pg. Let P(M]pg) be the
indexing power of pg as deﬁned in BVVR. Then S(M, pg) is computed as

ll-O + P(M|P1)N(Pi)l5(PbM) if 13(MIP1) < 1-0
5(M1Pi) = (49)

20.0 if P(M]pg) = 1.0
Note that P(M]pg) is the indexing power of pg for model M and when it is 1.0, it
indicates that model M is present in the scene, thus a very large score is given reﬂect-
ing this fact. Those model hypotheses with the indexing power P(M]pg) less than
the average m will have P(M]pg)N(pg) less than 1.0; in this case, MV outweighs
BVVR. On the other hand, those model hypotheses with the indexing power greater
than the average will have score higher than 1.0; then, BVVR outweighs MV. Thus,
models with more parts will have less indexing power on their parts and 5' (M, pg) is
scored mainly from MV, while models with less parts will have more indexing power
on their parts and S(M, pg) is scored mainly from BVVR. In this way, the voting
scheme balances the score between the models with various numbers of parts. Then

the ranking of the model candidates is assigned by

RHYBRID(M) = f: 5(M,Pi)

where RHYBRID(M) denotes the score of the model hypothesis M obtained via this

hybrid voting scheme.

138
4.4 Experimental Results

To demonstrate the viability of the proposed indexing scheme for 3D object recog-
nition, we intend to show the following: (1) the indexing scheme can provide a rea—
sonable set of model hypotheses for veriﬁcation/refutation; (2) model hypotheses
are pruned signiﬁcantly by the proposed voting schemes which are based on image
evidence and prior model constraints; and (3) the system can recognize partially—

occluded objects.

Experiments were conducted using a database of twenty 3D objects and eighty
2D objects with a total of 658 aspect models. Four sets of test data were used in

simulations:

0 Data Set [11: a copy of all model contour parts from the database (perfect data);

a Data Set [[2: 2D distortion on Data Set [11 (2D distortion includes 2D transfor-

mation and Gaussian noise added to the contour);

a Data Set [[3: 3D distortion on Data Set [[1 (an extreme view of each model
aspect was used) plus Gaussian noise 0 mean and 2 pixels standard deviation

added to the contour;

a Data Set [34: 2D distortion on test data set [13 (i.e., 2D plus 3D distortion on

perfect data which is similar to real data).

139
4.4.1 The Indexing Result

To evaluate the performance of our indexing scheme, we are interested in (1) how well
the indexing scheme prunes model-aspect hypotheses and (2) how well the ordering of
hypotheses limits the number of the veriﬁcations required. Four voting schemes, MV,
BVO, BVVR, and HYBRID as described in Section 4.3.4, were used to study how they
order the hypotheses. Table 4.1 shows some statistics obtained from these four voting

schemes where the attribute tolerance criteria:(eg, 6,, ep, em) = (10,0.1,10,0.01) was

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

used.
Table 4.1: Some statistics for Data Set [14 with 578 test objects.
Number Frequency Total Number of Voting Scheme’s Total Rank
of Parts Hypotheses Generated RMV RBVO RBVVR RHYBRID
1 15 167 89 89 89 89
2 35 12154 1243 294 834 714
3 64 21568 3057 744 1171 1533
4 103 37099 4199 2419 1977 1975
5 121 45741 3528 2342 2272 2150
6 107 47473 1271 2855 1841 877
7 72 34398 1314 2815 1636 964
8 38 18234 727 2020 1803 826
9 21 10643 293 1104 741 491
11 2 947 2 98 27 2
] Expected Value 395 (228424/578) T 27 ] 25J 21 L 16

 

 

We can make the following observations from Table 4.1: (i) The number of ex-
pected aspect-model hypotheses indexed (395) is relatively large compared to the
total number of aspect models (658) in the database; however, the expected rank, for
example, 16 for RHYBRID, is relatively small. This means than most of the aspect-

model hypotheses can be ruled out by ordering the retrieved hypotheses. (ii) For

140

objects with more than ﬁve parts, MV outperforms (i.e., has lowest total rank) the
Bayesian voting schemes, BVO and BVVR; for objects with less than ﬁve parts,
BVVR and BVO outperform MV. This result veriﬁes our hypothesis stated in the
Section 4.3.4 that MV tends to favor models with more parts, while BVO and BVVR
favors models with fewer parts. This phenomenon is also depicted in Figure 4.9. (iii)

HYBRID outperforms the other three voting schemes.

THE EFFECT OF PARTS ON VOTING SCHEME
45m I j I 1 I I f ﬂ

 

IIMVII '

4000
3500
3000
2500
2000

TOTAL RANKING

1500
1000
500

 

 

 

 

1 2 3 4 5 6 7 8 9 l 1
NUMBER OF PARTS

Figure 4.9: The effect of the number of object parts on voting schemes. Data Set [[4
and the attribute tolerance criteria (e,, e,, ep, 5451) = (10, 1,10, 0.01) were used.

Table 4.2 lists the number of average hypotheses generated (Nh), number of model
aspects in the database (Now), the reduction rate (R), deﬁned as (Ndb — Nh)/Ndb),
and the indexing percentile, deﬁned as (ng, — RHmeD)/(Ndb — 1). As indicated in
Table 4.1, the indexing attributes do not provide a good discriminating mechanism

(395 out 658 hypotheses were generated). This is because they are scale invariant.

141

If scale variant attributes, such as the area and perimeter of a part, in addition to
scale invariant attributes, were used in indexing, the aspect-model hypotheses could
be pruned signiﬁcantly as shown in Table 4.2 where the reduction rate increases
from 40% to 87.1% and the indexing percentile increases from 0.977 to 0.992. The
signiﬁcant reduction in the number of hypotheses passed to the veriﬁcation procedure

makes that indexing scheme more practical.

Table 4.2: The effect of using additional variant attributes in indexing.

 

Data Set [14 with 578 test objects was used
Attribute tolerance criteria (6,, e,, e,,, ed”) = (10,1,10, 0.01)

 

 

 

Indexing Scheme Ndb ] N), R (%) Rug/BR”) Indexing Percentile
Scale Invariant 658 395 40.0 16 0.977
Scale Variant & Invariant 658 85 87.1 6 0.992

 

 

 

 

 

 

 

 

 

To avoid those indexed hypotheses randomly clustered on aspect-models, we adopt
the pose consistency grouping on the retrieved hypotheses. Table 4.3 shows the total
ranking before and after the hypothesis grouping. For the HYBRID voting scheme,
the total ranking drops from 13579 to 9621 with a pruning rate of 29.1%. This
result indicates that the retrieved hypotheses should be grouped based on their pose
consistency before the hypothesis ordering to improve the recognition performance.

A thorough study of the performance of the voting schemes on four sets of test data
is given in Table 4.4. It can be seen from Table 4.4 that the tolerance on attributes
plays an important role in indexing: when the tolerance is loose, the number of
indexing failures decreases while the number of veriﬁcation required, as indicated

by the average rank, increases; and when the tolerance is tightened, the number of

142

Table 4.3: Hypothesis pruning via pose consistency grouping.

 

Data Set [[4 with 578 test objects was used
Attribute tolerance criteria (6,, 6,, ep, ed”) = (10,1,10, 0.01)
Hypothesis Voting Scheme MV BVO BVVR HYBRID
Total Ranking Before Pose Consistency 18112 15939 14294 13579
Total Ranking after Pose Consistency 15723 14780 12391 9621
Hypothesis Pruning Rate (%) 13.2 7.3 13.3 29.1

 

 

 

 

 

 

 

 

 

 

 

 

veriﬁcations required decreases but the number of indexing failures increases. The
choice of the attribute tolerance depends on the percentage of indexing failures (false
dismissal) a system allows. Table 4.4 also shows that HYBRID outperforms the
other three voting schemes in all four test data sets. A more detailed comparison is
depicted in Figure 4.10. These results suggest that the HYBRID voting scheme should
be used in ordering hypotheses to reduce the number of veriﬁcations required during
the recognition. The execution time for the indexing scheme to generate hypotheses

at run-time takes about 0.1 seconds on a SUN Sparc 20.

4.4.2 Recognition Results

We have shown the viability of the proposed indexing scheme for 3D object recogni-
tion through hypotheses grouping and ordering. The efﬁciency of the indexing scheme
can also be achieved if tighter tolerances of indexing attributes are used but at the
risk of more indexing failures, a situation which most recognition systems attempt
to avoid. Sixty test images containing objects with sculptured surfaces (half of them

containing partially occluded objects), were used to evaluate the performance of the

Table 4.4: Performance of hypothesis voting schemes.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ATCti Data Average Ranking Indexing
SCt [1 RMV Rsvo ] Revvn ] RHYBRID Failures
1 1.00 1.00 1.00 1.00 0
I 2 1.06 1.05 1.04 1.06 31
3 1.33 1.26 1.41 1.41 446
4 1.41 1.32 1.45 1.44 458
1 1.12 1.13 1.14 1.12 0
5 2 1.67 2.52 2.11 1.46 l
3 17.61 14.47 11.93 10.56 25
4 19.00 15.44 12.85 11.81 28
1 1.55 1.59 1.76 1.29 0
10 2 3.46 4.80 5.43 2.89 0
3 26.39 24.69 20.96 15.69 2
4 27.20 25.57 21.44 16.65 2
1 2.29 2.33 2.74 1.50 0
15 2 5.55 6.73 7.43 3.40 0
3 32.11 30.75 25.66 19.99 0
4 34.37 31.92 26.36 20.77 0
1 3.27 3.19 3.87 1.75 0
20 2 6.87 7.91 8.86 3.48 0
3 36.95 38.09 30.11 24.05 0
4 37.94 40.42 31.12 24.16 0

 

 

 

 

 

 

 

 

 

 

ATC denotes attribute tolerance criteria (eg, e.,, e,,, e¢,)

ATCﬁn is deﬁned as (n, 0.1n, n, 0.001n)

 

 

PERIORMANCE OF VARIOUS VUIING SCHEMES

 

AVERAGE RANKING

 

 

 

 

12 3 4 5 6 7 8 9101112131415161718191)
ammonium/1

(a)

m OF VARIOUS VO'IING SCHEMES

 

AVERAGE RANKING

 

 

 

 

0
I2 3 4 3 6 7 8 91011121314151617181920
A'I'I'RIBU'IETOIERANCECRHERIA

(C)

144

AVERAGE RANKING

AVERAGE RANKING

_
O

45

35

N w ‘ u 0 ‘3 on ‘O
V V I V V

WCE OF VARIOUS VOTING SCHEMES

 

T T I 1 V V I V V V I V V I V f V T

 

 

 

 

12 3 4 5 6 7 8 91011121314131617101920
ammm

WWVARIOUS VOTINGSC’HEMES

VVVVVVVVVVVVVV U ' V V

 

 

 

A L A A A A A j A L A A A

 

o A A A A A
I 23456789101112131415161718193)
ATI‘RIBU'IETOLERANCECRI’IEIIA

(d)

Figure 4.10: Comparison of the performance of hypothesis voting schemes. (a) Data
Set [[1 with 658 objects. (b) Data Set [[2 with 658 objects. (c) Data Set [[3 with 578
objects. ((1) Data Set [[4 with 578 objects.

145

recognition system. Figures 4.11 and 4.12 show a few of recognition results. The rest
of the results will be presented in the next chapter. Table 4.5 illustrates the perfor-
mance of the recognition system. Figures 4.11 and 4.12 show the test images after
indexing and recognition via alignment: the aligned model edgemaps are overlaid on
the input images. Table 4.5 gives details on the indexing and alignment steps. It can
be seen from Table 4.5 that the indexing step generates too many hypotheses (column
1) from most of the test images, but through hypotheses grouping and ordering, only
a few hypotheses (column 2) need to be checked in the veriﬁcation phase. For the test
images cup and soap, they both are convex objects, as can be seen in Figure 4.11(c)
and (d), and contain only one part. As a result, only one hypothesis was generated.
However, the indexing would fail if they are partially occluded. The indexing failure
issue will be addressed in Chapter 5. The reason for the low number of required veri-
ﬁcations was that a tight attribute tolerance criteria was used to prune the incorrect
hypotheses and that only a small portion of the parts corresponding to the correct
model was occluded (see Figure 4.12). The ﬁnal 2 columns of Table 4.5 illustrate
that alignment was achieved within 1.4 pixels MSE on the edgemaps within 18 iter—
ations of the hill-climbing procedure. Note that only the non-occluded contours of
the test object were used to compute the MSE. The goodness of the ﬁt is also shown
in Table 4.5. In most cases, the pose parameters were achieved within reasonable

ranges.

146

Table 4.5: Results of model indexing and hypothesis veriﬁcation.

 

Number of models in the database: 658.
Attribute tolerance criteria (e,, e,, ep, 6451) = (50,10, 5.0, 0.005)
Input parameters for alignment: E = 1.6 pixels and [max = 30 iterations.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Object Model Indexing Hypotheses Veriﬁcation (Pose Estimation)
N HG Rank Eg, Egg E, E0, EN MSE ﬁlter
blockBl 102 1 l 0 0.001 0.7 0.03 0.83 4
camar018 26 1 l 1 0.000 1.7 0.09 1.06 18
cup 1 l l 1 0.006 0.2 0.20 1.05 12
soap 1 1 2 0 0.002 0.4 0.05 1.02 3
taurusl 113 1 0 2 0.005 0.3 0.28 1.31 5
truck4 217 1 3 2 0.012 0.5 0.04 0.95 8
elephantl] 160 l 2 0 0.010 1.0 0.06 0.97 10
gorillal] 167 2 3 3 0.005 0.8 0.46 1.26 4
lionlO] 141 2 2 1 0.005 0.5 0.24 1.37 6
phone13f 60 l 0 2 0.007 0.7 0.03 0.86 5
sprayer7f 35 l 1 0 0.010 1.7 0.33 0.98 3
swan2f 131 1 2 1 0.007 0.4 0.13 0.75 3

 

 

 

 

 

 

 

 

 

 

 

 

 

N HG represents the number of hypotheses generated.

E? = ]] 13 —- pgguell where 13 is estimated parameter value of pm”.
The measure unit for Egg & Egg is pixels, for E0, is degrees, for EN is radians.
] indicates that the test object is partially occluded.

 

 

147

   

(b) camar018

   

(d) soap

 

(e) taurusl (f) truck4

Figure 4.11: Examples of hypothesis veriﬁcation via model ﬁtting.

148

 

a elephantl b orillal
g

 

      

(e) swan2 (f) sprayer7

Figure 4.12: More examples of hypothesis veriﬁcation via model ﬁtting.

149
4.5 Summary

We have proposed a new shape—based indexing scheme applicable to a large class of
rigid curved objects. Indexing is based on invariant attributes extracted from parts
on the object silhouette. We have presented a part representation that is invariant
under 2D afﬁne transformation (rotation, translation and scaling), and that provides
robustness for effectively supporting indexing in the presence of occlusion and spurious
noise. We have incorporated the pose consistency constraint to increase the accuracy
of selecting model—aspect hypotheses. We have also investigated various hypothesis
voting schemes that exploit prior knowledge to order model-aspect hypotheses for
efﬁcient search for candidate models. When combined into a complete system, these
techniques make progress toward improving accuracy and efﬁciency by pruning false

hypotheses and minimizing unnecessary veriﬁcations.

The work on indexing was framed within the general alignment paradigm pro-
posed by David Lowe [76]. Our recognition experimental results have shown that we
have made good progress toward a general alignment system where models can be
taught via input imagery. This system should work well on some existing recognition
problems. Although we have only used shape features, the paradigm can easily ac-
commodate other types, such as color or texture features. Moreover, we could also
include shape features derived from the internal object edges. In our experiments,
we have made stronger assumptions on the quality of object contours available from
the input image than were made by Lowe. The part theory applies, however, to open

contours so it is not required that an entire object is segmented from the background.

150

Surely, degraded input implies increased ambiguity and computational load. We will

study recognition in more difﬁcult imagery in future work.

Chapter 5

Object Recognition and Tracking

This chapter is concerned with the integrated recognition system as well as the appli-
cability of this system for object tracking. We have described the object representa-
tion in Chapter 2, the matching algorithm in Chapter 3, and the indexing techniques
in Chapter 4, In this chapter, we propose a hierarchical veriﬁcation strategy to cut
down the recognition time and explain how we can track a moving object in an image
sequence.

The important feature of the integrated recognition system is that it does not
need to verify all the false model hypotheses at a full length. False model hypotheses
can be repudiated at an early stage of the veriﬁcation process. Only hypotheses of
correct models need to be veriﬁed at a full length. Thus, we propose conservative
threshold values to determine when a false model hypothesis should be rejected during
the veriﬁcation process.

This chapter is divided into 3 sections. Section 5.1 proposes a veriﬁcation strat-
egy to expedite the recognition. Section 5.2 demonstrates the applicability of the

151

152

recognition system for tracking moving objects in image sequences. A summary of

the experiments is given in Section 5.3.

5.1 Object Recognition

The object recognition system operates within the alignment paradigm. Four stages
are involved in this framework: (1) model building, (2) feature extraction, (3) model-
aspect hypothesis generation via indexing, and (4) model—pose hypothesis veriﬁcation
through aligning the 3D projected model with the object image features. Both model
building and index construction are processed off-line. The computation time for
feature extraction is ﬁxed. Thus, the recognition time is primarily determined by the
efﬁciency of hypothesis generation and veriﬁcation. The indexing scheme must com-
pensate for noise and view variation by increasing the indexing attribute tolerance,
thus, increasing the number of incorrect model-aspect hypotheses that are generated.
To minimize the recognition time, the indexing scheme exploits some prior knowl-
edge to reduce the number of veriﬁcations required during recognition. However, the
recognition system may still need to repudiate a few false model-pose hypotheses be-
fore the correct one is veriﬁed. For example, the experimental results in Section 4.4
indicate that a signiﬁcant percentage of model-aspect hypotheses can be removed
from consideration via hypothesis ordering, but for a loose indexing attribute toler-
ance, such as ATCtl5 in Table 4.4, on the average, about 10 model-pose hypotheses
still need to be tried before the correct one is veriﬁed. This slows down the recogni-

tion signiﬁcantly. Consequently, shortening the veriﬁcation time for false model-pose

153

hypotheses becomes a necessity for the recognition system to be more efﬁcient. As
described next, we incorporate a hierarchical veriﬁcation strategy into the recognition

system to expedite the recognition.

5.1.1 Hierarchical Veriﬁcation

The model-pose hypothesis veriﬁcation is carried out through robust pose estimation,
as described in Chapter 3, which uses Newton’s method with Levenberg-Marquardt
minimization (N—L—M). The experimental results in Chapter 3 show that for correct
model aspects with good initial parameter estimates relatively close to true parameter
estimates, the N-L—M method converges within a few iterations. But for wrong model
aspects, the N-L—M method tends to wander around the parameter space, resulting
in a slow convergence to a local minimum or no convergence at all. Accordingly, a
tremendous amount of computation time is wasted on verifying incorrect model-pose
hypotheses. Thus, it is essential to reduce the number of iterations in the N-L-M

method when incorrect model-pose hypotheses are veriﬁed.

The edgemap matching algorithm (i.e., the N-L—M method), presented in Fig-
ure 3.2 of Chapter 3, uses two parameters: the maximum number of iterations al-
lowed, Imax, and the matching error threshold, E, to determine when the algorithm
should terminate. For a correct model aspect, a large value of Imax and a small value
of Eg tend to lead the algorithm to either converge with a mean-squared distance
error (MSE) less than E or to select the best match after Imax iterations are tried.

In either case, the veriﬁcation is thoroughly carried out, and the result can be used

154

to determine whether to accept or refute a model-pose hypothesis. However, such an
exhaustive veriﬁcation for an incorrect model-pose hypothesis is neither necessary nor
possible—the incorrect model-pose hypothesis can be detected after a few iterations
of the N-L—M method.

To expedite the recognition, we propose a hierarchical veriﬁcation strategy as

follows. Three situations may occur during the veriﬁcation process:

(1) ﬁtting the correct model aspect to the object image features;

(2) ﬁtting the wrong model to the object image features;

(3) ﬁtting the correct model but with wrong aspect to the object image features.

Normally, for those model-pose hypotheses where the model aspect and the observed
object stem from the same object model but different aspects, the ﬁtting error in—
dicated by the MSE is in between that from situation (1) and that from situation
(2). We can vary the matching error threshold, E, and the maximum number of
iterations allowed, Imax, to control the thoroughness of a veriﬁcation. Three levels
of veriﬁcation, coarse, middle, and ﬁne, are deﬁned based on the various values of
1mg. and E. During the recognition, model-pose hypotheses are veriﬁed in the order
of their rank obtained from indexing. Each ranked model-pose hypothesis is ﬁrst
veriﬁed at the coarse level; if it does not survive at the coarse level, it is removed
from further veriﬁcation; if it survives, then it is veriﬁed at the middle level; if it sur-
vives again at this level, then it continues to be veriﬁed at the ﬁne level; otherwise,
it is rejected. The speciﬁed error threshold of the ﬁne level will be used to accept a

model-pose hypothesis whenever the ﬁtting error is below this error threshold during

155

 

 

Input: a set of ranked model-pose hypotheses generated via indexing;

three sets of threshold values, (Limp, Ef), (1m Egm), and (liar, E!)

for coarse, middle, and ﬁne level of veriﬁcation, respectively.

Outputs: either failure or the model identity and pose parameters 175.

Processing: this veriﬁcation strategy uses the N-L—M method to compute the
pose parameters 1.3 and the ﬁtting error M S E iteratively for each
hypothesis until the correct one is found; otherwise report failure.

The iterations repeat the following steps:

(1) Get the next lower ranked model-pose hypothesis;
(2) Initialize the iteration counter: Iter = 1;

(3) If M S E S E{, then accept the current hypothesis
(i.e., report the pose parameter (3 and the model identity)
and terminate the whole veriﬁcation process;

(4) If M S' E > E: and I ter > 15,”, then reject the current hypotheses and
go to step (1); otherwise go to the next step;

(5) If M S E > E,“ and I ter > 1:3“, then reject the current hypotheses and
go to step (1); otherwise go to the next step;

(6) If M S' E > ng and I ter > 15,“, then reject the current hypotheses and
go to step (1); otherwise go to the next step;

(7) Increase I ter by 1 and go to step (3).

 

Figure 5.1: The hierarchical veriﬁcation strategy.

 

156

veriﬁcation. Once a ranked model-pose hypothesis is accepted, the whole veriﬁcation
process terminates; otherwise this hierarchical veriﬁcation process continues with the
next lower ranked model—pose hypothesis. The hierarchical veriﬁcation strategy is

sketched in Figure 5.1.

5.1.2 Monte Carlo Experiments

In this section, we describe several Monte Carlo experiments on a large number of
test images conducted to study what empirical values for E and Imax should be used

for veriﬁcations at the coarse, middle, and ﬁne level, respectively.

Fitting Correct Model Aspects to Images

In this experiment, six randomly chosen views for each of 16 models and one view
for each of 4 single model aspects in our database were generated as the test images.
Thus, we had 100 randomly selected test images for 100 (out of 576) model aspects.
These 100 model-pose hypotheses were tested via the N-L—M method.

Table 5.1 shows the results, which were collected after ﬁtting 100 model aspects
to their corresponding test images. The cell (Imax, E) in Table 5.1 indicates the
number of model-pose hypotheses in which the N-L—M method converges within Imax
iterations and with an MSE less than E. For example, for [max = 5 iterations and E
= 4.0 pixels, 98 (out of 100) model-pose hypotheses are accepted (i.e., survive); two
model-pose hypotheses did not survive because either bad initial parameter estimates
or small values of [max were used in the N-L—M method. We used another set of

initial parameter estimates for these two model-pose hypotheses, and in both cases,

157

Table 5.1: Number of veriﬁcations in ﬁtting correct model aspects to images as a
function of E, and Imax.

 

100 model-pose hypotheses were used.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Imax Error Threshold: Eg (pixels)
(iterations) 1.0 1.5 2.0 2.5 3.0 4.0 5.0 6.0
3 26 46 63 80 89 96 97 99
4 39 60 72 87 95 97 99 99
5 45 63 81 91 95 98 99 99
6 47 66 85 93 96 98 99 100
7 48 68 88 93 97 99 99 100
8 51 70 89 95 97 99 99 100
9 51 74 91 96 97 99 99 100
10 52 75 92 97 98 99 100 100
11 52 75 93 97 99 100 100 100
12 53 76 94 98 100 100 100 100

 

 

 

 

 

 

 

 

 

the N-L-M method converged within 5 iterations and with a ﬁtting error less than
2.0 pixels MSE. The N-L-M method also converged within 15 iterations when the
original initial parameter estimates were used. This suggests that for veriﬁcation at
the ﬁne level, the N-L—M method should use a few sets of initial parameter estimates
and a large value for [max to ensure that correct model-pose hypotheses are always
accepted. It can be seen from Table 5.1 that with Imax = 6 iterations and E = 6.0
pixels, all 100 model-pose hypotheses survive. Thus, for this set of model aspects,
[max = 6 iterations and E = 6.0 pixels are good criteria for veriﬁcation at the coarse

level to refute wrong model-pose hypotheses.

Fitting Incorrect Model Aspects to Images

In this experiment, we hoped to demonstrate that most wrong model-pose hypotheses

are refuted through veriﬁcation at the coarse level. For each of the aforementioned

158

Table 5.2: Number of veriﬁcations in ﬁtting wrong models to images as a function of

E and Imax.

 

 

 

 

 

 

 

 

 

1000 model-pose hypotheses were used.
1...” Error Threshold: Eg (pixels)
(iterations) 1.0 2.0 2.5 3.0 4.0 5.0 6.0
1 0 0 0 0 0 1 3
2 0 0 0 l 1 3 4
3 0 0 1 1 2 3 7
4 0 0 1 1 3 3 7
5 0 0 1 l 3 3 9
6 0 0 1 1 3 4 11

 

 

 

 

 

 

 

 

 

 

100 test images, 10 (excluding the correct one) randomly selected model aspects were
used to form 10 model—pose hypotheses. All 1000 model-pose hypotheses were tested

using the N-L-M method.

Table 5.2 shows some results from this experiment. As can be seen in Table 5.2,
with Eg = 6.0 pixels and Imax = 6 iterations (i.e., the criteria for veriﬁcation at
the coarse level), only 11 out of 1000 model-pose hypotheses survived the coarse
level. Table 5.3 lists the ﬁtting errors of these 11 model-pose hypotheses. It shows
in Table 5.3 that the best 3 model-pose hypotheses are either with the same object
model (sprayer7-sprayer10) or with neighboring aspects (zebraI-zebr020 and phone 7-
phone6). This suggests that the goal for veriﬁcation at the middle level is to refute
those model-pose hypotheses in which the model aspect and the observed object stem

from the same object model but different aspects, or from similar object models.

159

Table 5.3: Incorrect model-pose hypotheses that survived at the coarse level.

 

 

 

 

 

 

 

 

 

 

 

 

Model Aspect Test Image MSE (pixels)
sprayer7 sprayerlO 2.3
zebral zebra20 3.7
phone7 phone6 3.8
blockB7 soap 4.9
camaro20 taurus2 5.4
lion29 lion10 5.4
elephantl gorilla16 5.5
soap taurusl 5.8
gorillal gorilla16 5.8
soap taurusS 5.9
soap zebra20 5.9

 

 

 

 

 

Fitting Models to Neighboring Aspects

To study the effect of neighboring aspects on the ﬁtting error, 100 randomly selected
model-pose hypotheses were generated, in which model aspects and their correspond-
ing test images were from neighboring aspects. Table 5.4 lists the results of these
100 model-pose hypotheses. As shown in Table 5.4, only 5 out of 100 model—pose
hypotheses survived when Imax and E were 12 iterations and 3.0 pixels, respectively.
Note that with [max = 12 iteration and E = 3.0 pixels, all correct model-pose hy-
potheses survive as shown in Table 5.1. This suggests that for veriﬁcation at the
middle level, Imax = 12 iterations and E = 3.0 pixels are good criteria to remove
model—pose hypotheses in which model aspects and test images are from the same
object model but different aspects.

From the results of above three experiments, we are able to determine the empirical
values for Imax and E to be used in veriﬁcations at various levels. These values are

shown in Table 5.5. Note that two sets of initial parameter estimates are used for

160

Table 5.4: Number of veriﬁcations in ﬁtting models to their neighboring aspects as a
function of E and Imax.

 

 

 

 

 

 

 

 

 

 

 

 

 

100 model-pose hypotheses were used.
Imax Error Threshold: Eg (pixels)
(iterations) 1.0 2.0 2.5 3.0 4.0 5.0 6.0
3 0 0 0 3 13 23 40
4 0 0 0 3 17 27 45
5 0 0 2 4 20 28 50
6 0 0 2 4 20 32 51
7 0 0 2 4 20 36 52
8 0 0 3 4 20 39 56
9 0 0 3 4 21 43 57
10 0 0 3 5 24 44 59
11 0 0 3 5 25 45 61
12 0 0 3 5 26 47 63

 

 

 

 

 

 

 

 

 

 

veriﬁcation at the ﬁne level to prevent the N-L—M method from getting stuck in local
minima. Normally, one set of initial parameter estimates is sufﬁcient for the N-L-M
method to converge because the basin of convergence is broad as shown from the
experimental results in Chapter 3. The extra set of initial parameter estimate is used
to further guarantee the convergence of the N-L—M method. Thus, if a model-pose
hypothesis, using one set of initial parameter estimates in the N-L-M method, passes
the veriﬁcation at the middle level, but fails at the ﬁne level, then the other set of
initial parameter estimates is used for the N-L—M method to test this model-pose
hypothesis again. If one set of initial parameter estimates causes the N-L-M method

to converge, the other set will be ignored.

161

Table 5.5: Parameters for veriﬁcations at various levels.

 

 

 

 

 

Level [gmat (iterations) Eg (pixels)
coarse 6 6.0
middle 12 3.0
ﬁne 18 2.0

 

 

 

 

 

An Example

We provide an example to demonstrate how this hierarchical veriﬁcation strategy ex-
pedites the recognition. We use the test image for model aspect gorilla20 to index
into the database. 137 model-aspect hypotheses are retrieved. The correct model
aspect, gorilla20, is ranked as the 5th among these 137 model—aspect hypotheses. Ta-
ble 5.6 lists the number of iterations required for the ﬁrst ﬁve model-pose hypotheses
using the hierarchical veriﬁcation strategy. The input parameters for veriﬁcations at
various levels as shown in Table 5.5 were used. Since model aspect gorilla3 is neigh-
boring to model aspect gorillaQO, the corresponding model-pose hypothesis passes
the veriﬁcation at the coarse and middle levels, respectively; the rejection at the ﬁne
level costs totally 18 iterations in the N-L—M method. Model aspect gorilla13 is from
the same object model with model aspect gorilla20; the corresponding model-pose
hypothesis only passes the veriﬁcation at the coarse level; the rejection at the middle
level costs 12 iterations. Both model-pose hypotheses for lion31 and sharpener5 are
rejected at the coarse level, and each rejection costs 6 iterations. When the correct
model aspect gorilla20 was used in alignment, the N-L—M method converged with an
MSE less than the speciﬁed error threshold E of the ﬁne level; the N—L—M method

accepted the hypothesized model aspect and pose, and any further testing was ter-

162

minated. The total cost for the hierarchical veriﬁcation was 47 iterations as shown
in Table 5.6. The total cost for the non—hierarchical veriﬁcation would be 77 (i.e.,
18 x 4 + 5) iterations; each wrong model-pose hypothesis costs 18 iterations, and the
correct one costs only 5 iterations. Let I N and I H denote the number of iterations
required for the non—hierarchical and hierarchical veriﬁcation, respectively. Deﬁne the
speedup of using the hierarchical veriﬁcation over the non-hierarchical veriﬁcation as
S = (IN — IH)/IH. Then the speedup in this example is 63.8%. It is straightfor-
ward to derive the upper and lower bounds of S. Suppose the rank for the correct
aspect-model hypothesis is N, and the corresponding model-pose hypothesis costs
m iterations for veriﬁcation. The upper bound of 5' occurs when the N-l incorrect

model-pose hypotheses, ranked from 1 to N—l, are tested and rejected at the coarse

 

level. Then S = “Mm-13f: ES’SN'IHM. Thus, when N is very large, the upper
bound of S is approximately 200%. The lower bound of S occurs when N = 1, i.e.,
there is no wrong model-pose hypothesis to reject; in this case S = 0. Note that the
hierarchical veriﬁcation strategy would gain some speedup only if it can reject wrong

model-pose hypotheses at the early stage of the N-L—M method. More examples are

given in the next section.

5.1.3 Recognition Results

In this section, we intend to show the efﬁciency and viability of the proposed recogni-
tion system. The hierarchical veriﬁcation strategy is incorporated in the recognition

system to improve the efﬁciency of the recognition. The viability of the recognition

163

Table 5.6: Hierarchical veriﬁcation for model aspect: gorilla20.

 

 

 

 

 

 

 

Model Indexing Hypothesis Veriﬁcation
Rank Model Aspect Coarse Middle Fine Iterations
1 gorilla3 pass pass reject 18
2 gorilla13 pass reject 12
3 lion31 reject 6
4 sharpener5 reject 6
5 gorilla20 accept 5

 

 

 

 

 

 

 

 

system is demonstrated through alignment results of sixty real sculptured objects; 28
of them are with non-occluded objects, and 32 of them are with partially occluded
objects. Tables 5.7 and 5.8 give details on the indexing and alignment steps. Fig-
ures 3.10, 3.11, 4.11, 4.12, 5.2, 5.3, and 5.4 show the test images after indexing and
recognition via alignment: the aligned model edgemaps are overlaid on the input

images.

Recognition of Non-occluded Objects

Table 5.7 lists the number of model-aspect hypotheses generated (NH) via in-
dexing, the rank of the correct model aspect (R1,), the goodness of the ﬁt
(E13,E¢y,E,,Ea,EN), the ﬁtting error (MSE), the number of iterations required
for the hierarchical veriﬁcation (In), the number of iterations required for the non-
hierarchical veriﬁcation (IN), and the speedup of using the hierarchical veriﬁcation
(3), deﬁned as (IN — IH)/IH. 3 (out of 28) test images contain objects with only
one part in their silhouettes; these three objects are cup and soap in Figure 4.11(c)

and (d), respectively, and blockBIO in Figure 5.2(c). As shown in the column NH

164

 

(l) taurus3 (k) taurus4 (l) taurusS

Figure 5.2: Examples of recognition.

165

         

(k) camaro27 (l) camaro36

Figure 5.3: Examples of recognition.

166

     

(b) elephant20

 

        

(k) swan3 (l) zebralO

Figure 5.4: Examples of recognition.

167

of Table 5.7, both soap and cup have only 1 model-aspect hypothesis generated via
indexing, but blockBIO has 10. Since bloc/c810 has only one part, all the 10 generated
model-aspect hypotheses for blockBIU will have the same scores weighted by any of
the hypothesis ordering schemes as presented in Chapter 4; thus, the correct one is
ranked as the 5th, i.e., the average rank. The rest of the test images contain objects
with several parts in their silhouettes, and consequently, the number of model-aspect
hypotheses generated from indexing increases. However, the HYBRID hypothesis
ordering scheme as presented in Chapter 4 has effectively ranked the correct model-
aspect hypothesis. As can be seen from the column R), of Table 5.7, 27 (out of 28)
test objects are ranked as the ﬁrst, which implies only one veriﬁcation. This indicates
that the HYBRID hypothesis ordering scheme has effectively pruned the incorrect
model-aspect hypotheses using the prior knowledge about the pre—stored models and
the visual evidence of the observed objects. As indicated in the column 3 of Table 5.7,
the integrated recognition system does not gain any speedup by using the hierarchical
veriﬁcation strategy because most of the correct model-aspect hypotheses require only
one veriﬁcation; however, the overall recognition time for each of these 27 objects,
indicated by the column I H of Table 5.7, is smaller than the recognition time for the
test object, bloc/c810, which has gained some speedup (S = 128%). Thus, the overall
speed of the recognition system is not evaluated by the speedup S but by the total
iterations required for veriﬁcation.

The columns MSE and I H of Table 5.7 illustrate that alignment was achieved
within 2.0 pixels MSE on the edgemaps and within 28 iterations of the N-L-M method.

The goodness of the ﬁt is also shown in Table 5.7. In most cases, the pose parameters

168

Table 5.7: Recognition results for 28 non—occluded objects.

 

Number of model aspects in the database: 658.

Indexing attribute tolerance criteria (eg, 6,, ep, em) = (5.0, 1.0, 5.0, 0.005)

Parameters for hierarchical veriﬁcation are speciﬁed in Table 5.5.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Object Indexing Hypothesis Veriﬁcation (Pose Estimation)
NH R1, Eg, E, E, E0 EN MSE I” IN S(%)

blockBl 102 l 1 0 0.001 0.7 0.03 0.83 4 4 0
blockB4 114 1 2 0 0.006 0.1 0.33 0.77 5 5 0
blockB7 91 1 2 1 0.001 0.6 0.04 0.90 4 4 0
blockBlO 10 5 3 0 0.004 1.5 0.12 1.04 28 64 128
blockB13 116 1 3 0 0.005 0.7 0.06 1.02 4 4 0
blockBl6 111 1 1 0 0.005 0.5 0.21 0.96 4 4 0
camar08 223 1 3 2 0.009 0.4 0.06 1.16 9 9 0
camar018 26 1 1 1 0.004 0.4 0.03 1.06 16 16 0
cup 1 1 1 1 0.006 0.2 0.20 1.05 12 12 0
face* 113 1 0 0 0.002 2.2 0.20 1.39 7 7 0
phonel 50 1 0 0 0.005 0.7 0.08 1.24 4 4 0
phone6 71 1 1 0 0.000 0.4 0.06 1.08 6 6 0
phone7 82 1 1 0 0.002 1.0 0.08 1.28 5 5 0
sharpenerlO 88 1 3 0 0.001 2.4 0.13 1.36 8 8 0
soap 1 1 2 0 0.002 0.4 0.05 1.02 3 3 0
swanll 64 1 0 0 0.001 1.0 0.15 0.99 4 4 0
swanl3 42 1 2 0 0.002 1.7 0.08 0.97 6 6 0
taurusl 113 l 0 2 0.005 0.3 0.28 1.31 5 5 0
taurus2 173 1 3 0 0.002 2.0 0.06 1.47 3 3 0
taurus3 153 1 3 0 0.002 2.3 0.04 1.65 3 3 0
taurus4 47 1 4 0 0.008 1.2 0.25 1.20 5 5 0
taurus5 58 1 2 0 0.004 0.8 0.08 0.89 3 3 0
taurus8 61 1 2 0 0.002 2.1 0.04 1.22 6 6 0
taurusl3 30 l 3 1 0.000 1.1 0.21 1.38 6 6 0
taurusl4 69 l 3 0 0.007 0.9 0.11 1.16 5 5 0
taurule 147 1 8 1 0.018 4.9 0.39 1.58 18 18 0
taurus20 82 1 l 1 0.005 1.6 0.12 1.92 4 4 0
truck4 217 1 3 2 0.012 0.5 0.04 0.95 8 8 0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(tutu, s,a, N) represents the object pose where (it, ty) is 2D translation, 3 the

scale factor, and a and N the rotation angle and axis, respectively.

E, denotes I] f: — pug,” where ,3 is the estimated value of the ground truth pm”.
* ground truth not actually known (only approximation).

 

 

169

were achieved within reasonable ranges. Note that for the test object taurusIQ, the
large error in t, (8 pixels off the ground truth) was due to the large error in s (1.8%
off the ground truth). The error in tr was compensated by the error in 3, resulting in
a reasonable ﬁtting error (1.58 pixels MSE) on the edgemaps. Nevertheless, the con-
vergence of the ﬁtting was very slow (18 iterations), and also the errors in rotational
parameters were quite large; 4.9 degrees off the ground truth for the rotation angle
a, and 0.39 radians off the ground truth for the rotation axis N. This provides an
example in which an acceptable ﬁtting error does not necessarily guarantee the accu-
racy of pose parameters. As discussed in Chapter 3, the accuracy of pose parameters
is affected by the shapes of the objects. The execution time for each iteration of the
N-L—M method, excluding I/O, is about 0.1 second on a Sparc 20, and the execu-
tion time for the indexing scheme to generate hypotheses for each test image, again

excluding the time on loading the database into memory, takes about 0.1 second.

Recognition of Oecluded Objects

Results with occluded objects are given in Table 5.8: the notations are the same
as those in Table 5.7. Note that in the column R), of Table 5.8, ﬁve test objects,
blockAI, block/15, block/19, blockA 14, and blockA16, have the symbol —, which denotes
indexing failure. Because these ﬁve objects have only one part in their silhouettes and
they are partially occluded, indexing fails. This is one break point of the indexing
scheme. Without the help from indexing to prune the model-aspect hypotheses, a
lot of veriﬁcations need to be performed. For example, for the test object blockAI,

34 model-pose hypotheses needed to be tried ﬁrst because they were generated from

Table 5.8: Recognition results for 32 occluded objects.

170

 

Indexing attribute tolerance criteria (eg, 6,, ep, e¢,) = (5.0,1.0, 5.0,0.005)

Number of model aspects in the database: 658.

Parameters for hierarchical veriﬁcation are speciﬁed in Table 5.5.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Object Indexing Hypothesis Veriﬁcation (Pose Estimation)

NH 12,. ngngyl E, [Ea] EN MSE] 1,, 1 IN [S(%)
blockAl 34 - 0 2 0.007 0.4 0.16 1.02 2081 6233 199
blockA5 42 - 5 3 0.015 1.6 0.13 0.88 4200 6305 199
blockA9 44 - 1 2 0.010 1.2 0.43 1.09 4212 6321 200
blockA 14 8 - 4 1 0.001 2.2 0.60 1.03 2000 5996 200
blockA16 44 - 0 0 0.003 0.1 0.33 1.29 4210 6319 199
camaro20 126 1 1 0 0.001 0.7 0.03 0.83 4 4 0
camaro27 25 l 2 0 0.009 0.2 0.14 1.79 4 4 0
camaro36 189 1 5 1 0.017 1.8 0.61 1.62 3 3 0
camaro37 170 1 1 2 0.005 1.2 0.03 1.74 2 2 0
can 181 1 0 0 0.003 0.8 0.05 0.94 4 4 0
elephantl 160 1 2 0 0.010 1.0 0.06 0.97 10 10 0
elephant20 239 25 4 5 0.006 2.0 0.16 1.06 179 437 144
gorillal 167 2 3 3 0.005 0.8 0.46 1.26 10 22 120
gorilla20 137 5 3 0 0.009 0.5 0.01 0.85 47 77 64
lion10 141 2 2 1 0.005 0.5 0.24 1.37 12 24 100
lion29 240 12 1 0 0.000 0.5 0.06 1.56 77 203 164
mug7 194 59 1 1 0.008 3.9 0.80 1.02 388 1054 173
mug26 216 33 2 0 0.010 0.2 0.10 0.98 223 583 161
phonelO 103 32 1 0 0.000 0.0 0.15 1.78 196 568 190
phonel3 60 1 0 2 0.007 0.7 0.03 0.86 5 5 0
pigl 204 1 0 0 0.000 0.1 0.03 1.22 5 5 0
sharpenerl 226 1 3 3 0.005 3.1 0.02 0.97 3 3 0
sprayerl 30 l 3 1 0.013 0.2 0.15 1.17 10 10 0
sprayer7 35 1 1 0 0.010 1.7 0.33 0.98 3 3 0
sprayer25 28 l 1 0 0.002 0.1 0.10 1.73 11 11 0
squirre119 218 1 2 2 0.010 0.3 0.12 1.61 5 5 0
squirrel25 181 107 1 1 0.020 2.9 0.14 1.75 701 1913 173
swan2 131 1 2 1 0.007 0.4 0.13 0.75 3 3 0
swan3 146 l 3 1 0.008 0.1 0.12 0.93 10 10 0
trucklO 78 1 2 2 0.009 0.7 0.07 0.95 10 10 0
zebral 222 9 3 3 0.010 0.3 0.05 0.76 68 152 124
zebra20 265 5 5 5 0.009 1.1 0.01 1.19 27 75 178

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(t3, ty,s,a, N) represents the object pose where (tmty) is 2D translation, 3 the

scale factor, and a and N the rotation angle and axis, respectively.
B? denotes “p - pggucll where p is the estimated value of the ground truth pgme.

 

 

171

indexing, but none of them would be accepted; then the rest of the model aspects
are tested in some random order; on the average 317, i.e., (658-34) / 2, model-pose
hypotheses are veriﬁed, resulting in a tremendous number of veriﬁcation attempts.
The speedup gained by using the hierarchical veriﬁcation strategy almost reaches
the upper bound, i.e., S = 200%. In this case, indexing does not help the system
to alleviate the burden of veriﬁcation; on the contrary, it increases the recognition
time by requiring the system to perform an extra NH/2 tests (were the indexing
not provided, the system only needed to verify 329 model-pose hypotheses on the
average). As can be seen in the column S of Table 5.8, the recognition system always
gained some speedup whenever the correct model—aspect hypothesis is not ranked as
the ﬁrst. The column MSE of Table 5.8 illustrates that alignment was achieved within
2.0 pixels MSE on the edgemaps. Due to the indexing failures, the recognition system
tested a lot of false model-pose hypotheses, resulting in a huge number of iterations
of the N-L—M method, as indicated by the column I H in Table 5.8. The goodness of
the ﬁt is also shown in Table 5.8. In most cases, the pose parameters were achieved
within reasonable ranges, although in general not as good as the pose parameters for
non-occluded objects as shown in Table 5.7.

The column R), in Table 5.8 shows that the performance of the hypothesis ordering
scheme deteriorated when the observed object was partially occluded. However, as
long as the percentage of visual contour of non-occluded object parts is signiﬁcant,
the hypothesis ordering scheme still could effectively rank the correct model-aspect
hypothesis; for 14 (out of 32) test objects, the correct model aspect ranked ﬁrst.

Table 5.9 lists the number of non-occluded object parts (NNp), the number of scene

172

Table 5.9: The effect of visual contour on indexing.

 

 

 

 

 

 

 

 

 

 

 

 

Object NNP Nsp VC (%) N}; R];
elephant20 l 9 28.8 239 25
gorillal 2 11 25.7 167 2
gorilla20 l 7 19.7 137 5
lion10 3 8 24.1 141 2
lion29 2 9 15.3 240 12
mug7 1 9 13.3 194 59
mug26 1 10 13.6 216 33
phonelO l 5 22.8 103 32
squirrel25 1 11 17.2 181 107
zebral 2 8 25.9 222 9
zebra20 3 10 23.2 265 5

 

 

 

 

 

 

 

 

 

 

parts (Nsp), the percentage of the visual contour of non-occluded object parts in
the whole scene contour (VC), the number of model-aspect hypotheses generated via
indexing (NH), and the rank of the correct model-aspect hypothesis (Bk). As can
be seen in Table 5.9, the reason for the high number of required tests for the test
objects elephant20, mug7, mu926, phone] 0, and squirre125 is that these objects are
partially occluded and only a small portion of the parts corresponds to the correct
model aspect; for the test images mug7, only one part, i.e., the handle of the mug, (out
of 9 scene parts) is able to index to the correct model aspect and its visual contour
is relatively small (13.3%) as compared to the whole visual contour in the scene (see
Figure 5.4 (d)). The number of hypotheses generated depends on the number of scene
parts and the rarity of each scene part in the database. Normally, more scene parts
will generate more hypotheses. The rank of the correct model aspect depends on the

saliency of non-occluded parts in the object and the percentage of visual contour of

173

these parts in the scene. Thus, there is no particular regularity in Table 5.9, but, in
general, a smaller percentage of visual contour of non-occluded object parts means a

lower rank of the correct model-aspect hypothesis.

Recognition Failure

Several examples are given here to demonstrate two types of recognition errors,
namely false alarm and false dismissal. Figure 5.5 shows the test images containing
objects not existing in the model database. Table 5.10 lists the number of hypothe-
ses generated via indexing, the number of hypotheses accepted, and the number of

hypotheses rejected. For vehicles shown in Figure 5.5(a)-(d), edge contours (shown as

Table 5.10: Recognition results of objects not in the database.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Indexing attribute tolerance criteria (e,, e,, e,,, em) = (4.0, 0.4,4.0,0.005)
Parameters for hierarchical veriﬁcation are speciﬁed in Table 5.5.
Test Object Indexing ]] Hypothesis Veriﬁcation (Matching)
ﬁHypotheses ]] ﬁAccepted ﬁRejected

hatchback 69 0 69

sedan 71 0 71

wagon 60 0 60
sports-car 87 0 87

faceA 86 0 86

faceB 49 0 49

bcirc 4 0 4

jar 66 0 66
mushroom 8 0 8

pipe 40 0 40

pot 20 0 20

taco 11 0 11

top 29 0 29

vase 74 0 74

eg 0 - -

semi 0 - -

 

 

 

 

 

 

 

 

174

white contours superimposed on the intensity images) were manually edited to extract

a)hatchback b) sedan )wagon d)sports—car

 

 

(e) faceA n (f) faceB (g) bcirc

   

(i) mushroom (j) pipe (k) pot (l) taco

     

(m) top (0) egg (p) semi

Figure 5.5: Objects for recognition (not existing in the database).

vehicle silhouettes for indexing, but the entire edge contours were used for matching;
the indexing scheme produced several model hypotheses on taurus and camaro, but
the matching scheme refuted these hypotheses because of their large ﬁtting errors.
For test objects (not having similar models in the database) in Figure 5.5(e)—(n), the

recognition system failed to recognize them as indicated in the third column of Ta-

175

ble 5.10. Two face images were also manually edited to extract face silhouettes for
indexing. Note that in the last two rows of Table 5.10, the recognition system did not
generate any hypothesis for test objects egg and semi. Consequently, the recognition
system did not need to perform any veriﬁcation test for these two objects as indicated
by the symbol —. These two objects are automatically rejected as new objects at the
indexing phase. The results in Table 5.10 indicate that the recognition system does
not produce any false alarm for these 16 test objects.

Results with occluded objects are given in Table 5.11: the notation is the same

as in Table 5.10. The ﬁrst row of Figure 5.6 shows four test objects swan13, phonel,

Table 5.11: Recognition results for 4 occluded objects.

 

 

 

 

 

 

 

Indexing attribute tolerance criteria (e.g., e,, e,,, 6,5,) = (4.0, 0.4, 4.0, 0.005)
Parameters for hierarchical veriﬁcation are speciﬁed in Table 5.5.

Test Object Indexing Hypothesis Veriﬁcation (Matching)

[lHypotheses ﬁAccepted tiRejected

swanl3 58 0 58

taurusl 62 0 62

phonel 86 0 86

cup 58 0 58

 

 

 

 

 

 

 

 

taurus], and cup; the second row of Figure 5.6 gives the extracted parts of these four
objects; the third row of Figure 5.6 shows the extracted parts of the corresponding
models. As can be seen in the second and third rows of Figure 5.6, the occlusion
effect makes the extracted parts completely different, causing the indexing scheme to
fail to produce the corresponding hypotheses for veriﬁcation test. Table 5.11 shows
that the recognition system fails to recognize these four objects. The last row of

Figure 5.6 depicts the results of model ﬁtting, demonstrating that these four objects

176

do have corresponding models in the database and that the matching stage can verify
this. The recognition system has false dismissal error for these four occluded objects.
Additional examples of false dismissal were also given in the ﬁrst 5 rows of Table 5.8.
Thus, the recognition system breaks if the indexing scheme fails to generate correct

model hypotheses for veriﬁcation; this only occurs when all the parts of an observed

object are occluded.

    
  

phonel taurusl Cllp

 

rib

 

 

 

 

 

 

 

 

 

 

 

Figure 5.6: Examples of false dismissal. Row 1 contains four test objects. Row 2
presents the parts extracted from the test objects. Row 3 gives the parts extracted
from the models. Row 4 shows the results from model ﬁtting.

177
5.2 Motion Tracking

Motion detection and tracking of moving objects from image sequences are important
capacities for any vision system designed to operate in an uncontrolled environment.
Tracking in computer vision, however, is still in the development stages and has had
few applications in industry. It is hoped that tracking combined with other technolo-
gies may produce effective vision systems. Important applications which would be
impacted by a successful tracking system include (1) military target acquisition, (2)
military and commercial surveillance, (3) trafﬁc monitoring, and (4) robotics. For ex-
ample, recognizing and tracking parts on a moving conveyor belt would enable robots
to pick up the correct parts in an unconstrained environment.

In this work, we consider tracking with a stationary camera. Our assumption is
that only the object of interest is in motion and other objects are considered as still
background. Section 5.2.1 contains a brief review of some of the previous work related
to tracking. A description of the proposed tracking method in given in Section 5.2.2.
A motion segmentation scheme for extracting edges of the moving object is presented
in Section 5.2.3. In Section 5.2.4, we show the results of tracking objects in three

image sequences.

5.2.1 Previous Work

In general, there are two approaches toward the problem of tracking moving objects
from image sequences. One is recognition-based tracking and the other is motion-based

tracking. Recognition-based tracking (e.g., [1, 18, 77]) is actually a modiﬁcation of

178

object recognition. This approach attempts to recover the pose of an object at each
frame. Once correspondences between image and model features are determined,
changes in positions of image features in successful frames are used to update the
pose of the object. To accurately recovered the object pose at each frame requires
an exact geometric model of the object. Moreover, the tracked object has to be
recognizable. Since object recognition is a high-level operation that can be costly to
perform, the performance of the tracking system is limited by the efﬁciency of the

recognition system as well as the types of objects that are recognizable.

Motion—based tracking systems differ signiﬁcantly from recognition-based systems
in that they depend completely on motion detection to track the moving object. They
have the capability of tracking any moving object regardless of size or shape. Motion-
based techniques (e.g., [2, 84]) can be further subdivided into feature-based tracking
and optical ﬂow tracking. Feature-based tracking requires stable feature extraction
prior to establishing correspondences between the model features and the sensed
features in consecutive frames. Correspondence establishment may become difﬁcult
if the image includes many objects and feature extraction is ambiguous. On the
other hand, gradient-based tracking needs additional constraints (e.g., smoothness
[57]) to compute the optical ﬂow. Although this technique produces good results
for smooth object surfaces, the constraints increase the estimation error around the
object boundary. Both techniques are sensitive to noise and do not yield satisfactory

results for the real images typically encountered in practical applications.

179
5.2.2 Proposed Method

The proposed approach to the motion tracking problem is recognition-based. The
pose estimation is carried out by the Newton’s method with Levenberg-Marduardt
minimization (N-L-M) to force a convergence. The N—L-M method requires a good
initial estimate of parameters that is relatively close to object pose to avoid converging
to a local minimum. For tracking an object over a sequence of frames, an automated
guess of initial estimate of parameters can be acquired by (1) using the pose estimate
from the previous frame plus a velocity estimate for each parameter derived from the
previous two frames [77], or (2) by using the pose estimate from the previous frame
(if the motion of the object is small and smooth) and renewing the iterations of the
N-L-M method. We have selected the latter because the N-L—M method can recover
the translation parameters within a few iterations regardless of the initial estimate
of translation parameters, and a coarse object view (orientation) is sufﬁcient to form
a good initial parameter estimate for the N-L-M method. However, the tracking
system needs to pay attention to the situation when the object being tracked moves
across aspects. In this case, the model aspect for the previous frame can no longer be
used to ﬁt the current frame, but the pose estimate from the previous frame can still
be used as a good initial parameter estimate. The problem can be solved by using
the mean-squared distance error (MSE) of the ﬁtting as an indicator to determine
if a new model aspect should be introduced. For example, suppose that the system
initially uses model aspect Mg to track a moving object. The system continues to

use Mg for tracking the object until the ﬁtting error at a particular frame t,- is over

180

the user’s speciﬁed error threshold Eg. This indicates that the system should change
model aspect. The system can use the silhouette of the tracked object at frame t,
to index into the database to generate model-aspect hypotheses, among which the
ﬁrst neighboring model aspect of Mg is chosen to continue the tracking. The system
selects the next lower ranked neighboring model aspect if the previous one fails. This

procedure continues until all frames in the image sequence are tracked.

In fact, we can use the pose estimate of the previous frame plus a velocity estimate
obtained from previous two frames to predict the pose of the current frame; if the pose
estimate is out of the aspect boundary of the current model aspect, we can use the
motion information obtained to select the neighboring model aspect to continue the
tracking. This method deﬁnitely outperforms the aforementioned method because the
neighboring model aspect can be automatically selected without resorting to indexing.

This is future work to improve the tracking system.

5.2.3 Motion Segmentation

For the tracking system to be effective, we need to obtain edgemaps of the tracked ob-
ject from images. The use of motion information for object segmentation from image
sequences has been demonstrated as a successful technique when under the assump-
tion that the images contain only one single moving object. Thus, our implementation

is based primarily on this technique.

Edgemaps of the images are created using the Canny edge detector [25]. The

background noise edges are excluded by ANDing a mask on the edgemap. Such a

181

mask is generated through motion segmentation. In practice, the simplest imple-
mentation of motion segmentation is image subtraction, in which the image from the
previous frame is subtracted, pixel by pixel, from the image of the current frame.
When images are subtracted from one another, the absolute intensity difference of
the background tends to be very small while the absolute intensity difference along
the object boundary is large as compared to that of the background. An apprOpriate
threshold to segment an intensity-subtracted image into a binary image can be ob-
tained by a simple Gaussian thresholding technique [62]. The resulting binary image
is dilated several pixels in all directions to ensure that the mask includes the object
boundary. The mask may cover small portions of background edges. As a conse-
quence, the obtained edgemap may contain some small background edge contours.
Since the N-L-M method can handle partially occluded objects, the generation of the

mask need not be perfect.

The motion segmentation technique proposed by Dubuisson and Jain [38] is
adopted to generate the mask. The basic idea is as follows. Let f1, f2, and f3 be three
image frames extracted from the image sequence at times t1, t2, and t3 respectively.
They are blurred by a 3 x 3 averaging mask to smooth out the background and to
preserve the ramp edges around the moving object. Let f1, f2, and f3 represent the
blurred images of f1, f2, and f3, respectively. The difference image between frames

1, 2, and 3 is then computed as follows:

0011') = lfl(iij) — f2(iaj)l * 1.620.132 _ f3(i9j)l

182

where ‘*’ denotes the bitwise multiplication operation and ] - ] denotes the absolute
value operator. This difference image D enhances the pixels around the boundary
of the moving objects. Let tff‘” and 13'9" be two thresholds speciﬁed by the user.
The difference image D is then thresholded at tff‘” to keep only the pixels where a
signiﬁcant intensity change occurs. This thresholding procedure produces an image

Dg deﬁned by:

. . 1 if 0(1, j)>tf,°w
D¢(27]):

0 otherwise

Connected components are then extracted from Dg using an 8-connected neighbor-
hood. A connected component is removed if none of its pixels has a value larger
then 12"". This cleaning procedure generates an image C with “on” pixels around
the boundary of the moving object. The next step is to dilate the components in
C until all isolated components are connected into one single connected component
using a chamfering technique to record the distance of dilation. The internal holes
of this single connected component are then ﬁlled with “on” pixels. The ﬁnal step
is to erode the connected component on the boundary back to its original size. The
information for erosion is embedded in the chamfered distance image. The resulting

image is a mask covering the moving object.

Figure 5.7 presents the results of the described motion segmentation algorithm
using three image frames from an image sequence containing a moving vehicle. Fig—
ure 5.7(d) shows the generated mask in which black pixels are the thresholded regions
from motion segmentation and gray pixels are the results from dilation and erosion of

the thresholded regions. Figure 5.7(e) shows the edgemap of the vehicle obtained by

183

using the mask to ﬁlter out most of the background. Figure 5.7(f) shows the result of
alignment indicated by the white contours overlaid on the image. This demonstrates
that the extracted edgemap for alignment need not be perfect; however, for indexing,
the boundary of the edgemap should be as good as possible to avoid testing many
false hypotheses; nevertheless, indexing is not required if the object does not move

across aspects or if the motion information is incorporated in the tracking system to

select neighboring aspects.

  

(d) (C)

Figure 5.7: Results of the motion segmentation. (a) Frame 1. (b) Frame 2. (c)
Frame 3. (d) Mask for frame 2. (e) The edgemap of frame 2 obtained by using the
mask to exclude the background. (f) The ﬁtted edgemap (from the N-L—M method)
superimposed on frame 2.

184
5.2.4 Experimental Results

In this section, we intend to demonstrate the viability of the tracking system on both
indoor and outdoor moving objects and explain how the tracking system handles the
object moving across aspects. The motion tracking method was tested on three image
sequences. Two of them were simulated image sequences taken in the lab to study
the validity and accuracy of the method. One image sequence taken from a moving

vehicle at a parking lot was used to demonstrate the applicability of the method.

Tracking A Moving Squirrel Carving

The ﬁrst experiment was conducted with an image sequence from a moving squirrel
carving as shown in Figure 5.8. The image sequence was generated using a cam-
era mounted on a translation stage and the object on a rotating base to simulate
the motion. On every image frame grabbed, the translation stage was moved 5mm
along the X axis and the rotating base was rotated 2 to 5 degrees counterclockwise.
Table 5.12 shows the results of the tracking. The model aspect and the initial param-
eter estimates for the ﬁrst frame were provided by the indexing scheme. The initial
parameter estimates of the subsequent frames were obtained from the ﬁt parameter
estimates of the previous frame. The ﬁnal column of Table 5.12 illustrates that track-
ing was achieved with 1.3 pixels MSE on the edgemaps within 8 iterations of the
N-L—M method. From the results shown in Table 5.12, it is not difﬁcult to see that a
smaller angle of rotation 01 results in a smaller MSE but not necessarily small errors

in 3D pose parameters, which is consistent with the error analysis presented in Sec-

185

tion 2.5.1. The aligned model edgemaps are overlaid on the image sequence as shown
in Figure 5.8. Figure 5.9 depicts the evolution of the parameter estimates within and
across frames. It is obvious that the ﬁt translation parameters are consistent with the
trajectory of the motion as indicated in Figure 5.8(a). The parameters of the direc-
tion of rotation axis, 0 and 45, are less sensitive, especially when the rotation angle, a,
is small (for example, see the E9 values at frames t4, t5, and t6 in Table 5.12). This is
because small or results in small error (see Section 2.5.1) and small error leads to the
insensitivity of the related rotational parameters (See Section 3.3.3). This explains
why the errors for the rotation axis, E9 and Egg, are large for these three frames.
For the same reason, these three frames tend to need more iterations for convergence
because the N-L-M method may have to wander in the parameter space (due to the

random walk) before convergence.

Table 5.12: Results of motion tracking for a squirrel carving.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Frame]! 0 E, E1” E, E0, E9 E, ﬂIteration MSE (pixel)
to —15° 1 0 0.0 1.1° 10° 00° 4 1.22
t1 —13° 1 0 0.0 1.6° 10° 00° 4 1.20
t2 —8° 1 0 0.0 25° 30° 04° 7 1.18
t3 -5° 1 0 0.0 35" 1.7° 0.1° 4 1.09
t., —2° 0 0 0.0 20° 68° 25° 8 0.98
t5 4° 1 0 0.0 03° 92° 1.7° 7 0.99
t9 7° 0 0 0.0 05° 94° 65° 7 1.19
t7 12° 1 1 0.0 0.2° 2.2° 4.9° 4 1.21
t9 15° 1 0 0.0 0.4° 1.5° 31° 6 1.27

(t,, t,, s, a, 0, d) represents the pose parameters.
E, = If) — page] where 35 is the estimated value of the ground truth pgmg.
The measurement unit for Etx and E1” is pixels.

 

 

186

 

(g)Frame t5 (h)Frame t6 (i)Frame t7 (j)Frame t3

Figure 5.8: Motion tracking of image sequences of a moving squirrel. (a) Image of
frame to added with thresholded images of frame t1 - t8. (b)-(j) are the results of
model ﬁtting with the white contours indicating the ﬁtted edgemaps.

Tracking Across Aspects

The second experiment was conducted with an image sequence generated from a
moving Ford Taurus 1:20 scaled toy model as shown in Figure 5.10. The Taurus
model was mounted on the tripod (see Figure 2.7) to simulate the motion across two
neighboring aspects diagonally, i.e., from aspect taurus] to aspect taurus20. Ten test

images were generated by equally sampling the orientations between these two neigh-

187

Evolution of Parameter Estimates during Motion Trac ' g

I

I

5

 

T
I

I
I
I
l

I
l
l
I

     

 

TIL

 

 

III-11.11111
5555I'e5111
III-"111.1
IIIIl‘IIII
IIIII‘IIIL

5555‘5551;

 

 

 

 

Translation
tx

Translation
0’

 

55 1111 ""5

55"-55'5555

1.55-5.5---

555'5'5'55

..... r---

55 {{{{ .5555

I
1
1
1
1
j
1
1
1
1

 

 

 

 

0.995 '
0.990

1.005 "

Scale Factor lax)

S

flu.

 

 

55555

 

 

 

 

 

w..

555

r11

 

 

 

1111.....1111.

11111.. It...

5551-

5'5Ae

5'51e

5'5..-

555...

-551-

 

 

  

20

Rotation Angle 0 n
a

Rotation Axis

Rotation Axis
41

t7 t8

t6

t4

1]

Initial Estimate t0

Figure 5.9: The evolution of parameter estimates during motion tracking of the squir-
rel. The black square boxes indicate the ground truth parameters. The vertical dotted

line denote the convergence of the N-L-M method at a particular frame.

188

  

(f) Frame t6

    

(g)Frame t7 (h)Frame t3 (i)Frame t9 (j)Frame 110

Figure 5.10: Motion tracking on the Taurus image sequence. The white contours
indicate the ﬁtted edgemaps overlaid on the image sequence.

boring aspects. In this image sequence, frames tl—t5 belong to aspect taurusl while
frames ts—tlo belong to aspect tauru320. The aligned model edgemaps are superim-
posed on the image sequence as shown in Figure 5.10. The goal of this experiment
is to study how the system handles the situation when object motion is across two
neighboring aspects of a model. Figure 5.11 shows the results of tracking without
changing model aspects. Both curves indicate that the ﬁtting error degrades signiﬁ-
cantly when the system uses the wrong model aspect for tracking. The intersection
of these two curves suggests the equilibrium point where the system should change

the model aspect to continue tracking the object. This raises questions about (1)

189

when such a decision should be made, and (2) which neighboring aspect should be
used to continue the tracking. The answers lie in Table 5.13 which lists the ﬁrst three
ranked model-aspect hypotheses and their ﬁtting errors (MSE). The correct model-
aspect hypotheses for frames tl—tlo are either ranked as 1 or 2 from indexing. The
ﬁtting error is getting larger as the vehicle approaches the aspect boundary which
is in between frames t5 and t6. Suppose that 2.2 pixels MSE is the user’s speciﬁed
error threshold for the tracking system to change from one model aspect to another.
Without any prior knowledge, the ﬁrst frame (i.e., frame t1), has to resort to the
recognition system to ﬁnd out the object model aspect and identity. For frame t1, the
indexing scheme generates 19 model-aspect hypotheses, and the veriﬁcation stage ac-
cepts the ﬁrst model-aspect hypothesis, taurusl, as the correct model aspect because
the MSE is below the user’s speciﬁed error threshold. The system uses taurusl to
continue tracking the vehicle without resorting to indexing until frame t9 where the
MSE is above the user’s speciﬁed error threshold (see Figure 5.11); then the system
uses the object silhouette at frame t9 for indexing; since model aspect tauru320 is
the ﬁrst hypothesized model aspect neighboring to taurusl, the system selects it for
veriﬁcation; the 1.8 pixels MSE of ﬁtting tauru320 to frame t6 indicates that tauru320
is the right model aspect, so the system uses it to continue the tracking. The results

of this tracking is also depicted in Figure 5.12.

Tracking Using A Real Image Sequence

The image sequence in this experiment consists of an ordered list of frames:

f1, f2, - - - , fn. For the purpose of the demonstration, a sequence containing n = 120

MSE (Pixels)

MSE (Pixels)

190

Tracking Ford Taurus with Two Model Aspects
5.5 I I I I I r I I
II aSchtl II _._ f
"aspect20" -----*

 

 

 

 

 

'11 12 13 14 15 t6 t7 18 19 110
ImageFrame

Figure 5.11: Tracking Ford Taurus from two model aspects.

Tracking Ford Taurus By Changing Model Aspects

 

 

 

 

2.2 1 T , ,
”tracking" —.—-

2 . .
1.8 - .
1.6 ~ ~
1.4 r i
1.2 - .

1 - .

t
0.8 I I I I I I I I
tl 12 t3 t4 t5 t6 t7 t8 19 110
Image Frame

Figure 5.12: Tracking Ford Taurus by changing model aspects.

191

Table 5.13: Motion tracking across aspects.

 

N HG denotes the number of hypotheses generated via indexing.
Indexing attribute tolerance criteria: (6;, (2,, ep, em) = (4.0, 0.4,4.0, 0.005)
MSE (unit = pixels) denotes the ﬁtting error in the N-L—M method.

 

 

 

 

Frame N HG Rank 1 Rank 2 Rank 3 Correct
No. Model MSE Model MSE Model MSE Model
t1 19 taurusl 0.9 taurule 5.2 taurusg 5.5 taurusl
t2 21 taurusl 0.9 taurule 5.3 taurusl8 4.3 taurusl

 

t3 20 taurusl l. 1 taurusl 8 4.9 mug26 10.1 taurusl
t4 20 taurusl 1 .4 tauruslS 5.1 mug26 10.2 taurusl
t5 23 taurusl 2.1 taurusl8 6.1 mug26 10.6 taurusl
t6 28 phonel 14.1 taurus20 l .8 mug16 8.3 taurus20
t7 27 phonel 15.0 taurus20 1.6 mug16 8.6 taurus20
t3 27 taurus20 1.4 mug34 12.8 mug25 12.0 taurus20
t9 31 taurus20 1.2 mug5 11.3 mug15 9.0 taurus20
ha 24 taurus20 1.0 zebra23 13.5 camaroll 7.4 taurus20

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

frames was used as a test (see Figure 5.13). This image sequence was captured by
a cam-corder and stored on a video tape. The object is a Ford Taurus vehicle mov-
ing around a parking lot. One frame out of ten was used to perform the tracking
(ftlvftga - ° - ,ftm which is a total of 12 frames). Figure 5.13 shows these twelve se—

lected frames and the result of the tracking.

Note that although the motion segmentation provides satisfactory masks, it is
still difﬁcult to automatically extract the exact boundary of the vehicle, especially
the boundary underneath the vehicle caused by the shadow, as can be seen in Fig-
ure 5.7(e). The imperfect boundary extraction may render the indexing scheme inef-
fective. We manually cleaned the top portion of the boundary and leave the bottom
portion untouched. The corresponding model aspects for these twelve frames via in-

dexing are given in Table 5.14. The fitting error (MSE) for each frame is also given

192

in Table 5.14. As mentioned in Chapter 3, in order to allow a full 3D rotation of
an object, we need to recover the roll angle of the image plane such that the model
can brought into alignment with the observed object. This roll angle as discussed in
Chapter 4 can be recovered from the corresponding part features of the model and
the sensory data, but we have not implemented this stage yet in the current system.
This will be left for the future work. As a result, the fitted edgemaps for these 12
frames are a little tilted as shown in Figure 5.13, and the MSE values as indicated in
Table 5.14 are a little large as compared to the MSE values for those simulated ones

in the previous two experiments.

Table 5.14: Results of tracking the Ford Taurus.

 

 

 

 

 

 

 

Frame Model Aspect MSE (pixels) Frame Model Aspect MSE (pixels)
t1 taurusl5 1.9 t7 taurusl2 2.2
t2 tauruslS 2.2 t3 taurusl2 1.9
t3 taurusl4 2.1 t9 taurusll 1.7
t, taurusl4 1.7 t 10 taurusll 2.1
t5 taurusl3 1.8 t“ taurusll 2.4
t6 taurusl2 2.5 tn taurusll 2.3

 

 

 

 

 

 

 

 

 

5.3 Summary

We have combined the modeling, matching, and indexing modules together with a
hierarchical veriﬁcation strategy into an integrated recognition system. The hierar-
chical veriﬁcation strategy uses three sets of threshold values in the matching module
to quickly remove from further verification the false model-pose hypotheses generated

from the indexing module. This hierarchical veriﬁcation strategy has shown to be vi-

193

tal and efficient in expediting the recognition, especially when the indexing module
fails to reduce the number of verifications required due to the partial occlusion of the
observed object. Our recognition experimental results have shown that we have made
good progress toward a general alignment system. This system should work well on
some existing recognition problems.

We have demonstrated the applicability of the recognition system for tracking sin-
gle moving objects in image sequences. We have also studied how the tracking system
handles the object moving across model aspects. Our experimental results from both
simulated and real image sequences have shown the viability of the tracking method.
However, for the tracking method to work really well on real image sequences, the
current system needs to recover the roll angle of the image plane to allow a full 3D

rotation of the moving object. This is left as the future work.

194

 

(j) Frame tm ' (k) Frame tn (1) Frame in

Figure 5.13: Tracking a moving vehicle in a parking lot.

Chapter 6

Summary, Conclusion, and Future

Research

6.1 Summary and Conclusions

The research addressed in this thesis dealt with recognizing arbitrary curved 3D
objects from single 2D intensity images. We have developed a complete object recog-
nition system within the alignment paradigm. This paradigm involves three major
schemes. The modeling scheme consists of constructing model aspects for predict-
ing the object appearance seen from any viewpoint. The indexing scheme generates
hypotheses about candidate model aspects and poses which are then veriﬁed in the
matching scheme to support /refute the hypotheses and also to estimate/reﬁne the
object pose of the correct model.

In Chapter 2, we approached the problem of modeling 3D sculptured objects
using multi—view 2%D representations. The modeling scheme adopted the curvature

195

196

method of Basri and Ullman [9] to generate the predicted appearance of a 3D object
with a smooth surface following a 3D rotation. In this modeling scheme, a 3D object
is represented by a number of view-centered model aspects, rather than by a single
ob ject-centered geometric model. Each model aspect covers a range of many possible
neighboring viewing angles within the aspect. Using a number of model aspects, the
contour of an object seen from any vieWpoint can be predicted. The computations
required in this modeling scheme during the prediction stage is simple, and the model
construction is almost automatic when object prototypes are available. This modeling
scheme was applied to 20 arbitrary curved 3D objects with 578 model aspects, and the
experimental results are supportive of the design. We conclude that this modeling
scheme is viable in modeling many kinds of objects including both polyhedra and
sculptured objects.

In Chapter 3, we studied the problem of pose estimation from a known model
aspect and a 2D edgemap extracted from the scene. The proposed pose/refinement
algorithm, which can handle objects with partial occlusion, does not assume the ex-
istence of salient features in the image, and thus, is directly applicable to smooth
objects with sculptured surfaces. Due to the unavailability of salient features in ob-
jects with sculptured surfaces, we developed an iterative matching technique using
the Newton’s method with Levenberg—Marquardt minimization to estimate/ reﬁne the
object pose in a hill-climbing manner. As with any minimization technique, this itera-
tive matching technique requires good initial parameter estimates to avoid converging
to a local minimum. Two heuristics were adopted to improve the pose accuracy and

handle occlusion. The ﬁrst heuristic was used to synthesize feature correspondences

197

for pose estimation/refinement of objects with partial occlusion. The second heuris-
tic was adopted to to maintain a balance of alignment between object internal edges
and object silhouettes. Many conclusions were drawn from this study. The matching
algorithm was applied to 6 test objects, either partially occluded or non-occluded,
with 240 samples of initial parameter estimates in each object experiment. The re-
sults from these Monte Carlo experiments support the hypothesis that ﬁtting internal
edges and silhouette simultaneously can produce more accurate estimation of pose pa-
rameters than fitting the silhouette alone. The results also demonstrate a high rate
of convergence for a broad set of starting orientations within a model aspect. The
matching algorithm was also applied to 60 test model aspects, and the results indi-
cated that alignment was achieved with 1.6 pixels mean—squared distance error on
the edgemap within 10 iterations of the hill-climbing procedure. These results are
consistent with those reported by Lowe [76]. We conclude that large number of ﬁne
aspects are not needed for modeling. The experimental results also demonstrate that
arbitrary curved objects can be handled and some occlusion can be tolerated. These
results are not tied to the modeling scheme from Chapter 2. The matching algorithm
can work with a feature-based CAD model which can predict an object’s edgemap
when given a viewpoint.

In Chapter 4, we studied the problem of indexing candidate model aspect and pose
from a model-aspect database using 2D silhouettes. A part segmentation scheme was
proposed to decompose silhouettes into parts whose invariant attributes are used for
indexing. A hypothesis grouping scheme was also proposed to cluster hypotheses

stemming from the same model. We also studied four voting schemes for ordering

198

hypotheses based on prior knowledge of pre—stored models and the visual evidence of
the observed objects such that the most likely hypotheses are veriﬁed ﬁrst. When
combined into a complete system, these techniques make progress toward improving
accuracy and efﬁciency by pruning false model hypotheses and minimizing unneces-
sary veriﬁcation tests. Several experiments were performed on a database containing
658 model aspects, and the results support the hypothesis that HYBRID is the most
efﬁcient hypothesis ordering scheme in pruning false model hypotheses. Experimen—
tal results also indicate that the indexing scheme generates many hypotheses, but,
through hypothesis grouping and ordering, a significant number of veriﬁcation tests
for false models can be avoided. The success of the indexing scheme also indicates
that the proposed part representation is robust under global transformation and local
deformation.

We have combined the modeling, matching, and indexing schemes into a
recognition-by-alignment paradigm together with a hierarchical veriﬁcation strategy.
The hierarchical veriﬁcation strategy uses three sets of threshold values in the match-
ing scheme to quickly eliminate from further consideration the false model hypotheses
generated from the indexing scheme. The matching algorithm was applied to 100
randomly selected model aspects with totally 1200 test cases to select appropriate
threshold values at various levels. These empirical threshold values were used to
perform the recognition experiments on 60 test model aspects. The results indicate
that the integrated recognition system is viable in handling a large database of many
kinds of objects and the hierarchical veriﬁcation strategy is effective in expediting the

recognition, especially when the indexing scheme fails to reduce the number of false

199

model hypotheses generated due to partial occlusion of the observed object.

We have applied the recognition system to tracking a single moving object in a
scene from an image sequence. We have also studied how to track the moving object
across aspects. The recognition system was applied to 3 either simulated or real image
sequences to perform object tracking. Experimental results indicated the viability of
the tracking method and also demonstrated the applicability of the tracking method
by tracking a real car in an image sequence.

In this thesis, we have proposed a model-based computational paradigm for recog-
nition of arbitrary curved 3D objects from single 2D intensity images. Experimental
results show that this computational paradigm is viable in handling a large database

of many kinds of objects including polyhedra and sculptured objects.

6.2 Future Research

We have presented a model-based object recognition paradigm which shows that
recognizing 3D objects with sculptured surfaces from 2D images is feasible. However,
there are still a number of research issues to be addressed in the future. The most
prominent one is the ﬁgure-ground separation. In this thesis, we have implicitly
assumed a reasonable extraction of the object from its background. Figure-ground
separation is a major obstacle to cognitive psychology and computer vision in general
and to the alignment in particular. Thus, future work should include an investigation
of a better ﬁgure-ground separation algorithm.

In Chapter 2, we have used a 3—image stereo matching algorithm to compute the

200

curvature and 3D information from edgemaps for object modeling. This 3—image
stereo matching technique has applied several constraints/ heuristics to eliminate am-
biguous matches of object internal edges. However, the predicted appearance of
objects shown in the experiment results indicates that there are still some out-
liers due to inaccurate matches in this stereo matching algorithm. Thus, the con-
straints/heuristics used in this stereo matching algorithm for object modeling should
be further studied.

In Chapter 3, we have proposed a matching algorithm for pose estima-
tion / reﬁnement. We have deﬁned a merit function based on the mean-square distance
error between the observed edgemap and the edgemap of a model aspect. We have
seen that this matching algorithm requires reasonable initial parameter estimates to
avoid converging to local minima. We have incorporated the random walk feature in
the matching algorithm to escape local minima. However, it is a stochastic process,
which may not always lead to a global minimum convergence. Moreover, this random
walk feature sometimes causes the matching algorithm to wander around in param-
eter space, resulting in a slow convergence. Thus, it might be better to remove this
random walk feature in the matching algorithm if a good initial parameter estimate
can be obtained for pose reﬁnement or, perhaps to involve some other kind of search
through parameter space. A good initial parameter estimate may be obtained through
a coarse-to—ﬁne subdivision of the parameter space; the mean—square distance error
may be used as a similarity measure to determine the best initial parameter estimate.
All these require further investigation.

In Chapter 4, we have proposed an indexing scheme for generating model hy-

201

potheses from a database. This indexing scheme uses shape features obtained from
the object silhouette for indexing and is effective in pruning false model hypotheses;
however, the discriminating capacity of shape features is not very powerful. It is likely
to be useful to incorporate other types, such as color or texture features, or shape
features and relational features from internal edge contours. Also, it might be a good
idea to design a similarity measure between shape features for a better hypothesis
ordering scheme to avoid unnecessary verification tests.

The use of part invariant features derived from object silhouettes enables the
indexing scheme to effectively generate a small set of candidate model hypotheses;
but this indexing scheme fails when all parts of an observed object are partially
occluded. This suggests that the silhouette may need to be segmented using a small
scale for smoothing; however, this could also lead to more noisy and unnecessary
parts which may degrade the indexing. Toward this end, a scale-space approach to
part segmentation might be an important issue to be explored in the future.

The proposed object recognition paradigm in this thesis assumes that an object’s
orientation is ﬁxed when the object is modeled on a viewsphere. Consequently, the
matching technique assumes a zero—roll angle of the image plane. This assumption
ﬁxes one degree of freedom of the rotation and has no doubt and contributes to the
accuracy of the pose reported. The detection of this roll angle can be done by least-
square solution using part correspondences between the model and the image in the
indexing scheme. Thus, it is desirable to incorporate this roll angle detection in the
indexing scheme and then to use the detected roll angle for the matching algorithm

to allow a full 3D rotation of an object.

202

Finally, we have applied the integrated recognition system to track a single mov-
ing object in an image sequence. The tracking method applies the indexing scheme
to select the model aspect for tracking when the observed object moves across as-
pects. Although the indexing scheme can effectively generate the candidate neigh-
boring model-aspect hypotheses, this approach is clumsy. Thus, future work should
incorporate the motion information detected during the tracking to determine the

neighboring aspect automatically without resorting to indexing.

 

 

Bibliography

[1] Y. Aloimonos and D. Tsakiris. On the mathematics of visual tracking. Image

Vision Computing, 9(4):235—251, 1991.

[2] Y. Aloimonos, I. Weiss, and A. Bandopadhay. Active vision. Int. Journal of

Computer Vision, 2:333—356, 1988.

[3] H. Asada and M. Brady. The curvature primal sketch. IEEE Trans. on Pattern

Analysis and Machine Intelligence, 5(1):2—14, 1983.

[4] N. Ayache and O. D. Faugeras. HYPER: A new approach for the recognition
and positioning of two—dimensional objects. IEEE Trans. on Pattern Analysis

and Machine Intelligence, 8(1):44—54, 1986.

[5] R. Bajcsy and F. Solina. Three—dimensional object representation revisited. In

Proceedings Ist Int. Conf. on Computer Vision, pages 231—240, London, 1987.

[6] D. H. Ballard and C. H. Brown. Computer Vision. Prentice-Hall, Englewood
Cliffs, New Jersey, 1982.

203

 

204
[7] H. G. Barrow, J. M. Tenenbaum, R. C. Bolles, and H. C. Wolf. Paramet-

ric correspondence and chamfer matching. In Proceedings Int. Joint Conf. on

Artiﬁcial Intelligence, pages 22—25, Massachusetts, August 1977.

[8] R. Basri. The alignment of objects with smooth surfaces: Error analysis of the
curvature method. In Proceedings IEEE Conf. on Computer Vision and Pattern

Recognition, pages 341-346, Illinois, June 1992. h]

[9] R. Basri and S. Ullman. The alignment of objects with smooth surfaces. In Pro-

ceedings 2nd Int. Conf. on Computer Vision, pages 482—488, Florida, December

 

Tﬁ .
_\ . . .~ . x
I

1988.

[10] J. S. Beis and D. G. Lowe. Learning indexing functions for 3—D model-based
object recognition. In Proceedings IEEE Conf. on Computer Vision and Pattern

Recognition, pages 275-280, Seattle, WA, 1994.

[11] P. Besl and N. McKay. A method for registration of 3-D shape. IEEE Trans.

on Pattern Analysis and Machine Intelligence, 14(2):239—256, 1992.

[12] P. J. Besl and R. C. Jain. Three—dimensional object recognition. ACM Com-

puting Surveys, 17(1):75—154, 1985.

[13] I. Biederman. Recognition-by-Components: A theory of human image under-

standing. Psychological Review, 94(2):115—147, 1987.

[14] T. O. Binford. Visual perception by computer. In IEEE Conference on Systems

and Control, Miami, 1971.

 

205

[15] G. Blais and M. D. Levine. Registering multiview range data to create 3D
computer objects. IEEE Trans. on Pattern Analysis and Machine Intelligence,

17(8y820—824,1995.

[16] R. C. Bolles and R. A. Cain. Recognizing and locating partially visible objects,
the local—feature—focus method. Int. Journal of Robotics Research, 1(3):57—82,

1982. "

[17] R. C. Bolles and P. Horaud. 3DPO: A three-dimensional part orientation sys- ‘

tern. Int. Journal of Robotics Research, 5(3):3—26, 1986.

 

[18] A. J. Bray. Tracking objects using image disparity. Image Vision Computing,

8(1y4—9,1990.

[19] T. M. Breuel. Indexing for visual recognition from a large model base. AI Lab

Memo, MIT, 1990.

[20] T. M. Breuel. Fast recognition using adaptive subdivisions of transformation
space. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition,

pages 445—451 , 1992.

[21] R. A. Brooks. Symbolic reasoning among 3D models and 2D images. Artiﬁcial

Intelligence, 17:285—349, 1981 .

[22] J. B. Burns, R. S. Weiss, and E. M. Riseman. View variation of point—set and
linesegment features. IEEE Trans. on Pattern Analysis and Machine Intelli-

gence, 15(1):51—68, 1993.

[23]

[24]

[25]

[25]

[27]

[28]

[29]

[30]

206

A. Califano and R. Mohan. Multidimensional indexing for recognizing visual
shapes. IEEE Trans. on Pattern Analysis and Machine Intelligence, 16(4):373—

392, 1994.

O. I. Camps. PREMIO: The Use of Prediction in a CAD-Model—Based Vi-
sion System. PhD thesis, Department of Electrical Engineering, University of

Washington, 1992.

J. Canny. A computational approach to edge detection. IEEE Trans. on Pattern

Analysis and Machine Intelligence, 8(6):679—698, 1986.

C.-C. Chen. Improved moment invariants for shape discrimination. Pattern

Recognition, 26(5):683—686, 1993.

J.-L. Chen, G. C. Stockman, and K. Rao. Recovering and tracking pose of
curved 3D objects from 2D images. In Proceedings IEEE Conf. on Computer

Vision and Pattern Recognition, pages 233—239, New York, June 1993.

Sei-Wang Chen. 3-D Representation and Recognition Using Object Wings. PhD

thesis, Michigan State University, 1989.

Y. Chen and G. Medioni. Object modeling by registration of multiple range
images. International Journal of Image and Vision Computing, 10(3):145—155,

April 1992.

Y. Chen and G. Medioni. Fitting a surface to 3-D points using an inﬂating
balloon model. In Proceedings IEEE 2nd CAD-Based Vision Workshop, pages

266—273, Champion, PA, 1994.

I?“ ”mu-"“1

[31]

[32]

[33]

[34]

[35]

[36]

[37]

207
R. T. Chin and C. R. Dyer. Model-based recognition in robot vision. ACM

Computing Surveys, 18(1):67—108, March 1986.

R. Cipolla and A. Blake. The dynamic analysis of apparent contours. In Pro-

ceedings 3rd Int. Conf. on Computer Vision, pages 616—623, 1990.

D. T. Clemens and D. W. Jacobs. Space and time bounds on indexing 3-
D models from 2-D images. IEEE Trans. on Pattern Analysis and Machine

Intelligence, 13(10):1007—1017, 1991.

L. D. Cohen and 1. Cohen. F inite—element methods for active contour models
and balloons for 2D and 3D images. IEEE Trans. on Pattern Analysis and

Machine Intelligence, 15(11):1131—1147, 1993.

M. Dhome, M. Richetin, J .-T. Lapreste, and G. Rives. Determinination of the
attitude of 3D objects from a single perspective view. IEEE Trans. on Pattern

Analysis and Machine Intelligence, 11(12):1265—1278, 1989.

C. Dorai and A. K. Jain. COSMOS — a representation scheme for free-form
surfaces. In Proceedings of the International Conference on Computer Vision,

pages 1024—1029, Boston, MA, June 1995.

C. Dorai, J. Weng, and A. K. Jain. Optimal registration of multiple range views.
In Proceedings of the 12th International Conference on Pattern Recognition,

volume 1, pages 569—571, Jerusalem, Israel, October 1994.

 

II.

[38]

[39]

[40]

[41]

[42]

[43]

[44]

208

M.-P. Dubuisson and A. K. Jain. Object contour extraction using color and mo-
tion. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition,

pages 471—476, New York, June 1993.

D. Eggert and K. Bowyer. Computing the perspective projection aspect graph of
solids of revolution. IEEE Trans. on Pattern Analysis and Machine Intelligence,

15(2y109—128,1993.

O. Faugeras and M. Hebert. The representation, recognition, and locating of

3D objects. Int. Journal of Robotics Research, 5(3):27—52, 1986.

M. A. F ischler and R. C. Bolles. Random sample consensus: A paradigm for
model ﬁtting with application to image analysis and automated cartography.

Communications ACM, 24(6):381—395, 1981.

P. J. Flynn. CAD-Based Computer Vision: Modeling and Recognition Strate-
gies. PhD thesis, Department of Computer Science, Michigan State University,

1990.

P. J. Flynn and A. K. Jain. Bonsai: 3D object recognition using con-
strained search. IEEE Trans. on Pattern Analysis and Machine Intelligence,

13(10y1066—1075,1991.

D. Forsyth, J. L. Mundy, A. Zisserman, C. Coelho, A. Heller, and C. Rothwell.
Invariant descriptors for 3-D object recognition and pose. IEEE Trans. on

Pattern Analysis and Machine Intelligence, 13(10):971-991, 1991.

 

 

 

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

209

H. Freeman. Shape description via the use of critical points. Pattern Recogni-

tion, 10:159—166, 1978.

P. Giblin and R. Weiss. Reconstruction of surfaces from proﬁles. In Proceedings

Ist Int. Conf. on Computer Vision, pages 136—144, London, 1987.

W. E. L. Grimson. Object Recognition by Computer: The Role of Geometric

Constraints. MIT Press, Cambridge MA, 1990.

W. E. L. Grimson and T. Lozano—Perez. Model-based recognition and local-

ization from sparse range or tactile data. Int. Journal of Robotics Research,

3(3):3—35, 1984.

W. E. L. Crimson and T. Lozano—Perez. Localizing overlapping parts by search-
ing the interpretation tree. IEEE Trans. on Pattern Analysis and Machine

Intelligence, 9(4):469——482, 1987.

A. D. Gross and T. E. Boult. Error of ﬁt measurements for recovering parametric
solids. In Proceedings 2nd Int. Conf. on Computer Vision, pages 690—694,

Tampa, FL, 1988.

R. M. Haralick, H. Joo, C.-N. Lee, X. Zhuang, V.G. Vaidya, and M. B. Kim.
Pose estimation from corresponding point data. IEEE Trans. on Systems, Man,

and Cybernetics, 19(6):1426—1446, 1989.

K. Higuchi, H. Delingette, M. Hebert, and K. Ikeuchi. Merging multiple views
using a spherical representation. In Proceedings IEEE 2nd CAD-Based Vision

Workshop, pages 124—131, Champion, PA, 1994.

210
[53] D. D. Hoffman and W. A. Richards. Codon constraints on closed 2d shapes.

Computer Vision, Graphics, and Image Processing, 31:265—281, 1985.

[54] D. D. Hoffman and W. A. Richards. Parts of recognition. Cognition, 18:65—96,

1985.

[55] R. Horaud. New methods for matching 3D objects with single perspective views.

IEEE Trans. on Pattern Analysis and Machine Intelligence, 9(3):401—412, 1987.

[56] B. K. P. Horn. Robot Vision. MIT Press, Cambridge, MA, 1986.

[57] B. K. P. Horn and B. G. Schunck. Determining optical ﬂow. Artiﬁcial Intelli-

gence, 17: 185—203, 1981.

[58] M. K. Hu. Visual pattern recognition by moment invariants. IEEE Trans.

Information Theory, 12:179—187, 1962.

[59] D. P. Huttenlocher, G. A. Klanderman, and W. J. Rucklidge. Comparing images
using the Hausdorff distance. IEEE Trans. on Pattern Analysis and Machine

Intelligence, 15(9):850—863, 1993.

[60] D. P. Huttenlocher and S. Ullman. Object recognition using alignment. In

Proceedings Ist Int. Conf. on Computer Vision, pages 102—111, London, 1987.

[61] D. W. Jacobs. Space efﬁcient 3D models indexing. In Proceedings IEEE Conf.

on Computer Vision and Pattern Recognition, pages 439—444, 1992.

[62] A. K. Jain and M. P. Dubuisson. Segmentation of x-ray and c-scan images of

ﬁber reinforced composite materials. Pattern Recognition, 25(3):257—270, 1992.

[63]

[64]

[65]

[66]

[67]

[68]

[69]

[70]

211

M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. Int.

Journal of Computer Vision, 2(1):322—331, 1987.

D. Keren, D. Cooper, and J. Subrahmonia. Describing complicated objects by
implicit polynomials. IEEE Trans. on Pattern Analysis and Machine Intelli-

gence, 16(1):38—53, 1991.

T. F. Knoll and R. C. Jain. Recognizing partially visible objects using feature
indexing hypotheses. IEEE Journal of Robotics and Automation, 2(1):3—13,

1986.

J. J. Koenderink and A. J. van Doorn. Internal representation of solid shape

with respect to vision. Biological Cybernetics, 32(4):211—216, 1979.

V. Koivunen and J .-M. Vezien. Multiple representation approach to geometric
model construction from range data. In Proceedings IEEE 2nd CAD-Based

Vision Workshop, pages 132—139, Champion, PA, 1994.

M. R. Korn and C. R. Dyer. 3-D multiview object representations for model-

based object recognition. Pattern Recognition, 20(1):91—104, 1987.

D. J. Kriegman and J. Ponce. On recognizing and positioning curved 3D ob-
jects from image contours. IEEE Trans. on Pattern Analysis and Machine

Intelligence, 12:1127—1137, 1990.

K. N. Kutulakos, W. B. Scales, and C. R. Dyer. Building global object models
by purposive vieWpoint control. In Proceedings IEEE 2nd CAD-Based Vision

Workshop, pages 169—182, Champion, PA, 1994.

'd'.’ hf'sg. w

 

It.

‘17: '5”.~_ _'A. , - 0. re. ’

[71]

[72]

[73]

[74]

[75]

[76]

212

Y. Lamdan, J. T. Schwartz, and H. J. Wolfson. Object recognition by afﬁne in-
variant matching. In Proceedings IEEE Conf. on Computer Vision and Pattern

Recognition, pages 335—344, 1988.

Y. Lamdan and H. J. Wolfson. Geometric hashing: A general and efﬁcient
model-based recognition scheme. In Proceedings 2nd Int. Conf. on Computer

Vision, pages 238—249, 1988.

C.-W. Liao and G. Medioni. Surface approximation of a cloud of 3D points.

In Proceedings IEEE 2nd CAD-Based Vision Workshop, pages 274—281, Cham-

pion, PA, 1994.

C. C. Lin and R. Chellappa. Classification of partial 2—D shapes using fourier de—
scriptors. IEEE Trans. on Pattern Analysis and Machine Intelligence, 9(5):686—

673, 1987.

D. G. Lowe. Perceptual Organization and Visual Recognition. Kluwer Academic

Publishers, Boston, 1985.

D. G. Lowe. Three-dimensional object recognition from single two-dimensional

images. Artiﬁcial Intelligence, 31(3):355—395, 1987.

[77] D. G. Lowe. Fitting parameterized 3-D models to images. IEEE Trans. on

Pattern Analysis and Machine Intelligence, 13(5):441—450, 1991.

[78] D. Marr. Vision. Freeman, New York, NY, 1982.

 

t~‘om_. - v '

[79]

[80]

[81]

[82]

[83]

[84]

[85]

213

D. Marr and H. K. Nishihara. Representation and recognition of the spatial
organization of three-dimensional structure. In Proceedings Royal Society of

London, volume 200, pages 269-294, 1978.

T. Mclnerney and D. Terzopoulos. A ﬁnite element model for 3D shape re—
construction and nonrigid motion tracking. In Proceedings of the International

Conference on Computer Vision, pages 518—523, Berlin, Germany, May 1993.

R. Mohan, D. Weinshall, and R. R. Sarukkai. 3D object recognition by indexing
structural invariants from multiple views. In Proceedings of the International

Conference on Computer Vision, pages 264—268, Berlin, Germany, May 1993.

F. Mokhtarian and A. K. Mackworth. A theory of multi-scale, curvature-based
shape representation for planar curves. IEEE Trans. on Pattern Analysis and

Machine Intelligence, 14:789—805, 1992.

E. E. Molios. Shape matching using curvature processes. Computer Vision,

Graphics, and Image Processing, 43:203—226, 1989.

D. Murray and A. Basu. Motion tracking with an active camera. IEEE Trans.

on Pattern Analysis and Machine Intelligence, 17(5):518—522, 1995.

D. W. Murray and D. B. Cook. Using the orientation of fragmentary 3D edges
segments for polyhedral object recognition. Int. Journal of Computer Vision,

2(2):153-169, 1988.

 

 

[86]

[87]

[88]

[89]

[90]

[91]

[92]

[93]

214

C. Nastar and N. Ayache. Fast segmentation, tracking and analysis of de—
formable objects. In Proceedings of the International Conference on Computer

Vision, Berlin, Germany, May 1993.

C. F. Olson. Probabilistic indexing for object recognition. IEEE Trans. on

Pattern Analysis and Machine Intelligence, 17(5):518—522, 1995.

A. P. Pentland. From Pixels to Predicates. Ablex, 1986.

A. P. Pentland. Recognition by parts. In Proceedings Ist Int. Conf. on Computer

Vision, pages 612—620, London, 1987.

T. Poggio and S. Edelman. A network that learns to recognize three-dimensional

objects. Nature, 343:263—266, 1990.

J. Ponce, D. J. Kriegman, S. Petitjean, S. Sullivan, G. Taubin, and B. Vijayaku-
mar. Representations and Algorithms for 3D Curved Objects Recognition. In
A. K. Jain and P. J. Flynn, editors, Three-Dimensional Object Recognition

Systems, pages 327—352. Elsevier, 1993.

N. S. Raja. Obtaining Generic Parts from Range Images using a Multi- View

Representation. PhD thesis, Michigan State University, 1992.

A. P. Reeves, R. J. Prokop, S. E. Andrews, and F. P. Kuhl. Three-dimensional
shape analysis using moments and Fourier descriptors. IEEE Trans. on Pattern

Analysis and Machine Intelligence, 10(6):937—943, 1988.

215

[94] P. L. Rosin. Multiscale representation and matching of curves using codons.

Computer Vision, Graphics, and Image Processing, 55(4):286—310, 1993.

[95] H. Samet. The Design and Analysis ofSpatial Data Structures. Addison-Welsey,

New York, 1990.

[96] J. T. Schwartz and M. Sharir. Identiﬁcation of partially obscured objects in two
and three dimensions by matching noisy characteristics curves. Int. Journal of

Robotics Research, 6(2):29—44, 1987.

[97] W. B. Seales and O. D. Faugeras. Building three-dimensional CAD / CAM mod—
els from image sequences. In Proceedings IEEE 2nd CAD-Based Vision Work—

shop, pages 116—123, Champion, PA, 1994.

[98] R. Sekuler and R. Blake. Perception. Alfred A. Knopf, Inc., New York, 1985.

[99] K. Siddiqi and B. B. Kimia. Parts of visual form: Computational aspects. IEEE

Trans. on Pattern Analysis and Machine Intelligence, 17(3):239—251, 1995.

[100] M. Soucy and D. Laurendeau. A general surface approach to the integration of a

set of range views. IEEE Trans. on Pattern Analysis and Machine Intelligence,

17(4):344—358, 1995.

[101] F. Stein and G. Medioni. Structural indexing: Efﬁcient 2—D object recognition.
IEEE Trans. on Pattern Analysis and Machine Intelligence, 14(12):1198—1204,

1992.

 

[102]

[103]

[104]

[105]

[106]

[107]

[108]

216

F. Stein and G. Medioni. Structural indexing: Efﬁcient 3-D object recogni-
tion. IEEE Trans. on Pattern Analysis and Machine Intelligence, 14(2):125—

145, 1992.

G. C. Stockman. Object recognition and localization via pose clustering. Com-

puter Vision, Graphics, and Image Processing, 40:361—387, 1987.

G. C. Stockman. Object Recognition. In Ramesh C. Jain and Anil K. Jain,
editors, Analysis and Interpretation of Range Images, pages 225—253. Springer-

Verlag, 1990.

G. C. Stockman, S. Kopstein, and S. Benett. Matching images to models for reg-
istration and object detection via clustering. IEEE Trans. on Pattern Analysis

and Machine Intelligence, 3(3):229—241, 1982.

P. Suetens, P. Fua, and A. J. Hanson. Computational strategies for object

recognition. ACM Computing Surveys, 24(1):5—61, March 1992.

S. Sullivan, L. Sandford, and J. Ponce. On using geometric distance ﬁts to
estimate 3D object shape, pose, and deformation from range, CT, and video
images. In Proceedings IEEE Conf. on Computer Vision and Pattern Recogni-

tion, pages 110—115, New York, June 1993.

G. Taubin. Estimation of planar curves, surfaces and nonplanar space curves de-
ﬁned by implicit equations, with applications to edge and range image segmenta-
tion. IEEE Trans. on Pattern Analysis and Machine Intelligence, 13(11):1115—

1138, 1991.

[109}

[110]

[111]

[112]

[113]

[114]

[115]

217

D. W. Thompson and J. L. Mundy. Three-dimensional model matching from an

unconstrained vieWpoint. In International Conference on Robotics Automation,

pages 208—220, 1987.

L. W. Tucker, C. R. Feynman, and D. M. Fritzsche. Object recognition using
the connection machine. In Proceedings IEEE Conf. on Computer Vision and

Pattern Recognition, 1988.

S. Ullman. An approach to object recognition: Aligning pictorial descriptions.

AI Memo No. 931, M.I.T. AI Laboratory, 1986.

S. Ullman and R. Basri. Recognition by linear combinations of models. IEEE

Trans. on Pattern Analysis and Machine Intelligence, 13(10):992—1006, 1991.

R. Vaillant and O. Faugeras. Using extremal boundaries for 3-D object model-
ing. IEEE Trans. on Pattern Analysis and Machine Intelligence, 14(2):157—173,

1992.

T. P. Wallace, 0. R. Mitchell, and K. Fukunaga. Three-dimensional shape
analysis using local shape descriptors. IEEE Trans. on Pattern Analysis and

Machine Intelligence, 3:310—323, 1981.

A. S. Wallack and J. F. Canny. Efﬁcient indexing techniques for model based
sensing. In Proceedings IEEE Conf. on Computer Vision and Pattern Recogni-

tion, pages 259—266, Seattle, WA, 1994.

[116]

[117]

[118]

[119]

[120]

[121]

218

Y. F. Wang, M. J. Magee, and J. K. Aggarwal. Matching three-dimensional
objects using silhouettes. IEEE Trans. on Pattern Analysis and Machine In-

telligence, 6(4):513—518, 1984.

M. D. Wheeler and K. Ikeuchi. Sensor modeling, probabilistic hypothesis gen-
eration, and robust localization for object recognition. IEEE Trans. on Pattern

Analysis and [Machine Intelligence, 17(3):252—265, 1995.

J. M. Wolfe. The Mind's Eye. Freeman and Company, New York, 1986.

D. M. Wuescher and K. L. Boyer. Robust contour decomposition using a con-
stant curvature criterion. IEEE Trans. on Pattern Analysis and Machine In-

telligence, 13(1):41—51, 1991.

J. H. Yi and D. Chelberg. Rapid object recognition from a large model database.
In Proceedings IEEE 2nd CAD-Based Vision Workshop, pages 28—35, Cham-

pion, PA, 1994.

J. Y. Zheng. Acquiring 3-D models from sequences of contours. IEEE Trans.

on Pattern Analysis and Machine Intelligence, 16(2):163—178, 1994.

 

"‘illlillll’lllilllE5