. 'VPATRtCK JOSEPH FLYNN
A. ‘ ‘ 1990

 

W L

W,

WI

'7!

W

W

W;

l!

0:13

I

W

q
. L...

 

i
A

 

IBMRY

f

Unlvonlty

 

Huhlgan State

 

 

MSU

 

 

RErU’RNING MATERIALS:
Place in book drop to remove this
checkout from your record. FINES

 

731'?! O 4 “I...”
be;

.-
_J

46L 0 3122:»

1'

UL.”

S.

\

AQG- G 5 19"};

3

3r“ 2 5 .,',"
t... "u , r
I

c ..
_ ‘
«a ‘i-

.P‘
'.l
v [11"?qu »
, ‘ unin'.,ljﬂ._b—J.) “1|

9

i\

‘JL—JNH

3-1%} ‘3‘

 

4
r»

n

J‘.

‘2” U (2 964

. 4

25 19993

”BRAKES will be charged if book is returned
12—- after the date stamped below.
\\
.-‘.\..--WG‘

3

 

 

CAD-BASED COMPUTER VISION:

MODELING AND RECOGNITION STRATEGIES

By

Patrick Joseph Flynn

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science

1990

Abstract

CAD-Based Computer Vision: Modeling and Recognition Strategies

By
Patrick Joseph Flynn

In this thesis we develop a computer vision system for recognizing and locating
three-dimensional objects in an industrial setting, where the objects have been de-
signed on a commercial solid modeler (CAD system). Such a vision system can aid
in automating labor-intensive manufacturing tasks such as inspection, classiﬁcation,
and packaging for shipment. The problem to be attacked is stated as follows: given a
depth map containing one or more 3D objects which are piecewise-planar, piecewise-
quadric, or piecewise—surface-of-revolution, identify and locate those objects which
are instances of predefined 3D models stored in an object database.

Of particular importance to an object recognition system is the representation
of 3D objects. The representations produced by present-day CAD systems are not
organized efﬁciently for object recognition, and some salient features are not stored
explicitly. One major contribution of this work is the development of a novel ap-
proach to object representation. We link a 3D boundary representation with powerful
viewpoint-dependent features to produce augmented relational graph representations
of objects.

One of the popular computational models for object recognition is interpretation-
tree search. We have implemented a search-based recognition module which itera-
tively proposes identity and pose hypotheses for combinations of scene surfaces, and
then discards hypotheses not in agreement with the scene content. Several hypoth-

esis verification steps are used to prune the search space being explored; these steps

involve comparison of the scene entities in the hypothesis and model features, and
tests are based on intuitive geometric constraints such as proximity, visibility, .and
relative orientation.

This object recognition system is evaluated with both synthetically-generated
depth data from CAD models of twenty objects and depth data acquired from a
triangulation-based 3D range ﬁnder. An input image is segmented into connected
surface patches using a region-based local-feature clustering technique, followed by
patch merging at domain-independent and domain-dependent levels. Surface patches
are classiﬁed using regression and surface curvature estimates. Classiﬁed patches,
their parameters, edge positions, and edge lengths serve as input to the identiﬁca-
tion/ localization module. This system is implemented on a graphics workstation in

the C and Common LISP languages, and tested on a variety of 3D objects.

To my parents,

and Laurie June Starr

iv

Acknowledgements

The research described in this thesis would not have been possible were it not for
the continuous support of my thesis advisor, Professor Anil Jain. Through my years
in graduate school, he has constantly been supportive during the rough times, and
cracked the whip over me during my occasional lazy periods. I deeply appreciate our
hundreds of hours of discussions, and value him as a close friend. I also gratefully
acknowledge the assistance of Professor George Stockman, another of my committee
members, with whom I have spent many fruitful hours in conversation. I also appre-
ciate the assistance and interest of Professor Lionel Ni, Professor Richard Hill, and
Professor Erik Goodman, all members of my thesis committee. I thank Dr. Good-
man for making the facilities of the Case Center for Computer-Aided Engineering
and Manufacturing available to me during my thesis research. I spent three enjoyable
months during the summer of 1987 as an intern in the Automation Sciences Labo-
ratory of the Northrop Research and Technology Center in Palos Verdes, California.
Dr. H. R. Keshavan, the laboratory director and the sixth member of my thesis com-
mittee, has been generous with his time and resources during and after my internship
at Northrop. I value his many comments and practical advice highly. Professors
Richard Dubes and Mihran Tiiceryan have also given me many useful ideas over the
past few years. I appreciate their guidance. I also value the counsel of Dr. Torﬁnn
Taxt of the Norwegian Computing Centre and Professor Guna Seetharaman of the
University of Southwest Louisiana, both of whom spent several months at MSU as
visiting scholars. Dr. Paul Besl of General Motors Research Laboratories and Profes-
sor W. Eric L. Grimson of the Artiﬁcial Intelligence Laboratory at the Massachusetts
Institute of Technology gave advice at crucial points during my research.

The Pattern Recognition and Image Processing Laboratory has been an excellent
environment in which to carry out graduate research and yet preserve one’s sanity
during the high-pressure periods. In no small part, this is due to the concern and

encouragement of others, particularly my graduate student comrades. Many thanks to

V

all current and former graduate students for their help, especially Dr. Rick Hoffman,
Dr. Gongzhu Hu, Dr. Sei-Wang Chen, Dr. Bruce McMillin, Henry Davies, Ray Klein,
thyng Lee Hoffman, Joe Miller, Farshid Farrokhnia, Deborah Trytten, Narayan S.
Raja, Satish N adabar, Tim Newman, Arun Nanda, and Sally Howden. Special thanks
go to David Marks, our laboratory manager, for his assistance and the occasional trip
out for pizza.

Thanks to Drs. Roland Zapp and Marvin Siege] of the Electrical Engineering
Department, and Gregory Hoshal and Bill Byrne, for helping me get my start in
research as an undergraduate.

This research has been supported by the National Science Foundation under grants
IRI-8600371 and CDA-8806599, by the Department of Computer Science, by the GTE
Graduate Fellowship Program, and by the Northrop Research and Technology Center.
I also appreciate the ﬁnancial support of Dr. Reddiar S. Anbalagan of Innovision
Corporation of Madison, Wisconsin.

Finally, I acknowledge with love the support of my parents, my family, my wife’s

family, and especially my wife Laurie, who kept me motivated.

vi

Table of Contents

List of Tables X
List of Figures xi
1 Introduction 1
1.1 Manufacturing and Computer Vision .................. l
1.2 Three-Dimensional Sensing ........................ 4
1.3 3D Object Recognition .......................... 5
1.4 Synthesis: CAD-Based 3D Vision .................... 7
1.5 Outline of the Dissertation ........................ 9

2 Representation: Formalism, CAD Models, and Vision Models 10
2.1 Introduction ................................ 10
2.2 Formal Properties of Geometric Representations ............ 10
2.3 Representation Schemes ......................... 13
2.3.1 Representations used in CAD Systems ............. 14

2.3.2 Representations Used in Vision Systems ............ 27

2.4 CAD Models to Relational Graphs ................... 37
2.4.1 Node Attributes .......................... 38

2.4.2 Binary Features .......................... 42

2.4.3 Higher-order Relations ...................... 46

2.4.4 Calculating Connectivity ..................... 47

2.4.5 Viewpoint-Dependent Features ................. 47

2.4.6 Implementation Details ...................... 52

2.4.7 Examples ............................. 52

2.5 Object Model Database ......................... 56
2.6 Summary ................................. 62

vii

3 3D Sensing
3.1 Introduction ................................
3.2 Survey: Range-Finding Techniques ...................
3.2.1 Radar Sensors ...........................
3.2.2 Triangulation-Based Sensing ...................
3.2.3 Stereo Techniques .........................
3.2.4 Shape-From-X ...........................
3.2.5 Other Approaches .........................
3.3 Technical Arts 100X scanner .......................
3.4 Summary .................................

4 Segmentation and Surface Classiﬁcation

4.1 Survey ...................................
4.1.1 Superquadric Classiﬁcation. ...................
4.1.2 Curvature-Based Segmentation .................
4.1.3 Edge Detection in Range Imagery ................

4.2 Initial segmentation: Local feature clustering ..............
4.2.1 Surface Normal Estimation ....................
4.2.2 Sampling and Clustering .....................

4.3 Reﬁned Segmentation: Cleaning and Merging .............

4.4 Surface Classiﬁcation ...........................
4.4.1 Testing for Planes .........................
4.4.2 Estimating Curvature .......................
4.4.3 Testing and Parameter Estimation for Cylinders ........
4.4.4 Testing and Parameter Estimation for Spheres .........

4.5 Examples .................................

4.6 Summary .................................

5 3D Object Recognition and Localization: The BONSAI System
5.1Introduction............... .................

5.2 Survey ...................................

5.3 Interpretation Tree Search for 3D Recognition .............
5.3.1 Notation and Deﬁnitions .....................
5.3.2 NULL Associations ........................

viii

64

.64

65
66
67
67
68
69
70
77

78
78
78
79
80
81
84
87
90
91
94
95
100
101
103
105

106
106
109
119
119
120

5.3.3 Ranking Scene Entities ......................

5.3.4 Pruning and Predicates ......................
5.3.5 Heuristic Search Termination ..................
5.3.6 IT Search Algorithm .......................
5.3.7 Model Database Pruning .....................
5.3.8 Geometric Predicates .......................
5.4 Pose Estimation ..............................
5.4.1 Nonsymmetric Model Pose Estimation .............
5.4.2 Rotationally-Symmetric Objects .................
5.5 Hypothesis Veriﬁcation and Extension .................
5.6 Analysis of Interpretation Tree Search .................
5.6.1 IT Search as Sequential Hypothesis Testing ..........
5.6.2 Unary Test ............................
5.6.3 Binary Test ............................
5.6.4 Learning Error Probabilities ...................
5.6.5 Search Tree Size ..........................
5.6.6 A Simulation Study ........................
5.7 Summary .................................
Experiments
6.1 Pose Estimation Accuracy: Synthetic Imagery .............
6.2 Recognition Accuracy: Real Imagery ..................
6.3 Cluttered Scene Processing ........................
6.4 Discussion .................................

Conclusions and Recommendations for ﬁxture Research

Program Input and Output
A.l Model-builder ...............................
A.2 Segmentation Output ...........................

Bibliography

ix

121

_121

122
123
123
126
130
130
134

141
142
142
146
149

151
152
154
155
170

171

174
174
174

179

2.1
2.2
2.3
2.4

3.1

4.1

5.1
5.2

6.1

6.2

6.3
6.4

List of Tables

IGES representation and inferred quantities for vision models .....
Geometric Primitives and Orientation Attributes ...........
Relations produced by the geometric inference engine .........

Vertices and facets for the polyhedral object ...............

Range sensor requirements for different application areas. Adapted
from [68]. .................................

Segmentation from signs of mean (H) and Gaussian (K) curvatures
(from [10]). ................................

Object Recognition Systems .......................
Experimental results and empirical bounds for recognition of a curved

object ....................................

Nonsymmetric objects: mean and standard deviation of errors in axis
orientation, axis rotation, and translation. ...............
Symmetric objects: mean and standard deviation of errors in axis of
symmetry and translation. ........................
Confusion matrix for recognition accuracy experiments .........

Recognition results for ten cluttered scenes. ..............

42
43
46
54

65

80

110

147

153

153
155
157

List of Figures

1.1 Design dichotomies in object recognition systems. ........... 6
1.2 Block diagram of the CAD-based computer vision system ....... 8
2.1 A ‘nonsense’ object ............................. 11

2.2 A 3D object. Various representations for this part appear in the text. 12

2.3 A CSG tree for the example object of Figure 2.2. ........... 16
2.4 Possible ambiguities in boundary representations. ........... 17
2.5 A portion of the B-rep for the example object .............. 17
2.6 Volumetric approximation of the example object. ........... 20
2.7 A twisted superquadric. ......................... 21
2.8 A sweep representation of the example object. ............. 28
2.9 Three geons and their qualitative features ............... 30
2.10 Qualitative changes in the line drawing of a cube. ........... 31
2.11 Effects of standoff on visibility of a 2D object ............. 32
2.12 The aspect graph for a cube ........................ 34

2.13 Contour and surface types for three objects with identical line drawings. 36
2.14 Subdivision of icosahedron faces into 16 subfaces. (a) Icosahedron. (b)
Subdivision of an icosahedron face into 16 subfaces. (c) Resultant 320-

sided polyhedron. The view directions are chosen to lie at the center

of the triangular faces. .......................... 49
2.15 Images of a polyhedral object. (a): edge map. (b): range image . . . 53
2.16 The CSG model for the polyhedron .................... 55
2.17 A portion of the relational graph for the polyhedron .......... 56

2.18 Hidden-line and synthetic range images of a rotationally-symmetric part. 57
2.19 A portion of the relational graph generated for the rotationally-sym-
metric part. ................................ 57

2.20 Synthetic range images of twenty objects ................. 59

xi

2.21 Pseudo-intensity images of twenty objects. ...............

3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8

4.1

4.2

4.3
4.4
4.5
4.6
4.7

4.8

5.1
5.2
5.3
5.4
5.5

5.6

Triangulation-based range sensing ....................
The range sensing worktable .......................
Range values for a single laser stripe. ..................
Calibration of the 100X scanner ......................
Range and intensity images from the 100K scanner. ..........
Crease and jump edges ...........................
Real range images of twenty database objects. .............

Pseudo-intensity images of twenty database objects ...........

A functional block diagram of the segmentation and classiﬁcation pro-
cedure. Dashed boxes indicate data structures; solid boxes indicate
processing steps. .............................
Range (a) and pseudo-intensity (b) images of a scene containing two
objects. ..................................
Surface normals estimated from the range image of Figure 4.2 .....
Segment labels from clustering and after merging ............
Decision tree for surface classiﬁcation ...................
Sensor and local coordinate systems. ..................
Minimum (a) and maximum (b) curvature directions for the range
image of Figure 4.2 .............................
Example of segmentation and classiﬁcation procedures. (a) Range im-
age. (b) Shaded image. (c) Surface normals. (d) Initial segmentation

from clustering. (e) Final segmentation. ................

Example of IT search for a 2D object ...................
Algorithm SEARCH: Constrained search of the interpretation tree. ‘ .
Sequential application of geometric predicates ..............
Translation estimation: aligning cylinder axes. .............
The full interpretation tree for s = 3,172 = 4, allowing reuse of model
entities, without heuristic search termination. Thick branches indicate
NULL associations. ............................
The full interpretation tree (with heuristic termination) for s = 3, m =

4,t = 2, allowing reuse of model entities. ................

xii

60

'68

70
71
72
73
74
75
76

87
91
93
96

99

104

108
124
127
135

143

144

5.7

5.8

5.9

6.1
6.2

6.3

6.4

6.5

6.6

6.7

6.8

6.9

The restricted interpretation tree (without heuristic termination) for
s = 3, m = 4. ...............................
The restricted interpretation tree (with heuristic termination) for s =
3, m = 4,1 = 2. ..............................
5th and 95th percentile bounds on search (from simulation). Points

are actual experiments with image data. ................

A misinterpretation of the TAPEROLL object ..............
Recognition results far ADAPTER/CURVBLOCK scene. (a): range
image. (b): pseudo-intensity image. (c): segmentation (modiﬁed; see
text). (d): Top-ranked hypothesis. (e): Second-ranked hypothesis. . .
Recognition results for BALL2/PROPANE scene. (a): range image.
(b): pseudo-intensity image. (c): segmentation. (d): Top-ranked hy-
pothesis. (e): Second-ranked hypothesis. ................
Recognition results for BIGWYE/TAPEROLL scene. (a): range im-
age. (b): pseudo-intensity image. (c): segmentation. (d): Top-ranked
hypothesis. (e): Second-ranked hypothesis ................
Recognition reSults for BLOCK2/HARRISCUP scene. (a): range im-
age. (b): pseudo-intensity image. (c): segmentation. (d): Top-ranked
hypothesis. (e): Second-ranked hypothesis ................
Recognition results for BLOCK4/HALFSPH scene. (a): range image.
(b): pseudo-intensity image. (c): segmentation. (d): Top-ranked hy-
pothesis. (e): Second-ranked hypothesis. ................
Recognition results for CAP/BOX2INCH scene. (a): range image. (b):
pseudo-intensity image. (c): segmentation. (d): Topranked hypothe-
sis. (e): Second-ranked hypothesis. ...................
Recognition results for COLUMN l/BLOCKl scene. (a): range im-
age. (b): pseudo—intensity image. (c): segmentation. (d): Top-ranked
hypothesis. (e): Second-ranked hypothesis ................
Recognition results for COLUMNZ/BALL scene. (a): range image.
(b): pseudo-intensity image. (c): segmentation. (d): Top-ranked hy-
pothesis. (e): Second-ranked hypothesis. ................

xiii

.145

147

156

158

159

160

I61

162

164

165

167

6.10 Recognition results for GRNBLK3/PISTON scene. (a): range image.
(b): pseudo-intensity image. (c): segmentation. (d): Top-ranked hy-
pothesis. (e): Second-ranked hypothesis. ................ 168
6.11 Recognition results for HUMP/AGPART2 scene. (a): range image.
(b): pseudo-intensity image. (c): segmentation. (d): Tap-ranked hy-

pothesis. (e): Second-ranked hypothesis. ................ 169
A.1 A portion of the IGES ﬁle for the polyhedron ............. 175
A.2 S-expressions generated automatically from the polyhedral model . . 176
A.3 LISP description for the scene of Figure 4.8 ............... 178

xiv

Chapter 1

Introduction

This thesis describes an object recognition system employing geometric models of 3D
objects constructed on a computer-aided design (CAD) system and range (depth)
imagery; given these two inputs, it identiﬁes instances of the object models present in
the scene and estimates their location and orientation. Such a system can be used suc-
cessfully in automated manufacturing environments to perform inspection, assembly,
sorting, bin-picking, and'classiﬁcation tasks with the aid of a robotic manipulator. In
this chapter, we motivate the object recognition and localization problem, the role of
CAD models, and the use of range imagery. An outline of the remainder of the thesis

concludes the chapter.

1.1 Manufacturing and Computer Vision

Modern manufacturing enterprises are becoming increasingly automated. Computer
technology is present in one form or another in almost every manufacturing plant,
regardless of the products produced, the quantity produced, or the intended consumer.

The primary goals of automation are

0 improved product quality,
0 reduced manufacturing cost, and

0 reduced lead time.

Beneﬁts derived from automation can include increased production rates and im-
proved consistency. The primary requirement for successful automation in manu-
facturing is the integration of information between design, fabrication, testing, and
inspection. Integration allows design integrity to be maintained throughout this se-
quence. The most important information involved in this integration is geometric
(speciﬁcally, solid models of the objects to be fabricated).

One of the most successful applications of computers in manufacturing has been

the use of CAD software to design mechanical parts. Over the past decade, many
1

draftsmen have been replaced by engineers with desktop workstations; product speci-
ﬁcations are converted to mathematical descriptions of the parts to be manufactured,
and these descriptions are stored ‘on-line’, without resorting to paper copies. In sit-
uations where the mechanical parts can undergo stresses, ﬁnite-element techniques
and simulation studies can validate a design against speciﬁcations, often eliminating
or reducing the fabrication and destructive testing of prototypes. Finally, when a
design is found to meet all manufacturing requirements, the geometric model of the
object residing in the computer can be translated into process plans and programs
for computer-driven numerical control machines which fabricate the part from stock
materials.

The scenario described above is being carried out at many manufacturing facilities.
However, the usefulness of geometric models in automated manufacturing certainly
does not end there. The last decade has seen much progress in computer vision
techniques and their applications in industrial environments. Vision systems are used

in the factory for a number of tasks [68]:
o Metrology: veriﬁcation of dimensioning tolerances;

0 Surface Inspection: examination for proper ﬁnish and absence of other surface

ﬂaws;

o Integrity and Placement Veriﬁcation: checking for presence and proper posi-

tioning of subparts;

0 Model Acquisition: construction of an object description from an image or a

series of images;

0 Navigation: provide information about the environment to a teleoperated or

autonomous vehicle;

0 Feature Localization: identify and locate a particular feature (such as a hole or
corner) on an object, without ﬁnding the overall position and orientation of the

object; teleoperated or autonomous vehicle;

0 Recognition: identiﬁcation of a part as one of several that might be seen by the

sensor; and

0 Object localization: estimation of the position and orientation of the part in the

sensor’s ﬁeld of view.

In many of these vision applications, geometric information about the manufac—
tured part(s) is useful or even essential. For example, to verify that a given dimension
of the part being viewed is within a speciﬁed tolerance of the design value, one needs
to know the design and tolerance values. Clearly, the geometric model produced by
the part designer can provide much, if not all of the object information required by an
industrial vision system. Paradoxically, though, the integration of CAD information
into industrial vision systems has only recently been attacked in its own right [11]. As
computer vision research matures and industrial applications of vision assume more
importance in manufacturing, this issue of integration is becoming an area of serious
research.

Part of the contribution of this thesis is a method of integrating geometric models
designed on a mechanical CAD system into a model-based 3D object recognition and
localization system.

Of the several industrial vision problems outlined above, the latter two (recog-
nition and object localization) have received much attention from computer vision
researchers but they remain difﬁcult and challenging problems. Computer vision re-
search has recognized that two fundamental problems underlie the recognition and
localization task: representation of 3D objects, and matching of stored representa-
tions to the sensed image of the scene. The representation problem has been studied
by researchers in computer graphics, CAD, and vision; computer vision systems can
beneﬁt from the prior work done by others. However, the matching problem has
only recently been addressed in sufﬁcient generality to be of interest in automated
manufacturing. In the sequel, we will refer to the tasks of recognition and localiza-
tion jointly as the recognition problem; in Chapter 5, recognition and localization are
attacked as a single uniﬁed task. The second major contribution of this thesis is the
implementation of a 3D object recognition system which takes as its input a range
image (depth map) of a scene with one or more objects, possibly in self-occluding or
mutually-occluding positions, and produces as its output a listing of (known) objects

in the scene along with their locations and orientations.

1.2 Three-Dimensional Sensing

The discussion above has motivated the usefulness of computer vision in the man-
ufacturing environment, and suggested a multifaceted role for object geometry in
applications involving sensing of the manufactured parts. However, the actual ex-
traction of surface geometry is a difﬁcult problem. Many prior vision systems have
employed intensity sensors such as CCD (charge-coupled device) cameras to obtain
images. The resulting images contain information about surface ﬁnish, ambient and
active lighting, and other extraneous information as well as the desired information
about the scene geometry. The scene geometry is present in implicit form, and the
difﬁcult task is to extract it explicitly. Computer vision researchers continue to attack
the problem of extracting depth information from intensity images using a variety of
‘shape-from-X’ approaches such as shape from shading [59], shape from texture [121],
and shape from occluding contours [115]1.

Recent years have seen the development of a variety of techniques for sensing depth
(range) information directly. A survey of such techniques appears in Chapter 3; here,

we broadly classify range sensors into one of four types [7]:
0 Radar sensors: time-of-ﬂight, AM, and FM detectors;
0 Structured light sensors: triangulation-based techniques;
0 Stereo: methods using multiple cameras or light sources; and

0 Miscellaneous methods: Moire fringe analysis, holographic interferometry, and

depth-from-focus.

Structured light sensors obtain depth information by projecting a line, grid, or
other pattern on the scene, which is sensed by an intensity sensor at a known lo-
cation. Knowledge of the pattern’s geometry and the camera’s position allows the
depth to be derived from the distorted pattern seen by the intensity sensor. In this
dissertation, images of 3D parts are obtained from a structured light scanner [113]
in our laboratory; images from other sensors (e.g. radar range ﬁnders) could also be

used.

1Many of these approaches are supported by biological evidence; human beings do quite well at
recovering shape from intensity imagery. Some animals use different sensing strategies to ﬁnd the
‘shape’ of their environment (e.g., bats use a radar-like sense).

 

1.3 3D Object Recognition

A 3D object recognition system employs data from two sources to interpret a scene:

0 A model database, containing representations of the objects to be recognized,

and

e A scene representation, which summarizes the scene content extracted from the

sensed image(s).

The process of identifying a model object which is present in the scene is referred to
as matching, interpretation, recognition, or identiﬁcation. Many systems, including
the one developed in this dissertation, also perform localization, the estimation of a
rigid rotation and translation required to bring the coordinate system of the identiﬁed
model into correspondence with the scene coordinate system.

We will defer a survey of existing object recognition systems to Chapter 5 and
limit our discussion to a taxonomy of recognition techniques. Figure 1.1 illustrates
two design dichotomies in object recognition systems. Designers must choose a repre-
sentation strategy for their object models; the primary distinction deals with dimen-
sionality. Are models represented internally as a series of 2D views (e.g. silhouettes),
or are model curves and surfaces expressed geometrically in a common 3D coordinate
system? Additionally, the nature of the sensed image affects the recognition strategy.
The two major types of imagery in use today are intensity (or 2D) imagery, and range
(or 3D) imagery”. The choice of object representation strategy and image type results

in the following four types of object recognition systems:

0 2D-2D (2D image, 2D model) matching strategies are often practical in con-

strained recognition environments.

. 2D-3D (2D image, 3D model) strategies have been popular topics of research
because they combine the descriptive power of 3D models with inexpensive and

fast 2D sensing strategies.

0 3D-2D (3D image, 2D model) techniques have not been studied in any depth.
Matching a 3D image against a 2D model is wasteful of the image data. For

 

2Not all researchers would agree with this classiﬁcation. Some point to work in shape-from-
shading as an argument that intensity imagery is ‘more than 2D’; some view range imagery as 2§D
because data from the ‘underside’ of the object is not obtained, and reserve the 3D label for stacks
of density maps (e. 9. MRI images) from which data on all surfaces of an object can be obtained.

2D

Type of
Model

30

Figure 1.1: Design dichotomies in object recognition systems.

example, we would be discarding almost all of the range data in a system which

Type of image

 

 

 

2D 3D
2D-2D 3D-2D
ZD-BD 3D-3D

 

 

matches 2D silhouette models to 3D images.

0 3D-3D (3D image, 3D model) methods have received much attention recently;
our system falls in this category. An additional point of departure in this area
is the source of object models. One contribution of our work is the use of
CAD models in the recognition system; many prior systems use hand-designed

models, incomplete or coarse models, or models learned from multiple images.

3D representations are studied in detail in Chapter 2. We should note, however,
that the issue of power in object representation has historically been a barrier to
the use of 3D vision systems in industrial applications. Most of the early systems
employing 3D models were limited to either polyhedral or generalized cylinder repre-
sentations for objects. Neither approach captures enough variation in 3D geometry to
adequately and unambiguously represent a large class of manufactured parts. A versa-
tile vision system using 3D models should allow objects constructed using constructive
solid geometry (primitive shapes and Boolean combinations), sweeps, proﬁling, and
extrusion methods to be represented exactly. Our system is an improvement over

polyhedral representations, but does not attempt to represent objects with complex

sculptured surfaces.

 

1.4 Synthesis: CAD-Based 3D Vision

In previous sections we have identiﬁed problems‘in the manufacturing domain that
can be addressed using computer vision techniques. We have also highlighted the
central role of object geometry in these applications. Range sensors can provide a
computer with a dense sampling of 3D points on an object’s surface.

This thesis describes a vision system which recognizes and locates three-dimen-
sional objects in an image of a scene obtained by a range sensor. Models of the
objects to be recognized are constructed using a commercial CAD system (GEO-
MOD [110]), and stored in a database. Each object is described in the database as
a list of surfaces (planes, cylinders, and spheres) and bounding curves (lines, circular
arcs, and parametric curves). Higher-level view-dependent and view-independent ge-
ometric features are inferred from the geometric model produced by the CAD system,
producing a vision model used during the matching stage. A block diagram of the
system appears in Figure 1.2.

One important contribution of this thesis is its approach to creation of the object
models against which the scene description is matched. Rather than creating models
by hand, or building models by merging information from several views of the object,
our models are constructed automatically from a CAD database. A number of geo-
metric inferencing steps are performed on the CAD descriptions in order to extract
object features that are useful in matching. The end result of the inferencing system
is a relational graph model that contains a large amount of explicit geometric infor-
mation about the objects. This information is necessary in building efﬁcient object
recognition strategies.3

The scene information to be processed is in the form of a range image, which
contains explicit measurements of distance between the sensor and objects in the
scene. This image is segmented to extract homogeneous (smooth) surface regions.
From regions, geometric features (principal surface curvatures and directions), which

characterize the ‘local’ shape of the surface at each point, are calculated. These

 

3Automatic construction of vision models from CAD models is useful in environments where new
objects are being designed. In some environments, however, one might wish to create vision models
for objects which already exist. Ifso, the construction of models from multiple views [91, 116] might
be preferable to the effort of measuring the object, entering the measurements into the CAD system,
and building a vision model using our approach. In a sense, model-building from multiple views
and model-building from CAD models are complementary operations, and each has its place in 3D
vision systems for industrial object recognition.

 

 

Model Database

 

 

 

 

 

 

 

 

 

CAD system I.) Geometric 9 [LLF’

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Inference
Chapter 2
Matching

) Module
Segmentation g i
and i i
Classification v
AK Chapter 4 § § Scene interpretation

 

 

-Object Identityts)

 

 

 

Eg::> WWW? i -object Position(s)
g g Chapter 5

 

 

 

Z////<<2ﬁ 3D Sensor;

Chapter 3

Figure 1.2: Block diagram of the CAD-based computer vision system

features are used to classify surfaces (e.g. planar, spherical) and estimate their pa-
rameters (e.g. the radius and orientation of a cylindrical patch). The extraction of
smooth surfaces and their parameters from the image is done in the segmentation and
classiﬁcation module in Figure 1.2.

The output of the segmentation step is a list of classiﬁed surfaces in the image.
The recognition module (named BONSAI) takes this information and attempts to
match it against objects in the model database. The strategy used for recognition
is the interpretation-tree search paradigm. BONSAI makes tentative hypotheses of
identity for scene surfaces and model surfaces, attempts to extend these hypotheses to
include additional scene and model surfaces, and rejects those hypotheses which are
inconsistent with respect to the model under consideration. A number of constraints
are used to reject hypotheses so that the search tree can be pruned. The second major
contribution of this work is a detailed examination of the constraints on interpretation
provided by object models and scene geometry. These constraints allow inconsistent
hypotheses of object identity and pose to be rejected early in the recognition process,

allowing a larger model database to be examined efﬁciently.

1.5 Outline of the Dissertation

The remainder of this dissertation is organized as follows. In Chapter 2, we survey
three-dimensional object representation, describe the integration of geometric models
constructed on a CAD system into our vision system (with particular emphasis on
geometric inference), and describe our model database. Chapter 3 outlines the prin-
ciples of 3D range sensing, describes the Technical Arts 100x sensor used to obtain
our range images, and contains example images of the objects used in testing of the
BONSAI system. Chapter 4 describes the segmentation and classiﬁcation procedures
used to obtain a symbolic scene description from an input depth map. Chapter 5
describes our search-based recognition system. Experiments using a variety of object
models and sensed scenes appear in Chapter 6. Finally, Chapter 7 summarizes our

results and recommends future directions for research in 3D object recognition.

Chapter 2

Representation: Formalism, CAD
Models, and Vision Models

2.1 Introduction

In Chapter 1, we motivated the commonality of geometric knowledge between de-
sign, fabrication, and visual tasks such as 3D object recognition and localization.
In essence, we wish to have one logical (not necessarily physical) geometric knowl-
edge base for the entire manufacturing cycle, with representational modiﬁcations as
necessary, driven by the requirements of the processes involved. While a complete
discussion of such representations and modiﬁcations is beyond the scope of this dis-
sertation, it is important to discuss in some detail popular geometric representations
and describe those used in this thesis.

In this chapter, an outline of the mathematical properties of solid models is fol-
lowed by descriptions of the popular methods for representing three-dimensional ob—
jects in commercial CAD systems. Then, we focus on the geometric modeling system
(GEOMOD) [110] used in this research to create CAD models. A survey of repre-
sentations used in computer vision research follows. The remainder of the chapter
describes our CAD-to-vision model inferencing system, including examples of the

model construction process for a few typical 3D objects used in this thesis.

2.2 Formal Properties of Geometric Representa-
tions

In this section, we discuss the mathematical formalism of 3D solid representations.
Excellent surveys of solid modeling appear in Requicha [95], Requicha and Voel-
cker [93, 94], and Brown [23].

A representation method [95] M is a relation involving sets containing solid objects

10

11

 

 

Cube

 

 

 

Dangling 2D surface

Figure 2.1: A ‘nonsense’ object.

and representations of solid objects. A member m.- 6 M is an ordered pair (o.-,s,-),
where o.- 6 0 is an object (or a collection of objects) drawn from a set of objects and
s: 6 S is a representation (a model) of o.- drawn from the set of syntactically-correct
representations (Requicha observes that S is a language generated by a grammar
using a ﬁnite alphabet). Let D C 0 and R C S be the domain and range of M.
Elements 0 6 (0 - D) are not representable by M, and elements 3 6 (S — R) are
syntactically-correct representations with no corresponding physical object (they are
often called invalid representations).

Once the concept of representation has been expressed mathematically, we can

ask some standard questions about the relation M:

o How large is the set D? This allows us to judge the power of the representation
scheme. A representation M1 admitting only polyhedra is less powerful than a
scheme M 2 admitting natural quadrics, because the domain of M1 is a proper

subset of the domain of M2.

0 How large is R? If R = S, the representation scheme does not admit ‘nonsense’
objects (e.g. an object with a ‘dangling’ surface; see Figure 2.1). While it
is difﬁcult to imagine a human designer intentionally creating such ‘nonsense’
objects, a procedure for automatically detecting and flagging such models would

be useful in situations where they are generated by a computer.

12

/

Figure 2.2: A 3D object. Various representations for this part appear in the text.

0 Are representations complete (unambiguous)? A representation s is unambigu-
ous if
(51,3) 6 M A (02,3) 6 M => 01 = oz.

The method M is unambiguous (formally, left unique) if
(01,8) 6 M A (02,s) E M => 01 = 02 Vane; E DV3 6 R.

If M is ambiguous, representations do not contain enough information to re-
construct the 3D object. In 3D vision, complete models are desirable for a
pragmatic reason: they can allow synthetic images (intensity and range) to be

produced from the model.

0 Are representations unique? Unique (formally, right unique) schemes are func-

tions from 0 to 5; they satisfy
(0,81) 6 M A (0,32) 6 M => 31 = s2 V06 D V81,82 6 R.

If not, more than one representation in R corresponds to a single physical object.
In commercial CAD systems, this situation can easily arise, since different design

methodologies can produce the same object.

The ‘ideal’ 3D representation method produces only valid representations, is unique
and unambiguous (and hence one-to-one), and also has a rich domain of representable
Objects. Most 3D representation methods fail one of these criteria; this partially
explains the variety of approaches used by researchers to represent 3D solids.

To some degree, the particular representation strategy used in model-building

can be ‘hidden’ from the end user of the modeling software. Most commercial solid

13

modelers allow the user to build objects using constructive solid geometry (CSG)
techniques (primitives and Boolean operations; see below). However, the internal
representation of the solid so constructed need not be a CSG representation. Indeed,
some solid modelers maintain multiple representations of objects under construction,
- in order to offer the user ﬂexibility of design, speed of display, and representational
power. For example, the GEOMOD software used in this research contains three
distinct representations of each object.

o A boundary representation (B-rep), where curves and surfaces are nonuniform

rational B-splines.

e A ‘CSG-like’ tree describing the design history in terms of primitives and oper-

ations.

e A polyhedral approximation to the object’s boundary (which is useful for dis-

playing the object during the design cycle).

Hoffmann brieﬂy discusses the issues of abstraction in solid modeling, and the use-
fulness of multiple representations [58, pp. 614]. Of course, the use of multiple
internal representations introduces a coherency problem. How does the modeler en-
sure that all of the internal representations describe identical objects? Discussions of

this problem are beyond the scope of this dissertation.

2.3 Representation Schemes

This section describes the dominant 3D object representation methods in use today:
CSG, B-rep, volumetric techniques, sweep representations, and view-centered ap-
proaches. Most of these representation schemes has been used by vision researchers
for the representation of 3D shapes. Besl and Jain [9] have examined 3D modeling
from the vision perspective. Sarnet [98] gives detailed explorations of data structures
suitable for internal representations of 2D and 3D solids. The model-building system
developed in this thesis combines the representational power of B-reps with view-
centered information to,produce a relational object model; to our knowledge, such a
combination of approaches is unique.

For reasons of clarity, a common mathematical notation for models produced by

different representation schemes is adopted. All representations will be denoted by

14

’R with a subscript specifying the type; e.g. Rosa will represent a CSG model of a
given object. In addition, we will give an example of most representation schemes for

the object shown in Figure 2.2.

2.3.1 Representations used in CAD Systems

Here, we discuss representation methods used in many CAD systems. Computer
vision research employing these representations is noted in the descriptions.
Wire-frames

A wire-frame object representation is a graph

RWF = {VWFr £WF}!

where the vertices are 3D points on the object surface (typically corners or other
distinguished points), and the graph edges represent a physical edge on the object.
These representations do not contain surface information and are therefore not com-
plete, since the ‘solid’ portion of the wire frame is not deﬁned in the graph structure.

Few 3D vision systems have employed wire-frame representations.

Constructive Solid Geometry (CSG)

In CSG-based modeling, a ﬁnite set of primitive shapes (often simple shapes such as
right circular cylinders, spheres, and boxes, but occasionally more general shapes or
halfspaces) are combined using regularized Boolean operations to produce the desired

solid. A CSG model is stored as a tree

Rose = {Vcsofcsa},

with vertex set Vase and edge set 8050. Edges of the tree enforce precedence among
the Boolean operations used to construct the object. If the vertex set is partitioned

into subsets containing leaf and interior vertices,
VCSG = vleo] U vinteriora
then V1,. [’8 members are ordered pairs of the form

um, = (type,parameters)

15

where type denotes the type of primitive (i.e. cylinder, cone, box), and parameters
gives both the intrinsic parameters of the primitive (e.g. the cylinder’s radiusand
length) and the location and orientation of the primitive in 3D. Each member of
11mm," is one of the (regularized) Boolean operations UNION (Us), INTERSEC-
TION (he), and DIFFERENCE (—:tr).1 A CSG tree for the object in Figure 2.2
appears in Figure 2.3. The intrinsic and location parameters of the primitives at the
leaves are not shown. Lin and Chen [77] developed a 3D vision system which extracts
CSG object representations from range data and matches them against a CSG model

database.

Surface Models (B-reps)

A more powerful alternative to the CSG representation is the boundary representa-

tion. A B-rep can be viewed (conceptually) as a triple

RBrep = {SBrepi IBrepa QBrep}a

with San, the set of surfaces of the object, 13",, a set of space curves describing the

intersections between the surfaces in 53",, and

gBrep = {vBrepa EBrep}:

a graph describing the surface connectivity. Vertices in VB", represent the surfaces
in Sam», and edges express connectivity. Care must be taken to resolve any possible
ambiguities in representation due to the sense (or polarity) of neighboring surfaces.
Consider the two objects in Figure 2.4. The only difference between these two objects
is the sense of the top cylindrical patch (convex on the left, and concave on the
right). This difference could be expressed in the connectivity graphs 93",, (although
the difference in sense might also be expressed via different pararnetrizations of the
cylinders or the cylinder-plane creases).

Figure 2.5 shows a portion of the B-rep model for the object in Figure 2.2 (the
entire representation would be difﬁcult to draw in a legible way on paper). Note the
presence of individual surfaces and connectivity information (the small arcs).

Surfaces (i.e., elements of 53",) can be represented in a variety of ways [8]. Three

of the most popular approaches are

 

lA regularized Boolean operation will produce a resultant object without dangling 2D or 1D
artifacts. Regularized operations are conventionally denoted with a star following the logical operand

(64- Ur)-

g? c

 

 

 

 

@ -*: difference
U*: union

(3 &

Figure 2.3: A CSG tree for the example object of Figure 2.2.

17

Convex c linder Concave cylinder

't

 

Figure 2.4: Possible ambiguities in boundary representations.

 

Figure 2.5: A portion of the B-rep for the example object.

18

e the implicit surface form
5 = {(1.31%) = ﬂint/,2) = 0},

e the general parametric form

3 = {(3,31, 2) 2 1: = f1(u,v),y = f2(uiv)sz = f3(urv)a ("a”) 6 D C R2}’a’nd

e the graph surface form

3 = {(x.y.z) : z = f(r.y).(x,y) e D c R’},

Clearly, the graph surface is just a special case of the general parametric surface; we
highlight it because range data is usually viewed as samples from a graph surface (the
range value 2 is a function of a point in the my plane). The problem of ﬁtting surfaces
to range data can be tackled using regression techniques [39] when the graph surface
representation is used.

Polyhedral models (in which the elements of 5 are planes and elements of I are
lines) are the simplest boundary representations. Polyhedra are also very compact
descriptions, requiring four scalars per surface. The primary drawback of polyhe-
dral models is their representational power; objects with curved surfaces cannot be
represented exactly, and polyhedral approximations to curved objects are not unique.

The next ‘simplest’ surface representation is the quadric surface. Implicit quadrics,

deﬁned by
S = {(r, y, z) : 011:2 + aw2 + aaz2 + airy + 051:2 + aeyz + 071' + 033/ + 092 + aio = 0},

can represent many of the surface types encountered in manufacturing, such as
spheres, cylinders, planes, and cones. If the implicit quadric is represented in gen—
eral form, ten parameters are required per surface patch. However, Goldman [44]
has noted that if the quadric type and its geometric parameters are stored instead,
as few as ﬁve parameters are required. This ‘geometric’ representation for quadric
primitives enjoys other advantages over the implicit equation when the problem of
ﬁtting quadrics to sensed data is considered. If one knows the surface type a priori,
one can employ nonlinear optimization to get the best approximating quadric of that
type, whereas traditional least-squares techniques pay no attention to surface type

in ﬁtting the surface. In [39] we explored the nonlinear ﬁtting method as part of a

19

surface classiﬁcation strategy; one of the techniques developed there will be used in
Chapter 4 for surface classiﬁcation.
Most B-rep CAD systems use the general parametric form to represent surfaces;

in particular, GEOMOD uses nonuniform rational B-spline surfaces of the form

5 = {(r,y,z) : :1: = f1(u,v),y = f2(u,v),z = f3(u,v),where

2?: 27-1-1 Bik(")Bjt(v)hijPij
[fl f2 f3] = r1: 1 1;! B:r(u)B,1(v)h.-,° °

i=1

(2.1)

 

Here, n and m are the number of 3D control points R3 = [pa-,3 pyg- ng'j] in the u and
v parameter directions (for a total of nm points {P.ﬂ), I: and l are the orders of the
B-spline curves along the u and v directions, and the B;k(u) and Bj;(v) are the values
of the ith B-spline basis polynomial of degree 1: at u (respectively, the jth degree-l
basis polynomial at v), and the h.)- are control point weights associated with the P5,,
specifying their relative contribution to the surface shape.

B-rep surface models have been slow to gain acceptance among vision researchers;
two reasons for this can be cited. First, few of the true three-dimensional model-based
recognition systems have required the functionality that can be expressed through
the CAD methods. For example, recognition systems which work exclusively with
polyhedral objects have no need to employ curved surface patches and bounding
curves. Secondly, the popular surface representations in CAD (tensor-product splines)
often convey little intuition about the object surface being modeled. In this respect
natural surfaces like quadrics and planes have greater utility as modeling primitives.
The difﬁculties of ﬁtting more complicated surfaces to image data, and employing
the power of those descriptions for higher-level tasks have also hindered their use by

vision researchers.

Volumetric Models

One way to represent an object is to place it in some reference coordinate system
and subdivide the volume it occupies into small ‘primitive’ volume elements (often
called vozels). Most volumetric descriptions employ box-shaped primitives (cubes
or parallelepipeds) as vaels. If only one size of box is used such descriptions can
occupy a large amount of space, since a large number of boxes might be required to

approximate a large object with a complicated boundary. This volumetric model can

20

 

Figure 2.6: Volumetric approximation of the example object.
be represented internally as

7aver-cl = {Uzi Iva lz)sp19 p2, - - - PnLWhel‘e

l,, I”, and l, are the dimensions of the primitive box, and each p,- ﬁxes the location of
one of the boxes. Figure 2.6 shows a voxel representation of the example object. A
more accurate representation could be obtained (at the cost of storage and processing
time) by shrinking the box primitives. Some compression in size can be achieved by
encoding the occupied volume in octree format [98]. Other representations based on
volumetric primitives (such as spheres) have been proposed [88].

A fairly recent addition to the set of modeling primitives are the family of su-
perquadric (or superellipsoidal) solids. The canonical superquadric’s shape is con-
trolled by two parameters 61 and 62; the expression for a point on a canonical su-

perquadric at latitude n and longitude w is given by
(:r, y, z) = (cos‘1 n - cos" w,cos“ n - sin" w,sin" 17),

and the entire closed solid is obtained by varying 17 through the domain [0,1r], and
to through the domain [0, 2r]. These volumetric primitives can then be deformed
by bending, twisting, and tapering [4] to create a wide variety of curved primitives.
Boolean combinations can be used to represent arbitrarily complicated shapes. Most
‘interesting’ 3D objects would be composed of a number of superquadric primitives
placed in a common coordinate system. Figure 2.7 shows a shaded image of a double-
pyramid superquadric (e; = 62 = 2) which has been twisted about one axis. To our

knowledge there is no commercial CAD system employing superquadric primitives

 

1C.

ted superquadr'

lS

: A tw

.

lgurc _.4

F

22

internally. Since most interest in superquadrics has come from vision researchers,
this is not surprising [103]. Most existing work with superquadric primitives has dealt
with ﬁtting problems rather than the development of recognition systems employing

superquadrics as an object representation scheme.

Feature-based Models

The solid modeling approaches discussed above describe objects in terms of primitives
(primitive surfaces, curves, or volumes). While such descriptions capture the shape
of an object, they do not explicitly describe design features such as holes, grooves, or
notches. Recent research in CAD has led to the development of feature-based model-
ers [19, 5], which allow engineers to design objects at the feature level, avoiding the
use of geometric primitives where possible. In addition, these modelers can maintain

the semantics associated with features, such as

e a typical set of machining instructions to fabricate the feature (drilling, reaming,
etc.),

o the local relationships among feature components (which part of the hole is
hollow?), and

0 constraints on manufacturability in terms of feature parameters (for example,
we may only wish to allow ‘hole’ features with certain radius values, because

our selection of drill sizes is limited).

Feature-based models will make downstream processing of mechanical designs (such
as milling and assemblability analyses) less time-consuming; the inferences performed
for those tasks are less complicated, as the reasoning starts at the feature level, rather
than the local geometry level.

Feature-based models often employ B-reps as an internal representation. The
eXplicit organization of local geometry into design features may allow vision systems
to match on the features themselves, rather than their component entities. Park and
Mitchell [89]) developed a visual inspection system employing feature-based CAD

models.

23

GEOMOD Solid Modeler

In this section we describe GEOMOD [110], the solid modeler used in this disserta-
tion to create 3-D models for our vision system. We also discuss the Initial Graphics
Exchange Speciﬁcation (IGES), the industry—standard format for interchange of ge-
ometric data. Comprehensive surveys of the state of the art in mechanical CAD are
given in [93, 94, 95].

GEOMOD is a versatile solid modeling system using boundary representations.
1. Geometric primitives can be combined in a CSG-like fashion,
2. A 2—D proﬁle can be extruded along a space curve or revolved around an axis,

3. A series of cross-sections can be assembled and a surface grown (‘skinned’)

between them,

4. 3D points may be speciﬁed, linked into facets, and curved surfaces ﬁt to the

facets.

GEOMOD also offers deformation capabilities, including operations such as bend,
blend, stretch, and tweak, to modify existing object geometry and produce different
objects.

If the CSG-like assembly route is taken, the following primitives are available to
the user: blocks, cones, cylinders, spheres, tubes, and quadrilaterals. The normal
Boolean operations are available for aggregation of these primitives. The interface
between CSG operations and sweeping, skinning, and faceting operations is seamless,
because all surfaces are represented internally as nonuniform rational B-spline surfaces
(NURBS) [114]. The NURBS surface form was given in Equation (2.1).

N URBS are quite powerful in terms of their representational capability. Unlike
nonrational B-splines and most other parametric surfaces, they can represent quadric
surfaces exactly. The complexity of obtaining intersection curves between these sur-
faces is simpliﬁed because of the uniform representation. These intersection curves are
not necessarily exact, however; for example, the intersection between two quadrics is
a fourth-degree algebraic curve [76], and may not admit an exact parametric represen-
tation. However, an approximating parametric spline curve of any desired accuracy

can be easily obtained.

24

CAD Output: The IGES standard The CAD user community has largely
agreed on a data interchange format for CAD systems, allowing designs created on
one CAD system to be used by others. This format, known as the IGES standard [85],
deﬁnes standard design entities for 2D drawings and 3D objects, as well as electrical
diagrams, ﬁnite-element deﬁnitions, and other manufacturing quantities. We concen-
trate our attention on that portion of the IGES standard devoted to the description
of 3D geometry.

Within IGES, 3D solids are speciﬁed as lists of the following geometric primitives:

0 Lines, circular arcs, conic sections, parametric cubic spline curves, and NURBS

curves.

0 Planes, ruled surfaces, tabulated cylinders, surfaces of revolution, bicubic para-

metric spline surfaces, and N URBS surfaces.

Several of these primitives will be described in detail in Section 2.4.1.

Many CAD software products advertising ‘IGES compatibility’ actually imple-
ment only a subset of the IGES speciﬁcation. For example, GEOMOD will write
IGES ﬁles, but it only describes objects in terms of a subset of the available primi-
tives. An IGES ﬁle can also be read by GEOMOD, but some primitive types are not
accepted by the system.

Alternative Representations in IGES; Invisible Primitives The IGES format
can often specify objects in two different ways. Lines, circular arcs, parametric cubic
curves, planes, natural quadrics, and surfaces of revolution can be expressed either as
special cases (called the analytic form by GEOMOD), or as N URBS (the GEOMOD
rational form). A NURBS description is more powerful (i.e. not all objects composed
of piecewise-NURBS surfaces bounded by NURBS curves can be represented exactly
using only planes, natural quadrics, and surfaces of revolution), but intuitive geomet-
ric information about the object (e. g. symmetry axes, radii, and surface type) is not
eXplicit in the NURBS forms.

For example, consider a cone C, of height 1, with its axis coincident with the
y-axis, its vertex located at (0,1,0), and a radius of 1.0 at the base (the rz-plane).
How can this curved surface be represented within the IGES speciﬁcation?

One representation for this surface is the NURBS form, which, in the context of
Equation (2.1), has

25

O n=8,m=2,

eighteen control points P5,,

eighteen control point weights hij,

twelve knot values in one parameter, and

four knot values in the other parameter.

Given this information, how do we decide that the parametric surface is a right
circular cone? Once we have decided that it is conical, how do we obtain the geo-
metric parameters (namely, the coordinates of the vertex, the vertex angle, and the
orientation of the axis)?

In addition to the complicated NURBS description, we can also describe the cone

as a surface of revolution: a linear generatrir

g = P1132, P1 = (1,0,0), P2 = (0,110),
revolved about a linear axis

A = P3p4, p3 = (0, 0,0), p4 = (0,1,0)

Note that the axis of the cone is not uniquely determined; either endpoint (or both)
could be ‘extended’ to lengthen the axis without changing the revolved surface. This
representation of the cone surface is called the analytic representation by GEOMOD.

In the analytic form of this cone, 9' and .A are geometric primitives but are not
visible edges on the object; they are referred to as invisible entities in the IGES
speciﬁcation. Despite the fact that they cannot be sensed (and therefore should not
be matched to scene entities during the object recognition process), these invisible
entities convey valuable information about surface shape (i.e. the axis of symmetry,

the vertex location and vertex angle, etc.).

Analytic IGES: Motivation In constructing our object representation, we start
with the analytic IGES output from the CAD software. Lines, circular arcs, planes,
and surfaces of revolutiOn are described explicitly in terms of their geometric param-
eters (see Section 2.4.1). This representation allows us to easily extract important at-
tributes and relations from the CAD model. More general surface types (i.e. NURBS

surfaces) are not handled by our current modeling system.

26

Our use of the analytic IGES form rather than the rational form (in which all
primitives are expressed as N URBS curves or surfaces) is motivated primarily by
some computational considerations. It is possible (in principle, at least), to recover
the surface type and geometric parameters from the N URBS description. Let us
examine the recovery process associated with the case of a NURBS description of a

quadric surface. The implicit equation of a quadric surface is
fquedric(zs Us 2) = 0132 '1’ C123!2 ‘1' 0322' 'i" any + 0532 + asyz + 071' + asy + agz + 010 = 0

Given a set of points {(z;,y,-,z.-),i = 1,...,n} (which can be obtained by sampling
the parametric N URBS form), a quadric can be ﬁt (if n = 10) by solving a system
of ten equations in ten unknowns. However, the ﬁnite precision of the ﬂoating-point
numbers can adversely affect the quality of a solution to the system. The traditional
method for overcoming this difficulty is to perform regression: insist n > 10 and
obtain the estimate fquadn-c which minimizes
52 = i ‘qzuadriv
i=1

Invariant-based methods for classifying quadrics from the set of coefficients obtained
via regression have been published [49], but planar, cylindrical, and conical surfaces
require that an invariant of the parameter vector be tested against zero. When
the sensor data are contaminated by noise, these tests will always fail unless some
thresholds are established. The behavior of these invariants with respect to noise has

not been studied, making the choice of such a threshold an open problem.

IGES as a representation We wish to stress the following point: IGES was
intended to be used as a data interchange standard, not an object representation
method. In our work, we are treating the IGES information as a description of the
3D object, and limit our attention to a subset of the IGES 3D curves and surfaces.
One of the shortcomings in the IGES standard deals with the completeness of
representations for solid objects [120]. The mathematical notion of completeness is
given in Section 2.2; brieﬂy, a representation is complete if it corresponds to only
one physical object. While the latest revision of the IGES standard (version 4.0)
allows 3D objects to be speciﬁed as CSG trees (which are complete) [85], the descrip-
tions employing curve and surface primitives alone are not complete. The primary

shortcoming is the lack of explicit connectivity information. In most solid modelers,

27

connectivity [25] is expressed in a graph on the set of surfaces, edges, and vertices.
Since we are limited in our work to the ‘partial’ B-rep produced by IGES, the lack
of connectivity is a concern. 3D object recognition strategies make extensive use of
surface and edge connectivity during the matching process. Once a range or intensity
image has been segmented into regions, we know that adjacent image regions shar-
ing a crease edge are connected. Many object recognition systems in the literature
perform matching by establishing a correspondence between a ‘kernel’ image feature
(such as a long dihedral edge or a large region) and a model feature, and then ‘grow’
the match to adjacent model and image features to which the kernel is connected. If
connectivity is not explicitly represented in the model, it must be re-inferred when
the vision models are being constructed. While we have made some attempts in this
direction [40], our need for vieWpoint-dependent information provided an opportunity
to extract topological information reliably using additional information provided by

the solid modeler. This process is described later, in Section 2.4.5.

2.3.2 Representations Used in Vision Systems

The following representations have been used primarily by researchers in computer

vision.

Sweep Representations

One popular method for shape representation which is suitable for objects whose
volumes are swept by generating curves with respect to one or more axes. We will
concentrate on objects which are representable as a single such swept volume, and
stipulate that several such volumes are required to represent ‘interesting’ objects such
as industrial parts. In the vision community, sweep representations have been well-
known as Generalized Cylinder (GC) representations [1, 86, 22]”.

A single swept volume is characterized by
e an axis (which may be straight or curved); and

e a cross-sectional area which sweeps out the volume as it travels along the axis.

 

2Many CAD systems offer sweeping as a mechanical design option, but few systems employ it as
an internal representation.

28

2D profile

Extrusion
Direction

 

Figure 2.8: A sweep representation of the example object.

The cross-section need not contain the axis, and it may undergo translation, rotation,
scaling, or all three as it is being swept. Most researchers, however, assume that the
cross-section is planar and that its plane is perpendicular to the axis curve at all
times. Figure 2.8 shows that the example object can be represented as a linear
extrusion of a 2D proﬁle. Here, the cross-section is indeed planar, and the axis is
straight. It is interesting to note that some sweep operations correspond closely to
industrial fabrication operations. The example above showed linear extrusion. If the
cross-section was also rotating, we would obtain a twisted object (e.g. a gear). If the
shape is constant but the scale changes monotonically, the result is a constant-taper
object.

‘An object with n swept subparts might be represented as follows:
Rum, = {51,Sz,. ..,S,.},
with each of the S,- being a pair
5: = {a,~,A,-},where

a.- is an expression (parametric or otherwise) for the axis, and A.- is the area as a

function of position on the axis.

29

Geons: Qualitative Representations

Most solid representations used in CAD or vision'systems are quantitative in nature;
primitive surfaces (or volumes, etc.) are speciﬁed in terms of numerical parameters.
Such detailed descriptions are necessary in fabrication, simulation, and some vision
tasks (such as metrology). However, if we are performing a visual recognition task
and the objects to be recognized can be distinguished by examining some qualita-
tive features of the segmented primitives, representations which capture only those
variations might offer advantages in processing 3. The idea of qualitative representa-
tion was proposed by Biederman [12], who developed a catalog of 36 geons (geometric

ions) with each member of the catalog having a unique set of four qualitative features:
0 Edge: straight or curved;
0 Symmetry: rotational /reﬂective, reﬂective, asymmetric;
0 Size Variation: constant, expanding, expanding/contracting;
0 Axis: straight or curved.

Figure 2.9 shows three geons with their qualitative features. Most interesting 3D
objects would be composed of a number of connected geons. Biederman proposes
an edge-based procedure for segmenting images into their geon components, using
nonaccidental alignments and concavities to locate the places where geons join. Over
154 million qualitatively different objects can be constructed from three geons and
inter-geon relationships, considering all possible relative sizes, arrangements, and af-
ﬁxments of the three components [12]. Therefore, the geon models are a very versatile
coarse description of 3D objects. Bergevin and Levine [6] have developed a model-
building system which attempts to form geon models from 2D line drawings.

While the geon representation has intuitive appeal, the lack of quantitative infor-
mation limits its usefulness in environments where discrimination between qualita-
tively similar, but quantitatively different objects is performed. If an assembly line is
making a ‘family’ of parts which differ only in scale, they would have identical geon
representations, making discrimination between the different sized items impossible

without additional (quantitative) information.

 

3However, a qualitative representation cannot (in general) be used to synthesize an image of the
object.

56‘.

ml

30

9 O 7

 

Edge: straight Edge: curved

Symmetry: rotational s Symmetry: rotational &
reflective reflective

Size: constant Size: constant

Axis: straight Axis: straight

Edge: straight
Symmetry: reflective

Size: expanding
Axis: straight

Figure 2.9: Three geons and their qualitative features

Multi-View Representations and Aspect Graphs

One method for describing a 3D object is to describe some or all of its possible
2D projections. These projections are often known as 2D or view-centered models.
In some applications, the number of stable positions the object can assume on a
supporting surface is small. Suitable features calculated from those views can be used
in recognition via statistical, syntactic, or graph-theoretic approaches [31]. In most of
these systems, separate views of the same object are stored separately, as if they were
actually different objects. Dudani et al. [34] developed one of the earliest recognition
systems employing multiple views. The model set consisted of 132 images of six
different aircraft. Each view was represented by fourteen numerical features, which
were invariant moments calculated from the boundary and silhouette of the object in
the given view. The matching module considered each database view independently
of the others.

Other representation methods attempt to combine all these vieWpoint-speciﬁc
models into a single data structure. There has been a recent burst of activity among
several computer vision researchers seeking algorithms which compute the aspect
graph of both polyhedral and nonpolyhedral objects [35, 104]. The aspect graph

31

 

 

O

(a) (b) (C)

 

 

 

 

 

 

 

Figure 2.10: Qualitative changes in the line drawing of a cube.

summarizes the qualitative appearance of an object from every possible vieWpoint.
The meaning of ‘qualitative’ differs from researcher to researcher; often, the quali-
tative appearance involves the relative positions of junctions between creases on the
object surface and its silhouette (i.e. the ‘line—drawing’ topology). Figure 2.10 shows
three qualitatively different line drawings of a cube. Figure 2.10(a) shows a ‘general’
view, in which three surfaces are visible, along with seven vertices. When the cube is
rotated so as to obscure one of the faces (Figure 2.10(b)), one vertex disappears along
with the edges incident on it. A further rotation eliminates an additional face and
two vertices (Figure 2.10(c)). In this discussion, we will deﬁne a ‘qualitative’ change
as a topological change in the line-drawing of the object 4.

Discussion of a representation based on views of an object requires a deﬁnition of
‘view’ along with the number and type of parameters which deﬁne it. In the context
of intensity images, the type of projection assumed in the image formation process
must be speciﬁed. Under the perspective imaging model, the absolute distance (the
standoﬂ) between the object and the sensor distorts the position and orientation of
objects in the image. Many authors avoid the problems of perspective distortion
by assuming an orthographic projection model. Perspective distortion is most pro-
nounced when the sensor is very close to the object being sensed; if we can guarantee
that the standoff is fairly large in relation to the size of the object being viewed,
the orthographic assumption is not unreasonable to make. Figure 2.11 illustrates the
difference between orthographic and perspective projections in terms of visible sides
of a 2D object. The sides of the object are labeled 1 through 6 and are drawn with

wide dark lines. The sensor can be positioned anywhere outside the object. The gray

 

‘An alternative notion of qualitative appearance was proposed by Ikeuchi and Kanade [62]; they
deﬁne two views as qualitatively identical if the same faces are visible in both.

32

 

 

 

 

 

 

 

 

Figure 2.11: Effects of standoff on visibility of a 2D object

lines divide the plane into regions; each region is labeled with the visible faces of the
object in that region. Notice ﬁve small triangular regions‘associated with faces 1, 2,
3, 4, and 5. If the (1D) sensor is placed within those regions, only the associated
face will be visible. At larger standoffs, these effects diminish and the orthographic
projection is a reasonable assumption.

Unlike intensity images, range images are not projections; therefore, the image—
plane positions of image artifacts will not undergo perspective distortion. In a prac-
tical sense, this means that we need not worry about a change in standoff producing
a change in the image. The set of parameters deﬁning a view has three members,
namely the three components of a rigid rotation that we impose on the object to
orient it in the ﬁeld of view of the sensor. Most researchers ignore rotations about
the viewing direction, since such rotations will not affect its qualitative appearance.
Therefore, a view is speciﬁed by two angles, or (alternatively) a ‘viewing direction’
expressed as a unit vector. The space of view directions lies on a sphere of unit radius
called the viewsphere.

With the constraints speciﬁed above, we are now in a position to deﬁne the aspect

33

graph structure. The aspect graph of an object has the form

Respect = {Vespech suspect}-

Vertices v 6 vowed represent open connected subsets of the viewsphere from which
the qualitative appearance of the object is the same. Edges e E 8“,,“ between
vertices represent so-called visual events, changes of vieWpoint in which the qualitative
appearance changes. Visual events occur at viewpoints on the boundary of two or
more regions of the viewsphere associated with vertices. When the object is viewed
from vieWpoints lying on these boundaries, the object’s qualitative appearance is
undergoing a transition; these vieWpoints are called accidental. Figure 2.12 shows an
aspect graph for a cube. The labeling of the cube’s faces is also shown in the ﬁgure.
At each node in the graph, between one and three faces are visible. Traversing an
are between nodes corresponds to a change in vieWpoint that either occludes a face
or makes one visible.

Formal approaches to aspect graph construction seek an exact partitioning of the
view sphere. Such techniques analyze some form of geometric model for the object
and obtain the transition boundaries mentioned above using a ‘catalog’ of allowable
visual events for the particular object. A thorough exploration of the competing
techniques is beyond the scope of the present discussion. However, the two major

drawbacks of the formal techniques for aspect graph construction are

0 Large time and space complexity (for a non-convex polyhedron with n sides and
assuming orthographic projection, one existing algorithm [43] requires O(n8)

time and O(n8) storage to compute the aspect graph)5; and

e A lack of implemented algorithms for computing the aspect graphs of non-
polyhedral objects. Recent work has highlighted some formidable numerical
problems that must be solved before reliable implementations will become avail-

able [104].

Additionally, the current work on ‘formal’ aspect graphs has emphasized their con-

struction from geometric models, rather than their use in recognition. Once the

 

lsThe construction of aspect graphs would typically be performed off-line (not during recognition).
Searching (at recognition time) the data structure produced by these algorithms might be time-
consuming, however.

34

 

 

 

 

 

Figure 2.12: The aspect graph for a cube.

35

difﬁcult problems of actually building the aspect graph are solved, the important
work on recognition must take place. .

Some researchers have taken a pragmatic approach to view-centered representa-
tion [62, 72, 26, 67]. Rather than obtaining an exact partitioning of the viewing sphere
for each object, a discrete sampling of the viewsphere is taken, images of the object
are synthesized for each sample point, and some appropriate features are computed
from each image (see Section 2.4.5). Optionally, sets of vieWpoints producing similar
feature lists are merged into a view class and a representative image is stored with that
class. The pragmatic approach to viewpoint-dependent representation was adopted
in this dissertation; our modeling system incorporates such view-dependent features
as sensed area and simultaneous visibility into a vieWpoint-independent model, pro-
viding additional information that has proven useful to the recognition procedures.

Chen and Stockman [29] have introduced a new object representation scheme
which associates surface type with contour information. This model is built from
several range images of the object. Each edge pixel in the image is analyzed to extract
its type (silhouette edge, interior jump edge, or crease edge), and the edge points are
grouped into contours of the same type. The label for each contour segment is then
augmented with information about the surfaces on either side of the edge (unknown,
planar, convex, concave, or saddle). Hence, each contour is identiﬁed by a triple of
attributes, and is called an object wing. Figure 2.13 shows the contour and surface
types for three objects which have identical line drawings: a bowl, a half grapefruit,
and a clam. The combination of line and surface labelings produces a different wing
description of each object. Chen and Stockman show that with ﬁve surface types
(null, planar, concave, convex, saddle) and four contour types (silhouette, jump,
convex crease, concave crease), only 34 of the 100 triples containing two surfaces
and one contour correspond to realizable wings. One important feature of the wing
representation is that it is designed to accommodate errors in segmentation. This is
a very desirable property, because the extraction of edges (especially crease edges)
from range imagery is quite difﬁcult.

In our object modeling system, relational graphs constructed from B-reps are
augmented with some surface-type-speciﬁc information from a uniform viewsphere

sampling. No grouping of views into view classes is performed.

36

bowl half grapefruit clam

Edge Types:
<K Limb (tangent edge)
< Occluding contour (jump edge)

+ Convex crease

- Concave crease

Surface Types:

p planar
c concave

V convex

Figure 2.13: Contour and surface types for three objects with identical line drawings.

37

2.4 CAD Models to Relational Graphs

What information is explicitly stored in the IGES object description, and what addi-
tional information about the object should be stored explicitly to improve and speed
up the object recognition process? How is this information arranged? In this section,
we describe the geometric primitives produced by our CAD system and the attributes
of binary relations between primitives that we derive. The transformation applied to
object descriptions expressed in the IGES form produces our vision model, stored as

an attributed relational graph
G=<V£>,

with vertex set V, containing geometric primitives and attributes:
V = {(v,,a,-)]i=1, ...,n},

v,- 6 {circular-arc, line, parametric-spline-curve, plane, surface-of-revolution},

where a,- is an attribute set associated with 12,-; here, attributes are the geometric
parameters of the particular curve or surface.

The edge set 8 contains attributed binary relations between nodes. Binary rela-
tions can be both unidirectional (e.g. the inheritance relation between a surface of
revolution and its linear axis) and bidirectional (e.g. the angle between two lines).

Our relational graph contains redundant information. The computational burden
of relational graph construction is not incurred at object recognition time, however;
the transformation of IGES models to relational graphs need only be applied when a
new object deﬁnition is created and the corresponding vision model is needed. Each
object is handled separately. So, addition of an object to the database does not
change the representations of existing models.

Our choice of geometric features (attributes and relations) is motivated by sensory
considerations. We attempt to preserve explicitly the geometric information which
most concisely describes the objects in the database: orientation vectors, radii, and
surface type labels. Additionally, these quantities can be computed directly from
range data: radii and orientation vectors can be found from the principal curvatures
and their directions (see Chapter 4). The vision models we construct contain both
VieWpoint-independent features such as surface intrinsics (radii) and relative quanti-

ties (inter-patch orientations), and viewpoint-dependent features (e.g. patch areas).

in.

line:

ll

t Sill

7.1
.e
139

on:

:10

Eat:

38

The geometric inferencing performed here ﬁrst examines every IGES entity (in-
cluding visible entities) in the CAD model individually to calculate node attributes.
Then, all pairs of visible entities are processed, and binary relations are produced.
Finally, the set of synthetic views is generated and analyzed to produce the view-
dependent features for each model surface. In terms of processing time, the view-
dependent feature calculations dominate the time needed for vision model construc-

tion.

2.4.1 Node Attributes

In this section we deﬁne attributes for geometric primitives used in the relational
graph. These attributes are either stored explicitly in the analytic IGES representa-

tion or calculated by our inference system.

Lines and Curves

A. Line segments. In the IGES speciﬁcation, a line segment .6 is characterized by

a starting point and an ending point.

 

The segment’s length
||£|| = "P: - pl“

is also computed and stored as a node attribute for L. Henceforth, we will refer to line
segments as lines (with the implicit understanding that we actually mean segments).
B. Circular Arcs. In IGES, a circular arc A is speciﬁed in its own (arc-centered) co-
ordinate system (x., y.., 2,), in which the plane of the arc is parallel to the xaya-plane,
and displaced from it along the 2., axis. This ‘canonical’ curve is then transformed
into model-centered coordinates by a 3 x 4 transformation (rotation and translation)

matrix. The ﬁve quantities in the IGES description of the are are:
e the parallel distance 2. along the z. axis between A and the xaya-plane;
e the location (xac, y“) of A’s center in arc-centered coordinates;
O the location (x.,, y...) of A’s starting point in arc-centered coordinates;

O the location (x,,, y“) of A’s ending point in arc-centered coordinates; and

39

e a transformation matrix RA mapping arc-centered coordinates to model-cen-

tered coordinates.

Our system computes the following additional attributes for each circular arc primi-

tive:
e the starting and ending angles of rotation in the (xaya)-plane;
e the fraction of a circle produced by the rotation;
e the coefﬁcients of the plane (in model-centered coordinates) containing the arc;

e the radius of the arc; and
e the location of the arc’s center in model coordinates.

C. Parametric Spline Curves. In the analytic IGES ﬁles produced by GEOMOD,
any model curve which is not linear or circular is expressed as a parametric cubic
spline curve C, possibly with multiple segments. The range of parameter values is
divided into subsets by break points, and a different parametric cubic curve applies
in each segment (curves whose parameter ranges are adjacent are required to join,

however). These curves have many parameters. The ones of interest to us are:
e a ‘planarity’ flag (curves with this ﬂag set lie in a plane);
0 the number n of curve segments;
0 the n + 1 breakpoints {t1,...,t,,+1] in parameter space; and

e coefﬁcients of n +1 vector-valued univariate polynomials, describing the (x, y, z)

coordinates of the n segments deﬁned by the breakpoints.

The curve parameter t lies in the interval [t1,t,,+1]. One additional parameter is
present when C is planar. In that case, the IGES speciﬁcation requires that the z
polynomial must be constant-valued. The curve’thus deﬁned needs to be transformed
into the model-centered coordinate system using a 3 x 4 transformation matrix. In
fact, such a transformation can be supplied even for nonplanar curves; however, the
GEOMOD software does not produce transformations for other than planar curves.
D. Composite Curves. The IGES speciﬁcation allows for a hierarchical represen-

tation of object curves. A composite curve entity is simply a list of subcurves, each of

which

start
upln
item

'3 the

Surfs

1. PI

40

which can be any IGES curve type, such as the three mentioned above. The curves
are required to ‘join’, i.e. the endpoint of the ﬁrst subcurve is the beginning of. the
second, etc. In our models, composite curves are used only as the ‘bounding’ curve of
a planar surface (discussed below). We compute no additional attributes for compos-
ite curves at present. Much of the geometric information about the curve is embodied

in the subcurves, and can easily be retrieved from them if required.

Surfaces
A. Planes. An IGES planar entity 1’ contains the following information:
o Coefﬁcients: four scalars a, b, c, and d, such that points in ’P satisfy
or + by + cz = d.

We choose the signs of the coefﬁcients so that (a,b,c) is normal to the plane

and points out of the solid object bring described.

0 Bounding curve: a pointer to another IGES entity which describes a closed

curve that encloses “P.

It is important to note that the bounding curve does not (in general) coincide with
the object edges which bound the planar surface under consideration. For example,
consider one of the two planar end patches on a right circular cylinder. The ‘physical’
bounding curve for this planar patch is a circular arc, but the IGES speciﬁcation
allows the bounding curve to be piecewise-linear (e. g. a rectangle), or even a closed
parametric curve. In other words, the bounding curve need not correspond to any
object edges; it sufﬁces that the bounding curve enclose the ‘valid’ portion of the
plane.

It is often useful to reduce the dimensionality of geometric computations to 2D
When planes are involved (for example, it is easier to decide that a point lies in the
bounded portion of the plane if the point and bounding curve are transformed into
2D than to do the computations in 3D directly). For this reason, we calculate two
at tributes for each model plane. First, we ﬁnd a 4 x 4 homogeneous transformation
matrix Rp which maps 3D points into a ‘plane-centered’ coordinate system (x,, y,,, 2,),
With 2, = 0 on the model plane. This transformation is obtained from the four
coefﬁcients a, b, c, and d. A bounding box for the valid region of the plane in the plane-
centered coordinates is also calculated. The IGES bounding curve is transformed into

41

(32,, y,, 2,) coordinates and sampled. The bounding box of this sampled curve is stored

in the vision model.

Finally, each IGES plane has a list of areas corresponding to the vieWpoints on the

uniformly sampled viewsphere. The methods for synthesizing images and estimating

the visible area appear in Section 2.4.5.
B. Surfaces of Revolution. For a surface of revolution 8, the following properties

are given explicitly in the IGES description:

0 a pointer to the axis of revolution £12 (a line entity),

0 a pointer to the generatrix 93 (the line or curve which is revolved about the

axis to ‘sweep’ out the surface), and

e the starting and ending angles of rotation for the sweeping operation.

Additional attributes calculated for the vision model include

o a more precise surface classiﬁcation:

H

4.
5.

. cylindrical if 91; is linear and parallel to £12;

conical if 93 is linear and not parallel to £12;

spherical if GR is a circular arc and £12 passes through the center of the
arc;

toroidal if 91; is a circular arc and £1; does not pass through the center;

generic if 63 is nonlinear and noncircular.

e if the surface is cylindrical, the radius;

0 if the surface is conical, the vertex location and vertex angle;

0 if the surface is spherical, the radius and center location;

0 the orientation of the axis.

The radius of a cylinder is given by the perpendicular distance between 93 and

CR- The vertex angle of a conical surface of revolution is obtained from the scalar

product of the orientation vectors of an and £3. The cone’s vertex is located at the

intersection of £2 and 92. If S is spherical, the center and the radius are inherited

42

Table 2.1: IGES representation and inferred quantities for vision models

 

| Entity I Analytic IGES representation ] Additional Features in Vision Model ]

 

Line startiLg and ending points length
Circular Are canonical representation angles
transformation fraction of circle

plane coefﬁcients
radius, location
Parametric Spline Curve planarity ﬂag -

number of segments

 

 

 

breakpoints
polynomials

Composite Curve pointers to subcurves -

Plane coefﬁcients , canonical transformation
bounding curve canonical bounding box

visible area list

Surface of Revolution axis type
generatrix orientation of axis
angles of rotation type-speciﬁc attributes

 

 

 

 

visible area list

 

from 93. In addition, the visible area of S, indexed by viewpoint number, is calculated
for each point on the uniformly sampled viewsphere.

The radii of spherical and cylindrical surfaces of revolution are often highly infor-
mative features for object recognition. In addition to storing these radii along with
the corresponding nodes in the relational graph, we also compute a global list of radii
for each model. Examination of this when processing an input image allows models
with dissimilar curved surfaces to be rejected by the object recognition module.

Table 2.1 summarizes the geometric primitives provided in the IGES description
and inferred by our system. In many manufacturing environments, additional at-
tributes not dealing with the geometry of the object surfaces (such as surface ﬁnish,
paint color, a fabrication script, or material type) accompany the geometric attributes.

If these features were considered useful for object recognition, they could also be ex-

pressed as node attributes.

2.4.2 Binary Features

In this section we describe the binary features derived from examinations of pairs

0f IGES geometric primitives. These features are stored as attributed arcs between

l
D A
at
' 8|
0-
a.

".4.

Ori

43

Table 2.2: Geometric Primitives and Orientation Attributes

 

 

 

 

 

 

 

—Pn_°mitive A Orientation Parameter
Line Direction vector
Circular Arc Normal to plane of arc
Parametric Spline Curve none
Plane Normal Vector
Surface of Revolution Axis Direction

 

 

the attributed nodes in the relational graph representation. These binary relations

fall into three categories: orientation relations, proximity relations, and miscellaneous

relations.

0 rientation Relations

O ne local, intrinsic object property that has been used extensively in 3D object recog-
mi tion systems is orientation; speciﬁcally, the relative orientation of object faces. One
of the reasons we have focused on a subset of the IGES representation containing only
certain geometric primitives is that most of those primitives contain an orientation
at tribute of some sort. Table 2.2 lists the orientation attributes for each of the IGES
entities in our system.

An orientation relation is produced by our geometric inference engine for each
pair of geometric primitives for which an orientation attribute exists. In other words,
entity pairs containing one or two parametric spline curves produce no relation. The

orientation relation for two distinct primitives prim,- and prim, has the form
9.3 = angle(prim,-, primj),

where 0,,- is the angle between their orientation attributes. Thus, the orientation
:- elation between two lines £1 and L2 is simply the angle between them. A less
i1511’-‘I.1itive example is the angle relation between a plane and a surface of revolution. If

the normal to the plane is parallel to the axis of the surface, then 0 = 0.
In the section on node attributes, we discussed a global feature list constructed

from the radii of curved patches. A similar list is constructed from the orientation

rel ations. As above, the purpose of this list is to provide evidence for a certain model’s

44

plausibility at recognition time. For example, an angle between two planar surfaces
observed in the scene can be used to select models for matching in the recognition
module. An observed angle of 45° might cause us to include in the model subset
being searched all models with such an angle between two patches. Unfortunately,
not all orientation relations are equally salient for recognition. Some of our models
contain a large number of 90° and 0° angles between faces. Observation of such an
angle in the scene is not especially informative, as it implies that we should search
all models with such an angle. The extraction of salient features (unary and binary)

from models is an area of future research.

Proximity Relations
Proximity relations summarize the distance between entities. For pairs of curves

(1i mes, circular arcs, and parametric spline curves), we instantiate a
:1 i stance-interval relation and ﬁll its two slots with the maximum and minimum

distances between the curves. The only proximity relation between a curve and
a. surface primitive is coincident: this relation is added if the curve lies entirely
Within the ﬁnite surface primitive. If two model surfaces are adjacent (i.e. they share
a. Visible curve), an attributed adjacent relation is added between the two surfaces,
wi th the shared edge appearing as an attribute in the relation.

It is possible, in principle, to calculate distance-interval relations for edge-
surface and surface-surface pairs as well as edge-edge pairs. Sampling of the primitives
i s not an appropriate strategy because of the large amount of computation time that
would be consumed in just evaluating distances. Analytic techniques can be applied,
at least in theory; the minimum and maximum distance between two points can be
Calculated by ﬁnding the minimum distance using a direct solution to a least-squares
di Stance criterion [82]. Implementation of this solution presents practical difﬁculties,

as mentioned below.
Consider the process of ﬁnding the minimum and maximum distance between two

I)a-fl.‘ametric cubic spline curves
01(11) = (¢1(u),yi(u)szl(u))

and
02(v) = (x2(v).y2(v).zz(v))-

45

The squared Euclidean distance between the curves (as a function of u and v) is given
by
DUI, v) = (310‘) - 32(0)): + (31101) - trawl)2 + (210‘) - 32(0))2.

which is a sixth-order function of two variables. Finding the minimum and maximum
values of D involves locating values (11;, v.-) where % and 95% are zero, and the deter-
minant of the Jacobian is positive. There will typically be several such values. Since
closed-form solutions for roots of ﬁfth-order polynomials do not exist, some numerical
techniques must be employed. Each value must be checked against the domains for
the parameters u and v to discard distance measurements made outside the ‘valid’
portion of the curves. The analogous situation for a parametric curve and a para-
metric surface adds an additional parameter. If the distance between two parametric
surfaces is to be minimized, the minimum and maximum of a sixth-order function of
four parameters has to be found. In our present work, we felt that sampling provided
reasonable results for curve-curve pairs, and coincidence and adjacency relations for

curve-surface and surface-surface pairs were sufﬁcient.

Miscellaneous Relations

There are a few binary relations between geometric primitives which we categorize
as either identity, containment, or correspondence. One example of a containment
relation is coincidence between a pair of planes. Suppose our model contains a visible
circular arc and a visible plane which is coincident with the are (we do not require

the bounded plane to contain the plane of the arc). The
coincident-with (plane, planej)

relation between these two primitives means that they have the same geometric pa-
rameters, but the physical surfaces on the model are disjoint. If a model curve
lies within the two-dimensional bounding box calculated for a planar primitive, a
in-plane-bbox relation is present. Identity relations express either inheritance of
geometric information (e. g. a surface of revolution inherits the orientation of the line
entity used as its axis), or a complete correspondence between curves (e. g. two line
entities with coincident endpoints have an identical relation between them). Fi-
nally, correspondence relations are used to ﬂag similar surfaces of revolution (similar

in type, or in type and radius).

EEK Em amate— w. _,.w_.m _ m _ m_

2.4.3

litre

.
a,“
a

lit

1.1

t.

"61

ml.

ia'l

tequi

reliab'

iii PC

‘Com

46

Table 2.3: Relations produced by the geometric inference engine

name

0 two
coincident-with a curve a
th cates are
distance-interval an maximum two curves
identical cates two curves are

containment in curve a
is-axis-of t a
cates curve a
member-o! t cates curve 18 a com
sane-radius cates two tives
two

 

2.4.3 Higher-order Relations

At present, our system computes attributes for single primitives (unary relations), and
binary relations between pairs of primitives. It is possible to deﬁne tertiary or even
higher-order feature types, especially in the case of polyhedral objects. For example,
in a ‘blocks world’ allowing trihedral objects, the tertiary corner feature type might
be quite valuable at recognition time (if one can insure that corners are accurately and
reliably detected). At present, we choose not to extend our geometric inference engine
in that direction; however, such relations could easily be added. The corner relation
for polyhedra could be detected by detecting coincident endpoints of line primitives.
‘Corners’ also exist for some curved objects. However, the utility of explicit corner
relations might be limited due to the difﬁculties encountered when trying to detect
such features. An underlying principle in our object recognition system is that the
more reliably a feature can be estimated, the greater the value it has for recognition.
From that viewpoint, surfaces are more valuable than edges, because they can be
extracted more easily from range data than edges. Corners are less reliable than
edges, since small changes in the locations of segmentation boundaries in an image
can change the location of a corner feature.

Table 2.3 lists each relation produced by the geometric inference engine, its type,

and a brief description.

47

2.4.4 Calculating Connectivity

Earlier, we highlighted a drawback of IGES as a representation: the lack of explicit
inter-surface connectivity information. Connectivity is easily extracted from the uni-
versal ﬁle generated by GEOMOD (see Section 2.4.5), which contains point lists
making up the model surfaces. Our procedure steps through the polygonal surface
approximations for each pair of model surfaces extracting points on their common
edge. As an additional feature (used during match validation), a 3D bounding box

for the edge is calculated and stored along with the connectivity information.

2.4.5 Viewpoint-Dependent Features

In the previous section, we noted that patch areas are added to the node attributes of
model surfaces. Why are surface areas important in 3D object recognition, and how
are they calculated? Compelling arguments have been made for the use of viewpoint-
dependent quantities in model-based vision. In fact, some recognition systems classify
objects by comparing the sensed scene against ‘sample’ views stored in a database,
without referring to a model-centered description at all [37].

One signiﬁcant drawback for the use of viewpoint-dependent description is related
to model organization. How are ‘representative’ viewpoints chosen? How are the
features calculated from each viewpoint linked? How many vieWpoints are ‘enough’
for a particular object, sensor, or application? Many researchers are attacking these
questions; recent work on aspect graphs has produced interesting results, but complete
systems have not yet appeared (see Section 2.3.2).

In our work, we have found that an integration of both viewpoint-invariant and
viewpoint-dependent features in a model-based 3D object recognition system offers
powerful advantages over the use of either type of description by itself. However,
we wish to stress that the viewpoint-dependent features play a subordinate role in
object recognition. The primary source of geometric knowledge about objects is
the CAD model; surfaces in the model are augmented with the viewpoint-dependent

information.

Representative Views: The Uniform View Sphere

We take a practical view of the construction of viewpoint-dependent feature lists. A

set of vieWpoints on the view sphere, which provides a near-uniform sampling of this

tilt 5P3
gﬁtl is
[it in ‘
In In
resin

l to
iron [ti
méBrc
note the

re give

titre

48

ﬁnite space, is selected. Then, for each viewpoint in that set, a range image of the
object is generated, along with a synthetic segmentation which associates each surface
point in the synthetic range image with the corresponding surface in the CAD model.
This range image and segmentation are then used by the feature-extraction routines
to estimate the vieWpoint-dependent features of interest.

A total of 320 points on the viewsphere are generated from a subdivided icosahe-
dron (the 20-sided regular polyhedron), with its vertices on the unit sphere. Ballard
and Brown [3] give an algorithm for constructing a subdivided icosahedron. Let t de-
note the golden ratio .1135. Then the coordinates of the 12 vertices of the icosahedron
are given by

0, :ha, :tb)
( if), 0, :lza) ,
( to, if), 0)
where
a = —t-, (2.2)
‘5
1

(W75

Each of the triangular faces of the icosahedron is split into 16 subtriangles by dividing

 

(2.3)

its sides into four line segments of identical length and forming triangles. The new
interior vertices are not guaranteed to lie on the unit sphere, however; they must be
‘pushed out’ to the sphere along their direction vector. After the 320 triangular faces
have been obtained, a view direction for each facet is taken as the vector normal to
the facet. Therefore, we obtain 320 vieWpoints from which synthetic images of the
object are generated. Figure 2.14 shows the icosahedron, the subdivision process, and
the resulting 320-sided polyhedron.

Synthetic Range Image and Segmentation

Given a point on the viewsphere (which represents a viewing direction), we synthesize
a range image and a corresponding segmentation by using a polyhedral approximation
to the object. In addition to an IGES object description, GEOMOD can create a text
ﬁle known as a ‘universal’ file; this allows models to be exchanged between GEOMOD
software executing on different hardware platforms. The universal file contains a large

amount of information about the object, its creation history, and several geometric

49

(a) (b)

 

(c)

Figure 2.14: Subdivision of icosahedron faces into 16 subfaces. (a) Icosahedron.
(b) Subdivision of an icosahedron face into 16 subfaces. (c) Resultant 320-sided
polyhedron. The view directions are chosen to lie at the center of the triangular

faces.

50

speciﬁcations. Among these speciﬁcations is a list of 3D polygons which lie on the
object surface. During the design process, this polygonal approximation can be reﬁned
to any desired accuracy; this allows curved surfaces to be well approximated for the
task of synthetic range image generation. The models we designed for this dissertation
had their polyhedral approximations reﬁned so that the maximum deviation from a
polygon to its corresponding curved surface was 0.001 inch, comparable to the noise
level of our 3D sensor.

We now introduce some notation for the polyhedral approximation produced by

GEOMOD. An object 0 is a list
0 = {M1, . . . , Mn},where

each M.- is a model surface appearing as a single surface in the IGES ﬁle. A common
point list
P = {pli'°'ipm}

contains all of the 3D points serving as vertices of the planar facets making up the

approximation. Each surface M,- is further decomposed into a list of polygonal facets:

M.‘ = {fin ---ifin.}i

where the f), are the facets and n.- is the number of facets in model surface 2'. Each
facet is deﬁned by its bounding vertices; here, the indices of the points in the point
list 1’ are used:
fj = {kjl"°"kjnJ}'

While the notation is somewhat cumbersome, it is a concise way to express the hier-
archical structure of the object (the object built from surfaces, each surface built from
a number of facets, each facet being built from 3D points). The common point list
assures that connectivity information is not lost if the object undergoes a transforma-
tion. The polygonal approximation to the ob ject of interest is easily extracted from
the universal ﬁle produced by GEOMOD. When a synthetic range image is desired,
the point list 1’ is transformed (rotated) according to the desired vieWpoint. Then,
a process known in the computer graphics literature as ‘polygon scan conversion’ is
applied to each polygon [96]. Scan conversion of a single polygon is accomplished by
finding the subrectangle of the image which is occupied by the transformed polygon
and sequentially ﬁlling in the pixels within this box with the appropriate range in-

formation. At the same time, an auxiliary ‘label’ image is updated with the model

51

surface number corresponding to the polygon being scanned. In this way, a synthetic
range image and segmentation are generated simultaneously from the polyhedral ap-
proximation to the 3D object. Contrast this approach with the ray-casting method
used by Hoffman [56]; the ray-casting approach sequentially examines rays through
the pixels in the output range image and ﬁnds those model surfaces which intersect
that ray. As the number of model surfaces (here, facets) increases, polygon scan
conversion becomes more computationally efﬁcient than ray-casting.

Our use of the universal ﬁle and its polyhedral approximation has one disadvan-
tage. The universal ﬁle format is specialized to the GEOMOD software, and the code
which makes use of it must be modified for use with a different modeling system.
However, a polyhedral approximation to an object is usually available from CAD

systems.

Viewpoint-Dependent Features: Visible Area and Simultaneous Visibility

From the synthetic range image and its segmentation, we can produce estimates of
the area of visible patches in the image. These quantities (computed for each model
surface and view) are used to prune the interpretation tree produced during the
recognition phase. In addition, a binary feature deﬁned on pairs of patches is also
easily calculated. Two patches are never simultaneously visible if, for all views,
their areas are never simultaneously positive. This visibility feature also provides
constraints on interpretations during the recognition procedure.

Our procedure for calculating area proceeds as follows. For each view, the syn-
thetic segmentation (the label image) is examined. Each 2 x 2 block of pixels is
examined. If all four labels for those pixels are identical, the visible area of the cor-
responding model surface is incremented by the 3D area of the 2 x 2 block. This
increment is calculated by breaking the 2 x 2 block into two triangular facets and
adding the areas of those triangles together.

While our system only calculates area and visibility features from the CAD models,
additional features might also prove useful. Ikeuchi and Kanade [62] have used EGI-
based features in their vision system. We could also augment the view-independent
edge information extracted from the universal ﬁle with visible edge characteristics cal-

culated from the sample views“. The selection of the correct ‘mix’ of view-dependent

 

6Such edge features might not be very useful in occluded scenes, however.

52

and view-independent features for object description is still an open problem.

2.4.6 Implementation Details

The geometric inferencing system has been implemented on Sun-4 workstations in
C and Common LISP. After designing the object using the GEOMOD software, a
ﬁle containing the IGES information is translated by a C program into LISP S-
expressions. Each object is identiﬁed by a name (e.g. cube, hurdle, sundial).
Instances of object types are LISP structures, with corresponding slots and values.

Each primitive in the object description has a unique label in the graph structure.
For example, if we have an object labeled baseball-bat represented as a single
surface of revolution, the axis of the revolved surface would be associated with a
structure named baseball-bat-linen, where n is an integer that distinguishes this
line from all other baseball-bat-linem in the model, m 9t n (actually, n is taken
from the position of the entity in the IGES ﬁle, and is unique for all entities of any type
in the model). The structure contains slots for the parameters of the line (starting
and ending points and length). Binary relations affecting an entity are stored in a slot
labeled RELATIONS. The visible-area features calculated from the viewsphere sampling
are placed in a slot labeled AREA.

Most computations of the node attributes and binary relations are performed
in C. These attributes and relations are assembled into the graph structure within
LISP, and connectivity information is constructed. We chose to use C for numerical
processing because of its speed advantages over LISP. Once the graph is available,
however, LISP allows code which traverses the graph structure to be easily developed.
The viewpoint-independent features calculated directly from the IGES ﬁle take a few
minutes to compute. Generation of synthetic views and calculation of patch areas for
each view takes between twenty minutes and one hour per model, depending on the

complexity of the polyhedral approximation in the universal ﬁle.

2.4.7 Examples

In this section we give examples of relational graphs constructed for two of the objects
in our database. Our CAD models were constructed from physical prototypes whose
dimensions were measured by hand. In the manufacturing environment, CAD models

are often built from designs without reference to physical examples.

Tet

53

 

Figure 2.15: Images of a polyhedral object. (a): edge map. (b): range image

Polyhedron

Figure 2.15 shows an edge map and a corresponding synthetic range image of a poly-
hedral object. This object has twelve vertices (3D points) and nine faces. Table 2.4
gives the coordinates of the vertices and the list of faces, expressed as a sequence
of vertices which bound the face when traversed in order. For example, face 1 is
the plane bounded by a path through the following vertices: 1, 9, 10, 5, 6, 2, re-
turning to 1. The units of the 3D vertex coordinates are inches. A CAD model
was constructed from a prototype of this object using CSG techniques, as shown in
Figure 2.16. First, a parallelepiped of dimensions 4.37 x 3.15 x 2.87 was instantiated.
A second parallelepiped was subtracted, leaving the L-shaped block. This block was
intersected with a bounded halfspace, which removed a corner of the L-shaped object,
leaving the block in its ﬁnal form. A portion of the IGES description of the object
appears in Section A.1 of the Appendix. An IGES ﬁle contains a header containing
information about the software producing the ﬁle and the machine it was executed
on, along with some global parameters. Following the header is a directory section,
which contains pointers'to the geometric data for each model entity, along with some
ﬂags (we are only interested in the ﬂag indicating the visibility or invisibility of the

object). Following the directory is the parameter data section, which contains the

54

Table 2.4: Vertices and facets for the polyhedral object.

 

I Point I Coordinates I

(o.oo,o.oo,o.00)
(4.37,0.00,o.00)
(4.37,0.00,-2.87)
(o.oo,o.oo,-2.87)
(1.65,1.57,0.00)
(4.37,1.57,o.00)
(3.00,1.57,-2.87)
(1.65,1.57,-2.87)
(o.oo,3.15,o.00)
(1.65,3.15,0.00)
(1.65,3.15,-2.87)
(o.oo,3.15,-2.s7)

 

 

 

 

 

 

 

 

COCDKIGO‘ﬁWN“

 

p—a
O

 

H
p—n

 

H
N

 

 

 

 

(a) Model vertices

 

I Facet I Point List I

1910562
263

673
5876
510118
9121110
91412
1234
11124378

 

 

 

 

 

 

 

 

cooosroaanwMH

 

 

 

 

(b) Point lists for model facets

55

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 2.16: The CSG model for the polyhedron.

56

 

 

Edges Surfaces and Adjacencies

Figure 2.17: A portion of the relational graph for the polyhedron

geometric parameters for each entity. The C language part of the inference engine
produces approximately 2000 lines of LISP code; selected lines appear in Section A.1
of the Appendix. About 400 of the S-expressions are descriptions of nodes in the
relational graph and ‘housekeeping’ statements allowing the graph to be traversed
easily. The remainder of the S-expressions are binary relations including connectivity
and visible—area quantities. A graphical rendition of the relational graph appears in

Figure 2.17. Only a few of the relations are drawn on the graph.

A curved object

Figure 2.18, parts (a) and (b), show hidden-line and synthetic range images of a
curved part with rotational symmetry. The IGES ﬁle for this part contains four
cylinders, ﬁve planes, ten circular arcs, and a few additional invisible primitives. A

few of the nodes and edges in the relational graph are depicted in Figure 2.19.

2.5 Object Model Database

Twenty solid models were constructed for use with the BONSAI recognition system.
All were created using the GEOMOD software; relational graphs were constructed
from the IGES and universal ﬁle descriptions using the methods described above.
The models were constructed using the CSG approach, using measurements taken on

physical examples. In the brief descriptions below, we will note that several models

57

 

Figure 2.18: Hidden-line and synthetic range images of a rotationally-symmetric part.

 

Figure 2.19: A portion of the relational graph generated for the rotationally-symmet-
ric part.

go

5..."

58

are simpliﬁed representations of the actual physical objects. These simpliﬁcations
(usually the covering of object holes) were made primarily because surfaces within
deep concavities on the object are unlikely to be detected by our range sensor, much
less correctly classiﬁed. If a different range sensing strategy (such as FM radar)
were used, we might wish to increase the level of detail of the models. For now, we
concentrate only on representing the gross shape of the object using three types of
primitive surfaces.

Synthetic range images of the twenty objects appear in Figure 2.20. A correspond-
ing collage of ‘pseudo-intensity’ images for the same objects appears in Figure 2.217.

We now brieﬂy describe the objects used in our experiments.

1. ADAPTER
This object is a ﬁtting used to change the diameter of plastic piping. We
covered the ends of the object with paper for the purposes of design and image
acquisition; therefore, the model and our images do not contain the hole through
the center. Shadowing effects would make reliable extraction of the parameters
of the inner curved surfaces difficult in any case. The solid model consists of
three planes (all parallel), and two cylinders of slightly different radii (1.125 and
1.375 inches).

2. AGPART2
This rotationally-symmetric object was obtained from a machine shop on the
Michigan State University campus. Its solid model consists of ﬁve cylinders and

ﬁve planes.

3. BALL
This object is a child’s ball made of rubber, with a radius of 3.5 inches.

4. BALL2
This object is a spherical globe from a lighting ﬁxture; its radius is 3 inches.

5. BIGWYE

This object is a plastic plumbing part known in the trade as a ‘wye’ because

 

7The intensity value at each pixel in the pseudo-intensity collage is proportional to the 2-
component of the surface normal at the corresponding location in the range image. Fan et al. [37]
state that artiﬁcial shading of this type makes better use of the dynamic range of displayable gray
values.

59

BL K
0C 1 BLOCKZ BLOCK4

BOXZINCH I COLUMNl COLUMNZ

’ HARRISCUP
CURVBLOCK HALFSPH

PROPANE TAPEROLL

 

Figurv 2.20: Siwnluuir YﬁHg(’lHlng($ 0ft\venty objects

(it)

BIGWYE BLOCKl BLOCK4

BOXZINCH COLUMNl COLUMN2

GRNBLK3 HALFSPH HARRISCUP

TAPEROLL

 

Figure ”2.31: l’svmlwintvusity images of twenty objects.

10.

11.

12.

13.

14.

61

it is used to join two pipes at an angle. The join angle is 45 degrees, and the
arms of the object have different outer radii. As with the ADAPTER object,
we judged that the parameters of the inner surfaces would be difﬁcult to obtain
because of shadowing artifacts, so the ends of the pipe sections were covered

with paper and speciﬁed as planes in the solid model.

BLOCKI

This polyhedron was constructed from wood. It has eight planar faces.

BLOCK2

This polyhedron contains ten faces.

BLOCK4

This convex polyhedron has six faces.

. BOXZINCH

This object is a cube with a side length of 2 inches.

CAP
This object is the cap from a can of glass cleaner. We covered the open end of
the cap with paper to avoid shadowing problems with parameter estimates of

the inner surfaces.

COLUMNI and COLUMN 2

These two objects are similar, consisting of a cylinder and a rectangular block.
The cylinder is centered in a square face for the COLUMN] object. The rect-
angular block is larger in the COLUMN2 object and the cylinder is afﬁxed near

to one side of a planar face, rather than being centered.

CURVBLOCK
This object is the most complicated in our database. It consists of thirteen

surfaces (twelve planes and one cylinder).

GRNBLK3
This object is a nine-sided polyhedron.

HALFSPH
This object is a hemisphere of radius 1.25.

62

15. HARRISCUP
This model was constructed from a coffee cup. Because it cannot be represented
exactly as a planar, spherical, or cylindrical primitive (and the segmentation
procedure would usually label it unclassiﬁable anyway), we did not attempt to
model the handle. We did model the inside of the cup, however.

16. HUMP
This object consists of a half-cylinder cutout afﬁxed to a rectangular block.

17. PISTON
This object is a large cylindrical piece of metal. The object from which the
model was constructed has a complicated crankshaft attachment that we did

not attempt to model.

18. PROPANE

This object is a cylindrical propane tank for use with a small outdoor grill. We
did not model the threaded ﬁtting used to attach the tank to the grill.

19. TAPEROLL
This object (a roll of masking tape) is a cylindrical ring of inner radius 1.5,

outer radius 2.5, and height 1.5 inches.

2.6 Summary

In this chapter, we have described a geometric inference engine which produces a rela-
tional graph representation of a 3D object from a geometric model speciﬁed using the
IGES design standard along with a polyhedral approximation. While the GEOMOD
solid modeler was used in our work, any 3D CAD system which generates object
descriptions using the ‘analytic’ IGES form could be used instead. In addition to the
process of translation from one data format to another, the inference engine extracts
higher-level information from the CAD model and stores it explicitly in the new data
structure. The higher-level features will allow the search space explored during the
object recognition stage to be pruned early. The view-independent features extracted
from the IGES model description are supplemented with view-dependent features
(areas and simultaneous visibility lists) for each model surface taken from 320 views

of the object.

63

The contribution of the work reported in this chapter is twofold.

e The model-building step in machine vision has traditionally assumed one of
two forms: construction by hand, or construction by learning from images (of-
ten requiring human interaction). The availability of object geometry in CAD
databases motivated our construction of a model-building system which con-

structs relational graphs automatically, without human intervention.

0 Secondly, we desire to separate a strategy for recognition of 3D objects into
on-line and off-line phases. The off-line procedure described here allows the
recognition process to become a traversal of an existing database rather than

the construction of such a database at recognition time.

Ch

The I
know

its]

Chapter 3

3D Sensing

3.1 Introduction

The model-based vision system described in this dissertation uses range images (also
known as depth maps) as its input. Range images enjoy many advantages over tra-
ditional intensity maps for computer vision tasks, but hardware (and software) for
sensing depth is often more complicated and expensive than intensity sensors. In ad-
dition, depth sensors typically take longer (seconds to minutes) to produce an image
than intensity sensors.

The speciﬁcation of sensing methods (the actual physics of the sensor) has often
been decoupled from the vision task by researchers; their philosophy regarding sensing
implies that as long as the format of the image input to the processing step agrees
with the output of the sensor, any sensor providing that sort of data may be used.
In practice, the interface between sensing and processing is not a sharp boundary;
consider the extraction of range maps from intensity imagery, using shape-from-X
techniques. In this situation, we would still be tempted to call the input to the
processing stage a range image, and lump the processing necessary to derive it from
the intensity information into the sensing process. Some recent work has addressed
the impact of sensor speciﬁcations on visual tasks [54, 62].

While many range sensors provide data in a 11,-,- (grid or raster) format, R repre-
senting range and (i, j) representing image plane coordinates as above, other range
sensors simply provide lists of Cartesian coordinates (z,-,y,-,z,-),i = 1, ...,n, for some
(usually large) value of n. In the ﬁrst format, connectivity in image plane coordi-
nates is explicit. When the image is presented as a list of 3D points, connectivity
between neighboring points can be established by placing the data in a k-D tree or
similar data structure [42]. Clearly, R.)- form can easily be transformed into Cartesian
form, but the converse transformation is not possible in general. Most vision research
employing range data has employed the 12,,- format for synthetic and real imagery.

The problem domain often dictates the design parameters of the range sensor [68].

64

65

Table 3.1: Range sensor requirements for different application areas [68].

 

 

 

 

 

 

Application Area Description Spatial Resolution Data Rate pixels sec

Integrity and Detect missing or misaligned 10"! — 10'I inch 105 - 106

Placement components

Metrology High-precision measurement 10“5 - 10'" inch 10 — 10"

Surface Inspection Exhaustive scan for 10" - lO'ﬁncb 10“ - 1015
ﬂaws and assembly defects

Modeling Construct a solid model 10'“I — 10" inch 10“ — 10"
from multiple images

incognition Recognize and locate 10‘“ - 10"7 inch 10‘l - 10"

objects in an image

Navigation Obtain scene data for 1 — 10 inches 10‘ - 10"
autonomous vehicle

 

 

 

 

 

 

Table 3.1 lists a few application areas and the constraints in terms of spatial resolution
and data bandwidth.

3.2 Survey: Range-Finding Techniques

In this chapter we brieﬂy describe most of the techniques currently being used to
acquire range data from a scene. This discussion borrows heavily from two excellent
surveys, Besl [7] and Nitzan [87], and is meant to give a ﬂavor for different sensor
types rather than a detailed description of methods; for details, the reader is referred
to these surveys, or Jarvis’ earlier work [69].

In attempts to group range sensors into classes, it is often informative to describe

them in terms of the following properties.

0 Active vs. passive: an active sensor illuminates its environment. A passive

sensor does not.

0 Monocular vs. multinocular: A monocular sensor produces one image. A
multinocular image produces more than one. Binocular stereo sensors process
two intensity images in order to obtain depth. Some laser radar systems produce
registered range and intensity images. Multiple images can provide complemen-
tary information, but the question of integration of visual information has only

recently been addressed by researchers.

oz.

:11

ESE

it}?

66

Not all sensors provide absolute depth information; some only produce the relative
change in depth between distinguished scene points. Photometric stereo sensors do
not produce range data at all, but instead produce an image of surface normals. In
some applications, surface orientation may be the most important feature to obtain,

making such sensors preferable to other types.

3.2.1 Radar Sensors

Radar-based range sensors are often labeled active devices because the sensor illumi-
nates the object with a light source (often a laser), and range is calculated based on
the reﬂected energy. Most range sensors of this type employ lasers as the illumination
source, yielding the term laser range ﬁnder.

Perhaps the most intuitive use of radar techniques for range sensing is simple time-
of-ﬂight measurement: since electromagnetic radiation travels at a constant speed,
the distance to a point on an object’s surface can be determined by directing a pulse of
laser light to that point and measuring the time interval needed for the reﬂected pulse
to arrive back at the sensor. This sensing method is quite feasible for long-distance
sensing, such as in laser radar range ﬁnders for target acquisition, where the sensor
is located several hundred meters from the objects to be sensed. Such sensors often
produce both range and intensity images (the intensity is derived from the amplitude
of the returning laser light pulse); Jain et al. have investigated sensor fusion issues
for automated target recognition using such data [65]. Small-standoff time-of-ﬂight
range ﬁnders have been constructed but the precision of the optical equipment must
be high as the time intervals to be measured are extremely short [119].

An alternative to measuring pulse ﬂight time is to continuously illuminate the
surface point with an amplitude-modulated laser signal. The phase difference between
the incident and reﬂected energy is proportional to the range value sought. AM
range sensors’ most signiﬁcant problem is the ambiguity interval: the range value is
determined only up to a value within an interval whose length is inversely proportional
to the frequency of the amplitude modulation. In other words, for a sensor returning
an N-bit range value, the AM frequency may be increased, providing greater range
resolution in a smaller ambiguity interval, or it may be decreased, enlarging the

ambiguity interval but coarsening the range value quantization correspondingly [81,

99].

67

Instead of modulating the amplitude of the laser signal, the frequency of the
light radiation can be linearly swept between two values in a certain time interval
(the sweep period), and range can be estimated using coherent heterodyne detection.

Besl [7] reports on a few sensors of this type.

3.2.2 Triangulation-Based Sensing

Triangulation sensors are very inexpensive to build, requiring none of the preci-
sion optics and/or electronics of the laser range ﬁnders described above. A single
monochrome camera and a light projector are sufficient hardware to construct a tri-
angulation sensor based on structured light [60]. The software to derive depth is often
more complex, however. The basic idea of triangulation sensing is straightforward: a
point (z,y,z) in the 3D world images to a point (u,v) in the image plane. A priori
speciﬁcation of the geometric relationship between the light source and the camera
allows us to relate a 3D point (illuminated by the light source) and its 2D projection
(a ‘bright’ pixel in the intensity image produced by the camera) by similar triangles.
Frequently, the geometric relationship is obtained through a calibration procedure.
A simple diagram of a triangulation sensor appears in Figure 3.1.

A variety of methods for triangulation sensing have been developed into commer-
cial and ‘home-brewed’ sensors. Some sensors project an individual point of light on
the object and move the point over the object surface to obtain image data (either jo
or (:r,y,z) format). Other sensors use projected lines of light, grids, coded stripes,
or synthetic texture to obtain multiple range points from a single image. The sensor
used in our research uses the triangulation principle to derive depth; details are given

in a later section of this chapter.

3.2.3 Stereo Techniques

Passive stereo techniques employ constraints between two or more cameras to derive
depth. In binocular stereo, the image plane coordinates of two cameras are related by
a known transformation. Identiﬁcation of points in the two images which correspond
to the same 3D point in the scene allows us to calculate the coordinates of that
3D point. The difficult problems in stereo vision are the selection of ‘distinguished’
points in the two images and the establishment of correspondences. Typically, high-

gradient (i.e., edge) points are chosen as candidate points for correspondence, and

68

Laser Stripe
Projector

 

 

 

 

 

 

 

 

 

Figure 3.1: Triangulation-based range sensing

heuristic search procedures attempt to form the correspondences [112]. Extensions to
three cameras (trinocular stereo) have recently been proposed [50, 106]. Stereo vision

systems have been used in outdoor scenes for navigation [105].

3.2.4 Shape-From-X

Can we recover the shape of a 3D object from a single intensity image? Researchers
have approached this problem from many directions, producing techniques known
collectively as ‘shape-from-X’ methods. Horn [59] develops the shape-from-shading
method by exploiting constraints on the reﬂectance map (which relates surface orien-
tation to image brightness). Photometric stereo [59] is another technique, employing
multiple light sources and monocular intensity images in order to recover 3D shape.
The photometric stereo method does not recover absolute depth, however. While
shape-from-shading recovers surface orientation directly, it can be used to estimate
range values as well. Vaillant and Faugeras [115] developed a system to estimate
the depth map along occluding image contours (shape-from-contour). Nalwa [84] in-

ferred surfaces of revolution from bilaterally symmetric image contours. Blostein and

69

Ahuja [13] extract surface shape by estimating the perspective distortion of texture
elements extracted automatically from the unknown scene (i.e. the texture primitives

themselves are selected from the image, not speciﬁed a priori).

3.2.5 Other Approaches

A novel approach to range estimation is to examine a pattern of Moire fringes pro-
jected onto the scene entities [7]. A Moire pattern is produced by passing light through
two gratings which ‘interfere’ with one another through a slight angular offset. Often
the two gratings are identical and contain a pattern of equal-width opaque and trans-
parent stripes. When two of these gratings are overlaid with different orientations,
a pattern of interference fringes is produced. Moire fringe images are often obtained
by projecting light through one grating and placing the other grating in front of the
camera lens. Relative depth between centers of adjacent interference fringes is fairly
easily computed, but the absolute depth is harder to ﬁnd, requiring some ‘ground
truth’ about the true location of a few points in the scene. Many additional varia-
tions on the Moire theme have been described in the literature, but the difﬁculty in
extracting absolute depth has restricted the application of these sensors.

Fairly ﬂat objects can be measured using holographic interferometry in a process
similar to the Moire technique above. Instead of two gratings producing an inter-
ference pattern on the scene, two lasers (with identical polarization) illuminate the
object, also producing an interference pattern on the surface. The pattern can be
analyzed much as a Moire pattern to yield relative depth between adjacent bright
stripes in the pattern.

Depth from focus techniques exploit the law of thin lenses to obtain range [73].
Given a lens of focal length f, a point 2 units in front of the lens, and a distance w

behind the lens upon which the point at z is focused,

1 l 1

'u': 'z' = 7'
Assuming that the thin-lens model holds, range can be determined for an illuminated
scene point as long as it can be determined when the point is in focus on the image
plane. Depth-from-focus seems to have high potential for the future in vision research;
several such sensors have been constructed [90, 111].

Fresnel diffraction illuminates the scene with coherent light passed through a

diffraction grating. The physics of coherent light diffraction state that the light-

 

Figure 3.2: The range sensing worktable

reproduces the grating pattern at periodic intervals. Within these intervals range
can be estimated from the local contrast in the image. as the ‘focused-ness’ of the
diffraction grid image depends upon relative position within the interval. This has

formed the basis for a few range sensors.

3.3 Technical Arts 10OX scanner

This section describes the 'l‘echnical Arts IUOX Scanner [1133]. a structured-light range
ﬁnder using a laser for active illumination. A photograph of our range sensing work-
bench appears in Figure 3.12.

Figure 3.1 illustrates the general principles of the sensor and the sensor coordinate
system’. A tube creates the coherent laser radiation with a power level of 5 milliwatts.
The beam travels horizontally from the tubc's aperture. reflects off of a mirror and
strikes an oscillating mirror. The oscillating mirror creates a vertical triangular plane
of light which falls on the objects on the work table. The sensing footprint is about
1'2 x 1‘2 square inches. Instead of using a mirror to more the plane of light, two

motorized stages are used to position the object under the plane of light. One stage

 

"The origin of the sensor coordinate system is fixed during the calibration procedure.

71

 

 

 

 

 

Figure 3.3: Range values for a single laser stripe.

translates in the a: direction, and the other moves the object in the y direction. At
present, we do not perform any y translations of the object during a scan.

A CCD camera (with a 240 x 240 sampling grid) is positioned so that its optical
axis intersects the plane of laser light. The angle between the camera axis and the
z direction is 45 degrees. The image and sensor coordinate systems are aligned so
that position along the stripe (the y coordinate) is proportional to the row number in
the image plane, and height above the tabletop (the z coordinate) is proportional to
the column number in the image produced by the CCD camera. For each constant-r
range proﬁle, the corresponding proﬁle in the image plane is an irregular, broken
contour. Calibration of the position of bright spots in the proﬁle against their known
positions in 3D allows the y and z coordinates to be determined. This process is
applied to each row of the CCD camera image, yielding up to 240 depth values along
a slice. After the slice is processed, the a: stage can be translated and the process
repeated. Note that the distance between slices need not be the same as the distance
between adjacent rows of a slice. In our work, we adjust the number of slices and the
spacing between slices in order to keep the aspect ratio between 0.5 and 2.0.

Figure 3.3 shows a range image with a highlighted column corresponding to a
single scan, and the 3D points extracted during that scan.

The output of the 100x scanner is a grid-format range image
R = {rij = [xijiyijizijllzij = Sj’yij = [€1,331 S i S "1,1 Si 5 n}.

Note that we know the k, and k, values, so the 3D coordinates of each image point
are known. We typically calibrate the sensor to produce measurements in inches
(although it could easily be calibrated to operate in metric units as well). To calibrate

the sensor, an object called a ‘calibration gauge’ is placed under the laser stripe, as

72

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.4: Calibration of the 100X scanner.

shown in Figure 3.4. The central V-shaped portion of the gauge is 3.5 inches deep and
7.0 inches across. Repeated scans of this object are acquired, and the pitch and roll
of the CCD camera are adjusted after each scan, until the calibration error falls below
thresholds supplied by the manufacturer. If desired, the calibration can be improved
by placing an additional object (of known dimensions) under the laser stripe, and
requiring the sensor to produce the correct range estimate on the additional object’s
surface.

The y and z coordinates of the range pixels are quantized to 0.001 inch. While
some noise is present in the z coordinate, our estimates show that the noise standard
deviation is less than 0.01 inch. The shadowing effects mentioned earlier (as well as
specular reﬂections, surface markings, or any other artifact which prevents the CCD
camera from seeing a portion of the stripe)” yield pixels in the image where the range

data is invalid. For convenience, we can view this as a ﬂag image

f= {fij}

2The red laser light will be absorbed by a green surface. Highly specular (mirrorlike) surfaces
will usually not reﬂect much of the incident light back to the CCD camera.

 

73

 

Figure 3.5: Range and intensity images from the 100X scanner.

whose value at a given location is TRUE if the pixel is valid and FALSE if the
pixel’s 3D coordinates should be ignored. In addition to pixels which would lie on the
object surface if they were visible, pixels lying on the ‘background’ of the image are
also ﬂagged as FALSE, because the range sensing platform is painted black, and it
absorbs the incident laser radiation.

In addition to the range and ﬂag images, the 100X scanner produces an intensity
image, which measures the amplitude (brightness) of the visible laser stripe at each
range pixel. This image is registered with the range data, and has invalid pixels at the
same image-plane locations as the range image R. We note that the intensity image
is measuring ‘active intensity’, rather than the photometric behavior of the surface
under ambient room lighting. Figure 3.5 shows range and intensity images for the
objects shown in Figure 3.3. The intensity image is brightest at concavities in the
scene, where more of the laser’s energy is concentrated. The intensity information
may prove useful in image segmentation strategies employing fusion of the two images;
we are using only the range data in this thesis.

Like intensity images, range images contain edges; they form the boundaries be-
tween smooth surfaces of an object, or the boundary between two objects, or the
boundary between the object and the ‘background.’ Like most researchers, we iden-
tify two types of edges in range imagery. A depth discontinuity is labeled a jump
edge. Jump edges are not intrinsic properties of the object being sensed, but instead

depend on the vieWpoint. Crease edges are caused by a discontinuity in the surface

74

 

Figure 3.6: Crease and jump edges.

normal of the object, and are therefore intrinsic. These two edge types are illustrated
in Figure 3.6. In Chapter 4 we will survey a few techniques for extracting these
edges from range imagery. It should be noted that higher-order edges do exist; for
example, a curvature edge is present where the second derivative is discontinuous.
However, difﬁculties in reliable extraction of second-order properties from range data
have limited the use of higher-order edges in scene analysis.

Figure 3.7 shows real range images of the twenty objects in our CAD model
database. The models are described in Section 2.5. Corresponding pseudo-intensity
images appear in Figure 3.8. Synthetic range and pseudo-intensity images of the
models appeared in Figures 2.20 and 2.21, respectively. Shadowing and occasional

‘drop-outs’ (isolated missing pixels) are visible in the real images.

-l

{,1

BIGWYE BLOCKl BLOCKZ BLOCK4

COLUMN2

GRNBLK3 HALFSPH

 

Figure 3.7: Real range images of twenty database Objects.

ADAPTER

BIGWYE BLOCKl BLOCKZ BLOCK4

9.

BOXZINCH COLUMN2

GRNBLK3 HALFSPH HARRISCUP

 

Figure 3.8: Pseudo—intensity images of twenty database objects.

77

3.4 Summary

This chapter has explored methodologies for 3D sensing, concentrating on active
techniques (radar and structured light). We presented the principles of operation of
the 100x 3D scanner used in this dissertation. While many existing range sensors
are too slow, too expensive, or too difﬁcult to manufacture to make them attractive
in industrial environments, the technology is constantly improving and applications
well-suited to range data abound [68]. Improvements in sensing technology are needed
to make range sensors more attractive to manufacturing; in addition, strategies for
low-level and high-level processing of range data are also vital if 3D sensing is to

assume a more prominent role in industrial inspection.

Chapter 4

Segmentation and Surface

Classiﬁcation

In this chapter we describe procedures which are applied to range images to extract
surfaces and edges. Two processes are involved. In the segmentation process, the
pixels of the range image are grouped into regions called segments. Ideally, we wish
to have one image segment for each connected portion of a smooth (at least C1 con-
tinuous) surface on the object. We will occasionally use the term patch as a synonym
for ‘segment’. Segmentation is accomplished by clustering of image pixel coordinates
and associated surface normals, followed by cleaning and merging procedures. In
the classiﬁcation process, we attempt to characterize the object surface producing
an'image segment. We currently classify patches as planar, spherical, or cylindri-
cal. These classiﬁcations are based on linear and nonlinear ﬁtting techniques and
curvature estimators.

The chapter begins with a survey of previous research on range image segmenta-
tion. Then, the procedures we employ in our research are explained and demonstrated

on two sample images.

4.1 Survey

Range image segmentation and classiﬁcation are popular topics of research. This

section surveys some of the recent directions in the area.

4.1.1 Superquadric Classiﬁcation.

In Chapter 2, we introduced the superquadric volumetric primitives. Solina and
Bajcsy [103], Ferric et al. [38], Gupta et al. [48], Allen and Michaelman [2], and Boult
and Gross [17] have all addressed the problems in superquadric ﬁtting. Essentially,
the problem is a straightforward application of nonlinear optimization. The goal

is to ﬁt a single (possibly deformed) superquadric model to a set of n 3D points
78

79

{(z;,y,~,z,~) I 1 S i 5 n}. A nondeformed superquadric in general position has eleven
parameters: two shape parameters (61,62), a scale factor for each axis (a1,a2,.a3),
a rigid rotation expressed as three Euler angles (¢,0,¢), and a translation vector
(Az,Ay,Az). If global deformations are incorporated into the ﬁtting process, the
number of parameters will be larger.

The Levenberg-Marquardt nonlinear optimization technique [101] seems to be the
method of choice for performing the ﬁtting. The superquadric model is expressed
as an implicit function of (r, y, z) coordinates and the model parameters, deﬁned to
have the value zero when a point on a given superquadric is plugged into the function
when the correct parameters are given. Like most nonlinear optimization strategies,
Levenberg-Marquardt requires an initial estimate of the model parameters. Much of
the recent research cited above has dealt with the merits of various objective functions,
and methods for evaluating the quality of a superquadric ﬁt. Often, the range data is
assumed to be presegmented; to our knowledge, only Ferrie et al. have placed model

ﬁtting into the broader context of segmentation.

4.1.2 Curvature-Based Segmentation

Segmentation procedures based on qualitative interpretations of surface curvatures
have been used extensively over the last several years [10, 36, 100, 117, 122]. Our
detailed discussion of curvature extraction appears later in this chapter. For the mo-
ment, we will give an intuitive description. Principal curvatures are intrinsic quan-
tities deﬁned at each point p on a surface. There are two directions tangent to the
surface along which the surface curves the most or least. On nondegenerate surfaces,
these directions are mutually orthogonal; they are known as principal directions, and
the corresponding curvatures are called principal curvatures. On a cylindrical surface,
for example, one principal curvature is zero, and the corresponding direction is paral-
lel to the cylinder’s axis. The other principal curvature direction is perpendicular to
the axis, and the curvature value is the inverse of the cylinder’s radius. The signs of
these curvatures depend on our choice of coordinates. A negative principal curvature
implies that the surface is curving ‘downward’ with respect to the coordinate frame;
conversely, a positive curvature value means that the surface is curving ‘upward’ in
the associated direction.

Most qualitative segmentation schemes use mean and Gaussian curvatures rather

80

Table 4.1: Segmentation from signs of mean (H) and Gaussian (K) curvatures

(from [10]).

 

 

 

H < 0 H=0 H > 0
K < 0 saddle ridge minimal surface saddle valley
K=0 I] ridge plane valley
K > 0 [I peak impossible pit

 

 

 

 

 

than the two principal curvatures. Mean curvature is the average of the two principal
values, and Gaussian curvature is their product. Qualitative segmentation can be
performed using the signs of mean and Gaussian curvature. Table 4.1 shows the 8
valid classiﬁcations for a point based on the two signs.

Most strategies segment an image by classifying each pixel as one of the eight
types. Connected regions of the same type are identiﬁed as primitive image surfaces.
Fan et al. [36] use an edge-based strategy. Directional curvature estimates taken from
four directions are combined to obtain principal curvatures and directions. Extrema
and zero-crossings in principal curvature locate candidate edge points, which are then

grouped into closed contours, producing an image segmentation.

4.1.3 Edge Detection in Range Imagery

In addition to the edge-based segmentation scheme of Fan et al. [36] (which used
edge pixels suggested by extrema or zero-crossings in curvature), other general edge-
ﬁnders suitable for range imagery have been proposed. The two edge types most often
considered in range imagery are the jump and crease edges. A jump edge occurs at
a point of range discontinuity, and jump edge pixels can be detected using any of
the operators developed for edge detection in intensity imagery, such as the Sobel or
Prewitt operators [3]. A crease edge occurs where the surface normal is discontinuous,
and techniques from intensity image processing are not directly applicable. Inokuchi
et al. [63] examine a ring of pixels centered at the image point of interest and classify
the point as a jump edge, convex (roof) edge, concave (valley) edge, or non-edge
point, by examining the Fourier spectrum of the depth values in that ring (the ring
of depth values is viewed as a periodic 1D signal). Their detector was not tested with

curved surfaces, however, and even with FFT hardware, the edge—detection process

81

can be time-consuming, since the Fourier transform must be computed for every pixel
in the image. . .

Mitiche and Aggarwal [80] developed a statistical test for crease edges. At a
range pixel of interest, several partitionings of its neighborhood are considered. A
partitioning splits the neighborhood into half-neighborhoods, and a planar surface is
ﬁt to each ‘side.’ Under appropriate assumptions, the difference in slope between the
two planes in a ‘good’ partitioning serves as a test statistic. Thresholds on the statistic
produce a decision procedure which can be applied at each range pixel, producing an
edge or no-edge labeling of the image.

The primary drawback of edge-based segmentation is inevitable fragmentation of
the edges. If the edges do not form a closed network, they must be linked together
using heuristic techniques. For this reason, edge-based range image segmentation has
not received as much attention from researchers as region-based approaches. Nadabar
and Jain [83] have combined our local feature clustering approach to range image
segmentation with an edge-based segmentation strategy using a Markov random ﬁeld

model for the edge process, with good results.

4.2 Initial segmentation: Local feature clustering

A functional block diagram of the range image segmentation and region classiﬁcation
process appears in Figure 4.1. Our technique for obtaining a preliminary segmen-
tation was used by Hoffman and Jain [57] to segment images obtained from an AM
radar range sensor. Surface normals are estimated at each pixel where the range value
is valid (see Section 3.3) using principal components analysis; three components of the
normal vector (n2, ny, nz), together with the (2:, y, 2) position of the point, produces
a six-dimensional vector at each pixel in the image. A sampling of these vectors is
input to a squared-error clustering procedure. The output of the clustering step is a
preliminary labeling of image regions. The clustering technique usually oversegments
the input image (i.e., it produces more distinct image patches than there are smooth
surfaces in the scene); this is remedied through application of a region merging oper-
ation. This type of procedure is often referred to in the literature as split-and-merge
segmentation.

After the preliminary segmentation, a surface classiﬁcation is obtained for each

82

connected surface patch. Planar faces are extracted ﬁrst, using an error measure cal-
culated from a principal components planar ﬁt to the 3D points in the patch. Non-
planar surfaces are further subclassiﬁed using estimated surface curvatures. At each
surface point, we estimate minimum and maximum curvature and their associated
directions. The surface is then classiﬁed as being spherical, cylindrical, or ‘unknown’
using curvature values and directions. For each type of surface, the appropriate geo-
metric parameters (e. g. radius, orientation) are estimated. If adjacent image patches
are found to be of the same surface type with similar parameters, they are merged
at this time. Finally, the estimated information for each patch is represented as a
number of LISP S-expressions and serves as input to the recognition system.

The following notation will prove useful throughout the chapter.

0 The m x n range image produced by the 100x scanner is denoted as ’R:
R= {135: [xijyijzijlll Sis "1,1 5.75"}-

Note that the pixel’s coordinates form a row vector. We will refer to the elements
rgj image as pixels (or range pixels); it is important to remember that a range
pixel has three coordinates. Henceforth, we will not show the bounds on the

row and column indices.

0 In Chapter 3, we noted that not all of the mn pixels in the image contain valid

data, and deﬁned a Boolean ﬂag image /

 

f: {fijli

in which a TRUE value for f5, means that the corresponding range pixel r,,°
contains a valid 3D measurement, and a FALSE value indicates that the pixel

should be ignored in further processing.

0 Surface normals are estimated at each pixel r,-,~ which is itself valid and has at
least two valid neighboring pixels in its 5 x 5 neighborhood. This produces a
normal image

N = {as = [Maj nyij "zijlls

where nr;,-,ny,-,-, and nz,-,- denote the components of the surface normal along

the z, y, and z coordinate axes, respectively.

83

oooooooooooooooooooooooooooo

 

 

 

Range Surface
Image ' 3' Normal
Estimation

 

 

 

 

Clustering
1

Minimum
Distance

was

i Initial .
§ Labeling:

J

Size and
Connectivity

Filtering
1
Domain

Independent
Merging
I

 

 

 

 

 

 

 

 

 

 

 

Patch
Classification

 

 

 

 

i Descriptioni

 

Figure 4.1: A functional block diagram of the segmentation and classiﬁcation proce-
dure. Dashed boxes indicate data structures; solid boxes indicate processing steps.

84

0 Our clustering-based segmentation technique uses both the pixel and the cor-
responding surface normal to assign a segment label. It is convenient to deﬁne
a composite image
C = {0:3 = [W “all

with six measurements at each pixel.

0 One of the outputs of the segmentation and classiﬁcation procedures is a labeling
of the input image ‘R:
E = {lg}.

The labels are nominal-valued and take integer values for the sake of conve-
nience. By convention, 1,,- is zero if the ﬂag value f,-, is FALSE. Intermediate

labelings produced during processing will be denoted £1, £2, etc.

The goal of segmentation is to obtain a labeling in which regions correspond with the
connected ‘faces’ of the objects in the scene. In our work, objects are composed of
smooth surfaces joined at crease edges (where the directional derivatives are discon-
tinuous). The faces, then, are the individual patches; we wish the labels to take on
the same value for all the pixels belonging to the patch, and different values for pixels
outside it. Figure 4.2 shows range and pseudo-intensity images of a scene containing
two of our database objects. Later ﬁgures will show the intermediate and ﬁnal results

produced from the range image.

4.2.1 Surface Normal Estimation

Local surface orientation has been a popular image feature in range image analysis,
and some attention has been paid to the problem of reliably estimating the normal
image N from the range image 72. The input depth map 'R. provides positional
data on the scene surfaces, and N provides estimates of the surface orientation at
those points. Orientation is deﬁned locally, but is not intrinsic, since rotation of
the object with respect to a ﬁxed coordinate frame rotates the normal vectors as
well. Most techniques for estimating surface normals employ the range data in a
small neighborhood of the pixel of interest. Furthermore, these techniques make
independent estimates at each image pixel for which the normal is desired.

We broadly classify normal estimators as analytic or discrete [41]:

85

 

(a) (b)

Figure 4.2: Range (a) and pseudo-intensity (b) images of a scene containing two
objects.

0 Analytic approaches ﬁt a plane to the image data in a neighborhood of the
pixel of interest and extract the surface normal from the parameters of the
ﬁtted plane.

0 Discrete techniques use subsets of the neighboring image points to obtain sev-
eral estimates of the surface normal and summarize them, producing a single

estimate.

Regression-based analytic techniques have been the most popular in the literature,
and our method for surface normal estimation ﬁts in this category. As an example
of a competing discrete technique, consider the range point r,-,- and the eight pixels

surrounding it, forming a 3 x 3 block. Twenty-eight estimates of the normal at n,-
8
can be calculated by choosing all 2 pairs of points from the eight neighbors of

r,-,- and ﬁnding the normal to the triangle formed by the pair and r,-,-. Presumably,
these estimates would be averaged to produce a ﬁnal value. Heuristics to reduce the
computation time and improve the numerical stability of the result would normally be
applied. However, a desire for least-squares optimality of the normal vector’s deﬁning
plane leads one directly to regression techniques.

The regression estimator used to obtain the surface normal should reﬂect the type

86

and relative power of noise source (or sources) which contaminate the range observa-
tions. The approach taken by many researchers is to assume that the measurement
error in the range measurement only affects the z coordinate, and that the sampling
grid coordinates .r and y are error-free. Under this assumption, a linear model for

depth at the point of interest r,-,- = [351' y,-,' 2.3] is
z = a1: + by + c,

where a, b, and c are the parameters to be estimated. Let 9 denote a neighborhood
of the range point r,-,- containing k > 2 ‘valid’ range measurements. Provided that
the I: points are not all collinear, the least-squares solution (6, b, 5) to the system of

lc equations in 3 unknowns provides an estimate for the normal vector n,-,-, namely

ﬁg,- = (—&,—b,1) .

\/1 + a2 + b?
This technique ﬁts a plane to the local surface measurements which minimizes the
distance in the z-value only from the ﬁtted plane to those measurements.

Not all range sensors provide data which is guaranteed to be error-free in the a:
and y-coordinates, however. Our sensor, for example, can contain systematic errors in
the y coordinate due to misalignment of the laser stripe. The .r coordinate is obtained
from the position of the motorized stage, and positional errors can be caused by the
inherent nonrepeatability of the stepper motor and by quantization effects.

In cases where errors in all three coordinates are to be considered, another least-
squares procedure for ﬁtting planes to the local range measurements can be deﬁned
using the techniques of principal component analysis [66, 70]. This method proceeds
as follows. The 3 x 3 sample covariance matrix R = (rid) of the 3D points is obtained.
This symmetric matrix has as the diagonal elements ru, r22 and r33, the sample
variance of the r, y, and z coordinates, respectively. The off-diagonal elements are
the sample covariances between pairs of coordinates.

Using standard techniques [102], the three eigenpairs {(A,,v,~) |i = 1, . . . , 3} of R
can be obtained (A,- is an eigenvalue of R, and v,- is the corresponding eigenvector).
Without loss of generality, we will assume that the eigenpairs have been sorted so
that A; 2 A; 2 A3 2 0. Since R is always nonnegative deﬁnite, all eigenvalues are

nonnegative.

87

 

Figure 4.3: Surface normals estimated from the range image of Figure 4.2.

The physical interpretation of the eigenvalues and eigenvectors leads to our esti-
mate of the surface normal. When we normalize our set of (z,y,z) points by sub-
tracting its sample mean, and form a new measurement by projecting the normalized
data onto V], the sample variance of these projected points is A]. Similarly, A2 and A3
are the sample variances obtained when the data is projected onto the eigenvectors
v2 and V3, respectively. When we know that our data points are obtained from a
locally ﬂat surface, v1 and v2 deﬁne a plane that approximates the underlying sur-
face. Hence, the eigenvector associated with the smallest eigenvalue is an estimate
of the surface normal, and A3 is the variance of the data with respect to that plane.
Additionally, the principal components rotation is optimal in the sense that A3 is the
minimum obtainable variance for a l-D projection of the data. Geometrically, the
plane obtained from v; and v; is the minimum perpendicular error planar fit to the
given 3D points.

The above procedure is applied at each pixel in the range image; the output of
this process is the normal image N. Figure 4.3 shows the surface normals estimated

for the image of Figure 4.2.

4.2.2 Sampling and Clustering

To obtain an initial labeling

£1=iluj|15i5m,13j$n},

88

we treat segmentation as an unsupervised statistical classiﬁcation problem (a clus-
tering problem) in a six-dimensional feature space. Cluster analysis is “the process
of classifying [features] into subsets that have meaning in the context of a particqu
problem” [66]. A full exploration of clustering techniques is beyond the scope of this
dissertation; the interested reader is referred to Jain and Dubes [66] for a presentation
of techniques with emphasis on computational issues.

Our application of clustering attempts to identify connected image regions corre-
sponding to smooth object surfaces from local geometric features (positions and local
orientations). The input to a clustering procedure is an ordered set of feature vectors
(typically called patterns), and the output of the procedure is one or more clusterings,
each clustering being an ordered set of labels for the input patterns. The group of
patterns having the same label is called a cluster. The patterns in our application are
six-dimensional elements from the composite image C. This approach to segmenta-
tion [64, 57], is often called local feature clustering, since the image features involved
in clustering are local properties. The (x,,-,y;,-,z,-,-) coordinates of image points are
useful in clustering because they will help insure that the output clustering consists of
connected surface patches. The surface normal components (nan-j, nygj, nzgj) attempt
to place range pixels with different local orientations into separate clusters. Clearly,
these six coordinates constitute a ‘minimal’ set of features upon which we can obtain
a segmentation in which surface patches are connected and segment labels change
across signiﬁcant orientation changes in the image of the object’s surface.

In principle, we could input the six-dimensional feature vector at every pixel in
the composite image C into our clustering procedure, and obtain a label for every
pixel directly. Unfortunately, clustering algorithms are expensive in terms of memory
consumption and processing time; for this reason, we only cluster a subset of the
pixels. This subset is obtained by taking pixels from every lcth row and column, with
k chosen as small as possible so that the total number of pixels selected from the
image does not exceed 1000.

We now discuss the clustering algorithm (called CLUSTER) used for range image
segmentation. CLUSTER is a two-phase clustering program which, given an a pri-
ori upper bound 1:...“ on the number of clusters, generates a sequence of clusterings
containing 1, 2, k...“ clusters. Like many clustering programs, CLUSTER assigns
labels to the input patterns in an attempt to minimize the squared error for the clus-

tering being constructed. For ease of notation, we denote the n six-dimensional feature

89

vectors obtained from the composite image C as {c1, . . . , cu}, with c,- = (cu, . . . , egg)
(it is useful to visualize this set of patterns as a n x 6 matrix). Now let the labeling
obtained from a lc-cluster solution be denoted {11, . . . , In} (a label for each pixel), and
let the I: cluster centers be In], . . . mk’, with rm, = (mu, . . . , mks). The squared error
is given by
n 6
E” = Z 203:“ - mm)”.
i=1 j=1
where l.- is the class label obtained for the ith input pattern c,-.

The major steps in the CLUSTER program are as follows:

1. An initial set of km, clusterings is generated, with the ith clustering containing
i clusters. The ﬁrst clustering places all patterns in one cluster. To obtain the
second clustering, two cluster centers are deﬁned. One is the centroid of the
patterns, and the other is coincident with the pattern farthest from the first
cluster center. The patterns are then labeled with either 1 or 2 depending on
which of the two cluster centers it is closest to. A clustering with j+1 clusters is
obtained from a clustering with j clusters by choosing the pattern farthest from
the current clustering as the new cluster center. The squared error is calculated

for each clustering in this sequence.

2. A new set of clusterings is created, starting with the Irma-cluster solution ob-
tained at the end of Step 1. If any two clusters in that solution can be merged
to produce a (km, — 1)-cluster solution with lower squared error than the pre-
vious (km, — l)-cluster solution, the resultant clustering replaces the previous
(km, — l)-cluster solution. Then the (km, — 1)-cluster clustering is examined
in the same way to produce a (possibly new) (Irma - 2)-cluster clustering, and
so. forth.

Steps 1 and 2 are repeated in sequence until none of the km, clusterings change
during a pass. The resultant sequence of clusterings is then used as the output of the
clustering program, along with the cluster centers and other statistics.

Once a set of clusterings (containing km, solutions with 1, 2, . . . km, clusters) has
been obtained from CLUSTER, how is the ‘best’ clustering chosen? Like Hoffman
and Jain, we examine a statistic M (1:) developed by Coggins and Jain [32] which

reﬂects the isolation and compactness of the clusters. We specify km, = 16 to the

 

1A cluster center is the centroid of the patterns assigned to that cluster.

90

CLUSTER program and choose the lc-cluster solution, where k is as close to 1cm, as
possible and M (k) is a local maximum. Choosing a clustering with ‘too many’ clusters
is preferable to choosing a clustering with ‘too few’ clusters. Subsequent classiﬁcation
steps will not break image regions corresponding to a single cluster; they only merge
adjacent regions with similar parameters. Therefore, we are able to accept clusterings
which break connected surfaces into more than one cluster.

At this point, we have chosen one of the km, clusterings and therefore, we have
a label for each of the input patterns, which are a subset of the image pixels. In
addition, we have the I: cluster centers m,,i = 1, . . . , It. To construct the preliminary
labeling £1, we use a minimum-distance classiﬁer. The label 11,-,- for composite image

pixel c,-,- is the index I where the Euclidean distance
D = llmz - call

is smallest; in other words, we assign each pixel the index of its closest (in six dimen-

sions) cluster center.

4.3 Reﬁned Segmentation: Cleaning and Merg-
ing
The initial labeling £1 is often contaminated by the following undesirable artifacts:
o Non-connected patches with the same label. Clustering attempts (through use

of the (z,y,z) coordinates of the image points) to create patches which are

compact, but connectivity in the image grid is not enforced.

0 Patches with an extremely small number of pixels relative to the other image

segments.

a Differently-labeled patches belonging to the same object surface. This artifact
is known as oversegmentation, and is an unavoidable result of our choice of a

large value for the number of clusters.

A recursive connected components labeling algorithm is employed to remedy the
ﬁrst problem. It takes the label image £1 as its input and produces a new labeling

L2 as output; 8-connected regions in L; are preserved, but the label values are made

91

 

Figure >1. 1‘ Segment labels from clustering and after merging.

unique. After connected component labeling. the number of pixels in each segment is
counted. and segments smaller than a threshold value are discarded. Classifications
of such segments are frequently unreliable and offer little discriminatory power during
the object recognition stage. The output of this cleaning step is a label image £3.
The boundaries between patches are then examined to estimate the change in surface
orientation across the boundary. and ifthe change (averaged over all boundary pixels)
is below a threshold value. the patches are merged. Finally, the entire image is
relabeled so that the segment numbers begin with 1 and end with k. the number of
distinct segments. The output of the relabeling step is £4.

Figure 4.4 shows the initial segmentation £1 of the range image in Figure 4.2, and
the output £4 of the merging step. Pixels with the same gray level in the label images

belong to the same region.

4.4 Surface Classiﬁcation

Given a range image and a labeling. our next task is to characterize each segment
as planar. spherical. cylindrical. or unknown. and estimate the parameters for the

specific surface type. Our recognition system relies on the geometric parameters of

92

image segments to generate and validate hypotheses of identity and position for the
objects in the image. We place the geometric parameters of our surface types into

three classes: orientation parameters, location parameters, and intrinsic parameters.

0 Orientation parameters convey information about the orientation of the sur-
face under consideration. For example, the plane’s orientation parameter is its
normal vector. Spheres do not possess an orientation parameter. A cylinder’s

orientation parameter is a vector in the direction of its axis.

a Location parameters convey information about the position of the surface in
3D. For example, the plane’s location parameter is a scalar, namely the mini-
mum (perpendicular) distance to the origin from the plane. A sphere’s location
parameter is the 3D point at its center. For a cylinder, we take the location

parameter to be any point on its axis.

0 Intrinsic parameters are surface features that do not depend on the position or
orientation of the object. Planes have no miscellaneous parameters, since they
are completely speciﬁed by their normal (an orientation parameter) and their
distance from the origin (a location parameter). Spheres and cylinders have one

miscellaneous parameter, the radius.

Why should we estimate these particular parameters, and why are they important?
The domain of surfaces which can be completely characterized by these three families
of parameters includes all of the surfaces of revolution, including the three surface
types we employ. So, these parameters are both intuitive and descriptive. This
approach can be extended to other families of surfaces (for example, generalized
cylinders with straight axes). It should be noted that more complicated surfaces will
have more intrinsic parameters (and possibly more orientation parameters), providing
additional constraints to be used at recognition time. In Chapter 5, we will construct
predicates from orientation and intrinsic parameters; they will allow us to discard
incorrect hypotheses of object identity. The location and orientation parameters can
also be used to constrain the degrees of freedom in an estimate of the location and
orientation of the object as a whole. These parameters are relatively easy to extract
from the segmented range image using appropriate methods.

The hierarchical classiﬁcation procedure is shown in Figure 4.5. A plane is ﬁt to

the 3D points in the segment, and accepted if the mean-squared-error of the ﬁt is

93

Test 1
Test 2
Plane
Cylinder Unknown Sphere
Surface

Type
Figure 4.5: Decision tree for surface classiﬁcation.

below an empirically-determined threshold. If the threshold is exceeded, however, we
use the minimum curvature direction and corresponding maximum curvature value to
estimate the orientation, location, and radius parameters for a hypothesized cylindri-
cal patch. A squared-error statistic is also calculated for this hypothesis; if it exceeds
a threshold, the cylindrical ﬁt is rejected, and the patch is labeled as ‘unknown’.
If, after the planar hypothesis fails, the minimum curvature is not close to zero, we
attempt a spherical ﬁt using a nonlinear Optimization algorithm, and again calculate
a squared-error statistic. If the statistic lies below a threshold, the spherical ﬁt is
accepted; otherwise, the patch is labeled as ‘unknown.’ The classiﬁed image surfaces
are expressed as LISP S-expressions for use by the recognition module (described in
Chapter 5).

94

4.4.1 Testing for Planes

The strategy used to estimate surface normals is again used here; the primary dif-
ference between these two applications of the principal components ﬁt is the number
of points used in the ﬁt. Earlier, we used points in a small neighborhood (5 x 5)
around a range pixel of interest. Here, we use all of the range pixels in the image seg-
ment (typically several hundred 3D points). Given the set of 3D points, we ﬁnd the
sample covariance matrix R as before, and extract its eigenvectors and correspond-
ing eigenvalues. The eigenvector v3 = (v31, v32, 033) corresponding to the smallest
eigenvalue A3 is normal to the best-ﬁtting plane (in a squared-perpendicular-distance
sense). The coefﬁcients v3,- provide the orientation parameter for the plane. We then
estimate the location parameter by ﬁnding the estimated minimum distance between

the plane and the origin. The implicit equation of the ﬁtted plane is given by
van: + U323] + 12332 + d = 0, (4.1)

where d is the unknown location parameter to be determined. The estimate d is
obtained by plugging the centroid (:2, g, E) of the given 3D points into Equation (4.1),
producing

Cl = —(v31:'i + U323? + U335).

The validity of the planar ﬁt is tested by comparing the RMS (root mean-square)
error of ﬁt Em. for the plane to a predetermined threshold Twang. Em, is given by

 

I n .
Erma = J :1- 2013113; + my: + 0332; + d)’.
i=1
We do not have to calculate Em. explicitly; it is simply m. Note that the units of
E,,,.. are length (in our case, inches). We chose a threshold value of 0.05 inches for

the planar ﬁt. Our ﬁrst surface classiﬁcation test, then, is

Test 1: IF E"m < T,;.,,,, classify the surface as planar with the esti-

mated orientation and location parameters.

In our experiments, Tm... was chosen empirically. However, if the noise contam-
inating the range data follows a multivariate normal distribution with a diagonal
covariance matrix, a x2 test for planarity can be developed [39]. However, the multi-

variate normal assumption does not hold for our range data, and since the threshold

95

is dimensioned as length, we obtain empirical thresholds from a priori assumptions
about the type and parameters of the surfaces to be encountered. The threshold value
used in our experiments was Tm... = 0.05 inch. This test would classify very gently
curved surfaces (say, spheres with a radius of 20 inches) as planar; but if we employ
realistic assumptions about the surfaces to be expected, we can ‘tune’ this threshold

(and the others used in the segmentation procedure) for the expected set of objects.

4.4.2 Estimating Curvature

The remaining tests for surface classiﬁcation use surface curvatures and their associ-
ated directions to estimate the parameters of the various surface types. This section
discusses our approach to curvature estimation; a more detailed discussion of several
methods appears in [41]. Curvature estimation has been used extensively in range
image analysis; most researchers have used curvature signs to make qualitative deci-
sions about surface type [10, 36, 122, 64]. Only recently have researchers attempted
to infer quantitative properties (speciﬁcally, the geometric parameters of quadrics)
from curvature [14]. Quantitative interpretations allow for more detailed classiﬁca-
tions of surfaces; however, the noisiness of estimated curvatures makes contamination
of parameter estimates more likely.

Surface curvatures are intrinsic properties deﬁned at a point on a Cz-continuous
surface. Focusing on an interior point p, there exists a tangent plane Tp and an
inﬁnite number of orthonormal basis vector pairs for T p. Figure 4.6 shows a surface,
the global coordinate system, the tangent plane at a point, and one choice of an
orthogonal basis for the tangent plane. How does the surface curve at p when we
travel in a certain direction in Tp? The direction of travel deﬁnes a curve embedded
in the surface. Except in cases where the direction of travel does not affect the surface
curvature (e.g. a plane, where curvature is everywhere zero, or a sphere, where the
curvature is everywhere the inverse of the radius), there is an unique pair of directions,
orthogonal in the tangent plane, for which the surface curvature attains a maximum
in one direction and a minimum in the other. These directions are called principal
directions, and their associated surface curvatures are the principal curvatures; they
will be denoted vmimvmz, and k,,,.-,,, km“, respectively.

The surface types used in this work have some interesting curvature properties.

As mentioned above, the principal curvatures of planar surfaces are everywhere zero,

96

 

Figure 4.6: Sensor and local coordinate systems.

and the direction vectors are not uniquely deﬁned. On spherical surfaces, the curva-
tures are equal to the inverse of the sphere’s radius, and again, the directions are not
uniquely defined. On a cylindrical surface, however, the directions are uniquely de-
ﬁned, and useful for parameter estimation. The absolute maximum curvature is equal
to the inverse of the cylinder’s radius, and the corresponding direction is perpendic-
ular to the cylinder’s axis. The minimum curvature is zero, and the corresponding
direction is parallel to the axis. As a matter of fact, all surfaces of revolution have a
similar property; one principal direction is always perpendicular to the axis (a parallel
of the surface), and the other direction is always orthogonal to the axis (a meridian
of the surface).

Our strategy for curvature estimation is an analytic technique [41]: at each pixel

of interest p, curvature is estimated by
0 extracting the neighborhood 9,, of p,
o centering the data in the neighborhood to place p at the origin,
0 estimating the tangent plane T p,

a ﬁtting a bicubic surface to p and its neighbors (using the local coordinate system
deﬁned by Tp and the surface normal), and

o extracting the principal curvatures and directions from the coefﬁcients of the
ﬁtted surface.

We now explain the details of each step.
Extraction of the neighborhood 0;, of the range pixel p is easily performed since

the image is in ‘raster-scan’ format; we can simply create a list of the 3D coordinates

97

of the pixels in a I: x k neighborhood (k odd) centered at p. The size k governs
the accuracy of the curvature estimate. First of all, we require 1: > 3 to ensure
that the system of equations solved to ﬁt the bicubic surface is overdetermined. In
our work, we choose I: = 9. Small values of I: involve fewer pixels in the estimates,
and variation due to measurement errors is more pronounced. Larger neighborhoods
perform more implicit smoothing of the range data as part of the regression process,
and variation of the curvature values (and their associated directions) due to noise
will be reduced. A large neighborhood size will also reduce the number of estimates
produced from the segment, because we insist that 0;, be a full neighborhood, i.e.
if any of the range pixels in 9,, are invalid due to shadowing or edge effects, then a
curvature estimate will not be produced at that pixel. Through this restriction, we
are ensuring that estimates are produced only in the interior of the segment. Since
we are using these estimates to ﬁnd parameters of curved image surfaces, the interior
values, in our opinion, are the most reliable; curvatures estimated near physical or
shadow edges should not be allowed to contaminate the ensemble of estimates within
the image surfaces.

A centering step is applied to the list of 3D points in order to place the pixel of
interest p at the origin. This centering allows partial derivatives of the local surface ﬁt
(see below) to be computed more efficiently. We then obtain a local coordinate system
(x’, y’, z’) for the data using the principal components projection described earlier to
estimate surface normals. This procedure essentially estimates the tangent plane Tp
and projects the data into a new coordinate system deﬁned by the three eigenvectors
extracted as part of the principal components process. The subsequent regression
steps perform their ﬁts with respect to this new coordinate system. This projection
is illustrated in Figure 4.6, which shows a curved surface with range measurements
made in a sensor coordinate frame (2:, y, z), the tangent plane T p at a range pixel of
interest p, and the basis vectors for the local coordinate frame (:r’, y’, 2’). We choose
the local coordinate system so that the z’ and y’ direction vectors (denoted v,: and
v,:) span the tangent plane, and z’ is the direction normal to T9. Note that the center
pixel remains at (0,0,0) in this new coordinate system, since the transformation is a
rigid rotation.

At this point we have a list of 1:2 3D points {(z’;,y’,-,z’,-)} in the local coordi-

nate frame deﬁned at p = (0, 0,0). We then solve the linear least squares problem

98

associated with the bicubic surface ﬁt
2: ___ f(3',y') = agz’3+a21’2y’+aaz’y’2+a4y’3+a5x’2+a5:c’y’+a7y’2+a3x’+agy’-Falo.

When we plug each of the k2 3D points into this equation, we obtain a system of k2
equations in the 10 unknowns a,-. The least-squares solution can be obtained using
standard techniques; we employ routines in the LINPACK linear algebra software
library [33].

The surface ﬁt to the neighborhood of p is commonly called a graph surface, since
it is the graph of the function f (1’ , y’). The mean and Gaussian curvatures are related

to the two invariants of the Hessian matrix of the graph surface function f [100]:

n. [f:’:’ fs'y' :I _ [2a 2b]
fylzl filly, 2b 2C

Gaussian curvature K is the determinant of H, and the mean curvature H is one-half

the trace of H. The two principal curvatures are obtained from H and K by

k, = H+\/H2—K (4.2)
kg = H—Jaz—Ic.

The required partial derivatives are easily obtained from the graph function f; their

values at p = (0,0,0) are given by

f,:,:(0, 0) = 2a5 (4.3)
fyiy:(0, 0) = 2a7
f,v,,o(0, 0) = a6

f,a(0, 0) = as

fy:(0, 0) = a9

In addition to the curvatures themselves, we are interested in their direction vec-
tors. These vectors, expressed as a linear combination of the :c’ and y’ basis vectors

v,: and v,:, are given by

(a - c + ,/(a - c)2 + 4b2)v,. + 2M. if a z c,
2bv,: + -(a — c — \/ (a - c)2 + 4b2)v,:, otherwise

 

 

V1 (4.4)

 

 

v _ ‘4be + (a - c - \/(a - c)2 + 4b’)v,,., if a _>_ c,
1 — (a " C - «(z - C)2 + 4b’)v,v + 2bvyi, otherwise

99

ll!

illillllll’

 

(a)‘

Figure 4.7: Minimum (a) and maximum (b) curvature directions for the range image
of Figure 4.3

At this point. we must clarify our notion of minimum and maximum curvatures.
The sign of a principal curvature indicates the 'direction of bending” of a surface, but
does not influence its geometric parameters. For example, the principal curvature of
a cylinder in a direction perpendicular to the axis (and in the tangent plane) can be
either the inverse of the cylinders radius or the negative of the inverse of the radius,
depending on whether we are estimating curvature on the ‘inside’ or the ‘outside,’
respectively. If we decide that the surface is indeed cylindrical. all we need to examine
are the absolute values of the principal curvatures: the signs are discarded. For that
reason. we will define the nmrim um principal curvature lam” as the maximum absolute
principal curvature. Similarly. km“. the min/m um principal curvature, will refer to
the minimum absolute principal curvature. This convent-ion allows us to test for
cylinders versus spheres by examining the minimum curvature. without regard to the
effects of the curvature signs. l’e': consistency. we will denote the principal directions
corresponding to lrmm and 1.2. 4,. as yaw, and v,,,.,,.. respectively.

Figure 4.7 shows minimum and maximum curvature directions for the image of

Figure 4.2

100

4.4.3 Testing and Parameter Estimation for Cylinders

Recall that Test 1 was used to decide whether a segment in the range image was planar
or nonplanar. If the results of this test suggest that the surface is nonplanar, we next
examine the ensemble of curvature estimates and directions to classify the segment as
cylindrical, spherical, or ‘unknown’. How do cylinders differ from spheres in terms of
curvature? If one travels on a cylinder in a direction parallel to the axis, the surface
does not curve. Formally, the minimum (absolute) principal curvature is zero, and the
corresponding direction is parallel to the axis. Given the normal and the minimum
curvature direction, the maximum curvature direction is uniquely determined; the
corresponding principal curvature is the inverse of the radius. In contrast, all points
on a spherical surface are umbilics, where the principal curvatures are equal and the
principal directions are not uniquely deﬁned. Hence, a simple test for cylindrical
versus spherical patches is to compare a central estimate of the minimum curvature
against zero. When we are dealing with noisy data, the curvature estimates will never
be exactly zero; therefore, an empirically-determined threshold is used. In previous
work [41], we employed the median of curvature estimates on a smooth patch; this
was an attempt to avoid the effects of outlying estimates. Since a reasonably-sized
segment in a segmented range image will have several hundred pixels, we will obtain
several hundred curvature estimates. In our opinion, the median is preferable to
the mean, and may be preferable to trimmed averages. An ensemble of curvature
estimates obtained from a single smooth surface is usually contaminated with a large
number of outlying estimates. All of the surface types considered in this thesis have
constant principal curvatures, so the median has been the most reliable estimator of

the principal curvature of the entire patch.

Test 2: IF kw?“ < Taind", attempt to ﬁt a cylinder to the surface;

otherwise, attempt to ﬁt a sphere.

In our experiments, Twang" is 0.15 inches. In theory, this would allow us to classify
spheres with radii up to 6.66 inches.

Assume that Test 2 is passed, and therefore we wish to ﬁt a cylinder to the
segment. We can obtain estimates of the orientation, location, and radius parameters
by examining the minimum curvature direction Vmgn, the surface normal, and the
maximum curvature estimate km“, all estimated at the surface point producing the

median minimum curvature estimate. Recall that the orientation parameter of a

101

cylinder is a vector in the direction of the axis. This vector should be parallel to the
minimum curvature direction estimates; therefore, we use Vmgn as an estimate of the
axis direction. The radius of the cylinder is taken as r = 2.37:"

Estimation of the location parameter (any point on the axis) presents a problem.
Given a point p on the surface, the radius estimate a, and the surface normal np, a

point on the axis could have either of the two values
p :l: rnp,

depending on whether the surface is convex or concave. Here, we test both cases.
The RMS error test described below is applied under each assumption; if the surface
is indeed cylindrical, one of the two cases will produce a smaller error. We accept
the ‘better’ solution, conditioned on its satisfying our threshold test. We have never
encountered an image in which both hypotheses satisfy the RMS criterion.

To test the estimated cylindrical ﬁt, we calculate the average sum of squared dis-
tances from the observed data points to the hypothesized cylinder whose parameters
were given above. The error for each individual pixel is calculated as follows. The
minimum distance between the hypothesized axis (determined by the location and
orientation parameters) and the observed point is calculated; it is the length of the
line segment joining the point and the line, and perpendicular to the line. The dif-
ference between this value and the predicted radius is the error. Our overall error of

ﬁt is the square root of the average squared error for each pixel.

Test 3: If the error measure lies below Tgﬁﬁm classify the patch as a

cylinder with the estimated orientation, location, and radius.

Recall that this test is applied twice: once on the hypothesis that the cylinder is
‘negative’ (the surface is concave as viewed), and once on the hypothesis that the
cylinder is positive (convex when viewed). If the RMS error computed for either of
these hypotheses is below the threshold, the surface classiﬁcation is accepted, with
the parameters of the correct hypothesis. If neither error measure is below Tc’y’ﬁ’nﬁ”,

however, the surface is labeled as an ‘unknown’ type.

4.4.4 Testing and Parameter Estimation for Spheres

If the median minimum curvature test (Test 2) indicates that the surface is not ‘locally

ﬂat’ in one direction in the tangent plane, we attempt to ﬁt a sphere to the 3D points

102

in the segment. The technique used was described in [39]. Points (2:, y, 2) on a sphere

of radius r and center location (to, ya, 20) satisfy
ftpher¢(zt y, 2; 7310,90, 20) = (1' - 1'0): + (y - yo)2 + (z — so)2 — r2 = 0

This implicit form lends itself to solution via either linear or nonlinear least squares
methods. We use a nonlinear technique here; it allows us to adjust the geometric
parameters directly to optimize the ﬁt to the data, whereas linear methods would
adjust parameters which are a mixture of the geometric parameters.

The squared perpendicular distance between a point p = (:r, y, z) and the hypoth-
esized sphere is

2 _ 2 .
Ep " sphere($9 ya 2, 1', 309 yo, 20)

The parameter vector which minimizes 2,, E: is obtained using the Levenberg-Mar-
quardt nonlinear optimization technique [101, 92]. This method requires formulas
for the partial derivatives of the objective function E: with respect to the geometric
parameters. Like many nonlinear optimization methods, the Levenberg-Marquardt
procedure is iterative, and requires an initial estimate of the parameter vector. As
an initial estimate of the radius, we provide the inverse of the median minimum
principal curvature. By projecting the corresponding point ‘into’ the sphere along
the estimated surface normal, we can also obtain an estimate of the center location.
However, the spherical objective function seems to be particularly easy to optimize
(and indeed it has only one minimum), so any reasonable estimate would probably
work as well; only the speed of convergence would be affected.

Twenty-ﬁve iterations through the optimization procedure have proven to be suf-
ﬁcient to accurately ﬁt a sphere. As a matter of fact, once the overall error measure
22,, E; has either dropped below a threshold value, or changed very little between
two iterations, the procedure terminates. After termination, we examine the overall

error measure to judge the goodness of ﬁt.

Test 4: If p E; < T ”he", then classify the segment as a sphere with

the given parameters.

If Test 4 is ‘failed’, we label the segment as ‘unknown’ and do not attempt a further

classiﬁcation of that segment.

103

4.5 Examples

The classiﬁcation procedures were applied to the eight segments (see Figure 4.4) ob-
tained from the range image of Figure 4.2. The LISP code created from the classiﬁca-
tion procedure appears in Figure A.2 of Section A.2 of the Appendix. Summarizing
the output, we obtained

a A plane with implicit equation 0.1742 + 0.121y + 0.9772 — 1.299 = 0 and area
of 4.75 square inches,

0 A plane with implicit equation —0.612:r + 0.426y + 0.6662 + 1.536 = 0 and area

of 2.30 square inches,

0 A plane with implicit equation 0.1702 + 0.122y + 0.9782 + 0.291 = 0 and area
of 3.82 square inches,

0 A plane with implicit equation -0.775.1: — 0.591y + 0.2232 — 1.4822 = 0 and

area of 4.47 square inches,

0 A plane with implicit equation —0.934:r + 0.207y + 0.2892 -— 1.346 = 0 and area

of 2.63 square inches,

0 A plane with implicit equation —0.930:r + 0.2093; + 0.3012 — 1.346 = 0 and area

of 0.93 square inches,

0 A plane with implicit equation 0.3032: — 0.0683] + 0.9502 — 0.384 = 0 and area

of 0.04 square inches, and

o A cylinder with radius 0.660, direction vector (—0.921,0.224,0.319), a point
(0.355, 0.534, —1.862) on the axis of symmetry, and area of 2.48 square inches.

A Sun 4/ 280 computer required 5 minutes of CPU time to produce the segmentation
and classiﬁcation output from the range image.

Figure 4.8 shows the range, pseudo-intensity, surface normal, initial segmentation,
and ﬁnal segmentation images for a scene with two objects (and signiﬁcant occlusion).
The LISP S-expressions generated from the classiﬁed image appear in Figure A.3 in
Section A.2 of the Appendix.

104

 

(d) (e)

Figure 4.8: Example of segmentation and classiﬁcation procedures. (a) Range image.
(b) Shaded image. (c) Surface normals. ((1) Initial segmentation from clustering. (e)
Final segmentation.

105

4.6 Summary

This chapter has described a technique for range image segmentation and classiﬁca-
tion. Segmentation is performed by a sequential split-and-merge procedure based on
squared-error clustering. Classiﬁcation employs regression, curvature estimation, and
nonlinear optimization. The output of the segmentation and classiﬁcation procedure
is a label image and a symbolic scene description (LISP S-expressions).

We consider the segmentation procedure to be the weakest link in the overall
object recognition and localization system. One major drawback is its inability to
classify more complicated surfaces such as cones, other general surfaces of revolution,
or developable surfaces (i.e. cylinders with non-elliptical cross sections). Our recog-
nition module can handle many such surfaces as long as they can be characterized by
orientation, location, and intrinsic parameters. However, robust detectors for such
surfaces are difficult to develop.

The other drawback of this segmentation and classiﬁcation module is the lack of
a feedback path from the model database. Once segments are identiﬁed, surfaces are
classiﬁed using a data-driven approach. In the manufacturing environment, however,
we can often state a priori that only a certain set of objects will be sensed. If we
have geometric models for this allowable set of parts, a model-driven segmentation
procedure can exploit the geometric knowledge in our manufacturing database to
provide constraints on segmentation. The beneﬁts of such feedback extend beyond
classiﬁcation, however. If we identify a scene surface as one of a small number of
model surfaces from different objects, the very act of classiﬁcation has suggested can-
didate scene interpretations. We have implemented a very simple model database
pruning procedure based on this idea in Chapter 5, but the propagation of interpre-
tation constraints from classiﬁcation to recognition (in the context of model-driven

classiﬁcation) may provide powerful new techniques for scene interpretation.

Chapter 5

3D Object Recognition and

Localization: The BONSAI
System

bon - sai \ (')b6n-'si, 'bon-., 'ban-, also ban-.2? \ n, pl bonsai [Jp] (ca.
1929): a potted plant (as a tree) dwarfed by special methods of culture;
also: the art of growing such a plant.

——from Webster’s Ninth New Collegiate Dictionary

5.1 Introduction

This chapter describes BONSAI, the recognition module of the model-based vision
system developed in this thesis. Relational graphs extracted from CAD models are
compared with the scene graph returned by the segmentation and classiﬁcation mod-
ules, with object identities and estimates of location and orientation (pose) as output.

BONSAI recognizes objects by searching an Interpretation Tree (IT). The search
paradigm maps correspondences between a subset of scene features and a subset of

model features into a tree structure. Nodes in the IT represent interpretations
I =< object, {(31, M1,), . . . , (51‘, M1,”, (R,t) >,

containing one or more associations (5,, Mj) between single scene entities and single
model entities‘. Not all associations between model and scene entities are possible;
predicates applied to interpretations allow BONSAI to reject inconsistent hypotheses.

Figure 5.1 illustrates the search process for a simple 2D object. The object model
consists of several line segments and one circular arc, labeled M; through M3, which

(together with their parameters and interrelationships) describe the silhouette of the

 

lobj act identiﬁes the object model under consideration, and R and t are estimates of the rotation
and translation; see Section 5.3.

106

107

object. The scene description also consists of line segments and a circular arc, labeled
S; through 53’. However, some model segments are missing and others have been
partially occluded. Gray boxes mark tree nodes where one of four predicates (logical
functions of an interpretation) indicate that the interpretation is invalid. When an
interpretation fails one or more predicates, search continues with a different associa-
tion. In the example, only one interpretation involving all the scene entities satisﬁes
the predicates.

In general, however, a successful interpretation need not associate every scene en-
tity with a model entity; indeed, sometimes we must avoid interpreting a scene entity,
because it lies on a different object than the one currently under consideration (e.g.
in multi-object scenes). Scenes containing such ‘spurious’ entities are called cluttered;
the presence of clutter has a dramatic impact on the computational complexity of
IT search. Prior theoretical work [45, 46] has shown that in cluttered environments
(where scene entities belonging to the objects of interest are mixed with entities from
other objects), the combinatorics of the search process can make a naive implemen-
tation of IT search impractical. BONSAI’s solution is to use the search procedure
only long enough to obtain a minimal hypothesis of identity, namely one which pro-
duces an unique pose estimate. This estimate is then fed to a cooperative module
which attempts to verify the hypothesis by synthesizing an image of the object and
a segmentation from the CAD database, and comparing the synthetic image to the
input scene. If the input and synthetic images are similar, the synthetic segmentation
accompanying the synthetic range image is compared against the scene segmentation,
in order to extend the hypothesis (by adding additional associations) without going
back to the search procedure. The extended hypothesis is then re-veriﬁed. In effect,
the recognition module creates very small interpretation trees, only a few levels deep,
preferring to spend time verifying candidate matches rather than searching deeper
for associations (this observation inspired our choice of the name BONSAI for the
system).

The chapter is organized as follows. Section 5.2 surveys related work on 3D object
recognition. The interpretation tree paradigm is also introduced (in a general form)

there; Section 5.3 specializes the concept of the interpretation tree to our environment,

 

”We assume that the scene entities have been sorted; this ordering deﬁnes the sequence of inter-
pretations. The sorting step can greatly inﬂuence the performance of the recognition procedure. We
address this problem in a later section.

108

M4
“9D“ 86'
M7 84
S ///
S3
M1 32
Model Scene

      

Predicates used in
pruning:

t: curve type (unary)
1: length (unary)

o: orientation (binary)

Figure 5.1: Example of IT search for a 2D object.

109

and gives a high-level description of BONSAI’s recognition approach. Section 5.3.7
discusses the system’s technique for selecting a subset of models from the database
to attempt to match with the scene description. Section 5.3.8 details the steps in in-
terpretation after the selection of a particular candidate model; particular attention
is paid to the geometric predicates used to prune the search process. Once the inter-
pretation has adequately constrained a pose transformation, the hypothesis is tested
using synthetic imagery. Details are discussed in Section 5.5. Section 5.6 discusses
interpretation tree search as sequential tests of hypotheses and eXplores error models.
Simulations for estimating the average time complexity of search are also discussed.
Remarks on timing, the potential for parallel implementation, and other summary
comments appear in Section 5.7. Recognition results produced by this system are
described in Chapter 6.

5.2 Survey

This section describes several model-based object recognition systems which have ap-
peared in the computer vision literature. Some of these systems have employed CAD
models, and others use alternative representations. Table 5.1 contains a summary of
each system in terms of database size (including the number of objects in the pub-
lished papers), sensing modality, model source, matching strategy, allowable object
surfaces, and the presence of occlusion in the scene. Our survey emphasizes complete
systems and 3D-3D matching methods, and is by no means exhaustive; the reader
should refer to Besl and Jain [9], and Brady et al. [18] for more complete surveys.
The papers cited below also contain surveys of prior work.

Making comparisons between systems in terms of “recognition power,” “speed,”
or “accuracy” is a difﬁcult task. Often, design assumptions are implied rather than
stated explicitly, complexity analyses are avoided, and noise models are ignored or
incompletely speciﬁed. Unlike many ﬁelds, the computer vision community has yet
to agree on a database of realistic, representative images to test algorithms. While it
is important to avoid straightjacketing the freedom of researchers to pursue uncon-
ventional strategies in scene interpretation, the results they obtain must be placed in
context in order to be useful. The development of standards for evaluation of image
analysis systems will be a signiﬁcant step toward wider acceptance of such systems

outside the research community.

110

Table 5.1: Object Recognition Systems

 

CAD-based vision systems vary widely. Some researchers (e.g. Brooks [21]) im-
plemented their own geometric modeler and developed the model-based vision system
on that platform. Other work [51, 79, 71] uses experimental modelers which are not
commercial products. Systems based on experimental modelers can be specialized to
take advantage of the unique features of that modeler, but such specialization limits
the ease of technology transfer, since manufacturing enterprises do not (in general) use
experimental modelers. In order for CAD-based vision systems to be accepted in the
marketplace, they must adopt the prevailing standards for geometric representation.
To our knowledge, our system is the only one which does so.

One of the earliest model-based 3D object recognition systems was ACRONYM
[21, 20]. This system laid the foundation for much of the subsequent work in 3D ob-
ject recognition. In ACRONYM, objects are represented as a hierarchy of generalized
cylinders, and models are constructed using a custom-built geometric modeler. In
addition to models for speciﬁc objects, classes of models can be built from the indi-
viduals. Relationships between class members are embedded in a restriction graph,

and the object graph contains individual model primitives and relations between them.

111

Examples of classes and class members described in the papers are wide-bodied air-
craft, with members 747 (further subclassiﬁed as 7473 and 747SP) and L—1011,.and
electric motors, with members gas pump, industrial motor, and carbonator motor.
Within-class variations are embodied in constraints on the parameters applicable to
that class. Variations are of three types: size, structure (presence or absence of a
subpart), and relationships (relative positioning of subparts).

Generalized cylinders project as strips (‘ribbons’) in the 2D image plane, and their
ends (assumed in ACRONYM to be circular) project as ellipses. A line ﬁnder is used
to extract ribbons and ellipses from the image, and these entities serve as input to
the image interpretation system. Predictions of image entities which are invariant
over a large range of camera and model parameters are made and matched against
the extracted image artifacts. These local matches are combined in a fashion which
satisﬁes the existing model constraints, thus producing as refined an identiﬁcation as
possible, along with pose information. Besl and Jain [9] observe that this system can
be misled if the low-level input (linear structures) is of consistently poor quality. Also,
ACRONYM as described uses only edge information. It is possible that performance
could be improved if predictions could be made about the allowable surface types in
the input image. This might make it easier to incorporate range information into the
sensed data used by ACRONYM.

Ikeuchi [62] has proposed an off-line method for generation of an interpretation
tree from CAD models of objects composed of planar and cylindrical faces. Objects
are represented as a collection of prototypical views (equivalence classes), and small
changes in pose within a view class are described as linear shape changes, in which
the topology of a line drawing describing the object does not change. A nonlinear
shape change occurs between two line drawings from different equivalence classes.

Ikeuchi’s interpretation tree represents possible associations of model entities with
scene entities. A fully-instantiated IT enumerates all possible assignments, and is
computationally burdensome to construct. In Ikeuchi’s paper, the interpretation tree
is constructed as follows. A set of sixty view directions corresponding to the patches
on a tessellated sphere surrounding the object are deﬁned. From each view direction
a synthetic image of the object is generated. Those surface patches (cylindrical and
planar) which satisfy some lighting and size constraints are labeled as observable

for that view”. Shape groups are assembled by partitioning the sixty views into

 

3This system uses the photometric stereo technique to extract shape information; in order to be

112

classes using the information about the visible surface patches. The interpretation
tree contains paths from the root (an object with no classiﬁed patches) through
intermediate nodes which correspond to a partial match between some model patches
and some image patches, to a leaf node which represents a consistent interpretation
of a particular view.

Objects in an unknown position are classiﬁed and their pose extracted using a
two-phase approach. Features are computed from the sensed data (example features
are inertia, shape, EGI, and curvature measures). Intermediate classiﬁcations are per-
formed at each nonterminal node of the interpretation tree using one of these features.
Features for the sensed object are compared with the same features calculated from
the CAD model, and the most discriminatory feature (selected by hand) is used at
each decision point. Once a terminal node has been reached, the object and its rough
pose have been identiﬁed. The remaining processing determines the pose of the object
using additional feature comparisons that constrain the three rotational degrees of
freedom between the sensor and the sensed object. Examples of this technique are
presented for several scenes, employing registered intensity, orientation (needle map),
and derived range data. The technique seemed accurate in determining the pose of an
object (no numerical ﬁgures were reported), but its power in discriminating between
different objects was not demonstrated.

Bolles et al. [15, 16] developed an industrial part recognition system known as
3DPO (3-Dimensional Part Orientation) for identifying and locating parts in a bin-
picking application: several copies of a single part in a box are to be located and
presumably removed by a robotic arm. The CAD model of the part contains volu-
metric, surface, and edge information, as well as lists of adjacent object features like
bounding edges and holes. Models are comprised of planar and cylindrical patches
bounded by circular and linear edges. In addition to this information, the model also
contains lists of feature types to be used in matching (like straight and curved edges),
accompanied by instances of those feature types in the object. A polyhedral approx-
imation of the object is also stored in the model, and is used primarily to generate
range images from hypotheses of identity and pose.

The recognition paradigm (as in Ikeuchi’s work [62]) is search-based. Here, the

 

visible to the sensor, a surface must be illuminated by three light sources. Those surfaces which are
shadowed during illumination by one or more of the sources are not detectable by the stereo system,
although they may be visible to the intensity sensor.

113

search process is constrained by restricting the initial matching to just a small num-
ber of features. Initial match hypotheses are based on edges. Edges with similar
properties are grouped into classes and independent strategies are pursued for each
class (these strategies are constructed by hand). By matching on only one feature
and verifying the hypothesized match with additional features, considerable pruning
of the search space is obtained. The authors report that as few as three features
are typically needed to verify a correct match hypothesis. They demonstrate some
impressive results on real scenes, although they only worked with a single object
model.

Hansen and Henderson [51] make a strong case for the integration of CAD and
computer vision, and outline a recognition system employing a strategy tree. Object
models in this system are constructed using AlphaJ, a solid modeler developed at
the University of Utah. A variety of features are extracted from the CAD models:
pointwise normals, curvatures, and surface classiﬁcations (employing curvatures), pla-
nar region approximations, edges and arcs, and apparent symmetries (for surfaces of
revolution). An important contribution of this work is a systematic approach to the
selection of features for recognition. Feature ﬁlters are employed to select the most
powerful (discriminatory) features to use in recognizing a speciﬁc object, and discard
the others. For example, surface curvatures may be powerful features for recognition
of sculpted surfaces, but are essentially useless in recognition of polyhedra. A prop-
erly constructed feature ﬁlter would reject curvatures when compiling a recognition
tree for the polyhedron, and include them in the strategy tree for the sculpted object.

A strategy tree is a generalization of the tree classiﬁer. One such tree is generated
for each object in the database. Recognition of an object and computation of its
pose corresponds to a descent of the strategy tree for that object. The strategy tree
is constructed by placing the most discriminatory features at the ﬁrst level below
the root. Subtrees under the ﬁrst level are known as corroborating evidence subtrees
(CESs), which direct a search for evidence that the ﬁrst-level features are supported,
and attempt to reﬁne the pose estimate. If a ﬁrst-level feature is detected and sup-
ported by the CES, a hypothesis is generated about the object’s identity and pose.
Veriﬁcation of the generated hypothesis can be done either using frame correlation
between the sensed image and a synthesized range image of the model, or through
matching of features predicted from the model and the hypothesis with image fea-

tures. This system was demonstrated only on a single polyhedral object. Its ability to

114

discriminate between different objects and its performance with curved objects were
not demonstrated.

Kak et al. [71] propose the use of sensor-tuned representations as an intermediary
between a CSG-based solid modeler (the PADL-2 system [24]) and sensor data (range
imagery). Their intermediate representation is an attributed graph in which nodes
represent object surfaces (with‘attributes such as surface type and surface area),
and edges represent inter-surface relationships (with attributes such as edge type,
and edge angle). The generation of such an attributed graph from a CSG model is
relatively straightforward as long as the modeler maintains a surface-based model in
parallel with the CSG model (which many do). The attributed graph construction is
then mainly a process of assembling the adjacency information not explicitly present
in the parallel representation and inserting it in appropriate slots in the model graph.

Range images in Kak et al.’s system are analyzed as follows. Local operators
are used to estimate surface normals and curvatures at each surface point. Jump
edges are extracted by thresholding on local changes in depth. Crease edges are
obtained by thresholding on the surface curvatures, and are labeled as convex or
concave. After extraction and labeling of the three types of edges, the remaining range
points are grown into connected regions. Histograms of surface curvatures provide
a label for each region from one of the following: planar, generalized cylindrical,
generalized spherical, and toroidal. Inter-surface relations are constructed by forming
an adjacency matrix for all possible pairs of patches and estimating the orientation
change between adjacent patches.

The recognition paradigm used in Kak’s work is graph matching. Given the
attributed graphs derived from objects constructed in the solid modeler, and another
graph derived from the sensed image, the task is to ﬁnd a graph-subgraph isomorphism
between the scene subgraph and one of the model graphs. The combinatorics of this
matching problem can be burdensome if the model database is large. The recognition
system is discussed in detail and demonstrated on a few simple scenes.

Chen and Kak [27] developed 3D-POLY, perhaps the fastest 3D object recognition
vision system described in the literature. Object models are constructed by merging
multiple views; primitive surfaces are restricted to be planar or quadric, and match-
ing employs geometric feature sets similar to those used in our system (orientation

and location parameters). Surface and edge features are organized into sets which

115

allow a pose estimate to be obtained when a model set is associated with a scene set".
Appropriate selection of such sets drastically restricts the number of possible inter-
pretations of the scene primitives. Once a transformation (pose estimate) has been
obtained it is veriﬁed by comparing the orientation of appropriate transformed model
features and associated scene features (such orientations should be similar). These
heuristics allowed the complexity of the recognition process to be reduced. 3D-POLY
was tested on two scenes containing instances of two different models. Recognition
times for the two scenes were 9 and 4 seconds (on a Sun 3 computer); note, however,
that the input to the system is a processed range image. The acquisition of the range
data and initial segmentation can be time-consuming.

Grimson and Lozano-Pérez [47] developed a system which identiﬁes and estimates
the pose of polyhedral objects from sparse range or tactile data. The input to this
system is a set of surface points (assumed to be accurately measured) and estimated
surface normals at those points (which can be noisy). A correct match in their
system corresponds to a path in the interpretation tree, which enumerates all possible
assignments of sensed surface points to model surfaces. Two procedures are used to
effectively prune the interpretation tree, avoiding the combinatorial problems inherent
in a brute-force search. First, four constraints between surface points are used to
restrict the search. These constraints embody allowable distances between two points
assigned to two model surfaces, the tolerance in orientation difference between the
normals at the two points and their associated surfaces, tolerance in projection of
a vector joining two surface points on their surface normal, and the correct sign
of the triple product of three surface normals at sensed points with respect to the
corresponding sign derived from the model. Only assignments of point pairs to model
surface pairs which satisfy all four constraints are allowed.

After interpretation tree pruning based upon the local constraints mentioned
above, several possible interpretations will often survive, each corresponding to an
object identiﬁcation and pose estimate which was admissible with respect to the local
constraints. These interpretations are then veriﬁed by computing the transforma-
tion required to map the model corresponding to the interpretation into the sensed
data, and then rejecting interpretations in which the sensed data points fall outside
the bounds of their corresponding model faces. If multiple interpretations survive

this veriﬁcation process, they usually reflect object symmetries or artifacts of the

 

‘This idea is similar in spirit to the local feature set employed in 3DPO [15].

116

model-construction process, and are all returned as valid interpretations. Simulation
experiments with synthetic data (generated from the models) showed that this method
effectively prunes the search space”. High-resolution range data from a triangulation
sensor was also used as input to the system. Two range images of a polyhedron were
processed to extract points lying on planar faces, producing nine points on one object
and 11 on the other. For the 9-point image, pruning and transformation computation
produced three plausible interpretations. The ll-point image produced two plausible
interpretations.

Hoffman et al. [55, 57, 67] developed a complete system for 3D object recognition
from range data. Object models were constructed on a custom CAD system, and

model features used were:

0 global features: object silhouette perimeter, number of connected background
components, and the number of such components within the convex hull of the

object silhouette.

s intra-patch features: a trinary label indicating planarity, convexity, or concavity,

surface area, elongation, and a boundary ﬁt measure.

0 inter-patch features: edge type (none, jump, adjacent), normal angle, minimum
and maximum distances, boundary angle, a boundary ﬁt measure, and the size

of the jump between the two patches.

Objects in the scene are recognized as follows. The unknown view is segmented using
local feature clustering (see Chapter 4). The initial segments are classiﬁed, producing
the same features as those listed above. The combination of the initial segmentation
with the feature list is referred to as the initial representation. Then, object model
information is used to perform knowledge-based merging of patches in the initial rep-
resentation, producing a modified representation of the scene speciﬁc to each model.
These modiﬁed representations are then input to the recognition module, which con-
tains codiﬁed evidence values for each object model. The similarity between the
features in the model database and each of the modiﬁed representations is evaluated,
and if one model produces a high similarity value, the unknown object in the scene

is identiﬁed as that model. A method for generating the codiﬁed evidence rules was

 

”In experiments on simulated data, only a few interpretations survived the pruning process, out
of a total population greater than 10’”.

117

also developed. This system gave impressive results (less then 1% misclassiﬁcation
rate) using 290 test views drawn from a ten object database.

Vemuri and Aggarwal [118] developed an object localization system employing
range data as the source of both object models and input images. Object models
are constructed by placing the object of interest on a turntable and merging the 3D
images for a number of views (produced by rotating the table) at the range pixel
level (interframe registration of the range views is estimated from the position of a
line drawn on the turntable). Smooth surfaces are obtained from the set of points
using qualitative classiﬁcations from surface curvature estimators (see Section 4.1.2).
When the system is presented with an input image, curvature features are estimated
and regions are classiﬁed as above. A candidate set of ‘likely’ models is selected
by examining the similarity between the segmentations of the image and those of
the models. The localization procedure is then applied to each model, and an error
measure is calculated for each candidate model. The model producing the smallest
error is accepted as the true identity.

Localization is attacked as a combination of a single-point correspondence problem
(which ﬁnds the translation component of pose) and an alignment problem (which
ﬁxes the rotational component of pose). Briefly, range pixels in the input image and
the model are candidates for correspondence if they lie on lines of identical constant
principal curvatures. The search procedure used to extract such candidates is not
described in detail. Once such candidates have been identiﬁed, a rigid rotation which
aligns the Darboux frames (the coordinate systems deﬁned by the surface normal and
the two principal curvature directions) on the surface at the given points is computed.
A few test images were generated from models constructed from real range views,
and errors in the rotational component of the estimated pose ranged from two to ﬁve
degrees for objects without rotational symmetry.

Fan et al. [37] have also developed a 3D vision system employing multiple views as
a model source. Unlike Vemuri and Aggarwal, an input image is matched against indi-
vidual images of the database objects rather than one model constructed from those
images. The same processing is performed on range images forming the database
and the unknown views to be classiﬁed. Connected patches are extracted using the
edge-based technique described in Section 4.1.2. Each patch is described in terms
of orientation, curvature, occlusion, and location parameters. Heuristics are used

to group patches in the image (which may come from more than one object) into

118

classes corresponding to single objects, and patches within each class are linked with
attributed relations summarizing the type of adjacency and a probability of connec-
tion. The output of the image analysis phase is a set of graphs describing scene
objects. Given an input image and its output graphs, a screening module selects
database views judged ‘likely’ to match the input image graph. Construction of asso-
ciations between entities in a single model image and the scene is then performed, and
estimates of object pose are produced concurrently. An analyzer module attempts
to cope with segmentation or interpretation errors by modifying the graph structure.
The database consisted of 32 views of ten different objects. This system was tested
on three scenes containing occlusion and multiple objects, and correct interpretations
were obtained for all objects in these scenes.

In his thesis, Chen [28] proposed a search-based 3D object recognition system.
The underlying representation of object models and sensed images uses wing features6
rather than speciﬁc surface types such as planes and quadrics. Given a set of scene
wings and several models described as wing sets (arranged by viewpoint), recognition

involves the following steps:

0 Indexing: the selection of candidate models and views of those models. Models
are rejected from consideration if the sets of scene wings and model wings do
not intersect. Models surviving this test are arranged in a priority queue, using
a measure of similarity between the sensed wing set and the model wing set as

the priority function.

e Consistent Labeling: The scene wing set is matched to each view of the candidate

model set using the interpretation tree search technique.

0 Parameter Estimation: Hypotheses obtained from the search process produce

several pose estimates.

0 Decision: The ‘best’ match is detected by detecting clusters in the parameter

space of transformations obtained in the estimation step.

The parameter estimation and decision modules draw on earlier work by Stock-
man [108, 109]. Pose estimates are produced for each correspondence between a

scene wing and a model wing. If we view the ensemble of estimates produced from

 

”See Section 2.3.2 for a brief description of wing features.

119

all hypotheses as a cloud of points in the pose parameter space, we would expect a
compact cluster to be present at the correct pose value. Detection of these clusters
is addressed in [108] and [28].

Lowe [78] and Huttenlocher and Ullman [61] have also examined 2D-3D matching.
Lowe uses the principles of perceptual grouping to identify and rank intermediate-level
structures from nonaccidental relationships (proximity, collinearity, and parallelism)
between tokens (line segments) in the image plane. Given these segment groupings
and a model of the object to be located, an iterative procedure attempts to ‘align’
corresponding structures, or (equivalently) ﬁnd the rigid model transformation which,
when projected to the image plane, produced the observed image. Huttenlocher and
Ullman developed a non-iterative sequential procedure for aligning corresponding
point triplets obtained from the model and the scene. Candidate image points used

to estimate the transformation are extracted from a multiscale representation.

5.3 Interpretation Tree Search for 3D Recogni-
tion

Our computational model for 3D object recognition and localization is heuristic search
of the interpretation tree. The IT embodies all possible sequences of associations
(matchings) between a model surface and a scene surface. Descent in the tree implies
an increasing level of commitment to a particular hypothesis of identity and pose
for a single model object. If the system decides (based on the scene content) that
scene patches could match more than one of the database objects, a sequence of IT
searches is performed, one for each ob ject". Figure 5.1 illustrated the concept of the

interpretation tree. We now develop the formalism.

5.3.1 Notation and Definitions

The input to the model-based 3D vision system is a set of s scene surfaces (or scene
entities):

S: {S1,Sg,...,5.}

7Section 5.3.7 describes our techniques for candidate model selection.

 

120

The model under consideration contains m surfaces (or model entities):
M = {erM’bH-rMm}

An interpretation (or a hypothesis) is a triple containing the identity of the model
object under consideration, an ordered set of associations between scene and model
surfaces, and a rigid transformation which aligns the model with the scene entities

involved in the interpretation.
I = < 0,A,T > = < object, {(S',,M.-,),. . . (S,,M.~,.)},(R,t) >

Several associations are required before the transformation (R,t) is uniquely deter-
mined. A hypothesis with a unique transformation is termed veriﬁable.
The IT is constructed by forming all possible interpretations which contain asso-

ciations ordered on the scene entities 5;; in other words,

1' = < object, {(51, M1),(S'2,M4), (53,M9)},(R,t) >
is an element of the IT, but i

I = < object, {(51, MI), (33, M4), (52, M9)}, (R,t) >

is not, because the S; are not interpreted in order. Heuristics for ordering scene
entities are discussed in a later section. At the root of the IT, the association set A
is empty. At the ﬁrst level down from the root, the association set contains a single
interpretation for SI. At the second level, .4 contains associations for 51 and 32, and

so on. We deﬁne the depth d(.‘[) of an interpretation to be the number of associations:
0((1) = “All -

Alternatively, the depth of an association is its distance from the root node.

5.3.2 NULL Associations

The typical scene to be considered in this thesis contains more than one object.
Therefore, the scene entity set 5 will contain entities from more than one model.
Additionally, segmentation errors can produce scene entities which correspond to no

model. We reconcile the complexity of the scene entity set with the constraint that

121

we look for one model at a time by allowing the NULL match”. Deﬁne a NULL match

as an association of the form

(3., NULL),

and meaning that we mark S,- as having been used (and therefore not subject to sub-
sequent association with a model entity), but that it has no effect on the computation
of a pose transformation or on veriﬁcation of a hypothesis.

We deﬁne a correct association as one corresponding with the true scene interpre
tation. We will reserve the terms ‘correct’ and ‘incorrect’ for non-NULL associations;

a NULL association is considered neither correct nor incorrect.

5.3.3 Ranking Scene Entities

Inherent in a search-based approach to recognition is the sequential consideration
of scene entities. Since we desire that search be performed efﬁciently, the order in
which scene entities are selected is important. There are two competing goals to be
addressed when we order the scene surfaces. Obviously, we prefer to match ‘more
reliable’ entities early in the search, to avoid taking the NULL branch (since it is
considered after all model entities have been tried). The reliability of parameter
estimates for scene entities is related to their visible area. On the other hand, it
is important that we reduce the number of subtrees to be searched as close to the
root node as possible; pruning is much more powerful near the root than it is further
down in the tree. This goal suggests that we should attempt to employ the most
discriminatory scene entities near the root node. These two goals can conﬂict, because
the most discriminatory entity is not necessarily the largest patch, and vice versa.
At present, we are using patch area as a criterion for entity ranking prior to search.
Associations made at the ﬁrst level of the tree involve the largest scene entity, level-2

associations involve the next largest scene entity, and so on.

5.3.4 Pruning and Predicates

As we illustrated in the example of Figure 5.1, not all interpretations are valid.

Clearly, the presence of. an incorrect association implies an incorrect interpretation.

 

8or ‘wild-card’ match [45].

122

If we reject an interpretation

Io =< object,Ao, (Ro,to) >
as invalid, all interpretations

I). =< object, A;,(Rk, tr) >

with A; C A; are also invalid and need not be generated (much less checked for
validity). This is the essence of pruning; we wish to test interpretations as they are
generated, reject ones which are invalid, and avoid searching the subtrees rooted at
the incorrect interpretations.

The speciﬁc techniques we use to prune incorrect interpretations are discussed in
Section 5.3.8. Our general approach is to apply several predicates to each interpreta-

tion as it is generated. A general predicate ’P is a mapping
’P : 8 -+ {TRUE, FALSE},

where 8‘ is the set of possible interpretations I. These predicates return TRUE if
they are satisﬁed by the interpretation, and FALSE if the constraints of the predicate
are violated and the interpretation and its subtree should be pruned. Since the scene
and model entities are surface primitives with orientation, location, and intrinsic
parameters (i.e. geometric parameters), our predicates are geometric in nature.
Intuition suggests one way to test interpretations with the predicates: simply ex-
amine the consistency of the ‘newest’ association in the hypothesis. Such tests are
unary, as they only involve one scene entity and at most one model entity. However,
unary tests alone often do not provide sufficient pruning power to signiﬁcantly re-
duce the size of the IT, primarily because of their locality. Binary predicates, which
examine all pairs of associations involving the ‘newest’ one, allow us to reject inter-
pretations in which binary relationships between a pair of scene entities do not agree

with the relationships between the corresponding pair of model entities.

5.3.5 Heuristic Search Termination

We are now in a position where we could give an algorithm for IT search. Such
an algorithm would take as its input the scene description and a model description,

build sequences of interpretations, and reject invalid ones. When an interpretation

123

at the bottom of the tree (i.e. one which has associations for all scene surfaces, some
associations possibly NULL) is generated and accepted by the geometric predicates,
we would save that interpretation as a possible solution, and perform a veriﬁcation
step (which we discuss in Section 5.5). The interpretation with the highest measure
of validity (if any) would then be accepted, and the scene primitives involved could
be removed from consideration. The ‘new’ scene can then be re-interpreted, and so
on.

Examination of the IT search in the context of 3D recognition from geometric
features yields additional constraints which we can bring to bear on the interpretation
problem. As we shall see in Section 5.4, each successive correct non-NULL association
provides constraints on the components of the pose estimate. This fact provides
us with a mechanism to terminate the search before reaching the leaf nodes, i.e.
before associating all surfaces in the scene with either a model surface or NULL. As
soon as we have an unique pose estimate, we can terminate the search process and
immediately begin the veriﬁcation step. As a matter of fact, veriﬁcation of a correct
hypothesis can also suggest interpretations for the remaining scene surfaces. This

technique, called hypothesis extension, is discussed in Section 5.5.

5.3.6 IT Search Algorithm

Putting together the results of the previous sections, we now propose the algorithm
(called SEARCH), the heart of BONSAI. This algorithm will compute a list of plau-
sible hypotheses and their scores. A pseudocode version of this algorithm appears
in Figure 5.2. The internal details of the procedures labeled COMPUTEPOSE and
VALIDATE are discussed in later sections. The list of veriﬁed hypotheses is placed
on an output list named PLAUSIBLEJIYPOTHESES.

5.3.7 Model Database Pruning

Previous sections have outlined the interpretation process for a single model. Our
database contains twenty different object models, each an attributed relational graph.
It is computationally unattractive to apply the IT search algorithm once for each
model in the database if we can infer that most of the objects cannot be present
in the scene. This section examines techniques for pruning the model database by

examination of the scene content prior to IT search. Just as scene entities and their

124

 

Algorithm SEARCH(I =< ID, A, (R,t) >)
Input: Model ID with surface set M
Input: Scene surface set {81, . . . , 5.}
Output: PLAUSIBLEJIYPOTHESES
FOR EACH predicate 1’,- DO
IF NOT 'PJ-(I) RETURN
ENDFOR
(R,t) 4— COMPUTE.POSE(I)
IF (R,t) is unique THEN
score r— VERIFY(I)
Place (score,I) on PLA USIBLE.HYPOTHESES
RETURN
ENDIF
m ‘- IIMII
Munused ’— M — Ui=l(I)Mi:
IF (d(I) = s) RETURN
FORi=d(I)+1TOmDO
A, ‘- A U {(5d(z)+1, Mill
1' r—< ID, A’, {R,t} >
SEARCH(I’)
ENDFOR
A, ‘— A U{(5d(1)+1, NULL”
1’ «—< ID, .4’, {R,t} >
SEARCH(I')
END Algorithm SEARCH

 

 

 

Figure 5.2: Algorithm SEARCH: Constrained search of the interpretation tree.

125

relationships constrain interpretation for a single model, they also restrict the set of
models to be considered. In this thesis, we employ a few simple techniques for reducing
the database of models to examine. However, a detailed off-line examination of the
distribution of geometric parameters and relations between model entities can ‘ﬂag’
certain features as salient (i.e. highly discriminatory, informative, and powerful).
Exploiting this saliency at run-time (when objects are presented to the sensor) is a
topic of future research, and could supplement or even partially supplant IT search as
a recognition mechanism. Our model database pruning procedure rejects models from
the model database if the scene contains features that the models do not. Speciﬁcally,
we look for curved scene surfaces, and reject models which do not contain surfaces with
similar intrinsic parameters. We also examine the distribution of angles between the
orientation parameters of patches joined at crease edges, and reject models without

similar inter-patch angles.

Intrinsic Properties of Curved Surfaces

The presence of intrinsic shape parameters allows us to avoid the consideration of
object models which do not possess surfaces with similar parameters as those in the
scene. As part of the process of bringing a new object model into the recognition
system, we compute a radius list, which contains a sorted list of the radii of curved
model surfaces. Since our system utilizes only spherical and cylindrical surfaces, radii
are the only intrinsic shape parameters. However, this notion could be generalized to
more complicated surface types easily, by deﬁning additional lists (such as a vertex-
angle list for conical surfaces, a shape parameter list for superquadric surfaces,
etc.).

When a segmented image and its symbolic description is made available to the
recognition system, the intrinsic parameters of all curved image surfaces are extracted,
and sequentially compared to the radius lists of each model. Those models which
have entities of similar radii to the scene entities are considered for IT search, but the
remaining models, which do not contain primitives with similar intrinsic parameters,

are eliminated from further consideration.

126

Inter-patch angles

In addition to examination of unary (patch-speciﬁc) features like the intrinsic param-
eters, we can also reject models based on the absence of salient binary relationships
between patches that were observed in the scene. We have implemented a model
database pruning step which examines the angles between the orientation parameters
of primitive surfaces in the scene and the models. As an off-line process, we create
a histogram of inter-patch angles (at present we use a 90-bin histogram with each
bin being one degree ‘wide’). At recognition time, the inter-patch angles for scene
surfaces connected by a crease edge are compared with the histograms for each model,
and models which do not have similar angles are not searched. We restrict our an-
gle measurements to those between connected patches as an attempt at only pruning
within scene objects. One can easily imagine multi-object scenes where a salient angle
is observed between surface primitives belonging to different objects! The presence
of this angle might prevent a model or set of models from being rejected.

These simple model database pruning procedures were designed speciﬁcally to
be conservative. We do not wish to reject the correct model, and the tolerances on
intrinsic parameters and angle differences are deliberately loose in order to minimize
the chances of such a rejection. One unfortunate byproduct of this conservative
approach is that more models must be searched; however, Algorithm SEARCH can
also reject models (by producing no valid hypotheses), so the additional investment

in computation is preferable to obtaining an incorrect answer.

5.3.8 Geometric Predicates

In this section, we describe the unary and binary predicates used to prune the IT.
As outlined in Algorithm SEARCH, these predicates are applied sequentially to each
new hypothesis as it is generated (by adding a new association to the existing set). If
any one of these predicates is not satisﬁed, the new interpretation is rejected and its
subtree is pruned. This procedure is depicted graphically in Figure 5.3. The prun-
ing power of these predicates changes with the model under consideration and the
depth in the tree. It would be interesting to study the effect of ordering on these
predicates. If those predicates which provide the most pruning power at the current
level are applied ﬁrst, some increase in speed might be observed, although the im-

provement would probably be marginal, since the predicates are not computationally

127

area predicate satisfied?

FALSE TRUE

   
   
 
 
 
 
 

type predicate satisfied?
prune

intrinsic parameter predicate satisfied?

rotation predicate satisfied?
prune

orientation predicate satisfied?
prune

visibility predicate satisfied?
prune

prune search

Figure 5.3: Sequential application of geometric predicates.

128

demanding. We also note that the predicates are not totally independent from one
another; an incorrect interpretation is often rejected by more than one predicate. In
such situations, we would hope to use the ‘cheaper’ predicate before the others, so
that rejection of the interpretation is as efficient as possible. These issues are topics

for future research.

Unary Predicates

Three unary predicates have been implemented in the current version of BONSAI.
All apply a test to the newest association in the interpretation being evaluated. In
order to simplify the discussion, we are assuming that the newest association is non-
NULL. Because NULL associations should not affect the interpretation process either
positively or negatively; these predicates have the value of TRUE when the newest
association is NULL.

The ﬁrst is the area predicate ’Parea(I), deﬁned as follows. Let the ‘newest’
association in I be (3;, Mm). Deﬁne A5,. to be the estimated area of 5,- (calculated
during segmentation and classiﬁcation), and AM... to be the maximum estimated area
of MN, calculated from the 320 synthetic views of the model generated as part of the

model-building process. Then

TRUE if A.- 5 AM”.

P I =
area( ) {FALSE ijg>AMm

This predicate should appeal to intuition: we reject associations between scene prim-
itives and model primitives for which the area estimated in the scene is larger than
the largest possible area of the model patch. Empirical performance shows that this
predicate is most powerful (i.e. it rejects the most interpretations) at the top of the
IT, because the ﬁrst surfaces to be interpreted are the largest and the least likely to
be self-occluded or occluded by other objects in the scene”.

Another unary predicate used by BONSAI is the type predicate ptype Given the
interpretation I with ‘newest’ association (5;, Mm), ptypeII) is TRUE if S.- and Mn,

are the same primitive type (i.e. plane, cylinder, sphere).

 

”If we assumed that the scenes to be interpreted contained only one object, we could strengthen
this predicate by requiring the model and scene areas for an association to agree within some
tolerance value for at least one of the 320 model views. As an added beneﬁt, this improved predicate
would reject both incorrect models, and incorrect views of the correct model.

129

. The third unary predicate is the intrinsic parameter predicate pintrinsicr which
is deﬁned only for curved surfaces. This predicate is applied after ptyper so we
are assured that the newest association’s component surfaces are both spherical or
both cylindrical. The intrinsic parameter predicate returns TRUE if the intrinsic
parameters (radii) of S,- and Mm. are within a speciﬁed tolerance value of one another

(our implementation uses an empirically-chosen tolerance of 0.35 inches).

Binary Predicates

BONSAI employs four binary predicates to prune invalid IT subtrees. These pred-
icates examine all pairs of associations in the interpretation I under consideration,
and return FALSE if the constraint speciﬁc to the predicate is violated for one or
more of the pairs. As in the unary case, we will limit our discussion to situations
where the newest association is non-NULL. Additionally, we note that if the other
association in one of the pairs is NULL, the constraint is not applied, and the ﬁnal
decision of the predicate is not affected. Thus, if the current predicate contained the

associations

,4 = {(51, M4), (3,, NULL), (Saws). (54,M:)},

the conditions of the predicate would be applied to the pairs
[(51, M4)r (54: M1”, [(S3r M8): (54, MIN:

but not to [(52,NULL), (54, M1)].

In Section 5.4, we will describe our method for estimating the rotational and trans-
lational components of the object pose (this pose transformation aligns the model with
the image). Each pair of non-NULL associations not involving one or more spheres10
produces an estimate of the model rotation. If these estimates vary widely, the in-
terpretation is incorrect and should be pruned. This idea is the basis for the rotation
validity predicate protationalr which is TRUE if the model rotations estimated for
each pair of associations agree within a prespeciﬁed tolerance (chosen to be 0.25
radians, or 14 degrees), and FALSE otherwise.

The orientation predicate is perhaps the most intuitive binary predicate. As above,

association pairs containing one or more spheres are excluded from this predicate since

 

10The rotation is estimated from the patch’s orientation parameter, and spheres lack such a
parameter, so we cannot estimate rotation from a pair of associations containing a spherical surface.

130

spheres do not possess orientation parameters. This predicate examines the difference
in orientation parameters between the scene primitives in the association pair and the
corresponding model primitives. Let pa and pa be the orientation parameters of the
two scene primitives in the association pair, and let pm; and pm; be the corresponding

model primitive orientation parameters. Then

 

A. = cos"1 M and
llpa - pszll’
A = cos’1 Pm: . Pm:
m llpml ' pm2ll,

and the predicate is satisﬁed only if “A. -— Am" is less than a threshold (6 degrees in
our experiments) for all association pairs.

The visibility predicate is a binary predicate related to the unary area predicate
discussed above. Section 2.4.5 described the construction of area lists for each object
model surface, and the generation of never-simultaneously-visible relations between
surfaces which don’t appear together in any view. The visibility predicate is satisﬁed
only if, for all non-NULL association pairs, the model entities are not connected by

a never-simultaneously-visible relation.

5.4 Pose Estimation

Part of BONSAI’s output is a pose estimate for each object model that it decides
is present in the scene. This section gives our techniques for pose estimation. The
input to the pose estimation portion of BONSAI is an interpretation, assumed to have
enough non-NULL correspondences to make the pose estimate unique. The output
is a 3 x 3 rotation matrix R and a 3D translation t. The system uses two different
techniques to estimate pose; one technique is suitable for ‘general’ objects whose
orientation parameters are not all parallel, and the other technique was developed to

get unique rotation and translation estimates for objects with rotational symmetry.

5.4.1 Nonsymmetric Model Pose Estimation

We adopt Grimson and Lozano-Pérez’ technique [47] for estimation of model rotation.
Their presentation was developed for planar surfaces, but generalizes easily to any
surface with an unique orientation parameter. Chen et al. [30] extended Grimson

and Lozano-Pérez’ method to include line segments in addition to planar patches.

131

They describe algorithms for the cases of: (i) two nonparallel scene lines and two
corresponding model lines, (ii) two noncoplanar scene planes and a scene line and
corresponding model planes and line, and (iii) three (non-coplanar) scene planes and
corresponding model planes. Our technique obtains a rotation estimate for each pair
of non-NULL associations not involving spherical patches. The ensemble of estimates
' can be checked for validity and averaged to produce the ﬁnal estimate.

We now outline the procedure for a single pair of scene and model surfaces. Let
these entities be associated as (SQ-HM”) and (S;,,M,-,), and let the entities have
orientation parameters p“ pm], pa, and pm», respectively. We are interested in a
3 x 3 rotation matrix which, when applied to pm, aligns it with pm and aligns pm;
with p.2. An equivalent representation for this rigid 3D rotation is composed of an
axis of rotation (constrained to pass through the origin) r and an angle 0 [97, pp.
121-127]. We determine r and 0 as follows [47].

We wish to rotate pm; into p“. Rigid rotations through an axis r must preserve

the angles between r and p“, and r and pm]:

r-p,1= 1“th
or
1" (Pml " pal) = 0'

Hence, given two non-parallel model vectors and their corresponding scene vectors
(belonging to associated scene surfaces), the rotation which aligns model vectors with

corresponding scene vectors is the vector in the direction

(pml "" ps1) X (pm? ' ps2)
The amount of rotation 9 is given by

-1 (1‘ X ps1) ' pml

9 = tan .
1" (ps1 ' pml)

132

The range of values for the components of r and 0 are examined for each pair of
associations in order to verify that the rotation is valid. The estimated rotation axes
and angles for an incorrect interpretation will typically vary widely, and we detect
and reject interpretations for which this happens. If the rotation axis and angle
estimates vary only slightly (less than 14 degrees in our experiments), we average
those estimates to get an average axis r = (r,,r,,r,) and d, which are converted to

the 3 x 3 rotation matrix using the following formula:

1 0 0 r}, as, 1",1'",
R = cos? 0 1 0 +(1—cosd) Fur, 7": if,
0 0 1 ﬁr, 737, r,
0 —r, r,

+sin5 r, 0 —r,.

After the rotation is estimated, the model and scene surfaces will have similar
orientations, but will not in general be coincident in 3D space. Estimation of the
translation necessary to align them is the next step. Our approach to translation

estimation is (to our knowledge) unique. Its two primary advantages are the following.

o Linearity: The translation estimate is obtained from a solution to a linear least-

squares problem.

0 Uniﬁcation: The least-squares problem contains information from all the sur-

faces involved in the interpretation under consideration.

The primary difficulty in this estimation procedure is accommodation of three dif-
ferent surface types, each of which provides a different number of constraints on the
translation’s three components. Spheres constrain all three coordinates of the trans—
lation estimate. Cylinders constrain two of the coordinates (leaving translation along
the axis unconstrained). Planes only constrain one coordinate (in the direction normal
to the plane), leaving two free translations. How do we put these different constraints
together into a linear least-squares framework? For an interpretation which contains
associations involving j spherical surfaces, 1: planes, and 1 cylinders, we can set up
a linear system of 3j + k + 31 equations in 3 + l unknowns (Ax, Ay, Az,t1, . . . t;), as

follows.

133

Spheres

Let (xﬁn,y,’,,,,z,’,,) be the rotated location parameter of the spherical model surface
(i.e. the center of the model sphere after transformation by the matrix R estimated
above). Let (x.,y., 2.) be the location parameter of the corresponding scene sphere.

Clearly, then, we can obtain three equations in the three unknowns (Ax, Ay, A2):

at, = 2:" + Ax
y. = v1. + Ay
z. = z; + A2. (5.1)

Planes

If the rotation is correct, the rotated model plane and the corresponding scene plane
will have identical orientation parameters, which we denote (a, b, c). Their location
parameters will in general be different; we denote them dm and d,, for the transformed
model location parameter and the scene location parameter, respectively. The rotated

model plane has the implicit equation
axin + by; + cz,',, + dm = 0. (5.2)
and the scene plane has implicit equation
ax, + by, + cz, + d, = 0, (5.3)

where (x,, y., z.) and (xﬁn, yfn, z:,,) are the scene and rotated model coordinate systems,
as in Equation 5.1. But we have assumed that the rotation is correct, and that only
a translation is required to bring the two planes into alignment. So, by substituting
Equation (5.1) into Equation (5.3) and subtracting the result from Equation (5.2),

we have a single linear constraint on the translation (Ax, Ay, Az):

an + bAy + cAz + (d, — dm) = 0.

Cylinders

The case for cylindrical surfaces is slightly more complicated. Recall that the orienta-
tion parameter for a cylinder is a vector parallel to its axis, and the location parameter
is a point on the axis. Assuming again that the estimated rotation is correct, applica-

tion of R to the model orientation parameter will orient the model cylinder properly

134

but not (in general) make it coincident with the corresponding scene cylinder. Let
(xﬁ,,,y:,,, 2:“) represent the rotated location parameter of the model cylinder, and let
(v,, v," v.) be the coordinates of the orientation parameter of the scene cylinder (par-
allel to the rotated model cylinder). We wish to ﬁnd the translation (Ax,Ay,Az)
which moves (xﬁn,y:n,z:n) to the corresponding point on the axis of the scene primi-
tive. In effect, we have transformed the cylinder-matching problem to a line-matching
problem.

Figure 5.4 illustrates the idea behind our solution. We can decompose the overall
translation into a component parallel to (v,, v," v,), and a component perpendicular

to (v,,v,,v,):
(Ax, Ay, A2) = (Axn + Ax), Ayn + ij, A2“ + AZl). (5.4)
We also know that for some number (an auxiliary parameter) t,
(AxlerylerzII) = t(vx.vy,v2)- (5-5)

We can derive (AthyhAzi) from (x:,,,y:n,z:,,), (v,,v,,,v,), and (x,,y,,z,). Put-
ting together Equations (5.4) and (5.5), we obtain three equations in (Ax,Ay,Az)

and the auxiliary parameter t:

Ax
1 0 0 —v, Ax;
o 1 o A” A
9 AZ 31
0 0 1 —v2 A2;
L t ..

 

 

Note that each cylinder requires its own auxiliary parameter t.

5.4.2 Rotationally-Symmetric Objects

The technique for rotation estimation given above does not work when all of the
model surfaces’ orientation parameters are parallel (or antiparallel). Objects with
rotational symmetry are common among manufactured parts, so we developed an
alternative approach to rotation and translation estimation which works with such
objects. The estimation of rotation actually becomes simpler for such cases, as one of
the three components of rotation is immaterial (rotation about the symmetry axis);

there are an inﬁnity of 3 x 3 rotation matrices which can be used as the ‘correct’

135

Perpendicular

 

C°mp°ut°l Parallel

Figure 5.4: Translation estimation: aligning cylinder axes.

orientation transform. BONSAI assumes that models with rotational symmetry are
‘ﬂagged’ as such; also, the direction of the axis of symmetry (in model coordinates)
must be known. Since a mechanical designer typically knows when he/ she is designing
a symmetric part, and also knows the symmetry axis, these quantities need not be
inferred by a computer program.

Consider an interpretation I containing several associations

[(SlrM in?) (SQrMiser ”(SkrMikll'

Ignoring NULL associations and associations involving spherical patches, let the ori-

entation parameters of the entities in the remaining associations be given by

(pal, purl): ° ' ° (pale, pink)-

The condition of rotational symmetry requires that all the pm,- be parallel (or an-

tiparallel). If the interpretation I is correct, the orientation parameters of the scene

surfaces will also be parallel. Let

_z_:___i=l p“
2i=1 pm [I

be the average of the associations’ scene orientation vectors. The rotation axis r and

5:
ll

angle 0 needed to orient the model so that its orientation parameters are parallel to

the observed scene vectors are given by

r = faxme

9 = cos-1 pml ' I3

136

Once the rotation has been found, we can easily estimate the translation. An
estimate of the three components (Ax, Ay, A2) of the translation vector is produced
from each association between planes or spheres. For the planar case, the intersection
of the axis of symmetry and the plane under consideration is uniquely determined,
so the displacement between the (rotated) model plane’s intersection point and the
scene’s intersection point provides an estimate. For spheres, the displacement between
the (rotated) model sphere’s center and the scene sphere’s center is an estimate of

the translation. The average of these values is taken as the ﬁnal translation estimate.

5.5 Hypothesis Veriﬁcation and Extension

Ideally, we would like BONSAI to discard all incorrect interpretations that it gener-
ates, with at most one (the correct interpretation) surviving. Our use of tolerances
on pruning predicates, combined with sensor noise, segmentation errors, and inher-
ent ambiguities among object models, make a hypothesis veriﬁcation step necessary.
Each hypothesis surviving the IT search is veriﬁed by synthesizing a range image
and a segmentation of the object from the faceted approximation to the geomet-
ric model described in Chapter 2, and an area-based matching score is calculated
by pixel-to—pixel comparison with the input range image and segmentation. Correct
interpretations will have the highest matching score.

The matching score for a veriﬁable interpretation I is given by
s(I) = s;(I)s,(I),

where s;(I) is a score computed from the synthetic segmentation and the input scene
segmentation, and s,(I) is a score computed from the synthetic and input range
images. The range image score s,(I) is the fraction
N+
N+ + N. ’
where N... is called the positive evidence count, which we deﬁne simply as the num-
ber of times that corresponding range pixels in the synthetic and real images have
z-coordinates ‘close’ to one another (‘close’ is deﬁned as 0.1 inch in our experiments).
N. is the negative evidence count, deﬁned as the number of times in which the syn-

thetic image pixel is ‘higher’ (i.e. has a greater z-coordinate) than the corresponding

137

pixel in the observed range image“. Although it does not enter into the calculation
of s,(I), a third situation is possible: the observed range pixel could be ‘higher’ than
the corresponding pixel in the synthetic range image. Occurrences of this sort should
not (by themselves) be interpreted as negative evidence for the interpretation, be-
cause the input image could be reﬂecting scene occlusion. However, if we assumed
that scene objects are isolated, any deviation in either direction between the input
and synthetic image pixels would tend to indicate that the interpretation is incorrect.

The labeling score s;(I) is slightly more complicated, but the information pro-
duced by comparing the scene and synthetic labelings has a number of beneﬁts in
terms of interpretation reﬁnement. Given the two label image, we compute the fol-

lowing quantities.

s ugh”), the number of places where corresponding pixels in the images belong to
scene entity i and model entity j, for all i, j. This could be called an ‘overlap
list’; ideally, we want 125“)", to be large for (S;,M,-) to be an element of the
associations of I or for correct associations that aren’t present in I because of

search termination.
0 N90“, the sum of those "5.,M, which reﬂect correct associations in I.
s N;, the number of labeled pixels in the input image.
0 N,, the number of labeled pixels in the synthetic label image.

0 NM“, the number of image locations where the input label was valid but the

synthetic label was not.

0 NM”, the number of image locations where the synthetic label was valid but

the input label was not.

Note that the ‘bad’ quantities only reﬂect segmentation errors involving unlabeled
(background) pixels. We ignore incorrect labelings involving non-background pixels,
because they could be due to occlusion of the model under consideration by other
objects. As above, if we could guarantee that objects would be isolated in the scene,
we could incorporate such incorrect labelings. s;(I) is given by

Ngood [I _ Limit] [1_ 0 5_N__sa42

 

uSimilar measures were used by Hansen and Henderson [51] and Bolles and Horaud (in
3DPO) [15].

138

By inspection we can see that 0 5 31(1) 5 1. To attain values close to 1, the following

conditions are necessary.
0 The overlap for correct associations in I must be high.

0 The ‘background’ portions of the synthetic and input labelings must substan-
tially overlap.

0 Most of the correct associations must be present in I.

The third condition is of greatest importance. As mentioned earlier, one of the

advantages of BONSAI is its ability to stop searching when a veriﬁable interpretation
Iv =< 0: Ava (Ru: tv) >

is obtained (typically after three non-NULL associations are present). If the maximal

correct interpretation

1m =< OrAmr(R—mrtm) >

is such that A" C Am, then we should give I” a large score. However, the absence of
the associations in Am -A,, will depress the label-based matching score s;(I.,). Exper-
iments show that the scores will still be larger than those of incorrect interpretations
(because the effects of early search termination are also present in incorrect interpre-
tations), but the adjustment of matching scores for these cases is an interesting idea
to pursue in the future.

The presence of the overlap populations N5hM, can suggest additional associa-
tions to be added to veriﬁable interpretations. We add such associations as follows.
For each un-associated model surface that has a nonzero population in the synthetic
labeling, we ﬁnd all scene entities which overlap it. If one of those scene entities
is also unassociated or associated with NULL, and the amount of overlap is greater
than half the population of the model entity in the synthetic labeling, we add the
new association containing the scene and model entities to the interpretation. If this
process produces any change in the interpretation’s association, we then immediately
recalculate the matching score, which will typically increase for correct interpreta-

tions.

139

5.6 Analysis of Interpretation Tree Search

In this section, we introduce an error model for the probabilistic analysis of inter-
pretation tree search. Our model is a generalization of the model introduced by
Grimson [45]; he employed it to obtain closed-form bounds for the number of nodes
expanded during the search process. This section explicitly describes the difference
between Grimson’s interpretation trees and those used in this thesis. Our IT model,
while probably more appropriate for recognition using complex scene entities, does
not admit closed-form bounds for the number of nodes visited during search. In the
absence of formal techniques, we describe a simulation procedure for obtaining these
bounds empirically, and test these bounds against actual numbers obtained from a

few recognition experiments.

5.6.1 IT Search as Sequential Hypothesis Testing

We can view the unary and binary predicates developed in Section 5.3.8 in a more
formal way by drawing an analogy with traditional statistical inference. In applying
a predicate to an interpretation, we are in effect deciding on the validity of a model
for the scene. Arbitrarily, we choose the model corresponding to correctness (of the
interpretation) as a ‘null hypothesis’ Ho, which we test with the geometric predi-
cates. If the predicate’s constraints are violated, we are in effect ‘rejecting’ this null
hypothesis in favor of a general alternative Ha.

Once we have put the application of predicates in such a framework, we can
immediately discuss the possible errors in decision-making, and impose a simple model

for these errors. Single tests in traditional inference admit two errors:

0 A Type I error (or false reject) occurs if we reject the null hypothesis when in
fact it is correct, i.e. our predicates reject an interpretation whose associations

are all correct.

0 A Type II error (or false accept) occurs if we accept Ho when in fact H, applies,
i.e. our predicates fail to detect an incorrect association and the subtree of the

interpretation under question is searched instead of being pruned.

We are adopting terms from statistical inference because of their intuitive simi-

larity to the particular geometric testing situation at hand, not because we believe

140

that a simple statistical ‘model’ applies to any of the test quantities examined when a
predicate is evaluated. Continuing the analogy, IT search can be viewed as sequential
hypothesis testing. The tests involved examine continuous (e. g. radii), and nominal-
valued (e.g. patch type) data describing the scene and the models. In the traditional
situation, thresholds on appropriate test statistics and conﬁdence intervals can be
obtained from theoretically-determined distributions; in our situation, however, the
mix of data types, the difﬁculty of modeling the error process, and the complexity
of parameter estimation make theoretical thresholds prohibitive (or impossible) to
obtain. Thresholds are chosen empirically, appealing to intuition where possible.

In the interests of simplicity, we will model the ensemble of geometric predicates
(unary and binary) with two hypothesis tests: one unary and one binary. We assume
that these predicates are applied sequentially to each interpretation generated during
search, and generalize an error model developed by Grimson [45, 46] to allow both
Type I and Type II errors. We will assume constant error probabilities and examine
(via simulation) growth in the interpretation tree as a function of model and scene
size (in primitives), and the number of correct associations made. A more realistic

exploration would use families of error probabilities speciﬁc to
s the particular test involved,
0 the object model involved, and possibly
0 the level in the IT being explored.

While this generalization is amenable to simulation, formal bounds on the amount of

search have proved resistant to analysis.

5.6.2 Unary Test

In applying our ‘generic’ unary predicate to an interpretation I, we encounter four
alternatives in terms of the model entity under consideration and the correctness of
the association involving it. Our unary predicates were designed (see Section 5.3.8)
to (in current terms) accept the null hypothesis if the newest association is NULL.
If the newest association is non-NULL, our test is applied and produces a decision
to accept or reject the null hypothesis of correctness. Assume that the Type II error

probability is p1, and that the probability of a Type I error is 1 - p3 (we use p3

141

so that our notation agrees as much as possible with Grimson’s [46]). In summary,
the probability of acceptance or consistency [46]) in terms of the newest association

(3;, M5) in the interpretation under consideration is given by

r

1 if M,- is null.

p1 if (5;, M5) is an incorrect association (H. applies)

<1(5,,M,-) = t and Mj is non-NULL

p3 if (3;, M5) is a correct association (Ho applies)
and Mj is non-NULL

 

t

In Grimson’s work, p3 = 1, in effect stating that false rejections are impossible. By
allowing pa 96 1, we can no longer guarantee that any particular correct interpretation
will survive and be veriﬁed. The value of pa depends on the noise present in the data,
the segmentation procedure employed (particularly the sorts of errors it is prone to
make), and scene effects such as accidental alignment or occlusion. In practice, we
would like p3 to be as close to 1 as possible while avoiding a large number of false

acceptances.

5.6.3 Binary Test

We also need to model the consistency of an arbitrary pair of associations. Let
q(5..,M’);(5hM,) denote the consistency probability of a pair of associations (5,, M,) and
(5),, Mi). As above, this probability should have a discrete distribution depending on
the types of surfaces involved and the binary pruning predicates. For simplicity, we

deﬁne a similar two-type error model, so that

' 1 if M,- or M, or both are NULL.
p2 if (5,,Mj) and (ShMi) are not correct
9(S,,M,-);(S..,M;) = t and M,- and M; are both non-NULL.
p4 if (5:,Mj), (51.,M1) is correct and

M,- and M; are both non-NULL.

 

k

In Grimson’s analysis, p. = 1, implying that the binary predicate cannot make a
type I error. As with pa, p4 depends on noise, segmentation quality, and the types of

scenes expected; we would want p, to be close to 1.

142

5.6.4 Learning Error Probabilities

In section 5.6.6, we specify the error probabilities using intuition. However, it is
possible to estimate these quantities by examining the object models and the behavior
of the tests when applied to scenes. Consider the binary visibility predicate deﬁned on
two patches, which is TRUE if they are simultaneously visible in one of the synthetic
views generated during model-building, and FALSE if they are never simultaneously

visible. The false-acceptance probability p; could be estimated as
N true

(2)’

where n is the number of surfaces in the model, and NM, is the number of times (out

 

n
of the 2 choices of surface pairs) that the members of the pair are simultaneously

visible. Similar estimators for other binary predicates could be developed. Learning
the error probabilities p1 and pa for unary tests requires, in addition to the model
information, an error mechanism for the segmentation process. If we consider the
area predicate (in which we reject hypotheses for which the area of the scene and
surface in the newest association is greater than the maximum derived areas of the
corresponding model surface), the false-reject probability depends on our mechanism
for estimating areas, which in turn depends on the segmentation process. Due to
occlusion, the area predicate becomes less powerful for rejecting false hypotheses as
we descend in the IT. Therefore, the probabilities would depend on level as well as
the speciﬁc model involved. Instead of estimating the p.- by examination of the model
database, we could simply perform Monte Carlo trials on known scenes and count the

number and types of errors for a variety of scene and model types, and noise levels.

5.6.5 Search 'D‘ee Size

In its traditional incarnation [45], model entities are ‘re-usable’ during IT search.
When a given node in the tree is expanded, we can form new associations involving
any of the m model entities, or form a NULL association. In the fully-explored tree,
without pruning, and without heuristic termination after t non-NULL associations,

there are (m + 1)" nodes at the kth level, for a total tree size (assuming s scene

143

Interpret $1

 

($1,911 ($1.312 ($1.243 (81,14 (SLNUL

Interpret $2

 

 

, . Interpret $3
* l l l l' I
Figure 5.5: The full interpretation tree for s = 3,m = 4, allowing reuse of model

entities, without heuristic search termination. Thick branches indicate NULL associ-
ations.

entities) of
(m + 1)s+1 _ 1
m .

 

Figure 5.5 shows the interpretation tree constructed with s = 3, m = 4, allowing reuse
of model entities, with no heuristic search termination. This tree has 156 nodes. The
thick branches in the tree indicate NULL associations.

Suppose that we allow downward search to stop after t non-NULL associations
have been obtained. In this case, we do not have a simple closed form for the number
of nodes in the tree. We do have a recurrence formula, which can be tabulated for

any desired values of m, s, and t:

Nfull(SerT) = 1+ mNIwIS -19MrT - 1) "l' Nfull(S —13M’T)
Nfull(Sero) = 1
N;ull(1’ M, T) = m + 2.

Figure 5.6 shows the interpretation tree generated for s = 3, m = 4,t = 2. Using the
formula above, this tree has N ;u,,(3,4,2) = 76 nodes.
In BONSAI, we design the search so that model surfaces may not be re-used.

After an association involving model entity M has been created, any associations

144

 

Figure 5.6: The full interpretation tree (with heuristic termination) for s = 3,m =
4,t = 2, allowing reuse of model entities.

created in a subtree of that hypothesis cannot involve M. We refer to this variant
of the IT as a ‘restricted interpretation tree.’ Intuition should convince us that this
modiﬁcation can signiﬁcantly reduce the size of the full IT, with or without heuristic
search termination. The recurrence form for the restricted tree size, without heuristic

search termination, is given by

Nrestrieted(59 M) = 1 + mNrestricted(S _ l, M - l) + Nrestricted(S ‘— 1r M)
Nrestricted(la M) = m + 2
Nrestricted(5, 0) = 3 "l" 1

Figure 5.7 shows the restricted tree without termination for s = 3,m = 4. The
number of nodes is N,,.g,,-md(3,4) = 100.
When we add heuristic termination after t non-NULL associations, we have the

following recurrence for the number of nodes in the IT:

N:estricted(5erT) = 1 + mN;estricted(S " 1, M " LT - 1) + N:estricted(s " 1,M,T)
NmeJS, M . 0) = 1
N:estricted(l’ M, T) = m + 2

N:eurieted(51 0, T) = 3 + 1

Figure 5.8 shows the IT of this form for s = 3,m = 4,t = 2; it has 64 nodes.

145

 

($1.241 ($1.142) (SLM3) (SIAM) (SLNULL

Figure 5.7: The restricted interpretation tree (without heuristic termination) for s =
3,m = 4.

($1,M1) (51,312) ($1,M3) (81,144) (SLNULL)

 

Figure 5.8: The restricted interpretation tree (with heuristic termination) for s =
3, m = 4,t = 2.

146

5.6.6 A Simulation Study

Our speciﬁcation of Algorithm SEARCH and the probabilistic model of unary and
binary consistency above allows us to simulate IT search under the model, and obtain
rough bounds on the amount of search performed (the number of IT nodes visited) for
a given combination of the probabilities p.-, the total number of scene entities s, the
number of entities m in the model being searched, and the number of scene entities
c (c S s) which actually come from the model under consideration. We limit our
attention to search incorporating heuristic search termination, in which IT subtrees
rooted at nodes with t non-NULL associations are not searched.

Figure 5.9 shows curves corresponding to the 5th and 95th percentiles of an empir—
ical distribution of search simulation results for p, = p; = 0.5, p3 = p. = 0.99,m = 8
with the horizontal axis being the value of s and c. Since s = c, we are implicitly simu-
lating a search for an 8-sided isolated object under the assumption that segmentation
is perfect, and has produced 3 segments, all of which match one of the 8 model faces.
100 simulations of the search were performed for each value of 3. Each simulation
proceeded as follows. The correct interpretation was speciﬁed by randomly choosing c
of the s potential associations to be non-NULL and indexing them by an appropriate
model entity. Then the search was initiated at the root node. When expanded, the
unary and binary consistency were tested via Bernoulli trials. The total number of
nodes expanded during search was tracked, and reported at the end of the simulation.
Then the 100 numbers produced by the simulations were sorted, and the 5th and 95th
in order were selected, and appear on the graph. In addition, there are isolated points
on the graph corresponding to the amount of search performed for synthetic images
of one of our database models, which happens to have eight faces. If the probabilistic
model of consistency is correct, we would hope that the points on the plot would have
a 90% chance of lying within the empirically-determined bounds. Our sample size is
too low in this experiment to make broad statements. We simply note that three of
the points were well outside the empirical bounds. Raising p1 and p2 would push the
empirical bounds upward, and perhaps include the outliers. However, our previous
remarks on the simplicity of the model deﬁnitely apply here.

Table 5.2 shows experimental results and empirical bounds for recognition of an
object with 10 faces, several of which are curved. In these experiments, we set p1 =

p; = 0.1,p3 = p.. = 0.99, because experience has shown that both unary and binary

147

 

 

95%

5000»
Number 4000"
of
Nodes 3°°°"
Explored

20000

1000» .

4 s i 7 s
s . c

Figure 5.9: 5th and 95th percentile bounds on search (from simulation). Points are
actual experiments with image data.

Table 5.2: Experimental results and empirical bounds for recognition of a curved
object.

 

l Experiment | s l m I c | result I lower bound l upper bound I

 

 

 

 

 

 

 

1 7 10 "'5— 287 487 918
2 14 10 3 2405 1057 2669
3 5 10 5 126 266 453
4 8 10 5 583 483 1082
5 5 10 5 187 275 483

 

 

constraints are more powerful with curved objects than with polyhedra. As above, the
lower and upper bounds are obtained from the 5th and 95th percentile of a histogram
of tree sizes determined from 100 simulations with the given parameters. The sizes
from the actual experiments fall within the bounds in two of the ﬁve experiments,
and lie below the lower bound in the other three.

Our conclusion regarding the probabilistic model of predicate behavior is that it
can only be used as a rough guide for determining tree size. As mentioned earlier, the
four-probability model is too simplistic to reﬂect the complexity of models, scenes,
and noise sources. Extensions of the simulation-based approach to more realistic error

models (with their parameters learned from object models) might offer a clearer look

148

at search behavior under our set of constraints.

149

5.7 Summary

This chapter has described the BONSAI recognition module, which takes segmented
scene descriptions, selects a candidate model subset from the object model database,
and performs interpretation tree search, verifying hypotheses with adequately con-
strained pose estimates. Experiments with the system appear in Chapter 6. Our
extension of IT search to work with curved surface primitives is one of the ﬁrst re-
ported in the literature. We have also developed the concept of restricted IT search,
in which we allow model entities to be employed in hypothesis associations at most
once. Restricted search trees are smaller and require less effort to search, and the
underlying assumption of accurate segmentation for the scene can be relaxed through
hypothesis extension.

Unlike some recent systems [27], BONSAI was not implemented with speed in
mind; rather BONSAI serves as a powerful testbed for exploration of search-based
scene interpretations involving geometric constraints. Once reliable segmentation
procedures are developed for surface primitives such as general surfaces of revolution,
or any curve or surface primitive described in terms of orientation and location pa-
rameters, BONSAI’s constraints can be extended to accommodate those primitives.
Paradoxically, BONSAI’s worst performance is associated with polyhedral models,
because the planar faces of polyhedra do not provide as many constraints on identity
and pose as curved primitives. Redundant features (such as inter-patch angles) can
lower the power of pruning constraints, causing many hypotheses (hundreds or thou-
sands) to survive the search process. Since each hypothesis is expensive to verify, we
would like as few as possible to survive the search.

One advantage of the hypothesis-veriﬁcation procedure is large-grain parallelism.
Since each veriﬁcation is independent of all others, we could distribute the queue
of surviving hypotheses to multiple computational platforms, have them perform
the veriﬁcation concurrently, and simply collect the hypothesis scores when they
have ﬁnished. Alternatively, we could examine parallel implementations of the IT
search itself. Kumar and Kanal [74] developed two parallel variants of the SSS*
(state-space search) technique originally proposed by Stockman [107]. In one variant,
multiple processors search the problem tree concurrently, ‘competing’ to ﬁnd the
optimal solution. In the other, the search space is partitioned into disjoint subsets,

and a processor is dedicated to each subset. Application of parallel search methods

150

to the IT search problem requires that a merit (or objective) function be available
in order to compute the ‘goodness’ of a node in the tree. While the matching score
developed in Section 5.5 might be usable as a merit function, it is only deﬁned at
terminal nodes in the search tree, and an extended deﬁnition would be required in
order to prioritize the search at nonterminal nodes.

Recent efforts [45, 46] have focused on formal bounds on the effort expended
during search. Tractable analysis requires a simple model of the error probabilities
of pruning tests. We have conducted some preliminary simulations of an extended

search model to obtain empirical bounds on recognition.

Chapter 6

Experiments

This chapter describes three series of experiments conducted with the BONSAI object
recognition system. Twenty object models were used. Synthetic images of isolated
objects allow us to estimate the accuracy of pose estimation. Real images of isolated
objects are also examined, giving estimates of the error rate in recognition. Several
images of occluded scenes were also processed; they illustrate BONSAI’s ability to
resolve multiple objects in a single scene.

3D object recognition is a popular research topic, and several systems have been
described at length in the computer vision literature. However, many (if not most)
of these systems are not thoroughly tested. Vision researchers have recently begun to
address this shortcoming [52]. At a minimum, an object recognition system should
be tested on ‘many’ images, to obtain statistically-reliable estimates of the various
error rates (e.g. the incorrect-identiﬁcation and false-reject error rates). It would
be desirable for all images used in testing to be obtained from sensors rather than
synthetically-generated. These goals obviously conﬂict with the expense often in-
volved in taking images. As an attempt to address this problem, we have tested our
system on over two hundred images, half being ‘real’ images (obtained from our 100X
sensor), and half being synthetically generated. Thorough testing is not a substitute
for good performance. The thoroughness of our tests means that we have a better
grasp of the error rates, and not that those rates are lower than others. Comparison
of BONSAI’s accuracy with that of other systems must wait until those systems also

undergo similar tests.

151

152

6.1 Pose Estimation Accuracy: Synthetic Imagery

Five images of each of our twenty models1 were synthesized using the polygon scan-
conversion approach described in Section 2.4.5. Representative views appear in Fig-
ure 2.20. These images were then segmented using the techniques outlined in Chap-
ter 4. The resulting scene descriptions were presented to BONSAI, which was asked
to estimate the pose of the object. No selection of candidate models was performed
by the system; only the correct model was present in the database.

Tables 6.1 and 6.2 show the mean and standard deviations of errors in estimates
of the rotation and translation for nonsymmetric and rotationally-symmetric objects,
respectively. Note that for nonsymmetric objects, the pose rotation can be expressed
uniquely as a 3D vector v through the object, an angle of rotation 9 about v, and
a translation A = (Ax,Ay, Az). If we denote the corresponding estimates as 0,5,
and A, respectively, the three columns of numbers in Table 6.1 are the mean and
standard deviation of cos"(v - 0) (in degrees), I 0 -d | (in degrees), and [IA - All
(in inches), for ﬁve views of the model. For rotationally-symmetric objects, one of
the three components of the rigid rotation is undeﬁned, and we report the mean and
standard deviation of the angle (in degrees) between the known axis of symmetry and
the estimated axis of symmetry. For the two spherical models (BALL and BALL2),
only the mean and standard deviation of the translation error is reported, as the
orientation is not uniquely determined.

For most of our objects, the accuracy of pose estimates is very good (within 1°
for angles and 0.1 inch for translations). In our experiments, those models producing
large average pose errors actually were misidentiﬁed in some of the views. Therefore,
the pose estimate could not be expected (in general) to be correct anyway. If higher
accuracy were desired, these estimates could serve as input to an iterative pose-
improvement procedure which would attempt small adjustments in the angles and
translations to optimize a matching criterion. Alternatively, another sensor (perhaps

a tactile sensor) could be used to locate the object to the desired precision.

 

1See Section 2.5 for descriptions of the object models.

153

Table 6.1: Nonsymmetric objects: mean and standard deviation of errors in axis
orientation, axis rotation, and translation.

 

 

Object Axis orientation Rotation amt axis 'II’anslaFton (inches)
l (are Idesrees)
Mean Std. dev. Mean Std. dev. Mean Std. dev.
BALL - ~ - - 0.020 0.001
BALL2 - - - - 0.018 0.002
BIGWYE 5.614 6.098 0.762 1.159 0.488 0.505
BLOCK] 0.162 0.165 0.087 0.128 0.014 0.007
BLOCK2 0.332 0.486 0.299 0.300 0.022 0.017
BLOCK4 0.570 0.871 0.644 1.212 0.029 0.041

BOX2IN CH 1.479 0.842 1.602 0.934 0.086 0.007
COLUMNI 0.337 0.353 0.584 0.508 0.026 0.025
COLUMN 2 0.334 0.232 0.333 0.263 0.027 0.007
CURVBLOCK 0.216 0.221 0.279 0.303 0.158 0.028
GRNBLK3 0.668 0.0294 0.231 0.251 0.014 0.007
HUMP 0.237 0.492 0.822 0.772 0.172 0.008

 

 

 

 

 

 

 

 

 

Table 6.2: Symmetric objects: mean and standard deviation of errors in axis of
symmetry and translation.

 

Object Axis of symmetry It'anslation (incheE)
orientation (degrees)

Mean I Std. dev. Mean I Std. dev.
ADAPTER 0.537 0.588 0.022 0.009
AGPART2 2.892 4.158 0.059 0.031
CAP 0.081 0.045 0.417 0.611
HALFSPH 0.059 0.050 0.008 0.002
HARRISCUP 0.103 0.031 0.028 0.010
PISTON 0.086 0.038 0.039 0.024
PROPANE 0.078 0.040 0.066 0.0534
TAPEROLL 0.079 0.064 0.133 0.100

 

 

 

 

 

 

 

 

 

154

6.2 Recognition Accuracy: Real Imagery

Five images of each of our twenty objects were obtained from the 100X scanner and
segmented using the techniques in Chapter 4. During the image acquisition process,
we took care to position the objects so that enough faces were visible to provide
us with an adequately constrained pose estimate. When scanning polyhedra, for
example, we tried to ensure that at least three non-coplanar faces were visible. This
restriction on image acquisition could be relaxed by incorporating knowledge about
the bounds (in the image plane) on the object surfaces. This is a topic for future
research, however; at present, we employ only the geometric parameters in the pose
estimation procedure.

The resulting scene descriptions were input to BONSAI, which selected a subset of
models to search and produced a ranked list of recognition hypotheses. The confusion
matrix in Table 6.3 displays the correct and incorrect identities of the top-ranked
hypotheses for each experiment. A nonzero number m", in the matrix at row r and
column c means that m" number of views of model r were identiﬁed by BONSAI
as model c. Correct interpretations appear on the diagonal of the matrix, while off-
diagonal numbers indicate misinterpretations. In addition, there is an extra column in
the matrix, corresponding to the reject option, where no hypotheses were generated.
Summarizing the results, we have an overall recognition accuracy of 88%, a false reject
rate of 2%, and an incorrect recognition rate of 10%. These error rates are comparable
to those obtained by Lu et al. [79] from their CAD-based vision system (using intensity
data). Hoffman et al. [55] obtained perfect error rates on a twelve-object database,
but used only synthetic data. Instead of accepting only the top-ranked recognition
hypothesis, we could select the top k hypotheses. In 92% of our test views, the correct
hypothesis was top-ranked or second-ranked. In 95% of our test views, the correct
hypothesis was among the top three hypotheses ranked by the matching score.

Most of the errors made by our system were plausible interpretations of the scene.
For example, many of the polyhedral or partially polyhedral objects in our database
exhibit views which are qualitatively similar, and we were not surprised to see our
system have difﬁculty recognizing the correct model in such ambiguous cases. Many of
our polyhedral models have 90° dihedral edges and corners; in some views containing
only three planes (i.e. with only one visible corner), the system produces many

candidate hypotheses, which match that corner equally well, and will not be rejected

155

Table 6.3: Confusion matrix for recognition accuracy experiments.

Classiﬁcation
9 10 11 13

 

1 2 3 G 3 3 7 I 13 1O 15 13 17 13 1’ 20 REJECT
1 3 U 0 O 0 O 0 U U 0 0 0 0 0 0 O O 0 0 0 0
3 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 O 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C 0 0 O I 0 0 O 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 O 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 O O
I 0 0 0 0 0 3 0 0 0 0 0 O 0 O 0 0 O 0 0 0 0
7 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 G 0 0 0 0 0 0 0 0 1 0 0 0 0
3 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 O 3 0 0 0 0 0 0 0 0 0 O 0
Object 11 0 0 0 0 0 0 0 0 0 0 4 1 0 0 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0
13 0 1 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0
13 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 1
17 0 O 0 0 0 1 0 0 0 0 0 0 O 0 0 0 G O 0 0 0
13 0 0 0 0 0 0 0 O O 0 0 0 0 0 0 1 0 3 0 0 1
19 0 0 O 0 0 0 O 0 0 O 0 0 0 0 O 0 2 0 3 0 0
20 0 0 0 O 0 0 0 0 O O 0 0 0 0 0 0 0 1 0 G 0

 

 

 

Row and Column Indices:

1: ADAPTER, 2: AGPARTZ, 3: BALL, 4: BALL2, 5: BIGWYE, 6: BLOCKl
7: BLOCK2, 8: BLOCK4, 9: BOX2INCH, 10: CAP, 11: COLUMNl

12: COLUMN2, 13: CURVBLOCK,14: GRNBLK3, 15: HALFSPH

16: HARRISCUP, 17: HUMP, 18: PISTON, 19: PROPANE, 20: TAPEROLL.

by the geometric constraints. Alternative area-based or edge-based matching scores
could be used in the place of the technique given in Section 5.5, which compares the
range data and segmentations but not surfaces or edges.

Figure 6.1 illustrates one of the misinterpretations for the TAPEROLL object.
Figures 6.1(a), (b), and (c) show the input range image, pseudo-intensity image, and
segmentation image. Figures 6.1(d), (e), and (f) show the range, pseudo-intensity,
and segmentation images for the highest-ranked hypothesis, involving the PISTON
object (which has a similar radius to the outside surface of the TAPEROLL object).
The score for this hypothesis was 0.52. Figures 6.1(g), (h), and (i) illustrate one of
the correct hypotheses generated by BONSAI; its score was 0.28.

6.3 Cluttered Scene Processing

Ten images, each containing two model objects, were obtained from the 100X sensor,
processed, and input to BONSAI, which produced a ranked list of hypotheses re-

garding the image content. We saved the ﬁrst hypothesis and removed its associated

156

 

(G) (h) (i)

Figure 6.1: A misinterpretation of the TAPEROLL object.

157

Table 6.4: Recognition results for ten cluttered scenes.

 

Image name Figure number Number of correct Number of subjectively
interpretations good pose estimates

ADAPTER CURVBLOCK 6.2 2 2
BALL2/PROPANE . 6.3 2 2
BIGWYE/TAPEROLL 6.4 2 2

. BLOCK2/HARRISCUP 6.5 2 2
BLOCK4/HALFSPH 6.6 2 2
CAP/BOX2INCH 6.7 0 0
COLUMNI /BLOCK1 6.8 0 0
COLUMN2/BALL 6.9 1 1
GRNBLK3/PISTON 6.10 2 1
HUMP/AGPART2 6.11 2 2

 

 

 

 

 

 

scene entities from the system’s scene description. BONSAI then produced another
list of hypotheses, and we saved the highest-ranked of these. In cases where we do not
know the number of objects present in the scene, a different strategy, such as selecting
maximally-disjoint interpretations from the output of a single run of BONSAI, can
be used. Since we knew a priori the number of objects in each test image, we chose
to use this information during interpretation.

Table 6.4 summarizes the results of these ten recognition experiments. Figures 6.2
through 6.11 show the input range images, pseudo-intensity images, segmentations,
and top and second-ranked hypotheses for the experiments.

In the ADAPTER/CURVBLOCK image, the original segmentation contained
twelve patches, and the pruning constraints proved too weak to effectively trim the
interpretation tree. Consequently, several thousand hypotheses of identity and po-
sition for the CURVBLOCK object were produced. In order to obtain results in a
reasonable amount of time, we removed six of the planar patches from the original
segmentation and ran BONSAI on the remainder. In the CAP / BOX2IN CH image,
neither hypothesis of identity is correct. In the case of the cube, a corner of the
CURVBLOCK object was matched to the visible comer of the cube and produced
a higher hypothesis score (0.213), although the correct hypothesis of identity had
a score which was only slightly lower (0.184). More a priori knowledge about the
sensing environment could allow us to reject the CURVBLOCK hypothesis; for ex-
ample, we could specify that the images contain the entire model. In that case, we
would reject CURVBLOCK as a candidate model since the ﬁeld of view is smaller
than the hypothesized model. The CAP object in the scene was identiﬁed as the

._.
on
m

 

(d) (e)

Figure 6.2: Recognition results for ADAI’TER/CI'RVBLOCK scene. (a): range
image. (b): pseudo-intensity image. (c): segmentation (modified; see text). (d):
Top-ranked hypothesis. (1‘): Seminil-ranked hypothesis.

1:39

 

(d) (e)

Figure 6.3: Recognition results for BALL2/PROPANE scene. (a): range image. (b):
pseudo-intensity image. (c): segmentation. (<1): Top-ranked hypothesis. (e): Second-
ranked hypothesis.

160

 

(d) (e)

Figure 6.4: Recognition results for BIGWYE/TAPEROLL scene. (a): range image.
(b): pseudo-intensity image. (c): segmentation. (d): Top-ranked hypothesis. (e):
Second-ranked hypothesis.

161

 

(d) (e)

Figure 0.5: Recognition results for BLOCKZ/HARRISC‘UP scene. (a): range image.
(It); pseudo—intensity image. (e): segmentation. (t1): Top-ranked hypothesis. (e):
Second—ranked hypothesis.

 

(d) (e)

Figure 6.6: Recognition results for BLOCK-l/IIALFSPH scene. (a): range image.
(b): pseudo-intensity image. (e): segmentation. (d): Top-ranked hypothesis. (6):
Second-ranked hypothesis.

163

ADAPTER model; the ADAPTER hypothesis score was 0.122, while the correct hy-
pothesis’ score was 0.106. Similar radii and identical angle relations between. the
planar and cylindrical patches caused both models to be considered.

The COLUMNI /BLOCK1 image in Figure 6.8 produced hypotheses which, while
incorrect, are plausible and highlight additional ways to extend BONSAI’s use of con-
straints. The two highest-ranked hypotheses match the COLUMNl objects against
the COLUMN2 model, and the BLOCKl hypothesis has a reﬂected orientation es-
timate. The COLUMN2 interpretation is plausible since such an image could be
produced by occlusion of part of COLUMN2 by BLOCKl. The correct hypothesis
involving COLUMN 1 had a score of 0.082, versus a score of 0.085 for COLUMN2.
One possible way to prune this sort of interpretation would involve conﬁguration un-
derstanding, in which the system, using interference tests and an a priori rule which
states that objects cannot interpenetrate, would reject the COLUMN2 hypothesis
because the scene content indicates that the other scene patches cannot be in their
observed positions under that hypothesis. The three largest patches from the im-
age of BLOCKI were incorrectly assigned to the BLOCKI model. Since the patches
involved in this interpretation are planar and mutually orthogonal, this interpreta-
tion is plausible, and the absence of the slanted patch in the input image eliminated a
constraint that otherwise would have pruned the interpretation. The correct interpre-
tation had a score of 0.135, as opposed to the incorrect interpretation’s score of 0.147.
The triangular-shaped patch in the input image has been matched to a rectangular
model surface. Assuming that 2D shape could be reliably extracted from the seg-
mentation images, an additional (unary) predicate (namely, a patch-shape predicate)
could increase the power of BONSAI in these situations.

In the COLUMN 2/ BALL image, the ball was correctly recognized, but the COL—
UMN2 object was misidentiﬁed as an instance of CURVBLOCK, based on an in-
correct assignment between the three planar faces forming the upper right corner of
the object. The correct hypothesis of identity had a score of 0.030 as opposed to
the incorrect hypothesis’ score of 0.053. The addition of the cylindrical patch from
COLUMN2 to a correct hypothesis of identity would provide an additional constraint,
which could raise the correct hypothesis’ score. In this image (as in the previous im-
age), the right-angle corner feature involved in the interpretation is not particularly
salient (discriminatory) for many of the models. Improvements of the search proce-

dure would concentrate on more salient features, and use the nonsalient artifacts only

161

 

(d) (e)

Figure 6.7: Recognition results for CAP/BOXQINCH scene. (a): range image. (b):
pseudo-intensity image. (c): segmentation. ((1): Top-ranked hypothesis. (6): Second-
ranked hypothesis.

165

 

(d)

Figure 6.8: Recognition results for COLI'MNl/BLOC‘K] scene. (a): range image.
(b): pseudo-intensity image. (c): segmentation. (d): Top-ranked hypothesis. (e):
Second-ranked hypothesis.

166

during veriﬁcation.

In the GRNBLK3/ PISTON image, the PISTON object was identiﬁed correctly,
but the GRNBLK3 hypothesis, while correct in terms of object identity, also ex-
hibits the problem with a nonsalient corner feature matching against the wrong pla—
nar patches. Only one patch for this object has discriminatory angle relations with
neighboring patches, and its relatively small area caused it to be left out of the initial
IT search.

167

 

Figure 6.9: Recognition results for COLI'MN'Z/BALL scene. (a): range image. (b):
pseudo-intensity image. (c): segmentation. (d): Top—ranked hypothesis. (e): Second-
ranked hypothesis.

 

(d)

Figure 6.10: Recognition results for GRXBLK3/PISTON‘ scene. (a): range image.
(b): psendo~intensity image. (c): segmentation. (d): Top-ranked hypothesis. (e):
Second—ranked hypothesis.

l 69

 

(d) (e)

Figure 6.11: Recognition results for HL'MP/AGPART'Z scene. (a): range image.
(b): pseudo-intensity image. (c): segmentation. (d): Top-ranked hypothesis. (6):
Second-ranked hypothesis.

170

6.4 Discussion

Experiments with synthetic and real data indicate that BONSAI can produce correct
scene interpretations in situations where views are not underconstrained (in terms of
pose) and models have a signiﬁcant number of salient features. Models or scenes with
structural similarities or nonsalient features can occasionally yield incorrect interpre-
tations, but those interpretations are usually plausible nonetheless. As a testbed for
constraint-based IT search, BONSAI is a success, in our opinion. However, BON-
SAI is not particularly fast. Scenes can take from seconds to hours (even days!) to
interpret, depending on the number of hypotheses generated, the scene complexity,
the saliency of unary and binary relations, and the size of the model database. Since
curved surfaces provide more constraints on interpretation, BONSAI (unlike most
other systems we are aware of) typically resolves the identity of nonpolyhedral ob-
jects more quickly than polyhedra. This suggests that BONSAI could be extended
through the addition of constraints designed speciﬁcally for polyhedral models, which
would improve the system’s overall speed.

Our present set of experiments yields a recognition error rate for isolated objects
which is comparable with that of other systems being described in the literature, pose
estimates of good quality, and marginal interpretation quality for cluttered scenes,
although additional constraints (speciﬁcally, the computation of feature saliency) can

be employed to improve the results.

Chapter 7

Conclusions and

Recommendations for Future

Research

This thesis has explored the problem of 3D object recognition using CAD models
as a source of part geometry. Given a depth map of a scene containing (in general)
several parts, the BONSAI system derives a symbolic description of the image and
forms hypotheses of part identity and pose (location and orientation) using heuristic
search constrained by unary and binary geometric predicates, followed by hypothesis
veriﬁcation using synthetic imagery. In experiments on over two hundred images (both
real and synthetic) of twenty object models, the system exhibited 88% accuracy in
identiﬁcation of isolated objects, and produced reasonably accurate pose estimates
(translations within a few tenths of an inch and rotations within a few degrees).
Incorrect scene interpretations are usually plausible interpretations of the scene. In
isolated-object recognition experiments, 95% of the correct hypotheses were among
the three highest-ranked hypotheses for each image.

The primary contributions of this research are a methodology for automatically
converting solid models from CAD systems to relational graphs which are used by
the object recognition module, and the recognition system itself, which advances the
interpretation tree paradigm for 3D recognition into the domain of curved surfaces.

BONSAI’s major shortcomings are:
c a brittle, data-driven segmentation procedure;
0 a time-consuming veriﬁcation process;
0 weakness of search predicates for certain object types; and

o the inherent exponential time complexity of IT search.

171

172

Clearly, the domain of automated manufacturing constrains the types of images
we would expect to obtain from a range sensor. One area we will address in the future
is the integration of knowledge about part geometry into the segmentation procedure.
For example, if the parts in the current model database contained only a few cylindri-
cal patches with speciﬁed radii, we could tailor a model-driven segmentation system
with specialized ‘cylinder-ﬁnding’ modules appropriate to the database. Customized
segmentation procedures offer increased robustness to noise in many cases, since the
dimension of the parameter spaces involved in surface ﬁtting will often be lower for
curved surfaces. In addition, the successful extraction of a certain surface imposes
constraints on interpretation. If our model database contains only one cylindrical
patch with a radius of 1 inch, the successful extraction of that patch immediately
suggests that the corresponding model is present in the scene. In this way, we can
constrain interpretation tree search before the fact in addition to the pruning done
within the search module.

A large number of hypotheses can be generated from some scenes. At present,
we verify hypotheses by generating a synthetic range and segmentation image from
the CAD model and computing an area-based matching score. Other veriﬁcation
approaches are possible, however. Instead of matching on patch areas (which is
time-consuming), we could match on region boundaries (edges) by accumulating the
distances between observed edge pixels for each pair of patches and their predicted
locations (obtainable from the pose estimate and coordinates of edge points obtained
from the CAD model).

In Chapter 5, we stated that veriﬁcations can be performed in parallel, since each
hypothesis is independent of the others. We could also save time by reducing the
resolution of the synthetic imagery generated during veriﬁcation (this might be termed
coarse veriﬁcation). Only hypotheses with high scores from the coarse veriﬁcation
module would be veriﬁed at the same resolution as the input image. Alternative
approaches to veriﬁcation can be explored as well. Our difﬁculties in extracting crease
edges from range data precluded exploration of edge-based veriﬁcation; however, if
a model-driven segmentation procedure were developed, we might be able to extract
and accurately characterize creases and jump edges, and use them instead of patches
as veriﬁcation entities.

Our object database contained several polyhedral (or mostly polyhedral) objects
with 90 degree angles between adjacent faces. BONSAI typically produces a large

173

number of hypotheses (hundreds or thousands) for such objects. The reasons for this
behavior are twofold. First, planar surfaces inherently provide fewer opportunities for
search tree pruning than curved surfaces. Planes do not possess intrinsic parameters
other than their planarity, so only type, area, visibility, and orientation pruning can
be performed. BONSAI depends heavily on orientation relations to prune the search
tree, and for the boxlike objects with similar orientation relations between faces,
the orientation predicate is much less powerful. To address this problem, we might
wish to relax our heuristic termination requirement. Instead of terminating a search
once a consistent hypothesis with three associations has been found, we can let search
proceed one or more levels below the termination point, in the hope that more pruning
would take place, producing a smaller number of net hypotheses. For the polyhedral
cases, this enlargement of the search tree might signiﬁcantly reduce the number of
surviving hypotheses.

Both theoretical studies [45] and practical experience show that recognition via
search has exponential time complexity in cluttered scenes. Instead of making modiﬁ-
cations to the IT search procedure, we might wish to abandon a search-based approach
entirely. Geometric hashing [75, 53, 27] is a relatively new approach to recognition,
in which a hash table indexed by invariants computed from subsets of model entities
is created off-line. The hash table entries consist of the model identity and a pose
estimate. Recognition is performed by computing similar invariants from the input
scene, and extracting the model identity and pose from the table. Although previous
work has focused primarily on 2D models and scenes, extension of the work to 3D
might produce systems which can recognize objects in polynomial time. The time and
space consumed during construction of the hash table might be a barrier to practical
implementation of a hashing—based recognition system. It is important to note that

IT search is not the only way to recognize 3D objects in 3D scenes.

 

‘I’ m“- _-
.
‘ I

1""‘51_-.J .

Appendix

Appendix A

Program Input and Output

A.1 Model-builder

Figure A.1 shows a portion of the IGES description of the GRNBLK3 model. The
comments in square brackets are not part of the ﬁle; they explain the meaning of
some of the entities. Selected lines from the LISP code generated by the geometric
inference engine appear in Figure A.2. In both ﬁgures, deleted sections are marked

with ellipsis (. . . ).

A.2 Segmentation Output

Figure A.2 shows the LISP code generated from the segmentation and segment clas-
siﬁcation modules of the vision system. The input image appears in Figure 4.2.
Figure A.3 gives the LISP code generated by the segmentation and classiﬁcation

module using the range image of Figure 4.8.

174

175

[header]
00000117 cram 0' 8000 000l00 0.0 8 1
. ,TUOIIBLDJIIGIIBLI3. 3608000 030I00;V4 .0 .800 .ﬂPE-IOIS”, 32.0.20 .0 l
0.50.700000Ll3,l.0,l.2lll, ,,l30091029.140737,l.00-05,4.43¢00..; 0 2
[directory entry for line 1]
110 1 l l 0 0 0 0000000000 1
110 0 7 2 0 00 2

[directory entry for plane 49]

 

100 49 l 1 0 0 0 0000000010 49
100 0 'I 3 1 00 50
[directory entry for plane 145]
100 153 l l 0 0 0 0000000010 145
100 0 7 3 l 00 146
[geonetric paranotere for line 1]
110. 0.43700008001. 0.15700003+01. 0.28700000+01 . 1P 1
0.31000003001, 0.15700003‘01, 0.29233538-17, 0, 0; 1? 2
[geonetric paranotera for plane 49]
100, 0.00000003000. 0.10000000'01, 000000005000, 0.15700008+01, 49? 49
47. 0.00000000000, 0.0000000£+00, 0.00000000+OO. G9? 50
0.0000000!+W,0.0; 49? 51

[geonetric paranotera for plane 145]
100, 0.7351027E+00. 0.5947018E+00.-0.3253247E*m, 0.3212768E+01. 145? 153

143, 030000003000. 0.00000003+00. 0.00000003000. 145? 154
0.00000003*W.0,0; 145? 155
[end-of-ﬂle record]

8 10 20 146? 155 T 1

Figure A.1: A portion of the IGES ﬁle for the polyhedron

176

(note 3rnhlh0-ltae1 (nate-line :atart ’(4.370000 1.570000 2.070000)
:end ’(3.100000 1.570000 0.000000) :length 3.130439 :tranafornat1on Ill.
:vieihle t :parent 'grnhlh3))

(oetq grahlh3-11aee (cone ’grnhlh3-11ne1 grnblh3-l1aee))

(aetq grnhlhs-plane49 (lake-plane
:coefficlent. ’(0.000000 1.000000 0.000000 1.570000)
:hoandlng-carve ’grnhlh3-e0npoelte-cnrve47
:canonical-trancfornatlon (nate-array ’(4 4) :1n1t1al-c0ntenta
’((0.000000 4.000000 -1.000000 0.000000)

(4.000000 -0.000000 0.000000 0.000000)
(~0.0000W 1.000000 -0.000000 0.000000)
(0.000000 0.000000 0.000000 1.000000)
))
:canonlcal-bonnding-hoa ’((-4.950000 0.150000) (4.700001 -1.000000))
:tranefornatton 01L :v1eible t :parent 'grnhlh3))
(eetq grnblha-planea (cone ’grnhth-plane49 grnbth-planea))

(eetq grnhth-plane145 (nate-plane

:coofficleuta ’(0.735103 0.594702 r0.325325 3.212740)
:honnding-cerve ’grnblha-conpoe1te-curve143
:canon1cal-tranafornat1on (nate-array '(4 4) :1n1t1al-contenta
’((-0.252932 -0.200001 -0.945002 0.000000)

(-0.020913 0.777470 0.000000 0.000000)

(0.705103 0.590702 -0.025325 0.000000)

(0.000000 0.000000 0.000000 1.000000)

))
:canon1cal-honnd1ng-bor ’((-0.057709 2.404990)(-4.710105 4.032003))
:tranatornation 01L :v1eible t :parent ’grnblh3))

(eetq grnblh3-planee (cone ’grnblh3-plane145 grnblh3-planea))
(netq grnblh3 (lake-object
:clrcnlar-aree grnbth-ctrcular-arca
:conpoalte-curvee grnbth-conpoe1te-carvee
:planea grnblh3-planee
:llnee grnhth-linee
:paranotr1c-ep11ne-curvee grnblhB-paraaetric-eplineecurvee
:raled-enrfacee grnblh3-ruled-enrfacee
:anrfacee-or-revolution grnhlkS-enrtacea-or-revolut10a
:tahnlated-cyl1ndera grnblhs-tabulated-cylindere
:rational-b-epl1ne-curvee grnbth-rat10nal-b-epl1ne-curveo
:rational-b-epl1ne-eurtacee grnbth-rattonal-b-epline-enrfaeee
:cnrvee grnhlh3-curvee
:eortacee grnhth-eurtacee))

(angle "rnblu-planﬂ9 ’mhlu-lhel 1.570700)

(adjacent-t0 ’grnbth-plane49 ’grnbth-plane01)

(edge-into 'grnblh3-plane49 ’3rnhlh3-plane01
“1.0500” 1.570000 2.070000) ’(1.050000 1.570000 0.000000)
’((1.050W0 1.050000)(1.570000 1.570000)(0.0000W 2.070000)”

(not! (nth 0 (generic-area grnblh3-plane49)) 5.040171)

(eetf (nth 1 (generic-area grnblh3-plane49)) 5.070042)

(not! (nth 2 (gener1c-area grnbth-plane49)) 5.420592)

Figure A.2: S-expressions generated automatically from the polyhedral model

my

 

177

(aetq Oaceaee ail)
(eetq aceae-plane1 (lake-plane :coefr1c1ente ’(0.174419 0.120727 0.977243 -1.299193)))
(eetq Oaceaee (append Oacenee ’(eceae-plane1)))
(eetq ocene-plane2 (aahe-plaae :coert1e1ente '(-0.012003 0.420335 0.000020 1.535971)))
(aetq Oaceaeo (append Oeceneo ’(acene-plane2)))
(aetq ecene-plane3 (nate-plane :coefflctente ’(0.109940 0.122103 0.977049 0.291045)”
(eetq Oaceaee (append oeceneo ’(ecene-plane3)))
(eetq_ocene-plane4 (lake-plane :coefftc1ente ’(-0.775100 -0.591141 0.222793 -1.402211)))
. (eetq eeceaee (append oeceaee ’(ecene-plane4)))
(aetq acene-plane5 (lake-plane :coetﬂdente ’(-0.934494 0.207220 0.209445 -1.343550)))
(eetq Ceceaeo (append oeceneo ’(aceae-plane5)))
(aetq acene-plane0 (lake-plane :coefttclente '(-0.930402 0.200735 0.301054 1.340091)))
(netq Onceneo (append Oecenee '(ecene-plane0)))
(netq ecene-plane7 (lake-plane :coeff1clente '(0.307099 ~0.009145 0.949102 -0.370711)))
(eetq Oeceneo (append eecenee '(aeene-plane7)))
(eetq acene-enrtace-ot-revolnt1on0 (nahe-enrtace-of—revolntton
:onrtace-of—revolntion-type 'cyl1nder
:radtno 0.000110 :directton ’(-0.920777 0.223057 0.319400)
:poaelhle-polnte-on-axte '(0.355934 0.533573 -1.002910)))
(aetq eoceneo (append eeceneo ’(acene-onrtace-of—revolut1on9)))
(not! (plane-area acene-planel) 4.759323)
(eetf (plane-area ecene-plane2) 2.300559)
(not! (plane-area acene-plane3) 3.019509)
(not! (plane-area ocene-plane4) 4.471331)
(not! (plane-area ecene-plane5) 2.030743)
(not! (plane-area ocene-plane0) 0.935534)
(not! (plane-area acene-plane7) 0.030790)
(not! (enrface-of-revolntion-area ocene-enrtace-or-revolnt10n0) 2.403092)
(1.3-adjacent ’acene-plane1 ’ocene-plane2)
(1.3-adjacent 'ecene-plane2 ’ecene-planed)
(ing-adjacent ’ecene-planel 'ecene-plane4)
(ing'adjacent ’ocene-plane3»’ecene-plane4)
(1.5-adjacent ’acene-plane5 ’acene-plane7)
(ing—adjacent ’ecene-plane5 'acene-enrtace-of-revolut10n0)
(ocene-edge-info 'ocene-plane1 ’ecene-plane2
’((-1.431015 3.907000 -1.503000) (-0.207024 0.345000 -2.000000) 0.000200)
’((-1 .431015 -0.207024)(3.9070W 0.345000)(-2.000000 -1.503000)))
(acene-edgp-lnto ’ocene-planel ’eeene-plane4
’((-0.703021 2.940000 -1.500000)(-1.431015 3.907000 -1.503000) 0.000079)
’((-1.431015 -o.703021)(2.0e0000 3.907000) (-l.503(l)0 -1.503000)))
(ecene-edge-inro 'ecene-plane2 ’ecene-plane4
’((-2.1050M 4.499000 -2.590000) (-1.457014 3.907000 -1.504000) 0.001130)
’ ((-2.2030N -1 .e57014)(3.007000 4.599000)(-2.735000 -1 .504000)))
(acene-edge-tnfo 'ecene-plane3 ’ecene-plane4
’((0.330971 2.194000 -0.030000)(-0.103025 2.050000 -0.032000) 0.000457)
’((-0.103025 0.330971)(2.194000 2.050000)(-0.0300® -0.029000)))
(scene-edge-lnfo 'acene-plane5 'acene-plane7
’((0.900905 -0.051000 -0.000000)(1.532901 1.499000 -0.021000) 0.000404)
’((0.900905 1.532901)(-0.051000 1.499000) (-0.0000(X) -0.021000)))
(oceae-edge-info 'acene-plane7 ’acene-enrtace-of-revolnt1on0
’((1.030905 -0.091000 -0.021000)(1.550901 1.499000 -0.012000) 0.000170)
’((1.030905 1.550901)(-0.091000 1.499000)(-0.0210W -0.010000)))
(aotq Oacene-roeat 102)
(eetq Oeceae-colnnnae 200)
(oetq eecene-r-reeolntione 0.020000)
(aetq Oncene-y-reeolatione 0.040093)
; yree Ila 0.039000. nan 0.110000, nean 0.040093. var 0.000009. a 17720
(eetq eeceae-potnte ’((71 79) (-0.521022 3.540000)))
(aetq eecene-lahel-rileo “/hone/enrya/flynn/cad/inge/n1oc/juahle12/jnnble12.label4.2“)
(oetq Oecene-lnage-f1le0 ”/hone/enrya/flynn/cad/inge/a1ac/jnnble12/jnnble12.hlpa.2”)

 

178

(eetq Onceneo a1l)
(eetq oceae-plane1 (nate-plane :coefficlenta ’(-0.751939 0.477305 0.454034 1.043103)))
(eetq Oeceaee (append eeceneo ’(ecene-plane1)))
(aetq oceae-plane2 (naheeplane :coefr1c1ente ’(0.394040 ~0.244000i0.005730 -3.421710)))
(aetq Oncenee (append eaceneo '(acene-plane2)))
(eetq econo-plane3 (lake-plane :coetr1c10nte '(0.300544 -0.240010»0.090330 -3.419152)))
(eetq Ceceneo (append Ooceneo ’(ecene-plane3)))
(eetq acene-plane4 (nate-plane :coerﬂctente ’(-0.754504 0.473000 0.453930 0.300017”)
(eetq eaceneo (append oeceneo '(ecene-plane4)))
(aetq ecene-plane5 (nahe-plane :coefﬂciente ’(0.142912 ~0.570143 0.009020 -2.970015)))
(eetq Cecenet (append Oacenee '(ecene-plane5)))
(eetq acene-plane0 (nate-plane :coetr1c1ente ’(0.220402 0.004034 0.452025 1.039125)))
(eetq Ooceneo (append Oecenee ’(ecene-plane0)))
(eetq acene-eurtace-ot-revolnt1on7 (nahe-eurface-of-revolnt1on
:enrface-of-revolntion-type ’cyl1nder :radtue 1.040759
:direction ’(0.230747 0.002543 0.450307)
:poeaihle-pointa-on-aa1a ’(-1.172330»3.024273 -0.037951)))
(eetq Oecenee (append Ceceneo ’(econo-eurface-of-revolnt1on7)))
(aett (plane-area ocene-plane1) 1.005327)
(not! (plane-area acene-plane2) 1.023139)
(not! (plane-area ecene-plane3) 0.502917)
(not! (plane-area acene-plane4) 2.712013)
(aetf (plane-area ocene-plane5) 1.434931)
(not! (plane-area acene-plane0) 3.550574)
(aetf (anrface-of-revolut1on-area ocene-enrtace-of-revolnt10n7) 1.437000)
(ing-adjacent 'acene-plane1 'ecene-plane2)
(1ng-adjacent 'acene-plane3 ’acene-plane4)
(1ng-adjacent 'acene-plane4 ’acene-plane5)
(1ng-adjacent ’ecene-plane0 ’ecene-anrtace-ot-revolnt1on7)
(acene-edge-tnfo 'ecene-plane3 ’ecene-plane4
’((0.492970 4.415000 -2.050000)(1.030905 5.201000 -2.050000) 0.000120)
’((0.492970 1.030905)(e.ersooo 5.201000)(-2.050000 -2.043000)))
(acene-edge-info 'acene-plane4 ’acene-plane5
’((l.110904 3.739000. -1 .205000)(2.070950 4.704000 -0.725000) 0.001104)
’((1.110904 2.104950)(3.739000 4.704000)(-1.205000 -0.703000)))
(acene-edge-info ’ecene-plane0 ’ecene-anrtace-of-revolut1on7
'((-2.003010 2.347000 0.520000)(-1.507013 2.101000 0.000000) 0.009401)
’((-2.02”10 -1.507013)(2.101000 2.305000) (0.497000 0.000000)”
(aetq Oecene-roeeo 152)
(eetq eacene-colnnneo 240)
(aetq Oecene-r-reeolnttone 0.020000)
(eetq Oecene-y-reeolatione 0.040912)
; yree n1n 0.009000. nan 0.259000. aean 0.040912. var 0.000001. a 12973
(eetq Oecene-po1nt0 '((99 9) (-2.905002 2.504000)))
(eetq 00cene-lahel-111e0 "/hone/p1rel/flynn/cad/tags/occl0/occl0.label4.2")
(eetq 00cene-1nage-f1le0 “/hone/piaellilynn/cad/tags/occl0/0ccl0.h1pe.2”)

Figure A.3: LISP description for the scene of Figure 4.8.

Bibliography

Bibliography

[1] G. Agin and T. Binford. Computer Description of Curved Objects. In Proc.
3rd International Joint Conference on Artiﬁcial Intelligence, pages 1113-1115,
1973.

[2] P. Allen and P. Michaelman. Acquisition and Interpretation of 3-D Sensor Data
from Touch. In Proc. IEEE Workshop on Interpretation of 3D Scenes, pages
33-40, 1989.

[3] D. H. Ballard and C. M. Brown. Computer Vision. Prentice-Hall, Englewood
Cliffs, New Jersey, 1982.

[4] A. H. Barr. Global and Local Deformations of Solid Primitives. In ACM
Computer Graphics SIGGRAPH ’84, pages 21-30, July 1984.

[5] H. R. Berenji and B. Khoshnevis. Use of Artiﬁcial Intelligence in Automated
Process Planning. Computers in Mechanical Engineering, pages 47-55, Septem-
ber 1986.

[6] R. Bergevin and M. D. Levine. Generic Object Recognition: Building Coarse 3D
Descriptions from Line Drawings. In Proc. IEEE Workshop on Interpretation
of 3D Scenes, pages 68-74, November 1989.

[7] P. J. Besl. Active Optical Range Imaging Sensors. In J. L. C. Sanz, editor,

Advances in Machine Vision: Architectures and Applications. Springer-Verlag,

1988.

[8] P. J. Besl. Geometric Modeling and Computer Vision. Proc. IEEE, 76(8):936—
958, August 1988.

[9] P. J. Besl and R. C. Jain. Three-Dimensional Object Recognition. Computing
Surveys, 17(1):75—145, March 1985.

179

180

[10] P. J. Besl and R. C. Jain. Segmentation Through Variable-Order Surface
Fitting. IEEE Transactions on Pattern Analysis and Machine Intelligence,
9(2):167—192, 1988.

[11] B. Bhanu (ed.). Special Issue on CAD-Based Robot Vision. COMPUTER,
August 1987.

[12] I. Biederman. Recognition-by-Components: A Theory of Human Image Under-
standing. Psychological Review, 94(2):115-147, 1987.

[13] D. Blostein and N. Ahuja. Shape From Texture: Integrating Texture-Element
Extraction and Surface Estimation. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 11(12):1233-1251, December 1989.

[14] R. M. Bolle and D. Sabbah. Differential Geometry Applied to Least-Square
Error Surface Approximations. In Optical and Digital Pattern Recognition,
volume 754, pages 117-127. SPIE, 1987.

[15] R. C. Bolles and P. Horaud. 3DPO: A Three-Dimensional Part Orientation
System. International Journal of Robotics Research, 5(3):3—26, Fall 1986.

[16] R. C. Bolles, P. Horaud, and M. J. Hannah. 3DPO: A Three-Dimensional Part
Orientation System. Proc. 8th Int. Joint Conf. on Artiﬁcial Intelligence, pages
1116—1120, August 1983.

[17] T. E. Boult and A. D. Gross. On the Recovery of Superellipsoids. In Proc.
DARPA Image Understanding Workshop, pages 1052-1063, 1988.

[18] J. Brady, N. Nandhakumar, and J. Aggarwal. Recent Progress in the Recogni-
tion of Objects from Range Data. In Proc. 9th Int. Conf. on Pattern Recogni-
tion, pages 85-92, November 1988.

[19] H. Brody. CAD Meets CAM. High Technology, pages 12—16, 1987.

[20] R. A. Brooks. Symbolic Reasoning Among 3-D Models and 2-D Images. Arti-
ﬁcial Intelligence, 17(1-3):285—348, 1981.

[21] R. A. Brooks. Model-Based Three-Dimensional Interpretations of Two-
Dimensional Images. IEEE' Transactions on Pattern Analysis and Machine

Intelligence, PAMI-5(2):l40—150, March 1983.

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

181

R. A. Brooks. Model-Based Computer Vision. UMI Research Press, 1984.

C. M. Brown. Some Mathematical and Representational Aspects of Solid Mod-
eling. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-
3(4):444—453, July 1981.

C. M. Brown. PADL-2:A Technical Summary. IEEE Computer Graphics and
Applications, 2:69—84, March 1982.

M. S. Casale. Free-Form Solid Modeling with Trimmed Surface Patches. IEEE‘
Computer Graphics and Applications, pages 33-43, January 1987.

C. Chen and A. Kak. 3D-POLY: A Robot Vision System for Recognizing
Objects in Occluded Environments. Technical Report TR-EE 88-48, School of
Electrical Engineering, Purdue University, December 1988.

C. Chen and A. Kak. A Robot Vision System for Recognizing 3-D Objects in
Low-Order Polynomial Time. IEEE Trans. on Systems, Man, and Cybernetics,
19(6):1535-1563, November/ December 1989.

S.-W. Chen. 3D Representation and Recognition Using Object Wings. PhD
thesis, Department of Computer Science, Michigan State University, 1989.

S.-W. Chen and G. Stockman. Object Wings-2 l/2-D Primitives for 3-D Recog-
nition. In Proc. IEEE 1.98.9 Conf. on Computer Vision and Pattern Recognition,
pages 535-540, June 1989.

S.-W. Chen, G. Stockman, and N. Shrikhande. Computing a Pose Hypothesis
from a Small Set of 3-D Object Features. Technical Report MSU-ENGR-87-001,

Department of Computer Science, Michigan State University, 1987.

R. T. Chin and C. R. Dyer. Model-Based Recognition in Robot Vision. Com-
puting Surveys, 18(1):67-108, March 1986.

J. Coggins and A. Jain. A Spatial Filtering Approach to Texture Analysis.
Pattern Recognition Letters, 3:195—203, 1985.

J. J. Dongarra, J. R. Bunch, C. B. Moler, and G. W. Stewart. LINPACK Users’
Guide. Society for Industrial and Applied Mathematics, 1979.

$7..

 

182

[34] S. A. Dudani, K. J. Breeding, and R. B. McGhee. Aircraft Identiﬁcation by
Moment Invariants. IEEE Transactions on Computers, C-26(l):39-46, 1977.

[35] D. Eggert and K. Bowyer. Computing the Orthographic Projection Aspect
Graph of Solids of Revolution. In Proc. IEEE Workshop on Interpretation of
SD Scenes, pages 102-108, November 1989.

[36] T.-J. Fan, G. Medioni, and R. Nevatia. Segmented Descriptions of 3—D Surfaces.
IEEE Journal of Robotics and Automation, pages 527-538, December 1987.

[37] T.-J. Fan, G. Medioni, and R. Nevatia. Recognizing 3-D Objects Using Surface
Descriptions. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 11(11):1140-1157, 1989.

[38] F. Ferric, J. Lagarde, and R. Bajcsy. Darboux Frames, Snakes, and Super-
Quadrics: Geometry from the Bottom-Up. In Proc. IEEE Workshop on Inter-
pretation of SD Scenes, pages 170-176, 1989.

[39] P. J. Flynn and A. K. Jain. Surface Classiﬁcation: Hypothesis Testing and
Parameter Estimation. In Proc. IEEE 1988 Conf. on Computer Vision and
Pattern Recognition, pages 261-267, June 1988.

[40] P. J. Flynn and A. K. Jain. CAD-Based Computer Vision: From CAD Models
to Relational Graphs. In Proc. IEEE 1989 Int. Conf. on Systems, Man, and
Cybernetics, pages 162-167, November 1989.

[41] P. J. Flynn and A. K. Jain. On Reliable Curvature Estimation. In Proc. IEEE
1989 Conf. on Computer Vision and Pattern Recognition, pages 110-116, June
1989.

[42] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An Algorithm for Finding Best
Matches in Logarithmic Expected Time. ACM Trans. Math. Softw., 3(3):209-
226, September 1977.

[43] Z. Gigus and J. Malik. Computing the Aspect Graph for Line Drawings of
Polyhedral Objects. IEEE Transactions on Pattern Analysis and Machine In-
telligence, l2(2):113-122, February 1990.

183

[44] R. N. Goldman. Two Approaches to a Quadric Model for Surfaces. IEEE
Computer Graphics and Applications, 3(1):21-24, September 1983.

[45] W. E. L. Grimson. The Combinatorics of Object Recognition in Cluttered
Environments using Constrained Search. In Proc. 1988 Int. Conf. on Computer
Vision (ICCV ’88), pages 218—227, 1988.

[46] W. E. L. Grimson. The Combinatorics of Heuristic Search Termination for
Object Recognition in Cluttered Environments. Technical Report 1111, AI
Laboratory, Massachusetts Institute of Technology, 1989.

[47] W. E. L. Grimson and T. Lozano-Pérez. Model-Based Recognition and Local-
ization from Sparse Range or Tactile Data. International Journal of Robotics
Research, 3(3):3-35, Fall 1984.

[48] A. Gupta, L. Bogoni, and R. Bajcsy. Quantitative and Qualitative Measures
for the Evaluation of the Superquadric Models. In Proc. IEEE Workshop on
Interpretation of 3D Scenes, pages 162-169, 1989.

[49] E. L. Hall, J. K. B. Tio, C. A. McPherson, and F. A. Sadjadi. Measuring Curved
Surfaces for Robot Vision. COMPUTER, 15(12):42—54, December 1982.

[50] C. Hansen, N. Ayache, and F. Lustman. Towards Real-Time Trinocular Stereo.
In Proc 2nd Int. Conf. on Computer Vision, pages 129-133, December 1988.

[51] C. Hansen and T. Henderson. CAGD-Based Computer Vision. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 11(11):1181-1193, Novem-
ber 1989.

[52] R. M. Haralick. Methodology for Experimental Computer Vision. In Proc. 1989
IEEE Computer Society Conference on Computer Vision and Pattern Recogni-
tion, pages 437—438, June 1989.

[53] Y. C. Hecker and R. M. Bolle. Is Geometric Hashing a Hough Transform?
Technical Report RC 15202(#67767), IBM Research Division, T.J. Watson
Research Center, 1989.

184

[54] T. C. Henderson and C. Hansen. Multisensor Knowledge Systems. In A. K.
Jain, editor, Real-Time Object Measurement and Classification, volume 42 of
NATO ASI Series, pages 375-390. Springer-Verlag, Berlin, 1988.

[55] R. Hoffman, H. Keshavan, and F. Towﬁq. CAD-Driven Machine Vision.
IEEE Thus. on Systems, Man, and Cybernetics, pages 1477-1488, Novem-
ber/ December 1989.

[56] R. L. Hoffman. Object Recognition from Range Images. PhD thesis, Department
of Computer Science, Michigan State University, 1986.

[57] R. L. Hoffman and A. K. Jain. Segmentation and Classiﬁcation of Range Im-
ages. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-
9(5):608-620, September 1987.

[58] C. M. Hoffmann. Geometric 8 Solid Modeling: An Introduction. Morgan Kauf-

mann, 1989.

[59] B. K. P. Horn. Robot Vision. MIT Press, Cambridge, Massachusetts, 1986.

[60] G. Hu and G. C. Stockman. 3-D Surface Solution Using Structured Light and
Constraint Propagation. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 11:390-402, 1989.

[61] D. P. Huttenlocher and S. Ullman. Object Recognition Using Alignment. In
Proc. First International Conference on Computer Vision, pages 102-111, 1987.

[62] K. Ikeuchi and T. Kanade. Automatic Generation of Object Recognition Pro-
grams. Proc. IEEE, 76(8):1016-1035, August 1988.

[63] S. Inokuchi, T. Nita, F. Matsuda, and Y. Sakurai. A Three Dimensional Edge-
Region Operator for Range Pictures. In Proc. 6th Int. Conf. on Pattern Recog-
nition, pages 918-920, 1982.

[64] D. J. Ittner and A. K. Jain. 3-D Surface Discrimination from Local Curva-
ture Measures. In Proc. IEEE 1985 Conf. on Computer Vision and Pattern
Recognition, pages 119-123, June 1985.

[65] A. Jain, T. Newman, and M. Goulish. Range-Intensity Histogram for Segment-
ing LADAR Images. submitted to Pattern Recognition Letters, 1990.

185

[66] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice-Hall,
1988.

[67] A. K. Jain and R. L. Hoffman. Evidence-Based Recognition of 3-D Objects.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(6):783-
802, November 1988.

[68] R. C. Jain and A. K. Jain. Report: 1988 NSF Range Image Understanding
Workshop. In R. C. Jain and A. K. Jain, editors, Analysis and Interpretation
of Range Images, pages 1-31. Springer-Verlag, 1990.

[69] R. A. Jarvis. A Perspective on Range Finding Techniques for Computer Vi-
sion. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-
5(2):122-139, 1983.

[70] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, New
York, 1986.

[71] A. Kak, A. Vayda, R. Cromwell, W. Kim, and C. Chen. Knowledge-based
Robotics. Int. J. Prod. Res., 26(5):707-734, 1988.

[72] M. R. Korn and C. R. Dyer. 3-D Multiview Object Representations for Model-
Based Object Recognition. Pattern Recognition, 20(1):91—103, 1987.

[73] E. Krotkov and J. Martin. Range From Focus. In Proc. 1986 IEEE Int. Conf.
on Robotics and Automation, pages 1093-1098, April 1986.

[74] V. Kumar and L. N. Kanal. Parallel Branch-and-Bound Formulation for
AND/OR Tree Search. IEEE Transactions on Pattern Analysis and Machine
Intelligence, PAMI-6(6):768-788, 1984.

[75] Y. Lamdan and H. Wolfson. Geometric Hashing: A General and Efﬁcient
Model-Based Recognition Sceme. In Proc. Second International Conference on
Computer Vision, pages 238-249, 1988.

[76] J. Z. Levin. Mathematical Models for Determining the Intersections of Quadric
Surfaces. Computer Graphics and Image Processing, 11:73-87, 1979.

186

[77] W.-C. Lin and T.-W. Chen. CSG-Based Object Recognition Using Range Im-
ages. In Proc. 9th Int. Conf. on Pattern Recognition, pages 99-103, November
1988.

[78] D. G. Lowe. Three-Dimensional Object Recognition from Single Two-
Dimensional Images. Artiﬁcial Intelligence, 31:355—395, 1987.

[79] H. Lu, L. G. Shapiro, and O. I. Camps. A Relational Pyramid Approach to
View Class Determination. In Proc. IEEE Workshop on Interpretation of 3D
Scenes, pages 177-183, November 1989.

[80] A. Mitiche and J. Aggarwal. Detection of Edges Using Range Information.
IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-
5(2):174-178, March 1983.

[81] D. G. Morgenthaler, K. D. Gremban, M. Nathan, and J. D. Bradstreet. The
ITA Range Image Processing System. In Proc. DARPA Image Understanding
Workshop, pages 127-142, 1987.

[82] M. E. Mortenson. Geometric Modeling. Wiley, New York, 1985.

[83] S. G. Nadabar and A. K. Jain. MRF Model-Based Segmentation of Range
Images. submitted to Second International Conference on Computer Vision,

1990.

[84] V. S. Nalwa. Line-Drawing Interpretation: Bilateral Symmetry. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 11(10):1117-1120, October
1989.

[85] National Institute of Standards and Technology. Initial Graphics Exchange
Speciﬁcation, Version 4.0. Technical report, US. National Institute of Stan-
dards and Technology, 1988.

[86] R. Nevatia and T. Binford. Description and Recognition of Curved Objects.
Artiﬁcial Intelligence, 8(1):?7-98, 1977.

[87] D. Nitzan. T hree-Dimensional Vision Structure for Robot Applications. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 10(3):291-309,
May 1988.

[88]

[39]

[90]

[91]

[92]

[93]

[94]

[95]

[96]

[97]

187

J. O’Rourke and N. Badler. Decomposition of Three-Dimensional Objects into
Spheres. IEEE Transactions on Pattern Analysis and Machine Intelligence,
PAMI-l, July 1979.

H. Park and O. Mitchell. CAD based planning and execution of inspection.
In Proc. 1988 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, pages 858-863, June 1988.

A. Pentland, T. Darrell, M. Turk, and W. Huang. A Simple, Real-Time Range
Camera. In Proc. 1989 IEEE Computer Society Conference on Computer Vi-
sion and Pattern Recognition, pages 256—261, 1989.

M. Potmesil. Generating Models of Solid Objects by Matching 3D Surface
Segments. In Proc. 1983 Int. Jt. Conf. on Artificial Intelligence, pages 1089-
1093, August 1983.

W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. H. Vetterling. Numerical
Recipes in C: The Art of Scientiﬁc Computing. Cambridge University Press,
Cambridge, England, 1988.

A. Requicha and H. Voelcker. Solid Modeling: A Historical Summary and
Contemporary Assessment. IEEE Computer Graphics and Applications, 2(2):9—
24, March 1982.

A. Requicha and H. Voelcker. Solid Modeling: Current Status and Research
Directions. IEEE Computer Graphics and Applications, 3(7):25—37, October
1983.

A. A. Requicha. Representations for Rigid Solids: Theory, Methods, and Sys-
tems. Computing Surveys, l2(4):437—464, December 1980.

D. F. Rogers. Procedural Elements for Computer Graphics. McGraw-Hill, 1985.

D. F. Rogers and J. A. Adams. Mathematical Elements for Computer Graphics.
McGraw-Hill, second edition, 1990.

[98] H. Samet. The Design and Analysis of Spatial Data Structures. Addison-

Wesley, 1990.

188

[99] R. Sampson, C. Swonger, and P. VanAtta. Real Time 3 Dimensional Image
Processing for Robot Applications. Technical report, Environmental Research
Institute of Michigan, 1984.

[100] P. T. Sander and S. W. Zucker. Tracing Surfaces for Surfacing Traces. In Proc.
First Int. Conf. on Computer Vision (ICCV ’87), pages 241-249, 1987.

[101] L. E. Scales. Introduction to Non-Linear Optimization. Springer-Verlag, New
York, New York, 1985.

[102] B. Smith, J. Boyle, J. Dongarra, B. Garbow, Y. Ikebe, V. Klema, and C. Moler.
Matrix Eigensystem Routines: EISPACK Guide. Springer-Verlag, 1976.

[103] F. Solina and R. Bajcsy. Recovery of Parametric Models from Range Images:
The Case for Superquadrics with Global Deformations. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 12(2):131—147, February 1990.

[104] T. Sripradisvarakul and R. Jain. Generating Aspect Graphs for Curved Ob-
jects. In Proc. IEEE Workshop on Interpretation of 3D Scenes, pages 109-115,
November 1989.

[105] M. Stephens, R. Blissett, D. Charnley, E. Sparks, and J. Pike. Outdoor Vehicle
Navigation Using Passive 3D Vision. In Proc. IEEE 1989 Conf. on Computer
Vision and Pattern Recognition, pages 556—562, 1989.

[106] C. V. Stewart and C. R. Dyer. The Trinocular General Support Algorithm:
A Three-Camera Stereo Algorithm for Overcoming Binocular Matching Errors.
In Proc 2nd Int. Conf. on Computer Vision, pages 134-138, December 1988.

[107] G. C. Stockman. A minimax algorithm better than alpha-beta? Artiﬁcial
Intelligence, 12:179-196. 1979.

[108] G. C. Stockman. Object Recognition and Localization via Pose Clustering.
Computer Vision, Graphics, and Image Processing, 40:361-387, 1987.

[109] G. C. Stockman, S. Kopstein, and S. Benett. Matching Images to Models
for Registration and Object Detection via Clustering. IEEE Transactions on
Pattern Analysis and Machine Intelligence, PAMI-4(3):229-241, 1982.

189

[110] Structural Dynamics Research Corporation. Geomod 4.0 User Guide, 1988.

[111] M. Subbarao. Direct Recovery of Depth-Map. In Proc. IEEE Workshop on
Computer Vision, pages 58-65, 1987.

[112] S. Tanaka and A. C. Kak. A Rule-Based Approach to Binocular Stereopsis.
In R. C. Jain and A. K. Jain, editors, Analysis and Interpretation of Range
Images, pages 33-139. Springer-Verlag, 1990.

[113] Technical Arts Corporation. 100X 3-Dimensional Scanner: User’s Manual and
Application Programming Guide. Redmond, Washington.

[114] W. Tiller. Rational B-Splines for Curve and Surface Representation. IEEE
Computer Graphics and Applications, 3(6):61-69, September 1983.

[115] R. Vaillant and O. Faugeras. Using Occluding Contours for Recovering Shape
Properties of Objects. In Proc. IEEE Workshop on Interpretation of 3D Scenes,
pages 26-32, November 1989.

[116] J. Veenstra and N. Ahuja. Efﬁcient Octree Generation from Silhouettes. In
Proc. IEEE 1986 Conf. on Computer Vision and Pattern Recognition, pages
537-542, June 1986.

[117] B. Vemuri, A. Mitiche, and J. Aggarwal. Curvature-Based Representation of
Objects from Range Data. Image and Vision Computing, 4(2):107-ll4, May
1986.

[118] B. C. Vemuri and J. K. Aggarwal. Resolving the Orientation and Identity of an
Object From Range Data. In Proc. 1987 Workshop on Spatial Reasoning and
Multisensor Fusion, pages 178-187, October 1987.

[119] J. Verly, R. Delanoy, and D. E. Dudgeon. Machine Intelligence Technology for
Automatic Target Recognition. The Lincoln Laboratory Journal, 2(2):277-310,

1989.

[120] P. R. Wilson. A Short History of CAD Data Transfer Standards. IEEE Com-
puter Graphics and Applications, pages 64-67, June 1987.

[121] A. Witkin. Recovering Surface Shape and Orientation from Texture. Artiﬁcial
Intelligence, 17(1-3):l7-—45, August 1981.

190

[122] N. Yokoya and M. D. Levine. Range Image Segmentation Based on Differential
Geometry: A Hybrid Approach. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 11(6):643-649, June 1989.

MICHIGAN STATE UNIV. LIBRARIES
lllllllllllllllllllllllIllIllllllllllllllllllllllllll
31293000988463