dlb‘

ulna-”.4

 

 

 

 

 

 

 

44'|.1.r-.!

'r-

x
J
’

)

v aiud’ur

 

. n a... ..
i;¢»3«i h: .

 

THESlS

3

llllllllllllllllllllllllllllllll(Illllllllllllllllllllilll

3 1293 01417 272

This is to certify that the

dissertation entitled

0n Integrating Vision Modules

presented by

Sharathchandra Pankanti

has been accepted towards fulﬁllment
of the requirements for

Bh.D . degree in Jmnputex—Science

 

w WW

Anil K. Jain

 

Major professor

Date July 31, 1995

 

MS U is an Affirmative Action/Equal Opportunity Institution 0- 12771

 

 

Liamnv
Michigan mate
Unlverslty

 

 

 

PLACE N REI'URN BOX to roman thin Mom from your record.
TO AVOID FINES Mum on or baton duo duo.

DATE DUE DATE DUE DATE DUE

_ -
I’ll I I

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU I. An mum Action/Ema! Opponunlty Inﬂation
Wm!

 

ON INTEGRATING VISION MODULES
By

Sharathchandm Pankanti

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Computer Science Department

1995

ABSTRACT
ON INTEGRATING VISION MODULES
By

Shamthchandm Pankanti

Individual visual cues are often unreliable and ambiguous. Therefore, integrated
vision systems are necessary to obtain a reliable interpretation of complex scenes.
Design of such systems is challenging Since each vision module works under a different
and possibly conﬂicting sets of assumptions; an effective integration scheme must not
only deal with noisy input images but must also overcome the artifacts and restrictive
assumptions of the individual modules.

We propose a uniﬁed Bayesian integration framework for interaction among the vi—
sion modules to obtain a complete 3D reconstruction from a pair of intensity (stereo)
images. The proposed integration architecture allows a parsimonious modeling of
various interactions. Novel features of the proposed scheme include, (i) interaction
of each module with intrinsic map, (ii) multi-resolution representations and hierar-
chical coarse-to-ﬁne control, (iii) ﬁne—grained feedback mechanisms, and (iv) robust
estimation procedures based on the principle of coherence.

We have integrated perceptual grouping, stereo, Shape from shading, and Shape

from texture modules under the proposed Bayesian framework. We demonstrate the
efficacy Of our approach using real images of several different scenes and observe
improvements in the quality of recovered 3D structure as a result of integration. The
output of the integrated system is Shown to be insensitive to violations of individual
module assumptions. The numerical accuracy of the recovered depth is assessed
for photo-realistically rendered images from several scenes containing a variety of
generic surfaces. Average improvement due to integration in depths estimated from
the synthetic textured images of surface primitives was about 20%. The average
improvement in the Shape estimates due to integration was 16%. For non-textured
synthetic images, the corresponding improvements in depth and shape estimates were
25% and 23%.

Integrating vision modules is a difﬁcult problem primarily due to our lack of
understanding of two underlying issues: (i) an accurate assessment of the strengths
and limitations of individual modules; (ii) the representations and control structures
which can exploit complementary constraints provided by the imperfect modules to
recover the true structure in the data. We have attempted to systematize the design
procedure for an integrated system which takes into account these research issues and
demonstrated that an integrated system thus designed leads to improved results in a
limited scene domain. Much more research is needed to obtain a deﬁnitive solution

to the integration problem.

To my family

iv

ACKNOWLEDGMENTS

Obtaining Ph.D. was a pleasant journey and I had plenty of Opportunities to learn
from many wonderful people throughout the course of this journey.

My thesis director, Dr. Jain has remained a constant source of inspiration. His
insights in the areas of pattern recognition, image processing, and computer vision
have a strong influence on my way analyzing and solving problems. Dr. Jain’s gentle
but ﬁrm reminders about the schedule have helped ﬁnish this work in a reasonable
amount of time. I also greatly appreciate his taking interest in students’ overall
growth and I am grateful for his advise and encouragement.

I am thankful to the members Ph.D. Committee for agreeing to serve on my Ph.D.
committee. Dr. Ruud Bolle has been very supportive and Spent considerable time
with me (in Cafe Venezia, for instance) in discussing crucial integration issues as well
as some of the formulations. Dr. John Weng’s critical as well as constructive com-
ments have helped improve the quality of my work. Dr. William Punch’s expertise in
Artiﬁcial Intelligence and Dr. James Zack’s expertise in the human visual perception
have also enriched my work considerably.

Pattern Recognition and Image Processing Laboratory (PRIP) has been a won-

derful place to work, Sleep, and live. Dr. Richard Dubes, Dr. George Stockman, and
Dr. Mihran Tuceryan and PRIPieS have played a vital role in formation of my ideas.
They have been always helpful and without them working on Ph.D. would have been
much less interesting. I was supported by National Science Foundation research grants
IRI-9103143 and CDA-8806599 during my Ph.D. research during 1992—1995. Special
thanks to Lisa Lees for promptly responding to my system needs and last minute
panic. She also has been (lately) little kind to my disk extravaganza (impending 4GB
disk?)

Integrating vision modules would have been even more difficult if implementations
of some of the modules were not made available to uS. Dr. John Oliensis provided
us with his shape from Shading software; DrS. Deb Trytten and Mihran Tuceryan
provided the initial version of the line labeling module; DrS. Mubarak Shah and P.
S. Tsai Shared their Shape from Shading algorithm as well as some of the images;
Dr. John Weng provided the stereo module; and Dr. Super provided assistance in
re-implementing his Shape from texture algorithm. I am grateful to all of them for
patiently listening to my gripes and providing timely help in porting/reimplementing
their software.

I need to thank my roomie Satya Kudapa for running countless errands for me
(especially during the completion of my dissertation work) and for overall support.
Last but not least, support, love, and patience of my family (including the latest

addition, Praneetha) are also gratefully acknowledged.

vi

TABLE OF CONTENTS

LIST OF TABLES ix
LIST OF FIGURES x
1 Introduction 1
1.1 Motivation ................................... 2
1.2 Bottom-up View ............................... 12
1.3 Top-down View ................................ 15
1.4 Interaction between Top-down and Bottom-up Processes ......... 17
1.5 Horizontal Interaction ............................ 20
1.6 Connectionist Approaches .......................... 25
1.7 Utilitarian Theory .............................. 25
1.8 Interactions between Constraints and Data ................ 26
1.9 Difficulties in Integration ........................... 29
1.10 Design Issues ................................. 34
1.11 Problem Statement .............................. 36
1.12 Contributions of This Thesis ......................... 38
1.13 Organization of the Thesis .......................... 39
2 Integration Methods 40
2.1 Complexity of Integration .......................... 40
2.2 Bayesian and Non-Bayesian Fusion ..................... 41
2.3 Relaxation ................................... 45
2.4 Markov Random Fields ........................... 48
2.5 Bayesian Networks .............................. 52
2.6 Information-theoretic Methods ........................ 53
2.7 Mechanistic Models ............................. 56
2.8 Game-theoretic Methods ........................... 60
2.9 Lattice-theoretic Methods .......................... 63
2.10 Active Vision ................................. 70
2.11 Knowledge-based Methods .......................... 72
2.12 Summary ................................... 74
3 Vision Modules 78
3.1 Perceptual Organization (grouping) ..................... 78
3.2 Shape From Shading ............................. 86
3.3 Stereo ..................................... 92
3.4 Line Labeling ................................. 98

vii

viii

3.5 Shape From Texture ............................. 104
3.6 Summary ................................... 109
4 Non-Uniform Integration 114
4.1 Interactions .................................. 1 16
4.1.1 Interaction between Shape from Shading and Stereo Modules ..... 117
4.1.2 Interactions between the depth modules and the Perceptual
Organization Module ....................... 124
4.1.3 Interactions with Line Labeling Module ................. 125
4.2 Integration Algorithm ............................ 134
4.3 Experiments ................................. 136
4.4 Summary ................................... 143
5 A Uniform Bayesian Framework for Integration 158
5.1 Bayesian Estimation ............................. 159
5.2 Modules for Integration ........................... 173
5.2.1 Perceptual Organization Module ..................... 174
5.2.2 Stereo Module ............................... 179
5.2.3 Shape From Shading ............................ 184
5.2.4 Shape from Texture ............................ 193
5.3 Computation of Reliabilities ......................... 196
5.4 Integration Algorithm ............................ 207
5.5 Experimental Results ............................. 209
5.6 Summary ................................... 214
6 Conclusions and Future Work 226
6.1 Overview ................................... 226
6.2 Eiture Directions ............................... 228
6.3 Conclusions .................................. 230
APPENDICES 232
A Notation and Conventions 232

B Glossary of Assumptions in Computer Vision 233

1.1
1.2
1.2
1.3

2.1
4.1

4.2

4.3

4.4

4.5
4.6

4.7
4.8

5.1
5.2
5.3

5.4

LIST OF TABLES

Natural Assumptions in Computer Vision. ................. 5
Synthetic Assumptions in Computer Vision. ................ 6
Synthetic Assumptions in Computer Vision (contd) ............. 7
A list of vision modules. ........................... 38
Integration Methodologies ........................... 77
Vision modules in the proposed integrated system. ............ 116
Arc and vertex features. The features 9, (151, ll relate to an arc; (152, d, and

I2 are degree-2 vertex features; (p3 and Z3 are degree-3 vertex features

(see Fig. 4.6). ............................... 132
Prior probabilities for arc labels. The term Z = ZiELA I (z) iS the normal-

ization factor. Note that Tk speciﬁes threshold for parameter k. . . . 132
Prior probabilities for junctions. Note that T), Speciﬁes threshold for pa-

rameter k and Sgn(.) is the conventional Sign function. ....... 133
CPU times for Mushroom and Vase Image on Sun Sparc 20. ....... 136
Improvement in surface reconstruction due to integration: Lambertian

surfaces. .................................. 140

Improvement in surface reconstruction due to integration: Specular surfaces. 141
Improvement in surface reconstruction due to integration: intensity images

synthesized from real range images. ................... 141
Vision modules for the proposed integration ................. 173
CPU times for a typical image on Sun Sparc 20 ............... 209
Improvement in surface reconstruction due to integration: Lambertian

surfaces. .................................. 211

Improvement in surface reconstruction in texture-mapped Mozart images. 213

ix

1.1
1.2

1.3

1.4
1.5
1.6
1.7
1.8
1.9
1.10
1.11
1.12
1.13

2.1

2.2

2.3

2.4

2.5

2.6
2.7

2.8
2.9

LIST OF FIGURES

Ambiguity in shading: concave or convex? .................
Ambiguous line diagrams can be easily resolved by shading information.

Line diagrams in (a) and (b) are easier to interpret in conjunction with

their shaded counterparts in (c) and (d) (Adapted from [13]). . . . .
Imaging is a many—to—one mapping. The general viewpoint assumption

states that the interpretation (a) is more likely than (b) [109]. . . . .
Marr paradigm [109] ..............................
Dalmatian dog [116]. .............................
Lowe’s paradigm [109] .............................
Class I: weakly coupled system. .......................
Class II: weakly coupled system ........................
Class III: weakly coupled system. ......................
A feedforward system. ............................
A recurrent system ...............................
A generic integration scheme. ........................
System input/output: (a) and (b) synthetic left and right input intensity

images; (0) ground truth depth (pixels closer to us are brighter), (d) a

wireframe representation of the depths ..................

Registration: (a) and (b) are two images of a given scene; (c) Shows an im-
age which makes the spatial relationship between (a) and (b) explicit.
The dotted lines demarcate positions of (a) and (b) with respect to
each other ..................................

Fusion: (a) stereo, (b) sonar, and (c) fused data. The density of the
dots represent the likelihood of occupancy of that Space by a physical
object [129]. ................................

Collation. ...................................

Integration of line diagram and shading data to obtain a correct 3D inter-
pretation. .................................

Example neighborhoods of pixel and line (edge) sites for (a) a pixel site;
(b) a horizontal edge Site; and (c) a vertical edge site (from [134]). . .

An example of a Bayesian network. .....................

First-order and second-order MRF models produce unintuitive results
(from [20]). ................................

A single player. ................................

A 2-player game. ...............................

9

10
14
16
19
28
28
29

37

42

42
42

64

2.10 Level sets of the payoff functions for player 1 (left) and player 2 (right) [26]. 64

X

xi

2.11 Payoff level sets for both the players are superimposed. Also Shown are
the reaction maps for both the players [26] ................
2.12 Successive moves made by the players to reach Nash equilibrium. R1 and
R2 denote the reaction maps of player 1 and player 2. Note that player
1 always moves horizontally and player 2 always moves vertically in
this graph. Also, note that in any given move, the payoff is maximized
at the corresponding intercept on the reaction map [26] .........
2.13 A Simple fault—lattice based upon a rigid motion premise and a coplanarity
premise (From [89]). ...........................
2.14 Three methods of enforcing a unique solution: (a) Voting; (b) Priors; and
(c) Generation of a new hypothesis ....................

3.1 Perceptual organization helps complete the obscure boundaries [116]. . .
3.2 Relations Signiﬁcant in grouping [109]. ...................
3.3 Perceptual Organization Module. ......................
3.4 A grouping example for Mushroom and Vase image (512 x 512): (a) input
intensity image; (b) output from Canny edge detector; (c) output of
region-based segmentation; ((1) Significant closed regions after integrat-
ing segmentations in (b) and (c) using Gestalt rules ...........
3.5 Recovery of saddle—Shaped surfaces is especially difﬁcult: (a) An image of
Lambertian hyperboloid surface with constant albedo; (b) True surface
shape; (c) A convex shape recovered from (a) by Oliensis and Dupuis’
algorithm using default convex surface assumption. A default concave
surface assumption would have recovered an entirely concave surface.
3.6 A simple image formation geometry. Rays AB, AC, and AD are in the
direction of source (5), surface normal (11), and sensor (k), respectively.
The angles BAC, BAD, CAD will be referred to as incidence angle (2'),
emittance angle (e), and phase angle (9), respectively [80]. ......
3.7 Parallel axis stereo geometry [50]. .....................
3.8 Stereogram of ambiguously perceivable center square ﬂanked with unam-
biguous areas in front of and behind the surround [91]. (a),(b) are left
and right random dot stereograms; (c), (d) are two perceived depth
maps. ...................................
3.9 Mushroom and Vase image (Size 512 x 512): (a), (b) Left and right stereo
images; (c) Recovered depth from stereo alone; (d) A wire-frame rep-
resentation of (c). ............................
3.10 A line drawing [13] ...............................
3.11 Malik’s junction catalog. The double arrows indicate limb edges. The
object boundaries are shown by Single arrows. Symbols ‘+’ and ‘—’
denote convex and concave (internal) edges. Symbol ‘?’ denotes a
don’t care line label [112]. ........................
3.12 Obtaining line diagrams is a difﬁcult problem even in case of simple scenes:
(a) An image from the blocks world; (b) A typical edge map using
Canny edge detector. Note that the edge detector failed to detect
many junctions and introduced several extraneous edges. .......

65

66

67

68

79
80
82

85

86

88
93

94

99

100

103

104

3.13

3.14

3.15
3.16

4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8

4.9

4.10

4.11

4.12

4.13

4.14

xii

Junction labeling example for Mushroom and Vase image (Fig. 3.9(b)):

(a) correct input line diagram; (b) labeling using line diagram alone;

L, C, T, Y, A, and P denote L, curvature-L, tangent, Y, arrow, and

phantom junctions, respectively. .................... 111
Shape from texture example: (a) image of an object with uniform surface

texture; (b) Edge map (scaled) extracted from the image shown in (a).

Both pictures adapted from [106]. ................... 112
Imaging Geometry (adapted from [61]) .................... 112
Shape from texture example: (a) a synthetic texture-mapped image of an

object with uniform surface texture (10% i.i.d. noise and Lambertian

shading); (b) recovered Slant ﬁeld (here we Show 90° complement of
slant for better visibility) from (a) using a Shape from texture algo.

rithm [178]; (c) ground truth slant ﬁeld for the object in (a); and (d)

normalized error in the recovered Slant. ................ 113
Overall Block Diagram. ........................... 117
Shape From Shading Module. ........................ 118
Stereo Module. ................................ 119
Line Labeling Module. ............................ 126
Limb Detection ................................. 129
Line diagram features. ............................ 131
Image synthesis using ray tracing ....................... 139

Mushroom and Vase image (size 512 x 512): (a) and (b) Left and right
stereo images; (c) Recovered depth from stereo alone; (d) Recovered
depth from the integrated system; (c) Recovered surface normals from
stereo alone; (f) Recovered surface normals from the integrated system. 146

Synthetic surface primitives (Lambertian surfaces): (a) Parallelopiped; (b)
Sphere; (0) Cylinder; (d) Paraboloid; (e) Hyperboloid; and (f) Torus. 147

Synthetic surface primitives (Specular surfaces): (a) Parallelopiped; (b)
Ellipsoid; (c) Cylinder; (d) Paraboloid; (e) Hyperboloid; and (f) Torus.

All surfaces are Shown with Phong Shading. .............. 148

Synthetic stereo images: (a), ((1): Range Images of Tomato and Pipe ob-
tained from White scanner [110]; (b), (c): Left and right stereo images
generated from (a); (e), (f): Left and right stereo images generated
from (d). ................................. 149

Mushroom and Pipe image (Size 512 x 512): (a) and (b) Left and right
stereo images; (c) Recovered depth from stereo alone; ((1) Recovered
depth from the integrated system; (e) Recovered surface normals from
stereo alone; (f) Recovered surface normals from the integrated system. 150

Apple and Pepper image (Size 512 x 512): (a) and (b) Left and right stereo
images; (0) Recovered depth from stereo alone; (d) Recovered depth
from the integrated system; (e) Recovered surface normals from stereo
alone; (f) Recovered surface normals from the integrated system. . . 151

Segmentation results: (a) Mushroom and Vase 4.8(b); (b) Mushroom and
Pipe 4.12(b); (c) Apple and Pepper 4.13(b); (d) Egg and Cup 4.15(b). 152

xiii

4.15 Egg and Cup image (size 512 x 512): (a) and (b) Left and right stereo
images; (c) Recovered depth from stereo alone; (d) Recovered depth
from the integrated system; (e) Recovered surface normals from stereo
alone; (f) Recovered surface normals from the integrated system.

4.16 Recovery of 3D shapes from Egg and Cup image (Fig. 4.15): (a) Recovered
superquadrics from stereo alone; (b) Recovered superquadrics from the
integrated system. ............................

4.17 Limb edges detected in Mushroom and Pipe image (Figure 4.12). Limb
boundary pixels are rendered as thick boundaries and non-limb bound-
ary pixels are depicted as thinner edges. ................

4.18 Junction labeling results for Mushroom and Pipe image (Figure 4.12):
(a) using line diagram alone; (b) by the integrated system using the
information provided by the depth modules and line diagram; L, C, T,
Y, A, and P denote L, curvature-L, tangent, Y, arrow, and phantom
junctions, respectively. ..........................

4.19 Junction labeling results for Mushroom and Vase image (Figure 4.8): (a)
using line diagram alone; (b) by the integrated system using the in-
formation provided by the depth modules and line diagram; L, C, T,
Y, A, and P denote L, curvature-L, tangent, Y, arrow, and phantom
junctions, respectively. ..........................

5.1 Generic Integration Problem. ........................
5.2 Modular Integration. .............................
5.3 Modular Integration (restricted interactions). ...............
5.4 Interaction Model (for one module at one level of resolution) ........
5.5 Sequence of Module Operation (at one level of resolution). ........
5.6 System Integration ...............................
5.7 4—neighborhood system. ...........................
5.8 An Image of a cylinder. ...........................
5.9 Image in an image. ..............................
5.10 Synthetic stereo pair image of a cylinder (Size 512 x 512) ..........

5.11 Error analysis of synthetic stereo image pair in Figure 5.10. This ﬁgure
shows the results for the 256‘“ row only: (a) Intensity proﬁles of the
left and right stereo image in the 256‘” row. (b) Error between the
observed and the true depths; (c) Intensity residuals; ((1) Sensitivity of
the intensity residuals. ..........................

5.12 Empirical method for stereo evaluation (only the right stereo image is
Shown): (a) intensity image of Sphere; (b) output from Canny edge
applied to (a); (c) depth output from stereo module; (d) ground truth
depths; (e) error in (c) ...........................

5.13 Uniform Bayesian Integration Algorithm ...................

153

154

155

156

157

163
164
166
170
171
172
183
198
199
203

204

206
208

xiv

5.14 Synthetic texture-mapped surface primitives (Only the right stereo image
is shown): (a) parallelopiped (31%); (b) Sphere (20%); (c) cylinder
(22%); (g)—(i) and (m)—(o) depict the depth reconstruction for these
primitives from stereo module and from the integrated system, respec-
tively. Figures in the parentheses Show improvements in the depth
estimate due to integration (see Table 5.3) ................

5.15 Synthetic texture-mapped surface primitives (Only the right stereo image
is shown): ((1) paraboloid (17%); (e) hyperboloid (8%); and (f) torus
(5%); (j)—(l) and (p)—(S) depict the depth reconstruction for these prim—
itives from stereo module and from the integrated system, respectively.
Figures in the parentheses Show improvements in the depth estimate
due to integration (see Table 5.3) .....................

5.16 A Lambertian surface with no texture: (a) and (b) Mozart stereo images
(Size 512x512) synthesized with no surface texture; (c) reconstruction
of (a) and (b) using stereo module only; (d) reconstruction of (a) and
(b) using the integrated system. Depth estimation was improved by
8% due to integration (see Table 5.4). .................

5.17 A surface with partial texture: (a) and (b) Mozart stereo images (Size
512x512) with texture-mapped surface juxtaposed with a surface with
no texture; (c) reconstruction of (a) and (b) using stereo module only;
((1) reconstruction of (a) and (b) using the integrated system. Depth
estimation was improved by 27% due to integration (see Table 5.4). .

5.18 A surface with full texture: (a) and (b) Mozart stereo images (Size
512x512) texture-mapped with a homogeneous texture; (c) reconstruc-
tion of (a) and (b) using stereo module only; (d) reconstruction of (a)
and (b) using the integrated system. Depth estimation was improved
by 32% due to integration (see Table 5.4). ...............

5.19 A surface with two textures: (a) and (b) Mozart stereo images (Size
512x512) texture-mapped with two distinct textures; (c) reconstruc-
tion of (a) and (b) using stereo module only; ((1) reconstruction of (a)
and (b) using the integrated system. Depth estimation was improved
by 37% due to integration (see Table 5.4). ...............

5.20 Specular surface: (a) and (b) Mozart stereo texture-mapped images (Size
512x512) for a Specular surface; (c) reconstruction of (a) and (b) us—
ing stereo module only; ((1) reconstruction of (a) and (b) using the
integrated system. Depth estimation was improved by 16% due to
integration (see Table 5.4). .......................

5.21 Fruit image (Size 512x512): (a) and (b) Left and right stereo images;
(c) Recovered depth from stereo alone; (d) Recovered depth from the
integrated system; (e) Recovered surface normals from stereo alone;
(f) Recovered surface normals from the integrated system; (g) and (h)
Show (e) and (h) as Shaded perspective views [76]. Fruit image was
provided by Prof. Narendra Ahuja. ...................

216

217

218

219

220

221

223

XV

5.22 Egg and Cup image (Size 512x512): (a) and (b) Left and right stereo
images; (c) Recovered depth from stereo alone; (d) Recovered depth
from the integrated system; (e) Recovered surface normals from stereo
alone; (f) Recovered surface normals from the integrated system; (g)
and (h) Show (e) and (h) as Shaded perspective views [76]. ......

5.23 Segmentation results for (a) Fruit image (Fig. 5.21(b)) and for (b) Egg
and Cup image (Fig. 5.22(b)). .....................

5.24 Recovery of 3D Shape of cantaloupe in the Fruit image (Fig. 5.21): (a) Re-
covered superquadrics from stereo alone; (b) Recovered superquadrics
from the integrated system. .......................

224

Chapter 1

Introduction

One of the central objectives in computer vision is to understand the input scene
and Objects therein from the image(s) of the scene for the purpose of recognition,
manipulation, navigation, or other visual tasks. Since the initial efforts of Roberts
[163], the ﬁeld of computer vision has developed both in rigor as well as in the
repertoire of the methodologies. Yet, the goal of building vision systems capable of
undertaking visual tasks in a human-like way has remained elusive. Most computer
vision systems have remained restrictive in their domain assumptions and brittle
in their performance. A system making use of limited information will be fallible;
robust vision systems would need to necessarily take into account many independent
visual cues. How Should one develop vision systems that are capable of effectively
representing the world knowledge and information in the visual cues? What are the
control structures which allow efficient use of these representations? How do these
representations and control structures result in robust systems? The research reported
here is an attempt to understand these issues by developing a system which integrates

1

2

visual cues to achieve a 3D reconstruction of the input scene based on sensed (stereo)
intensity images.

This chapter is organized as follows. Section 1.1 discusses the motivation behind
the integrated systems for visual reconstruction in some detail. Sections 1.2-1.7 dis-
cuss literature (mostly from psychology and physiology) which provides some clues
on how the integration is actually effected in the human visual system. Section 1.8
reviews a computational basis for the classiﬁcation Of integration methodologies. Sec-
tion 1.9 discusses the difﬁculties in integration of visual cues, both methodological as
well as implementational. Section 1.10 describes critical research issues in the design
of an integrated system. We also present a statement of the problem that we have
solved (Section 1.11), a list of our contributions (Section 1.12), and organization of

this thesis (Section 1.13). The chapter concludes with a summary.

1 .1 Motivation

Many vision related tasks like grasping, navigation, and exploration require extraction
of 3D information (e. g., depth, surface normals) about the input scene using a variety
of sensors. The generic 3D reconstruction problem in itself is an important research
problem since its study in the past has increased our understanding of the systematic
constraints and limitations in designing a vision system [7]. These constraints and
limitations could then be gainfully employed to disambiguate 3D structure of a scene
or to help design more robust vision systems.

A number of approaches have been used for recovering 3D characteristics of a

3

given scene. A comprehensive summary of these techniques could be found in Jain
and Flynn [88]. One of the approaches for obtaining 3D information has been through
direct depth measurement using time-of-flight sensors or structured light. Given pre-
cise imaging conditions and favorable surface characteristics of all the sensed objects,
a very accurate 3D reconstruction of the scene could be obtained with a high ac-
curacy. However, imaging conditions required by the direct range sensing strategies
can not always be improvised; large outdoor structures and distant objects can not
be effectively scanned by these Special-purpose sensors (e. g., laser, radar); and the
object surfaces may not be suitable (e. g., Specular/Shining surfaces) for the sensors
used in such systems.

An important research issue is whether the 3D structure of a scene could be
reliably recovered in less assuming and more realistic Situations? For instance, can
we precisely recover the 2.5D Sketch1 from a pair of intensity images of a scene?
To obtain a reasonable depth map, it would seem that with a careful management
and integration of the information available in the intensity images, we need not
have to resort to precisely engineered solutions. Besides, in many computer vision
applications, the imaging conditions and the surface characteristics of the objects to
be imaged are well beyond our control. In such Situations, a pair of stereo images
of a scene illuminated with a ﬁnite number of point sources with known positions
and possibly with (unknown) ambient light may be the only reasonable assumptions

which can be made about the input. The primary objective of this thesis is to

 

lA 2.5D sketch represents depth and surface orientation at each visible point in the scene. Ac—
cording to Marr[116], it is an explicit representation of depth, surface orientation, discontinuities in
depths, and discontinuities in surface orientations in a scene.

4

explore and evaluate methods for a complete and accurate 3D reconstruction under
these conditions.

The projections of the 3D world onto 2D image planes suffer from a lOSS of explicit
depth information. The resulting intensity images convey 3D information in several
different and indirect ways. In the last couple of decades, a number of Shape-from-X
modules have been identiﬁed and shown to be capable of conveying Shape of the object
in constrained environments. Several other modules such as perceptual grouping have
been demonstrated to be helpful in the depth recovery process [72]. Problems posed
by many of these modules are ill-posed 2 and all these modules make assumptions
restricting the scope of their application. Tables 1.1 and 1.2 summarizes various
assumptions made by the (computer) vision modules. The primary considerations in

making these assumptions might be classiﬁed into three categories:

0 Ambiguity: A vision module (inherently) can not determine certain component

of the 3D structure of the scene.

0 Complexity: Although a vision module could potentially extract all the required
3D information, it might be easier to recover the information under Simplifying

assumptions.

0 Reliability: In order for the recovered 3D structure of an input scene to be

stable, it might be necessary to impose certain assumptions.

Violations of these assumptions result in making the module output less reliable. Only

 

2A problem is a well-posed problem when its solution exists, is unique, and continuously depends
on the data. A problem which is not well-posed is an ill-posed problem [187].

Table 1.1: Natural Assumptions in Computer Vision.

 

 

 

 

 

 

 

 

 

 

Module Assumption Example
Grouping General Lowe
viewpoint [109]
Stereo Coherence Prazdny
[154]
Stereo Cohesiveness Marr &
Poggio
[119]
Shading Smoothness Horn &
Brooks
[30]
Stereo Continuity Mayhew
& Frisby
[123]
Stereo, Sufﬁcient Malik &
texture context Rosen-
(aper- holtz
ture [114]
problem)
Structure Smooth Nagel &
from motion intensities Enkel-
mann
[137]
Line labeling Piecewise Malik
continuity [1 12]

 

 

 

 

Table 1.2: Synthetic Assumptions in Computer Vision.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Module Assumption [ Example
Edge Local Geman
detection Spatial & Ge-

interaction man [63]
Stereo Opacity Grimson
[68]
Shading, Lambertian Ikeuchi
stereo surfaces & Horn
[85]
Stereo Gaussian Belhumeur
Noise & Mum-
ford [15]
Texture Homogeneity Super &
Bovik
[178]
Texture Isotropy Garding
I62]
Texture Perfect Ohta et
segmentation al. [143]
Shading Single Lee &
source Rosen-
of feld
illumination [104]
Structure Small Horn &
from motion motion Schunck
[81]
Line labeling Origami Kanade
world [92]
Line labeling Polyhedral Sugihara
world [177]
Shape from Symmetry Gross
Symmetry [71]

 

 

 

 

Table 1.2: Synthetic Assumptions in Computer Vision (contd).

 

 

 

 

 

 

 

 

 

 

 

 

[ Module Assumption Example
Line Labeling Orthographic Malik
Projection [112]
Texture Para- Aloimonos
perspective [4]
projection
Structure Rigidity Weng
from motion et al. [197]
Shading Constant Tsai
albedo & Shah
[192]
Shading Locally Pentland
Spheri- [152]
cal
surface
Focus Gaussian Ens &
blurring Brown
[55]
Contour Isotropy Horaud
& Brady
[78]
Shading No Horn
interreﬂection [79]
Specularity Dichromatic Healey
reflectance & Bin-
ford [74]
Shading & Fractal Pentland
Texture surface [151]
model
Shape using Separable Christenser
color colors 85
Shapiro

 

 

 

l

 

I40]

 

8

in an integrated environment could one (i) avoid making very restrictive assumptions,
(ii) verify the accuracy of a module output using another independent output, and
(iii) assess applicability of the assumptions from the scene properties.

For instance, shape from shading module can not disambiguate “concave or con-
vex” ambiguity (Fig. 1.1). A line labeling module can not deﬁnitively resolve whether
the top surface of the block shown in Fig. 1.2(b) is curved or planar. Conventionally,
such ambiguities have been resolved either (i) by imposing arbitrary assumptions,
e. g., many Shape from shading algorithms are biased towards ‘convex surface’, or (ii)
by invoking synthetic constraints (see Section 1.8), e.g., arguments based on sym-
metry to decide the curvature of the top surface of the block in Fig. 1.2(b). The
imposition of arbitrary assumptions as well as a premature invocation of synthetic
assumptions can be avoided by making use of the other visual cues in the input im-
age(s). For instance, a stereo module could resolve the ambiguity in Fig. 1.1, or the
shading information (Fig. 1.2(d)) could facilitate a more reliable interpretation of line

diagram in Fig. 1.2(b).

 

Figure 1.1: Ambiguity in Shading: concave or convex?

 

((1)

Figure 1.2: Ambiguous line diagrams can be easily resolved by shading information.
Line diagrams in (a) and (b) are easier to interpret in conjunction with their Shaded
counterparts in (c) and (d) (Adapted from [13]).

10

Image Plane ] , '

 

 

 

 

 

a
I
I
, I
I
I
’__

 

 

Image Plane

 

 

 

 

 

Figure 1.3: Imaging is a many—to—one mapping. The general viewpoint assumption
states that the interpretation (a) is more likely than (b) [109].

11

In an integrated environment, it is also possible to assess the violation of a weak
assumption3. General viewpoint assumption is a case in point. Due to a lOSS of the
explicit depth information, the input images can not be uniquely interpreted based on
the projective geometry alone (Fig. 1.3); the assumption of general viewpoint posits
that Signiﬁcant structural relationships in an image are unlikely to have resulted due
to accidental location or orientation of the observer. However, in a large image (espe-
cially, of a man-made environment), a few violations of general viewpoint assumption
might have occurred. In an integrated system, the individual modules can afford
to operate outside the scope of their nominal assumptions Since other independent
sources of information can correct its mistakes. For instance, a violation of general
viewpoint assumption can be ascertained in an integrated environment using several

independent sources of information.

The central role of constraints exerted by the world knowledge and visual cues in
image interpretation is widely agreed upon. The details of how these constraints are
effectively embedded into a working system are, for the most part, not known. In the
hope of ﬁnding answers to these problems, we often turn to the human vision liter-
ature. Several psychophysical and neurophysiological experiments have been carried
out to ﬁnd the general structure of the perceptual processing in the last few decades.

The following sections summarize integration research in human visual processing.

 

3A weak assumption (also referred to as natural constraint) is an assumption about the world
which is frequently true. See Section 1.8.

12
1.2 Bottom-up View

This view of solving vision problem hOldS that different sources of information dis-
ambiguate image interpretation by building increasingly abstract representations in
a domain-independent fashion. The system obtains higher-level descriptions from
image- (lower-) level descriptions by recursively grouping the features at each level of
description. This model enjoys a good support from conventional vision researchers.

Marr’s work is an excellent example of this approach [116].

Marr Paradigm
Marr suggested that it is useful to pose vision problems at a computational level
before dealing with representations, algorithms, or implementations. What is being
computed? Why is it being computed? While he was clearly aware of the top-
down influences, he felt that their role in early vision processes was marginal. He
proposed building the following three levels of representations during the course of

3D reconstruction:

Raw Primal Sketch: This representation is responsible for making the local struc-
ture in the image explicit. It consists of descriptions of physically meaningful events
(changes in albedo, illumination, depth, or surface orientation) by a combination of
Spatially coincident zero-crossings extracted from consecutive levels of representations
(Spatial coincidence)". Spatially localized conﬁgurations of (zero-crossings) are iden-

tiﬁed into units of representation: tokens. Tokens are categorized into edges, bars,

 

4These principles of combination of information across different spatial channels can be quite
involved and constitute the ﬁrst systematic efforts in integration of information across the spatial
channels [117].

13

blobs, terminations, etc. and have typically half a dozen features like orientation,
contrast, etc. Each token is a very localized descriptor. For instance, a long thin line
would be described by several small oriented segments. Spatial relations among the
tokens are also made explicit in the raw primal Sketch.

Full Primal Sketch: The tokens in the raw primal Sketch are recursively grouped
based on their Similarity. Such grouping procedures are local mechanisms for or—
ganizing perceptually Signiﬁcant events such as texture boundaries and subjective
contours. The full primal Sketch makes organization of the perceptually Signiﬁcant
2D structures explicit.

2.5D Sketch: The features in this representation are surface orientation, depth
discontinuities, orientation discontinuities, and coarSe depth estimates for visible sur-
faces in the scene. This representation is viewer-centered and iS essentially the ﬁrst
description to make the 3D structure of the scene explicit. This representation is
the precursor to the ﬁnal object-centered, segmented, and volumetric representation
[118]. Marr argued that a 2.5D Sketch description is a convenient common substrate
for the visual modules to combine their individual outputs.

Marr’s proposal has a strong argument in favor of the modularity of the human vi-
sion system. Several psychophysical experiments have demonstrated that the human
visual system processes different cues like Shape from shading, Shape from texture,
stereo, Shape from motion in a relatively independent way (Figure 1.4). For instance,
subjects could fuse random dot stereograms which did not have any other cues [91].
Neurophysiological data available at that time demonstrated the existence of different

channels of processing for different types of visual data [108]. Marr coined the term

14

vision module for these autonomous processes, each handling one type of visual cue.
While there was no deﬁnitive evidence for the existence of vision modules, he argued
that the modular development of vision system would Simplify the complexity in its
design.

Recognition and
Localization of objects

 

Object Models

 

Matcher

 

 

2.5D Sketch
Intrinsic Images

I I

 

9

Stereo
Motion
Shape from Shading
Shape from Texture
Shape from X

 

 

 

 

 

 

Image Features

 

1 Input Image(s)

Figure 1.4: Marr paradigm [109].

One of the advantages of the bottom-up approach is its generality. It is usually
characterized by lack of any restrictive domain-speciﬁc assumptions. Hence the use
of the bottom-up, data-driven approach has been recommended in exploratory and

generic computer vision tasks, where the details of the environment are not known a

priori.

15
1.3 Top-down View

The world we live in is not an abstract space, but a three—dimensional world obeying
laws of physics. Most of the objects we perceive in our everyday life are opaque.
Object surfaces are smooth almost everywhere. The evolutionary process has built
mechanisms into our perceptual apparatus to exploit these features for easier inter-
pretation of our sensory stimuli. These default assumptions serve to disambiguate
and stabilize our perception in Situations where visual stimuli in themselves are not

sufﬁcient.

A bottom-up, hierarchical model of perception has influenced much of both clas-
sical philosophical thought and psychological theories, thus top-down forces were
initially ignored. Marr’s paradigm described in Section 1.2 is a case in point. It
undermines our ability to infer 3D structures and recognize them in the absence of

explicit and conclusive 3D cues.

The human visual system assumes that its viewpoint precludes any accidental
alignment of non-causal object features. This assumption is called general viewpoint
assumption. Several researchers [199, 109, 13, 18] have emphasized that perceptual
organization imposes Signiﬁcant 3D constraints in building the 2.5D Sketch as a result
of general viewpoint assumption; the presence of Signiﬁcant structures in the full
primal sketch can be related to the presence of relevant causal 3D structures, making
our perception and recognition of objects from line drawings and from Single 2D
images effortless. Nakayama and Shimojo [138] have demonstrated several situations

where interpretation of images even in the presence of conﬂicting explicit 3D cues

16

is inﬂuenced by the general viewpoint assumption. Although the general viewpoint
assumption inﬂuences 3D organization of image features in the early human visual
processing (and without domain-speciﬁc information), most of these inferences are
respected in the ﬁnal image interpretation [199]. The Signiﬁcance of our ability to infer
3D structure directly from a 2D organization of image features has led to the inclusion
of a 3D inference module into Marr’s bottom-up paradigm [109]. One interesting
example of the top-down inﬂuence is shown in Figure 1.5, where the Dalmatian dog
is not ‘visible’ until we are told to look for it. Here, the model-driven processes
appear to help the integration of the noisy visual cues, resulting in a relatively vivid
perception of the Dalmatian dog [116]. These and several other pieces of evidence
have prompted an augmentation of ‘classical’ bottom-up model of perception. Such
cognition-based theories involve interaction of knowledge and expectations with the

perceptual process in a more top—down manner [141].

 

Figure 1.5: Dalmatian dog [116].

The top-down information integration approach is often related to the goal-

17

directed processing. Goals are decomposed into subgoals till each subgoal is sufﬁ-
ciently Simple to be solved directly. A common top-down technique is “hypothesize-
and-test” paradigm; here an internal modeling process makes predictions about the
way information from each visual module iS being combined. Perception becomes an

act of verifying such predictions or hypotheses that ﬂow from the model [25, 69].

It is generally agreed that our (human) low-level vision system processes prodi-
gious amounts of information in several cascaded parallel layers [108, 204]. With
serial computational hardware, it is very expensive to duplicate the power of our
low-level visual system. The desire to circumvent unnecessary low-level processing to
reconstruct a huge amount of 3D data is understandable and has given rise to the

purposive vision paradigm which emphasizes goal—oriented visual processing [8].

How could we incorporate top-down processes in a vision system that reconstructs
the input scene? Traditionally, the top-down constraints have been (i) explicitly
invoked by restricting the object domain (e.g., polyhedral objects, smooth objects)
or (ii) implicitly embedded in the representational framework ( e.g., polynomials,

Fourier descriptors).

1.4 Interaction between Top-down and Bottom-up

Processes

In computer vision practice, a judicious mixture of the data-driven analysis and

model-driven prediction often seems to perform better than either process in iso-

18

lation. This hybrid control iS often implemented using hierarchical representations
with a Simple pass-oriented control structure [10, 195]. The uncertainties and ambi-
guities in the sensed data make it difﬁcult for the model-driven processes to effectively
hypothesize locations of the features. In order to obtain a salient, noise-insensitive,
and useful description over which a model-driven process can ‘hypothesize-and-test’
efﬁciently, a few bottom-up passes are usually deemed to be necessary for deriving a
better representation. The low-level processes also seem to offer a certain degree of
reliability in the performance of the overall system. A number of purely tOp-down
approaches had to be eventually complemented by bottom-up groupings to offer an
overall robust performance. A recent example of such an augmentation is provided
by Jacob [86, 87].

Among others, Lowe [109] felt that Marr’s principle of least commitment was too
conservative in terms of making the information explicit. For instance, it had no ex-
pressive mechanism for exploiting probabilistic information. Further, it deemphasized
the role of perceptual organization in a topdown recognition and reconstruction of
the scene. Lowe’s proposal for an object recognition system is Shown in Figure 1.6.
The human object recognition remains easy even in situations where the explicit 3D
information contained in the image is minimal. He also argued that, in many cases,
it may not be expedient/necessary to reconstruct the 3D structure of the scene for
recognition of the objects therein.

What mechanisms does the human visual system use for combining top-down and
bottom-up processes? Guided search theory [200] hypothesizes an objective func-

tion which linearly combines bottom-up and top-down inﬂuences for directing visual

19

Recognition and
Localization of objects

 

Object Models

   
 
  

 

Perceptual
Organization

Recognition
Matching

 

 

 

3D

 

 

  

Perceptual Groupings 2-59 Sketch
Inference Intrinsic Images
fa" E
E 5 ><
o s "’ 5
E 8 E E ~‘=
(I) 2 - u— §
.. 1:
Recognition g- E- w
m m

 

 

 

 

 

 

Image Features

 

T Input Image(s)

Figure 1.6: Lowe’s paradigm [109].

 

Veriﬁcation

20

attention. The bottom-up mechanism is represented by a metric measuring the Simi-
larity between the given object and neighboring Objects (grouping) and the top-down
mechanism is represented by a metric measuring the Similarity between an ‘adaptive’
template (expectations) and the given Object. Duncan and Humphrey [53] propose
a theory which is similar in spirit except that they conjecture a non-linear two-way
interaction between the top-down and bottom-up processes. Grossberg’s adaptive
resonance theory also contains similar mechanisms in the connectionist paradigm
(Section 1.6).

In most of the computer vision literature, the interaction of top-down and bottom-
up inﬂuences is implicit and is usually posed in the form of a constrained Optimization
involving parametric [174] or non-parametric representations [94, 186]5. These meth-
ods are sensitive to the initial conditions and the parameter values. For instance,
given undesirable initial conditions, the optimization scheme used by Solina and Ba-
jcsy [174] can produce strange and unintuitive superquadric ﬁts to a given set of 3D

points.

1.5 Horizontal Interaction

Not only do the top-down and bottom-up processes interact, but the processes oper-
ating at the same level of the representation can also laterally exchange information

to disambiguate image interpretation. One common Situation in computer vision is

 

5A notable exception to the implicit tOp-down bottom-up interaction model is the distributed
scheme developed by Bozma and Duncan [27]. However, the modeling assumptions have not been
fully exploited in these studies due to the inability of the objective function to represent all the
physical constraints.

21

the interaction among the processes in a Spatial neighborhood. The importance of
this Spatial interaction becomes obvious from the fact that human observers have
difficulty in interpreting isolated subimages (small portions of a scene) which can
otherwise be effortlessly interpreted in their proper Spatial context. There is also an
evidence for lateral inhibitory inﬂuence of neurons on their Spatial neighbors [168].
Another type of horizontal interaction is the exchange of information across the
vision modules. These interactions have been studied by Bulthoff and Mallot [113]
using psychophysical experiments. They hypothesize the following four types of such

interactions:

1. Accumulation: The evidence produced by each module could be accumulated
as probabilities or conﬁdence indices. More generally, individual support for hy-
potheses could be linearly combined. Many of the cost functions in Optimization

formulations represent this kind of interaction.

2. Cooperation: This non-linear interaction could be used for synergistic interpre-
tation of noisy data. Competition between the modules can be considered as
an opposite of cooperation. Markov Random Fields (MRF) and game-theoretic

models can be considered as examples of this interaction.

3. Disambiguation: One module can help obtain a unique solution from an un-
derconstrained problem posed by another module. Many of the algebraic ap-
proaches proposed by Aloimonos [8] can be considered as examples of this type

of interaction.

4. Veto: This interaction provides for an overriding role played by a module which

22

can not be challenged by the other modules. For instance, in several conﬂicting
Situations, stereo cue is considered more reliable than shape from shading and
the stereo module vetoes alternative interpretations provided by the Shading

module.

Bulthoff and Mallot [34] studied the integration of stereo disparities, edge information,
and Shading in the 3D perception of synthetically generated images. Subjects were
Shown images of end-on views of ﬂat and smoothly-Shaded ellipsoids and they were
asked to judge the perceived depth. Based on the experimental data, they concluded

that

1. The human visual system underestimates depth if the stimulus consists of Shad-

ing cues alone;

2. When both Shading and stereo cues are present, the information provided by

the stereo module dominates the ﬁnal interpretation of the scene;

3. Disparate Shading6 yields a vivid stereoscopic depth (even in the absence of

disparate edges); and

4. The human visual system interpolates the depth provided by the stereo system

using the information provided by the Shading cues.

A linear combination of individual module outputs is meaningful when the individual
module outputs agree [113]. Maloney and Landy [113] have described the following

results about the interaction between the cues based on their linear combination.

 

6Shading differences in the left and right stereo images.

23

1. The weight assigned to each cue depends upon the reliability of the cue. The
measure of reliability in itself might be computed from some “ancillary” (not

necessarily a depth) cue.

2. The concept of promotion states that information provided by one cue makes

up for the deﬁciency in another cue.

3. The consistency among the cues can be used to adaptively weight the cues. If the
majority of the cues concur on a depth value, say, then the weights associated

with the dissenting cues are automatically reduced.

Studies related to the interaction between stereo and texture cues have been con-
ducted by Johnston et al. [90] and Buckley et al. [32, 33, 31].

Johnston et al. [90] studied. the interaction among stereo and texture cues. Their
experimental results suggest that stereopsis and Shape from texture are independent
processes in their early stages. They conclude that stereo and texture cues interact
Simply by means of a weighted linear combination, irrespective of whether the per-
ceptions resulting from the individual cues were in conﬂict or in agreement. Further,
the information from stereopsis was weighted much more heavily than that provided
by the texture cues. This indicates that stereopsis is considered to be a more reliable
source of information for the human visual system.

Buckley et al. [32, 33] studied interactions between stereo and shape from texture.

Their results can be summarized as follows:

1. Stereo and texture cues are pooled about the 3D surfaces only if evidence (infor-

mation) provided by the individual cues is Similar. The threshold on Similarity

24

depends upon the expected noise in each channel [32].

2. Buckley et al. [33] studied the perception of discontinuities in a region of an im-
age devoid of any binocular cues (monocular region). The test images consisted
of a textured monocular region between the regions displaying strong binocular
depth cues (binocular regions). The perceived location of a 3D edge in a monoc-
ular region was found to be consistent with the Obvious texture boundaries in
that region and the explicit 3D edges perceived in the neighboring binocular

regions.

More recently, Buckley et al. [31] have studied interaction among stereo, texture,
and outline cueS7. The experimental designs are based on the cue conﬂict paradigm:
experiments are designed such that each cue might produce conﬂicting evidence about
the underlying 3D surface. Their experimental data consists of synthetic stereo pairs

of either horizontally or vertically oriented ridges. Their ﬁndings suggest that:

1. If the synthetic ridge is horizontal, then stereo cues strongly dominate the ﬁnal

interpretation of the input scene.

2. If the synthetic ridge is vertical, then texture/ outline cues dominate.

3. Stereo cues dominated in all real ridge stimuli. This result is in agreement with

that of Johnston et al. [90].

 

7Authors refer to contours as outlines.

25
1.6 Connectionist Approaches

The most general approach of interaction would be where every intrinsic variable (e. g.,
reﬂectance, depth, surface orientation) in some way interacts with every other intrin-
sic variable as well as with the observations (pixel values). These types of interactions
can naturally be accommodated into a connectionist framework. Connectionist ap-
proaches are known for their learning capacity, graceful degradation in performance,
and Spontaneous generalization [125]. On the other hand, these approaches have a
tendency to be prodigal in terms of the computational and communication resources.
Due to the large number of parameters involved in the connectionist approaches, a
good insight into the solution space is needed before a connectionist approach can
be used to solve a non-trivial integration problem. Many studies have utilized con—
nectionist models for integration of image features for recognition and reconstruction

tasks [72, 203, 125, 115, 119].

1 .7 Utilitarian Theory

The most iconoclastic theory explaining how human visual system integrates all pieces
of information is proposed by Ramachandran [156]. He argues that the current theo-
ries of the human visual perception are “overarching” and impose unnecessary com-
plex mechanisms. According to him, the human visual mechanism was molded into
its present form by the evolutionary processes which had to exploit the available
neural hardware to effect an opportunistic perception. He cites several examples to

support his case. For instance, the human visual system has an “in-built” assumption

26

of an overhead source of illumination. This can be directly attributed to the upright
posture of the human beings and Single illumination source in nature (sun). He has
devised a series of elegant and novel experiments to prove his case. But, how could
one believe that the entire reconstruction problem can be solved by a collection of
ad hoc heuristics? He counters that, in most cases, the information required by a
visual task iS qualitative in nature and could easily be extracted with the help of a
few heuristics. Some of these arguments can be traced back to Gibson’s theory of
affordances [64]. The utilitarian theory is Similar in Spirit to the purposive vision
paradigm [5] which has resulted in a few practical navigation and tracking systems

[19] but not in any recognition systems [180].

1.8 Interactions between Constraints and Data

A popular computational framework for studying various integration frameworks has
been adopted from Clark and Yuille [42]. They argue that all the integration methods
can be best understood in terms of the types of constraints they employ and the
methods of embedding constraints to achieve stable and unique solutions.

Clark and Yuille deﬁne the following three types of constraints:

1. Physical constraints: The constraints that can not be violated by our physical
3D world. These constraints often include the image formation models and

other laws of physics.

2. Natural constraints: The constraints that hold good in most of the situations

but fail occasionally. General viewpoint constraint or smoothness constraint are

27

examples of this type of constraint.

3. Synthetic constraints: These are the constraints imposed by the designer of a

vision system by restricting the domain of the application of the vision system.

One classiﬁcation scheme of integration methods is based on how the individual mod-
ules interact with the data. Accordingly, Clark and Yuille classify an integration
system as either weakly coupled or strongly coupled.

In weakly coupled systems, each module independently processes the sensed data.
In strongly coupled systems, the operation of a module may be affected by the oper-
ation of another module.

The weakly coupled systems are further classiﬁed into the following three subcat-

egories:

1. Class 1: Each module can produce a unique and stable solution. These in-
dividual outcomes are combined to reduce the uncertainties in the ﬁnal result
(Figure 1.7). The weight assigned to the information provided by each module
is dependent on the reliability of that module. Accumulation can be considered

as a Class I type of interaction [90, 42].

2. Class II: In order to obtain a unique solution from the outputs of the individual
modules, it is necessary to apply a set of a priori constraints (Figure 1.8). These
methods rely on the algebraic solutions of the analytically modeled problems.
When the given Situation can not be modeled exactly, this method can not

be applied. These methods are sensitive to the modeling assumption and the

28

Sense Data

Module 1 a' o

 

 

6 Output

 

 

:12
Module 2

 

Figure 1.7: Class I: weakly coupled system.

resulting solutions are brittle. Disambiguation [34] or promotion [90] can be

thought of as Class II type of interactions.

Sense Data

a priori constraints
—9[ Module l al i
Fusion
Module
—9[ Module 2 J

L a2

 

 

 

 

 

Figure 1.8: Class II: weakly coupled system.

3. Class III: This method can be considered as a combination of the Class I and
Class II methods. An overdetermined system is achieved by a set of modules

and constraints (Figure 1.9).

The strongly coupled systems are further classiﬁed into feedforward and recurrent
systems (Figures 1.10 and 1.11). In recurrent systems, feedback paths are allowed.
In feedforward systems, the feedback paths are not allowed. The way in which a

module can affect another module is either by controlling its a priori constraints or

29

Sense Data

F—

   
  
 
 
  
   
   
    

. . n v
3 Module l a pnon constraints

g_____a
r————-\

[—)' Module 2

t_____J

  

Fusion
Module

 

 

 

Figure 1.9: Class III: weakly coupled system.

by directly modifying their inputs. These models of interaction are the most general.
Cooperation [34] and competition can be thought of as examples of strongly coupled

systems.

1.9 Difﬁculties in Integration

It is generally agreed that vision researchers have not devoted sufﬁcient efforts in
designing and building complete vision systems [8] Consequently, there is a dearth
of expertise in the context of building complete vision systems.

The sources Of complexity for the integration problem can be broadly classiﬁed
into three categories: (i) the inherent difficulty in designing a complete vision system,
(ii) theoretical issues related to integration, and (iii) the implementational problems.

First, we list the difﬁculties arising due to the characteristics of the domain [164, 51]:

o Sheer quantity of data: The volume of input data to a computer vision system
is overwhelming. An image with a reasonable. resolution (512 x 512) and frame

rate (20 per sec.) would involve about 20 Mb data per second. A simple

30

Sense Data

{ .

Module 1

 

 

 

- J

I

 

 

 

a priori
constraints
I 1
Output
"'9 Module 2 "—9
L J

 

 

 

Figure 1.10: A feedforward system.

31

 

 

 

 

 

 

 

 

 

 

 

 

Sense Data
Outputl
Module 1 J >
T It
a priori
constraints
‘3 V
Output2
Module 2 J >

 

Figure 1.11: A recurrent system.

convolution Operator over this data would require almost half a billion arithmetic
instructions per second. Even massively parallel architectures will need to be

very selective in their choice of processing strategies.

0 Uncertainty in data: As mentioned earlier, visual data are often locally ambigu-

ous. In addition, image noise further deteriorates the information content.

0 Lack of introspection: Although every human being is an ‘expert’ in performing
visual tasks, we do not know how we acquire such a capability. What constraints
actually work? Why do the constraints that actually work are so unintuitive?
The only process of introspection of our visual processing is through data on

visual illusions and on psychophysical experiments.

We now discuss inherent theoretical and implementational difﬁculties in integration.

32

0 One of the most difﬁcult problems in integration is speciﬁcation of the con-
straints and the role played by each of the constraints. The last few decades
have identiﬁed several cues that are useful in 3D reconstruction. However, the
exact role played by each of these cues in the context of a vision system recov-
ering the 3D structure of an input scene and its relative Signiﬁcance is still an

active research issue.

0 Incorporation of these constraints into an integrated system appears to be the
most difﬁcult problem. Quite often, the constraints exerted by the cues are
not obviously commensurable. For instance, the geometric constraints imposed
by line labelling algorithm are qualitatively different from the information Of-
fered by stereo or perceptual organization module. Is it possible to establish a
common framework/ representation for interaction? The most general represen-
tation would be the multivariate joint probability distributions involving the 3D
properties of the world and the response (features) of each module. How can we
Obtain this distribution? Even if such a distribution could be estimated, we feel
that it would be too general to constrain the solution to the 3D geometry of the
world. IS there a better choice? Marr proposed that we could use 2.5D Sketch
as a common interaction ground for all the vision modules. This appears to be
a reasonable solution for all the modules providing depth information. What
about the cues which are not directly involved in estimating depth? Even when
we accept the depth map as an interaction ground for all the modules, it is not

clear as to what representations and what levels of representations will be most

33

suitable. One could argue that coarse qualitative features would form a more
stable basis for interaction among the modules. But, what is a desirable choice
of qualitative features? How can these qualitative representations be made to
work in a domain where constraints of many vision modules are expressed in

quantitative terms?

0 A Simultaneous recovery of a large number of intrinsic variables (e.g., reﬂectance
and surface orientation at each pixel) is often intractable. If a limited auton-
omy is to be granted to each visual module, then such recovery of intrinsic
variables and reconstruction inherently needs Simplifying initial conditions to
be incorporated into each module. Sometimes, these assumptions are necessary
because 3D structure of the scene is not known a priori. For instance, if we de-
sire to include a mutual illumination model into a Shape from Shading module,
it needs an a priori knowledge of the geometry of the scene to be recovered!
How do these assumptions affect the performance of the other modules? What
approaches permit a gradual reﬁnement of the initial assumptions made by each

module to obtain the optimal reconstruction?

0 Traditionally, researchers have used a linear combination of the individual ob—
jective functions to derive a global object function on the basis of simplicity
arguments. Often, these cost functions do not represent the dynamics of the
interaction among the constraints. The resultant systems often require a ‘good’

set of parameter values to deliver a reasonable performance.

0 Another difﬁculty in the integration problem is due to the diverse sets of as-

34

sumptions made by the individual modules. Often the domains of applicability

of the individual modules do not overlap or overlap only slightly.

0 When Should a module actually be employed? AS Simple as this sounds, often
this issue can not be dealt with in a theoretical fashion. For instance, take the
case of a Shape-from-texture module. What is texture? Is texture isotropic?
Are the surfaces of objects in the scene piecewise continuous? These and other
questions need to be answered before we know when and where to ‘apply’ the

module.

0 Often, algorithms involved in each module are chosen off-the—Shelf for the visual
integration. The designer needs to take into account the artifacts and idiosyn-

crasies of the individual modules over and above the theoretical difﬁculties.

1.10 Design Issues

Design issues involved in developing a machine vision system are not very different
from those involved in developing any other complex system. Any non-trivial system
design will involve conﬂicting requirements and good judgment will be needed in

making the correct trade—offs. We discuss here some of the important issues.

0 Correctness and Reliability: The purpose of integration is to increase the relia-
bility of the ﬁnal output. Hence, in a formal sense, the system should correctly
use all the available information. Unreasonable assumptions about the data and

the constraints often introduce artifacts in the ﬁnal solution [20]. The overall

35

solution Should continuously depend on the data [187].

Completeness and Generality: The integrated system Should include a reason-
able set of modules and constraints to provide an accurate depth information
in commonly occurring Situations where the individual modules fail. Yet, the
proposed framework Should be able to accommodate any additional information

and cues.

Optimality and Tractability: Systems employing a minimal number of con-
straints tend to be brittle; overdetermined systems deliver stable perfor-
mance [42]. Although, computationally effective models would be attractive,

the issues of optimality and tractability are currently of secondary interest.

Modularity and Extensibility: System design, development, and maintenance are
considerably streamlined if the system has a modular structure. It is also easy
to extend such a system. However, modularity may often result in suboptimal

solutions.

Defaults, Partial Information, and Ambiguity: Perhaps, this is one of the most
important issues related to integration. The individual modules need reasonable
mechanisms to represent their intermediate inferences which can be used by
other modules. In the absence of any external information, the modules need

to be aware of the context-dependent defaults.

36
1.11 Problem Statement

We are trying to solve the following problem:

Given a pair of stereo images of a scene containing objects illuminated with a
ﬁnite number of point sources with known positions and possibly with ambient light,
our goals are (i) to recover a complete and accurate 3D structure ( depth and surface
normals ) of the input scene, and (ii) to evaluate the efficacy of the proposed integrated
reconstruction results.

A generic block diagram of modular integration is shown in Fig. 1.12. Given a
stereo pair of intensity images, the integrated system recovers the 3D structure in
the scene with the help of a given set of vision modules. Sample input images of a
synthetic scene along with the desired outputs from the integration system are shown

in Fig. 1.13.

 

V ﬂ

 

 

am : Depth Map
0 —"
Input Stereo 1' o . = '3
Images 0
' E r—>
[ l' o c '1 Surface
4 Normals

 

Figure 1.12: A generic integration scheme.

The choice of which modules to incorporate into our system was not easy, given the
long (and extending) list of vision modules which have been used in the vision litera-
ture (Table 1.3). Stereo, Shading, texture, grouping, and line labeling information are
among the most extensively researched cues and form a representative sample from

the entire gamut of information that could be extracted from the intensity image(s).

37

 

(C) (d)

Figure 1.13: System input/output: (a) and (b) synthetic left and right input intensity
images; (c) ground truth depth (pixels closer to us are brighter), (d) a wireframe
representation of the depths.

38

Table 1.3: A list of vision modules.

 

Shading [80] Texture [61] Stereo [68]
Contour [78] Symmetry [186] Grouping [199]
Photometric stereo [202] Focus [55] Line labeling [139]
Color [40] Motion [197] Specularity [74]
Occlusion [97] Vergence [1] Perspective [131]
Darkness and Shadows [95]

 

 

 

Further, principles of 3D reconstruction from several other cues closely resemble those
of the modules constituting the integrated system. For instance, solutions to struc-
ture from motion and stereo problems are based on establishing correspondence and
shape from structured light, shape from contour, and shape from texture modules
attempt at inverting the projective imaging geometry. Consequently, the integration
of the other modules into the existing systems should be fairly Similar. The proposed
uniﬁed Bayesian framework (Chapter 5) facilitates inclusion of a new module into the
integrated system with a relative ease. It should be noted that the proposed systems

do not impose any restrictions on extending them to include new modules.

1.12 Contributions of This Thesis

The following is a list of the contributions of our work.

1. We propose and implement a uniﬁed Bayesian framework for integrating vision

modules.

2. We have built two working integrated vision systems: one integrates perceptual
organization, stereo, shape from shading, line labeling and the other integrates

perceptual organization, stereo, Shape from shading, and Shape from texture.

39

3. Design of a perceptual organization module which integrates region- and edge-

based segmentations using Gestalt and intensity gradient cues.
4. A limb detection module for reliably detecting occluding limb boundaries.

5. An improved implementation of the line labeling module which exploits strong

constraints exerted by limb boundaries.

1.13 Organization Of the Thesis

Rest of this thesis is organized as follows. The details of computational mechanisms
needed for the interactions described in this chapter are the topics of Chapter 2.
The individual vision modules and their limitations are studied in Chapter 3. A
non-uniform integration scheme is proposed and evaluated in Chapter 4. Chapter 5
provides a Bayesian integration framework for integrating vision modules. The con-
cluding chapter describes the contributions of this thesis and directions for future

research.

Chapter 2

Integration Methods

Computer vision researchers have studied several models of information integration.
In this chapter, we will ﬁrst describe different models of integration methods. This
will be followed by a summary of the efforts on information fusion and integration
available in the vision literature. Each section will brieﬂy describe an integration
model, followed by the relevant vision applications. We will conclude with a compar-

ison of the strengths and shortcomings of various approaches.

2.1 Complexity Of Integration

Complexity of integrating information can vary Signiﬁcantly. In this section we make
a distinction between different types of integration based on their complexity and in-

troduce a terminology for different information integration strategies in the literature.

0 Type A: Registration. In this type of information integration, 3 common
frame of reference is established between two (or more) representations to make

40

41

the Spatidtemporal relationships between the components Of each represen-
tation explicit with virtually no interpretation of the information (Fig. 2.1).

Example: Ton and Jain [188].

0 Type B: Fusion. The registered representations can locally be combined
to obtain a more reliable information. Fusion usually takes into account a
limited number of constraints like spatial dependency and continuity (Fig. 2.2).

Example: Nadabar and Jain [133].

0 Type C: Collation. Given a set of explicit constraints, two (or more) rep-
resentations are related and combined to obtain a more complete and reliable
information about the data (Fig. 2.3). Note that a collation problem might

involve global constraints. Example: Chang and Aggarwal [38].

0 Type D: Integration (of cues). In this scenario, the constraints are not
explicit and often, the interactions among the constraints are not clearly un—
derstood. Further, the integration problem might often be underdetermined.

(Fig. 2.4). Example: Moerdler and Boult [128].

The work described in this thesis primarily deals with the integration of cues.

2.2 Bayesian and Non-Bayesian Fusion

Consider a multisensor system with M sensors at each of the N locations producing

a set of M x N observations, {X,-,-}. Let {Yij} denote the true (unknown) world

42

 

 

CI)

ill
LI)

O

3303

 

 

 

 

 

 

(b)

 

 

 

 

 

 

 

 

Figure 2.1: Registration: (a) and (b) are two images of a given scene; (c) Shows an
image which makes the Spatial relationship between (a) and (b) explicit. The dotted
lines demarcate positions of (a) and (b) with respect to each other.

 

Figure 2.2: Fusion: (a) stereo, (b) sonar, and (c) fused data. The density of the dots
represent the likelihood of occupancy of that Space by a physical object [129].

 

X view

 

 

 

 

Y View

 

 

 

A
O //

Figure 2.3: Collation.

43

 

 

x‘xxx
‘ o

Shading

 

 

 

 

 

>.
9.:‘
t:‘
cg:
1:.
o
o
e
a :‘e v :
7"6
4' v, v
7 a.

~5-

3D Interpretation

 

Line Diagram

Figure 2.4: Integration of line diagram and Shading data to obtain a correct 3D
interpretation.

variablesl denoting the state of nature at each of these locations. Given a set of
possible labels of objects at these locations, the system attempts to determine which
assignment of labels to the variables {16,-} best describes the observations.

Bayes’ formula states that the optimal prediction about the state of nature can
be made according to

{Xijllyij = y)P(YiJ’ = ye)
Pllxull 3

 

1304:“ = yz‘jliXijll = P(

where

o P(Y,-j = y,,-|{X,-j}) is the posterior probability of an object with label y,j, given

the set of observations, {Xij}.
o p({X,-,-}|Y,-J- = yij) is the likelihood of a set of observations {X0}, given 1”,,- = y,,-.

o P(Y,-j = y,,-) is the prior probability of the label Yij = y,,-.

 

1Also referred to as real world variables [8].

44

o P({X,-j}) is the normalizing factor.

Each sensor i will be able to supply its opinion of the observations representing a
true label. Evaluation of p(Y,~j = y,,-[{X,~,~}) by the system involves integration of
the information received from all the sensors. Often, a simplifying assumption of
the statistical independence of the likelihood function of each sensor is required in a
Bayesian framework to yield tractable schemes of integration? Such an integration
results in an increased certainty of information [42].
Bayesian methods have been used by several researchers for information integration.
Bayes’ rule is the basis for determining ‘occupancy’ of the Space in mobile robot work
of Moravec [130] and of Mathies and Elfes [122]. They both fuse data originating from
different sensors to reliably avoid obstacles in the robot’s path. Chou and Brown [39]
have used Bayesian methods for labelling image regions. Chu and Aggarwal [41] have
used the maximum likelihood estimate for obtaining an edgesbased segmentation from
multiple sources of information. Bolle and Cooper [23, 24] have used Bayesian meth-
ods for integration of evidence offered by different parts of an object for recognition
tasks. Rao and Whyte [159] used a decentralized hierarchical algorithm for integrated
decision making in multisensor systems. Rigoutsos and Hummel [161] have used a dis-
tributed Bayesian formulation for objection recognition. One of the major difficulties
in implementing a Bayesian integration is in Specifying the a priori probabilities.
Several researchers have explored the possibility of making robust decisions instead

of taking the Bayesian approach [99]. Robust statistics Show a graceful degradation in

 

2More involved treatment of dependencies is deferred until Sections 2.4 and 2.5.

45

performance in the presence of noisy observations. For instance, the median of a data
set is a more robust statistic than its arithmetic mean. Use of robust statistics usually
results in more practical systems at the expense of tractability of rigorous analyses.
Explicit voting schemes have also been used in the literature to reliably integrate
different sources of evidence. A statistical multi-source classiﬁer iS a general method
for classifying multispectral data [167]. A linear opinion pool classiﬁer uses consensus-
based classiﬁcation [16] and is Shown to perform better than the maximum likelihood
classiﬁer for certain simple data sets. Quite often, in the absence of quantitative
data, subjective methods of fusion are pursued. Two such approaches which Simplify
the integration of evidence are: Dempster-Shafer theory [169] and Possibility theory
[52]. They both are characterized by very simple rules of combination of evidence.
The price one often pays for such simplicity is the erroneous results due to implicit
independence assumptions underlying such subjective integration models. In order
to overcome these limitations, researchers have attempted to use more sophisticated

models [83]. However, it turns out that these models are not fool-proof, either [8].

2.3 Relaxation

A relaxation algorithm is an iterative method of constrained optimization. Given a set
of labels, a set of objects3, neighborhood relations among the objects, and constraints
among the labels of the neighboring objects, a relaxation algorithm ﬁnds the most

consistent conﬁguration of the object labels. The idea is to initially assign all possible

 

3We have used the terms objects, sites, and pixels interchangeably in next few sections.

46

labels to all the objects and then remove a label from the label set of an object if it
is found to be incompatible with any of the labels of the neighboring objects. This
removal may, in turn, make some labels of some other objects inconsistent. The
process continues until either (a) the label set of an object is empty indicating that
no solution iS possible; or (b) no label of any object can be removed indicating the
desired conﬁguration(s). In the latter case, we may need to impose more constraints

to avoid ambiguity.

This concept can be extended to probabilistic relaxation. Here, instead of a def-
inite association of certain labels with an object, the likelihoods of an object being
associated with each label are computed. Note that the term probabilistic is merely
a misnomer; the derivations for the expressions used for ranking the likelihoods are
justiﬁed on the basis of subjective arguments. However, these (pseudo) probabilistic

approaches have been found to be useful in building practical vision systems.

Let Pt(y) E [0, 1] be the ‘probability’ or weight associated with site t (a pixel (i, j ))
and label y. Constraints are provided in terms of compatibility functions, rt,(y, y’),
quantifying the compatibility between label y at Site t and label y’ at Site 3. The

weights are updated in parallel according to the following iterative equation [107]:

 

n+1 _ Pt(y)‘"’[1 + APl(y)("’l
my” ) — 2,, Pl(y)‘"’[1 + APl(y)‘"’l’

where

APl(i/)("’ = Z dial; rots, y')Ps(y')‘"’l, (2-3)

47

n is the number of iterations, and dt, are the weights Specifying the relative contri-
butions from the neighbors of y. The algorithm terminates when weights for some
label of most of the sites is close to 1. Given the values of visual cues at each pixel
(say), a set of labels to be associated with each pixel, and the constraints among the
labels of neighboring pixels, a relaxation algorithm is an iterative method of ﬁnding
the most consistent set of (label, pixel) association. However, as the integration prob-
lem becomes more complex (more number of measurements, intrinsic variables) and
the visual cues interact in a more complex way, it becomes harder to interpret the
parameters (rts, for instance) of the relaxation process. Secondly, the true measure-
ments are ignored after they have been used for constructing the initial estimates of
the labels. Due to these two reasons, relaxation methods are considered to be weaker
formulations for integration of evidence than the classical probabilistic models [132].

There is an extensive body of research on relaxation methods and their applica-
tions to vision problems [82, 165, 184, 67, 49, 96]. Marr and Poggio [119] were among
the ﬁrst to suggest the utility of relaxation techniques for cooperative computing in
low-level vision modules. We will brieﬂy mention a few studies that formulate visual
integration in terms of relaxation.

Barrow and Tenenbaum [12] used relaxation for computing intrinsic properties like
reﬂectance, surface orientation, and illumination of a given image. Given two images
of a scene taken at different times and from different positions, a relaxation approach
was used to match the images in [119, 11]. A multigrid technique in relaxation was
introduced by Terzopoulos [185] to accelerate prOpagation of constraints in relaxation

applications involving large high resolution images. Its efficacy was demonstrated in

48

various applications, including a) ﬁnding lightness (reﬂectance) of Mondrian surfaces;
b) estimating surface orientation from Shading cues; and c) computing optical ﬂow

from an image sequence.

2.4 Markov Random Fields

The Bayesian approach discussed in Section 2.2 gives us a method to ﬁnd an ‘op-
timum’ estimate of the state of nature without exploiting any dependencies (e.g.,
Spatial dependency) among the neighboring pixels. In order to take advantage of the
Spatial constraints among the pixels (objects) in the neighborhood, we need a formu-
lation which can couple probabilistic decision making with Spatial constraints. The
use of Spatial constraints is often necessary to make the image interpretation problem

well-posed [121].

 

, r '2' Horizontal Line Sit

u . [1 Vertical Line Site

[I . [ [I Pixel Site
u

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(b) (c)

Figure 2.5: Example neighborhoods of pixel and line (edge) sites for (a) a pixel Site;
(b) a horizontal edge site; and (c) a vertical edge Site (from [134]).

49

One method to realize this is to consider a multivariate probability distribution
involving real world variables (intrinsic variables) at each pixel and observations of
visual cues at each pixel. Given measurements about the observable visual cues,
the prior constraints among the observed visual cues and the true label, one could

formulate a Bayesian solution according to:

P(ylx) = —f“f:((:')’”,

where
x = { x,,-; x,,- is the observation made by the j‘” visual cue at the ith site},

and

y = {yi; y,- is the state of nature at the ith site}.

Note that y,- is a vector representing the intrinsic variables at site Q,. This large sys-
tem of variables makes the problem not only intractable but often underconstrained.
We can, however, impose a reasonable assumption that only pixels within a local
neighborhood can interact directly (locality assumption). This simpliﬁcation is the
essence of the Markov Random Fields (MRF) formulation. Let N(Q,) denote the

5

neighborhood of a Site Qi”. Let Y“, represent one of the intrinsic variables at Site 62,.

 

“For the sake of brevity, a linear ordering of the two-dimensional sites is assumed.
5Also referred to as real-world variables.

50

The Markovian property states that6

P(Y.-k|Z\Y.-,.) = P(Yik|ZN‘),

where Z = {Yij} U {Xij};

ZNi = Y1"i UXNi;

YN‘ = {YnlIQn E N(Qi)};

and

XN*={X,,1|Q,, E N(Qi)};

The constraints between ‘neighbors’ are deﬁned by assigning energies to all realiz-
able conﬁgurations; lower energies indicate more likely conﬁgurations. These energies
contribute to the ‘internal’ energies (clique energies) at each Site. The external energy
at each Site depends upon P(Y,-,~|Z~"). The energy of the entire system will be the
sum of the internal and external energies of all the sites. Given X, deﬁnition of a
neighborhood N(.), and likelihoods of different conﬁgurations of neighbors (in terms
of clique energies), the problem of most consistent interpretation of the observations
is equated with the problem of ﬁnding a realization of Y which will result in a mini-
mum energy of the system. Notice that MRF formulations not only permit modeling
the constraints among the observed visual cues, X,,-, but they can also handle the
constraints among Y,J-S and the constraints between st and Xijs.

Although MRF formulations can elegantly model many real-world constraints like

transparency and continuity [111], there are certain difﬁculties in employing them for

 

6See appendix for notation.

51

practical applications. First of all, MRF models are computationally expensive. Many
alternative formulations are being devised to alleviate this problem [17, 121, 120]. A
second problem plaguing implementation of an MRF formulation is the difﬁculty in
estimating the prior probabilities which represent the contextual information in the
given application. A few techniques have been devised to partially alleviate this
problem [134].

MRF models used for integration of different information sources require that all
the information from individual sources be available Simultaneously. Marroquin [120]
modiﬁed the MRF methods to ease these restrictions by decomposing the overall
processing into several stages. Each stage can be designed to handle partial data
(estimates). Marroquin claims that these new models, random measure ﬁelds, are
computationally less expensive than other regularization methods (see Section 2.7).
He also argues that these models are more appropriate for visual reconstruction prob-
lems due to their ability to handle phenomenon like transparency, occlusion, etc.

The pioneering research in MRF has been the work of Geman and Geman [63] who
formulated the problem of segmentation of intensity images using a tightly coupled
system of ‘line’ and ‘pixel’ processes. The line processes prevent smoothing across
a potential discontinuity. The maximum a posteriori (MAP) solution is achieved by
the Simulated annealing process.

Gamble et al. [60] have used MRF models to integrate the outputs of four vision
modules: color, edge detection, motion, and stereo. The integration is based on
the assumption that Shape and color discontinuities are usually (Spatially) related

to the brightness (intensity) discontinuities. And that motion boundaries coincide

52

with the depth discontinuities. By selecting the coupling parameters empirically,
they have obtained a reasonable improvement in the localization of boundaries over
those estimated by the edge detection module alone. Modestino and Zhang [127]
have used MRF models for consistent labeling of presegmented regions. Daily [48]
has used MRF models to integrate information in color channels with the process of
segmentation. Nadabar and Jain [133] used the MRF models for fusing range and

intensity images.

2.5 Bayesian Networks

A more general modeling of the Spatial and other constraints can be carried out
by means of Bayesian networks. Bayesian networks, which are also called belief
networks, or causal networks, are directed acyclic graphs with nodes representing
propositions (or random variables) and arcs signifying direct dependencies quantiﬁed
by the conditional probabilities. An example of the Bayesian network is shown in
Figure 2.6. If the events Xl-Xg were independent, their joint density would have
been simply a product of the individual probabilities. The events in Figure 2.6 are
not independent (perhaps, due to common underlying cause). For instance, the nodes
X8 and X9 are related 7. This dependency is represented by node X7 and the joint

density of events X7, X8, and X9 is given by

P(X7,X8,X9) = P(X7)P(X8[X7)P(X9[X7).

 

7They represent two parallel boundaries, for instance.

 

HXI ..... X9) 3 HX‘XXD HXRIX'I) HX‘IIXJ) [’(XIIXIXJ) WXHXSWIXS‘XIWIXMX l )PIXZIXI )HXI)

Figure 2.6: An example of a Bayesian network.

The Bayesian network formalism becomes intractable for any non-trivial applica-
tion. Cooper has Shown that the problem of constraint propagation in such networks
is NP—hard [44]. Usually, there is a need to use additional domain-dependent heuris-
tics to make these networks feasible for any practical application.

Such a domain-dependent extension of the Bayesian network called perceptual
inference network was used by Sarkar and Boyer [166] to represent the statistical
dependencies among various features (e. g., curvilinearity, and symmetry) in images of
man-made environments. They used this network to integrate the information about
various Spatial features to form composite hypotheses and a prediction of structures

(e. g., rectangles and ellipses).

2.6 Information-theoretic Methods

Gestalt laws of good form (Prc'ignanz) have been proposed as the key to understand-
ing perceptual organization. This eventually led to the ‘minimum’ principle which

states that “other things being equal, that perceptual response to a stimulus will be

54

obtained which requires the least amount of information to Specify” [75]. The mini-
mum description length (MDL) approach developed by Risannen [162] states that we
Should interpret the data in terms of minimal description with respect to a prespec-
iﬁed model of the domain (language). Given a reasonable description language, the
Simplicity criteria provides an accurate and computable determination of perceptions
which are most likely. The Bayesian method of combining information is related to
the MDL method with a priori probabilities P(Y) corresponding to the length in the
description and the P(X IY) term related to the error of description [42].

In the MDL approach, the integration problem can be stated as follows: Given
a reasonable description language for the object domain, a bound on the acceptable
error in describing the observations, and a set of observations X,,-, i = 1,. . .,N,
j = 1, . . . , M, from N vision modules at M Sites, determine the shortest description
(most likely) of the observations within the prescribed error bound.

Leclerc [101] applied MDL approach for obtaining a reasonable segmentation of
monocular intensity images using piecewise constant and piecewise continuous models
for description of surfaces and boundaries, respectively.

MDL is an intuitive approach, but it has several implementational problems. First,
' what is the best language for a description of the domain? Researchers have used
piecewise polynomial models (descriptions) in their MDL formulations without any
objective justiﬁcation. Since the power of MDL approach is derived from how well the
language captures the object descriptions in the domain [8], it is necessary to match
the constructs of the language with the peculiarities of the domain. But, how to

design such a language in an objective manner is an open research issue. The second

55

issue regards the choice of the metric for the information content. The conventional
approach has been to use the number of bits required to represent a description as a
measure of the information content. While this is a good approximation of the infor-
mation content, it is not obvious as to why all the descriptors of a language Should
be weighted equally. For instance, if we choose Fourier descriptors as our choice of
the language, then not all descriptors are equally important. Given the domain of
polyhedral objects, lower frequencies are more reliable sources of information than the
higher frequencies. Some non-zero components may be more common (and, there-
fore, less meaningful) than the others. A particular combination of parameters may
be more meaningful than the sum of the information content of the individual param-
eters. It is also not known how to relate the information content in one vision module
with that of the others. Optimization formulations using the MDL approach require
relating the cost of inaccuracy due to a description and the information content of
the description into a single objective measure. Since the error in description and
the information content are two conceptually different entities, it is often not possible
to objectively relate the error in the approximation with the information measure
of the description. Finally, the solution surface of the MDL formulations has been
found to exhibit several local minima. Obtaining the correct solution usually requires

computationally demanding techniques [101].

56
2.7 Mechanistic Models

Despite severe degradations and distortions undergone by sensed data, we are still
able to perceive the world in a stable manner. For this reason, we restrict our solution
space only to the stable solutions [20]: the solutions which continuously depend on

the input data.

Most early vision problems are inherently ill-posed. Often, we introduce ‘synthetic’
constraints to obtain a unique and stable solution from an ill—posed problem. This
method of converting an ill-posed problem to a well-posed problem by restricting the

space of acceptable solutions is called regularization.

Regularization usually involves enforcing synthetic constraints like smoothness of
the solution (ﬁrst- and second- order differentiability). One approach to formulating
regularization is based on physical modeling. Physical objects resist bending, stretch-
ing, twisting, and breaking under the inﬂuence of external force(s). Their tendency to
maintain a characteristic smooth proﬁle is used as the synthetic constraint in energy
minimization approaches involving mechanistic modeling. In this formulation, the in-
terpretation of the sensed observations is related to the behavior of a physical ‘object’
under the inﬂuence of a set of external forces and the sensed data are related to the
set of external forces. The object is usually in the form of a thin membrane, a plate,
or a rod. The stiffness, stretchability, and shear strength of the object are used to
model the relative desirability of the ﬁt of the object to the external forces (data) and
the smoothness of its conﬁguration. Each conﬁguration of the object deﬁnes certain

energy depending upon the physical parameters of the object and its conformation

57

to the external forces. It is hypothesized that the most likely solution will emerge as

the minimum energy conﬁguration of the equivalent physical system.

Regularization alone can not guarantee physically meaningful solutions. If all
the ‘natural’ constraints are not well represented in the formulation, the ‘Synthetic’
constraints may produce strange results. For instance, in the regularization formu-
lation of shape from shading problem [30], the integrability constraint is not taken
into consideration. As a result, the algorithm often produces abstract (as opposed to
physical) surfaces. Subsequent enforcement of the integrability constraint corrected
this problem [59]. Many approaches, therefore, couple the regularization techniques

with the Speciﬁc geometrical and physical constraints pertaining to 3D world.

The integration problem can be formulated in the mechanistic setting in the fol-
lowing manner: Given the order of interaction among the objects, a set of parameters
governing the relative strengths (energies) of interaction among the neighboring ob-
jects, and the sensed data from different modules, the problem of integration is to
ﬁnd the minimum energy (most likely) conﬁguration as an interpretation of the sensed
stimuli. Note that the power of the model lies in correctly choosing the parameters
governing the interactions. These parameters assume the most likely “surfaces” in

the given domain.

An example will clarify these concepts. Consider ﬁtting an energy minimizing
curve to the parameterized set of data points x(3) = (x(s), y(s)). The energy of the

ﬁt u(s), E, consists of two components.

E = Eext + Eint°

58

The external energy, Em, is deﬁned as:

Eext : [AIIIX(S) _ 11(3)“2

and the internal energy is deﬁned as:

E,,,= / A21 ((1-1 <s+>> as()l(s)l+/\3[u”(s)2(1-k(s))+ﬁ(s)k(s)lds,

where l(s) and k(s) are functions denoting the ﬁrst-order and second-order derivative
discontinuities, and u(s) and 5(3) are the penalties associated with these disconti-
nuities. The parameters A1, A2, and /\3 control the stretchability and stiffness of
the curve. The external energy term enforces closeness to the data and the internal
energy term enforces the smoothness properties of the curve. The resulting energy

minimization could be solved using Euler’s method [190].

We now mention a few examples of mechanistic and physical modeling formulation
in computer vision literature. Depth from stereo and surface interpolation algorithms
developed by Grimson [68] and Terzopoulos [183] are examples of mechanistic formu-
lations. Blostein and Ahuja [22] integrate the extraction of texture elements (texels)
with surface Shape extraction by modeling the Size change of the texel Shape due to
the projection process. Hoff and Ahuja [76] describe integration of depth from stereo
and surface interpolation process by assuming that the surfaces in the real world are
smooth. Their approach uses this constraint to mutually guide both processes in an

integrated manner. Sugihara [177] describes the integration of boundary extraction

59

with 3D Shape extraction for polyhedral Objects. He uses various methods Of shape
extraction including Shape from Shading, Shape from texture, and line labeling. The
domain is sufﬁciently restricted to allow him to use surface models to complete the
line drawings and subsequently use this information to construct the 3D shapes of
objects. Unfortunately, his methods can not be easily generalized to more complex
domains. Malik and Maydan [113] use their line labeling scheme to derive boundary
conditions for object surfaces to compute Shape from Shading. Their method assumes
perfect line drawing as an input to their line labeling module. Stockman et al. [176]
integrate surface Shape from range data and line labeling. Aloimonos and Shulman
[8] describe several pairwise module integrations for obtaining depth, Shape, or struc-
ture. They Show that the direction of light and Shape can be uniquely computed using
motion and Shading information. However, results are presented for synthetic images
and the authors admit to stability problems related with their approach. They also
describe a method of integrating texture and motion to derive shape if the motion
parameters are known. Further, shape and 3D motion can be computed from contour
and stereo information if the correspondences in the image pair are known.

It has been theoretically shown that under certain conditions both mechanistic
and probabilistic models are equivalent [20, 179]. However, the efﬁcacy in modeling
the physical surfaces and the well-behaved resultant descriptions are the strengths of
mechanistic modeling. Figure 2.7 is a case in point. Usually, ﬁrst-order and second-
order MRF models produce less intuitive results than the corresponding mechanistic
models. However, mechanistic modeling has several Shortcomings also. It is difﬁ-

cult to model phenomenon like transparency and statistical dependence using these

:2" ":5

Figure 2.7: First-order and second-order MRF models produce unintuitive results
(from [20]).

models. The strong assumptions underlying the mechanistic modeling produce per-
ceptible artifacts in the resultant interpretation of the data. How will the models
behave when the assumptions do not hold? Since statistical models can inherently

model random variations, they are more robust than their mechanistic counterparts.

2.8 Game-theoretic Methods

Bozma and Duncan [28] have used a game-theoretic method to obtain reasonable
parametric descriptions of objects in medical images. In this formulation, each player
represents a vision module who can independently pursue its own objective. The
interests of the players can potentially conﬂict (non-cooperative game [9]). We brieﬂy

describe some of the terminology below before presenting the formulation.

Formally, in a game with N players, each player is associated with a decision

vector, Pi, and a payoff function, E, as shown in Figure 2.8.

61

 

 

 

 

( W
i th Player Pi
Fi(P1,...,PN) ——e>
L 4
P.
J

Figure 2.8: A Single player.

1. At each decision epoch t of the game, player i chooses a Speciﬁc decision vector
pf. This is referred to as the move of the player i. The initial decision vector is

assumed to be p?.

2. A payoﬁfunction, F,(p1, . . . , p;_1, pi, pi“, . . . , pN), evaluates the performance
of player i based on the set of decisions made by all the players. Player i strives

to maximize its payoff by modifying the decision vector, p,:

Player i 313?: argmpaxF,(p1, . . .,pi, . . .,pN).

The equilibrium point, p*, of the game is deﬁned in terms of the equilibrium

points pf of the individual players: p* = (p1,. . .,pf, . .. ,pfq).

Nash Equilibrium, p* = (p1,. . .,p‘i", . . .,p’ﬁ), is a set of decisions such that for
player i,

pf : “1;?deva ° °ipf-lipfipf+li ' ' ' a DIN)

P“:

r».
ac.)

62

From the viewpoint of the it” player, Nash equilibrium point Pf is locally optimum.
Decision Model: Given an operating point p, player i modiﬁes its decision vector
to maximize its payoff. A reaction map of a player is the collection of all of its optimal
responses for each operating point in (p1, . . . ,pN).

A Simple example of a 2-player game will clarify these concepts [26]. The output
of player 1 is pl 6 R, and the output of player 2 is 192 E R. The objective8 of player

1(F1ER) IS

F1(P1,P2) = (P1 —1)2 + (P2 — 3)2 +(P1—1)(P2 — 3).

The objective of the player 2 (F2 6 R) is

F2(p1,pz) = 075002 — 1)2 +1-5(pl — 3)2 + 075(191— 3)(p2 — 1)-

The operating points for which an objective function takes a given value can be seen
as a level curve of the inverse objective function. The level curves of F1‘1(c) and
Ff‘(c) are shown in Figure 2.10. The X—axis represents the decision, p1, of player 1
and the Y—axis represents the decision, p2, of player 2. Each closed contour represents
a set of points for which the corresponding payoff function takes a constant value.
In Figure 2.11, a superposition of these two level sets and the individual reaction
maps are shown. In Figure 2.12, we have shown a set of successive moves made by

each player to eventually reach Nash equilibrium. It has been shown that even if

 

8Note that the use of cost function instead of payoff function converts a maximization problem
into the corresponding minimization problem.

63

the decision of each player was updated in a parallel fashion, the Nash equilibrium
is reached. The success of this formulation depends on the validity of the following

three critical assumptions:

0 Each objective function is locally convex in the neighborhood of each of the

Operating points during the evolution of the solution.

0 Each objective function is C2 with respect to the local variables.

0 Each objective function is bounded from below by a ﬁnite cost.

Bozma and Duncan [28] have used non-cooperative game-theoretic methods to
integrate two vision modules. Their strategy of integration allows for distributed,
modular implementation. However, designing appropriate (locally convex) objective
cost functions for the individual modules appears to be difficult. The bottom-up
module is responsible for detecting edges; the top-down module ﬁts parametric Shapes
to the detected edges. The two modules independently optimize their individual
objective functions. Eﬂicacy of their approach was demonstrated for detecting organs

in fairly complex medical images.

2.9 Lattice-theoretic Methods

Jepson and Richards [89] proposed a framework for assimilation of information pre-
sented by several modules. In many situations, use of knowledge about the world for
passively regularizing the sensory data is not justiﬁed. For instance, a cost function

is often obtained by superposing several individual functions — each favoring a certain

64

 

 

 

 

 

Second Player P2
P , P
Pg 1 2)
P2 l>l
First Player [ p:

F,(P,,l>2) J——‘

Figure 2.9: A 2—player game.

 

 

 

 

 

 

 

 

 

 

i ll
1» 3P:
p p
l l
8 I
y y
C O
l t
2 2

1(6)

0+ 0 I

o s o
9: (rural) ‘ n (pl-ya!)

Figure 2.10: Level sets of the payoﬂ functions for player 1 (left) and player 2 (right)
[26].

65

 

 

1a ........... a a .
I??? 3'3
2‘ In"

 

 

Pl (ulnar. 1)

Figure 2.11: Payoff level sets for both the players are superimposed. Also shown are
the reaction maps for both the players [26].

66

 

 

 

 

8
,Ff‘tc
PI
8 I
P t 0
I
I
i
2 /
pa Ra
’(¢)l
£1 "
o x -
o if
P! (pl-m1)

Figure 2.12: Successive moves made by the players to reach Nash equilibrium. R1
and R2 denote the reaction maps of player 1 and player 2. Note that player 1 always
moves horizontally and player 2 always moves vertically in this graph. Also, note
that in any given move, the payoff is maximized at the corresponding intercept on
the reaction map [26].

desired property of the ﬁnal solution. The solution obtained by minimizing such a
cost function does not necessarily respect the physical reality of the world. In such
Situations, it is important to take into account consistency of an interpretation within
the chosen world models. Use of vetoes, accumulation and votes for integration of
the modular information without taking into account the underlying constraints of
the modules could produce grossly erroneous results. They contend that the modules
do not resolve their conﬂicts at the level of their outputs, but at the level of premises
used to arrive at any conﬂicting individual interpretations.

Given a set of observations, a set of premises (assumptions made by each module),
and a partial order of the assumptions, they represent the partial ordering of the
premises in the form of a lattice (fault-lattice). Each vertex of the lattice represents

a unique assertion about the validity (or the rejection) of each premise.

67

Their theory can be illustrated with the help of a concrete example [89]. Suppose
an observer is watching a sequence of images of a moving 3D object being screened
on a TV. The stereo module will conclude that the object is planar and the motion
module will conclude that the object is three-dimensional. The resolution of this
conﬂict can not be effected at the level of the outputs of the two modules.

Figure 2.13 Shows a lattice formed by two premises D and G which are deﬁned

below.

 

Figure 2.13: A Simple fault—lattice based upon a rigid motion premise and a copla—
narity premise (From [89]).

o Coplanarity Premise (D) is the assumption made by the stereo module that
disparity represents the depth of the object. Since the television screen is ﬂat,

stereo module believes that Objects depicted on a television screen are coplanar.

o Rigidity Premise (G) is the assumption made by the motion module that

objects in the scene are rigid.

   
    

PICTURE

MOTION

(C)

(b)

Figure 2.14: Three methods of enforcing a unique solution: (a) Voting; (b) Priors;
and (0) Generation of a new hypothesis.

The vertex GD represents a Situation when the premise G holds, but D does not
hold. Similarly, GD means that both D and G do not hold for a given scene. Given
the data and the two premises, there is no world structure which can be both rigid
and coplanar. Hence the interpretation of an object being both rigid and planar is
ruled out. This incompatibility of the node is depicted by cross—hatching it. At the
opposite end is the node GT. There are several possible non-coplanar and non-rigid
3D structures which can give rise to the given sensed data. Hence this node is valid.
The remaining two nodes are the individual interpretations reached by the stereo and
motion modules. Everything else being equal, we would prefer the interpretations for
which the premises are satisﬁed over the interpretations for which they are violated.

Hence, the nodes GD (stereo) and GD (motion) are the preferred interpretations

69

over the non-rigid and non-coplanar interpretations. This is pictorially depicted by
drawing arcs from GD to GD and GDQ. Thus, two ‘maximal’ interpretations are
possible. The issue of enforcing a unique interpretation can now be dealt in the

following ways:

1. Votes: Here, the node which has defaulted the least number of premises is cho-
sen as the most preferred perception. For instance, if there was a third premise
F which was satisﬁed by motion module but violated by the stereo module,

then the most preferred node would be GDF (motion) (Figure 2.14(a)).

2. Priors: In the absence of any hard evidence, we know that the stereo evidence is
less infallible than the rigidity premise. In this case, GD (stereo) interpretation

would be the most preferred interpretation (Figure 2.14(b)).

3. Generation of a new premise: We could generate a new premise correspond-
ing to the Situations in which conﬂicting evidences can be presented.
Picture Premise (T): “given a stereo disparity consistent with COplanar sur-
face and given other evidence about non-planar objects, the image depicts a
picture of a 3D object. This would add a new node PGD, which is supported
by both stereo and motion modules. This node will, then, become the most

preferred interpretation (Figure 2.14(c)).

Jepson and Richards conclude that, under this framework, integration of vision

modules emphasizes three main issues:

 

9An arc from node A to node B indicates that the interpretation offered by node B is the preferred
interpretation.

70

1. What is the general set of constraints? How can we ﬁnd a basis set for all the

constraints/ assumptions made by the vision modules?

2. What constitutes consistency?

3. What are the rules of processing which reason about consistency among the

chosen premises?

Implicit in their work is the assumption that the outputs of individual modules
are reliable and an accurate depiction of the real world. They assume that each
module necessarily outputs ‘correct’ opinion about the world. This is often not true,

especially in the computer implementations of vision modules.

2.10 Active Vision

Marr and Nishihara’s thesis [118] hypothesized that the “higher level” processes do
not themselves bear upon the formation of a 2.50 sketch in the human visual sys-
tem and that different low-level visual modules arrive at a “complete and accurate”
reconstruction of the entire scene, independent of the task to be performed by the
higher level processes. This hypothesis resulted in a clear dichotomy between the
study of higher level processes (recognition and navigation) and lower level processes.
A further side effect of this hypothesis has been the treatment of sensed data used for
reconstruction (a low-level task) in isolation from the observers’ actions and intentions
(higher level tasks).

The proponents of the “active vision” school believe that both the complexity

71

and the brittleness of the solution to problems faced by the individual modules can
be considerably reduced by making the observer sensing the world active. An active
observer is “one who is capable of engaging in some kind of activity whose purpose is
to control the parameters of sensors” [7]. By constantly interacting with the sensed
data, the observer has more opportunities to dynamically adapt its sensing parameters
to obtain favorable, Signiﬁcant, and unambiguous data. The activity of the Observer
may be used for (but not limited to) selecting the visual orientation and location of the
sensor, and number and resolution of the frames of sensed data. Some results illustrate
that the ill-posed and non-linear optimization problems can become well-posed and
linear with the active observer [6]. An active observer also has the opportunity to
refer to the context (of sensing) to further remove ambiguities.

The following example might illustrate the difference between the approaches
adopted by the conventional and active vision groups. Many vision modules begin
with the assumption of “general viewpoint”. In any non-trivial scene, this assump-
tion may be true most of the time but not all the time. Therefore, a conventional
approach has to incorporate this assumption as a weak constraint and the resulting
energy formulations (say) provide an energy proﬁle with many local minima — each
usually corresponding to the hypothesis of likely localizations of the violation of the
‘general viewpoint’ assumption. The exploration of this energy proﬁle to detect the
global minima is a difﬁcult non-linear optimization problem. In the active vision
paradigm, on the other hand, veriﬁcation of the validity of the general viewpoint
assumption and localization of its possible violations can be made possible by per-

turbations of the viewing positions or orientations. This strategy usually will result

72

in a considerable Simpliﬁcation of the subsequent processing.

Thus, the paradigm Of active vision emphasizes dynamic integration of the visual
cues. The information that will most effectively eliminate the ambiguity in the data
will be sought and dynamically integrated. An active vision system needs a meta-
level knowledge about the set of constraints, the utility of each constraint, a selection

method, and an ability to manipulate the sensor parameters.

2.11 Knowledge-based Methods

It is commonly agreed that there is a power-generality10 trade-off in any automation
task [57, 37]. This issue concerns types of knowledge utilized in the system (including
control knowledge). Strong methods apply task-Speciﬁc inference mechanisms using
domain-Speciﬁc knowledge. They allow easy and feasible solutions for the problems
that are too difﬁcult to be solved by less speciﬁc approaches. Weak methods are
general in nature and can be applied across many domains. The price paid for the
generality is the cost in terms of (search) time and increased computational complex-
ity.

Special—purpose vision systems using strong methods have Shown considerable
success within their limited task domains [136, 51, 84, 56]. To date, however, there
have been no general-purpose vision systems that work across a variety of vision
domains. Thus, the strong methods, being better able to deﬁne, structure, and apply

the knowledge, realize effective and practical systems. Visual knowledge includes

 

10Powerful methods are less general and vice versa.

73

domain-independent knowledge about occlusion, perspective, physical support, etc.,
as well as the domain-Speciﬁc knowledge about the objects the system is expected to
encounter, their 3D structure and appearance, their expected appearance in the 2D
image, and their relationship to the other objects. Control knowledge addresses how
the information in the image can be efﬁciently extracted, organized, and matched
against stored models, and when the object knowledge can be used to prune the
goal-directed search tree.

Vision systems working in restricted domains can bring very Speciﬁc recognition
and control knowledge to bear on their assigned tasks. This allows them to perform
sophisticated inferencing with relatively little computational effort.

Several knowledge-based architectures have been proposed for the integration
problem, including a logical framework [160], production systems [135] and black-
board architectures [54, 189].

Several researchers have used knowledge-based methods for solving image under-
standing problems. Lakin et al. have used a blackboard architecture for data fusion in
strategic naval control problems [98]. Nagao et al. analyzed aerial photographs using
this approach [136]. Shafer et al. have used a distributed architecture for real-time
navigation [170]. For a broader overview of the blackboard applications is vision, the
reader is referred to Engelmore and Morgan [54], for instance.

Although knowledge-based approaches work effectively for high-level vision, it is
our opinion that low-level visual integration tasks are leSS intensive in knowledge
requirements. The success of the low-level integration primarily depends upon ob-

taining general consensus (and compatibility) among a ﬁne-grained ensemble of unre-

74

liable information pivoted on a few important high-level cues. The knowledge-based
approaches also need an enumeration Of all the possible Situations and a meticulous
programming for each such anticipated Situation. The integration of huge amounts
of uncertain data (with their Spatial dependencies) with a few high-level knowledge
sources is an open issue in vision research. However, the blackboard systems have
remainedan attractive choice of integration architecture owing to the ease of system

development and ﬂexibility of control.

2.12 Summary

Table 2.1 presents a summary of various integration frameworks. We will now compare
two of the most inﬂuential models for integration in computer vision: probabilistic and
mechanistic models. There is a continuing debate on the suitability of probabilistic
and mechanistic models for different computer vision applications. Here, we will

discuss only a few signiﬁcant issues:

1. Which approach is more suitable for modeling the world constraints? There
is no deﬁnitive answer to this question. The appropriate approach to model
constraints depends upon how naturally a signiﬁcant set of constraints offered
by the given object domain can be captured. For instance, transparency and
statistical dependence of the data can be captured quite well by the probabilis-
tic models. Probabilistic models also appear to be better equipped to handle
imaging noise. The constraints of smoothness and continuity can be well cap-

tured by the mechanistic models. Nadabar and Jain [133] have found an elegant

75

way of combining both the approaches: they use MRF models for represent-
ing the prior constraints and mechanistic models to generate a large number of

synthetic images to estimate these constraints“.

2. How are the model-Speciﬁc representations useful for further processing? For
instance, the statistical models offer an unintuitive and diffuse ﬁnal represen-
tations. The mechanical models offer more intuitive and compact ﬁnal repre-
sentations which make useful information explicit for further processing, e.g.,

object recognition.

3. The statistical models are inherently better suited for handling noise and outliers
in the data. Hence they Offer stability to the ﬁnal representations. Both the

models appear to be deﬁcient in handling geometric and qualitative constraints.

4. Both the approaches entail an ad hoc selection of parameters. The parameter
values represent relative signﬁcance assigned to each constraint. Such a prior
weighting of different information sources precludes the possibility of handling
all the possible scene conﬁgurations equally well. In some applications, it is not
possible to ‘optimally’ rank the information sources. For instance, McCafferty
[124] found that different perceptual organization cues can not be ranked based

on different Gestalt cues.

5. These models are computationally demanding, both in terms of convergence as

well as in terms of propagation of the non-local constraints. Many statistical

 

11They observe that it is easier to generate images with edge continuity and smoothness using
mechanistic models.

76

models are inherently parallelizable.

The other integration models discussed in this chapter are relatively new. While
they introduce novel ways of embedding certain Speciﬁc constraints, we feel that they
are not sufﬁciently general to offer a central role in the integration architecture.

In summary, the existing models of integration appear to be incapable of ex-
tracting stable, compact, and intuitive representations from the large quantities of
visual data. The limited implementations and evaluations of the existing integration
frameworks appear to be lacking in their ability to capture the essence of the sensed
data. The existing frameworks are brittle. We believe that effective vision systems
need to incorporate hierarchical feedback control structures to design robust systems

integrating vision modules. This issue is further motivated in Chapters 4 and 5.

77

 

 

 

 

 

 

 

 

 

 

 

 

 

3:: .3 8 5:839: 2588: 88: ::::n
40E .3 8 ::::m ::: ::::::a::-:_::8Q 83:88? 8m::_3o:v~
38:8 ::::uwﬁﬁ
E 8888?. 5 3:87:80: 3::E::< 888:: ::: Began 885 :>E:<
£8885
FR: :22 ::: :wEm 3:83:28 3:288:800 E3853: ::::Empmlmf
:w::w::~ 82:98: 88882:: 8.: 388:5
To: 988:4 -8: a: 8388: 5 3:8me w: 838:: 838 8::de -:o$::b.£:~
325.88 w:m::.::_
Ru ”888:0 ::::w8>:o: 33m ::: 88:38.8: 830m 83:58:80
a& 83:8: 80:58:84: £8885
888$ ::: :88: 83:08 8.88:: 8:::8< Lo 8588: 88535 8:584
Ea ::::sQ 3.88:5
::: :::Nom 323:3:8 :::qu 8:38:28: ::::£:.85 88:0
smﬁﬁooﬁm 8:52:08:
7i .3 8 8x ::::8: o: :::? ::: 885$ ::: 33:58: 8.53:0 ::SSEBme
:3 52
::: ::::::Z .33 .3 ::::w8>::: 3853::
8 83:80 33: ::: 828.885 3:04 ::: 3:82 853:0 :2
Wm: 830:0: 35:88:: :5
-88 48$ 8x::N :::... 88588 :8_ mm :::Q EMSEEE 838028
6:: 82.3 32:83:: 889m
8.3m ::: ::xSm 83:88 E 8:855 ::: 3:58:15: 38:3:D :::8»:m
8.9::xm ::emugmﬁa magmas—am :058—2

 

 

 

 

88:38:82 :28835 ”mm 83:8

 

Chapter 3

Vision Modules

In Chapter 1 we described the origins of the modular processing in computer vision
and illustrations about the nature of limitations of a vision system which relies solely
on the output produced by a single module. In this chapter we describe in detail
several modules that will be incorporated into our integrated system. Sections 3.1-
3.5, describe the individual modules, their objectives, limitations, and the underlying

assumptions.

3.1 Perceptual Organization (grouping)

Segmentation is one of the central problems in computer vision. Even deﬁning the
problem of segmentation is difﬁcult since it is tightly coupled with the semantics of the
image content and the visual task under consideration. It is, therefore, expedient to
set up a working definition for the segmentation based on the photometric attributes
of the intensity image [2]. By this “segmentation” we mean a grouping of the pixels

78

79

based on homogeneity of their photometric attributes [2]. Given this deﬁnition, a
region-based or an edge—based operator would be sufﬁcient to effect a perfect seg-

mentation in an idealized situation. However, in practice, poor imaging conditions,

 

Figure 3.1: Perceptual organization helps complete the obscure boundaries [116].

insufﬁcient contrast, noise sensitivity of the selected attributes, and artifacts inherent
in segmentation operators all conspire to produce an imperfect segmentation (See
Fig. 3.12). Image properties alone are not adequate in segmenting the input scene.
For instance, two perceptually signiﬁcant regions cannot be identiﬁed in Fig. 3.1 us-
ing intensity gradients. In such situations, the principles of perceptual organization
could be invoked to obtain a reasonable segmentation.

The human visual system recognizes statistically signiﬁcant relationships in a
given image and uses them to infer causal structures in the scene without using any
higher level domain-dependent knowledge [199]. These perceptual phenomenon were
closely studied by psychologists and were accounted for in very subjective terms such
as Gestalt and Prd'gnanz. Some of these statistically signiﬁcant relations (Fig. 3.2) are

collinearity, parallelism, symmetry, and connectivity. Human perceptual mechanisms

80

000 o’- oo 00 09 K m

b)000000..°’..‘° m

[III ......
a, ~>(\ QC comm-

Symmetry

9.;

9.

Figure 3.2: Relations signiﬁcant in grouping [109].

help us organize the image features, complete the features obliterated by noise or
occlusion, permit 3D inferences from the 2D image relations, and facilitate efﬁcient
indexing of the world knowledge [109]. This phenomenon of grouping is referred
to as perceptual organization. Many believe that perceptual organization plays an

important role in simplifying the computer vision problems [72, 155, 109].

The potential of the perceptual grouping module in computer vision systems was
relatively unexplored until recently. Witkin and Tenenbaum [199] were among the ﬁrst
to recognize the importance of this module. Lowe [109] established a computational
framework for selecting perceptually signiﬁcant relations. Several other researchers
have used perceptual organization, both, for 2D and 3D image features. These efforts
can be broadly classiﬁed into region-based and contour-based methods. The region-

based methods have used the similarity of attributes of spatially adjacent pixels to

81

categorize them into meaningful regions. Often, relaxation mechanisms are then
used to overcome noise and to impose a reasonable consistent relationship among
the neighboring pixels [207]. Hoffman and Jain’s region segmentation algorithm [77]
and 2D region segmentation prOposed by Zucker et al. [207] are typical examples of
this strategy. In the contour-based algorithms, edge-like features are grouped into
boundaries. These boundaries are segmented into perceptually signiﬁcant segments
[109]. The segments may be ﬁnally grouped into objects and assemblies [166]. Kass
et al.’s [94] 2D snakes, and Lowe’s [109] methods of boundary segmentation can be
considered as typical examples of contour-based groupings. Ferrie and Whaite [58]
and Pankanti et al. [146] have used principles of Gestalt for grouping 3D boundaries.
Only a few researchers have attempted to integrate the region-based and contour-
based groupings [150, 124, 41].

We will now describe the perceptual organization module (See Fig. 3.3) that will
be used in the integrated system (Chapters 4 and 5). The novelty of this module is
that it not only takes into account the statistically signiﬁcant image relationships but
also the intensity gradient across a potential boundary.

Recent research in the human vision has shown that both region-based and edge-
based mechanisms are disjoint at a very low-level and serve complementary functions
[108, 157, 72]. These observations can be empirically corroborated by our experience
with several edge-based and region-based operators. The segmentation produced by
each operator has certain desirable prOperties, but the integration of these segmen-
tations often leads to better results than provided by either one of them [150]. Our

approach to integration (of region-based and edge—based segmentation) is carried out

82

Intensity Image

 

 

 

Canny
Edge
Detector

Selection

And
ExtenSIon

Gestalt-Based
Grouping

[ Selection

Of Closed J
Regions

 

 

 
     
   

 

 

 

 

 

 

 

 

I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
e—
I
I
I
I
I
I
I
I
I
I
I
I
L

Closed Contours

Figure 3.3: Perceptual Organization Module.

in two steps as described below. We have used outputs of Canny edge operator [35]
and split and merge segmentation [149] as our edge-based and region-based segmen-
tations, respectively. But, in general, other edge-based or region-based operators can

also be used.

0 Selection and Extension: In the ﬁrst phase, we select a subset of boundaries
which are supported by both the segmentations. This step is performed by
searching for the boundaries of the region segmentation (RB) in a ﬁxed rectan-
gular neighborhood of the edges detected by the edge-based segmentation (EB).
The presence of RES in a given neighborhood of the E83 is taken as a strong

evidence for the presence of a boundary (as opposed to a spurious edge) and we

83

select all such boundaries. These boundaries are localized at the EB positions].
The boundaries thus obtained are tangentially extended based on the presence

of boundaries in region-based or edge-based segmentations.

o Voronoi-based Grouping: In the second phase, the boundary terminations and
corners in the representation obtained from the previous stage are linked based
on a set of Gestalt criteria (proximity, collinearity, cotermination) and the
presence of a signiﬁcant gradient across the linking edge in the region-based
segmentation. Instead of considering all possible linking edges, only the edges
connecting Voronoi neighbors are considered [3, 147]. Ideally, it would be desir-
able to consider the smooth snake-like extensions between the edge terminations
selected for linking, but the present implementation links these terminations us-
ing a straight line segment. Following components are superimposed to obtain

an objective function estimating the signiﬁcance of a Voronoi edge:

1. Proximity. Voronoi edges which are short will be considered perceptually
more signiﬁcant than those which are longer. The contribution of a Voronoi
edge of length d connecting the edges of lengths D1 and D2 is deﬁned as E,

= 212de / D1 D2, where 11),, is the relative signiﬁcance of proximity attribute.

2. Curvilinearity. Voronoi edges which are in the tangential direction of the
edge terminations are more signiﬁcant. A Voronoi edge which subtends

angles 01 and 02 with the terminations of the edges at its either end will

 

1We have found that our edge-based detector has a better localization than the region-based
segmentation. Some researchers have used the maximum likelihood estimate provided by both the
segmentations for this integration [41].

84

contribute E, = 1115(61 + 62) /47r to the cost function, where w, is the relative

signiﬁcance of the curvilinearity.

3. Cotermination. The coterminant is the common point shared by ter-
minations of two (or more) smooth boundaries. Coterminations are per-
ceptually signiﬁcant. We estimate the cost of a cotermination by EC 2
wc2/(n1 + n2), where ”.1 and 77.2 are the numbers of Voronoi neighbors of
each termination and wc is the weight indicating the perceptual signiﬁcance

of cotermination.

4. Gradient. Signiﬁcance of a Voronoi edge is proportional to the average
intensity difference of the regions it abuts. Instead of computing a raw
intensity gradient, we consider difference in means of intensity values of
the adjoining regions as a reliable indicator of this criteria. The cost of the
gradient is measured by E9 = M / G, where M is the maximum number of
gray levels in the image and G is the average difference of intensity across

the length of the edge.

Among the several Voronoi edge alternatives, only the minimum cost edge is
chosen, provided that its total cost is below a threshold T. Finally, we choose
closed boundaries to obtain regions with reasonably uniform photometric at-
tributes. There is no systematic way of estimating the “correct” weights of
each Gestalt component [124]. At present, we set the values of relative signiﬁ-
cance of each component of the cost function and the threshold, T, empirically.

Fig. 3.4 shows results of our grouping algorithm for Mushroom and Vase image.

85

 

((0

Figure 3.4: A grouping example for Mushroom and Vase image (512 x 512): (a)
input intensity image; (b) output from Canny edge detector; (c) output of region-
based segmentation; (d) signiﬁcant closed regions after integrating segmentations in
(b) and (c) using Gestalt rules.

86
3.2 Shape From Shading

Shading is an important source of information about the 3D structure of the input
scene, especially when binocular disparity and motion cues are absent. These situa-
tions commonly occur in pictures and scenes consisting of smoothly sculpted surfaces.
Artists have known the effectiveness of shading in visualization and have been using
this one to convey the 3D structure in their paintings. However, inferring shape from

shading using a computer algorithm has proved to be a difﬁcult and underdetermined

 

   
    
   
    

    
   
    
    

I:
'I']
{W

I’ll] IN],
MI HIIIIIIH'

III] W
IIHII'
II

]II]I]
I I I]
II] I

II]I]
Figure 3.5: Recovery of saddle—shaped surfaces is especially difﬁcult: (a) An image of
Lambertian hyperboloid surface with constant albedo; (b) True surface shape; (c) A
convex shape recovered from (a) by Oliensis and Dupuis’ algorithm using default con—
vex surface assumption. A default concave surface assumption would have recovered
an entirely concave surface.

   
 

 
 

 

II
III]

III]

       

 

 

 

 

 

 

problem (see Fig. 3.5) and it has been generally regarded as an unreliable source of
information [205].

We will now brieﬂy describe the physics of imaging. This description will be
interspersed with the imaging assumptions made by a typical shape from shading
algorithm. This will be followed by a summary of the common approaches to solve

shape from shading problem.

87

Light reﬂected from the surface of an object depends on the following factors:

1.

Incident light: The amount of incident light on an object depends upon the

 

illumination source(s), its distance from the object surface, and geometry of
the object. Light energy incident on an object surface is related to the source
ﬂux by the law of inverse squares. This often gives rise to an intensity gradient
across the image illuminated by a single point source. However, all the previ-
ous researchers have made the simplifying assumption that the light source is
located at a sufficiently large distance (orthographic imaging geometry) and inci-
dent light energy does not depend upon the distance from the source. The light
reﬂecting from object surfaces often illuminates surrounding surfaces, a phe-
nomenon called the mutual illumination. Mutual illumination depends upon
the geometry of the object surfaces. Modeling mutual illumination is a difﬁcult
problem which requires information about the object shape — which is usually
not known. Almost all the shape from shading algorithms hypothesize that the
incident light energy mainly depends upon the primary illumination and the

effects of secondary (mutual) illumination can be disregarded.

. Surface characteristics: A simple scene geometry is depicted in Figure 3.6.

When the light energy incidents on the surface of an opaque object, a part of
it is absorbed. The reﬂected part consists of two components: specular and

diffuse.

The specular component models reﬂection from mirror-like surfaces. In the

case of an ideal specular surface, light rays incident on the surface are reﬂected

88

Source {1

  
 

Surface Normal
C II

   

 

Figure 3.6: A simple image formation geometry. Rays AB, AC, and AD are in the
direction of source (8), surface normal (n), and sensor (k), respectively. The angles
BAC, BAD, CAD will be referred to as incidence angle (i), emittance angle (e), and
phase angle (g), respectively [80].
such that the angle of reﬂection equals the angle of incidence. In a typical
surface, however, specular reﬂections are restricted to a compact lobe around
the ideal specular reﬂection direction. Thus, if the sensor direction changes, the

amount of irradiance from a specular surface changes considerably and specular

highlights shift their positions.

A matte surface, on the other hand, models Lambertian or diffuse reﬂections.
Such reﬂections are a result of multiple bendings and bouncings undergone
by the light rays below the surface microstructure. The reﬂected light from a
perfect matte surface does not depend upon the sensor direction and depends

only upon the source direction and the object surface orientation.

The specular and diffuse components of the surface reﬂectance can be modeled

by the following expression relating image irradiance to the scene luminance

89
[42]:

E(x, y) = 1($,y)(Rs(n) + RICE, 31)),

where R, is specular reﬂectance map, R, is Lambertian reﬂectance map, E(:c, y)

is image irradiance, and I (.27, y) is the incident irradiance.

The specular reﬂectance can be modeled as:

Rs(n) = (k-(2(n(w, y)-S)n($, y) - 8))“,

where m is a parameter related to the sharpness of Specularity of a given surface

and ‘.’ denotes the “dot” product.

The Lambertian component is modeled by:

R105: 31) = (11(13, y)..S)

Typically, the shape from shading research has been restricted to the pure
Lambertian surface models. Incorporating specular reﬂections in the shape from
shading formulation is a relatively recent phenomenon [42]. Researchers have
used a priori decomposition of the surface image irradiance into specular and
diffuse reﬂectances using dichromatic reﬂectance models [73] or using empirical
methods devised by Wolff and Boult [201]. Others have attempted to estimate

these components from the images using optimization methods [42].

3. Sensor characteristics: It is generally assumed that the lens is isotropic and

90

responds equally well to the objects in peripheral as well as central regions
in the image. These situations are, in most cases, exceptions rather than a
rule. Everything else remaining the same, two locations in the image may have
different intensities due to optical distortions of the lens system or non-uniform
sensitivity of the sensor array. In addition, the inverse square law may be
effective if the width of the sensor array is comparable to the average distance

to the object surfaces being imaged (wide-angled lens).

Given an imaging model and a single intensity image, a shape from shading algorithm
needs to reliably estimate the geometry of the sensed surfaces. First of all, due
to the ubiquitous assumption of an orthographic projection, we can not determine
absolute depths from shading cues alone (even when the reﬂectance map is completely
speciﬁed). Hence, all the shape from shading problems should be interpreted in
terms of recovery of surface normal information. The difﬁculty in recovering surface
orientation from shading (intensity) cues is due to a lack of a sufficient number of
constraints: we have only one (intensity) value at each location (pixel) while we are
expected to recover two parameters (tilt and slant) per location. In principle, it is
possible to obtain an inﬁnite number of solutions if we hypothesize that the surface
normals at the neighboring locations are independent. In order to recover physically

meaningful and unique surfaces, additional constraints are necessary.

One approach to obtaining a unique solution to the shape from shading problem
is by imposing a smoothness constraint and incorporating it into a regularization

framework. If the boundary conditions are known, Blake et al. [21] have derived

91

some theoretical results stating uniqueness of the solution under restricted situations.
In practice, these algorithms begin with known occluding boundaries and formulate
shape from shading as a variational problem with known boundary conditions [30, 85];
the solution typically involves regularization. There are several difﬁculties with this
formulation. For instance, correct recovery of spherical surfaces is difﬁcult [205]. The
commonly used smoothness constraint might result in unintuitive results. Frankot
and Chellappa [59] have solved this problem by projecting each intermediate solution
on to the frequency domain representation to enforce the integrability condition.
The regularization approach is computationally intensive; it typically requires several
thousand iterations before obtaining a reasonable solution.

Another approach to obtaining shape from shading restricts the surface geometry
to provide local solutions. For instance, Pentland [152] assumes the surfaces to be
locally spherical. This requires recovering only a single parameter from each surface
location. Tsai and Shah [192] offer an iterative solution by assuming a linear ap-
proximation of the reﬂectance function. While these strategies are computationally
adequate, the restrictive assumptions make it difficult to improve upon the initial
results incrementally. Thus, this approach is a poor choice for an integration envi-
ronment.

Oliensis and Dupuis [144] have shown several counterexamples to the widely be-
lieved claim that surfaces are uniquely constrained by limb (occluding contours) edges.
They, instead, show that shape from shading problem is well-constrained only in the
presence of singular points — pixels with maximal brightness. They propose a novel,

computationally attractive algorithm based on the method of characteristic strips

92

[79]. We have adapted Oliensis and Dupuis’ algorithm for our module integration re-
search. The characteristic strips are the curves of the steepest ascent in the direction
away from the light source. Given the depths at singular points, Oliensis and Dupuis
[145, 144] developed a noise—resistant method for reconstructing the characteristic
strips. This idea is based on an elegant method of propagating the depth values at
singular points to the rest of the locations in the image. If relative albedo at each
point on the image surface is known, then this concept can be easily extended to

arbitrary images with Lambertian surfaces.

3.3 Stereo

In humans, binocular stereo is a robust estimator of depth at an acute visual angle,
particularly in the image regions with a signiﬁcant variation in intensity. A stereo
module is known to produce an erroneous depth map in regions of the image with no
texture or with very little texture [195]. It is also difﬁcult to compute depth at those

parts of the scene which are visible only from a single camera.

A simplifying assumption which is frequently made in the stereo matching algo-
rithms is called the parallel axis geometry (Fig. 3.7). Consider a point P in 3D space
(Figure 3.7). Let us image this point from two known camera positions deﬁned by
their optical centers, 0; and 0,. If we can identify the locations A and B, of the
point P in the two images taken from two known positions, then the 3D location of
P can be recovered from the disparity of the corresponding image points using the

principle of triangulation. Establishing the correspondence of all the points in the

93

 

 

 

 

 

 

// 7/

 

 

 

 

II

 

II

 

Figure 3.7: Parallel axis stereo geometry [50].

image pair is called the correspondence problem. Given the imaging geometry and the
correspondence of physical points in the stereo image pair, computation of the depth
from the disparity is relatively straightforward. There are two primary categories
of stereo algorithms: (i) Pixel-based: algorithms which establish correspondence for
every pixel in the stereo images; and (ii) Feature-based: algorithms which establish
correspondence for only selective points, called feature points. The depth for the
remaining points is determined by smooth interpolation.
Pixel-based Stereo:

There are two strong assumptions underlying the pixel-based stereo algorithms.

0 Lambertian Surfaces: It is hypothesized that the intensities at the corresponding
pixels are identical under the camera transformation. If we assume that the
optical axes of the two cameras are parallel, then this assumption is equivalent

to assuming that the object surfaces are Lambertian.

o Sufﬁcient Saliency: It is also assumed that each pixel has suﬂiciently distinctive

 

94

 

Figure 3.8: Stereogram of ambiguously perceivable center square ﬂanked with un-
ambiguous areas in front of and behind the surround [91]. (a),(b) are left and right
random dot stereograms; (c), (d) are two perceived depth maps.

95

spatial context to eliminate potential false matches in establishing the correct

correspondence.

In practice, due to noise in the imaging process, the gray level values at the corre-
sponding pixels are not identical; such a prOperty is valid only in a statistical sense.
To counter this problem, the pixel-based methods often use correlation between sets
of neighboring pixels in the left and right images to ﬁnd the correct correspondence
[129]. The window used for computing the correlation depends upon the noise char-
acteristics of the sensing conditions. However, this strategy is not effective in dealing
with non-Lambertian scene surfaces. The suﬁicient saliency assumption is violated
if the image has only slowly varying intensities — a lack of distinctive features makes
correspondence based on local, window-based properties difficult. The search region
for these pixel-based methods is restricted to alleviate computational burden and,
thereby, imposing a small relative motion assumption. One of the advantages of
using the pixel-based methods is that they provide a dense disparity (depth) map
without resorting to interpolation. Also, these methods avoid the problems associ-
ated with the feature-extractors and their artifacts.

Feature-based Stereo:

The feature-based algorithms require a selection of viewpoint invariant and noise-
resistant feature points. Since feature points are sufﬁciently distinctive, ﬁnding cor-
respondence in a feature-based stereo algorithm is relatively easy. However, these
algorithms result in a sparse disparity map and estimating the entire depth map for

sparse disparity map is a difficult problem (interpolation problem).

96

What feature points are desirable for establishing correspondence? Several can-
didates, including zero-crossings of LoG operator [117], oriented step-edges [126],
peaks of LoG Operator [123], and frequency domain-based descriptors [195] have been
proposed in the literature. Many of them are justiﬁed on the basis of either psy-
chological studies or computational arguments. Some less general features like line
segments have also been used to simplify the correspondence problem [126].

Both pixel—based and feature-based approaches implicitly assume that the more
similar the attributes of a given pair of pixels (features) are from the left and right
stereo images, the more likely they represent the correct correspondence. At limb
boundaries or due to specular highlights and occlusion, this hypothesis is not true
(and can be misleading). Although pixels near occluding limb edges in stereo images
have similar intensity proﬁles, they are images of different physical parts of the scene.
The specular highlights depend upon the viewpoint and the “corresponding” high-
lights might mislead matching based on similarity measures. Finally, it is meaningless
to obtain the correspondence based on a given similarity measure for the parts of a
scene which are visible in only one of the stereo image pair. In addition, the stereo
constraints alone are not sufﬁcient in constraining the false correspondence problem.
Use of relaxation or other optimization [195] schemes is often necessary to impose
additional constraints to alleviate this problem. Marr’s computational theory pre-
scribed that the assumption of cohesive matter and unique correspondence should
be used for disambiguation. If the objects in the scene are cohesive, then the re-
sultant disparities should vary smoothly almost everywhere. The principle of unique

correspondence states that each feature may have only one correspondence. Mayhew

97

and Frisby [123] observed that smoothing the disparities across the depth bound-
aries can be avoided by use of ‘ﬁgural continuity’; the disparities along a boundary
should vary smoothly. Prazdny argued that ‘coherence’ is a more general principle
for disambiguating matches [154].

There are several difficulties in interpolating depth between feature points. Many
interpolation methods use an implicit smoothing function for estimating the depth
at pixels between feature points. The rationale of these methods is based on the
argument that absence of feature points at these intermediate pixels indicates absence
of sharp depth discontinuities (no-news principle [68]). Naive implementations of
these methods might smooth the sharp depth discontinuities, or produce noisy depth
map. Blake et al. [20] have suggested a discontinuity-preserving smoothing method
based on regularization. Hoff and Ahuja [76] have proposed an integration scheme
which combines matching and integration.

The local matching schemes can not take into account large disparity ranges.
Hierarchical schemes have been suggested in the literature for handling large disparity
range. Here, the disparities obtained from matching feature points at a coarse level
guide the matching process at the next ﬁner resolution [116, 195].

In our integrated systems (Chapters 4 and 5), we use the multi-resolution stereo
matcher proposed by Weng et al. [195]. This algorithm matches four attributes of
the intensity images which are reasonably insensitive to the relative camera motion.
These attributes are the (smoothed) image intensity, magnitude of the image gradi—
ent, and “positive” and “negative” curvatures. These four features grossly correspond

to functionals of zeroth-, ﬁrst-, and second- order derivatives of the original image

98

function tailored for the stability in terms of relative camera motion. The matcher
also imposes intra— and inter— regional smoothness constraints on the disparities. The
main idea underlying the matching algorithm is as follows: If the attributes of input
stereo images are insensitive to the relative motion of the camera, then the correct cor-
respondence implies that the dissimilarity (residuals) in the corresponding attribute
values should be minimum. Given an estimate of the correspondence vector ﬁeld,
the stereo matcher at a given level of resolution obtains a reﬁnement of the corre-
spondence vector ﬁeld by minimizing a weighted sum of squared residuals using the

gradient descent method.
. . . ‘2
mdm Eu 2; w,[R,(u,d)] , (3.1)

where d is the correspondence vector ﬁeld, Ri(u,d) is the residual contributed by
the it” attribute image at location 11 due to correspondence vector (1, and w,- is a

prespeciﬁed weight associated with the residual R].

The stereo system starts at the coarsest level of resolution and the ﬁnal estimate
of the disparity obtained at any level guides the matcher at the next ﬁner level. In
regions with small changes in intensity, there is not sufﬁcient information available for
a gradient-descent technique to drive the disparities in the correct direction. Fig. 3.9

illustrates this limitation of the stereo algorithm.

3.4 Line Labeling

99

 

(c) ' (d)

Figure 3.9: Mushroom and Vase image (size 512 x 512): (a), (b) Left and right stereo
images; (c) Recovered depth from stereo alone; (d) A wire-frame representation of
(C)-

100

 

Figure 3.10: A line drawing [13].

Given the boundaries in an image, we can infer a number of signiﬁcant properties
of the surfaces giving rise to these boundaries. For example, take the case of the binary
line drawing shown in Figure 3.10. This ﬁgure conveys valuable information about
the surfaces of leaves even in the absence of other cues like stereo, shading, or texture.
Several experiments involving human infants have shown that our skill in interpreting
3D surfaces from the line drawings is innate [109]. It is now agreed that line labeling
is a fundamental cue for 3D inference from 2D images [13, 14]. Like other human
perceptual skills, the ability of interpreting line drawings is based on assumptions
of general viewpoint and detection of statistically signiﬁcant properties, independent
of any higher level and domain-dependent knowledge. The line labeling module is
unique in that it exploits local geometric constraints and conveys its inferences in a
qualitative way. The labeled line diagram can be used for scene interpretation, object

recognition, and matching [142, 45, 13, 100, 171].

In a given image, a boundary detected by an edge detector could be a result of

101

a number of different physical events such as discontinuity in illumination (shadow),
change in surface albedo, surface markings, discontinuity in surface orientation, dis-
continuity in depth, and self occlusion. Before a line drawing can be interpreted, it
needs to be parsed into a graph representation. The vertices of this graph represent
intersections (junctions) of the boundaries and arcs of this graph represent the indi-
vidual boundary segments between two junctions. Given a graph representation of a
correct line drawing, the objective of the line interpretation module is to obtain an
approximate and qualitative 2.5D sketch of the input scene by appropriately catego-
rizing each arc in the graph. The architecture of line drawing interpretation module

consists of the following three components:

1. Line Labeling: The line labeling module labels the boundaries into several dif-

 

ferent categories. The correct line labels of only a few boundaries can be initially

determined.

2. Junction Labeling: Boundary intersections in an image result in junctions which

 

are 2D depictions of the 3D surface intersections. The junctions can often be
categorized into a few qualitative categories based on the degree of the inter-
section and the angles between the boundaries involved. The correct labels of

only a few junctions can be initially determined.

3. Relaxation: The 3D world exerts strong constraints on the compatibility of the
neighboring junction labels and line labels. However, these constraints alone are
typically inadequate to obtain a unique solution; a unique line interpretation

is usually imposed on a line drawing using constraints based on perceptual

102
mechanisms [103, 191].

Several efforts in interpreting line drawings have been made. The initial line
drawing interpretation for polyhedral objects was presented by Huffman and Clowes
[43], which was further augmented by Waltz [194]. Sugihara [177] derived a linear
programming type algorithm for labeling line diagrams of polyhedral objects. Kanade
[92] extended the object domain to include origami objects — objects composed of very
thin surfaces. Chakravarty [36] extended the original polyhedral object domain to
include curved objects. A mathematically rigorous and minimal junction catalog was
provided by Malik [112] for orthographic views of 3D scenes containing C3 surfaces.
This was later extended to perspective projection by Nalwa [139]. Leclerc and Fischler
[103] have formulated the problem of line drawing interpretation of wireframe objects

using minimum description length (MDL) approach.

Malik [112] classiﬁed lines into depth and orientation discontinuities. The depth
discontinuities are further classiﬁed into limb and non-limb categories. The orienta-
tion discontinuities are further classiﬁed into convex and concave boundaries. Malik’s
algorithm does not allow surface markings, shadow edges, and albedo edges. His
junction catalog is shown in Figure 3.11. When a unique solution is not possible, Ma-
lik’s algorithm imposes a uniqueness constraint by preferring an interpretation with
a minimum number of faces. ’Ifytten’s algorithm [189] is a reﬁnement of the Malik’s
algorithm in that it imposes uniqueness by preferring “ﬂoating object interpretation”.
It also has a different control structure to facilitate integration with other modules.

Since the assumptions of orthographic projection geometry and C3 surfaces are not

103

too restrictive in practice, we prefer to use letten’s reﬁned method of line labeling

for our non-uniform integration scheme.

Curvature-L:_ / A /—» {K (/

Three Tangent: —°'>' \ —»
II) II).

“YYY'YY

¢

 

 

 

 

 

 

 

 

Figure 3.11: Malik’s junction catalog. The double arrows indicate limb edges. The
object boundaries are shown by single arrows. Symbols ‘+’ and ‘—’ denote convex
and concave (internal) edges. Symbol ‘?’ denotes a don’t care line label [112].
Although the line labeling problem has been rigorously studied for a limited object
domain, it is plagued with innumerable implementation problems. Most reported

work on the line labeling problem assume availability of a perfect line diagram at the

outset. Obtaining a good line diagram appears to be a difﬁcult problem (Fig. 3.12).

104

 

 

 

 

 

 

(a)

Figure 3.12: Obtaining line diagrams is a difﬁcult problem even in case of simple
scenes: (a) An image from the blocks world; (b) A typical edge map using Canny edge
detector. Note that the edge detector failed to detect many junctions and introduced
several extraneous edges.

Further, the parsing of a line diagram into vertices and arcs has been shown to be
extraordinarily tricky [139]. Many researchers wonder whether an ideal line diagram
and its accurate parsing could ever be produced in a practical situation. Even when
the line diagram is correctly parsed, the line labeling module may label it erroneously.
For instance, it often mistakes an L-junction for a curvature-L junction and T-junction
for a 3-tangent junction. Fig. 3.13 illustrates the difficulty faced by a line labeling

module in labeling a curvature-L junction.

3.5 Shape From Texture

The term texture deﬁes a formal deﬁnition [106]. The simplest types of texture
can be characterized by the ‘regularity’ in placement of a texture element (texel). In

an image of a 3D surface, the spatial distribution of surface markings is distorted.

105

This distortion systematically depends upon the shape and orientation of the surface
with respect to the imaging geometry (Figure 3.14). For instance, consider a plane
covered with a uniform surface texture; the spatial relationship among texels on this
surface is uniform. As the planar surface is tilted, the number of texels per unit image
area increases the most in the direction of slant (texture gradient). The magnitude
of the compression depends upon the amount of tilt. Shape from texture algorithms
attempt to estimate the orientation of a scene surface from the measurement(s) on
the spatial distribution of the surface markings.

Notation and Imaging Geometry: Following Garding [61], we will a assume
perspective spherical projection (Fig. 3.15). The optical imaging projects the events
present on the 3D surface S on to the unit viewsphere 23 around the focal point. The
angle subtended by the surface normal N to the visual ray p will be called slant (0)
and the angle subtended by the projection of the surface normal (on to the tangent
plane) to the reference X -axis is referred to as tilt direction, t. Together, slant and tilt
uniquely determine the orientation of the surface normal n. The tilt direction t and
direction orthogonal to tilt direction, b (= n x t) in the tangent plane form a natural
local coordinate frame for description of distortion due to the imaging geometry.

The linear backprojection map, F,” projecting from tangent plane of 2 to tan-
gent plane of S is assumed to be sufﬁciently well-behaved. This map can be tersely
expressed in the bases (t, b) and (T, B) [61] as an afﬁne transform, A, composed of

three generic components [61], scale, rotation, and shear:

A = 5(8)R(9)D(a, It),

106

where S speciﬁes scaling parameterized by a scaling factor:

12(9) denotes rotation by an angle 9:

c056 —sin9

13(9) = ;

sind c036

and D(a, It) represents an area-preserving shear distortion whose axis and magnitude

are determined by u and a:

0(0, II) = R(II)M(a)RT(II)

O

]cosu ~sinu] a cosy sinu

DII-

[sin/I cosu I 0 —sinu cosu

isinzu + acosgu (a — %)sinucos,u ]

I

l
a

l
a

I (or — )sinucosu sinzu + acoszu J

If we exclude isotropic rotation component R, then A can be expressed as

r/coso 0 l/m O
F*=

, (3.2)
0 r 0 I/M

where r is the radial distance of the point on the surface under consideration. M and

m can be visualized as the half major and minor axes of the ellipse in 2 resulting

107

from imaging a unit circle in S.

Having realized that the absolute measurement of the distortion F* is difﬁcult in
reality, vision researchers have focused on the measurement of spatial variation in the
distortion map as a basis for recovering shape from texture. These different measures
are collectively referred to as texture gradients. The following four texture gradients
have been commonly used: density, area, perspective, and scale. The density and area
gradients refer to the spatial variation in the texel densities and areas, respectively.
Perspective gradient refers to the variation in the linear dimension of the texels in
the direction of the surface tilt and scale gradient is the similar measurement taken
in the direction orthogonal to the direction of the surface tilt.

The exact nature of the surface markings on the scene surfaces is not generally
known. Therefore, constraints imposed by the projective geometry alone are not
sufﬁcient. Additional constraints in the form of assumptions about the texture are
necessary. The most commonly used assumptions are: (i) uniform texture over all the
surfaces in the input scene, and (ii) orthographic imaging geometry. If the texels can
be reliably detected in the image, then it is possible to compute the gradient of texture
density. However, the texel identiﬁcation problem is difﬁcult due to imaging noise,
edge detection artifacts, and occlusion. Ahuja and Blostein [22] have integrated texel
identiﬁcation with the estimation of the surface orientation to alleviate the problem of
erroneous preprocessing. Witkin [198] posed the shape from texture problem under
regular projective geometry, isotropy, and independence assumptions. Under these
assumptions, the maximal variation in the data is accounted for by the projective

geometry. He plotted the histogram of the orientation of the edge elements and

108

showed that under orthographic imaging geometry, the amplitude and ‘phase’ of this

histogram are related to the orientation of the 3D surface. One signiﬁcant advantage

of this scheme is that it does not require extraction and identiﬁcation of the texels.
We use the shape from texture algorithm proposed by Super and Bovik [178].

Their algorithm consists of the following three steps:

1. Estimation of instantaneous frequency: In this step, a set of Gabor ﬁlters
(and their derivatives) are used to estimate the local frequency content at each

pixel.

2. Computation of invariant moments: A second—order moment matrix is
computed at each pixel using the local spatial frequency spectrograms. A pair
of coordinate frame invariant moments (M, m) can then be computed in the

directions (9, 9’) of the ﬁrst two eigenvectors of the second-order moment matrix.

3. Estimation of surface normals: Given invariant moments (Mmmp) at a
pixel p(a:1,y1) and invariant moments (Mq, mg) and slant 0,, at pixel q(a:q, yq),

the slant 0,, at p can be computed using

 

COS 0' —— C08 0' .
P q M m
P P

The tilt rp can be computed using the following equations:

9 —9 ilarccosA ,
Tp = Q P 2 p (3.4)

9,, — 9,, :l: %arccos /\,, + 7r,

109

where 9,, 9,, are the orientations associated with Mp, m,,, respectively, and AP

is deﬁned by:

cos2 0,, + 1)(M,, + m,,) — 2(Mq + 771,)
sin2 o,,(M,, -— mp)

 

AP = ( I (35)

provided that 0,,(M,, — m,,) is not zero.

Figure 3.16 illustrates a typical quantitative problems in estimating surface orien-

tation near limb edges using texture information. Note that these systemic inaccura-

cies were observed even after the perfect texture segmentation and scale information

was made available to the algorithm.

3.6 Summary

Vision researchers continue to propose, develop, and demonstrate novel algorithms

for shape-from—X modules. There are several limitations in the performance of the

individual modules:

1. Parameters Like most computer vision algorithms, a shape-from-X module

involves various parameters and its performance is often critically sensitive to its
operating region in the parameter space. The individual modules are unreliable
since the appropriate values of the parameters (i) can not be easily selected;
(ii) remain unstable over the domain of the images; and (iii) can not be, even
qualitatively, related to the tangible characteristics of the image domain. Utility

of most of the shape-from-X algorithms in practical situations is not obvious due

110

to a lack of studies on the relationship between performance of the algorithm

and its operating region.

2. Ambiguities Often the individual modules can not completely constrain the

object surfaces from the input image(s).

3. Assumptions Different assumptions are often built into vision modules. The
synthetic constraints prove to be too restrictive for real images. The violations

of natural constraints can not be reliably detected.

All these factors make it difﬁcult to rely on the individual modules for obtaining
robust performance. How these individually fragile modules can be integrated to

obtain better performance is the topic of the next two chapters.

111

 

Figure 3.13: Junction labeling example for Mushroom and Vase image (Fig. 3.9(b)):
(a) correct input line diagram; (b) labeling using line diagram alone; L, C, T, Y, A,
and P denote L, curvature—L, tangent, Y, arrow, and phantom junctions, respectively.

112

 

UDD
ODD

DC]
000
-‘O

 

IDUD
IUD

 

 

 

 

 

Figure 3.14: Shape from texture example: (a) image of an object with uniform surface
texture; (b) Edge map (scaled) extracted from the image shown in (a). Both pictures
adapted from [106].

Surface, S
viewsphere, 2

 

Figure 3.15: Imaging Geometry (adapted from [61]).

113

I99'IC“

., fill! 99999!!!
III 999
VHWIIIIOdO'II

’4’: till: 04
III’ Off!
”saunas:

 

    

 

(d)

          

ground truth slant ﬁeld for

a synthetic texture-mapped image of
i.i.d. noise and Lambertian shading);

)

complement of slant for better visibility)

a

(

10%

(

 

 

from (a) using a shape from texture algorithm [178]; (c)
the object in (a); and (d) normalized error in the recovered slant.

Figure 3.16: Shape from texture example:
an object with uniform surface texture
(b) recovered slant ﬁeld (here we show 90°

Chapter 4

N on—Uniform Integration

The shape-from-X modules are motivated by Marr’s paradigm of modular design
of a vision system [116]. We have seen in Chapter 1 that the vision modules have
several limitations and one can not depend on the individual modules to extract the
true 3D structure in all situations. Ideally, the vision processes could be modeled as
a connection matrix with each state variable interacting with every other variable,
thereby eliminating the concept of a vision module. To design such a system would
be an ambitious goal; to maintain and extend it would be even more difﬁcult.

A reasonable implementation of an integrated system would involve models of
each vision module and models of their interactions. However, past computer vision
research in integration of vision modules has not emphasized explicit information
exchange among the modules in the context of a complete system. We believe that
designing integration strategies based on explicit information exchange between the
modules is a ﬁrst step towards building more robust and tractable vision systems. This
approach also emphasizes cooperation and resonance between the individual modules

114

115

which many researchers believe to be the key to the effectiveness of the human vision
system [72, 46, 193, 172]. Earlier efforts in integration assumed that the estimates
obtained by the individual modules are reasonably accurate and hence adopted a
feedforward strategy. However, in reality, each module is imperfect and, therefore,
those integrated systems based on feedforward strategy alone are critically dependent
on the performance of individual modules. Our effort here is to demonstrate the
efﬁcacy of a feedback-feedforward strategy for integration.

Integration models proposed in the literature are based on Bayesian [166], MRF
[133], lattice-theoretic [89], game-theoretic [29], regularization [94, 153], and energy
minimization [20] formulations (Table 2.1). It is reasonable to assume that each
strategy is best suited for a certain type of interaction and hence can not be used
as the integration framework for the entire vision system without sacriﬁcing either
performance or efﬁciency. Thus, each interaction could be modeled by the strategy
best suited for that particular interaction. The success of these individual interac-
tion models depends on a number of user-speciﬁed parameters. It is hoped that the
dynamics of the system with interacting modules obviates these elaborate models of
interaction among the modules and replaces them with simpler interaction schemes,
facilitating the implementation of large integrated systems.

The general integration problem is formidable. Illustration of the advantages of
integration in a general setting is both abstract as well as ambitious. Instead, we
integrate a few speciﬁc vision modules with an objective of obtaining accurate 3D re-
construction. We demonstrate that the integrated system can withstand the violation

of the nominal assumptions underlying the design of the component modules. In this

116

Table 4.1: Vision modules in the proposed integrated system.

 

 

 

 

 

 

Module | Strengths J Problems ]

Stereo Reliable short-range Correspondence and
depth information Occlusion

Shape from Shading Orientation estimation ir- Mutual illumination and
respective of distance ﬁne textures

Line Labeling Geometric constraints Extraction of line

diagram

Perceptual Organization Boundary completion in Oversegmentation

noisy images

 

 

 

 

 

chapter, we describe an implementation of the integration of perceptual organization,
stereo, shape from shading, and line labeling modules [148]. These modules were cho-
sen primarily because of their importance in low-level vision. Also, they are known
to interact with each other [72, 158] and are complementary in their strengths (Ta-
ble 4.1). Our strategy for integration can be extended to include additional modules.

The overall block diagram of the proposed system is shown in Figure 4.1.

Rest of this chapter is organized as follows. In the next section, we describe the
interactions among the modules. Section 4.2 presents the integration algorithm. In
Section 4.3 we will discuss our experiments and results. We will conclude with a
summary of our experimental results and research issues that need to be addressed

in the design of robust integrated vision systems.

4. 1 Interactions

In this section we describe our formulation of interactions among the modules and

their implementation. The relevant literature will also be mentioned.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4.1: Overall Block Diagram.

4.1.1 Interaction between Shape from Shading and Stereo

Modules

The algorithm proposed by Oliensis and Dupuis [145] has several shortcomings typ-
ical to the traditional shape from shading and early vision problems. It assumes
Lambertian surfaces and a ﬁnite number of point sources. It is extremely difﬁcult
to take into account the effects of interreﬂections, Specularity, and innumerable other
nonlinear effects in a general situation — since that requires an a priori knowledge

of the object geometry itself! Finally, every shape from shading approach is inher-

118

ently incapable of determining concavity/convexity of the surface. It is also not easy
to model the photometric variations due to variable distance of scene surfaces from
the illumination source. These shortcomings can be overcome with the help of the

information provided by the other modules.

Intensity Image Segmented Regions
rom Perceptual Org.)

+-

Detection Of
Singularities

' Initialization

 

    

 

 

Barners

l Depth. I
Propagation]

"""""""""" I“i5.1.1]'t.1t2..""""

Pnterregion ]
Propagation

 

 

--—---—-----—--———--—---
.--------_—---—-——-_---—

Figure 4.2: Shape From Shading Module.

We have extended the algorithm by Oliensis and Dupuis to work on piecewise
constant albedo surfaces. The extension is based on segmenting the image into re-
gions with constant albedo and treating them independently. The singular points
are now detected in the individual regions and the propagation of the depth values
is prevented across the region boundaries. Such a treatment, when implemented in
an isolated module, is plagued with erroneous depth recovery. The resulting recon-
struction often is only qualitatively reasonable and may not have desired numerical

accuracy primarily due to the following reasons: (i) the stereo module, needed for

119

initial depth values at singular points, itself might provide erroneous depths; (ii) the
Lambertian model may not be an accurate model for the image surfaces; and (iii)

interreﬂections and specularities may also cause inaccurate reconstruction.

Dis arities Estimated
t Level (I -1)

 

Detection Of
Singularities

 

Next I Intermediate

Disparities

Iteration

 

 

 

Corrected
]e—I-De th From
] D‘spimies Iéi-ISha mg Module
' At Levell

---------------------------

Disparities Estimated
At Level I

Figure 4.3: Stereo Module.

According to Bulthoff et al. [34], the depth conveyed by disparity overrides the
information conveyed by shading. When both cues actually agree, the perception
of the surface curvature is reinforced. They also have accumulated evidence that in
scenes with scarce zero crossings, shape from shading interpolates depth in the area
between two sparse zero crossings.

Some attempts have been made in the past to integrate the shape from shading
and stereo modules. Grimson [70] has used shading to constrain the image irradiance

equation to obtain the surface normal directions. Blake et al. [21] have established

120

that given the depth at the boundary conditions, the surface reconstruction by shape
from shading is unique. Leclerc and Bobick [102] have used shading to interpolate
the depth obtained from the stereo. Cryer et al. [47] have used frequency domain
methods based on the psychophysical theories to integrate the shape from shading and
stereo modules. Most of these earlier integration schemes assumed a single constant
albedo over the surfaces comprising the entire scene. Further, they do not include
any strategy for two-way interaction between the stereo and shape from shading nor
does the shading guide the computation of stereo disparities.

In our approach, we have a more general model of surface reﬂectance of the scene
(piecewise constant albedo) and a more reliable and robust strategy for treating scenes
deviating from our model assumptions. Speciﬁcally, the feedback loop between the
shape from shading and stereo modules is designed to counter the limitations of both
the modules. When there is sparse texture in the scene, the shape from shading mod-
ule will interpolate the surface depths between the (sparse) zero crossings. On the
other hand, when the reconstruction offered by shape from shading is in error, the
gradient descent iteration of the stereo module will attempt to force the resultant re-
construction towards the true depth value. This synergistic cooperation compensates
for the errors in individual modules.

Usually, albedo of the scene surfaces is not known. In such a situation, the depth
map obtained from the shape from shading module is only qualitatively correct and
can not be directly related to the depth map obtained from stereo module. Then, how
should the information obtained from the shape from shading module be meaningfully

used to improve the depth map obtained from stereo? Given an initial estimate of

121

the depth from the stereo module, the proposed integration method uses a variational
approach to make the surface orientation at each point on the depth map ‘consistent’
with the surface orientation at the corresponding point obtained from the shape from
shading module. In addition, we impose a smoothness constraint in driving the ﬁnal

depth reconstruction. A detailed formulation of our approach is described below:

1. Smoothness Constraint: The surface depth varies smoothly over each segmented
region. The measure of departure from smoothness of the surface can be ex-
pressed by (Dg. + D5), where D,C and Dy represent partial derivatives of depth

map, D, in :1: and y direction, respectively. We minimize:
e, = f/(DE; +D§)dxdy. (4.1)
x y

2. Shading Constraint: The resultant shape obtained by the integrated system
should conform with the shape obtained by the shading module. More specif-
ically, the surface orientation at each point (2:, y) of the reconstructed surface,
D, should be consistent with the orientation of the surface estimated at that
point by the shading module alone. If D“ is the estimate of the reconstructed
surface by the shape from shading module, then the aforementioned consistency

can be measured by the term:
e. = f/ [(D, — 1);")2 + (D, — 1);”)2] d1: dy. (4.2)
x :1

Let 5, and D represent the discretized representations of e, and D, respectively. Let éc

122

and D3” represent the corresponding discretized versions of ec and B“, respectively.

Given an initial estimate of the reconstructed (discretized) depths by the stereo
module, D“, and that by the shape from shading module, D”, the integrated output,

D, is the surface that minimizes
e = éc + Aé,, (4.3)

where )I is the regularization parameter; (E, éc, and 63', are the discretized versions of

e, co, and 6,, respectively.

Differentiating é with respect to D(i, j ), we obtain

‘12:.— = —~a—e.C—. + Aﬁa—ef—J (4-4)
0190.1) 500,1) 600.1)

After differentiation‘ and a rearrangement of term, we have

£5 =2(D(I.j) —’D‘(z',j)), (4.5)

where D(i, j) denotes the local 4—neighbor arithmetic mean of D(i, j ), and

ac". _ ~ . . 4‘]-
m —2(D(Z,]) D(,]))

+2 (15%, j) — 75%, 3)). (4.6)

 

1We use ﬁnite differences (on the 4-neighborhood system) to approximate differentiation.

123

Setting 51%;?)- to zero gives us an iterative solution of the depth estimate:

~ A ~ 1:
Dn+l - - :5" ~ - _ __ sh . . .
(m) (2.2) 1+ ,IIAD (m) I, (47)
where

“sowed-r] = f(D"‘(i+1,j)+ D8"<I.j+1)+
DWI — 1,2“) + Dray] — 1))

—Ds”(i, j), (4.8)

and D0 = D". In order to avoid the instability in the reconstruction process, Eq. (4.7)

is applied only when the sign of [IADS"(i,j)n]| differs from the sign of ||AD(i,j)n|],

where
IIAD‘I'JYII =f(15(i+1,j)+ D(i,j+1)
D(i—1,j)+ D(m — 1)) — Duo. (4.9)
Parameter a (= iii—A) is the coupling coefficient between the shape from shading

module and the stereo module. Notice that correction term in Eq. (4.7) is not based
on any precise calibration, but is set to an arbitrary monotonic function of the depth
depending on the value of a. In practice, we have seen that the performance of
the integrated system does not critically depend on the value of a as long as it is
suﬂiciently small (a g 0.05).

The ﬂow of information from the stereo module to the shading module is relatively

124

straightforward. The depth values at the singular points are initialized to the corre-
sponding depth values predicted by the stereo module. The concavity or convexity
of the surface at the singular points is also estimated from the depth map obtained

from the stereo module.

4.1.2 Interactions between the depth modules and the Per-

ceptual Organization Module

Both the general viewpoint and the relationship of signiﬁcant 2D features (See Fig-
ure 3.2) with 3D causality are based on fallible assumptions. The advantage of an
integrated environment is the capacity to oversegment the image using these fallible
assumptions and then, at a later stage, recover the “correct” segmentation with the

help of other information cues.

The human visual system interprets gradual changes in irradiance as a change
in surface orientation of a single surface while abrupt changes in the irradiance are
interpreted as the presence of boundaries between two distinct surfaces (or change
in albedo). The process of demarcating the image plane into regions is shown to be
a precursor to the 3D interpretation of the scenes. However, in the real world, this
delineation is not decided by the image characteristics alone, but also by perceptually
signiﬁcant relations among the image features. Regions formed by perceptual bound-
aries are ﬁlled-in with the appropriate features (like color, brightness, texture, etc.)
and the perceptual boundaries act as feature barriers [72]. Many vision researchers

have been using intensity gradient as a deterrent to smoothing across uniform regions

125

[195], but perceptual boundaries have not been used frequently for this purpose.

In shape from shading module used in our integrated system we have prevented
the pr0pagation of depth values across the perceptual boundary. Similarly, in stereo
module, the smoothness constraint is not enforced across perceptual boundaries. This
has signiﬁcantly reduced the blurring of the sharp depth boundaries in the recovered
depth map. We also set the depth at perceptual boundaries to the values predicted
by the stereo module since the reliability of the stereo module is at its best in these

regions.

4.1.3 Interactions with Line Labeling Module

Although the line labeling problem has been rigorously studied for a limited object
domain, it is plagued with innumerable implementation problems. Most reported
work on the line labeling problem assumes availability of a perfect line diagram at the
outset. Obtaining a good line diagram appears to be a difﬁcult problem. Further, the
parsing of a line diagram into vertices and arcs has been shown to be extraordinarily
tricky [139]. Many researchers wonder whether ideal line diagram and its accurate

parsing could ever be produced in a practical situation.

Integrating line labeling module with other modules is a challenging task since the
constraints involved in it are quite different from those in stereo, shape from shading,
or shape from texture modules. The line drawing interpretation module uses geo—
metrical and qualitative constraints whereas the other modules rely on quantitative

constraints.

126

Boundaries from Perceptual
Organization Module

Vertex/Arc

Parser

 

   

 

       

Assign
Initial
Probabilities

  

Depth Values fro
Depth Modules.

    

 

Probabilistic

Relaxation

 

Final Line Labels

Figure 4.4: Line Labeling Module.

Attempts to integrate shape from shading and line labeling for curved objects
have been reported by Malik and Maydan [113]. They formulated the integration
problem as an optimization problem to simultaneously recover surface orientation
and line labels. Their cost function also includes a regularization term. However,
their problem has been formulated for scenes composed of the single constant albedo
and demonstrated only for synthetic images. Malik’s line labeling algorithm has been
used by Stockman et al. [176] for constraining the ﬁt of the object models to range
data. 'IIytten has attempted to label the line drawings obtained from perceptually
grouped edgels [191]. Our approach is an improvement over the strategy proposed by
Trytten [191] and exploits the powerful constraints exerted by the limb boundaries

to disambiguate line drawing interpretation.

The proposed integration mechanism described here investigates whether the er-

rors in extraction of the line diagram and its parsing can be compensated for by the

127

information provided by the depth modules described earlier. We use the line labeling
module for curved objects proposed by Malik [112] who provided an explicit catalog
of legal line labels for objects with C3 surfaces. Our method of extracting and parsing
the line diagram and favors an interpretation of the scene in which the surfaces of one
object do not touch surfaces of any other object (ﬂoating object hypothesis [189]).

Traditionally, deterministic discrete relaxation methods have been used for label-
ing line diagrams. However, the errors in the measurement of various features (e. g.,
boundary curvatures, tangent directions) from the line diagram warrant a better mod-
eling of the uncertainties involved in the process of extraction of the line diagram and
its parsing. One obvious choice is probabilistic relaxation formulation.

Let .C = £5 U L4 be the set of labels, where £3 is the set of junction labels
and .CA is the set of arc labels. Deﬁne £3 = {L, C, T, Y,A,R, P}, where L, C, T,
Y, A, R, and P represent L, curvature-L, T, Y, arrow, three-tangent, and phantom
junctions, respectively. Deﬁne EA 2 {:17, +, —, —>, ——>—>}, where +, —, —>, —>——) denote
convex, concave, (outside) boundary, and limb labels for arcs of the line diagram,
respectively. Note that the arc label a: could represent any non-depth arc including
surface markings, albedo, or shadow edges. Function Pk(l,,) denotes the ‘probability’
of vertex (arc) I: having a label l,,.

Junction labels determined from tangent-based and curvature-based features of
an are are unreliable [189]. We, therefore, supplement this information by using three
depth-based features extracted from the depth modules: dihedral angles, depth dis-
continuity, and limbness of an arc (See Table 4.2). A dihedral angle ($1) is estimated

using the 3D information extracted from several patches from regions abutting an

128

arc. The dihedral angles are measured in such a way that for convex arcs the angle
measurements are positive. The present implementation of estimating dihedral angles
does not fully address the issues related to invisible junction [112]. We have grouped
the invisible junctions and the spurious degree-2 junctions formed due to noise into
phantom junctions. The presence of the depth discontinuity is measured in a similar

manner. The measurement of limbness of an arc is described below.

Estimation of “Limbness” of a Boundary

To reliably determine “limbness” of a boundary, we use the information offered by
an ensemble of surface normals in its vicinity. Let us call the angle deﬁned by the
surface normal (ni), the viewing direction, and the boundary normal (n) as 9,- (see
Fig. 4.5). Usually, these angles decrease monotonically as we move in the direction of
boundary normal pointing inside the region enclosed by the limb edge. For non-limb
edges, this trend should be considerably less signiﬁcant.

Suppose we are investigating “limbness” of a point b on the boundary B and
the two estimated normals of the boundary are in the directions 11 and —n. Let
n1, n2, 113, ~ - - , nm be the m surface normals sampled in the direction 11 at distances
d1,d2,d3,---,dm from the boundary. We assume that the sampling is guided by
segmentation to avoid observations coming from different surfaces of the scene. Con-
struct a separate rank ordering of the angles 9, and distances (breaking any existing
ties randomly) to obtain two rank sequences r,- and 3,. We compute the Spearman’s

rank-order correlation statistic, S [105]. The signiﬁcance of a negative value of S is

 

tested by computing R = S\/ (m — 2) / (1 — S?) and comparing it to a threshold with

Z Axis

Y Axis

I

129

Viewing Direction ]

 

 

 

X Axis

Boundary

Normal

 

X Axis

Figure 4.5: Limb Detection.

II

130

a size of 0.05. This test is supplemented by the average starting angle test. This
test evaluates whether the arithmetic mean of the ﬁrst few angles (9,) is larger than
a threshold, t. Both of these tests are repeated for the samples of surface normals
taken in the direction of the other boundary normal (—n). A point on the boundary
is assessed to be part of a limb edge if at least one set of surface normals passes
both the trend test as well as the average starting angle test. The fraction of points

belonging to an arc passing the limbness test determines the limbness of that arc.

Estimation of Relaxation Parameters

We compute the initial probabilities using Tables 4.3—4.4 ((T,,,1 = 7r / 20, T9 = 4 pixels,
61 = 0.5, 62 = 0.7, T,,,3 = 7r/3, Td = 5). We have used the generic sigmoid function

for specifying the a priori probabilities:

0(13; 6, It) 2 1/(1 + exp(——6 * (a: — It))).

For notational convenience, we denote 0'(—.’13;5,K.) by 0’ (:13; 6, It). The compatibility
functions, rm: .6, x C, —-> [—1,1], model the ‘likelihood’ of labels t and 3 being
labels of the neighboring arcs (or vertices). A compatibility function is negative,
positive, or zero depending upon whether the labels are incompatible, compatible, or
independent, respectively. The compatibility functions are directly determined from
Malik’s junction catalog.

The relaxation algorithm alternates between two epochs: vertex and arc. In the

vertex epoch, it updates the probabilities of the vertex labels. At the end of the

131

 

(1)3 = min(A,B,C)

Figure 4.6: Line diagram features.

132

Table 4.2: Arc and vertex features. The features g, (151, ll relate to an arc; d2, d, and
lg are degree—2 vertex features; (153 and l3 are degree-3 vertex features (see Fig. 4.6).

 

 

 

 

 

 

 

 

 

 

 

Feature Deﬁnition 1

g Average depth gradient across the arc.

(151 Average dihedral angle between the
surface normals across the arc.

l1 Limbness of the arc.

(t2 Angle between tangents to the two arcs.

d Length of the shortest of the two arcs.

12 Maximum of limbness of the two arcs.

(153 Maximum of the angles between tangents
of each pair of arcs.

[3 Max. of the limbness of all arcs.

 

 

 

Table 4.3: Prior probabilities for arc labels. The term Z = 2,65,, 1(2) is the normal-
ization factor. Note that T k speciﬁes threshold for parameter k.

 

 

 

 

 

 

 

 

Label Indicator Prior
Function Probability

+ II”): 0(¢1;51aIT¢1I) II’Ul/Z

- I(C%=0 U’WnAHQM Ikﬂz

-+ “()= wWﬂhWQDIl-h)ZUVZ
-++ N<2TﬂWMMEDHb) ﬂaw

I a: = 2 1— max Ii
X ] i E {u,c,b,l} I(1:)/Z

 

 

 

 

epoch, it assigns a tentative label to each vertex according to the ranking offered by
the computed probabilities. Thus, in each vertex epoch, the relative probability of
label y assigned to vertex v is updated according to

va( )"”[1+Ava( )"”l

2%ch (y )(")[1+AP,,y( )<n>]I (4-10)

 

Pv(y)‘"+"=

133

Table 4.4: Prior probabilities for junctions. Note that Tk speciﬁes threshold for
parameter k and Sgn(.) is the conventional sign function.

 

rLabel I Prior Probability ]
0W2; d/Td, g)

(1 - 0’(¢2;d/Td,§) * (1 —12)

(1- 0'(¢2; (ii/71mg) * 12

MG) = 0(Na - 7r; 5, |T¢3l)

My) = 0’(¢3 - 7r; <5, IT¢3I))

Sgn(¢3 - 7r) * (1 - p(d))+

(1 - 59w. — 7r)) * (1 - p(y)) * (1 - ls)
Sgn(¢3 — 7r) * (l — p(a))+

(1 - Sgn(¢3 - 70) * (1 - PM) *13

 

 

 

 

 

 

 

 

i=0 H’K‘ILQB’TJ

 

 

 

 

where

APv(y)‘"’= Z Z rvu(y,y’)Pu(y’)("’, (4-11)

uENa(v) y’ECJ

n is iteration number and Na(v) is the set of all arcs incident on vertex 2). Considering
these tentative vertex labels, the label probabilities of the arcs are updated in the arc
epoch using an updating strategy similar to that in Eq (4.10). Since the boundary arcs
(vertices) exert stronger constraints, the order in which each arc (vertex) is visited
favors peripheral arcs (vertices). The relative probabilities of each label of each arc
(vertex) are restored to their original values after entering each epoch. The algorithm
terminates when the weights for some label of most of the sites is close to 1 or there

are no changes in the successive epochs.

134
4.2 Integration Algorithm

In this section we present a high-level description of the overall integration algorithm.
Given a stereo pair of intensity images, I1 and 12, direction of the illumination source,
S, weight vector w (Eq. (3.1)), a coupling coefﬁcient (oz) (Eq. (4.7)), various threshold
parameters (Tables 4.3 and 4.4), the depth values and the line labels are reconstructed

using the following steps.

1. Compute the four attribute (smoothed, gradient, positive curvature, and nega-
tive curvature) images. Initialize the disparity at each pixel in the coarsest level

(I = 6) to zero.

2. Compute the closed regions of uniform intensity using perceptual organization

module (Section 3.1). This will be referred to as the label image.
3. Starting with the coarsest level (I = 6), do at each level I:

A. Obtain the four attribute images at level I by blurring the attribute images

at level 0.

B. Obtain label image at level I by median ﬁltering the label image at level
0. This is essentially a map of the regions of uniform intensity in the im-
age. Applying an edge detector to this label image will provide perceptual

boundaries at level I (boundary image, 8;).
C. Identify singularities in each region from the intensity image at level I (8;).

Do steps (i) to (v) for N (= 20) times.

135

(i) Update the disparities at level 1, (d,), by one iteration of Weng et al.’s
algorithm (Eq. (3.1)). The resulting depth map computed from these

disparities will be denoted by Di".

(ii) Initialize the depth at each singularity position in S; by appropriate
depth values from 15,“.
(iii) Normalize the intensities in each region with respect to the average

intensities in that region. Assess from Di" the convexity/ concavity of

each region.
(iv) Apply shape from shading algorithm to obtain a depth map, Df".

(v) Obtain the corrected depth map D; from th and D,“ using the shape
from shading and stereo integration (Eq. (4.7)). Update the disparity

map using D1.

. Parse the boundaries in B, into a line diagram graph 9 = {MA}, where

V is the set of vertices and A is the set of arcs.

. Compute the line diagram features from the depth map D1 and the bound-

ary image 8, (Table 4.2).
. Initialize the a priori probabilities of labels (Tables 4.3 and 4.4).

. Iterate through the probabilistic relaxation until the label probabilities

stabilize (Eq. (4.10) shows the updating scheme for vertex epoch).

. Project the disparities to level (I — 1) by replicating disparity 2d¢(2', j) to

locations (22', 23'), (22+ 1, 2]), (22', 23' + 1), and (22' + 1, 23' + 1).

136

Table 4.5: CPU times for Mushroom and Vase Image on Sun Sparc 20.

 

 

 

 

 

 

Computation Time
Perceptual Grouping 6ls
Shape from shading 810$
Stereo 5945
Line labeling 357s
Integrated System 22845

 

 

 

 

4. Output the ﬁnal depth map [3 = Do, and line labels. The surface orientation

(normals) at each point can be computed from D.

The major components of the integration themselves require local computations
(Eqs. (4.7) and (4.10)). The integration of the stereo and shape from shading modules
needs one pass over the entire image and its computational complexity is 0(P2), for
a total of P2 pixels. The computation of features from the line diagram is somewhat
image dependent and its overall complexity is 0(IVI2), where IV] is the cardinality of
V. The probabilistic relaxation does not usually take more than a few iterations over
the entire graph. Thus, the proposed integration mechanism is an efﬁcient and reliable
method of integrating vision modules. Table 4.2 presents actual timing statistics on

a Sun Sparc 20 for Mushroom Vase image (Fig. 4.8).

4.3 Experiments

All the real images were captured by an inexpensive CCD camera (Panasonic GP-
KR202, f = 25 cm, maximum aperture). The images were subsequently gamma

corrected with '7 = 2.0 and normalized to 256 gray levels. The stand off was

137

approximately 80 cm and to obtain a stereo pair of images, the camera was either
translated or translated and rotated. The translation was in the direction of :::-axis
and rotation was about the y-axis (the z-axis being approximately aligned with the
optical axis). The imaging setup was not calibrated; all alignments, translations, and
rotations were approximate and were not precisely measured/verified. The rotation
of the camera was effected to bring the disparity of the region of interest close to zero.
The scene was illuminated with ambient light and a single incandescent light source
was located (30 cm) behind the camera (approximately in :::-2 plane) pointing in the
direction (0,0,1). A polarizing setup similar to the one suggested in [140] was used
(when necessary) to reduce the specular component of the reﬂection.

In most of the experiments, we have chosen to compare the reconstruction results
obtained from the integrated system with corresponding results from stereo module
alone; the reconstruction results from the other individual modules were not as reli-
able as those of the stereo module. All the reconstruction results are presented for
the right image of the stereo pair and for all the experiments the parameter a was
set to 0.001. Further, the shape from shading constraints were exploited only for four
different spatial resolutions (64 x 64, 128 x 128, 256 x 256, and 512 x 512). Use of
shape from shading module for lower resolution representations did not signiﬁcantly
improve the results. Figures 4.8(a) and 4.8(b) show a stereo pair of a scene consisting
of two unglazed ceramic objects (image size 512 x 512) with near ideal Lambertian
surfaces. Figures 4.8(c) and 4.8(d) show the (relative) depth reconstruction obtained
by the stereo module and the integrated system, respectively. The diffusion of dis-

parities across the perceptual boundaries signiﬁcantly blurs the depth map by the

138

stereo module; this is prevented in the depth map obtained by the integrated system.
Figures 4.8(e) and 4.8(f) show the orientation of the object surfaces for stereo and
integrated system, respectively. In case of the stereo depth map, it can be observed
that as the size of the untextured region increases, the quality of reconstruction de-
teriorates.

The objective of the second set of experiments is to quantitatively demonstrate
that our integration mechanism can handle images of various types of surfaces, includ-
ing concave, convex, and saddle shaped objects. In particular, this set of experiments
is designed to evaluate our integrated system on a range of intensity images of syn-
thetic scenes consisting of Lambertian algebraic surfaces. The shaded intensity images
(Fig. (4.9)) were synthesized using ray tracing method [66] of photo—realistic render-
ing (with atmospheric turbulence modeling the imaging noise) with a parallel camera
geometry (standoff = 50 units, cameras at (—5, 0, 0) and (5,0, 0)) and a distant point
source illumination (at (0,0, —100)) (see Fig. 4.7). We then matched these stereo
pairs obtained from synthetic rendering using (a) stereo module alone and (b) the
integrated system. The resultant reconstructions were compared with the depth data
projected from the range images (ground truth) using a squared error function based
on differences in true and estimated depth and shape (surface normals) measurements
(Den. and Se", respectively). Percentage reduction in this squared error is used for
assessing the performance of the proposed integration strategy. Table 4.6 shows that
the surface reconstruction using the integrated method is superior to that obtained
by stereo module alone. Note that stereo module can almost always correctly ﬁnd

correspondence at the boundaries of the object. The accuracy of the disparity map in

139

Noise Camera
Model Position(s)

 

 

Stereo
——-> Pair of
Intensity Images

Imaging
(Raytracing)

 

 

 

Surface Texture-map Source
Texture Method Position

Figure 4.7: Image synthesis using ray tracing.

the regions between the boundaries is dependent on the changes in the corresponding
brightness patterns in the image. The fewer the changes in the brightness in this re-
gion, the greater the scope of improvement in the performance due to integration. If
the foreground regions in the image have smaller diameter (maximal width) then the
stereo module alone can produce reasonably accurate correspondences in the regions
between the boundaries. Consequently, it is harder for the integration to improve the

performance in this situation.

Figure 4.10 shows the images synthesized using imaging conditions identical to
those in Figure 4.9 except that Phong model of Specularity was used (17 = 0.8) in-
stead of the Lambertian assumption. Table 4.7 shows that the reconstruction results

are relatively stable when surfaces are non-Lambertian (specular) (Figure 4.10). The

140

improvement in the performance in the case of specular surfaces is smaller in magni-
tude than for the corresponding Lambertian surface (Table 4.6). This is largely due
to the improvement in performance of stereo module in the case of specular surfaces
— the sharper changes in intensity at specular highlights make the correspondence
problem easierz.

Figures 4.11(a) and 4.11(d) show the range images of two objects used for
quantitatively evaluating the accuracy of the depth map obtained by our integrated
system. Two pairs of synthetic stereo intensity images were generated from these
depth maps using Lambertian surface assumption. An additive 2'.sz Gaussian
noise (with standard deviation 2%) was then added to the left and right images
separately. A sample pair of these synthetically generated stereo pairs is shown in
Figures 4.11(b), 4.11(c), 4.11(e), and 4.11(f). A comparison of reconstructions is

summarized in Table 4.8.

Table 4.6: Improvement in surface reconstruction due to integration: Lambertian
surfaces.

 

| Surface primitive I Sm(%) | De" (‘76) I

 

 

 

 

 

 

 

Parallelopiped 30.0 41.3
Sphere 23.4 35.7
Cylinder 22.0 26.5
Paraboloid 18.1 20.0
Hyperboloid 24.1 20.2
Torus 15.8 17.6

 

 

 

 

 

Figures 4.12(a) and 4.12(b) show stereo images of an unglazed ceramic object

 

2Location of specularities in an image depends upon the viewing direction and then the corre-
spondence based on the features derived from Specularity are unreliable. However, when an entire
surface lacks any signiﬁcant change in the albedo (or the irradiance) and change in the viewing di-
rection is not very large, the correspondence is somewhat improved due to the presence of specular
features.

141

Table 4.7: Improvement in surface reconstruction due to integration: Specular sur-
faces.

 

I Surface primitive ISer,(%) ID", (%) I

 

 

 

 

 

 

 

Parallelopiped 30.0 41.3
Ellipsoid 10.0 13.9
Cylinder 12.5 16.9
Paraboloid 13.4 17.6
Hyperboloid 15.3 13.6
Torus 6.8 8.1

 

 

 

 

 

Table 4.8: Improvement in surface reconstruction due to integration: intensity images
synthesized from real range images.

 

 

 

 

I Object I Sm(%)I Den. (‘76) I
Tomato 25.2 22.7
Pipe 35.0 32.6

 

 

 

 

 

(mushroom) and an object made of acrylic plastic (Y—shaped pipe). These images
were captured without cross-polarized ﬁlters to allow the specular reﬂections to be
imaged. Notice the two specularities on the surface of the pipe. The quality of the
reconstructed depth map of the integrated system as shown in Figures 4.12(d) and
4.12(f) has largely remained insensitive to the specular reﬂections. Figures 4.12(c)

and 4.12(e) show the corresponding outputs of the stereo module.

Figures 4.13(a) and 4.13(b) show stereo images of a granny smith apple and a
yellow pepper. The surfaces of both the objects do not possess a very uniform albedo.
Again, as depicted in Figures 4.13(d) and 4.13(f) the integrated system performs
better than the stereo module alone.

The segmentations obtained in the earlier images (Figs. 4.14(a)-(c)) by the group—

ing module were near perfect. We now demonstrate the efﬁcacy of our approach in

case of an imperfect segmentation of an image consisting of piecewise constant albedo

142

surfaces. Note that the grouping module fails to obtain a correct segmentation of this
image (Fig. 4.14(d)). Figures 4.15(a) and 4.15(b) show stereo images of an unglazed
ceramic object (egg) and a foam cup. Notice the lack of secondary reﬂections on
the lower part of the egg. Figures 4.15(c) and 4.15(e), respectively, show the depth
reconstruction and surface needle map obtained from the isolated stereo system. Fig-
ures 4.15(d) and 4.15(f) show the corresponding representations for the integrated
system. Very bright (low depth values) regions between the objects and in the far
left of the image are due to occlusions. Notice that the quality of surface recon-
struction for the cup in case of the stereo module alone (Figures 4.15(c) and 4.15(e))
is comparable to that of the integrated system (Figures 4.15(d) and 4.15(f)). The
depth reconstruction provided by the integrated system is relatively more accurate
except for the slight deterioration on the lower part of the egg. The surface of the egg
has been incorrectly reconstructed by the isolated stereo module (Figures 4.16(a))
compared to the integrated system (4.16(b)). Note that stereo module (alone) could
reconstruct surface of glass correctly due to the presence of texture features.

Figure 4.17 shows the contours detected from the Mushroom and Pipe image
by the line labeling module. The objects in this image do not have any surface
markings and the image does not contain any shadows. The detected contours have
been classiﬁed into limb and non-limb edges. Recall that these representations form
an input to the junction labeling module. Figures 4.18(a) and 4.18(b) show the
outputs of the line labeling algorithm proposed by Trytten [189] and our algorithm,
respectively. Note that all the curvilinear-L junctions are correctly labeled by our

algorithm. In addition, the labeling of the T-junctz'ons is a gross approximation of

143

the physical reality [139], given the quality of output generated by the segmentation

module.

4.4 Summary

The visual world can be ambiguous only in relatively contrived situations. In the real
world a combination of cues (visual or other kinds) conveys a unique physical reality.
Whether to reconstruct the entire visual input or to extract its components relevant
to the given task, a reliable vision system is expected to integrate many visual cues
to obtain an unambiguous output. However, the information provided by each cue is
based on its own set of assumptions. This raises several important research issues in
solving the problem of integration. What is the most reliable information provided by
each visual cue? How to design an integrated system which can be easily maintained
and extended? How to integrate vision modules so that the system performance does
not critically depend on individual modules? Deﬁnitive answers to these questions do
not appear to be in sight. In this chapter we have made an attempt to explore some of
these issues in a somewhat limited context of an integrated system which reconstructs
3D information from a pair of (stereo) intensity images using the following four vision
modules: perceptual organization, shape from shading, stereo, and line labeling.
Several integration strategies have been reported in the literature for primarily
studying pairwise integration of vision modules. But only a few studies have compre-
hensively integrated more than two modules for a complete 3D reconstruction of real

images. We have proposed and implemented an integration framework emphasizing

144

interaction and information exchange among the four vision modules. We have shown
the reconstruction results using integrated system for both synthetic and real images.
We also demonstrated the consistent performance of the integrated system even in the
adverse situations where one or more assumptions made by the individual modules
are violated. The numerical accuracy of the recovered depth is assessed in case of
synthetically generated data. We have also quantitatively evaluated our approach by
reconstructing geons from the depth data obtained from the integrated system.

In general, the potential of a system relying on low-level modules has been grossly
underestimated because of the vulnerabilities of the individual modules. We have
demonstrated that these shortcomings in the individual modules can be overcome
in an integrated environment. In particular, the limitations of a stereo module in
dealing with images displaying no signiﬁcant intensity changes can be alleviated by
the shading module. Our perceptual organization module completes the boundaries
obscured due to low contrast and noise and facilitates in obtaining a reasonable line
diagram. The perceptually completed boundaries also serve as feature propagation
barriers and help in obtaining more reliable results for both shading and stereo mod-
ules. Difficulties in obtaining correct arc and junction labels can be overcome using
depth information obtained from the stereo and shading modules.

The proposed system is far from ideal. The perceptual organization system does
not get any feedback from the other modules. It is not obvious in what form the
other modules can detect and convey the errors made by the perceptual organization
module. Our design is largely inspired by psychophysical and neurological evidence. It

is, therefore, not easy to extend the current integrated system to include an arbitrary

145

vision module without understanding the strengths and limitations of the module.
A considerable amount of further research is needed to obtain better strategies of
integration and establish a formal and more uniﬁed framework to ease the design of

an integrated system. This constitutes the topic of the next chapter.

146

 

(C) (d)

      
 

LII/r
IlllFZ—Z

§

Figure 4.8: Mushroom and Vase image (size 512 x 512): (a) and (b) Left and right
stereo images; (c) Recovered depth from stereo alone; (d) Recovered depth from the
integrated system; (e) Recovered surface normals from stereo alone; (f) Recovered
surface normals from the integrated system.

147

 

(d) (0) (0

Figure 4.9: Synthetic surface primitives (Lambertian surfaces): (a) Parallelopiped;
(b) Sphere; (c) Cylinder; (d) Paraboloid; (e) Hyperboloid; and (f) Torus.

148

 

(e) (0

Figure 4.10: Synthetic surface primitives (Specular surfaces): (a) Parallelopiped; (b)
Ellipsoid; (c) Cylinder; (d) Paraboloid; (e) Hyperboloid; and (f) Torus. All surfaces
are shown with Phong shading.

149

 

((0

Figure 4.11: Synthetic stereo images: (a), ((1): Range Images of Tomato and Pipe
obtained from White scanner [110]; (b), (c): Left and right stereo images generated
from (a); (e), (f): Left and right stereo images generated from (d).

150

 

Figure 4.12: Mushroom and Pipe image (size 512 x 512): (a) and (b) Left and right
stereo images; (c) Recovered depth from stereo alone; (d) Recovered depth from the
integrated system; (e) Recovered surface normals from stereo alone; (f) Recovered
surface normals from the integrated system.

151

 

 
  
 

//
it

Ill|\l\\\

    

I mu

 

Figure 4.13: Apple and Pepper image (size 512 x 512): (a) and (b) Left and right
stereo images; (c) Recovered depth from stereo alone; (d) Recovered depth from the
integrated system; (e) Recovered surface normals from stereo alone; (f) Recovered
surface normals from the integrated system.

152

 

(C)

Figure 4.14: Segmentation results: (3) Mushroom and Vase 4.8(b); (b) Mushroom
and Pipe 4.12(b); (c) Apple and Pepper 4.13(b); (d) Egg and Cup 4.15(b).

153

 

Figure 4.15: Egg and Cup image (size 512 x 512): (a) and (b) Left and right stereo
images; (c) Recovered depth from stereo alone; (d) Recovered depth from the inte-
grated system; (e) Recovered surface normals from stereo alone; (f) Recovered surface
normals from the integrated system.

155

 

Figure 4.17: Limb edges detected in Mushroom and Pipe image (Figure 4.12). Limb
boundary pixels are rendered as thick boundaries and non-limb boundary pixels are
depicted as thinner edges.

156

 

 

 

 

 

 

 

 

 

 

 

Figure 4.18: Junction labeling results for Mushroom and Pipe image (Figure 4.12):
(a) using line diagram alone; (b) by the integrated system using the information
provided by the depth modules and line diagram; L, C, T, Y, A, and P denote L,
curvature-L, tangent, Y, arrow, and phantom junctions, respectively.

157

L L
L ' L
T Y
7T T
P
L .
L , L L ~ L L
(a)
L L
c --c
T T
T T
P
C
(b)

Figure 4.19: Junction labeling results for Mushroom and Vase image (Figure 4.8): (a)
using line diagram alone; (b) by the integrated system using the information provided
by the depth modules and line diagram; L, C, T, Y, A, and P denote L, curvature-L,
tangent, Y, arrow, and phantom junctions, respectively.

Chapter 5

A Uniform Bayesian Framework

for Integration

There is a substantial amount of literature on speciﬁc algorithms for pairwise in-
tegration of vision modules [8]. However, typically, these schemes lack a broader
perspective of the integration problem and do not systematically deal with the prob-
lems encountered with images of real and complex scenes. Hence, these integration
algorithms often can not be extended either to other domains or to different mod-
ules. A method of including additional vision modules in the non-uniform integration
schemes similar to the one pr0posed in Chapter 4 is also not obvious. Several generic
methodologies in the literature (Table 2.1) offer a diverse set of tools for the inte—
gration problem, but do not offer a ready-made solution. A considerable insight is
needed in order to implement these techniques for solving a particular integration
problem.

A uniform integration framework deals with the constraints provided by each

158

159

module in a systematic and regular fashion. One of the simplest examples of the
uniform integration framework is Moravec’s [130] work on fusion of data from multiple
depth sensors to derive space occupancy information using Bayes rule. It is desirable
to have a uniform framework for integrating information available from the vision
modules. Such a framework will help us identify the essential components of an
integration task and facilitate the incorporation of a given vision module into an
existing system. In this chapter, we pr0pose and evaluate a Bayesian framework for

the recovery of structures speciﬁc to the 3D world.

Rest of this chapter is organized as follows. Section 5.1 formulates the 3D recon-
struction problem as an estimation problem. Section 5.2 presents the four modules
which we have integrated and their interactions with the intrinsic map. In Section 5.4,
we describe the overall integration algorithm. Experimental results and imaging setup
are described in Section 5.5. Section 5.6 concludes with a discussion of various is-
sues pertaining to the visual integration, and accomplishments and limitations of the

proposed scheme.

5.1 Bayesian Estimation

Let us illustrate the concept of Bayesian integration using the following simple ex-
ample. Consider a Bayesian estimator integrating three depth observations, 01, 02,

and 03 at the same pixel site from three different vision modules, M1, M2, and M3,

 

 

 

 

160

 

respectively. Consider the following formulation]:

 

01 n1
,.
03 n3

where 6 is the (true) depth variable to be estimated based on the observed depths
01, 02, and 03 contaminated with additive noise n1 ~ N(0, 01"), n2 ~ N(0, 0%), 713 ~

N(0, 03).

The Bayesian estimate is given by maximizing the a posteriori probability of G:

6:) = argmax P(9I01,02,03). (5.2)

Using Bayes’ formula, the maximization problem in Eq. (5.2) can be restated as:

Q = arg max P(01,02,03I9)P(9)
P(01a02303)

 

(5.3)

Since the denominator P(01,02,03) is independent of 9, the maximization problem

is simpliﬁed to:

A

9 = arg max P(01,02,03|G)P(9). (5.4)

 

1In a more general case, the depth is a linear or non-linear function of the observed variables (see
Section 5.3 for the linear case).

161

Assuming 6 follows a uniform distribution, Eq.( 5.4) reduces to
(:3 = arg max P(01,02,03|(-)).
Assuming that observations are statistically independent leads to:

G) = arg max P(01|@)P(02|(-))P(03|9).

Let

"‘0
III

HM®HMWW%W)

. .1... (4+9? )1 (Lee-Mam (_._,.._..2,

”1 1 ‘72 ‘73

Differentiating ln(P) with respect to 9 and setting QI—g‘éﬂ to zero, we obtain

From Eq.(5.9), the MAP estimation of 6 turns out to be:

 

- a+3+§
G = 1 2 3
1 l 1 ’
?+?+?
l 2 3

an

as

am

am)

This Bayesian solution for integrating the three observations, 01, 02, and 03 is in-

tuitively appealing. The larger the variance in the module output, the lower the

inﬂuence of the module in the final outcome of the integration. The reader may also

162

notice that this solution also corresponds to the least squares solution.

We now present the integration problem in the context of 3D scene reconstruction
as a classical Bayesian estimation problem. Let I 2 (I,, I1) E D” represent the right
and left (stereo) images of size N (= n X n), where 5R” denotes an N —dimensional real
vector. Let S represent the set of all sites (pixels) (:::, y), S = {(23, y,);i = 1, . . . , n;j =
1,. . .,n}. Let D = {D1,...,DK}, D], E 9?”, be the kth component of the intrinsic
map. For example, D,- could be the boundary, depth, or surface reﬂectance map.
The generic problem of 3D reconstruction, given I, can then be stated as ﬁnding the

maximum a posteriori estimate of the intrinsic map:

A

D = argmgxP(D|I), (5.11)

where P (DII) denotes the probability of intrinsic map D, given the observed intensity

maps, I = {IbIr} (see Fig. 5.1).

In a modular integration, this basic 3D reconstruction problem (Eq. 5.11) is mod-
iﬁed as follows (see Fig. 5.2). Let us assume that the vision module M, can reli-
ably estimate a function F, : I —> ER” of (often, a projection of) the intrinsic map
component, D], j = 1, . . . ,K2. The corresponding imaging process is described by
R, : D ——> I. The modular integration problem can then be stated as follows: given

I and a set of modules {M,, r = 1, . . . , m}, ﬁnd an accurate and stable estimate of

 

2For instance, some modules do not directly determine depth but only depth derivatives, e. 9.,
surface normals.

163

 

Intensity Cues

 

 

 

Intrinsic Map

 

 

Figure 5.1: Generic Integration Problem.

the intrinsic map D:

A

D = argmgx P(R1(D),...,R,,,(D)II), (5.12)

where P(R1(D), . . . , R,,,(D)|I) denotes the joint probability of modules R1, . . . , Rm

producing images similar to the observed intensity images I.

The formulation in Eq. (5.12) is a very difﬁcult optimization problem for the
following reasons. First, the number of random variables involved is extremely large;
for a n x n image, there are n2 random variables in each D,. Secondly, the inverse
optical imaging problem (function F,) being solved by each vision module is often
underconstrained. Finally, it is not easy to separately model the interactions among
the modules; even if the joint density of random variables (outputs) associated with

different modules can be estimated, the resultant model will be extremely complex

164

Module C

 

Intrinsic Map

 

 

 

Module A Module B

Figure 5.2: Modular Integration.

and diﬂicult to optimize.

It is desirable to have a uniform interface mechanism to facilitate an extensible
and a ﬂexible overall system (even at the expense of accuracy). One pragmatic way
to incorporate all these (integrated) system features is to make the operations of each
module transparent to the operation of the other modules in the system. One of
the simplifying assumptions is that at any stage, the intrinsic map could be updated
by a given module by maximizing its (random variables associated with the intrinsic
map) a posteriori probability independent of the other modules; the interaction of the
modules is permitted only through a centralized intrinsic map. Each module observes
the current state of the intrinsic map and the input intensity information to arrive

at the best possible (MAP) reﬁnement of the intrinsic map. The resultant iterative

165

strategy then is to obtain (see Fig. 5.3)
75in) = arg mgx P(R,(t)(D)|D:(‘,1_1),I), (5-13)

where r(t) deﬁnes a module Mr(t) operating at time t and Dim is an intermediate
solution at time t proposed by module Mm). Involving Dial” in the solution of Dim
makes the approach iterative and stable. Further, due to the distributed nature of
this approach, it facilitates the implementation of each module independent of the
implementation of the other modules. However, the solution of this new formulation
is not guaranteed to coincide with that of Eq. (5.12) in all the situations. Under
certain conditions, the iterative strategy converges to the correct solution3:

1mb%=p. an)

t—ioo

Thus, in our Bayesian approach, the problem of module integration is formulated
as the maximization of P(R,(D)|Dﬁ(‘,l_1),I), the posteriori density of a module M,
producing intensity maps similar to the observed I at time t (given the current
solution D211) and I). Notice that since the operation of a typical vision module
itself is very complex, ﬁnding the solution in Eq. (5.13) will essentially involve a
(enumerative) search even if the module and its implementation were perfect. Our

approach to obtain a practical solution to this problem consists of decomposing the

 

3The conditions under which the distributed solution corresponds to the exact solution are de-
scribed in Chapter 2.

166

 

Intrinsic Map

 

 

M°d“'° A Module B

Figure 5.3: Modular Integration (restricted interactions).

original formulation into the three components described below. While each of these
components could potentially make a suboptimal choice, we hope that an integrated
system can tolerate these myopic decisions due to the multitude of constraints exerted
by several modules, each compensating for the mistakes committed by the other. Such
simplifications have been routinely (and successfully) incorporated into both natural

and man-made system designs [182, 181, 156].

(i) Intrinsic Map Estimation The estimation of F, by module M, is known to
be a difficult and an unstable problem. This can be considerably simpliﬁed by
restating the original problem of independently estimating F, : I —) 5R” as an

incremental (iterative) estimation problem:

11, : Dial—1) x I -> Dim. (5.15)

167

This function may determine a feature which completely constrains the depth
values (e.g., disparities in stereo module) or a depth-related feature which may
partially constrain the depth map (e. 9., surface normals estimated by shape
from shading). In some other modules (e. g., perceptual organization module),

this function determines the boundary segmentation which indirectly constrains

t—l

the depth values [72]. The dynamics of such a feedback (involvement of Dru—1)

in estimating Dim) often reduces sensitivity of the system to the violation of

individual module assumptions and the operating parameters [42].

Coherence Function: To account for the spatial and systematic variations in
the module performance, we associate a conﬁdence map with each component
of the intrinsic map and module outputs. These confidence maps are used to

validate the individual module outputs.

Most vision modules are based on a simple imaging model. For instance, shape
from shading and stereo modules often assume Lambertian surfaces. In some
situations, model parameters are simply not available since they depend on
(unknown) scene geometry. Many of the existing integration methodologies
are biased against either data or synthetic constraints and, consequently, result
in serious artifacts. It is desirable that erroneous outputs from the individual
modules do not signiﬁcantly degrade the output of the entire system and each
additional module should leave the overall system output no worse than it origi-
nally found. In essence, we would like to assess the quality of the module output

at each image site, and discard the noisy components of the feature map before

168

permitting this output to modify the intrinsic map. This problem can not be
solved in its total generality. However, knowing that the features have origi-
nated from a physical surface helps us design a strategy based on the principle
of coherence [154]. Simply stated, this principle hypothesizes that if there is
a statistical interaction among an ensemble of estimates then they might have

been derived from the same physical source(s).

Computation of coherence involves using the scene geometry and the existing
intrinsic map. In our scheme, we validate a feature value at a location if it
‘agrees’ with its independent estimates derived from its neighbors. Note that
we are not relating an estimate at a site with the estimates at its neighbors;
we are relating an estimate of a variable at a site with the estimates of the
same variable at that site derived from the information at the neighbors using

a intrinsic map estimation function.

Let us illustrate a typical coherence function model:

c. : ({P1,P2,...,P°},P°,p) -+ (,6, 5,15), an. 6 ER.

Given a set of c attribute vectors {P1,P2,...,P°} and an attribute vector
of the same type P”, a coherence function evaluates the likelihood that all
the (c + 1) objects have originated from a single source and this likelihood is
represented by the conﬁdence value 6. The parameter n represents the conﬁ-
dence value that all members in {P1,P2, . . .,Pc} are coherent. P is the at-

tribute vector which is considered to be a representative of the entire ensemble

169

{P1,P2, . . .,Pc} U{P°}, if B > p and K. > p; p is a user-speciﬁed parameter.
Given the intrinsic map, the updating mechanism seeks the opinion of each
neighbor of a particular site (as, y) to predict the value of F,(a', y). If there is
a consensus of opinions among the neighbors (according to the coherence func-
tion (3,), and it is consistent with the F,(x,y) computed from D (since F, is
a known function of D), then we increase our conﬁdence in D(x,y). If there
is a consensus of the opinions among the predictions of the neighbors, but it
does not agree with F,(3:,y) computed from D, then we can replace F,(:1:,y)
by the consensus value prescribed by C,. If there is no consensus among the
predictions of neighbors of (x, y), then the value of F,(x, y) is not updated and

we decrease the conﬁdence value of D,(x, y).

(iii) Consistency Mapping: The problem of module integration (Eq. (5.13)) then
reduces to obtaining the most likely and consistent (re)interpretation of the

current solution Di-(tl—l) in the light of new (validated) evidence u,(D:(-,‘_,,,;r).

13:“, = arg mgx P(Dlu,(Df.(‘,1_1),I), 193-,1”). (5.16)

The integration scheme can now be expressed in terms of three major steps: (i)
a method for module output estimation (Eq. (5.15)), (ii) a method for estimating
the reliability of the module output, and (iii) a method for deriving the validated4

module output using (Eq. (5.16))5.

 

4A module output is validated using the associated conﬁdence map.
5For conciseness, we will be referring to Dial— 1) as Do”.

170

 

' Intrinsic
Depth Map Segmentation Map

 

 

 

 

Depth Coherence Consistency Vision
Estimation Function Mapping Module

 

 

 

 

 

Intensity Information

 

 

; J

Figure 5.4: Interaction Model (for one module at one level of resolution).

Fig. 5.4 shows our interaction model for one module at one level of resolution.
In this model, each module interacts only with the intrinsic map. Given a pair
of intensity images and an initial estimate of intrinsic map, each module reﬁnes a
component of the intrinsic map. Coherence module assesses the reﬁned map by the
degree of consensus among the independent estimates. Incoherent and noisy estimates
are discarded and used to reﬁne the segmentation of the image. Finally, the intrinsic
map is updated using consistency mapping, so that it is consistent with the ‘coherent’
estimates of the module.

This basic model described above needs to be supplemented with the additional

mechanisms to deal with the following issues:

1. Scale of Integration The interaction model will typically also involve neigh-

171

borhood deﬁnition JV, : S —> 25, where 2S is the set of all the subsets of
S. Given a site (z,y) E S, N, deﬁnes a set of sites which are considered as

neighbors of (3:, y).

2. Order of Module Operation: The proposed integration scheme in itself does
not impose any ordering constraints on the module operations. In fact, it is pos-
sible to implement the operation of the module concurrently. In our sequential
implementation (see Fig. 5.5), we adopt the following sequence of module op-

erations: perceptual grouping, stereo, shading, and shape from texture.

Initialization
I

Perceptual

Shading

  

Figure 5.5: Sequence of Module Operation (at one level of resolution).

3. Initial Intrinsic Map How should one obtain the initial guess for the intrinsic
map? This problem can be considerably alleviated by (i) the multi—grid, multi-
resolution processing (see Fig. 5.6), (ii) default assumptions, and (iii) proper
choice of models. Our system starts out with a planar depth map. A coarse-

level solution guides the next ﬁner level solution; the data-driven mechanisms

172

 

 

 

 

 

 

 

 

In 'nsic Ma
at Elven (L+l’)
Stereo Shape from
I Texture
De th Ma Se entation Intrinsic Map
at Level L
Shape from Perceptual
Shading , . Organization
lntrmsnc Map
at Level (L-l)

 

Figure 5.6: System Integration.

will iteratively reﬁne the initial output into more accurate intrinsic maps at

ﬁner levels.

. Segmentation issues: Most shape-from-X algorithms are sensitive to the do-
main assumptions. Indiscriminate application of a module to an arbitrary region
in an image usually results in nonsensical results. While it is desirable to local-
ize and restrict the scope of each module to appropriate regions in the image,
it is difﬁcult in practice to prespecify these regions. The principle of coherence
permits not only the rejection of the unreliable results from a given module, but
also a gradual reﬁnement of the existing segmentation to eventually achieve an

‘emergent’ segmentation.

173

Table 5.1: Vision modules for the proposed integration.

 

 

 

 

 

 

 

 

I Module I Strengths Problems I

Stereo Reliable short—range Correspondence and
depth information Occlusion

Shape from Shading Orientation estimation ir- Mutual illumination and
respective of distance ﬁne textures

Shape from Texture Orientation estimation ir- Texture segmentation,
respective of distance uniform albedo

Perceptual Organization Boundary completion Oversegmentation

 

 

 

 

5.2 Modules for Integration

Four modules have been chosen for demonstrating the utility of the proposed inte-
gration framework: shape from shading, shape from texture, stereo, and perceptual
organization. Only the ﬁrst three of these modules directly determine the depth;
the importance of these low-level cues in determining depth has been recognized in
the literature [116] and they display complementary strengths and limitations (see
Table 5.1). Perceptual organization module has been known to interact with every
other vision module and it helps determine depth by providing additional constraints
in adverse situations. While these four modules form a good combination for illus-
trating the concept of the integration methodology, note that any other module can
be incorporated into the implementation with equal ease because of our “uniform”
integration methodology. In this section, we will describe the problem solved by each
individual module. We will then present our formulation of interaction between the

individual modules and the intrinsic map.

174

5.2.1 Perceptual Organization Module

The primary role of the perceptual organization module in our integrated system
is to estimate and reﬁne signiﬁcant 3D boundaries. This boundary segmentation is
difﬁcult due to several confounding factors, including imaging noise and artifacts of
the boundary detection operators. We have, therefore, collated different sources of
information to reliably estimate these boundaries. A secondary role of the perceptual
organization module is anisotropic diffusion of the depth features, where the estimated

boundaries act as diffusion barriers [72].

Rewriting the generic modular integration equation (Eq. (5.16)) for grouping6

module,

15;, = arg mgx P(D|ug,,(p°’”, I), 130‘”), (5.17)

where, for conciseness, we have replaced Dial—1) by Dabs. In our present implementa-
tion, grouping module does not affect the depth map and reﬁnes the boundary map

only. Eq. (5.17) can, therefore, be written as:
B3,, = arg max P(Blugp(B°"",I), BO‘”). (5.18)

B is the boundary component of the intrinsic map. For conciseness, we will refer to
Ligp(B°b’, I) as B”. Thus, the term P(BIB”, BO‘”)) denotes the posteriori probability
of boundary map B, given the boundary map produced by grouping module (B9?)

and the current state of boundary map (Bd’s).

 

6We will be using perceptual organization module and grouping module synonymously.

175

Boundary Estimation (119,, : D0“ x I —> B): We use (i) an edge segmentation
of the intensity image using Canny edge operator [35]), (ii) a region segmentation of
the intensity image using a split-and-merge algorithm (Pavlidis [149]), (ii) an edge
segmentation of current depth map (from Do”) using a Sobel operator, and (iv)
the current boundary map to obtain an initial estimate for the object boundaries.
Each terminal boundary pixel in the current boundary map B0“ is extended in the
tangential direction when supported by two of the three remaining segmentations.
Let us call this representation B’. We then detect corners and terminations in the
resultant representation. A Voronoi tessellation V of the corner and termination
pixels augments B’. Thus, B* = B’ U V is an initial estimate of the reﬁned boundary
map.

Coherence Function: The design of this component is based on the grouping
module described in Section 5.2.1. The coherence at a boundary site b,- E B’ is
estimated as:

'f_ b?
@- = #, (5.19)

where r(= 4) is the number of segmentations participating in the perceptual grouping
and bf is 1 if a boundary exists at the site i in the jth segmentation.
The estimation of coherence of Voronoi edges v, E V at site i uses a different

7

procedure . The signiﬁcance of each Voronoi edge is determined by the following

factors:

1. Proximity Voronoi edges which are short will be considered perceptually more

 

7Some of this description is already described in Section 5.2.1

176

signiﬁcant than those which are longer. The contribution of a Voronoi edge
of length d connecting the edges of length D1 and D2 is deﬁned as E, =

'w,,d2 /D1D2, where w, is the relative signiﬁcance of proximity attribute.

. Curvilinearity Voronoi edges which are in the tangential direction of the edge
terminations are more signiﬁcant. A Voronoi edge which subtends angles 61
and 62 with the terminations of the edges at its either end will contribute E; =
103(61 + 02)/47r to the objective function, where u), is the relative signiﬁcance of

the curvilinearity.

. Cotermination It is the common point shared by terminations of two (or
more) smooth boundaries. Coterminations are perceptually signiﬁcant. We
estimate the contribution of a cotermination to the objective function by E,
= uJC2/(n1+ n2), where n1 and n2 are the numbers of Voronoi neighbors of
each termination and "wC is the weight indicating the perceptual signiﬁcance of

cotermination.

. Depth Gradient Signiﬁcance of a Voronoi edge is proportional to the average
depth difference across its length: Ed 2 de, where AD is the difference of

depth across the Voronoi edge of length L.

. Brightness Gradient Signiﬁcance of a Voronoi edge is proportional to the
average intensity difference of the regions it abuts. Instead of computing a raw
intensity gradient, we consider difference in means of intensity values of the

adjoining regions as a reliable indicator of this criteria. The contribution of

177

the brightness gradient is measured by E, = mg %, where M is the maximum
number of gray levels in the image and G is the average difference of intensity

across the length of the edge.

Total perceptual signiﬁcance of a Voronoi edge is
E :2 Ep + E1+ EC + Eg + Ed. (5.20)
The coherence of a Voronoi edge 21,, then, is determined as follows:

5— E
‘_wp+w,+wc+wg+wd

 

(5.21)

From the set of c Voronoi edge alternatives at an edge termination (or a corner) with
the corresponding coherence s {51,52,...,ﬂc}, all the Voronoi edges except with
leg, 2 ﬁg, 2 maxf:1 6‘ are discarded, provided that its coherence is above a threshold

p912:

Consistency Mapping

P(ng, B°”5|B)P(B)

P(Blnga Bobs) : P(ng Bobs) ’

 

(5.22)

where B is a binary map of the reﬁned boundaries, and B” and 80’” are the out-
put of the perceptual grouping module and the current boundary map (from the
intrinsic map), respectively. As the denominator in Eq.(5.22) is independent of B,

the maximization of P(BIBW, BO’”) essentially requires maximization of the product

178
P(B9P,B°”"IB)P(B). The ﬁrst term in the product is the data term which depends

on the output of perceptual organization module. The second (model) term imposes
desirable characteristics of a boundary map. The present implementation assumes
a spatially uniform distribution of boundary elements; our problem then reduces to

maximization of P(ng, BObSIB).

The data term, P(BQ”, BObslB) is given by (Chu and Aggarwal [41]):

P(Bg"(i,j),B°"’(i,J')IB(i,j)) 0<

1
x —- 9p k,l k,l — i, ' , 5.23
me p 2 ﬁ( )||(( ) (1))” ( )

89P(k,l) U Bobs(k,l)eN(z’.j)

 

where N (i, j) denotes the neighborhood of site (i, j). We make the simplifying

assumption that the scale factor x/L— does not signiﬁcantly inﬂuence the form of
9?

 

the P(ng, BObSIB) and consequently discard its. By assuming spatial independence

among the boundary elements, we have

P(Bg". B°’”IB)) oc II P(Bg"(z'.j), B°’”(z',j)IB(z'.j))
(231')

cc exp —2 Z ﬁgs/m) (HUM) -— (231)“) (5.24)

(131') 39P(k,l) U B°b’(k,l)€N(i,j)

The overall optimization problem then reduces to maximizing individual terms at
each site, P(ng(i,j), Bab‘(i,j)IB(i,j)). Let S(i,j) be the set of all boundary coor-

dinates in a neighborhood of (i, j) in the set 39,, UBO’”. The required optimization

 

8This assumption about the scale factor is not often true and has negative impact on the perfor-
mance of the integrated system. However, use of this assumption makes the solution tractable.

179

problem reduces to the following closed-form solution:

 

1 3 (k,l) such that (i,j) : Zﬂllip(’ll:)zs)j(k”)
B(i,j) Z 9’ ’ (5.25)

0 otherwise

5.2.2 Stereo Module

Rewriting the generic modular integration equation (Eq. (5.16)) for stereo module,
15;, = arg mgx P(DlLl,,(D°’”, I), DO’”), (5.26)

where, for conciseness, we have replaced Dig; ” by DO’”. In our present implemen-
tation, stereo module does not affect the boundary map and reﬁnes the depth map

only. Eq. (5.26) can, therefore, be written as:
D5,, = arg max P(DIU,t(D°"S,I), Do‘”), (5.27)

where D is the depth component of the intrinsic map. For conciseness, we will refer to
U,t(D°”‘,I) as D". Thus, the term P(DID”, D°b3)) denotes the posteriori probability
of depth map D, given the depth map produced by stereo (DS’) and the current state
of intrinsic map (DOb’).
We use the stereo module proposed by Weng et al. [195] (see Section 3.3). The
components for integrating the stereo module into the system are described below.
Disparity Estimation (L1,; : DO‘” x I -—+ ’P): Let (1‘, d' be the current right

and left disparity maps derived from DO’”. The two-way matching results in two

180

correspondence vector ﬁelds, ’P’ and 73' (u E (i, j)):

’P‘(u) = arg min 2: Zw,[Rf(u,d)I, (5.28)

d€N(d‘(u)) “61‘ i

where N(.) denotes a neighborhood function. Similarly,

73’(u) = arg min 2 Zw,[Rf(u,d)], (5.29)

dEN(d’(U)) “6L ,-

where Ri(u,d) is the residual contributed by the 2'“ attribute image at location 11
due to correspondence vector d, and w,- is a prespeciﬁed weight associated with the
residual Ri.

Neighborhood Deﬁnition: A standard 4-pixel neighborhood N is used.
Coherence Deﬁnition: Two coherence functions are deﬁned, one each for left and
right stereo matching. Here, we describe the coherence function for the right stereo
matching. We hypothesize a correspondence at (:13, y) deﬁned by ’P, (as, y) to be coher-
ent if the back-projection of neighborhood of its correspondence, N((m, y) +P,(;r, y)),
prescribed by ”P; falls in the neighborhood N (:13, y). Farther the back-projected cor-

respondences fall, lower is the conﬁdence.

CStCT€O(F($) y), P(x, 3]), pstereo) : (ﬁstereoa ”stereo, 13(1’3 y)), (530)

~

where ’P(:I:,y) is the disparity at (2:,y) determined by the intrinsic map, F($,y) =
{(mm) 5 (k,l) + P,(k,l);(k,l) E (233') + Pr(i,j);V(i,j) 6 N($,y)} is the set of

disparity estimates at (x, y) by the stereo module in the neighborhood of (x, y). Like-

181

lihoods ﬁstemo and nstmo are measured using the variance estimates of F(a:, y) and

F(:r, y) U{’P(a:, y)}, respectively.

~

P(CL‘, y) if pstereo > 'Bstereo
73(559 y) = median(F(z,y) U{’P(:1:,y)}) if pmm, > Esta-co (5-31)
median(F(:r, y)) otherwise

We have used pstmo < 4 pixels.

Consistency Mapping Given the depth maps DI and D" corresponding to the
left and right (validated) disparity maps, P‘ and ’P' and their associated likelihoods
6' and S”, how to reﬁne the estimate of the depths D in the intrinsic map? We resort
to the Bayesian estimation. Let D3t be the depth map estimated by a (right or left)
stereo module and 53‘ be the associated estimate of its likelihood 9 (st 6 {l, r}):

Dst, ,8“, DobsID)P(D)
P(Dst, ﬁst, Dabs)

 

P(DIDSt,IBSt,DOb3) : P( (532)

Since the denominator in Eq. (5.32) is independent of D, the MAP estimate of D

involves maximizing the product of P(D“, ﬂ“, DObSID) and P(D).
Let B E {B(i,j)|i = 1,. . .,n;j = 1,. . .,n} represent the binary boundary map.
B(i, j ) = 1 indicates the presence of a boundary at location (i, j) and B deﬁnes closed

regions ”R", r = 1,. . .,p. The data dependent term P(D", 5”,D°”’|D) at location

 

9Eq. (5.32) is applied twice; once for the left stereo matching and once for the right stereo
matching.

182

(i, j) can be estimated as

P(Dstm), W231), D°’S(i,j)lD(i,j)) o< 7773—2'7 exp(-kdeAiJ», (5.33)

6.1033) = ﬁ"(i,j)(D”(i,j) - D(MW- (534)

We make the sim lif in assum tion that the effect of the scale factor 1 is

 

negligible and discard it. By assuming that the likelihoods ed(i, j ) are spatially inde-

pendent, we obtain,

P(D|D3’,ﬂs’,D°bs) o< exp( —=kded) H exp( -—(kded (i ,j)) (5.35)
(U)

ed = g ; B”(i,j)(D”(i,j) - D(Mllz- (5-36)
The prior term P(D) is chosen to favor smooth depth proﬁles:
P(D(i,j)) 0< exp(-kses(i,j)), (537)
where
6303]) = 03031) + 193,031), (538)

and D,(i, j) and Dy(i, j) denote partial derivatives of the depth map in a: and y

direction, respectively. These partial derivatives are estimated as follows.

183

O (i-l.j)

 

\J

O . .
(i,j) (LJ‘H)

 

(i+l,j)

Figure 5.7: 4-neighborhood system.

, , D(i+1.j)—D(z'.j) if<i+1.j).<i.j)evzp.
DAM) = (5.39)

0 otherwise

, , D(i,j+1)—D(i,j) z'f (i,j+1),(i,j) €73”,
Dy(z,]) = (5.40)

0 otherwise

By assuming that the likelihoods e,(i, j) are spatially independent, we obtain,

P(D) oc exp( —k 36,) 2H exp( —k ,e,( i j,)), (5.41)
(U)

where
= Z 2 [1930.0 + 03033)] . (5.42)
i 1'
Thus MAP estimation in Eq. (5.32) is identical to maximizing

P(DID".16",D°"’)0< exp(—kded)exp—(k.e.) (5.43)

184

o< exp(—(kded + 19363)) (5.44)

o< exp(ed + Ases). (5.45)

From Eq. (5.45), we observe that the problem of MAP estimation is identical to

minimizing

est = ed + A3th (546)

«2.. = :2 [ﬁ3’(D”(i,J’) - Danﬂ + A... Z Z [Dian + Bath] . (5.47)

If we assume a 4—neighborhood system, then the resultant stereo estimate is given by

the following equation [15]:

ﬁ”(i,j)D"(i,j) + Asmidl
ﬁ8t(iaj) + A3 ’

 

D(i, j) = (5.48)

where D(i, j) is the local (4—neighborhood) average at D(i, j). The true average

(D(i, j)) is not known and is approximated by D0b8(i, j).

5.2.3 Shape From Shading

Rewriting the generic modular integration equation (Eq. (5.16)) for the shading mod-
ule,

15:. = argmgx P(Dlush(73°”,1), 12°“). (5.49)

where, for conciseness, we have replaced Dial) by Do‘”. The shape from shading

module does not affect the boundary map and reﬁnes the depth map only. Eq. (5.49)

185

can, therefore, be written as:
D:,, = arg mgx P(DIU3h(D°”",I), Do’”), (5.50)

where D is the depth component of the intrinsic map. For conciseness, we will refer
to U,h(D°b3,I) as D“. Thus, the term P(DID‘h,D°”3)) denotes the a posteriori
probability of depth map D, given the depth map produced by shading (D’h) and

the current state of intrinsic map (DObs).

Given a Lambertian image surface with constant albedo 17, the image irradiance

equation relates the brightness (E) at a position (:::, y) with the surface normal,
11 E (n,, ny, 72.2), by E(:r, y) = nnsn, where n, is the unit vector pointing towards the
illumination source. The goal of shape from shading module is to recover the surface
orientation (normal) It at each pixel, given the intensity image.
Surface Normal Estimation (Us), : DO’” x I ——> (9): Given a surface orientation
n(:z:, y) at site (:23, y), usually two solutions exist for surface normals for the neighbors
in the directions (15 and —¢. These solutions correspond to the maxima of a constant
brightness equation [80]. Given the current estimates of the depth (from the intrin-
sic map) and intensity image, our shape from shading module directly estimates the
depth at each pixel using Oliensis’s algorithm [144]. However, the depth informa-
tion thus obtained is only qualitatively correct; the consistency mapping function is
designed to utilize the relative depth information from the depths obtained by the
shape from shading module.

Neighborhood deﬁnition: We use a standard 8-pixel neighborhood, N.

186

Coherence Function:

Cshading(B($a y), 15(3) y): pshading) = (ﬁshadinga ”shading, ﬁ($r y)), (551)

where ﬁ(:r, y) is the surface orientation at (:r, y) computed from intrinsic map,
B(x, y) = {n(i,j), (i,j) E N(m, y)} is the set of surface orientation estimates at (:r, y)
by shape from shading module in the neighborhood of (:r, y). Likelihoods [33,",de and
I‘Cshadmg are measured using the variance estimates of B(x, y) and B(x, y) U{ﬁ(:z:, y)},
respectively. We deﬁne the median of a set of vectors to be the nearest Euclidean
neighbor of median of individual components and variance of the set to be an arith-

metic mean of the individual component variances.

ﬁt”: y) if pshading > ﬁshading
ﬁ($a y) = median(B(a:, y) U{ﬁ(:1:,y)}) if pshading > rimming (5'52)
median(B(a:, y)) otherwise

Consistency Mapping Usually, albedo of the scene surfaces is not known. In
such a situation, the depth map obtained from the shape from shading module is
only qualitatively correct and can not be directly related to the depth map obtained
from stereo module. Then, how should the information obtained from the shape
from shading module be meaningfully used to improve the depth map obtained from
stereo? Given a current estimate of the depth, the boundaries from the intrinsic map
and an estimate of the shape from the shading module, we propose to use a Bayesian

strategy to arrive at a reﬁned estimate of the depth map.

187

Let D E {D(i,j)|i = 1,. . .,n;j = 1,. . .,n} represent depth map random
variables“). Let B E {B(i,j)|i = 1,. . .,n;j = 1,. . .,n} represent the binary bound-
ary map. B (i, j) = 1 indicates presence of a boundary at location (i, j) and B deﬁnes
closed regions R’, r =1,...,p. DS” E{D3h(i,j)Ii=1,...,n;j=1,...,n} denotes
the ‘depth’ map obtained using shape from shading module.

P(Dsh, D°b3|D)P(D)

P D Dsh Dobs =
( l 1 ) P(Dsh)

 

(5.53)

The denominator is independent of D, therefore, the estimation of reﬁned depths
reduces to computation of P(Ds”, DObSID) and P(D). The ﬁrst term is referred to as
the data term and the latter term is referred to as the model term or the prior term.
We model the data term as the consistency between the existing depth map and the
(relative) depth map obtained from the shading module; the prior term is modeled

as the piecewise smooth depth map. These terms are described below.

1. Prior density of D, P(D): The surface depth varies smoothly over each image
region of constant albedo. The measure of departure from smoothness of the
surface at pixel (i,j) can be expressed by (D;,(i,j)2 + Dy(i,j)2), where D,(i,j)
and Dy(i, j) represent partial derivatives of depth map at D(i, j), in :1: and y

direction, respectively.

P(D(i,j)) 0< exp (_kses(iaj))a (5-54)

830,3.) = 21' zj(Dz(zaj)2 + Dy(ivj)2)v (555)

 

10Recall that N (= n x n) is the image size.

188

where k, is a constant. The partial derivatives are estimated as follows.

Di 1,'—Di,' i ' 1,,i, RP,
DAM) = (+ J) (J) f(2+ J)( J)6 (556)

0 otherwise

, , D(i,j+1)—D(z‘,j) if(i,j+1),(i,j)e72".
Dyan = (5.57)

0 otherwise

By assuming that the likelihoods e,(i, j) are spatially independent, we obtain,

P(D) or exp( —k 363) 211 exp( —k se,( i ,j)), (5.58)
(1 .7)

= Z Z: [D§(i,j) + D§(i,j)I- (5.59)

2. Data Term, P(Ds”,D°"SID): The resultant shape obtained by the integrated
system should conform with the shape obtained by the shading module. More
speciﬁcally, the surface orientation at each point (i, j) of the reconstructed sur-
face, D, should be consistent with the orientation of the surface estimated at
that point by the shading module alone. The aforementioned consistency can

be measured using:

P(D‘“(i.j),D°“(z'.j)lD(z'.j)) o<

1 . . . .
_— exp ("kcﬁshadingav])6c(z’]))a (560)
V [Bshading

(5.61)

189

where

ec(i,j) = (0503]) - D§h(i,j))2+

(Dz/(233’) _ D;h(i,j))2, (562)

where k, is a constant and Di“ and D;” are deﬁned similar to D, and D3,

(Eqs.(5.56) and(5.57)), respectively. We make the simplifying assumption that

1 . u . . .
the effect of the scale factor ——m 1s negllgible and hence discard 1t.

By assuming that the likelihoods ec(i, j) are spatially independent, we obtain,

P(Dshv Dable) 0‘ 9X1) (—kcﬂshadingec) =

11 exp (‘kc/Bshadinguajlecuajlla (563)
(is)
(5.64)
where
ec = 22123,- [ (D3033) - Dal”(i,j))2
+(Dy(z',j) — D;"(i,j))2] . (5.65)
Combining Eqs. (5.53), (5.54), and (5.63), we have
P(DIDSh: Dabs) 0‘ exp (_kses) exp (—kcﬂshadingec) (566)

(X exp (—kcﬂshadingec) exp (—kses) (567)

190

OC eXp —(kc,63hadingec + kses)
OC exp "(kcﬁshadingec + 16385)

= Kexp —(ec + A63):

(5.68)
(5.69)
(5.70)

(5.71)

where K is a constant and /\ decides the signiﬁcance of the shading module output

at each site. Bayesian (optimal) estimate corresponds to D which maximizes the a

posteriori probability P(DIDS”). Eq.(5.71) implies that the problem of obtaining a

maximum a posteriori estimate of D is identical to minimizing

e =ec+Ae,.

Differentiating e with respect to D(i, j), we obtain

68 686 aes

606', j) = 60(2', j) + "”jlopa, j)’

 

 

where

6e,
512033)

 

= —0.5(D(i + 1,1) — 09.7)) — 05(061’ + 1) - D(i,”)

+0.5(D(i,j) — D(i,j - 1)) + 0.5(D(i,j) — D(i — 1,j)).

(5.72)

(5.73)

(5.74)

191

After differentiation11 and rearrangement, we have

('36,
30(2) j)

 

= 2 (D(i,j) — D(i,j)), (5.75)

where D(i, j) denotes the local 4-neighbor arithmetic mean of D(i, j), and

86

3190.1) = —0-5(D(i + 1,j)— D(i,j)) + 0.5(D3h(i+1,j) —— D3h(i,j))

 

—O.5(D(i,j + 1) — 09.7)) + 0.5(D8"(i.j + 1) — D’h(i.j))
+0.5(D(i,j) — D(i,j—1))— 0.5(D’h(i,j) — D3h(i,j — 1))

+0.5(D(i,j) — D(i—1,j))— 0.5(D3h(i,j) — 03"(2' — 1,j)). (5.76)

Again, rearrangement results in

Be
300,3)

 

= 2 (Day) — 59.7))

—2 (0%, j) — 53%, j)). (5.77)

Combining Eqs. (5.75) and (5.77), we have,

8e
300,1)

 

= (1+ Aw» (Dar) — 39.7))

—A(i,j) (Dre's) — 13%.»). (5.78)

 

11We use ﬁnite differences (on 4-neighborhood system) to approximate one-sided differentiation.

192

Setting 81')D( to zero gives us an iterative solution of the depth estimate:
.. ..—.. ..8..—sh..
(1+ /\(z,J))(D(z,J) - 00.1)) = 4(2,J)(D "(z,J) - D (2.1)) (5-79)
This results in an iterative solution of the depth estimate:

D"+1(i,j) =E"(z', j)— 1—_:-(::(—':,)]—,)-ADS”(i,j)n, (5.80)

where ADS”(i, j)" is the magnitude of the mean gradient at (i, j) in 03” at the n‘h

iteration and is given by

ADshm)" = 1(Dsh(z'+ 1.1) + Dsh<m+1>+
DS"(2‘ — 1.1) + D“"(z‘.j — 1))

—D3"(i,j), (5.81)

and D0 2 Do‘”. In practical situations, albedo of the scene surfaces is not known
a priori and hence shape of the surface obtained from shading information is only
qualitatively correct. In order to avoid the instability in the reconstruction pro-
cess, Eq. (5.80) is applied only when the sign of AD’h(i, j)" differs from the sign of

AD(i, j)", where

ADM)" = 31(D(z' + 1.1‘) + D(zyj +1)+

D(i—1,j)+ D(i,j-1)) — D(i,j). (5.82)

193
We will refer to % in Eq. (5.80) as ashadmg(i, j). Parameter ashading(ia j) is the
coupling coefﬁcient between the shape from shading module and the intrinsic map at
the site (i, j). Notice that the correction term in Eq. (5.80) is not based on any precise
calibration, but is set to an arbitrary monotonic function of the depth depending on
the value of ashading. In practice, we have seen that the performance of the system
does not critically depend on the value of ashadmg as long as it is sufﬁciently small

(ashading S 0.05). In the present implementation, we assume ashadmg to be a global

constant. We have used pshading g 0.2.

5.2.4 Shape from Texture

Rewriting the generic modular integration equation (Eq. (5.16)) for texture module,
D2,. = arg mgx P(Dlut,(D°‘”,I), DO’”), (5.83)

where, for conciseness, we have replaced [DR—1:1) by DO’”. In our present implementa-
tion, texture module does not affect the boundary map and reﬁnes the depth map.

Eq. (5.83) can, therefore, be written as:
15:, = arg mgx P(Dlu,,(v°b3,;r),1)°b8), (5.84)

where D is the depth component of the intrinsic map. For conciseness, we will refer to

Ut,(D°”’, I) as D”. Thus, the term P(DID", D"”")) denotes the posteriori probability

194

of depth map D, given the depth map produced by shape from texture module (D’x)

and the current state of intrinsic map (DOM).

The algorithm we adopt here is a variation of Super and Bovik’s [178] shape from
texture algorithm described in Section 3.5. This algorithm assumes that the surface

texture is uniform and can be correctly segmented.

Surface Normal Estimation (L7,, : Dab” x I —> 0): Let g(;1:,y) =
(g,(:1:, y), gy(a‘, y), gxy(:1:, y)) represent :13, y, and my squared image moments of a texture
3

feature12 at (2:, y). Given image moments g(:1:, y) and surface orientation n E (o, r)1

at a site (51:, y), surface moments can be estimated by:

3 _
g (13,31) — T*g($1y)v (5.85)
where matrix T is given by
cos2 0 cos2 7 cos2 0 cos r sin r cos2 a sin2 r
T = —2 cos 0 cos r sin r cos (r(cos2 r — sin2 r) 2 cos 0 cos r sin r (5-86)
. 2 - 2
sm r — cos r sm r cos r

 

 

Canonical image moments (M, m) and angle 6 are determined from image moments

g(:1:, y) by (after dropping the qualiﬁer (x, y) for brevity)

 

 

[M , mlT = % [(9. + 9y + do? + (9x — 91y)2),(gx + 937 — ([93 + (91 — 91y)2)I(5-87)

 

12Our implementation uses intensity gradients.
13For conciseness, we represent orientation as (slant, tilt) pair.

195

0 = %arctan 973’; (5.88)

Similarly, surface canonical moments can be computed. Let p(:1:1, yl) and q(x2, yg) be
two sites located on a uniform textured surface. Given surface orientation nq at a site
(162, yg), and image moments of a texture feature at site p and q, surface orientation

np = (n,, ny, n,) at p can be constrained by the following system of equations:

 

 

 

_ Mum . 1mm . 1

”z — Moiihmaliyi’i 1 (5'89)
n, = cos r, (5.90)
ny = \/1— (nap)2 - (nz)2, (5.91)

where M 3 (51:2,y2) and m3(:1:2, yg) are maximum and minimum canonical surface mo-
ments of the selected features at site (332,312); M(:1:1, yl) and m(:r1,y1) are the corre-

sponding image moments of the same feature at site (21:1, yl).

1
0 :1: éarccos A
., ___ (5.92)

6i§arccosA+7r

20 _ s 1. s :r .
,\ = (cos +’)[M(””y’);:’g(:II’Z”_,:E’I ( 2’y2)+”’ ‘ 2"”2”, Slno 75 0, MA ¢ ma, (593)

where 6 is the orientation determined by canonical image moments M ($1,311) and

m($11y1)14'

Since shape from texture module also estimates surface orientation (although from

 

1“Among the four estimates of r, only the one which is the closest to the current surface normal
estimate is considered.

196

different features), we have essentially used the same neighborhood deﬁnition and
coherence mapping described in Section 5.2.3. We will denote the corresponding
parameter set associated with shape from texture module by (amme, ptexme). The
values of these parameters are identical to the values of the corresponding shape from

shading parameters.

5.3 Computation of Reliabilities

Given input intensity images, a vision module outputs a map of intrinsic variables
or of variables which partially constrain the intrinsic map. The reliability of the
module output is not spatially uniform due to several reasons. First, not all parts
of the input images have the information desired by a module; the module output
from the corresponding parts of the image are erroneous. Secondly, the assumptions
made by a given module might be violated in some regions of an image invalidating
the module output in those regions. Finally, recovery of structural information from
intensity images inherently depends upon the imaging geometry and the reliability
of resultant module output might systematically depend upon both the scene and
imaging geometry.

Assessing the reliability of a module at each pixel is important since such infor-
mation directly helps in determining to what extent each portion of a module output
is to be believed. It would be useful to obtain a measure of reliability of a module
output for each part of the input images. Such an index will associate a conﬁdence

value R(:r, y) 6 ill with output of the module at each site (:12,y) in an image. This

197

reliability information might then be used for discarding highly unreliable portions
of a module output or for relating a module output with the other module outputs.
Unfortunately, consistent and accurate estimation of reliability of a module is an ex-
tremely difﬁcult task in its most general form. We brieﬂy discuss a few approaches

to solve this problem and the limitations associated with each of these approaches.

1. Intramodular approaches: Typically, these methods make assumptions
about the imaging process and estimation of reliability is based on operation
of the module itself, i.e., the method of extraction of the requisite information
by the module. The intramodule assessment of the output usually does neither
verify whether the required information is available nor question the module
assumptions. For instance, in Figure 5.8, we show an image of a cylindrical
surface with circularly shaped stripes along its circumference. a typical shape
from texture module will infer a ﬂat surface in the image with a high degree
of reliability. Similarly, a typical shape from shading algorithm would predict
a curved surface after analyzing the intensity proﬁles across the photo frame in
Figure 5.9. Therefore, the reliability of a module output (as estimated by an
intramodular approach) is not always an indicator of its correct behavior. We
now describe an intramodular method of assessing the reliability of the stereo

module.

In this section we will brieﬂy describe a method of assessing the reliability of a

module using analyses of the stereo module presented in Weng et. al [196]. Let

198

Figure 5.8: An Image of a cylinder.

 

us ﬁrst consider the following generic linear problem:
y = AD + 6y, (5.94)

where we need to estimate the parameter p. Note that 6,, is a random vector
contaminating the measurements y. Let the expectation of 6,. be zero (8(5) =
0) and its covariance matrix be Fy = £(eyey’). According to Gauss-Markov

theorem [175, 65], the unbiased linear minimum variance estimator of p is

p = (A‘I‘;1A)A‘I‘;1y (5.95)

199

 

Figure 5.9: Image in an image.

whose error covariance matrix is
. . _ '1
F6 E 5((1) — p)(p - p)’ = (A’Fy’A) - (5-96)

This estimator is equivalent to the least-squares estimator with weight matrix

1"; 1 minimizing objective function (Weng et. al [196]):

(y - AP)’ F371 (y - AP) - (5.97)

Let us now consider the estimation of disparities by the stereo module. This
algorithm matches four different attributes of the intensity images (see Sec-

tion 3.3). The stereo module assumes Lambertian scene surfaces and posits

200

that the image attributes of the corresponding pixels in a stereo image pair are
identical. The difference in these image attribute values at the corresponding
pixels (residuals) contribute to an error function which guides the matching

process.

Given an estimate of the correspondence (disparity) vector ﬁeld, the stereo
matcher at a given level of resolution obtains a reﬁnement of the correspon-
dence (disparity) vector ﬁeld by minimizing a weighted sum of squared residuals

(errors):

mdin Z ZWiIRi(u, d)]2, (5.98)

where d is the correspondence (disparity) vector ﬁeld, Ri(u, d) is the residual
contributed by the i‘” attribute image at location 11 due to correspondence
vector (1, and w,- is a pre-speciﬁed weight associated with the residual R) (with

W = diag({w;})). For instance, intensity residual is deﬁned as follows:

Rm(u, d) = I,(u + d) - I,(u). (5.99)

Expanding R at d = d’ + 6,, and assuming that higher order terms of ||d — d’II

are negligible, we have

R(u, d) = R(u, d’) + J (d — d’), (5.100)

201
where J = 21%?)-

Rewriting Eq. (5.100), we obtain

My 2 Md — R(u, d). (5.101)

—R(u,d)+ 6d 6d

This equation can be related to Eq. (5.94) where the term —R(u, d’) + (Egg-Lad
could be considered as y, A = anag’d , p = d, and 6y 2 —R(u, d). The
least squares solution (Eq. (5.98)) and the minimum variance estimator of p in

Eq. (5.94) are related through Eq. (5.97). If we make the simplifying assumption

that cy = 021 and W :2 1111, it can be observed from Eq. (5.96) that

I}, o< (JTJ)‘1. (5.102)

Thus, the conﬁdence in the stereo module output is directly proportional to the
magnitude of J (evaluated at the ﬁnal solution). Relating J to the reliability of
the module is intuitive: since steeper the R(u, (1) proﬁle at u with the variation
in d, the more likely that the salient image features (as opposed to noise ﬂuctu-
ations) are contributing to the matching process. In addition, a large residual
value R(u, d) could indicate a larger likelihood of the solution being spurious.
However, all these heuristics are merely gross indicators of the reliability of the
module and tend to be inconsistent and even misleading in rather simplistic

images.

Consider a simple synthetic stereo image pair of a cylinder shown in Figure 5.10.

202

Figure 5.11 illustrates some of results derived from matching the stereo pair in
Figure 5.10 using the stereo module. Figure 5.11(a) shows the intensity proﬁles
of the cylinders in the 256‘” row. Figure 5.11(b) shows the error in the depth
computed from the stereo module for the 256“ row. In this illustration, we
will only consider the intensity attribute of the image. Figure 5.11(c) shows
the intensity residuals computed in the 256‘” row at the ﬁnal solution. Fig-
ure 5.11(d) depicts 3,, = amgéxu’d , where dz. is the x-axis component of the
disparity. From these illustrations, it could be observed that (i) the large ﬁnal
values of the residuals are not directly related to the error in the disparity com-

putation and (ii) the higher magnitude of J?” does not necessarily equate to

mt

more accurate disparity values.

In our experience, one of the the primary sources of error in the stereo module
output is due to the violations of the implicit assumptions embedded in the
stereo module. For instance, the stereo module we use assumes that all the
pixels in each stereo image are visible to both the cameras and the corresponding
pixels possess similar image features. Image regions representing occlusion, limb
edges, etc., conspicuously violate these assumptions and result in an erroneous
disparity output which can not be assessed using this method of reliability

estimation.

Some other criteria for computing the conﬁdence measures have been suggested
in the literature based on a spatial variation in the magnitude of residuals with

respect to change in observed disparity. For instance, Singh and Allen [173]

[Q

203

   

(a) (b)
Figure 5.10: Synthetic stereo pair image of a cylinder (size 512 x 512).

propose a conﬁdence measure based on the eccentricity of this spatial distri-
bution. These and other measures suffer from similar limitations when the
correspondence estimation based on similarity of the image attributes is not
adequate. Hence, caution needs to be exercised in the use of conﬁdence mea-

sures based on intramodule approach alone.

. Empirical approaches: These approaches are based on the known facts about

a given vision module. These facts may have been either derived from psy-
chophysical experiments or from the human neurophysiological data. For in-
stance, there is some evidence that the weight assigned to each cue is computed
from an unrelated “ancillary” cue [72]. In some situations, the human visual
system might have empirically learnt to weight the stereo module output more
heavily than the output from shape from shading module. However, such heuris-

tics are not readily known and tend to display inconsistent performance in real

204

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

10
5 .-
o
.5 .
-1o ,
is i 40 l
.20 . so
25 l 40 L 1
300 100 200 300 can 500 000 -5o° 100 200 am «so am one

(C) (d)

Figure 5.11: Error analysis of synthetic stereo image pair in Figure 5.10. This ﬁgure
shows the results for the 256‘” row only: (a) Intensity proﬁles of the left and right
stereo image in the 256“ row. (b) Error between the observed and the true depths;
(c) Intensity residuals; ((1) Sensitivity of the intensity residuals.

situations.

For instance, it is commonly known that the stereo modules perform better in
the regions of high contrast. Consider Figure 5.12. Figure 5.12(a) shows the
right image of a stereo pair of sphere. Figure 5.12(b) depicts the edge output
from Canny operator (only 10% of the strongest edges shown). Figures 5.12(c)
and (d) show the depth map obtained from the stereo module and ground truth

depths, respectively. Note that the regions of large error in estimates do not

205

directly correspond with the regions of higher edge density.

. Intermodular approaches: Here the communication among the modules
themselves is used for resolving the uncertainty about the overall system output.
As mentioned in Chapter 1, there are several methods of resolution: accumu-
lation, cooperation, competition, veto, disambiguation, promotion, consensus,
coherency, etc. Since the information from several independent modules are
juxtaposed in this approach, such an approach has better opportunities to cor-
rectly assess output of individual module. In this thesis we have used a novel
method of assessing the reliability of the module output based on coherency.
In the present implementation we do not weigh the module output at a site
by the corresponding conﬁdence value (except in the case of stereo integration
(Eq. (5.48)) and uniformly treat the outputs at all sites. This means that the

values of A and ﬂgp in Eqs. (5.80) and (5.25) are assumed to be globally constant.

The conﬁdence maps of the module output inﬂuence the weighing of the evi-
dence gathered by that module. The present approach does not estimate the
reliability of the intrinsic map itself and the information regarding the reliabil-
ity of a given solution is lost after each module performs its update. In a more
general solution, a conﬁdence map could be associated with each component
of the intrinsic map. Given a module output along with its conﬁdence map,
the current state of intrinsic map, and the current state of the conﬁdence map
associated with the intrinsic map, it is possible to use a Bayesian framework

to updating not only the intrinsic map but also the conﬁdence map associated

206

 

 

 

 

 

 

 

 

   
 
 
   
 
 
 

\
“)‘1 ,

  
 
  
   
    

 
  
 

‘r—r -

   

\

  

  
 

        
   
  
  
  

 

‘
\

\f\~, .
1'9

‘I‘

\{
\‘

  

 
  

  

   
  

. . . . ‘
a‘ill ./ /"

‘ i“) ’ :14 1') ; "l

I 1, -. h. I I.

‘liiaiii'r 'ltllt'ii‘ai‘l

: | 11'» " "i“l

  

      
 

 

_vi
I
Pz—A

Figure 5.12: Empirical method for stereo evaluation (only the right stereo image is
shown): (a) intensity image of sphere; (b) output from Canny edge applied to (a);

(c) depth output from stereo module; (d) ground truth depths; (e) error in (c).

207

with it. These representation and updation schemes are similar to the Kalman
ﬁlter approach. The advantage of such a scheme is that the conﬁdence map

associated with the intrinsic map indicates the reliability of the ﬁnal solution.

5.4 Integration Algorithm

Our representation of the input images and the extracted features uses seven levels
of resolution; each coarser level of resolution reduces the number of pixels by a factor
of 4 (half in each dimension, :1: and y).

The integration system starts with an initial planar depth map (and no bound-
aries) at the coarsest level of resolution (8 x 8 image size). Each module sequentially
updates the intrinsic map a ﬁxed number of times (20) and then the control is passed
to the next ﬁner level of resolution (Fig. 5.6).

We now present a high-level description of the overall integration algorithm.
Given a stereo pair of intensity images, I) and 1,, direction n, of the illumination
source, weight vector w (Eqs. (5.28)), thresholds related to the coherence func-
tions (pshading, patereo, Ptemture, and p9?) (Eq. 5.52), coupling coefficients (amading and
mature) (Eq. 5.80), and the weight vector associated with perceptual grouping module

(Section 5.2.1), the depth values are reconstructed using the algorithm in Figure 5.13.

Note that our uniform framework facilitates extension of the current system to
include additional modules. All components of the integration system require local

computations. The integration of the perceptual grouping and stereo modules at

208

 

 

1. Compute the four attribute (smoothed, gradient, positive curvature, and neg-
ative curvature) images. Initialize the disparity at each pixel at the coarsest
level (I = 6) to zero.

2. Starting with the coarsest level (I = 6), do at each level I:

A. Obtain the four attribute images at level I by blurring the attribute images
at level 0 (input image).
Do steps (2') to (iv) N (= 20) times.
(i) Apply perceptual organization module to reﬁne the boundary map. Do
a ﬁxed number (5) of anisotropic diffusion iterations of depth values.
(ii) Update the disparities at level l by one iteration of stereo module
(Eqs. (5.28)).
(iii) Apply shape from shading module to reﬁne the depth map.
(iv) Apply shape from texture module to reﬁne the depth map.

E. Project the disparities to level (l — 1) by (quadratically) interpolating dis-
parities at level I. Project the regions from level I to (l — 1).

3. Output the ﬁnal depth map and boundaries. The surface orientation (normals)
at each point can be computed from the depth map.

 

Figure 5.13: Uniform Bayesian Integration Algorithm.

each level of resolution for each iteration needs one pass over the entire image and
its computational complexity is 0(n2), for an n x 71 image. The complexities of inte-
grating shape from shading and texture modules are somewhat data-dependent and,
from our experience, require a few dozen iterations before convergence. The compu-
tations required for the integration constitute about 25% of the overall computations
of the system. The modular and distributed organization facilitates parallel imple-
mentation. Table 5.4 presents actual timing statistics for a typical pair of 512 x 512

intensity images on a Sun Sparc 20.

 

209

Table 5.2: CPU times for a typical image on Sun Sparc 20.

 

 

 

 

 

 

 

Computation Time
Perceptual Grouping 18308
Shape from shading 8108
Stereo 1188s
Shape from texture 33608
Integrated System 84398

 

 

 

5.5 Experimental Results

Our experimental results will be discussed in the context of quality of reconstruction
obtained from the integrated system. The experiments are primarily designed to
demonstrate the graceful deterioration in the performance of the system in situations
where assumptions made by the individual modules are violated. Many results are
presented for subjective evaluation by the reader. Objective evaluation of the quality
of surface reconstruction is presented in case of synthetic images. We will now brieﬂy

describe our imaging setup before presenting our results.

The objective of the experiments described in this section is to demonstrate the
improvements in the recovery of the 3D structure of the scene due to integration
as well as the graceful degradation in the performance of the system under adverse
situations. In most of the experiments, we have chosen to compare the reconstruction
results obtained from the integrated system with corresponding results from stereo
module alone; the reconstruction results from the other individual modules were not
as reliable as those of the stereo module. The experiments are organized as follows.

First, we describe the improvement in numerical accuracy of 3D reconstruction due

210

to integration for synthetic images rendered under a variety of controlled imaging
conditions. The second part of the section illustrates the qualitative improvement in
the recovery of 3D structure due to integration for several real images.

Synthetic Data: Six volumetric primitives (parallelopiped, cylinder, hyper-
boloid, paraboloid, ellipsoid, and torus) were used for generating photo-realistic stereo
images. This set of primitives was selected primarily due to its wide range of sur-
face proﬁles. This data set was supplemented by the range image of a Mozart bust.
Texture-mapped stereo images of these surfaces were obtained using a photo-realistic
renderer, pov-ray.

Imaging Conditions: In all the experiments involving synthetic images, the im-
age centers of the left and right cameras were located at (—9, 0, —90) and (9,0, ~90),
respectively. The optical axes of both the cameras pointed towards the origin and
were located in X —Z plane. A pin hole camera geometry determined the imaging
projections. A point illumination source was placed directly behind the camera at
(0,0, —1000). The sensing noise was simulated by 10% i.z'.d Gaussian jitter in the
projected intensity of pixels.

The ﬁrst set of experiments was designed to demonstrate the soundness of the
integration scheme to recover the 3D structure for the entire data set (which encom-
passes a wide range of surfaces) from their stereo images. Each object surface was
texture-mapped with a synthetic checkerboard texture and was modeled as a perfect
Lambertian surface. The right stereo image for each of the surface primitives and their
reconstructions obtained from stereo module alone and from the integrated system

are depicted in Figs. 5.14 and 5.15. The regularity of the ﬁne-grained texture often

211

presented a severe challenge to the stereo module in obtaining the correct correspon-
dence [76]. In an integrated system, some of the mistakes committed by the stereo
module were corrected by the information provided by the other modules. However,
the performance of the integrated system is far from perfect. The errors in the re-
construction obtained by the integrated system can be primarily attributed to the

late introduction of the other modules in the processing of data 15.

Consequently,
the errors committed by the stereo module at the coarser levels of resolution could
not always be reversed. The 3D reconstructions obtained from stereo alone and from
the integrated system were compared with the ground truth (given), using a squared
error function based on differences in true and estimated depth and shape (surface
normals) measurements. Percentage reductions in this squared error are used for as-
sessing the performance of the proposed integration strategy. Table 5.3 shows that

the surface reconstruction using the integrated method is superior to that obtained

by the stereo module alone.

Table 5.3: Improvement in surface reconstruction due to integration: Lambertian
surfaces.

 

 

 

 

 

 

 

Surface primitive % Reduction in shape % Reduction in depth
estimation error estimation error
Parallelopiped 31 28
Sphere 20 23
Cylinder 22 26
Paraboloid 17 22
Hyperboloid 8 14
Torus 5 12

 

 

 

 

 

Violation of nominal assumptions made by individual modules should not signif-

 

15At very coarse resolutions (8 x 8, 16 x 16, and 32 x 32), no meaningful features could be extracted
from the image pair for the modules other than stereo.

212

icantly deteriorate the overall system performance. The second experiment focussed
on the reconstruction of a surface of an arbitrary complexity (Mozart bust) under
various “adverse” conditions. Each of the confounding situations is shown in (a) and
(b) part8 Figs. 5.16—5.20. Fig. 5.16(a) is the right stereo image of a Lambertian surface
with no texture and hence the performance of shape from texture module was not ex-
pected to be reliable. The resultant integrated system performance remained stable.
The improvement in performance (Table 5.4) was primarily contributed by the shape
from shading module and was not signiﬁcant since considerable variation in brightness
across the image surface helped stereo module to obtain the correct correspondences.
Stereo images in Figs. 5.16-5.19 present challenging conditions for not only the shape
from shading module (due to violation of single surface albedo assumption), but also
to the shape from texture module. First, the texture is not uniform and the variation
of the texture features across a physical surface does not systematically relate to its
orientation. Further, in Fig. 5.19 the surface is mapped with two similar yet distinct
textures. All these confounding conditions did not signiﬁcantly interfere with the
performance of the integrated system. We primarily observed an improved shape es-
timation rather than improved depth estimation under these situations. Figs. 5.20(a)
and (b) depict a more complex situation. Here, the Specularity violates the “Lamber-
tian surface” assumption made by the shape from shading module in addition to the
unique surface albedo. The specular patches also interfere with the performance of
the stereo module (since the sites with similar intensity attributes do not necessarily
imply correspondence). Finally, the non-uniform texture violated the assumptions

made by the texture module. The violation of assumptions made by all the three

213

modules results in a very low improvement in the overall system performance (see

Table 5.4).

Table 5.4: Improvement in surface reconstruction in texture-mapped Mozart images.

 

 

 

 

 

 

Surface % Reduction in shape % Reduction in depth
estimation error estimation error
No texture (Fig. 5.16) 8 4
Partial texture (Fig. 5.17) 27 12
Full texture (Fig. (5.18) 32 14
Two textures (Fig. 5.19) 37 18
Specular texture (Fig. 5.20) 16 14

 

 

 

 

 

Finally, Figures 5.21 and 5.22 illustrate (visually) how the performance of the
integrated system improves the 3D reconstruction of real images. In the fruit im-
age (Figs. 5.21(a) and (b)), the integrated system improved the shape estimation
(improvement is conspicuous over the cantaloupe and table-cloth surfaces) and the
depth estimation (especially over the cantaloupe surface). Figures 5.24(a) and (b)
show superquadric ﬁts to the depths obtained from stereo module and the integrated
system, respectively.

The image of Egg and Cup (Figs. 5.22(a) and (b)) was captured by an inexpensive
CCD camera (Panasonic GP-KR202, f = 25 cm, maximum aperture). The images
were subsequently gamma corrected with 7 = 2.0 and normalized to 256 gray levels.
The stand off was approximately 80 cm. To obtain a stereo pair of images, the camera
was translated and rotated. The translation was in the direction of :::-axis and rotation
was about the y-axis (the z-axis being approximately aligned with the optical axis).
The rotation of the camera was effected to bring the disparity of the region of interest

close to zero. The scene was illuminated with ambient light and a single incandescent

are

N],
' v

3.7

3'1

.11

214

light source was located (30 cm) behind the camera (approximately in the :::-2 plane)
pointing in the direction (0,0,1). The integrated system improved the estimation of

the shape and depth features over the egg surface.

Fig. 5.23 shows segmentation results for Fruit image (Fig. 5.21(b)) and Egg and

Cup image (Fig. 5.22(b)).

5.6 Summary

The visual world can only be ambiguous in relatively contrived situations, but in
real world it is the combinations of the cues which convey a unique physical real-
ity. Whether in reconstructing the entire visual input or in extracting its component
relevant to the given task, the designer of a vision system confronts the challenging
problem of integrating all visual cues to obtain a reliable performance. Several inte-
gration strategies have been reported in the literature which primarily study pairwise
integration of vision modules. But, there is a dearth of results on integrating more
than two modules for real images. We have presented a uniﬁed framework for inte-
grating vision modules which facilitates a design of a ﬂexible and extensible integrated
system for 3D reconstruction from a pair of stereo images. Based upon this frame-
work, we have implemented a system for integrating four vision modules: perceptual
organization, shape from shading, stereo, and shape from texture. We show the re-
construction results using the integrated system for both synthetic and real images.
We also demonstrate the consistent performance of the integrated system even in the

adverse situations where one or more assumptions made by the individual modules

215

are violated. The numerical accuracy of the recovered depth is assessed in case of
synthetically generated data.

Finally, we note some of the limitations of our current system which are signiﬁcant
topics for further research: We have only provided some empirical evidence for the
convergence and stability of the integrated system. A rigorous analysis of these issues
needs to be undertaken which is a subject of our ongoing exploration. Further, the
present ﬂow of control is ﬁxed and is not suitable for partial and dynamic reconstruc-
tion. A more ﬂexible control might provide a congenial environment for many active

and purposive vision systems.

216

 

 

(11)

Figure 5.14: Synthetic texture-mapped surface primitives (Only the right stereo im-
age is shown): (a) parallelopiped (31%); (b) sphere (20%); (c) cylinder (22%); (g)—(i)
and (m)—(o) depict the depth reconstruction for these primitives from stereo mod-
ule and from the integrated system, respectively. Figures in the parentheses show
improvements in the depth estimate due to integration (see Table 5.3).

217

.. 1' '1".
2111'.)

”lull“

 

(<1)

Figure 5.15: Synthetic texture-mapped surface primitives (Only the right stereo im-
age is shown): (d) paraboloid (17%); (e) hyperboloid (8%); and (f) torus (5%); (j)—(l)
and (p)—(s) depict the depth reconstruction for these primitives from stereo module
and from the integrated system, respectively. Figures in the parentheses show im-
provements in the depth estimate due to integration (see Table 5.3).

218

= r
”a.
.9;

=
, 3:.
a7

9:;

z.
5
”V44”?

5
E a

,5?

, z/z/

: 4
r5 1 /II

awn/5'11

/

a
a a
., 454% i

25

r
v

r
41¢an!
VII/I

:-

'
"’
"Valli;

$th
I
r/II’//IIII

(III/rtIo/Ga

III;
toll

 

(Illa .

images

reconstruction of (a) and

(a) and (b) Mozart stereo

(size 512x512) synthesized with no surface texture

(b) using stereo module only
system. Depth estimat

ith no texture

ian surface w

: A Lambert

16

Figure 5.

C)
and (b) us

(
)

d by 8% due to integration (see Table 5.4).

1

integrated

the

ing

a

(

reconstruction of

(d)

i

1011 was Improve

219

ial texture

512x512) with texture-mapped surface juxtaposed

l

l)
l

M »

—
.-
.:

G:

, 95/15

1—

=
a. —

5: u:-
232. 4.

;

 

40 ,2
$592? r?

t!

........?..c.
g

Uzi/ll

’6 2:, $517!!

9!
r 5

>5

5

4.1/Illa
.

ﬁll/xi

I;

’[I/

Griﬂ/I/II
1

'ﬂ»
I!
#4.

t
it

 

images (size

(a) and (b) Mozart stereo

A surface with part

Figure 5.17

th a surface with no texture;

wi

(d) reconstruction of (a)
improved by 27%

)

tereo module only
Depth est

) usmg s

(c) reconstruction of (a) and (b
tegrated system

and (b)

due to

1011 was

t

lma

1n

the '

usmg

integration (see Table 5.4).

.

Slze

(

images

integrated

' the

usmg

)

a t!
a:- a:
52—5”?!
. 2’; 21:33:01!
, r; I
ﬁrérﬁrplllll

b

(

220

 

 

:4

z 5;
5 5

i;
.
”96¢

a
I”;

\ l5
3
.k

an?
r

of
a: O4
r ’40:.

sear/III!!!

 

:
\\,\‘

II

27¢

do!
a0;
a v

I

I
or.
t

.
\1 .

 

 

A surface with full texture: (a) and (b) Mozart stereo

18

Figure 5.

(c) reconstruction of (a) and

)

512x512) texture-mapped with a homogeneous texture

(b) using stereo module only

) and

a

(

reconstruction of

)

d

(

7

System. Depth estimation was improved by 32% due to integration (see Table 5.4).

221

I;
I
are,

,,
2?.

.
(r.
. .

41’

«ill
II/Il

,l/I’l/I‘

vwrtllﬁ

?
2, a
:5 .276

It,

7/
(4 d'/
5'
in?!

a.

 

1Z6

(a) and (b) Mozart stereo images (8

A surface with two textures
512x512) texture-mapped with two distinct textures

(b)

5.19

Figure

(c) reconstruction of (a) and

1

ion of (a) and

system. Depth estimation was improved by 37% due to integration (see Table 5.4).

ing the integrated

b) us

(

reconstruct

(d)

’

tereo module only

usmg s

)

8128

images (

lmation

Depth est'

(d

"
55’ till
a g: runaarnnnnllliz
5r; ﬁrﬁrﬁﬁolllllll

222

, /.
xii/4110’

3,4?1/11/4-

 

(a) and (b) Mozart stereo texture-mapped

Specular surface
512x512) for a specular surface

Figure 5.20

(c) reconstruction of (a) and (b) using stereo module

)

and

integrated system

ing the

b) us

(

)

a

(

(d) reconstruction of
d by 16% due to integration (see Table 5.4).

.
)

only

was improve

223

 

Figure 5.21: Fruit image (size 512x512): (a) and (b) Left and right stereo images; (c)
Recovered depth from stereo alone; (d) Recovered depth from the integrated system;
(e) Recovered surface normals from stereo alone; (f) Recovered surface normals from
the integrated system; (g) and (h) show (e) and (h) as shaded perspective views [76].
Fruit image was provided by Prof. Narendra Ahuja.

224

 

 

(
Figure 5.22: Egg and Cup image (size 512x512): (a) and (b) Left and right stereo im—
ages; (c) Recovered depth from stereo alone; (d) Recovered depth from the integrated
system; (e) Recovered surface normals from stereo alone; (f) Recovered surface nor-
mals from the integrated system; (g) and (h) show (e) and (h) as shaded perspective
views [76].

225

   

(a)

Figure 5.23: Segmentation results for (a) Fruit image (Fig. 5.21(b)) and for (b) Egg
and Cup image (Fig. 5.22(b)).

   

(a) (b)

Figure 5.24: Recovery of 3D shape of cantaloupe in the Fruit image (Fig. 5.21): (a)
Recovered superquadrics from stereo alone; (b) Recovered superquadrics from the
integrated system.

Chapter 6

Conclusions and Future Work

Information integration is an important research problem and is a constant theme of
exploration in many scientiﬁc ﬁelds, including computer vision. Integration of vision

modules for 3D surface reconstruction is the focus of research presented in this thesis.

In this chapter we present a brief overview of our work (Section 6.1), a list of di-

rections for future research (Section 6.2), and some concluding remarks (Section 6.3).

6.1 Overview

Individual visual cues are often unreliable and ambiguous. It is extremely difﬁcult
to overcome the limitations in implementations of the individual vision modules in
an isolated system; integrated vision systems are necessary to obtain a reliable inter-
pretation of complex scenes. Design of such systems is challenging since each vision
module works under a different and possibly conﬂicting sets of assumptions; an ef-
fective integration scheme must not only deal with noisy input images but must also

226

227

overcome the artifacts and restrictive assumptions of the individual modules.

Research reported in this thesis emphasizes modeling the interaction and infor-
mation exchange among the vision modules to overcome the limitations in their indi-
vidual performance in isolation. In a detailed case study presented in Chapter 4, we
demonstrated how the interactions among the vision modules (via, shape from shad-
ing, stereo, line labeling, and perceptual organization) can improve 3D reconstruction
from a pair of stereo intensity images.

In Chapter 5 we made an attempt at systematizing design of integration of vision
modules using a simple control structure. We proposed a uniﬁed Bayesian integra-
tion framework for interaction among the vision modules to obtain a complete 3D
reconstruction from a pair of intensity (stereo) images. The proposed integration
architecture allows a parsimonious modeling of various interactions. Novel features
of the proposed scheme include, (i) interaction of each module with intrinsic map,
(ii) multi-level, multi—resolution representations and hierarchical coarse-to-ﬁne con-
trol, (iii) ﬁne—grained feedback mechanisms, and (iv) robust estimation procedures
based on the principle of coherence. We have integrated perceptual grouping, stereo,
shape from shading, and shape from texture modules under the proposed framework.
We demonstrated the efﬁcacy of our approach using real images of several different
scenes and observed improvements in the quality of recovered 3D structure as a result
of integration. The output of the integrated system is shown to be insensitive to vi-
olations of individual module assumptions. The numerical accuracy of the recovered
depth is assessed for photo-realistically rendered images from several scenes contain-

ing a variety of generic surfaces. We have also qualitatively evaluated our approach

228

by reconstructing geons from the depth data obtained from the integrated system.

6.2 Future Directions

Whether to reconstruct the entire visual input or to extract its component relevant to
the given task, a reliable vision system is required to consider all the available visual
cues to obtain an unambiguous output. This raises several important research issues
in solving integration problem. What is the most reliable information provided by
each visual cue? How to design an integrated system which can be easily maintained
and extended? How to integrate vision modules so that the system performance does
not critically depend on the performance of individual modules? How much weight
should be assigned to the information provided by each module? We have made an
attempt to address some of these research issues in our work, but a deﬁnitive solution
to these problems will need more extensive research. Here we list some possible

enhancements to the present system:

1. Color is an important source of information in the human visual system. Until
now, we did not consider this cue in order to keep the magnitude of the image
data manageable. As seen in Section 3.2, the recovery of surface orientation from
shading information alone is an underconstrained problem. Inclusion of color
information will provide additional physical constraints (one irradiance equation
per channel). Such an augmentation is likely to improve the performance of the

integrated system.

229

. The current system does not include a systematic model for specular reﬂectance.
Our experiments indicate that a limited amount of Specularity does not critically
degrade the performance of the system. Inclusion of even a simple model for
Specularity could improve the performance of the system for scenes with highly

specular surfaces.

. The present implementation separately models interaction between each cue
and the intrinsic map to derive a simple control structure. How eﬂective is this
model, in practice, compared to the pairwise modeling of interactions among

the modules?

. Our system is relatively insensitive to the various module and interaction pa-
rameters due to the manner in which we have incorporated feedback into the
system. It would be desirable to automatically learn the parameters of the

system.

. The ﬂow of control in our system is ﬁxed and is not suitable for partial and
dynamic reconstruction. A more ﬂexible control might provide a congenial

environment for many active and purposive vision systems.

. The present approach, in a sense, egalitarian; it assumes that all modules per-
form equally well in all regions of an image. This situation can be vastly im-
proved by our knowledge about the performance of a module on the image
features. For instance, Karu, Jain, and Bolle have recently shown that it is pos-

sible to identify textured regions in an image. Such mechanisms are expected

230

to vastly improve the system performance [93]. The present approach does not

utilize the common knowledge about the performance of various modules.

7. The present system proposes an ambitious goal of generic 3D reconstruction. A
given application domain often offers opportunities to recover from the errors in
the sensing and intermediate processing by coupling goals of the system with the
object models in the domain. Such an augmentation of the current integrated
system in the context of a given application ( e. 9., 3D object recognition for a

limited class of objects) would be useful.

6.3 Conclusions

The visual world can only be ambiguous in relatively contrived situations (e. g. N ecker
cube) but the visual input from the scenes in real world abounds with a number
of visual cues. It is only the combination of several cues that permits a reliable
interpretation1 .

There is a general agreement among the computer vision researchers about the
need for information integration. However, the potential of a vision system relying
on low-level modules has been grossly underestimated because of individual vulner-
abilities of the modules. While much of the research is directed at improving the
performance of individual modules, a relatively few studies emphasize the impor-
tance of integration. Our results from a limited scene domain demonstrate that an

integrated system comprising of several low-level modules can provide a better 3D

 

1This line of thinking is associated with the theory of direct perception.

231

reconstruction than the individual modules without using any top-down knowledge.
However, attaining the performance of the human visual system using computers
appears to be a far cry.

Integrating vision modules for 3D reconstruction from a stereo pair of intensity
images is a diﬂicult problem primarily due to our lack of understanding of two under-
lying issues: (i) an accurate assessment of the strengths and limitations of individual
modules; (ii) the representations and control structures which can exploit comple-
mentary constraints provided by the imperfect modules to recover the true structure
in the data. We have attempted to systematize the design procedure for an inte-
grated system which takes into account these research issues and demonstrated that
an integrated system thus designed leads to improved results in a limited scene do-
main. Much more research is needed to obtain deﬁnitive and robust solutions to the

integration problem.

 

APPENDICES

Appendix A

Notation and Conventions

R: set of real numbers.

Rd: d-dimensional real space.

C": A function differentiable n times.

f () variable probability distributions.

P(.): speciﬁc named probabilities.

Xij: jt" observation (measurement) at the it” location (site).
Y”: 3"" intrinsic parameter of the it” location (site).
Pi: Decision vector variable.

p;: A decision vector.

P;: Optimal decision vector variable.

Sgn(:c): Sign function; S gn(:z:) = 1 if a: > 0 else zero.
Id.Sgn(a:,y): 1 if Sgn(:2:) = Sgn(y) else zero.

Z\y: All members of set Z except y.

232

Appendix B

Glossary of Assumptions in

Computer Vision

1. Transparency: Depth map could be a multi-valued function.
2. Opacity: Depth map is a single-valued function.

3. General VieWpoint: Statistically signiﬁcant structural relationships among im-
age features are unlikely to have resulted due to accidental viewpoint of the

observer.

4. Coherence: Features resulting from a single physical event display statistical

interaction.
5. Cohesiveness: Objects are usually compact and opaque.

6. Continuity: A given property is smoothly varying and is differentiable. Usually,
used in the context of the linear image (or scene) features, e. g, boundaries.

233

10.

11.

12.

13.

14.

15.

16.

17.

18.

234

IsotrOpy: all edge orientations are equally likely in the scene.

Homogeneity: The surface is covered with a single texture.

Independence: The edgel orientations extracted from the input image are inde-

pendent.

Regularity: The texture is periodic and not stochastic.

Regular projective geometry: The unprojected texture in the scene and the
imaging geometry are not conspiring to compensate for their individual contri-

butions.

Smoothness: Differentiable property. Typically used for surface properties.

Integrability: The reconstructed surface is required to be physically meaningful.

Local spatial interaction: Spatially distant image features are less likely to

display statistical interaction.

Lambertian Surface: Irradiance is independent of viewing direction and follows

Lambert’s cosine law.

Gaussian noise: Noise process follows Gaussian distribution.

Perfect segmentation: The objects of “interest” (foreground) can be accurately

identiﬁed in the given image(s).

Single source of illumination: Objects in the image are illuminated by a single

source of light.

 

19.

20.

21.

22.

23.

24.

25.

26.

27.

235

Small motion: Inﬁnitesimal motion of each image feature between successive

image frames.

Polyhedral world: Objects in the image are polyhedral and possess deﬁnite

volume.

Paraperspective projection: Imaging projection process is considered to be a
composition of two projections: (i) ﬁrst, all features on surface of an object
are orthographically projected onto a a plane parallel to the image plane and
passing through the center of mass of that object, (ii) the projected surface

features are then projected on to the image plane using perspective projection.

Sufficient context: The projective distortion of surface markings due to imaging

is not trivially degenerate.
Symmetry: Objects in the image are symmetric.

Rigidity: Objects in the image are rigid. No relative motion of any part of an

object with respect to any other part of the object.

Locally spherical surface: Maximum and minimum curvature at every point on

the object surface are identical.

No interreﬂection: The amount of illumination at each point on an object sur-

face is accounted for by the source of direct illumination.

Piecewise continuity: Image features (boundaries) are piecewise continuous.

28.

29.

30.

31.

32.

236

Dichromatic reﬂectance: Spectral distribution of the diffuse component is deter-
mined by the colorant in the surface whereas the specular component preserved

the spectral distribution of the incident light.

ﬁactal surface model: The surface shape can be adequately described by a

fractal.

Separable colors: Piecewise constant albedo.

Constant albedo: Albedo of all the surfaces in an image is identical and constant.

Gaussian blurring: Camera’s point spread function can be approximated by a

two dimensional Gaussian function.

Bibliography

[ll

[2]

l3]

[4]

[5]

A. L. Abbott and N. Ahuja. Surface reconstruction by dynamic integration of
focus, camera vergence, and stereo. In Proceedings of the Second International

Conference on Computer Vision, pages 532—543, Tarpon Springs, 1988.

N. Ahuja. A transform for detection of multiscale image structure. In Proceed-
ings of IEEE Conference on Computer Vision and Pattern Recognition, pages

780-781, New York, 1993.

N. Ahuja and M. Tuceryan. Extraction of early perceptual structure in dot

patterns: integrating region, boundary, and component Gestalt. Computer

Vision, Graphics, and Image Processing, 48:304—356, 1989.

J. Aloimonos. Detection of surface orientation from texture i: The case of
planes. In Proceedings of IEEE Computer Society Conference on Computer

Vision and Pattern Recognition, pages 584—593, Miami Beach, FL, 1986.

J. Aloimonos. Purposive and qualitative vision. In Proceedings of the 10th
International Conference on Pattern Recognition, pages 346—360, Atlantic City,
New Jersey, 1990.

237

238

[6] J. Aloimonos and A. Bandopadhyay. Active vision. In Proc. First IEEE Inter-

national Conference on Computer Vision, pages 35—54, London, 1987.

I '7] Y. Aloimonos, editor. Active Perception. Lawrence Erlbaum Assoc, Hillsdale,

NJ, 1993.

[8] Y. Aloimonos and D. Shulman. Integration of Visual Modules: An Extension

of the Marr Paradigm. Academic Press, San Diego, CA, 1989.

[9] T. Basar and G. J. Olsder. Dynamic Noncooperative Game Theory. Academic

Press, New York, NY, 1982.

[10] D. H. Ballard and C. M. Brown. Computer Vision. Prentice Hall, New Jersey,

1982.

(1 1] S. T. Barnard and W.B. Thompson. Disparity analysis of images. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence, 2(4):333-340, June 1980.

[12] H -G. Barrow and J.M. Tenenbaum. Recovering intrinsic characteristics from
images. In A. Hanson and E. Riseman, editors, Computer Vision Systems,

pages 3—26. Academic Press, New York, NY, 1978.

[13] H- G. Barrow and J .M. Tenenbaum. Interpreting line drawings as three dimen-

sional surfaces. Artiﬁcial Intelligence, 17:47—75, August 1981.

l14l H~ G - Barrow and J .M. Tenenbaum. Retrospective on interpreting line drawings

as three dimensional surfaces. Artiﬁcial Intelligence, 59:71—80, 1993.

239

[15] RN. Belhumeur and D. Mumford. A Bayesian treatment of the stereo corre-
spondence problem using half-occluded regions. In Proc. IEEE Conference on
Computer Vision and Pattern Recognition, pages 506—512, Champaign, Illinois,

1992.

[16] J. A. Benediktsson and P. H. Swain. Consensus theoretic classiﬁcation meth-
ods. IEEE Transactions on Systems, Man, and Cybernetics, 22(4), July/ August

1992.

[17] J. Besag. On the statistical analysis of dirty pictures. Journal of Royal Statistical

Society Ser.B, 48:259-302, 1986.

[18] T. O. Binford. Inferring surfaces from images. Artiﬁcial Intelligence, 17:205—

244, August 1981.

[19] A. Blake and A. Yuille, editors. Active Vision. The MIT Press, Cambridge,

MA, 1992.

[20] A. Blake and A. Zisserman. Visual Reconstruction. The MIT Press, Cambridge,

MA, 1987.

[21] A. Blake, A. Zisserman, and G. Knowles. Surface descriptions from stereo and
shading. In B. K. P. Horn and M. J. Brooks, editors, Shape from Shading,

chapter 2, pages 29—52. The MIT Press, Cambridge, MA, 1989.

[22] D. Blostein and N. Ahuja. Shape from texture: Integrating texture-element
extraction and surface estimation. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 11(12):1233—1251, December 1989.

 

[23]

[24]

[25]

[26]

[27]

[28]

[29]

240

RM. Bolle and DB. C00per. Bayesian recognition of local 3-D shape by approx-
imating image intensity functions with quadric polynomials. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 62418—429, July 1984.

RM. Bolle and DB. Cooper. On optimally combining pieces of information,
with application to estimating 3-D complex-object position from range data.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 82619—638,

September 1986.

R. C. Bolles and P. Horaud. 3DPO: A three—dimensional part orientation sys-

tem. The International Journal of Robotics Research, 5(3), 1986.

HI. Bozma. A Decentralized Integration in Modular Systems Using a Game

Theoretic Framework. PhD thesis, Yale University, New Haven, CT, 1992.

HI. Bozma and J .8. Duncan. Integration of vision modules: A game-theoretic
framework. In Proceedings of IEEE Conference on Computer Vision and Pat-

tern Recognition, pages 501—507, Maui, Hawaii, 1991.

HI. Bozma and J .8. Duncan. A modular system for image analysis using a
game theoretic framework. Image and Vision Computing, 6(10):431—443, July—

August 1992.

HI. Bozma and J.S. Duncan. A game-theoretic approach to integration of
modules. IEEE Transactions on Pattern Analysis and Machine Intelligence,

16(11):1074—1086, November 1994.

 

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

241
J. Brolio, B. A. Draper, J. R. Beveridge, and A. R. Hansen. ISR: A database

for symbolic processing in computer vision. Computer, pages 22—30, Dec 1989.

D. Buckley and J. Frisby. Interaction of stereo, texture and outline cues in the
shape perception of three-dimensional ridges. Vision Research, 33(7):919—933,

May 1993.

D. Buckley, J. P. Frisby, and J. E. W. Mayhew. Interaction of texture and stereo
cues in perception of surface slant: Evidence for surface orientation anisotropy

in cue integration. Perception, 17:384, 1988.

D. Buckley, J. P. Frisby, and J. E. W. Mayhew. Integration of stereo and
texture cues in the formation of discontinuities during three-dimensional surface

interpolation. Perception, 18(5):563—588, 1989.

H. H. Bulthoff and H. A. Mallot. Integration of depth modules: Stereo and

shading. Journal of Optical Society of America, 5(10):1749—1758, October 1988.

J. Canny. A computational approach to edge detection. IEEE Transactions
on Pattern Analysis and Machine Intelligence, PAMI-8(6):679—698, November

1986.

I. Chakravarty. A generalized line and junction labeling scheme with applica-
tions to scene analysis. IEEE Transactions on Pattern Analysis and Machine

Intelligence, PAMI-1(2), April 1979.

B. Chandrasekaran. What kind of information processing is intelligence? a

perspective on A1 paradigms and a proposal. In Patridge and Wilks, editors,

 

242

Source Book on the Foundations of AI. Cambridge University Press, Cambridge,

England, 1987.

[38] Y.L. Chang and J .K. Aggarwal. Reconstructing 3D lines from a sequence of
2D projections: Representation and estimation. In Third IEEE International

Conference on Computer Vision, pages 101—105, Osaka, Japan, 1990.

[39] P. Chou and C. Brown. The theory and practice of Bayesian image labelling.

International Journal of Computer Vision, 4:185—210, 1990.

[40] RH. Christensen and LG. Shapiro. Three-dimensional shape from color pho-

tometric stereo. International Journal of Computer Vision, 13:213—227, 1994.

[41] C.-C. Chu and J. K. Aggarwal. The integration of image segmentation maps
using region and edge information. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 15(12):1241—1252, December 1993.

[42] J. J. Clark and A. L. Yuille. Data Fusion for Sensory Information Processing

Systems. Kluwer Academic Publishers, Boston, MA, 1990.

[43] M. B. Clowes. On seeing things. Artiﬁcial Intelligence, 2279—116, 1971.

[44] G. F. Cooper. The computational complexity of probabilistic inference using

Bayesian belief networks. Artiﬁcial Intelligence, 41:393—405, 1990.

[45] R. Cowie and R. Perrott. From line drawings to impressions of 3D objects:
Developing a model to account for the shapes that people see. Image and

Vision Computing, 11:342—352, 1993.

[46]

[47]

[48]

[49]

[50]

[51]

[52]

243

F. Crick. Function of the thalamic reticular complex: The searchlight hypoth-
esis. In Proceedings of the National Academy of Sciences, pages 4586—4590,

1984.

J. Cryer, P.S. Tsai, and M. Shah. Integration of shape from X modules: Com—-
bining stereo and shading. In Proceedings of IEEE conference on Computer

Vision and Pattern Recognition, pages 720—721, New York, 1993.

M. Daily. Color segmentation using Markov random ﬁelds. In Proc. IEEE
Computer Society Conference on Computer Vision and Pattern Recognition,

pages 304—312, San Diego, June 1989.

L. S. Davis and A. Rosenfeld. Cooperating processes for low-level vision: A

survey. Artiﬁcial Intelligence, 17(1—3):245—263, August 1981.

U. R. Dhond and J .K. Aggarwal. Structure from stereo —- A review. IEEE

Transactions on Systems, Man, and Cybernetics, 19(6):1489—1510, June 1989.

B. A. Draper, R. T. Collins, J. Brolio, A. R. Hanson, and E. M. Riseman. Issues
in the development of a blackboard-based schema system for image understand-
ing. In R. Engelmore and T. Morgan, editors, Blackboard Systems, chapter 8,

pages 189—218. Addison Wesley, 1988.

D. Dubois and H. Prade. Possibility Theory: An Approach to Computerized

Processing of Uncertainty. Plenum Press, New York, 1988.

[53] J. Duncan and G. W. Humphreys. Visual search and stimulus similarity. Psy-

chological Review, 96:433—458, 1989.

 

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

244

R. Engelmore and T. Morgan. Blackboard Systems. Addison Wesley, New York,
1988.
J. Ens and P. Lawrence. An investigation of methods for determining depth

from focus. IEEE Transactions on Pattern Analysis and Machine Intelligence,

15(2):97—108, February 1993.

L. D. Erman, F. Hayes—Roth, V. R. Lesser, and D. R. Reddy. The Hearsay-
II speech-understanding system: Integrating knowledge to resolve uncertainty.

Computing Surveys, 12(2):213—253, June 1980.

E. A. Feigenbaum. The art of artiﬁcial intelligence: themes and case studies
in knowledge engineering. In Fifth Joint International Conference on Artiﬁcial

Intelligence, pages 1014—1029, Menlo Park, CA, 1977.

F. P. Ferrie and M. D. Levine. Where and why local shading works. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 17(1), 1989.

R. T. Frankot and R. Chellappa. A method for enforcing integrability in shape
from shading algorithms. In B. K. P. Horn and M. J. Brooks, editors, Shape

from Shading, chapter 5, pages 89—122. The MIT Press, 1989.

E. B. Gamble, D. Geiger, T. Poggio, and D. Weinshall. Integration of vision
modules and labeling of surface discontinuities. IEEE Transactions on Systems,

Man, and Cybernetics, 19(6):1576—1581, November/ December 1989.

J. Garding. Shape from texture for smooth curved surfaces in perspective pro-

jection. Journal of Mathematical Imaging and Vision, 2:327-350, 1992.

 

245

[62] J. Garding. Shape from texture and contour by weak isotropy. Artiﬁcial Intel-

ligence, 64:243—297, 1993.

[63] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the
Bayesian restoration of images. IEEE Transactions on Pattern Analysis and

Machine Intelligence, PAMI-6z721—741, November 1984. R

[64] J. Gibson. The Ecological Approach to Visual Perception. Houghton-Mifﬂin,

 

1979.

[65] A. A. Giordano and F. M. Hsu, editors. Least squares estimation with applica-

tions to digital signal processing. Wiley, New York, 1985.

[66] AS. Glassner, editor. An introduction to Ray tracing. Academic Press, New

York, 1989.

[67] F. Glazer. Multilevel relaxation in low level computer vision. In A. Rosen-
feld, editor, Multiresolution Image Processing and Analysis, pages 312-330.

Springer—Verlag, New York, 1984.

[68] W. E. L. Grimson. From Images to Surfaces. MIT Press, Cambridge, MA,

1981.

[69] W. E. L. Grimson. Object Recognition by Computer: The Role of Geometric

Constraints. MIT Press, Cambridge, MA, 1990.

[70] W.E.L. Grimson. Binocular shading and visual surface reconstruction. Com-

puter Vision, Graphics, and Image Processing, 28:19—43, 1984.

246

[71] A. D. Gross. Shape from a symmetric universe. Technical Report CUCS—065—

90, Columbia University, New York City, New York, 1990.

[72] Stephen Grossberg, editor. Neural Networks and Natural Intelligence. The MIT

Press, Cambridge, MA, 1988.

[73] G. Healey and T. Binford. A color metric for computer vision. In IEEE Con-

ference on Computer Vision and Pattern Recognition, pages 10—17, Ann Arbor,

MI, 1988.

[74] G. Healey and TO. Binford. Local shape from Specularity. In Proc. First IEEE
International Conference on Computer Vision, pages 151—160, London, UK,

1987.

[75] J. E. Hochberg. Perception. Prentice-Hall, Englewood Cliffs, N.J., 1964.

[76] W. Hoff and N. Ahuja. Extracting surface from stereo: An integrated approach.
In Proc. First IEEE International Conference on Computer Vision, pages 284—-

294, London, UK, 1987.

[77] R. L. Hoffman and A. K. Jain. Segmentation and Classiﬁcation of Range Im-
ages. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-

9(5):608—620, September 1987.

[78] R. Horaud and M. Brady. On the geometric interpretation of image contours. In
The 1987 IEEE International Conference on Computer Vision, pages 374—382,

1987.

 

[79]

[80]

[81]

[82]

[83]

[84]

[85]

[86]

247

B. K. P. Horn. Obtaining shape from shading information. In Shape from

Shading, pages 123—173. The MIT Press, Cambridge, MA, 1989.

B. K. P. Horn and M. J. Brooks, editors. Shape from Shading. The MIT Press,

Cambridge, MA, 1989.

B. K. P. Horn and B. G. Schunck. Determining optic ﬂow. Artiﬁcial Intelligence,

17, 1981.

R. A. Hummel and S. W. Zucker. On the foundation of relaxation labeling
processes. IEEE Transactions on Pattern Analysis and Machine Intelligence,

PAMI—5(3):267—287, May 1983.

T. L. Huntsberger and S. N. Jayaramamurthy. A framework for multi—sensor
fusion in the presence of uncertainty. In Proceedings of 1987 Workshop on Spa-
tial reasoning and Multi-sensor fusion, pages 345—350, St. Charles, IL, October

1987. Morgan Kaufman, Los Altos.

K. Ikeuchi. Generating an Interpretation Tree from a CAD Model for 3-D
Object Recognition in Bin-Picking Tasks. International Journal of Computer

Vision, 1(2):145—165, 1987.

K. Ikeuchi and B. K. P. Horn. Numerical shape from shading and occluding

boundaries. In Shape from Shading, pages 245—301. The MIT Press, 1989.

D.W. Jacobs. Space efficient 3D model indexing. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 439—444, Ur-

bana, Illinois, June 1992.

248

[87] D.W. Jacobs. Robust and efﬁcient detection of convex groups. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pages

770—771, New York, New York, June 1993.

[88] AK. Jain and P. Flynn, editors. 3D Object Recognition Systems. Elsevier, New

York, 1993.

[89] A. Jepson and W. Richards. A lattice framework for integrating vision modules.

IEEE Transactions on Systems, Man, and Cybernetics, 22:1087~1096, 1992.

[90] E. B. Johnston, B. G. Cumming, and A. J. Parker. Integration of depth mod-

ules: Stereopsis and texture. Vision Research, 33(5-6):813—826, 1993.

[91] B. Julesz. Foundations of Cyclopean Perception. University of Chicago Press,

Chicago, Ill., 1971.

[92] T. Kanade. A theory of origami world. Artiﬁcial Intelligence, 13:279—311, 1980.

[93] Kalle Karu, A. K. Jain, and R. M. Bolle. Is there a texture in the image? IEEE

Transactions on Pattern Analysis and Machine Intelligence, 1995. To appear.

[94] M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models.

International Journal of Computer Vision, 1:321—331, 1988.

[95] J. Kender and EM. Smith. Shape from darkness: Deriving surface information
from dynamic shadows. In Proceedings of the ﬁfth National Conference on

Artiﬁcial Intelligence, pages 664—669, Philadelphia, PA, August 1986.

 

[96]

[97]

[98]

1991

[100]

[101]

249

J. Kittler and J. Illingworth. Relaxation labelling algorithms — a. review. Image

and Vision Computing, 3:206—216, 1985.

J. J. Koenderink. What does the occluding contour tell us about solid shape?

Perception, 13:321 — 330, 1984.

W. L. Lakin, J. A. H. Miles, and C. D. Byrne. Intelligent data fusion for naval
command and control. In R. Engelmore and T. Morgan, editors, Blackboard

Systems, chapter 22, pages 443—458. Addison Wesley, 1988.

S. Lakshmanan and K. C. Kluge. Lane detection for automotive sensors. In
IEEE International Conference on Acoustics Speech and Signal Processing, De-

troit, May 1995. To appear.

D. Lamb and A. Bandopadhay. Shape from line drawings: Beyond Huffman-

Clowes labeling. Pattern Recognition, 14:213—219, 1993.

Y.G. Leclerc. Constructing simple stable descriptions for image partitioning.

International Journal of Computer Vision, 3:73—102, 1989.

[102] Y.G. Leclerc and AF. Bobick. The direct computation of height from shading.

In Proceedings of IEEE Conference on Computer Vision and Pattern Recogni-

tion, pages 552—558, Maui, Hawaii, 1991.

[103] Y.G. Leclerc and M.A. Fischler. An optimization-based approach to the in-

terpretation of single line drawings 3D wire frames. International Journal of

Computer Vision, 9:113—136, 1992.

 

250

[104] C. Lee and A. Rosenfeld. Improved methods of estimating shape from shading
using the light source coordinate system. In B. K. P. Horn and M. J. Brooks,
editors, Shape from Shading, chapter 11, pages 323—348. Massachusetts Institute

of Technology, Cambridge, MA, 1989.

[105] EL. Lehmann. Nonparametrics: Statistical Methods Based on Ranks. Holden-

Day, San Francisco, 1975.

[106] M. D. Levine. Vision in Man and Machine. McGraw-Hill, New York, New

York, 1985.

[107] M. D. Levine. Vision in Man and Machine. McGraw-Hill, New York, New

York, 1985.

[108] M. Livingstone and D. Hubel. Segregation of form, color, movement and depth:

Anatomy, physiology, and perception. Science, 240:740—749, 1988.

[109] D. G. Lowe. Perceptual Organization and Visual Recognition. Kluwer Academic

Publishers, Boston, MA, 1985.

[110] Tomas Lozano-Perez, W. Eric L. Grimson, and Steven J. White. Finding cylin-
ders in range data. In Proceedings of the 1987 IEEE International Conference

on Robotics 89' Automation, pages 202—207, Raleigh, NC, 1987.

[111] S. Madarasmi, D. Kersten, and TC. Pong. Multi-layer surface segmentation
using energy minimization. In Proceedings of IEEE conference on Computer

Vision and Pattern Recognition, pages 774—775, New York, 1993.

:J'

 

251

[112] J. Malik. Interpreting line drawings of curved objects. International Journal of

Computer Vision, 1:73—103, 1987.

[113] J. Malik and D. Maydan. Recovering three—dimensional shape from a single
image of curved objects. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 11(6):555—566, June 1989. I

[114] J. Malik and R. Rosenholtz. A differential method for computing local shape-

from-texture for planar and curved surfaces. In Proceedings of IEEE Conference

 

on Computer Vision and Pattern Recognition, pages 267—273, New York, 1993.

[115] ES. Manjunath and R. Chellappa. A uniﬁed approach to boundary perception:
Edges, textures, and illusory contours. IEEE Transactions on Neural Networks,

4296—108, 1993.

[116] D. Marr. Vision. W.H. Freeman and Co., San Francisco, 1982.

[117] D. Marr and E. Hildreth. Theory of edge detection. Proc. Royal Society London

(8207), pages 187—217, 1980.

[118] D. Marr and H. K. N ishihara. Representation and recognition of the spatial or—
ganization of three-dimensional shapes. Proc. Royal Society of London (8200),

pages 269—294, 1978.

[119] D. Marr and T. Poggio. Cooperative computation of stereo disparity. Science,

194:283—287, October 1976.

252

[120] J. Marroquin. Random measure ﬁelds and the integration of visual information.
IEEE Transactions on Systems, Man, and Cybernetics, 22(4):705—716, July

1992.

[121] J. Marroquin, S. Mitter, and T. Poggio. Probabilistic solution of ill-posed prob-
lems in computational vision. Journal of the American Statistical Association,

82:76—89, 1987.

[122] L. Matthies and A. Elfes. Integration of sonar and stereo range data using grid
representation. In IEEE International Conference on Robotics and Automation,

pages 723—733, Philadelphia, PA, 1988.

[123] J. E. W. Mayhew and J. P. Frisby. Psychological and computational studies

towards a theory of human stereopsis. Artiﬁcial Intelligence, 17:349—385, 1981.

[124] J. D. McCafferty. Human and Machine Vision: Computing Perceptual Organi-

zation. Ellis Horwood, New York, 1990.

[125] J. McClelland and D. Rumelhart, editors. Parallel Distributed Processing, vol—

ume 1,2. MIT Press, 1986.

[126] G. Medioni and R. Nevatia. Segment-based stereo matching. Computer Vision,

Graphics, and Image Processing, 31:2—18, 1985.

[127] J. Modestino and J. Zhang. A Markov random ﬁeld model-based approach to
image interpretation. In Proceedings of IEEE Conference on Computer Vision

and Pattern Recognition, pages 458—465, San Diego, June 1989.

 

253

[128] ML. Moerdler and TE. Boult. The integration of information from stereo

[129]

[130]

[131]

[132]

[133]

[134]

and multiple shape-from texture cues. In Proc. 1988 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, pages 524—529, Ann

Arbor, Michigan, June 1988.

H. Moravec. Towards automatic obstacle avoidance. In Proceedings 5th Joint
Conference on Artiﬁcial Intelligence, page 584, Los Angeles, LA, 1977. William

Kaufmann.

H. Moravec. Sensor fusion in certainty grids for mobile robots. AI Magazine,

pages 61—74, Summer 1988.

PG. Mulgaonkar, L.G. Shapiro, and RM. Haralick. Shape from perspective: A
rule-based approach. In Proceedings of IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, pages 298—320, Miami Beach, FL,

1986.

S. Nadabar. MRF Model-Based Segmentation of Range Images. PhD thesis,

Michigan State University, 1992.

S. G. Nadabar and A. K. Jain. Edge Detection and Labeling by Fusion of In-
tensity and Range Images. Proc. of SPIE Conf. on Applications of AI: Machine

Vision and Robotics, 1708:108—119, 1992.

S. G. Nadabar and A. K. Jain. Parameter estimation in MRF line process. In
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,

~q 528—533, Champaign, Illinois, 1992.

 

254
[135] M. Nagao and T. Matsuyama. A Structural Analysis of Complex Aerial Pho-

tographs. Plenum Press, New York, 1980.

[136] M. Nagao, T. Matsuyama, and H. Mori. Structural analysis of complex aerial
photographs. In R. Engelmore and T. Morgan, editors, Blackboard Systems,

chapter 9, pages 219—230. Addison Wesley, 1988. .1

[137] H.-H. Nagel and W. Enkelmann. An investigation of smoothness constraints

 

for the estimation of displacement vector ﬁelds from image sequences. IEEE b .

Transactions on Pattern Analysis and Machine Intelligence, 8:565—593, 1986.

[138] K. Nakayama and S. Shimojo. Experiencing and perceiving visual surfaces.

Science, 257:1357—1363, 1992.

[139] V. S. Nalwa. Line—drawing interpretation: A mathematical framework. Inter-

national Journal of Computer Vision, 2:103—124, 1988.

[140] S. K. Nayar, X.—S. Fang, and T. Boult. Removal of specularities using color
and polarization. In Proceedings of IEEE Conference on Computer Vision and

Pattern Recognition, pages 583—590, New York, 1993.

[141] U. Neisser. Cognitive Psychology. Appleton-Century-Crofts, New York, NY,

1967.

[142] Q.L. Nguyen and MD. Levine. 3-d object representation in range images using
geons. In Proc. the 11th [APR International Conference on Pattern Recognition,

volume ICPR A, pages 149—153, The Hague, 1982.

[143]

[144]

[145]

[140]

[147]

[148]

[149]

[150]

255

Y. Ohta, K. Mennobu, and T. Sakai. Obtaining surface orientation from texels
under perspective projection. In Proc. 7th. Int. Jt. Conf. Artiﬁcial Intell, pages

746—751, Oulu, Finland, 1981.

J. Oliensis. Shape from shading as a partially well-constrained problem. Com-

puter Vision, Graphics, and Image Processing, 54:163—183, September 1991.

J. Oliensis and P. Dupuis. Direct method for reconstructing shape from shading.
In Proc. of SPIE conference on Geometric Methods, volume 1570, pages 116—

128, San Diego, California, July 1991.

S. Pankanti, C. Dorai, and A. K. Jain. Robust feature detection for 3d object
recognition. In Proceedings of 2031 SPIE Conference on Geometric Methods In

Computer Vision II, pages 366—377, San Diego, CA, July 1993.

S. Pankanti and A. K. Jain. On integration of vision modules. Technical Report

TR—CS, Michigan State University, June 1994.

S. Pankanti, A. K. Jain, and M. Tuceryan. On integration of vision modules. In
Proceedings of IEEE conference on Computer Vision and Pattern Recognition,

pages 316—322, Seattle, WA, June 1994.

T. Pavlidis. Algorithms for Graphics and Image Processing. Computer Science

Press, Rockville, Maryland, 1982.

T. Pavlidis and C. Liow. Integrating region growing and edge detection. IEEE
Transactions on Pattern Analysis and Machine Intelligence, PAMI—12:225—233,

March 1990.

'1‘}.-

 

[151]

[152]

[153]

[154]

[155]

[150]

[157]

[158]

256
A. P. Pentland. Shading into texture. Artiﬁcial Intelligence, 29:147—170, 1986.

A. P. Pentland. Local shading analysis. In B. K. P. Horn and M. J. Brooks,
editors, Shape from Shading, chapter 15, pages 443—488. The MIT Press, Cam-

bridge, MA, 1989.

T. Poggio, V. Torre, and C. Koch. Computational vision and regularization

theory. Nature, 317 :638—643, 1985.

K. Prazdny. Detection of binocular disparities. Biological Cybernetics, 52:93—99,

1985.

V. S. Ramachandran. The utilitarian theory of perception. In Proceeedings of

APA Symposium: Theories of Perception, Washington, DC, 1986.

V. S. Ramachandran. Interaction between motion, depth, color and form: the
utilitarian theory. In A. Blake and T. Troscianko, editors, AI and the Eye,

chapter 31, pages 348-360. John Wiley and Co., New York, 1990.

V. S. Ramachandran. Visual perception in people and machines. In A. Blake
and T. Troscianko, editors, AI and the Eye, chapter 3, pages 21—77. John Wiley

and Co., New York, 1990.

V. S. Ramachandran. Biological perspective. In G. Carpenter and S. Grossberg,
editors, Neural networks for vision and image processing, chapter 3, pages 46—

91. The MIT Press, 1992.

 

[159]

[160]

[101]

[102]

[163]

[104]

[165]

[100]

257

B. S. Rao and H. Durrant-Whyte. A decentralized Bayesian algorithm for
identiﬁcation of tracked targets. IEEE Transactions on Systems, Man, and

Cybernetics, SMC-23(6):1683—1698, November/December 1993.

R. Reiter and A. K. Mackworth. A logical framework for depiction and image

interpretation. Artiﬁcial Intelligence, 41:125—155, 1990.

I. Rigoutsos and R. Hummel. Distributed Bayesian object recognition. In
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,

pages 180—186, New York, 1993.

J. Rissanen. A universal prior for integers and estimation by minimal description

length. Annals of Statistics, 11(2):416—431, 1983.

L. G. Roberts. Machine perception of three-dimensional solids. In J. Tippett
et al., editor, Optical and Electro-Optical Information Processing. The MIT

Press, Cambridge, MA, 1965.

A. Rosenfeld. ‘Expert’ Vision systems: Some issues. Computer Vision, Graph-

ics, and Image Processing, 34(2):99—102, 1986.

A. Rosenfeld and R. A. Hummel. Scene labeling by relaxation operations. IEEE

Transactions on Systems, Man, and Cybernetics, SMC—6(6):420—433, June 1976.

S. Sarkar and K.L. Boyer. Integration, inference, and management of spatial
information using Bayesian networks: Perceptual organization. IEEE Tinnsac-

tions on Pattern Analysis and Machine Intelligence, 15:256—274, 1993.

 

258

[167] A. H. Schistad-Solberg, A. K. Jain, and T. Taxt. Multisource classiﬁcation of

[168]

[109]

[170]

[171]

[172]

[173]

remotely sensed data: Fusion of Landsat TM and SAR images. IEEE Transac-

tions on Geoscience and Remote Sensing, 32(4):768—778, July 1994.

R. Sekuler and R. Blake. Perception. McGraw Hill, New York, 1990.

G. Shafer. A Mathematical Theory of Evidence. Princeton University Press,

Princeton, NJ, 1976.

S. A. Shafer, A. Stentz, and C. E. Thorpe. An architecture for sensor fusion on
mobile robot. In IEEE International Conference on Robotics and Automation,

pages 2002—2011, San Francisco, CA, 1986.

W.J. Shomar and TY. Young. Three-dimensional shape recovery from line
drawings. In Tzay Y. Young, editor, Handbook of Pattern Recognition and
Image Processing, volume 2 (Computer Vision), pages 53—100. Academic Press,

San Diego, CA, 1994.

W. Singer, C. Gray, P. Konig, A. Artola, and S. Brocher. Formation of cor-
tical assemblies. In Cold Spring Harbor Symposia on Quantitative Biology,

volume 50, pages 939—952, 1990.

Ajit Singh and Peter Allen. Image-ﬂow computation: An estimation-theoritic
framework and a uniﬁed perspective. Computer Vision, Graphics, and Image

Processing, 56(2):152—177, September 1992.

 

 

[174]

[175]

[176]

[177]

[178]

[179]

259

F. Solina and R. Bajcsy. Recovery of parametric models from range images:
The case for superquadrics with global deformations. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 12(2):131—147, February 1990.

H. W. Sorenson, editor. Parameter estimation: Principles and problems. Marcel

Dekker, New York, 1980.

G. Stockman, G. Lee, and S. W. Chen. Reconstructing line drawings from
wings: The polygonal case. In Proceedings of the Third IEEE International

Conference on Computer Vision, pages 526—529, Osaka, Japan, 1990.

K. Sugihara. Machine Interpretation of Line Drawings. The MIT Press, Cam-

bridge, MA, 1986.

B. J. Super and A. C. Bovik. Shape from texture using local spectral moments.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(4):333—

343, 1995.

R. Szeliski. Regularization uses fractal priors. In National Conference on Arti-

ﬁcial Intelligence, pages 749—754, Seattle, WA, 1987.

[180] M. J. Tarr and M. J. Black. A computational and evolutionary perspective on

the role of representation in vision. In preparation, June 1992.

[181] R. R. Tenney and N. R. Sandell. Strategies for distributed decisionmaking.

IEEE Transactions on Systems, Man, and Cybernetics, 11(8):527—538, 1981.

 

[182]

[183]

[184]

[185]

[186]

[187]

[188]

[189]

260

R. R. Tenney and N. R. Sandell. Structures for distributed decisionmaking.

IEEE Transactions on Systems, Man, and Cybernetics, 11(8):517—526, 1981.

D. Terzopoulos. Multilevel computational processes for visual surface recon-

struction. Computer Vision, Graphics, and Image Processing, 24:211—221, 1983.

D. Terzopoulos. Multilevel reconstruction of visual surfaces: variational princi-
ples and ﬁnite-element representations. In A. Rosenfeld, editor, Multiresolution
Image Processing and Analysis, pages 237—310. Springer—Verlag, New York,

1984.

D. Terzopoulos. Image analysis using multigrid relaxation. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 8(2), March 1986.

D. Terzopoulos, A. Witkin, and M. Kass. Symmetry—seeking models and 3D

object recognition. International Journal of Computer Vision, 1:211—221, 1987.

A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-Posed Problems. Winston

& Sons, Washington, DC, 1977.

J. Ton and Anil K. Jain. Registering Landsat image by point matching. IEEE

Transactions on Geoscience and Remote Sensing, 27(5):642-651, 1989.

D. A. Trytten. Integrating Diverse Perceptual Modules to Create a 2.5 Dimen-
sional Sketch. PhD thesis, Michigan State University, E. Lansing, Michigan,

1992.

 

261

[190] D. A. Trytten and M. Tuceryan. Segmentation and grouping of object bound-

[191]

[192]

[193]

[194]

[195]

[190]

aries using energy minimization. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 730—731, Maui, Hawaii, June

1991.

D. A. Trytten and Mihran Tuceryan. Construction of labeled line drawings

from intensity images. Pattern Recognition, 28(2):171—198, February 1995.

PS. Tsai and M. Shah. A fast linear shape from shading. In Proceedings of
IEEE Conference on Computer Vision and Pattern Recognition, pages 716—718,

Champaign, Illinois, 1992.

Ch. von der Malsburg and W. Schneider. A neural cocktail—party processor.

Biological Cybernetics, 54:29-40, 1986.

D. Waltz. Understanding line drawings of scenes with shadows. In P. H. Win-
ston, editor, Psychology of Computer Vision, pages 19—91. McGraw-Hill, New

York, NY, 1975.

J. Weng, N. Ahuja, and T. S. Huang. Matching two perspective views. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 14(8):806—825, Au-

gust 1992.

J. Weng, N. Ahuja, and T. S. Huang. Optimal motion and structure estimation.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(9):864—

884, September 1993.

[197)

[198]

[199]

[200]

[201]

[202]

[203]

[204]

262

Juyang Weng, Thomas S. Huang, and Narendra Ahuja, editors. Motion and

structure from image sequences. Springer-Verlag, New York, 1993.

A. P. Witkin. Recovering surface shape from texture. Artiﬁcial Intelligence,

17:17—45, August 1981.

A. P. Witkin and J. M. Tenenbaum. On the role of structure in vision. In
J. Beck, B. Hope, and A. Rosenfeld, editors, Human and Machine Vision, pages

481—543. Academic Press, New York, 1983.

J. M. Wolfe and K. R. Cave. Deploying visual attention: The guided search
model. In A. Blake and T. Troscianko, editors, AI and the Eye, chapter 4, pages

79—103. John Wiley and Co., New York, 1990.

L.B. Wolff and TE. Boult. Constraining object features using a polariza—
tion model. IEEE Transactions on Pattern Analysis and Machine Intelligence,

13:635—657, August 1991.

Robert J. Woodham. Determining Surface Curvature with Photometric Stereo.
In Proc. 1989 IEEE International Conference on Robotics and Automation,

pages 36—42, May 1989.

G. Yang and TS. Huang. Human face detection in a scene. In Proceedings of
IEEE conference on Computer Vision and Pattern Recognition, pages 453—458,

New York, 1993.

Semir Zeki. A vision of the brain. Blackwell Scientiﬁc Publications, Boston,

1993.

263
[205] R. Zhang, P.S. Tsai, J. E. Cryer, and M. Shah. Analysis of shape from shading

techniques. In Proceedings of IEEE conference on Computer Vision and Pattern

Recognition, pages 377—384, Seattle, 1994.

[206] S. W. Zucker. Vertical and horizontal processes in low level vision. In A. Hanson
and E. Riseman, editors, Computer Vision Systems, pages 187—195. Academic

Press, New York, NY, 1978.

[207] S. W. Zucker and R. A. Hummel. Toward a low-level description of dot clus-
ters: Labeling edge, interior and noise points. Computer Graphics and Image

Processing, 9:213—233, 1979.