I!!! I
ﬁx...

ﬁnignumw .
. .ﬁkvxwﬁ

a...

, ,. .. .....
.. . . 3m“.
, awhﬁv «carvmzmn..u:..tdt

v v3.1, wish... u. .1. I.
an}. .3
23'!-
. J!
z.
15¢). 2‘
q 3:3)... .
“Sarawak.
:01 .. . 1... 319.1in
3.1313. . . . . a”... . ..
«Ha-2.33:..x3 . ..\. .c a!
:3..an . . . . , I»... ..
. . L

F... 3

PVT-YD“
.9: . r».
.1 L59...

I.” I

:11! .
Jx. . x.
5..

‘1. I!
«hag ..

. new
14!..m1 3..
. a“,
. chqnubuﬁ‘ksurwsu
.. an? .mﬁm
3|...)

2.. 58. l.

3.. n 1
.1 .QtIii?

.irilnflui.

r Xlih

} .
i

.
32.1....
V IL...
1...... .
1 2. .
. {kitcm .
. 3.11%)?! .
.1. 3. .. i
5 f . ,

i) 2... n. .
-21.... .
?§:\F9Ai.zv

1.51...-
l \,|(.’ .
LT?! ..
Lists...
.31. .2:
1.1%}-..

.ﬂrr..wl.lnpu.n. ,
1:": .....1. Esofwn .. Ly . .
‘l’ r .L, . . .. , . . 3&4... , .
| < \. .. V .. .r . ,
.. .1. . . . . R . . l
. . Frag“. "mmﬁgi‘amﬁgm

. .n . 11.11..
:1: :1

 

"Hrsa

KN

VESR SITY LIBRARI IES

llllllllllllllllllllllllllllll2 ll ill

3 1293 013892

 

 

\I

 

 

 

 

 

 

 

This is to certify that the

dissertation entitled

The Self-Organizing Hierarchical Optimal
Subspace Learning and Inference Framework
For View-Based Object Recognition
and Image Retrieval

presented by

Daniel L. Swets

has been accepted towards fulﬁllment
of the requirements for

PhD Computer Science

degreein

 

 

WWW

WM?

Mjor professor

27/;6/l976
I -

MSU is an Affirmative Action/Equal Opportunity Institution 0- 12771

  
 

 

  

LlBRARY

M‘Chlgan State
University

 

  

PLACE ll RETURN BOXtorunovothhchoderrom yourrocord.
TO AVOID FINES return on or before date duo.

DATE DUE DATE DUE DATE DUE

      

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

THE SELF-ORGANIZING HIERARCHIGAL OPTIMAL
SUBSPACE LEARNING AND INFERENCE FRAMEWORK
FOR VIEW-BASED OBJECT RECOGNITION AND IMAGE
RETRIEVAL

By
Daniel L. Swets

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science

1996

ABSTRACT
THE SELF-ORGANIZING HIERARCHICAL OPTIMAL SUBSPACE
LEARNING AND INFERENCE FRAMEWORK FOR VIEW-BASED

OBJECT RECOGNITION AND IMAGE RETRIEVAL
By
Daniel L. Swets

A self-organizing framework for object recognition is described. The SH OSLIF (Self-
Organizing Hierarchical Optimal Subspace Learning and Inference Framework) system uses
the theories of optimal linear projection for automatic optimal feature selection and a hier-
archical structure to achieve a logarithmic retrieval complexity. A Space-Tessellation 'Il'ee
is automatically generated using the Most Ezpressive Features (MEFS) and the Most Dis-
criminating Features (MDFS) at each level of the tree. We use incremental learning for
large-scale and Open-ended learning problems, and we allow for perturbations in the Size
and position of objects in the images through learning. We demonstrate the technique of
a large database of widely varying real-world objects in natural settings, and Show the ap-

plicability of the approach even for large variability within a particular class, including 3D

rotation.

To Debby, for making the sacriﬁces that let this happen.
To Joseph, who admonished me to work very hard.

To Kira, who always smiled away the troubles of the day.

COR MEUM TIBI OFFERO DOMINE PROMPTE ET SINCERE

iii

ACKNOWLEDGMENTS

I would like to acknowledge those people and institutions that have assisted me
through my PhD program at Michigan State University. First I would like to thank
my advisors, Dr. John Weng and Dr. Anil K. Jain, for their patient and persistent
guidance in my studies. Without their guidance, this thesis would never have been
completed. Their ideas, advice, and encouragement over the years have proved in-
valuable in my research activities. Dr. Weng was a constant fountain of ideas for
alternative ways for approaching a problem. Dr. Jain, with the great time constraints
that his stature in the vision community demands, always took the time to be gen-
uinely interested in the progress of my studies. I would also like to thank my guidance
committee members for the time and energies spent helping me work through my re-
search. Dr. George Stockman provided practical ideas and insightful suggestions as
to how the project could be improved. Dr. Jerry Schuur provided the keys to the
mathematical basis and gave pointers for further analysis.

I would like to acknowledge the financial support provided by the National Science
Foundation, the Ofﬁce of Naval Research, the Army Research Labs, and Ameritech.
Without this institutional support, my studies at Michigan State University would
not have been possible.

iv

I would like to thank my children Joseph and Kira, for putting up with Daddy
being gone one more night.

Finally, I would like to thank my wife Debby for her unwavering love, support,
and encouragement over the years that this thesis developed. Her commitment to
this degree was necessarily more formidable than my own, and without her backing,

my studies would not have been successful.

TABLE OF CONTENTS

LIST OF TABLES

LIST OF FIGURES

l SHOSLIF-O: The Self-Organizing Hierarchical Optimal Subspace

Learning and Inference Framework for Object Recognition
1.1 Overview of the SHOSLIF ..........................

1.2 Contributions of the Thesis

2 Object Recognition

2.1 Sensing scheme ................................
2.1.1 Desktop Scanners ..............................
2.1.2 Cameras ...................................
2.1.3 Range scanners ...............................
2.2 Model representation .............................
2.2.1 Feature-based representations .......................
2.2.2 Multiple view representations .......................
2.2.3 Hierarchical representations ........................
2.3 Model-based object recognition .......................
2.3.1 Interpretation tree .............................
2.3.2 Geometric Hashing .............................
2.3.3 Evidence-Based Recognition ........................
2.3.4 Geons ....................................
2.3.5 Distance minimization ...........................
2.3.6 Pose Clustering ...............................
2.3.7 Graph Matching ..............................
2.3.8 Alignment ..................................
2.3.9 2D object recognition ...........................
2.4 Face recognition ................................
2.4.1 The human capacity for face recognition .................
2.4.2 Representations for automatic systems ..................
2.4.3 Face Localization ..............................
2.4.4 Face Recognition ..............................
2.4.5 Proﬁle Recognition .............................
2.5 Self-organizing approaches ..........................
2.6 Summary ...................................

vi

ix

xi

1
4
5

1 1
12
13
13
17
17
18
24
26
26
26
29
31
35
35
36
36
37
37
38
39
41
43
46
53
54
56

3 The Self-Organizing Hierarchical Optimal Subspace Learning and

Inference Framework (SHOSLIF) 58
3.1 System overview ............................... 59
3.2 Visual attention ................................ 60
3.3 The Most Expressive Features (MEF) .................... 64
3.3.1 Density of the space ............................ 65
3.3.2 Principal component analysis ....................... 67
3.3.3 Computational Considerations ....................... 69
3.4 The Most Discriminating Features (MDF) ................. 72
3.4.1 Multivariate linear discriminant analysis ................. 74
3.4.2 Computational Considerations ....................... 75
3.4.3 The DKL projection ............................ 76
3.4.4 Explanation of the MDFs ......................... 79
3.5 Space Tessellation ............................... 79
3.5.1 The Hierarchy Decomposes the Problem ................. 82
3.5.2 Hierarchical Quasi-Voronoi Tessellation .................. 85
3.5.3 Incremental Learning ............................ 88
3.5.4 The Adoption Problem ........................... 89
3.5.5 Conditional Sterility Principle ....................... 92
3.5.6 Properties of the Hierarchical Quasi-Voronoi Tessellation Algorithm . . 96
3.6 Image Retrieval ................................ 98
3.7 Summary ................................... 99
4 SHOSLIF for Object Recognition 104
4.1 SHOSLIF-O using manually-deﬁned hierarchical labels .......... 104
4.2 SHOSLIF-O Options for Automatic Space Tessellation .......... 107
4.2.1 Distance measures ............................. 107
4.2.2 Use of Multiple Competing Paths ..................... 117
4.2.3 Expected Radius .............................. 117
4.3 Experimental Results ............................. 119
4.3.1 Space comparison .............................. 119
4.3.2 The Clustering Effect of the MDF subspace using the DKL projection 125
4.3.3 Face database ................................ 126
4.3.4 Combination database: Faces and Other Objects ............ 129
4.3.5 Handling 2D variation ........................... 132
4.3.6 Handling different 3D views ........................ 139
4.3.7 Neighborhood information ......................... 142
4.3.8 Timing ................................... 147
4.4 Summary ................................... 150
5 Summary and Future Research 153
5.1 Summary ................................... 153
5.2 thure research ................................ 158
APPENDICES 163

vii

A Space comparison test database 163
B Face Database 201

BIBLIOGRAPHY 212

viii

LIST OF TABLES

2.1 Classiﬁcation of Major Approaches to Object Recognition ........

3.1 A characterization of a sample tree built using the Hierarchical Quasi-Voronoi
Tessellation Algorithm. The table lists the number of nodes found on each
level of the tree, because the tree is too large to show in detail. ......

4.1 An example of a set of hierarchical labels that could be used in the hierarchical
labels mode. ................................

96

105

4.2 Results of testing the tree built under the hierarchical labels learning mode 106

4.3 Results of subspace comparison study. For the “ﬂat” databases, only a sin-
gle level of the tree was created. For the single projection databases, the
MEF/MDF projection matrices were only computed at the root node, and
the same projection matrices were used throughout the tree. For the mul-
tiple projection trees, a new projection was made at each node of the tree.

4.4 Results showing the superiority of the MDF subspace for subclass selection . .

4.5 Summary of experiment on a face database of 1042 images (384 individuals).
For re-substitution each training image was given as a test probe. For the
disjoint test set, a list of 246 images not found in the training set was used
for testing. .................................

4.6 Summary of large natural scene experiment. The training images were drawn
at random from the pool of available images, with the remaining images
serving as a disjoint set of test images ....................

4.7 Summary of the scale generalization experiment on the large combination face
and other object data set. A disjoint test set was used for testing the
retrieval capability; each test probe was randomly scaled in the range of
[—20, +20]% of the fovea size to test the ability of the system to generalize
over various scales. .............................

4.8 Summary of the learning-based parameter generalization experiment. A disjoint
test set was used for testing the retrieval capability. ............

4.9 Summary of the learning-based view generalization experiment. A disjoint test
set was used for testing the retrieval capability. ..............

4.10 Summary of blurring experiment. .......................

4.11 Average time required to perform a single probe in Tree mode and Non-tree
mode. In contrast to the rapid retrieval of 9.1 seconds for images from the
tree version using ﬁfteen competing paths, when a “ﬂat” database is used,
i.e., one in which no hierarchy is utilized, a test probe into the same data
requires an average of 400.7245 seconds to complete. ...........

ix

120
126

128

131

136

138

142
146

4.12 Recognition results using the ﬂat database for the combination database . . . 149

LIST OF FIGURES

2.1 Pinhole camera model. The y — 2 projection is similar to the :1: — 2 shown

in the ﬁgure ................................. 14
2.2 Stereo camera imaging system. Based on Fig. 2.3 in Ballard and Brown
Computer Vision .............................. 16

2.3 Smoothed local symmetry. A and B are points on the bounding contour
of a shape. The angle formed by AB and the outward normal at A =
the angle formed by AB and the inward normal at B. The midpoint
of AB deﬁnes the local symmetry. The loci of local symmetries which
are maximal with respect to forming a smooth curve is called a spine. 22

3.1 A tree view of the SHOSLIF network. Each node corresponds to a processing

element. .................................. 59
3.2 The network structure of the SHOSLIF. Each level of the network has a variable

number of nodes. A node at level lo has a sub-network at levels I > lo. A

node consists of a processing module and an attention supplier. As indicated

by the multiplexers in the ﬁgure, each of the processing modules can activate

one of the attached nodes at the next level of the network. Each of the

processing modules controls an attention module, which is responsible for

supplying properly aligned fovea images to the activated node. ...... 61
3.3 A top-level flow of the processing performed at each node in the Space-

Tessellation 'Ii'ee during the training phase. A set of training samples which

enter the processing element are extended to allow for learning-based gen-

eralization for position, scale, and orientation. These extended samples are

vectorized and used to produce the projection matrices to the MEF and

MDF subspaces. The extended samples are projected to the MDF sub-

space using these matrices, and a tessellation of the space covered by the

node being worked on is produced. The projection matrices and the space

tessellation for each node are produced in the learning phase. ...... 63
3.4 The vectorization of an image. SHOSLIF treats an p x q image as a point

in pq-dimensional space. The rows of pixels in an image are concatenated

together to form a vector. ......................... 64
3.5 A histogram of inter-point distances for the images in the database treated as

vectors. The horizontal axis shows the distance between a pair of image

vectors; the vertical axis shows the number of pairs that fell into each bin. 66
3.6 A histogram of inter-point distances for the images projected to the MEF space.

The horizontal axis shows the distance between a pair of MEF vectors; the

vertical axis shows the number of pairs that fell into each bin ........ 70

xi

3.7

3.8

3.9

3.10

3.11

3.12

3.13

3.14
3.15

3.16

3.17
3.18
3.19
3.20
3.21
3.22

3.23
3.24

Example MEFS and their ability to approximate a training set. The training
images were taking from the Weizmann Institute of Science in Israel. The
images of a single individual are shown. .................. 71

Example MEFs for a large training set. The top thirty are shown for a training
set of 168 images. Six images of each of 28 individuals were used for training
this set .................................... 73

Problems with the MEF projection for class separation. The MEF projec-
tion onto the principal component Y1 is unable to separate the two obvious
classes. A projection onto Z1, however, gives a clear separation ....... 74

A histogram of inter-point distances for the images projected to the MDF space.

The independent axis shows the distance between a pair of MDF vectors;
the dependent axis shows the number of pairs that fell into each bin. . . . 78

A sample of MEF and MDF vectors treated as images. The MEF vectors
in (a) show the tendency of the principal components to indicate major
variations in the training set, such as lighting direction. The MDF vectors
in (b) show the ability of the MDFs to discount those factors unrelated
to classiﬁcation. The training images used to produce these vectors are
courtesy of the Weizmann Institute. .................... 80

Distribution of some samples using the best two features in the MEF and the
MDF spaces for the root node. In the MDF subspace, objects of the same
class are clustered much more tightly than in the MEF space ........ 81

The SHOSLIF’S hierarchical Quasi-Voronoi space tessellation. Digits and “+”
symbols indicate centers of cells at level I, l + 1, and 1 +2. Dark lines indicate
cell boundaries at a level of the Space-Tessellation Tree nearer the root. . . 83

An example tessellation of the space covered by node N ............ 84

An example showing the complexity of the class separation problem at two
different levels of the tree. Samples from the same class are given for both
graphs. Since the internal node contains many fewer samples than the root
node, the MDF vectors can be optimized for the classes contained in the node. 86

An example tree built with both the MEF and the MDF subspaces. Shaded
nodes indicate that the MDF space was used; square nodes indicate that

the MEF space was used; bold nodes are leaf nodes ............. 88
An example tessellation where grandchild G is close to a tessellation boundary
and outside the expected distance radius for its nearest parent 0 1. . . . . 89

An incorrect tessellation of node N that would result from adding X as a
child of N. The subtree rooted at G would be adopted by X and therefore

unreachable if this tessellation were to be established. ........... 90
Lemma 1 .................................... 91
An example tessellation where changing the tessellation would not result in the

loss of a subtree .............................. 92
A valid tessellation of node N resulting from the addition of X as a child of N.

No subtrees of N are lost by this tessellation modiﬁcation. ........ 93
Algorithm 1, The Hierarchical Quasi-Voronoi Tessellation Algorithm ...... 95
Theorem 1 ................................... 97
Hyperspheres used in Theorem 1. ....................... 98

xii

3.25 Lemma 2 ....................................
3.26 The general flow of each SHOSLIF Processing Element .............
3.27 A tessellation in which a single-path search may not ﬁnd a nearest match. Node
N at level I will shuttle test probe X to its nearest child 02. The nearest
neighbor for X may well be found in the subtree of G, which is missed in a
single-path search. .............................
3.28 Algorithm 2, The Image Retrieval Algorithm .................
3.29 Theorem 2 ...................................

4.1 Nodes used for training the MDFS. .....................
4.2 The set of hierarchical labels used ........................

4.3 Performance of the Normalized Distance measure as a function of the number
of nearest neighbors explored. Both the distance ranking method and the
conﬁdence ranking method are shown ....................

4.4 Performance of the feature space distance measure as a function of the number
of nearest neighbors explored. When the history was maintained, the results
traced the same lines as when history was not maintained. ........

4.5 Performance of the feature space distance measure using image space instead
of MEF subspace once a single class has been found. The performance is
given as a function of the number of nearest neighbors explored. .....

4.6 Distance from Subspace description for 3D. The subspace is shown as the plane,
and is deﬁned by the matrix V. Two classes are shown in this example, and
the MDF vector that can separate them is shown ..............

4.7 Performance of the DPS distance measure, given as a function of the number
of nearest neighbors explored. .......................

4.8 A comparison of the performance for various 1: competing paths .......

4.9 The performance of the system for different numbers of MEF/MDF features.
The number of features from the subspace used was varied to show how
the MDF subspace outperforms the MEF subspace. 95% of the variance
for the MDF subspace was attained when ﬁfteen features were used; 95% of
variance for the MEF subspace did not occur until 37 features were used.
Using 95% of the MEF variance resulted in an 89% recognition rate, and
that rate was not improved using more features ...............

4.10 A sample of the Weizmann Institute face data. The frontal images for an
individual contain two expressions; for each expression, a set of three images
with differing lighting conditions forms the set of images available for an
individual. .................................

4.11 Example of how well within-class variation is handled. All search images
were correctly classiﬁed into the class deﬁned by the training images.

4.12 Representatives of training images from some of the various classes learned.
Fovea images are shown without the pixel weighting; this pixel weighting
will tend to suppress the background. Objects of interest are at the center
of the fovea images. In the learning phase for this experiment, the fovea
was generated using manual extraction of the areas of interest. ......

xiii

99
100

101
102
102

105
106

110

111

113

115

116
118

123

124

127

4.13

4.14

4.15

4.16

4.17

4.18

4.19

4.20

4.21

4.22

5.1

A1
A2

B.1
B.2

Example of a failed search probe. The retrieval failed to select the appropriate
class due to a lack of 3D rotation in the set of training images. ......
Generalization for size and position of objects. The search probes were syn-
thetically generated from images in a disjoint test set. Each search probe
retrieved an image from the appropriate class. The images shown in the ﬁrst
line were the original images. The scaling shown is 50% of the fovea size
(both positively and negatively); the position change shows 12% negative
change in the fixation point for'both the row and the column coordinates.
Example probes and their retrieved images when scaling was enabled on the
tree and when scaling was disabled on the tree during the training phase. .
A sample of the Weizmann Institute face data. Each individual for this exper-
iment contained ﬁve viewpoints under identical lighting conditions. . . . .
View-based generalization. When sufficient training images are given for a
particular class, the system is able to accurately retrieve images from the
correct class. ................................
Example failure in the learning-based view generalization experiment. The fail-
ures occured only when the viewing angle of the test probe did not fall
between the viewing angles of two training images. These images are cour-
tesy of the Weizmann Institute. ......................
A stack of multi-level resolution images. Each level is a subsample of pixels
from a group in the next lower level. ..................
A stack of blurred images. The pixels in each level correspond to a linear
combination of neighborhood pixels at the next lower level .......
Examples of blurring applied to different images. At levels of the tree nearer
the root, more blurring is applied to both the training and the test images.
As the processing moves away from the root node of the tree, less and less
blurring is applied to provide more detail for the ﬁner classiﬁcations. Shown
here are examples of images with progressive blurring applied to them. . .
Timing results obtained from running the system in the tree mode. This data
shows the linear relationship between the time required to complete a test
probe retrieval and the number of competing paths. This linear relationships
demonstrates how the k competing paths only explores a band of the tree.

Major strengths and limitations of the described system. ...........

Tfaining images for space and hierarchy study. ................
Test images for space and hierarchy study. Each probe image is followed by the
list of the top ﬁve responses using pure image space, MEF space, and MDF
space. Beneath each retrieved image is the distance measure computed
between the test probe and the retrieved image ...............

Database of face training images ......................
Database of face test images .........................

xiv

131

133

137

140

143

144

145

146

172

202
210

Chapter 1

SHOSLIF-O: The Self-Organizing
Hierarchical Optimal Subspace
Learning and Inference Framework

for Object Recognition

In order for an intelligent machine to interact with its environment, it must be able
to locate and recognize objects from its environment [67]. That is, the machine must
sense its environment, represent this sensed data in some usable form, and match
objects found in the environment with the set of objects known to the machine [110].
The machine invokes the computer vision module when the sensed data is visual
input. Researchers have worked diligently to have their computer vision modules
emulate the capability of biological visual systems, but the generality, adaptability,

1

2

and robustness of these biological systems have eluded researchers from a wide range

of disciplines.

A central task in computer vision module is the recognition of objects from various

images of the machine’s environment [11] [47] [50] [82] [85].

Model-based object recognition is the domain where a model exists for every
object in the recognition system’s universe of discourse. The research emphasis in
this paradigm has historically been on the design of efﬁcient matching algorithms
from a manually designed feature set with hand-crafted shape rules, as discussed in

chapter 2.

Manually designing a feature set is appealing because such a feature set is very
efﬁcient. When designed properly, a very small number of parameters for each of the
objects is sufﬁcient to capture the distinguishing characteristics among the objects
to be recognized. This pre—meditated efficiency is bitter—sweet, however, in that gen—
eralization of the features to objects other than those for which they were designed
is usually impossible. For example, parameters painstakingly tuned to efﬁciently dis-
criminate between good and bad cherries will probably be useless in differentiating
grades of lumber. Automatic selection of features is thus a crucial step toward a

general, fully automatic recognition system.

An alternative to hand-crafting features is the self-organizing1 approach in which

 

1In the neural network community, self-organization is often taken to be synonymous with un-
supervised learning[76]. In this document, by self-organization we mean that the system is able to
organize its structure autonomously. That is, we absent the learning modality from the structure of
the network. Our learning mode can either be unsupervised (i.e., no class labels) or supervised (i.e.,
class labels are associated with each training sample); in either learning mode, the system organizes
its hierarchical structure, so we call the system self-organizing.

3

the machine will automatically determine what features to use and how to organize the
knowledge structure [8] [9] [45] [3] [100] [104] [99] [120] [131] [130]. In this framework,
the recognition phase of the system is preceded by a learning phase. The learning
phase focuses on the methods by which the system can automatically organize itself
for the task of object recognition, giving it a wide range of generality [96], [97]. If
designed properly, this approach can deal directly with complex, real-world images
[160] because the system is able to learn the environment in which it is to function,
thereby making it much more ﬂexible than its static counterparts. Self-organizing
object recognition systems are open-ended, allowing them to learn and improve con-
tinuously [1] [54] [125]. As such, these systems are not restricted by their human
designer’s limitations in understanding vision and its rules.

Allowing the system to organize itself, however, raises some important efﬁciency
issues. The ﬁrst one is the feature selection issue. A well-known problem in feature se-
lection is called “the curse of dimensionality” —more features do not necessarily imply
a better classiﬁcation success rate, and in fact, the performance can actually drop off.
For example, the principal component analysis, also known as the Karhunen-Loéve
projection and “eigenfeatures,” has been used for face recognition [122] [154] and lip
reading [17]. An eigenfeature, however, may represent aspects of the imaging process
which are unrelated to recognition (for example, the illumination direction). An in-
crease or decrease in the number of eigenfeatures that are used does not necessarily
lead to an improved success rate.

The second issue is how the system Should organize itself. Given a k-dimensional

feature space with n objects, a linear search is impractical since each recognition

4

probe requires 0(a) computations (1 billion visits to models for a database of one
billion objects).

So how does a self-organizing system automatically ﬁnd the most useful features
in the images? How can this be done without restricting the domain to which the
system will be applicable? How can the system learn and recognize a huge number
of objects (say a few million) without slowing the process down to a crawl? The
work described here addresses these crucial issues. We discuss how to automatically
ﬁnd the most discriminating features for view-based object recognition and image
retrieval and how a hierarchy can be utilized to achieve a very low computational
complexity in both the training and the retrieval phases. Preliminary experiments
utilize the largest number of diverse classes (48 plus faces) of natural objects reported
in the literature so far. The work by Flynn and Jain [52] has 100 objects, using range
images. The work by Weng et al. [160] uses intensity images and describes the use

of 11 classes of objects plus faces.

1.1 Overview of the SHOSLIF

The SHOSLIF uses the theories of optimal linear projection to generate a hierarchical
tessellation of a space deﬁned by the training images. This Space is generated using
two projections: a Karhunen—Loeve projection to produce a set of Most Expressive
Features (MEFS), and a subsequent discriminant analysis projection to produce a
set of Most Discriminating Features (MDFS). The MEFS are well-suited to image

characterization, and can be used as a representation of images. The MDFS are

5

better suited to class discrimination than the MEFs are.

The system builds a Space-Tessellation Tree that partitions these MEF/MDF
spaces to provide an O(log n) complexity for retrieving images from a database of
images. The children of a node N represent a tessellation of the space subtended by
N, using incremental learning to build the tree.

This tree could suffer from the Adoption Problem, which occurs when a new node
being added to the tree “adopts” children from another node’s subtree. The Hier-
archical Quasi-Voronoi Tessellation algorithm shown in Algorithm 1 effectively and
efﬁciently deals with this problem with th Conditional Sterility Principle. This tree-
building routine will generate a bounded number of children for each node, and with

a uniform distribution of input samples, will generate 0(log n) levels.

1.2 Contributions of the Thesis

This work represents a signiﬁcant advance towards recognizing a large number of
diverse real-world objects in natural settings. Although the Cresceptron [160] has
demonstrated a similar versatility, the SHOSLIF-O described here is unprecedented
in its efﬁciency, optimality, and the number of classes learned and reported. This work
has produced good recognition results without imposing hand-crafted Shape rules or
human-speciﬁed features, which is a major departure from the traditional shape-rule
based recognition methods.

The detailed techniques that are used in this work are not entirely new. The

techniques have been individually used in different ﬁelds for different topics such as

6

in statistics and pattern recognition. However, the basic approach and actual frame-
work are both new in addressing the very complex problem of recognition. Although
Cresceptron attacked this task of recognition from views of general objects, a large
number of objects was not possible until this new framework, due to the way that
the Cresceptron organized its database. Two major contributions of the SHOSLIF
work are its generality and scalability, i.e., the capability and potential for applying
to a large number of problems of different natures and the capability to recall from a
large number of learned views quickly.

The Most Expressive Features (MEFS), also known as principal component fea-
tures and “eigenfeatures” [122], have been used in various capacities and applications,
such as face recognition [154], image representation [94], and image processing plan-
ning [116]. We develop these features and describe their use for the characterization
of a class of objects in our database. But we also show their limitations in terms of
discriminating between classes. These MEFS can deftly represent images in a compact
and efﬁcient manner, but they are not necessarily good features for distinguishing be-
tween different classes of objects. The MEFS describe some major variations in the set
of sample images; these variations, however, may be completely irrelevant to how the
classes are divided. Those features tuned to discriminating between classes of sam-
ple objects can be found using the discriminant analysis procedure [84], and because
of their discriminating capabilities, we call them the Most Discriminating Features
(MDFS), because the optimally discriminate among the training classes in the sense
of linear transform. This thesis describes the Discriminant K arhunen-Loéve (DKL)

projection for dealing with discriminant analysis in high dimensional space relative

7

to the number of training samples. The standard discriminant analysis procedure
breaks down when the within-class scatter of the training set of samples becomes de-
generate, such as when the number of samples is less than the dimensionality of the
sample vectors. The DKL projection performs the discriminant analysis projection
in the space of the Karhunen-Loéve projection to avoid this problem.

The SHOSLIF produces a hierarchical space tessellation using the hyperplanes
determined by these MEFS and MDFS. The feature Space of all possible images is
partitioned into cells of varying sizes to produce a hierarchical Quasi-Voronoi tessel-
lation of the feature space. AS the processing moves down from the root node of the
tree, the hierarchy recursively subdivides the objects of the database into increas-
ingly simpler problems. This has a two-fold advantage: the Space-Tessellation tree
provides an efficient means for image retrieval, and the decomposition of the problem
into simpler problems provides a mechanism for the linearization of the class separa-
tion problem. Though the classes may not be lineally separable at some level of the
Space partitioning, as the Space—Tessellation tree recursively subdivides the space,
they will become lineally separable at a deeper level of the tree.

The SHOSLIF assumes a well-framed image containing a single object of interest.
We describe a mechanism that uses genetic algorithms for ﬁnding those objects in
a complex scene. Genetic algorithms are a search technique for dealing with a very
large search space, such as the one encountered in image segmentation. This work
describes a technique for using genetic algorithms for object localization in a complex
scene. A brief review of genetic algorithms is given, followed by the speciﬁcs of using

a genetic algorithm for this image segmentation application. The results from several

8

experiments Show that this approach iS a viable method for image segmentation.
Another approach to producing a well-framed image is to use a coarse-to—ﬁne search,
similar to that used by Cresceptron [160]

Even assuming that a good visual attention technique exists for extracting im-
ages of interest to process, this visual attention mechanism may not give identically
registered images with which to work. The MEF and MDF subspaces rely for their
good results on images that are close to perfectly registered. In order to allow for
generalization to variations in the position and scale of the objects in the training
samples, this work proposes a method for producing an extended training set from
the training images already given. A small variation around the attention position
and scale should be sufﬁcient for the method to succeed, since the attention supplier
is assumed to provide a rough estimate of these parameters. Though for a more ro—
bust system, further image acquisition could be utilized, this work describes using the
existing training images for extending the training set. A set of grids for the position
and scale parameters are generated to overlay the ﬁxation point and scale found in
the visual attention phase. A new training image is extracted from the image at each
of these grid points, generating an extended training image set that can handle the
recognition generalization to position and scale variations.

This thesis describes the Self-Organizing Hierarchical Optimal Subspace Learning
and Inference Framework (SHOSLIF) and demonstrates its utility for view-based
object recognition and image retrieval. We Show the technique on a large multifarious
set of real-world objects found in natural scenes. We demonstrate the ability of the

MDF space to tolerate within-class variations and to discount such imaging artifacts

9

as lighting direction. We have a total of 500 classes of objects, including faces, and
we Show experiments on a training database of over 1300 images. We provide the
results of many experiments that Show different favorable aspects of the system and
its application to object recognition.

What we describe is not a general object recognition system. The system is view—
based: in order to retrieve a correct image from an image database, the SHOSLIF
requires that the system has been trained on an image taken at a very similar view-
point (i.e., within a few degrees). This limitation is true of any approach that is based
on a pixel-level representation.

In summary, the main contributions of this thesis are:

0 Use of the Most Discriminating Features for automatic optimal feature selection.

0 The DKL projection technique for discriminant analysis in high dimensional

space relative to the number of training samples.

0 An automatically generated space-tessellation tree for fast access into a large
model database, and problem decomposition into solvable linear discriminant

analysis.

0 Use of genetic algorithms for object localization in a complex scene.

0 Learning-based object size and position generalization.

0 Application for object recognition on a large set of diverse, real-world objects

in natural settings.

0 Low logarithmic computational complexity made possible by the framework.

10

o Generality of objects handled by the system.

0 Applicability to general scenes and handling variations in a uniﬁed framework.

Such a capability has been severely lacking in computer vision.

The major restrictions of the current work include:

o Well-framed images in which the object of interest is roughly centered and has

a standard Size.

0 Limitation in dealing with a signiﬁcant amount of background in the well-framed

images.

Chapter 2

Object Recognition

The general object recognition problem is a difﬁcult one. The mapping is one-to-
many, i.e., many 2D images can be interpreted as the same 3D object due to the
imaging process. The pOpular object recognition paradigms must therefore have
built-in biases. These biases allow the vision system to analyze the scene successfully
most of the time, but they are application dependent. When the assumptions imposed
by the object recognition system are not valid, the process fails.

The steps preceding object recognition include image formation from an input
sensor, possible pre-processing to enhance/restore images, image feature detection,
and representation of objects found in the scene. The sensor used could be some sort
of a scanner, such as those popular with optical character recognition (OCR) schemes;
a Single image from a Single camera or multiple images from one or more cameras in
order to obtain Sparse depth information; or a range scanner to obtain dense depth
information.

An object recognition system designer faces several issues [67], including the se-

11

12

lection of sensory cues to be used, the representation of these cues, the storage and
retrieval of objects from large libraries, the matching strategy, the position and ori-
entation estimation of objects found in the scenes, incremental addition of objects to
the library (i.e., learning), and the role of attention, object function, and context in
the recognition process. Some of these issues are addressed by the various approaches

to object recognition developed over the past several years.

2.1 Sensing scheme

The ﬁrst step in image processing is the acquisition of the image from the real world.
This can be accomplished in a variety of ways suitable for a particular application.
This “image” can be a set of images presented to the image processing system as a
particular instance for analysis. Thus for a system using the binocular stereo sensing
scheme, the image instance is a pair of intensity images that the image processing
system uses.

Image acquisition systems can be either active or passive. Passive systems accept
input from the world without feedback from the other levels of image processing;
active systems actively seek out the areas of importance in their worlds using the
instructions given by higher-level reasoning modules. Thus the active systems may
control their sensor’s direction, focus, or resolution based on the instructions derived
from previously obtained images; passive systems have no such automatic adjust-
ments.

Several different types of input sensors can be used to acquire an image for further

13

processing. These different sensors are explored in the following sections.

2.1.1 Desktop Scanners

Images acquired for document analysis are typically obtained from a desktop or hand-
held scanner. The scanning is accomplished by moving a light source coupled with a
photo cell relative to the page or ﬁlm being scanned. The paper or ﬁlm is typically
held stationary and the sensing assembly is swept along the item being scanned. For
desktop scanners, this movement is mechanically handled by the scanner; hand-held
scanners require the operator to sweep the sensing assembly across the page. The
sensing assembly consists of a light emitter and a light sensor (i.e., a photo cell).
The amount of light transmitted or reﬂected (for transparent ﬁlm scanners or opaque

paper scanners, respectively) is provided to the computer using the scanner.

2.1.2 Cameras

Camera behavior is usually approximated using a pinhole camera model, shown in
ﬁgure 2.1. Typical cameras in use today are either charge injection devices (CID
cameras), where charges built up on capacitor arrays are read from the location, or
chargecoupled devices (CCD cameras), where an array of capacitors form a shift

register which responds to the illumination present in the scene.

Intensity Images

Cameras which are used for producing intensity images generate a set of images

1,-(r,c),i = 1,2,---,n that have a spectral value for each pixel located at row r

14

 

 

Image/(1
Plane

 

 

Figure 2.1: Pinhole camera model. The y — 2 projection is similar to the a: — z Shown
in the ﬁgure.

column 0 in the image set. For color cameras, n = 3, and 11 corresponds to the red
image, 12 the green image, and I3 the blue image. Gray scale cameras have n = 1,

and II (r, c) represents the gray scale value of the pixel at (r, c).

Depth Maps

Often, the object recognition paradigm requires depth information in order to suc-
cessfully recognize objects in the sensed scene. As such, intensity images have been
replaced by depth maps, either directly using range scanners or indirectly using mul-
tiple views from a set of cameras [25]. These indirect depth maps can be obtained
using a multiple cameras or from a single intensity camera under photometric stereo,

depth from focus, or structure from motion techniques among others. Direct depth

15

maps can be obtained using a range scanner.

Photometric Stereo. Photometric stereo [165] is the sensing scheme in which
multiple images are obtained using a single camera in a ﬁxed position. For each
image taken by the camera for a particular image instance, the direction or method
of illumination is changed. This change provides a system of irradiance equations
that can be used to recover Shape (and depth) from the different shadings produced

by the different lighting directions.

Depth from Focus. Shape from focus uses at least two images, each taken with
a different focus, in order to obtain sparse depth information. When the focal length
exactly coincides with the distance between the camera and the object, theoretically
no blurring will be present at the pixels comprising the object. When this focal length
is changed, the amount of blurring that results about a particular pixel is used to

determine how far the point is from the camera [26].

Structure from Motion. Inter—frame motion between successive image frames can
be used to ascertain the structure and therefore the depth of the scene. The motion
parameters, such as translation and rotation, can be estimated from at least two
views in which corresponding features are matched. The relative depth can be esti-
mated from these feature correspondences as long as feature is matched across three

views [162].

16

Multiple cameras

Multiple cameras can also be used to obtain depth information. For binocular imag-
ing, the disparity between the two images, i.e., the difference in position between the
images, is used to decode the depth information. The baseline distance can be used
in conjunction with matched points to calculate the z coordinate of the object points

in the real world. With the cameras conﬁgured as in ﬁgure 2.2, we have

2df

 

This setup provides an inexpensive way to produce reliable depth maps.

Z=Z'=Z”=0

 

 

 

 

 

 

 

 

 

Figure 2.2: Stereo camera imaging system. Based on Fig. 2.3 in Ballard and Brown
Computer Vision.

17

2.1.3 Range scanners

Three-dimensional range scanners, such as the Technical Arts 100 White Scanner,
produce dense depth maps directly. Active illumination is typically used on range
scanners, where either a stripe or grid is projected onto the scene being sensed. The
third dimension can be obtained using triangulation techniques.

Alternatively, lidar technology can be used to obtain range information. This
technique uses light in the same way that radar uses radio waves and ultrasonic
ranging uses high frequency sound waves. Lidar systems send a pulse of light toward
the scene to be sensed and measures the time taken for a reflection to return. This
elapsed time is used to deduce the range between the objects in the scene and the
sensor. Although these lidar scanners are very accurate and versatile sensing systems,
because of the speed that light travels, lidar scanning can only be done with very
sensitive equipment.

Only those points that are visible to both the sensing device and the emitter device
can have depth information computed for them. Because of this, many images from

differing views must be scanned to ﬁnd the concavities and eliminate self-occlusion.

2.2 Model representation

The decisions regarding the way that a model is represented, i.e., the way that an
object is abstracted, stored, and utilized, are among the most important phases in the
development of an object recognition system. The choice of model representation has

direct implications on the types of features to extract and select, the utility of various

18

matching techniques, and the success of the overall object recognition endeavor.

2.2.1 Feature-based representations

Feature-based object recognition systems represent the models in the database in
terms of a set of N features or properties. As such, an object can be viewed as a
point in this N -dimensional space. When a good choice of features is used, the feature
vectors which represent the objects in the database will occupy different regions in
the feature space.

The choice of features depends on the types of objects that need to be recognized.
This choice of features is a fundamental step in obtaining good object recognition

rates using feature-based representations.

Global features

Probably the easiest and most compact representation for a set of objects is through
global features. Global features are those whose measurement depends on the en-
tire object spatially. These features are typically measurements such as Fourier de-
scriptors, area, perimeter, moments, area, circularity, elongation, and the number of
connected components.

Global features, though simple and intuitively appealing, do not necessarily pro-
duce the best results in an object recognition system. They are sensitive to noise (e.g.,
the number of connected components can vary wildly in the presence of noise), they
are not necessarily sufﬁcient for distinguishing all Shapes (e.g., the area of a circle

with radius r is the same as the area of a square whose sides are s = ﬁr), and they

19

do not work well under occlusion (e.g., the moments of objects change drastically

when the object is occluded).

Local features

Local features can be used for representation of objects as well. Local features depend
only on a small spatial part of the object. These features must be organized into some
structure, such as a graph whose edges represent relationships among the features
found at the vertices, in order to characterize the interrelationships among the local
features.

The same properties of objects used as global features can be used as local features,
working simply on a subregion of the image. For example, the area of a hole detected
in an input image can be useful for industrial part recognition and inspection. The

difference is that only the local information is used to extract the features.

Points

Lists of points to describe various objects are the most direct set of features to use.
Point features are essentially pixel coordinates of important landmarks on the ob-
ject being represented and attributes associated with those coordinates. The pixels
can represent corners under a 2D model representation scheme or vertices under a
3D modeling technique, vertices of cones and centers of spheres for quadric surfaces,
or points of high curvature or other landmark points for free-form surfaces. The at-
tributes can represent aspects such as the amount of convexity at an object’s boundary

point, the torsion of the surface at a point, or the angle formed by the corners at the

20

point under consideration. These points and their attributes can be used to build an

object model.

Edges

A local edge is a small area in the image where the local gray levels (for 2D images)
or the depth values (for 3D images) are changing rapidly in a simple (e.g., monotonic)
way [4]. These edges can be used to represent a boundary of an object or to construct

a wire-frame polyhedral approximation to an object.

Object Silhouettes. The boundary representation of an object is the most com-
mon representation for 2D objects. The boundary is typically represented by a set of
points, by a chain code, or by a sequence of line segments.

The results from a border-following or edge tracking algorithm produces a list of
points that represents the border of a 2D object. In most cases, some subset of this
list, i.e., the landmark points of high curvature, is maintained to describe the object
under consideration. In other cases, the whole Silhouette is stored to describe the
objects in the model database [32] [37].

Alternatively, a boundary can be encoded by a chain code. The chain code is
computed by quantizing the object boundary on a grid of appropriate resolution.
The edges between the points on the resulting quantization are given a code based
on the angle formed with a right—pointing vector. This chain code provides a drastic
reduction in boundary storage requirements and can even be used directly to ﬁnd

many other features of the object [70].

21

Finally, a 2D shape can be outlined by a sequence of line segments. Finding
lines using a Hough transform or some other line-ﬁnding technique produces a list
of line segments that describes the boundary of a 2D object. This list is similar to
maintaining a list of points, but provides additional information such as the angles

formed between adjacent line segments.

Wire-frame models. Three-dimensional images can also use edges as a primitive
for model representation. A 3D model consisting of only the edges of the object is
called a wire frame model [70]. Attributes that are typically associated with these
wire-frame model edges are such items as the sense of the edge (i.e., is it a jump edge,
a crease edge, a blade edge, etc.), the orientation of the edge, or the length of the

edge.

Parts

Rather than describing an object by its silhouette, another popular—and successful
[77]-—representation scheme views an object as a collection of smaller parts. The parts-
based representation is founded in the proposition that the visual system decomposes
shapes into parts. These parts should be represented by rules deﬁning part boundaries

rather than part shapes.

Skeletons. Objects that are a collection of elongated, thin pieces may be well
represented by a set of curves, each of which describes a piece of the object. Smoothed
Local Symmetries [16], used to generate and generalize model databases [39], ﬁnds

the spine of an object by locating the midpoint of the line segment joining points A

22

and B on the silhouette of an object as shown in ﬁgure 2.3.

 

Figure 2.3: Smoothed local symmetry. A and B are points on the bounding contour
of a shape. The angle formed by AB and the outward normal at A = the angle
formed by AB and the inward normal at B. The midpoint of AB deﬁnes the local
symmetry. The loci of local symmetries which are maximal with respect to forming
a smooth curve is called a spine.

Surfaces. Objects, speciﬁcally 3D objects, can be decomposed into a set of sur-
faces and their adjacency relationships. General surfaces can be approximated by a
collection of planar or quadric surface patches, and the attributes of this patch set
can yield deﬁnition to the attributes of the general surface. For example, the Shape
of a surface can be ascertained by the set of surface normals of the planar patches
approximating the surface.

An object can be represented by three lists and a set of topological relationships
among these lists: a list of surfaces, a list of edges, and a list of vertices. Information

relating each edge in the edge list to the pair of surfaces surrounding it, and relating

23

each edge to a pair of vertices form the topological relationships among the lists. This

model is called the Surface-Edge-Vertex representation [70].

Volumes. Representing 3D objects as a collection of volumetric primitives is
another way to decompose objects into parts. A generalized cylinder is deﬁned by an
axis, a cross section, and sweeping rule [118]. By sweeping the cross section along the
axis using the sweeping rule, a solid is created. For example, a standard cone uses
a line segment axis, a circular cross section, and an increasing sweeping rule. Using
this representation and a collection of connectivity relations, a wide variety of objects
can be exactly deﬁned. Hierarchical generalized cylinder models [109] more closely
approximate the actual object as the hierarchy is traversed.

A superquadric [124] is another volumetric primitive that can describe the parts
of a 3D object. This primitive deforms an ellipsoid into an arbitrary (symmetric)
shape. Parameters deﬁne the size of the volume and the degree to which the ellipsoid

is deformed into a rectangular Shape in the latitude and longitude planes.

Space-occupying features

Spatial occupancy representations deﬁne a 2D or 3D grid where each cell is either
occupied or vacant. Quad trees (for 2D images) [135] and oct trees [83] or kd—trees
[10] (for 3D volumes) systematically decompose the Space from large areas or volumes
into smaller ones. At each level of the tree, a certain area or volume is covered by
each node; the node can indicate full, where the covered portion of the space is

fully occupied; empty, where the covered portion iS completely empty, or partial,

24

where a portion of the space is occupied. Partially full areas or volumes are further

decomposed in like manner until the ﬁnest resolution of cells is attained.

Surface normals

Finally, a robust set of features for representing a 3D object is a set of points on the
Gaussian sphere [79]. The object is viewed as a collection of surface normals; each of
these normal vectors are translated to the center of the unit sphere, and the points

formed by these vectors on the sphere is called the Gaussian image of the object.

2.2.2 Multiple view representations

Feature-based representations describe the three-dimensional properties of objects.
Unless range images are used, however, objects must be recognized from projections
of that object onto a 2D image plane. A good representation will allow the viewpoint
from which the object is viewed to be arbitrary; 3D objects viewed from varying
viewpoints appear completely different when projected to a 2D image plane.

To combat this, many researchers have represented 3D objects by a set of 2D

descriptions instead of full 3D descriptions.

Collections of features from 2D views and their relationships

Representative two-dimensional views of objects can have standard features described
in previous sections represent the 2D projection of the object for that view [67]. These
features, coupled with relationships among the various views, can represent a three-

dimensional object when the set of 2D views and the extracted features are sufﬁcient

25
[143].

Aspect graphs

These two-dimensional views that Share some set of properties can be grouped to-
gether into a (usually) small set of view classes [95]. These view classes can be
organized into a graph structure. Each node of the graph represents an aspect of
the object, i.e., a representative from a set of views which are topologically distinct
from other views. An edge between nodes represents the transition between two such
representative views.

Typically, the entire aspect graph is not computed in systems that use this rep-
resentation; instead, the Gaussian view Sphere is tessellated in some fashion and the
resulting views are combined into equivalence classes according to some set of fea-
tures or shapes [82]. This approach does not require any assumptions regarding the
underlying Shape of the 3D objects being recorded (such as polyhedral objects), but
only a sampling of the entire view sphere is made. As such, the resulting aspect graph

is only an approximation of the true graph.

Multiple 2D views

Three-dimensional objects can also be stored as a collection of full two-dimensional
images as well. In this case, the object for which a model is being built is viewed from
many different viewpoints, each rendering a two—dimensional image of the model. The
viewpoints to use in the process are selected by tessellating the Gaussian view sphere

in some way in order to obtain representative 2D views of the object [31].

26

2.2.3 Hierarchical representations

Representation of objects by their parts and the relationships among these parts
breaks down in complexity when complex objects and scenes are to be recognized. In
a hierarchical structure, objects can still be deﬁned in terms of their parts, but these
parts can either be another hierarchy or an indivisible part. This approach is much
like the hierarchical directory structures on modern computer systems. Directory
entries can either be indivisible parts (ﬁles) or can themselves be directory entries
that can be further explored.

Hierarchical structures generally render a better efﬁciency in searching the object
representation space, Since large portions of the model database can be truncated

from the search for possible matches with search probe images.

2.3 Model-based object recognition

Model-based object recognition is the domain where a model exists for every object in
the recognition system’s universe of discourse. The research emphasis in this paradigm
has historically been on the design of efﬁcient matching algorithms from a manually
designed feature set with hand—crafted shape rules. Representative examples of the

different approaches to object recognition are given in table 2.1.

2.3.1 Interpretation tree

The interpretation tree approach to object recognition constructs a set of candidate

hypotheses that satisfy correspondences between a set of model feature and input

27

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Approach Db Sensing
Author Size Scheme Features
Grimson and Lozano- 1 Range Surfaces and Edges
Interpretation Perez [66]
tree Flynn and Jain [53] 20 Range Surfaces and volumes
Brooks [19] 3 Intens. Generalized cylinders
Stein and Medioni 1 Intens. Edges, supersegments
Geometric [141]
Hashing Flynn and Jain [52] 23 Range Surfaces and volumes
Stein and Medioni 12 Range Splashes
[142]
Jain and Hoffman [86] 10 Range Global, surfaces, and
Evidence— boundaries
Caelli and Drier [25] 25 Range Global, surfaces, cur-
based .
vatures, boundaries
Bischof and Caelli [13] 5 Synth. Corners, perimeter
Pentland [124] 1 81:11:12? Superquadrics
Geons Dickinson Pentland 1 Intens. Volumes
and Rosenfeld [46]
Distance Lowe [106] 1 Intens. Points, edges
. . . . Chen and Stockman 4 Intens. Silhouettes
rmmmization [31]
Pose clustering Stockman [143] 1 Intens. Edges
miff‘llllﬁg Belles and Horaud [14] ] 1 ] Range | Surfaces
. Huttenlocher and U11- 1 Intens. Points, silhouettes
Ahgnment

 

man [81]

 

 

 

 

Table 2.1: Classiﬁcation of Major Approaches to Object Recognition

 

28

image feature pairs. A node in this tree corresponds to a proposed match between
the feature in a particular model and a feature in the test object. Each branch in the
tree represents another feature in the model being matched to a feature in the test
image. For a node N, the branches to the children of N are all mutually exclusive;
only one such branch can be valid at a time.

Impossible nodes of trees are pruned from further search. This determination of
whether a node is valid is accomplished by a pruning predicate. These predicates
have been limited in practice to unary and binary rules, where either the last feature
match is tested alone or in conjunction with one of the previously seen matches in
the path to the root of the tree, respectively. A node is pruned when the predicate
indicates an impossible match; the descendants of a pruned node are not explored,
and the search is thus constrained.

The order in which these features are expanded in the tree is important to keep the
size of the search tree to a minimum. Features that can cause mostly impossible nodes
can be considered those features which are the most discriminatory; these features
will cause most of the generated children nodes to be impossible nodes and therefore
pruned from the tree. When such features are propagated toward the root of the
interpretation tree, a much smaller search tree is produced, and less computational
complexity is achieved for the same results.

A different interpretation tree is constructed for each model in the model database.
Each leaf node of such a tree must go through a veriﬁcation step that determines the
validity of the hypothesis described by the leaf node.

Grimson [67] [66] is the major preponent of this technique in which he constrains

29

the search using the geometry of the ob ject(S) under consideration. His approach uses
a unary edge-length constraint, a binary angle constraint, and two binary constraints

on the distances between two linear edge fragments.

Chen and Kak [30] strictly limit the interpretation tree search by invoking their
veriﬁcation procedure as soon as the object pose can be estimated. The remaining

features are used to assist in the veriﬁcation of the minimal hypothesis.

Flynn and Jain [53] use the constraints of area, surface type, the radius of non-
planar surfaces, the rotation, orientation, visibility, and the degree of parallelism for

planar surfaces in order to prune their interpretation tree.

Brooks [19] used two-dimensional images to recognize 3D objects using the volu—

metric primitives of the generalized cylinder.

2.3.2 Geometric Hashing

Geometric hashing [102] attempts to minimize the computational effort during the
recognition phase by extensively preprocessing the library of object models [52] and
storing invariant features in a hash table. These features are invariant with respect to
viewpoint, and tables of object models are indexed by these features and their values.
Evidence is given that some implementations of geometric hashing can handle the
uncertainty of segmentation, and it is fast [141]. When implemented in parallel [15]
and combined with a constraint satisfaction module to perform the veriﬁcation task,
object recognition (of relatively small model databases) can be done in near real time

[141].

30

Stein and Medioni [141] base their implementation of geometric hashing on sev-
eral polygonal approximations as a representation of a model or a scene. The authors
believe that the curvature of a general curve is its most important feature; it is in-
variant with respect to scale, rotation, translation, and weak perspective (when some
redundancy is supplied). No explicit distinguishing points (such as corners or inflec-
tion points) are required in this method, and though the polygonal approximation
looses most of the curvature information, parts are preserved in the angles formed
by the consecutive line segments in the polygonal approximations. Connected linear
segments in the images form chains of adjacent segments; the segment chains are
turned into super segments by grouping a ﬁxed number of adjacent segments. These
super segments consist of the line segments, the angles between them, and the mea-
sure of eccentricity (second moment ratio) of the ellipse that can be computed from
the covariance matrix of the vertices. All the super segments are encoded using a
Gray code [69]; the resulting predicates are used as a key for the hash table, where
the super segment is stored as an entry. Note that these predicates represent the
intervals within which the features of the super segments lie.

Flynn and Jain [52] propose a table—driven object recognition technique using pairs
of surface patches as invariant features. These invariant features are such items as
angles (those formed between the the normals of planar patches, between the axes of
cylindrical parts, and between a plane’s normal and a cylinder’s axis) and distances
(between a planar patch and the center of a spherical part, between the centers
of two spherical parts, and the minimum distance between the center of a sphere

and a cylinder’s axis). Triples of these patches are used to form two-dimensional

31

interpretation tables. Each pair of invariant features in a triple of surface patches
combines so that a table has multiple entries. For example, a plane-plane-plane triple
uses all three pairs of angles to index into the two—dimensional table. Standard hash
table techniques are used to deal with entry collisions during the table-building stage.
To match test images with models in the database, the deﬁned invariant features are
extracted from the test image during the segmentation phase. Interpretation table
entries within some window of the obtained values are extracted, and the features
from the model(s) given by the table entries are bound to the test image features.
These sets of hypotheses are sent to the veriﬁcation stage (interpretation tree in the

work reported).

Stein and Medioni [142] deﬁne “splashes” as representations of surface patches.
The surface normal Np at a carefully selected point p on a surface of interest is
computed. Points lying on a set of concentric circles, centered at p, are sampled at
equal angular increments. The angles between N, and the normals computed at each
of the sampled points, using a local coordinate system, are used to produce a set of
piecewise-linear points representing a 3D contour of the surface. The circle radius,
curvature and torsion of this representation are used as invariant features to use as

an index into a hash table.

2.3.3 Evidence-Based Recognition

Evidence—based object recognition derives a discriminatory evidence rulebase, capable

of differentiating between a set of objects. This rulebase is a set of salient (evidence)

32

conditions with corresponding evidence weights for various objects in the database
[87]. During recognition, weights for various features are accumulated to give evidence
for a particular model in the database [25]. This technique can deal with real-valued
data and can learn characteristics of classes which are not shared by every instance
of the class, constructing a useful rulebase in polynomial time [86].

Psychophysical research of biological visual systems suggests that much analy-
sis of the visual stimuli occurs at a low-level, i.e., pre—cognitively [110]. For the
evidence-based recognition technique, this is taken to mean that knowledge should
be integrated only after low-level processing has derived whatever it can from the data
without the aid of the high-level cognitive processes. These low-level data for Jain
and Hoffman [86] are surface patches, boundaries, and corresponding classiﬁcations
of these. This information is structured to form an object representation.

The evidence-based object recognition technique makes use of a discriminatory
rulebase for object recognition. J uxtaposed with this type of rulebase is the generative
rulebase, patterned after Winston’s work on analogy [164], explored by Connell and
Brady [39]. In this generative type of rulebase, the objects can be regenerated from
the rules themselves. A discriminatory rulebase, on the other hand, is capable only
of differentiating between a set of objects and is incapable of generating the objects
from the rules themselves. This requires the system to retain additional information
about the model database besides that information actually used by the system in
order to avoid the loss of a description of the models being matched.

The object models in Jain and Hoffman [86] consist of several views of various

objects. A segmentation, classiﬁcation, and merging process provides a segmentation

33

of a range image into surface patches. These surface patches are assigned a sense, viz.
planar, convex, or concave. The model representation is made up of the following

classes of information derived from the range images and the surface patches:

0 morphological information, consisting of the perimeter of the set of object pixels,
the number of connected background components in the image, and the number
of connected components within the convex hull constructed over the object

pixels;

0 local patch features, consisting of the sense of the patch (i.e., planar, convex,
or concave), the 3D surface area of the patch, the Span of the patch (i.e., the
maximum 3D distance between any two points on the patch), and the maximum
squared error in ﬁnding a piecewise linear curve to the boundary between the

patch and the background pixels;

0 relationships between all possible pairs of patches, consisting of the boundary
type (adjacent, jump, or remote), the normal angle between patches, the min-
imum and maximum distances between patches, and the boundary angle and
linear ﬁt of pixels along the boundary (for adjacent patches) or jump gaps (for

jump boundaries between patches).

Knowledge about the boundary angles of the models is used to perform a merging of
patches to produce a new set of representations. Rather than using all of this informa-
tion in the representation, however, only “remarkable” information—evidence-which
strongly cues certain objects is used. An evidence rule consists of a list Specifying

those features involved in the rule and the corresponding bounds for these features,

34

a speciﬁcation of the range of possible distinct patches that must satisfy all of these
feature bounds, and a list of evidence weights 10,-, (one for each database object) that
this list of features in their associated bounds support the hypothesis that that the
observed object is object i. These evidence rules either refute or support hypotheses
about an object in a scene. The technique to build this rulebase was to construct a
minimal spanning tree over a subspace of the feature space. Connected components
of these minimum spanning trees which resulted from removing weak edges in the
tree were considered to be clusters; each cluster was a potential rule which may be
added to the rulebase. Clusters which contained only a few vectors were considered
“good” rule candidates, because the feature subspace under consideration showed

good differentiation between the objects.

Caelli and Drier [25] extended the pioneering evidence-based object recognition
work done by Jain and Hoffman [86] to provide some new view-independent features,
an optimized rulebase generation procedure, and a neural network to estimate optimal
evidence weights and match objects in the later recognition phase. CAD images
projected to various viewpoints were used for database objects. The features used by
Jain and Hoffman were augmented to include principal curvatures, which are invariant

to rigid motion.

Bischof and Caelli [13] discuss optimally classifying patterns in terms of the bounds
on features which constitute different types of patterns. The label-compatibilities
which should occur between unary and binary rules are taken into account in the

rulebase generation.

35
2.3.4 Geons

Geometric ions (geons) [12] are volumetric primitives with properties describing their
cross section’s edge, symmetry, size, and axis. The edges can be straight or curved;
there can be no symmetry, reﬂective symmetry, or rotational symmetry; cross section
sizes can be constant, expanding, or expanding and contracting; the axis can be
straight or curved. These yield 36 geons by taking the Cartesian product of these
four properties.

The geons can be viewed as the alphabet of a formal language. If the volumetric
primitive attachments, location, and orientation could be encoded in the production
rules of this language, recognition would be reduced to parsing strings of the language.

Pentland [124] proposed the superquadric shape model in this regard. Dickinson,
et al. [46] use view-dependent features in intensity images to produce a geon-based

object representation that is used for object recognition.

2.3.5 Distance minimization

Newton’s method of nonlinear least-squares minimization can be applied to minimiz-
ing the distance between 2D contours and 3D objects [106]. Projection from 3D to
2D is a smooth and well-behaved nonlinear operation. Newton’s method is based on
the assumption that the function being minimized is locally linear; this requires a
good starting point to avoid falling into local minima.

Lowe [106] proposed this method for object matching. Chen and Stockman [32]

build their object model database by obtaining the object silhouettes from various

36

viewpoints. The silhouette of a test image is extracted and matched to an object

using this distance minimization technique.

2.3.6 Pose Clustering

A transformation of the model and image features, i.e., the translation, rotation,
and scaling, can be achieved using pose clustering [143]. This technique derives
the transformation between each model and image feature pair; a Hough transform
approach is used to accumulate evidence for each candidate transformation. When
all such pairs are considered, peaks in the transformation bins produce hypotheses to

be veriﬁed.

2.3.7 Graph Matching

Finding the correspondence between object model features and test image features
can be considered a graph matching problem. Using a graph structure to represent
object features, attributes, and their relationships can be effected by viewing graph
nodes as the features and graph edges as relationships among the features. Viewed
as such, object recognition reduces to ﬁnding Similar structures among subgraphs,

commonly called graph isomorphisms.

Bolles and Horaud [14] ﬁnd the maximal cliques in the graphs and use this infor-

mation to deﬁne the best set of consistent model/ image pairs.

37
2.3.8 Alignment

Pose estimation by alignment [81] tackles the problem of recognizing objects from a
single two-dimensional image of a 3D scene. The methodology provides a closed-form
solution for the transformation of a 3D model coordinate frame to the 2D image

coordinate frame using three pair model and image points.

Huttenlocher and Ullman [81] uses the alignment method for object recognition.
The transformation method is used to produce a set of hypotheses for alignment of
the model with an image. A set of features, such as the silhouettes [31], is used to

verify the hypotheses proposed by the transformations.

2.3.9 2D object recognition

The foregoing sections have focused on the problem of three-dimensional object recog-
nition using various representations. Two-dimensional objects, such as characters for

document processing applications, are also of interest to the object recognition ﬁeld.

Kim and Yuan [93] use Zernike moments in order to make their character recogni-
tion system invariant to translation, scale, and rotation. Using a k-nearest neighbor
classiﬁer for character label identiﬁcation gave very good results. Applications for
map reading and lid inspection are addressed and show good recognition of char-
acters from the images. Pan and Keane [121] describe a set of moment invariants
recognizing characters with differing aspect ratios. A multilayer perceptron was used
on the preprocessed images to achieve a good numeral recognition rate of 99.97% for

handprinted characters. Buse and Liu [24] present a word recognition system that

38

is able to recognize cursive handwriting. No preprocessing is done on the grey-level
images; Gabor ﬁlters are used to extract parts from word images. An evidence-based
approach for word recognition is used to achieve a 81% rate for the correct word being

in the top six choices.

2.4 Face recognition

Face recognition is a sub-area of general object recognition of particular interest
for a wide variety of applications. Applications to law enforcement for mugshot
identiﬁcation, veriﬁcation of personal identiﬁcation such as driver’s licenses and credit
cards, gateways to limited access areas, surveillance of crowd behavior all are potential
applications to a successful face recognition system.

Several basic problems confront researchers in this area. The choice of represen-
tation, as with general object recognition, is a critical step toward successful face
recognition. Using observable features alone (such as a long nose, long hair, or a
round face) pose diﬂiculties for automatic extraction and qualitative quantization.
For practical automatic recognition systems, however, a compact representation is
essential.

Once the choice of representation is ﬁxed, systems to automatically recognize faces
must detect faces in sensed images. A scene may or may not contain a set of faces;
if it does, their location must be ascertained before recognition can take place. This
is a complex problem: facial hair, make-up, turbans all obscure facial features which

may be useful in localizing faces; large variation in scale, position, and orientation

39

should be handled to make the automatic system practical.

When a representation is chosen and face images extracted, the recognition prob-
lem remains. Several approaches have been used to ﬁnd the identity of faces, from
holistic approaches to matching based on features. In a practical system, changing
facial features over time must be handled; making the association between faces and
their labels must be accomplished; multiple instances of the same face under varying

imaging conditions, sizes, and orientations must be classiﬁed as the same person.

2.4.1 The human capacity for face recognition

Though we may be unable to endow machines with the capability of the human
visual system, it is a good reference point from which to start. Much research has
been done on face recognition, both by machine and biological system researchers.
Research issues of interest to neuroscientists and psychophysicists include the human
capacity for face recognition [111], the modeling of this capability [127] [140], and
the apparent modularity of face recognition [129]; the human facility for learning to
recognize faces [29] [65] [49] [114]; the role of distinctive or unusual features in faces
for recognition [155]; the degradation of face recognition capability as humans age [7]
[112] [156]; and conditions which result in the human inability to recognize faces, such
as prosopagnosia [153].

There is evidence to suggest that the human capacity for face recognition is a
dedicated process, not merely an application of the general recognition process [48].

Infants are attracted more to face-like patterns than those without patterns [114].

40

People suffering from prosopagnosia retain the ability to recognize people from various
other cues, such as voice, dress, and hair style [157]; they can recognize features
of faces and determine whether an object is a face, but are unable to attach an
identiﬁcation [153].

The analysis of what features humans use for face recognition has been cause for
much debate. Apparently both global and local feature information are used in an
hierarchical manner, the local features providing a ﬁner classiﬁcation system for facial
recognition [75]. The features tend to become less important as the extraction process
moves down the face [20]. “Attractive” faces tend to be recognized most accurately
and readily; “unattractive” faces are able to be recognized almost as well; faces having
little aesthetic weight either positively or negatively are the most difﬁcult to recognize.
This gives support to the theories suggesting that distinctive faces are more easily
recognized than typical ones [6]. Information contained in low Spatial frequency
bands are used in order to make the determination of the sex of the individual,
while the higher frequency components are used in recognition [137]. Young children
typically recognize unfamiliar faces using unrelated cues, such as glasses, clothes, hats,
and hair style. By age twelve, these paraphernalia are usually reliably ignored [28].
Psychosocial conditions also affect the ability to recognize faces. Humans may code
a mean face with average attributes; these means may be different for different races,
and recognition may suffer from prejudice or unfamiliarity with the class of faces from
another race [59] or gender [60].

The emulation of this human capacity for face recognition is the goal espoused by

most computer vision researchers in face recognition. Sample approaches taken by

41

various research groups is summarized in the following sections.

2.4.2 Representations for automatic systems

2D intensity images are usually used as input to automatic systems. Though it is
not very compact, the images themselves are good for robustness in recognition. For
detection of a face in an image, a resolution of 16 x 16 is sufﬁcient; for recognition, 3
32 x 32 at 4 bits per pixel (sixteen grey level values) intensity image is sufﬁcient [134].
Therefore, the use of this resolution of 2D images would require only 512 bytes for

each image to be recognized.

Alternatively, a feature vector could be extracted from the face. The features are
derived from the intensity or edge map images, and usually include items such as
the hair intensity, the distance between the eyes, or the size of the eyes. Features
derived from face proﬁles typically include a set of characteristic points on the proﬁle
(such as the notch between the brow and the nose or the tip of the nose) and the
angles between these points. Combinations of these representations have also been
employed.

Galton [58] deﬁned a set of characteristic points and a collection of curves for the
proﬁle shapes at or adjacent to those points. The index of the shape curve for the
characteristic point was used for a description of face proﬁles. Hong [78] worked on
recognizing photographs. Each photograph was sampled using ﬁve different relative
positions between the photograph and the camera. A Singular value decomposition

of the image matrix extracted the representation for this work as an n-dimensional

42

single value feature vector. As a combination of features and 2D images for face
representation, Campbell, et al. [27] utilized hair and cheek intensity values coupled

with subimage patches for eye representation in order to recognize faces.

The Karhunen-Loéve projection, though it has been studied for many years [88]
and used in countless pattern recognition applications [84], has recently been revisited
for face representation [94] [139], face recognition [154] [122], and for planning the

illumination of objects for future recognition tasks [116].

Sirovich and Kirby [139] do not attempt to recognize faces, but their work has
been at the heart of many other face recognition attempts. They use the Karhunen—
Loeve projection in order to represent face images. Any image in the training set
can be approximately reconstructed using a linear combination of the “eigenpictures”
(see section 3.3 for a description of how these eigenvectors of the training set can be
computed); arbitrary images not found in the training set can also be approximated
using a weighted combination of eigenvectors obtained from the training set. They
extend this work [94] by exploiting the near-symmetrical nature of faces in their
representation. They augment their training set of images using the mirror images of
these training images in an attempt to reduce the errors involved in reconstructing

the faces.

Though no representation approach has established itself as being far superior to
others, the use of feature vectors, either implicitly derived or explicitly extracted from

images, are by far the most popular.

43
2.4.3 Face Localization

In order to automate the face recognition process, the system must be able to ﬁnd a
face in the input image(s). Though this is a typical problem in most object recognition
systems, much research effort has been expended in the pursuit of face localization in

particular to assist in the face recognition process.

Sakai, et al. [132] seeks to determine the existence of a face in a set of digitized
photographs. A Sobel-type edge operator found pixels with the largest magnitude
of gradient values; neighboring pixels with Similar properties were joined in order to
obtain a silhouette of the face. These outlines produced a similarity measure when
matched with a pre-deﬁned face template. Once an outline was identiﬁed as a face
silhouette, a coarse-to-ﬁne approach discovered individual features within the face

outline.

Nixon [119] locates the eyes using a Hough transform. The iris has a nearly cir-
cular perimeter in most cases. Furthermore, the iris is expected to have gradient
directions going out in each quadrant. A Sobel operator produces gradient direc-
tions by quantizing the gradient magnitudes, and these data are used in the Hough

transform technique.

Conlin [38] detects faces from artist sketches using a knowledge based approach
for task organization. The system utilizes a face template which contains both eyes,
the nose and the mouth; various levels of image information are used. Line segments
propose component parts; component parts propose components (e.g., nose), and

components propose the layout of the face. The best candidates at each level survive

44

and are propagated to the next level; competing candidates are explored to provide
the best possible detection results.

Craw, Ellis, and Lishman [41] employ a multi-resolution scheme to locate the
face in an input image. At a low resolution of the image, a Sobel edge detector in
conjunction with a line follower algorithm ﬁnds the outline of the head. At each level
of the resolution hierarchy, the template of the head position is propagated to the
next ﬁner resolution of the image for further reﬁnement. Once the head outline has
been reliably extracted, the process is repeated for internal facial features (such as
eyes) using the head outline as a guide. For example, to ﬁnd the lips and eyebrows,
a vertical line is drawn in the lower quarter of the face outline (for ﬁnding the lips)
or in the upper quarter of the face outline (for ﬁnding eyebrows); each pixel of this
line is a seed point for line ﬁnding algorithms. A collection of these lines that ﬁt in a
thin box indicates that the desired facial features have been found.

Yuille, Cohen, and Hallinan [167] deﬁne an energy function to be used with de-
formable templates. These templates are altered to ﬁnd the best representation of
the shape present in the image. Preprocessing the intensity input images using mor-
phological ﬁlters generates representations of peaks and valleys in the images. A
collection of energy surfaces are deﬁned, and the templates are deformed to Simulta-
neously seek good minima in all of the surfaces. The parameter values of the template
are modiﬁed to minimize the energy function.

Govindaraju, Srihari, and Sher [64] extend the work by Yuille, et al. [167] by
basing their template on three arcs of the head Silhouette which have curvature dis-

continuity, viz. the right, left, and top sides of the head. The chord vector connecting

45

the ends of the arc is used to deﬁne an area and a region centroid that are used with
the arc length and the chord itself to form a four-tuple for each of the three arcs to be
used as a template to deform. Considering these three arcs as a supersegment [141]
gives a face region whose center is the center of the face being located.

Ballard and Stockman [5] locate the eye twinkles, i.e., the bright specular reﬂection
from the cornea. The twinkles were found to be the brightest points on the grayscale
images. A two-dimensional binary search of the histograms quickly ﬁnds the locations
of these twinkles; Shape heuristics are used to rule out spurious twinkles. Using the
positions of the eye twinkles, a similar procedure was used to ﬁnd a much weaker
“nose twinkle” along the bisector of the image columns containing the eye twinkles.

Hallinan [68] views eyes as having two regions: the iris and the white region of the
eye. Each of these regions has a nearly uniform intensity. The non-uniform intensity
values of the eye are modeled as noise in the ideal template (model) of the eye. For
input images, a valley-detection algorithm is used to ﬁnd the iris surrounded by the
white of the eye. A deformable template is ﬁt to the data using both the intensity
surface and the edge surface.

Craw, Tock, and Bennet [42] use a hierarchical course-to—ﬁne search to ﬁnd the
location of a head. At each level, Simulated annealing produces a polygonal ap-
proximation to the head Silhouette; once the best such silhouette is achieved, the
approximation is reﬁned by adjusting each vector of the polygon representing the
silhouette.

Manjunath, Chellappa, and Malsburg [107] use Gabor wavelet decomposition to

extract features at points of maximal curvature in the images. In a particular neigh-

46

borhood, a point (2:, y) is selected as a feature point if its response to the wavelet

representation of the Gabor function is maximum.

Reisfeld and Yeshurun [128] seek facial features using the almost symmetric nature
of the face about the center of the nose. Paired points in the image about the axis of
near-symmetry are given a symmetry measure value. Points with a high response to

this measure correspond to facial features such as eyes or cheeks.

Weng, Ahuja and Huang [160] combine the localization and recognition steps in a
general, self-organizing paradigm. Their system was demonstrated primarily for face
segmentation and recognition, but the general learning technique was also applied to

other, real-world objects. The Cresceptron is described in more detail in section 2.5.

2.4.4 Face Recognition

In any system that can be utilized for automatic recognition using faces, an identity
must be assigned to the representation of the input images. This implies that the faces
to be recognized must be suitably organized in a database for future retrieval. Using
this database of faces, a label must be assigned to an input facial image. Furthermore,
this label must be consistently and robustly assigned for large variances in the image
acquisition parameters.

Typically, during the system setup phase, a set of features that can be reliably
extracted are identiﬁed, and the set of “known” faces are encoded using this set of
features in the face database. Then at system run-time, novel face images undergo

the feature extraction process to place them in the same feature space as the face

47

database, and a matching algorithm ﬁnds the best candidate(s) for the face identity
[133]. For the most part, the different approaches to face recognition differ only in
the features selected and the matching procedures used.

Some of the pioneers in face recognition include Kaya and Kobayashi, who manu-
ally extracted facial features in 1972; Kanade, who automated the feature extraction
and recognition of faces in 1977; and Kohonen who applied the use of neural network
technology for recalling a face from a noisy image in 1988.

Kaya and Kobayashi [92] studied classifying a restricted domain of faces. The
subjects used in the experiments looked directly into the camera, kept their mouths
closed; they had neither glasses, nor facial hair, and all the images were taken under
identical illumination conditions. Two basic types of noise were identiﬁed: the inher-
ent noise, namely the variation of features of an individual from one image to another,
and extraction noise, namely the errors involved in extracting parameters from the
images. Euclidean distances between manually identiﬁed points in the images char-
acterize the faces in this work, and the parameters thus obtained were normalized by
dividing each length by the nose length.

Kanade [90] identiﬁed several steps for face recognition: the selection of effective
features to identify the face, automatic extraction of those features, and automatic
classiﬁcation based on the automatically extracted features. The subjects in the
experiments had no glasses and no facial hair; two images of each subject were used,
taken a month apart under different conditions. One of the images was used in the
training set and the other in the test set. The method for characterizing the face is

the parameterization of the facial geometry using the distances and angles between

48

eye corners, ends of the mouth, nostrils, and t0p of the chin. A two—stage process,
a low- and a high-resolution scan, located the requisite features in the images. The
low-resolution scan found regions for the two eyes, the nose and the mouth. Each of
the regions is then scanned at the higher resolution, producing more precise features.

Kohonen [98] demonstrated the use of neural network technology for face recol—
lection applications. Even when the input images were very noisy or had portions
missing, an accurate recall capability was achieved on a small set of face images.

More recently, a renewed interest in the ﬁeld of face recognition has appeared
on the computer vision research scene. This is due in large part to technological
improvements in the ability to perform large computational efforts and the application
of new or the reorientation of previously seen techniques applied to face recognition.

Stonham [144] distributed a set of single-layer adaptive neural networks to each
person found in the database to perform face recognition, expression analysis, and
face veriﬁcation. Several hundred presentations were required to train the classiﬁers,
providing changes in facial expression and translation of input images. The network
that produced the highest response among all database networks was the identiﬁed
person for a test image.

The Dynamic Link Architecture [23] [101] can form sets of neurons grouped into
structured graphs. Using Gabor-based wavelets as the features, object recognition
invariant with respect to background, translation, scale, and distortion was achieved.
An image is represented as a graph with attributes attached to the graph’s nodes; an
object in the image is represented by a subgraph in the image graph. Recognition is

thus reduced to the process of elastic graph matching.

49

Golomb and Sejnowski [61] classify the sex of the individual from their face image.
Two three-layer neural networks are trained, but the hidden layer of the ﬁrst is given
as input to the second. The ﬁrst network is trained as an image compression network,
reducing a 30 x 30 = 900-node input layer to 40 hidden nodes. These 40 hidden
nodes are given as input to the sex classiﬁcation network, which is another three-
layer network with a single output node indicating the sex of the person in facial
input image.

Turk and Pentland [154] use the Karhunen-Loeve projection in order to detect
and recognize faces. The eigenvectors of the covariance matrix for the training set
(“eigenpictures” in Kirby and Sirovich [94]) are called “eigenfaces” in this work. Every
face in the database of “known” faces can be projected onto each of the eigenfaces
to produce a weighting of how much that eigenface contributes to the reconstruction
of the database face. When the database face has been projected to each of the
eigenfaces, a weight vector is thus produced. When a novel face image which may or
may not be in the database of known faces is presented, this same projection is done
to produce another weight vector. The Euclidean distance between the test image’s
weight vector and each of the face database weight vectors is computed; recognition
is achieved by ﬁnding the individual in the face database with the smallest distance
so obtained. They also observe that faces and non-faces produce very different weight
vectors, and they uSe this information in order to detect whether a face exists in the
image.

Akamatsu, et al. [2] normalize the position and Size of the face in the images. Each

target subimage is sent through an afﬁne transformation so that the eyes and mouth

50

are placed in a pre-deﬁned geometric arrangement within a standard face window.
From this standard face window, two different experiments were performed. In the
ﬁrst experiment, the Karhunen-Loeve projection is done in a similar manner to that
done by Turk and Pentland [154]. For the second experiment, the Fourier Transform
is applied to this standardized image, and the Karhunen-Loéve projection is applied
to the Fourier spectrum. The Fourier approach allowed for much more variation in
Shifting the standardized image points, but both systems provided good recognition
results.

Nakamura, Mathur, and Minami [117] use curves of constant gray level (isodensity
lines) for face recognition. The silhouettes of face images on a black background are
discovered with a Sobel edge detector, and recognition is done with template matching
on a small (ten pairs of images) data set.

Rahardja, Sowmya, and Wilson [126] employ a pyramid of neural networks for
expression recognition. The training images are partitioned into overlapping blocks,
each block of which serves as input to the lowest level of networks in the pyramid.
Training of the networks at the lowest level of the pyramid use backpropagation to
extract features from the images. The hidden layer of these networks are combined to
the networks at the next higher level in the pyramid. The training images were hand-
drawn faces with varying expressions; the pyramid successfully recalled the trained
data, but performed badly (approximately 50%) in recognizing distorted images.

Seibert and Waxman [136] use a neural network to recognize faces from their parts.
The eyes, nose, month are extracted from a facial image using symmetry detection.

Considering the facial features as 2D views of a 3D object, ART 2 is utilized to produce

51

a clustering of the feature vectors. An aspect network integrates the accumulation of
evidence for recognition.

Gordon and Vincent [63] [62] use range data for face recognition. The image
is segmented into planar and spherical regions and surfaces of revolution. From this
segmented range image, the nose is located to aid in the location of the eyes. Faces are
normalized to a standard position and orientation using the positions of the eyes and
nose. The volume between the test image and the database image in this normalized
Space is the criterion for matching.

Manjunath, Chellappa, and Malsburg [107] [108] recognize faces by combining Spa-
tial location in the image with the feature points obtained using the Gabor wavelet
decomposition in a local neighborhood. These vectors are considered as the vertices
of graphs, connected using the Spatial information from the image. The graph from
an input image is matched with a database graph, ﬁrst by ﬁnding feature pairings be-
tween the graphs, followed by combining this information with the topology difference
of the two graphs.

Cheng, Liu, Yang, and Wang [36] use three training samples for each individual
to produce a mean image for the individual. The eigensystem for this average image
is computed using singular value decomposition, thresholding the eigenvalues near
zero. The average eigenvectors obtained for all of the mean images are computed to
produce the feature vectors used in the system. A novel image is projected to this
average eigenvector space to produce a feature vector.

Brunelli and Poggio [22] look at the problem of gender classiﬁcation from face

images. The positions of the faces in the images are standardized by automatically

52

detecting the eyes and placing them at a standard position. A HyperBF network
[125] is trained for each of the two genders, and during the test phase, the gender
network with the larger output is taken as the appropriate one. This same technique
was extended to face recognition, training a network for each person. Synthetic data
was generated from the small set of real images by placing points around the average
of the feature vectors and using these vectors for training, using the real data as test
images.

Weng, Abuja, and Huang [160] demonstrate a general learning paradigm for object
segmentation and recognition. Their system is a self-organizing recognition system
for general object recognition; as such, it does not rely on any hand-crafted shape
rules extracted from the objects to be recognized. Their system is described in more
detail in section 2.5.

Pentland, Moghaddam, and Starner [122] extend the work done in [154] regarding
the Karhunen-Loeve projection for face recognition by taking a sample of a very large
database of faces as the training set from which to compute the eigenvectors used
for weight vector calculation. The database consisted of many different views of each
individual and wide variation in the paraphernalia (such as glasses and hats) that the
person was wearing. Using the eigenface techniques, the database could be considered
as a whole or it could be divided into multiple sub-databases consisting of the various
views available. Their work favors the use of multiple eigenspaces for view-based
face recognition. The eigenface concept was also extended to include such features as
“eigeneyes” and “eigenmouths.” These feature eigenspaces produced only a marginal

improvement in the recognition rates. The work explores various applications to a

53

face database of this size and reports good results on each tested application. One of
the applications is an interactive database search. Constraining an initial search using
annotations associated with each individual in the database, a face can be selected for
further search. Using the eigenface recognition technique, a list of faces is presented

to the operator in decreasing similarity.

2.4.5 Proﬁle Recognition

Due to the requirements of law enforcement applications to face recognition, some
research in the area of proﬁle recognition has also been done. For proﬁle recognition,
the Size and shape of the nose is of primary importance, whereas it is of less importance

for frontal view face recognition.

Kaufman and Breeding [91] use binary images for proﬁle recognition using the
proﬁle silhouettes. A set of normalized autocorrelations is utilized as a feature vector;

the k-nearest neighbor rule is used for classiﬁcation.

Harmon et al. [71] [72] [73] [74] treated the proﬁle as a waveform to achieve proﬁle
recognition. From a set of characteristic points on the proﬁle, Six feature character-
istics (protrusion of nose, wiggle, proﬁle triangle angles, angle between characteristic
points) were established. Eleven features from these were used as a feature vector,
and Euclidean distance was used to ﬁnd a good match. Set partitioning was used
to reduce the number of candidate matches for the Euclidean distance calculation,

thereby both improving the performance and reducing the computational complexity.

Wu and Huang [166] use B-splines to extract points of interest. A feature vector

54

is generated using the distances between neighboring points of interest, the curvature
joining adjacent points, etc. The feature vector thus extracted is compared with the

database vectors for matching purposes.

2.5 Self-organizing approaches

The previous sections have discussed various techniques for representing and rec—
ognizing 3D objects in general and faces in particular. The research emphasis has
historically been on the design of efﬁcient matching algorithms from a manually de-
signed feature set with hand-crafted shape rules. This manual design of a feature
set is appealing because typically the system can be designed to be very efﬁcient.
When designed properly, a very small number of parameters for each of the objects
or faces is sufﬁcient to capture the distinguishing characteristics among the objects
to be recognized. These hand-crafted features, however, can make the system brittle
in that it is unable to Operate except for those domains for which it was speciﬁcally
designed.

An alternative to hand-crafting features is the self-organizing approach. Auto-
matic selection of features is a crucial step toward a general, fully automatic recogni-
tion system. In a self-organizing approach, the structure of the knowledge base is a
function of the training samples given to the system. With such a system, the human
designers must provide an environment for learning, but most of the design details
are left to the learning phase of the system to complete.

The Cresceptron [160] is such a self-organizing system. Similar to the Neocogni-

55

tron [57] [56] in that the Cresceptron uses multilevel retinotopic layers of neurons, the
Cresceptron automatically determines the conﬁguration of its network in the learning
phase. The Cresceptron learns in an incremental and unsupervised manner, automat-
ically creating an hierarchical image analysis structure to describe images. Since the
Cresceptron forms a hierarchical structure, it is able to tolerate deviation at various
levels. Near the root of the hierarchy, the tolerance is large; near the leaves, this
tolerance is smaller in order to handle perceptually similar objects from a relatively
small set of training samples. Unlike most neural networks, the Cresceptron does not
use back prOpagation for learning. Instead, the Cresceptron analyzes the structure
of an object in a bottom-up manner, as proported by Kohonen [98] to be the way
biological learning takes place. No pre-segmentation is required by the Cresceptron,
Since the system combines the segmentation and the recognition steps, and different
features for different objects may be selected based on the training set.

The network that the learning phase of the Cresceptron constructs consists of
several levels; each level has several layers, and each layer has several neural planes.
A neural plane is a square set of nodes that represents a concept; the response at a
certain position in the neural plane is representative of the presence of the concept.
At the lowest level, the nodes correspond to the pixels in the input image. The higher
layers consist of pattern detection layers, node reduction layers, blurring layers, or
combination of other types of layers.

Though the available layers and their function have been pre—determined, the
conﬁguration of the network connections is automatically learned from the training

samples. The network can continue to learn in an incremental fashion as new training

56

samples become available.

The Cresceptron network grows very quickly, however [159]. Regardless of the
input image, every node in the network must be visited. While in a parallel hardware
implementation this may be acceptable, for the current state of software simulation,
the system runs more and more slowly as more images are learned. A concept is
regarded as new and memorized whenever it is signiﬁcantly different from those that
have been previously learned. Instead, an optimal feature selection procedure Should

be employed.

2.6 Summary

Object recognition is a difﬁcult problem in general. The sensing scheme, the selection
of features and their representation, the matching procedures and the model database

indexing techniques all contribute to this difﬁcult problem.

A summary of popular sensing schemes to bring the images from the world into
the computer is given, including desktop scanners, cameras for intensity images or
depth maps, and range scanners are brieﬂy described. Then the problem of model
representation is addressed, surveying many popular representations being used in
the computer vision community. A survey of object recognition techniques is given.

As a sub-area to general objection, face recognition has received much attention in
recent years. Both frontal face and proﬁle recognition are important topics for many
varied applications, but frontal face recognition has received more attention. A wide

variety of representations have been employed, and features ranging from holistic to

57

individual local features or eigenfeatures have been utilized. Using these features, a
collection of works has been summarized for locating the face in the image, if one
exists. A survey of face recognition techniques has been provided for extracted face
images. Using the human capacity for face recognition as an ideal for which to strive,

the automatic recognition of faces still has a long way to go.

Chapter 3

The Self-Organizing Hierarchical

Optimal Subspace Learning and

Inference Framework (SHOSLIF)

The SHOSLIF is a framework that aims to provide a uniﬁed methodology for compre-
hensive visual learning. Its objective is not just to attack a particular vision problem,
but a wide variety of vision problems. It addresses critical problems such as how
to automatically select the most useful features and how to automatically organize
visual information using a coarse-to—ﬁne space partition tree which results in a very
low, logarithmic time complexity for retrieving from a large visual knowledge base

[158].

The SHOSLIF uses the theories of optimal linear projection to generate a hierar-
chical tessellation of a Space deﬁned by the training images. This Space is generated

58

59

using two projections: a Karhunen—Loeve projection to produce a set of Most Expres-
sive Features (MEFS), and a subsequent discriminant analysis projection to produce
a set of Most Discriminating Features (MDFS). The system builds a network that
tessellates these MEF/MDF spaces to provide an 0(log n) average time complexity
for recognizing objects from images, where n is the number of learned (and stored)

samples.

3. 1 System overview

The network that is constructed takes the properties of an abstract tree structure.

An example of such a network is shown in Figure 3.1. It is this Space- Tessellation

 

P

Figure 3.1: A tree view of the SHOSLIF network. Each node corresponds to a processing
element.

Tree that provides the key to the efﬁcient object recognition capability of the system

described in this work.

60

The shaded blocks in Figure 3.1 indicate nodes which may be visited during the
recognition phase. The leaf nodes represent the smallest cells deﬁned by the train-
ing set. AS the processing moves down from the root node of the tree, the Space-
Tessellation Tree recursively subdivides the training samples into smaller problems
until a manageable problem size is achieved. When a test object is presented to a
node, a distance measure from each of the node’s children is computed to determine
the most likely child to which the test object belongs. At each level of the tree, the
node that best captures the features of the test object is used as the root of the
subtree for further reﬁnement, thereby greatly reducing the search space for object
model matches. Figure 3.2 provides a more detailed view of the processing elements
found in the network.

The Space-Tessellation Tree may be invoked multiple times, each corresponding
to an attention point and scale of the visual saccades learned in the training phase of
the system. Each time the tree is invoked, it uses a fovea image extracted from the

original image.

3.2 Visual attention

This fovea image is produced using a visual attention mechanism. The visual atten-
tion mechanism provides a well-framed portion of the image that contains a single
object of interest in a standard position and scale. During the training phase, the
attention mechanism ﬁnds areas of interest using a scanning technique. The scanning

is accomplished using a manual selection, a pixel—by-pixel search [154], a Simple raster

61

 

 

Image 000

 

 

 

 

 

 

 

Feature

Processing
Extraction

Module

 

 

 

 

 

 

 

 

 

Q0000] l .
, Processmg

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Feature
Module Extraction
o l
(WL Processing Feature
Module Extraction
(p l
f’“? , Processing
Module
0
O
0
Processing
Module
0
O
0
Processing Feature
Module Extraction
0 \cf 1
In“? Processing
Module
0
O
0
Processing
Module

 

 

 

 

 

 

 

Figure 3.2: The network structure of the SHOSLIF. Each level of the network has a variable
number of nodes. A node at level to has a sub-network at levels I > lo. A node consists
of a processing module and an attention supplier. As indicated by the multiplexers in the
ﬁgure, each of the processing modules can activate one of the attached nodes at the next
level of the network. Each of the processing modules controls an attention module, which
is responsible for supplying properly aligned fovea images to the activated node.

62

scanning approach [160], or a more sophisticated technique such as described in [145].

In manual selection, the operator uses the mouse to select the area of interest
from a displayed image. A Simple raster scan of an input image is used in [160] by
raster—scanning an input image using a fovea window. At various predeﬁned points in
the raster scan, the subimage subtended by the fovea window is extracted as a fovea
image to process.

Once an area of interest is selected, it is mapped to a standard—sized “fovea” by
the foveation procedure. Because the fovea is representative of a ﬁxation point in
the visual saccades performed on an input image, we consider those pixels nearer the
center of this fovea to be more important than those nearer the edges. Therefore, the
pixels in the fovea are weighted appropriately. The center pixel receives a weight of 1,
and the pixels on the edge of the fovea receive a weight of 0, with a linear interpolation
between these weights for those pixels in between. This ensures that the ﬁxation point
that produced the fovea receives more attention than the background.

The details of this visual attention mechanism is beyond the scope of this docu-
ment, and are given in more detail in [159]. Visual attention is utilized to be able to
limit the scope of images processed by this system to be those images containing a
single, well-framed object.

A top—level ﬂow diagram for the processing done in each of the Space—Tessellation
Tree’s processing elements during the learning phase is given in Figure 3.3. In order
to allow for generalization to variations in the position, scale, and orientation of
the objects in the training samples, these parameters are varied Slightly during the

training phase. Because the attention supplier provides a rough estimate of the

63

 

 

 

Vectorlze J—l

 

$9809
r—‘ Tessellation
for this node

 

 

 

 

Figure 3.3: A top-level ﬂow of the processing performed at each node in the Space-
Tessellation Tree during the training phase. A set of training samples which enter the
processing element are extended to allow for learning-based generalization for position,
scale, and orientation. These extended samples are vectorized and used to produce the
projection matrices to the MEF and MDF subspaces. The extended samples are projected
to the MDF subspace using these matrices, and a tessellation of the space covered by the
node being worked on is produced. The projection matrices and the space tessellation for
each node are produced in the learning phase.

position and scale of the area of interest, only a relatively small variation around
this point need be expanded using the learning-based parameter generalization. The
span of the variation for each of the parameters (e.g., row and column positions in
the image, size of extracted fovea) is divided into a uniform grid of points. A new
fovea is extracted from the image at each of these grid points, generating an extended
training set designed to handle the recognition generalization to position and size

variations.

64
3.3 The Most Expressive Features (MEF)

Each input subimage can be treated as a high dimensional feature vector by con-
catenating the rows of the subimage together, using each pixel as a single feature,

as Shown in Figure 3.4. Thus each image is considered as a sample point in this

   

11 1

I: f.1 f.

1

(a) An image, made up of rows of pixels

. le .
f1,2

f1.q
f2,1
f2,2

f2.q

fl.
fp,2

 

 

. fp,q .

(b) The vector is formed by concatenating the rows of the image together.

Figure 3.4: The vectorization of an image. SHOSLIF treats an p x q image as a point in
pq-dimensional space. The rows of pixels in an image are concatenated together to form a
vector.

high-dimensional space.

65
3.3.1 Density of the space

This high-dimensional space is sparsely populated with interesting images. For an
p x q image with g grey levels, there are a total of 9” possible points. A typical
grayscale image has 9 = 256 pixel values. For the experimentation described in this
work, the image dimensions were ﬁxed at p = 64 and q = 88, so the number of
possible points in the space that this work utilizes is 2565632, which is a very large

number.

Most of these points are not interesting images to consider in a vision processing
system, however. For example, an all- (or near-) white image or an all- (or near-)
black image is useful for testing how the system handles degenerate cases, but not
interesting from a practical standpoint of vision processing in the real world. In
addition, contrived images, such as those images with alternating white and black
pixels, though they exist in the Space, do not occur naturally and are therefore not
very interesting images to use except to test the handling of these outlying cases.
In fact, the interesting images tend to be grouped together in a small area. For
the database of images used in this work, the maximum Euclidean distance between
images viewed as points in this space is less than 16, 000. Figure 3.5 shows a histogram

of the distances between image vectors in the database.

Image instances of a particular object can be represented by an n-dimensional

random vector X. X can be expanded exactly by

X=VY,

66

 

 

 

 

 

 

 

 

 

 

 

35““) I I T if I I l
istances in Image Space -—
300000 ~ -
«3 250000 b -
is
a.
3.3. 200000 - -
8.
l'-t-a
2 150000 -
i
2 100000 L _
50000 F -
O 1 I— 1 1 l l —_l—l l

 

 

 

 

 

 

 

 

 

 

0 2000 4000 6000 800010000120001400016000
Inter-point distances in image space

Figure 3.5: A histogram of inter-point distances for the images in the database treated as
vectors. The horizontal axis shows the distance between a pair of image vectors; the vertical
axis shows the number of pairs that fell into each bin.

67

where the columns of the n x n square matrix V are orthonormal basis vectors; this
makes Y a random feature vector of the image X. Without loss of generality, Y can
be considered as the vector resulting from subtracting the sample from its mean, uy,

since we could always redeﬁne Y — uy as the new vector.

3.3.2 Principal component analysis

This dimension n of X is usually very large, on the order of several thousand for
even small image sizes. Since we expect that a relatively small number of features
are sufﬁcient to characterize a class, it is eﬂicient and reasonable to approximate X

using m < n columns of V to give

X(’”) = 2 MW,

i=1

where the v,-’s are the column vectors of V.

Let the effectiveness of the approximation be deﬁned as the mean-square error
||X - X(m)|]2. Then we can use the proven result [55] [89] [105] that the best vectors
v1, v2, - - ~ vm to use are the unit eigenvectors associated with the m largest eigenvalues

of the covariance matrix of X,
2,, = E [(x — Mx)(X - Mx)‘] ,

where Mx is the mean (expected) vector of X. During the training phase, Ex is

replaced by the sample scatter matrix. Then the features y1,y2,- - - , ym can be easily

68

computed from

y,=vf(X—Mx) , i=1,2,~-,m.

Since these y,’s give the minimum mean-square error, we call them the Most Expres-
sive Features (MEF) because they best express the population in the sense of linear

transform.

This projection, also called the Karhunen-Loeve projection and principal com-
ponent analysis [84], has been used to represent [94] and recognize [154] [122] face

images, and for planning the illumination of objects for future recognition tasks [116].

To determine m, the number of features to use, we ﬁrst rank the eigenvalues of
Ex, A1, A2, - - - , A", in non-increasing order. The residual mean-square error in using

m < n features is simply the sum of the eigenvalues not used,

and is a natural criterion to determine how many features are needed to sufﬁciently
represent a class. We can choose m such that the sum of these unused eigenvalues is
less than some ﬁxed percentage P of the sum of the entire set. So we let m satisfy

?=m+l Ai < P
n Ai

i=1

If P = 5%, a good reduction in the number of features is obtained while retaining
a large proportion of the variance present in the original feature vector [84] [154],

thereby losing very little of their original population-capturing power.

69

3.3.3 Computational Considerations

We can approximate the covariance matrix Ex with the sample scatter matrix S =
UU‘, where U = [U1U2'° -Uk], and U,- = X; — X, for k training samples. If k < n,
as is typically the case when dealing with a small number of training samples relative
to the image dimension, S is degenerate. When this happens, however, we can ﬁnd

the eigensystem of the k x 1: matrix U ‘U . This means that

UtUW,‘ = AiWi,

with eigenvalue A,- and associated eigenvector w,. Pre-multiplying by U gives

UUtUWi = AiUWi.

Then v,- = Uw, is the eigenvector of S = U U ‘ with eigenvalue /\,-. If the number of
samples available is more than the image dimensions, then the eigensystem of U U ‘

can be computed directly.

For the database of images used in this work, each image was projected to the
MEF space to ﬁnd the density of the vectors in this MEF Space. Figure 3.6 Shows a

histogram of the distances between MEF vectors in the database.

The MEF vectors are well-suited to be utilized for image approximation. Figure
3.7 demonstrates the capability for the MEFS to effectively approximate an image
set. Images of a single individual were used for training; variations in lighting and

expression were present in the training samples. We can view the MEF vectors as

70

 

80000 r . i i T
_ Distances in MEF Space --—
70000 in” «
Fl—
60000 t ”L -
w L
E. H-
._. 50000 ~ _ .
.s _
S 40000 'l i -
o F_
§ 30000 ~ I ~
g r-
2 I
20000 .. - .
10000 ~ 4
O 1 1 _ l 1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0 10000 20000 30000 40000 50000 60000
Inter-point distances in MEF space

Figure 3.6: A histogram of inter-point distances for the images projected to the MEF
space. The horizontal axis shows the distance between a pair of MEF vectors; the vertical
axis shows the number of pairs that fell into each bin.

71

Training set

      

Approximations of images

Figure 3.7: Example MEFS and their ability to approximate a training set. The training
images were taking from the Weizmann Institute of Science in Israel. The images of a single
individual are shown.

72

images by reversing the vectorization procedure shown in Figure 3.4.

Figure 3.8 shows samples of the MEF vectors for a large training set of images with
variations in lighting direction and expression. Six images of each of 28 individuals
were used as the training set; each individual had two expressions taken under three
different lighting conditions. As can be seen in the ﬁgure, the most important features

are due to the lighting direction and the expression changes.

3.4 The Most Discriminating Features (MDF)

Although the MEF projection is well-suited to object representation, the features
produced are not necessarily good for discriminating among classes deﬁned by the set
of samples. The MEFS describe some major variations in the set of sample images,
such as those due to lighting direction; these variations may well be irrelevant to
how the classes are divided. Figure 3.9 shows an example of a two—dimensional case
where the MEF projection cannot separate classes in the population. In this ﬁgure,
the MEF projection onto the principal component Y1 is unable to separate the two
obvious classes. A projection onto Z1, however, gives a clear separation. This clear
separation is provided by the discriminant analysis procedure. The features used
to effect this clear separation are called the Most Discriminating Features (MDFS)
because in a linear sense, they Optimally discriminate among the classes represented

in the training set.

73

 

Figure 3.8: Example MEFS for a large training set. The top thirty are shown for a training
set of 168 images. Six images of each of 28 individuals were used for training this set.

74

 

  
 

 

MEF ector
MD vector Y1
21 2.... ..........................
.......
”ﬁﬁghvgli‘iiamtes l x: """"""""""""""""""""""""" l "l - No MEF value can
the c ........... ../—\ separate the two
classes
Y2 . ' - -
- I- 22
‘ / _x1

 

Figure 3.9: Problems with the MEF projection for class separation. The MEF projection
onto the principal component Y1 is unable to separate the two obvious classes. A projection
onto 21 , however, gives a clear separation.

3.4.1 Multivariate linear discriminant analysis

Let W be a projection matrix that projects a vector into the MDF subspace.
Vector Z = W‘Y is a new feature vector from samples of c classes with class
means M,, i = 1,2, . -~,c. Then the within-class scatter matrix is deﬁned as [84]
S", = 5:1 2?;1(Yj - M,)(Y,- — M,)‘ for n,- Samples from class i.

For a grand mean vector M for all samples from all classes, the between-class
scatter matrix is deﬁned as 3,, = 22:1(M, — M) (M,- — M)‘.

In discriminant analysis, we want to determine the projection matrix W that

det s

et{s }. In other words, we want to maximize the between-class

maximizes the ratio
scatter while minimizing the within-class scatter.

It has been proven [51] [163] that this ratio is maximized when the column vectors

75

of projection matrix W are the eigenvectors of Sngb associated with the largest
eigenvalues. Then the scalar components in Z are feature values of the given samples

and the column vectors of W are the MDF feature vectors.

3.4.2 Computational Considerations

Because the matrix Sbe need not be symmetric, the eigensystem computation could
be unstable. To avoid this problem, the following method diagonalizes two symmetric
matrices, and produces a stable eigensystem computation.

Compute H and A such that
Sw = HAH‘, (3.1)
where H is orthogonal and A is diagonal. Then
(HA-i)‘ SwHA-i = I. (3.2)
Now compute U and 2 such that
(HA-i)‘ stA~i = UzzUt (3.3)
where U is orthogonal and E is diagonal. Then

S, = HAisztAiH‘ (3.4)

5,, = HAiUIutAiH‘, (3.5)

76

Deﬁning V = H A“iU , V diagonalizes Sb and Sw at the same time. Since

3,;1 = HA‘1H‘, (3.6)

Equations 3.4 and 3.6 give

5,33,, = HA-IH‘HAiszU‘Ath
= HA-iUEU‘AiH‘

= VEV‘

That is, V consists of the eigenvectors of 5,3131, and 2 contains the eigenvalues of
5,3135.
Thus using the symmetric properties of the component scatter matrices, we have

a method for ﬁnding the eigensystem of $513,, that is stable.

3.4.3 The DKL projection

The discriminant analysis procedure breaks down, however, when the within-class
scatter matrix Sw becomes degenerate. This can happen when the number of sam-
ples is smaller than the dimension of the sample vectors. Using discriminant analysis
directly on the images will generally render Sw non-invertible due to the high dimen-
sion of the input image relative to the number of training samples. For example, a
very small image of 64 X 64 when vectorized turns into a 4096-dimensional sample

vector. The number of training samples available is usually much smaller than this

77

large dimension.

The resolution to this problem is to perform two projections instead of one. The
discriminant analysis projection is performed in the space of the Karhunen-Loeve
projection (i.e., MEF space), where the degeneracy does not occur.

We ﬁrst project the n-dimensional image space onto the m-dimensional MEF
space. We choose m such that for 3 training samples from 0 classes, m + c S s. In
fact, it is impossible for m > s — 1 Since there are a maximum of s — 1 non-zero
eigenvalues in the Karhunen-Loeve projection. But we further constrain the ﬁnal
dimension of the MEF Space to be less than the rank of Sw in order to make 3,,

non-degenerate.

Since there are at most 0— 1 non-zero eigenvalues of 3,3135, we choose I: S c—1 to
be the ﬁnal dimension of the MDF space. So we relate the dimensions of the MEF-
and the MDF- spaces as k g c— 1 S m g s—c.

Thus, the new overall discriminant projection is decomposed into two projec-
tions, the Karhunen-Loeve projection followed by the discriminant projection. We call
this new projection the Discriminant Karhunen-Loeve projection (DKL projection).
We ﬁrst project the n—dimensional image space (X—Space) onto the m-dimensional
MEF space (Y-space) using the Karhunen-Loeve projection. Then we project the
m—dimensional MEF factors onto the MDF Space using the discriminant analysis

projection.

Deﬁnition 1 (DKL projection) The DKL projection to the Most Discriminating

Feature (MDF) space is Z = W‘V‘X, where V is the projection matrix from the image

78

space to the MEF space, and W is the projection matrix from the MEF space to the

MDF space.

For the database of images used in this work, each image was projected to the
MDF space using the DKL projection to ﬁnd the density of the vectors in this space.

Figure 3.10 shows a histogram of the distances between MDF vectors in the database.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

18m I I t 1
_ Distances in MDF Space —
160000 -L -
F
140000 - "‘ .
m L—
E 120000 - _ -
g 100000 - — -
“a __
1-1 80000 ‘ .]
g s
5 60000 F -
2 ﬂ
40000 ~ '_ .
20000 - __ -
O 1 l F-l-—11___I—-_ 1
O 5000 10000 15000 20000 25000

Inter-point distances in MDF space

Figure 3.10: A histogram of inter-point distances for the images projected to the MDF
space. The independent axis shows the distance between a pair of MDF vectors; the de-
pendent axis shows the number of pairs that fell into each bin.

79
3.4.4 Explanation of the MDFS

The Most Discriminating Features are not directly tied to the absolute intensity
values of the input images. Figure 3.11 shows some MEF S and MDFS obtained from
a large training set of human faces. As can be seen from the ﬁgure, the features
encapsulated in the MDF vectors Show directional edges found in the training set
and have discarded the imaging artifacts, such as lighting direction, to a large extent.
Figure 3.11 further demonstrates how the MDF vectors have divorced themselves from

the artifacts of imaging process, especially in this instance, the lighting direction.

The Clustering Effect of the MDF subspace using the DKL projection

To show how the MDF subspace effectively discounts factors unrelated to classiﬁca-
tion, an experiment was performed to obtain the MEF and the MDF vectors for a
collection of training images. Figure 3.12 shows samples of the data in the space of
the best (in terms of the largest eigenvalues) two MEFS and MDFS for the experi-
ment. From Figure 3.12, it is clear that the MDF subspace has a signiﬁcantly higher
capability than the MEF subspace to correctly classify an input image that is con-
siderably different from those used in training, because classes in the MDF subspace

have larger between-class distance and smaller within-class scatter.

3.5 Space Tessellation

Given the strengths of these two feature sets, we want to provide an effective and

efﬁcient method for retrieval of images from the database. The SHOSLIF produces

80

 

MEF 1 MEF 2 MEF 3 MEF 4

    

MEF 5 MEF 6 MEF 7 MEF 8
(a) MEF vectors

 

MDF 1 MDF 2 MDF 3 MDF 4

    

MDF 5 MDF 6 MDF 7 MDF 8
(b) MDF vectors

Figure 3.11: A sample of MEF and MDF vectors treated as images. The MEF vectors
in (a) show the tendency of the principal components to indicate major variations in the
training set, such as lighting direction. The MDF vectors in (b) Show the ability of the
MDFS to discount those factors unrelated to classiﬁcation. The training images used to
produce these vectors are courtesy of the Weizmann Institute.

81

 

 

 

 

 

 

 

 

 

 

6000 u i x i r
_ Cars A g
5000 Chair +
4000 - Computer mice 0 ~
Digits X
3000 " Human faces A ‘
2000 _ AHumanbodies 1* j
N A
E 1000 ~ . . -
O .A Acmm‘” . . ,,............_
AA
-1000 - A ~
-2000 - A "‘ ~
-3000 - x x-
-4000 ‘ ' ' '
-6000 -4000 -2000 0 2000 4000 6000
MEF 1
3% I I T I I I
A Cars A
2000 ~ A A A A Chair + -
A A Computer mice E1
_ + AA Digits x
1000 + ‘ Human faces A I
Human bodies 3*
N O .. ............. , ............... , ............................................................................................................................................ ..
Li.
a 1... - ,0 _
-2000 - 3‘ .
DD
-3000 - -
£6
_4(XX) 1 1 1 l l 1
-3000 -2000 -1000 0 1000 2000 3000 4000 5000
MDFl

Figure 3.12: Distribution of some samples using the best two features in the MEF and
the MDF spaces for the root node. In the MDF subspace, objects of the same class are
clustered much more tightly than in the MEF space.

82

a hierarchical space tessellation using the hyperplanes determined by the MEF and
MDF features. The feature space of all possible images is partitioned into cells of
differing sizes as shown in Figure 3.13. A cell at level I is subdivided into smaller
cells at level I + 1. The network structure that can effect this recursive tessellation
is a Space- Tessellation Tree whose nodes represent cells at the corresponding levels
as shown in 3.1. The tree is built automatically during the learning phase; this tree
is used in the recognition phase to ﬁnd the model in the learned database that best

matches an unknown image in O(log n) time on average for n images in the database.

As the processing moves down from the root node of the tree, the hierarchy recur-
sively subdivides the objects of the model database into increasingly simpler problems.
When a sample vector is presented to a node, a distance measure from the center of
each of the node’s children is computed to determine the most likely partition to
which the sample vector belongs. At each level of the tree, the k nodes that best
capture the features of the sample vector are used as roots for further reﬁnement.
The operator chooses a constant k for retrieval purposes, thereby ensuring that the

retrieval complexity is 0(log n) on average.

The children of a node N represent a tessellation of the space subtended by N. A

two-dimensional example is shown in Figure 3.14.

3.5.1 The Hierarchy Decomposes the Problem

The tree structure is able to decompose the problem of linear discriminant analy-

sis into smaller, tractable problems. At the root node of the tree, where the entire

83

 

 

 

 

 

 

 

 

(a) 2D space view

I
‘J i . "V I «1 ‘\‘
,/' A ,I \
ﬂ/ .- /’ \
./’ ,, f x
/ t/' o . \\
/ '/ I / \‘\\
/ " I,” 3 \\~.
.'/’ ~ \\ r\,.
t 4' . I , ‘\. . ._ .. N‘s.

O++ :l__f+__2~:- 3+4» [4+ l5++wmtf§+ﬂ+§++ 7++++++ 5++++‘

(b) Tree view

Figure 3.13: The SHOSLIF’s hierarchical Quasi-Voronoi space tessellation. Digits and
“+” symbols indicate centers of cells at level I, l + 1, and l + 2. Dark lines indicate cell
boundaries at a level of the Space-Tessellation 'Ii'ee nearer the root.

84

 

[:] Space covered by node N at level I
km Space covered by node N at level I +1

Tree description of this space decomposition

 

Figure 3.14: An example tessellation of the space covered by node N

database of samples are found, the classes may not be separable using the linear
discriminant analysis technique. But because we do not attempt to completely sep-
arate the classes at a single level, we can successfully solve the problem in stages by

breaking it down into solvable pieces.

The DKL projection does well in dealing with large variation within a particular
class. When the database contains many classes, however, they still may not be
linearly separable. The Space-Tessellation Tree provides a mechanism for dealing
with this problem. Children of a particular node decompose the difﬁcult problem
of separating many classes into several smaller problems. At each node of the tree,
the set of features found in the discriminant analysis procedure are speciﬁcally tuned

to the set of samples found in the node. So although the MDF space provides an

85

optimal set of features for class selection in the sense of linear transform, this optimal
set may be insufﬁcient to separate classes. But even if a node cannot completely
separate the classes, it can make a ﬁrst attempt at separation, dividing the samples
among its children nodes. Then at this child level, Since fewer samples exist, the
linear discriminant analysis procedure is more likely to succeed. Since this is applied
recursively, eventually a successful separation of classes is achieved.

Figure 3.15 shows an example of the difference in the complexity of the class
separation problem for the root node and an internal node of the tree. A child node
contains fewer samples than its parent does, and the MDF vectors can therefore be

optimized to the smaller set of samples in the child node.

3.5.2 Hierarchical Quasi-Voronoi Tessellation

We tessellate the Space covered by a node N using a Hierarchical Quasi- Voronoi
T essellation, with the resulting tessellated Space resembling that shown in Figure
3.13, for two dimensions. First we deﬁne some terms and provide a few properties of

the Hierarchical Quasi-Voronoi Tessellation algorithm.

Deﬁnition 2 Expected radius. Each level of the tree has an expected radius r(l) of
the space it covers, where l is the level of the node and r(l) is a decreasing positive

function based on the level.

Deﬁnition 3 Node center. The center of a node N is the sample vector associated

with the training sample that caused N to be created in the tree.

86

 

 

 

 

 

 

 

 

8m I I I I I I I
Class 1 A
600 " ClassZ P ‘
5 x Class3 0
400 i f Class4 x ‘
200 7 o x D x o I
N
‘25 -200 » * 3 ~
2 0 02° X
-400 " e .
O :
-600 . x .
-8m __ + D ‘ 4;
-12“) 1 J 1 1 1 1 1 +
-2000 -1500 -1000 -500 0 500 1000 1500 2000
MDF 1
(a) Root node
3”) I j I I I T I I— I
Class 1 A
2500 ' " " ClassZ + -
x x (“3.883 13
200° ” ..§ .. Class4 x ‘
1500 " x "
1000 - a 1
z: =1
0 1.1-10 ........ Q ............ 9 ......... D. ......................................... E.) .......... , , .............. . ............. g ........... 4. ......... ..
-500 .. q
-1500 - . -
.2” 1 1 1 1 1 1 1 1 1
-1500-1000 -500 0 500 1000 1500 2000 2500 3000 3500
MDF]

(b) Internal node

Figure 3.15: An example showing the complexity of the class separation problem at two
diﬂerent levels of the tree. Samples from the same class are given for both graphs. Since
the internal node contains many fewer samples than the root node, the MDF vectors can
be optimized for the classes contained in the node.

87

Deﬁnition 4 Distance measure. d(X, A) is the distance measure between node N

with center vector A and a sample vector X.

A node N in the tree contains a set of labeled training images. Every node which
contains more than a single training image computes a projection matrix V that is
used to project the samples to the MEF space as described in Section 3.3. If the
training samples contained in N are drawn from multiple classes, as indicated by the
labels associated with each training sample, then the MEF sample vectors are used to
compute a projection matrix W to project the samples to the MDF space as described
in Section 3.4. Otherwise, the training samples in N are from a single class, and we

organize them into a subtree using MEFS for efficient image retrieval.

The MEF subtree is used to ﬁnd the nearest neighbor for a test probe. Once
that nearest neighbor is found, that sample is projected back to the nearest ancestor
node that utilizes an MDF space. This is done because the center of the nearest
MDF ancestor node may not be very near the test probe in this MDF space. But
it represents a class of objects that contains a vector that is near the test probe.
Therefore, the MEF subtree is used to ﬁnd that nearest neighbor in order to compare

the test probe with nearest neighbor sample using the most speciﬁc MDF subspace.

A ﬂag is set for each node to indicate what sort of feature space it is working in.

A simple example tree constructed with these spaces is shown in Figure 3.16.

88

 

 

 

 

 

 

sanples from class I
[2 samples lrom classl J
O sa from class 1 1 511 from 1 ”
E 51.33.13: from class 2 [2 88 “madam“ 2 .2 .. l 11% unsung; .

 

 

 

 

Osanviestromclesst Osalroleslromclasst lsenpletromclasst samplesirom lasst turnpletroml 1
1 tromdasez ls. llromclassZ Ose lromclass2 [gearmiestromglass'z ] lawnmca‘asz I [gg$gliowmngl&2]

A

an lromclasst sa homeless sunblock do 1 fr
Iluﬂjromclassz Il’lﬁleiflllomclass Illmnwclasﬁr I 1’ 11333335

Figure 3.16: An example tree built with both the MEF and the MDF subspaces. Shaded
nodes indicate that the MDF space was used; square nodes indicate that the MEF space
was used; bold nodes are leaf nodes.

 

 

3.5.3 Incremental Learning

Training samples are added to the tree in series of batches. Larger batches produce
more efﬁcient image retrieval trees, but they take longer to develop. Smaller batches
can be processed more quickly, but may yield a less efﬁcient tree than one in which
all the training images are given as a Single batch. Thus the operator of the system
must consider the tradeoff in the learning versus the retrieval time complexity when

determining the batch Sizes used for training the tree.

Suppose that we want to add training sample X,- to node N which is at level I.
If the feature vector for X,- is within the radius covered by one of the children of N,
then X,- will be added as a descendent of that child as we increase the depth of the
tree. If the feature vector for X,- iS outside the expected radius for all the children of
N, however, we would like to add X,- as a new child of N. If N is an internal node,
then a partial tessellation of N has already been established by previous training
batches, and care must be taken in changing an existing tessellation, as discussed in

the following section.

89
3.5.4 The Adoption Problem

Once node N becomes a grandparent node, the tessellation of the space at node N
must be ﬁxed so that the descendants of N will not be lost in a changing tessellation.
For example, due to some previous processing, G may be a grandchild of N as shown

in Figure 3.17. G is close to a tessellation boundary, but it falls outside the expected

Expected
distance from
child 02

E u : ed distance from child C1

‘0 Tree description of this space decomposition

Figure 3.17: An example tessellation where grandchild G is close to a tessellation boundary
and outside the expected distance radius for its nearest parent C 1.

 

cell radius for its nearest neighbor, node C1. Now suppose that a new image X
falls into the space covered by N at a position shown in Figure 3.17. The nearest
neighbor to X is node CZ, but X falls outside the cell radius for C2. So we wish
to establish X as a child of N in order to contribute to the tessellation of N. If we

were to do this, however, the new tessellation of node N would be as shown in Figure

90

3.18. The modiﬁed tessellation, though seemingly appropriate for node N, renders

Expected
distance from
child CZ
e . : ed distance from child C1
.5 Tree description of this space decomposition

 

Figure 3.18: An incorrect tessellation of node N that would result from adding X as a
child of N. The subtree rooted at G would be adopted by X and therefore unreachable if
this tessellation were to be established.

the entire subtree rooted at G unreachable because it is adopted by X. Node N will
pass samples that should be classiﬁed using the subtree rooted at G to node X; since
X knows nothing about subtree G, the tree rooted at G cannot be accessed. Due to
the possibly complex subtree rooted at G, a simple attachment of G as a child of X

may cause problems for G’s descendants.

Of course, we could always restructure the tree whenever a new sample is added to
the model database. Such a restructuring, however, is very computationally expensive
to perform. The system could be set up to do this type of restructuring during slow

times, but we do not want to impose such a large computational requirement for every

91

incrementally learned sample or batch of samples.

Because of this problem of losing subtrees that are rooted near a tessellation

boundary, we say that node N becomes a sterile node whenever a grandchild is added

to N’s subtree.

Deﬁnition 5 Sterile Node. A node N is a sterile node when its tessellation can no

longer be changed.

 

Lemma 1 Node N at level I is a sterile node when it becomes a grandparent.

Proof:

1.

When a node N produces its ﬁrst child node, a copy of N is made at the level
of the child (see Algorithm 1), and this copy contributes to the tessellation of
N. So N can have either 0 or > 1 children contributing to its tessellation.

When one of the children of N produces a child of its own, G, then N becomes
a grandparent node.

. It is possible for this new grandchild to be near a tessellation boundary of N,

but outside the expected radius for nodes at level I + 1.

. If a new node X now comes into N to (possibly) contribute to N’s tessellation,

it is possible to be positioned as shown in Figure 3.17.

If this is the case, then the tessellation of N cannot be changed by the new
sample X, since the grandchild G may be adopted by X.

Therefore, node N at level I is a sterile node when it becomes a grandparent.

Cl

 

Figure 3.19: Lemma 1

For sterile nodes, such as node N in Figure 3.17, we do not want to change the

tessellation (i.e., the children of N). X is added as a decedent of C2, even though it

is beyond the expected radius of C2.

92
3.5.5 Conditional Sterility Principle

This sterile condition of a node need not be absolute, however. We could have pre-
viously built a tree as shown in Figure 3.20, where C1¢+2 and G are the only grand-

children of N. Adding X as a child of N in this case will alter the tessellation of N

distance
Child 02mm eq‘a
E in ;

ed distance from child 01

Tree description of this space decomposition

 

Figure 3.20: An example tessellation where changing the tessellation would not result in
the loss of a subtree

but will not render any subtree of N unreachable. Figure 3.21 shows the tessellation
resulting from adding X as a child of N.
This example can be generalized and encapsulated in the Conditional Sterility

Principle for dealing with the Adoption Problem.

Principle 1 Conditional Sterility Principle. It is safe to add new node X as a child

of sterile node N if none of the current descendants of N will be covered by the new

93

Expected
distance from ‘
child CZ
a .. : ed distance from child 01
‘90 Tree description of this space decomposition

 

Figure 3.21: A valid tessellation of node N resulting from the addition of X as a child of
N. No subtrees of N are lost by this tessellation modiﬁcation.

node X.

If none of the descendants of N fall into the new cell that will be covered by X (i.e.,
the shaded region in Figure 3.21), then X can be added as a child of N even if N is
a sterile node. To do this, we keep track of D(C), the maximum distance between
node C and all of its descendants, while we are building the tree. Then we can add
new training sample X as a child of sterile node N if d (X, Cj) > D(C,-) + r(l + 1),
for all children 01- of N at level I + 1.

In summary, the tree is built using a sequence of training sample batches. Within
each batch, the tree is updated at level I using all the batch samples before level l+1
is updated.

Essentially, the algorithm goes through the following steps:

94

0 Given c classes and 1: training images for a particular node N (for example, ten
individuals and 50 fovea training images (ﬁve from each individual) for the root

node of the tree).
0 Obtain a vector representation of an image.
0 Compute the Most Expressive Features; the maximum number possible is k.
0 Choose m g It features of the MEFS to retain.

0 Compute the Most Discriminating Features in the MEF space; the maximum

number possible is c — 1.
0 Choose p g c — 1 features to retain.

0 Perform a space tessellation in this MDF Space for node N. We have It points

in this p—dimensional space to use to partition the Space.

Algorithm 1 summarizes and incorporates the solutions to the concerns in the
preceding discussion.

This algorithm is called the Hierarchical Quasi— Voronoi Tessellation Algorithm
because the space of each node is partitioned into a Quasi-Voronoi Tessellation in a
hierarchical manner. A characterization of a sample tree built using this algorithm
is shown in Table 3.1. The generated tree is too large to display in detail, so a
characterization of the tree Showing the number of nodes produced on each level is
given here.

Other trees have been utilized for classiﬁcation purposes [18] [115]. These trees

typically perform the most obvious class separations ﬁrst, postponing ﬁner distinc-

95

 

Algorithm 1 The Hierarchical Quasi- Voronoi Tessellation Algorithm
Input: Node N at level I — 1, list of samples X to add.

Output: A (possibly) changed tessellation of N based on the new samples.
1. If this is not a sterile node:

( a) Add the samples already covered by node N to the list of new samples.
{ b) If there are multiple images in this node:

i. Compute the projection matrix V to the MEF space for this node.

ii. If there are multiple classes in this node, compute the projection matrix
W to the MDF space for this node.

iii. For each child C,- of this node:

A. Project the center of C,- to the new MEF/MDF space of N,
whichever is in use.

B. Set the radius covered by each child C,- to the default for the level.
Since we added the existing samples to the new ones, the samples
will be re-partitioned.

2. For each sample X:

(a) If V is deﬁned, project X to the MEF space to get Y.
(b) If W is deﬁned, project Y to the MDF space to get Z.

(c) Compute an array of distances between the children of N and Z or Y,
depending on which space is in use for node N. Keep an array R of the

radii covered by the children of N in case Z or Y belongs to none of the
children of N.

(d) If Z (or Y) is external to all of the spaces covered by the children of N:

i. IfN is not sterile, then add Z (or Y) as a new child of N.
ii. Otherwise, if d(Z (or Y), C.) > R; +r(l+ 1), for all children C,- of N,
then add Z (or Y) as a new child of N.

Otherwise if Z ( or Y ) is far enough away from the nearest child of N, put
Z (or Y ) into the nearest child of N and update that child’s radius of space
covered, if necessary.

 

Figure 3.22: Algorithm 1, The Hierarchical Quasi-Voronoi Tessellation Algorithm.

96

 

 

 

Level Node count ]
0 1
I 9
2 32
3 93
4 256
5 471
6 491
7 310
8 113
9 25
10 2
Total: 1803

 

 

 

 

Table 3.1: A characterization of a sample tree built using the Hierarchical Quasi-Voronoi
Tessellation Algorithm. The table lists the number of nodes found on each level of the tree,
because the tree is too large to show in detail.

tions to a later stage. In these trees, the classiﬁcation problem is broken down into
smaller, Simpler problems [103]. However, these trees typically have a ﬁxed set of
hand-crafted features [138], and they are not necessarily tuned speciﬁcally for the
samples found in the nodes of the tree. In contrast, our Space-Tessellation Tree 10-
cates those features that can best describe the samples contained in each node, and

best discriminate among the different classes in these samples.

3.5.6 Properties of the Hierarchical Quasi-Voronoi Tessella-

tion Algorithm

The Hierarchical Quasi-Voronoi Tessellation Algorithm has several favorable proper-
ties. Among them are a bounded number of children for each node, and for n samples,

O(log n) levels in the average case.

97

 

Theorem 1 Given a ﬁxed dimensionality d for the samples and a decreasing positive
expected radius function r(l) based on the level of the tree, the number of children that
a node N at level I — 1 can have is bounded above by a constant Ii, regardless of the
training set.

Proof: Let N be the center vector for a node and n, n’ be children of node N.
Let P(N) be the hypersphere centered at N with radius r1_1 + (r; / 2). Let q(n) be
the hypersphere centered at n with radius (rz/2). Then q(n) ﬂ q(n’) = (i) if n 75 n’.
Otherwise if these hyperspheres overlap, then one of {n, n’} will be the child of the
other by Algorithm 1. See Figure 3.24 for a 2D example of these hyperspheres. P(N)
is the extended space of N, i.e., the space that N might have to cover. q(n) is the
smallest hypersphere that produce non-overlapping children of N, Since d(n, n’) _>_ r,
by Algorithm 1. Note that P(N) contains all q(n), for all children n of N.

Since the volume of a d-dimensional hypersphere of radius R is (2d’1/d)7er, the
volume of P(N) is V, = (2d‘1/d)7r(rl_1 + (rt/2))“. Likewise, the volume of q(n) is
V,, = (2d‘l/d)7r (rl/2)“. Since q(n) and q(n’) do not overlap, and P(N) contains all
q(n), then the number of q(n)’s, which is also the number of children of N, is bounded
above by constant

 

 

_ l/p _ T[_1+(T1/2) d_ 277-1 d
n_7q_[ (Tl/2) ]_[1+ ]. (3.7)

Cl

Corollary 1 If r1_1 = on, a > 1, then the number of children of any node is not
larger than (1 + 2a)d.

 

Figure 3.23: Theorem 1

Theorem 1 uses the fact that both the dimensionality of the sample vectors and
the expected radius that is used to create a node N ’3 children are constants to assert
that the maximum number of children that N can produce is bounded above by a
constant it, irrespective of the training samples used. Lemma 2 asserts that Algorithm

1 produces O(log n) levels.

98

q(n)

4

r, 11 P(N)
l1 + 2

Figure 3.24: Hyperspheres used in Theorem 1.

3.6 Image Retrieval

When an unknown image X is presented to the recognition tree, the general ﬂow
shown in Figure 3.26 is followed. When a node N of the tree is activated, X is
projected to Xy, a vector in the MEF subspace of node N. Xy is then projected to
Xz, a vector in the MDF subspace for node N. Xz is compared with each of N ’8
children. The child with the best response is selected as the path of the tree to be
explored.

Such a single-path search will guarantee that an exact match can always be found,
but it cannot guarantee that a nearest match will always be found. As shown in Figure
3.27, the best match for X may be found in the subtree rooted at C, but a Single-path
search will explore only C2 and its subtree. If the nearest neighbor lies in the subtree
rooted at G, it will be missed in a single-path search.

Therefore, we deﬁne a constant k of nodes to be explored at every level of the tree.

Using a constant k at every level ensures that only a band of the Space-Tessellation

99

 

Deﬁnition 6 Given 11 samples, a Bounded Unbalanced Tree with Unbalance Bound
0 S a S 1 (a constant) is a tree such that for any node N containing n1+n2+° - -+n;c
samples, where N has In children with n,- samples assigned to node i and n1 2 n2 2
-~nk,n15a(n1+n2+---nk). '

Lemma 2 The number of levels in a Bounded Unbalanced Tree with n samples is
bounded above by log(1 /a) n where a is the Unbalance Bound of the tree.

Proof: Each node N of the tree is assigned with n1 + 722 -l- - - - + n], samples, where
n,- is the number of samples assigned to the ith child of N. Rank these n,’s so that
n1 2 72; _>_ - - - 2 nk. Because the tree is a Bounded Unbalanced Tree, we know that
n1 3 a(n1 + n2 + - - . + nk), and is true for all nodes N of the tree.

a is a constant. Note that a may be large (i.e., 99%), but it is still a constant. This
means that each node will not assign a huge portion of samples to a single child. An
extreme example would be where child 1 receives n — 1 samples and child 2 receives
1 sample. Since the tree is a Bounded Unbalanced Tree, this is not allowed, and in
the worst case, child 1 would receive om samples and child 2 would receive (1 — a)n
samples. This is a signiﬁcant constraint.

Each deeper level of the tree will reduce the number of samples by a factor of at least
a. The lth level down the tree will receive nal samples. At tree height h, we have
just a single sample by Algorithm 1. Then nah = 1, and a” = (1 /n), or (1 /a)" = n.
Then the height of the tree h = log(1/a)n = (log n/ log (1 /a)). Cl

 

Figure 3.25: Lemma 2

tree is explored, that band surrounding the nearest leaf node match of a test probe.
The retrieval algorithm is given in more detail by Algorithm 2. Theorem 2 uses
the results from Theorem 1 and Lemma 2 to Show that this Image Retrieval algorithm

runs in O(log n) time on the average.

3.7 Summary

In this chapter introduces the Self-Organizing Hierarchical Optimal Subspace Learn-
ing and Inference Framework. This SHOSLIF paradigm is an attempt at a uniﬁed

methodology for comprehensive visual learning. It attacks problems such as how to

100

 
    
  
      
 

Subim ge
Input

 

  

More than
one image
in node?

Activate node N

Yes Projection to the
MEF subspace

O

 

 

 

 

 

   
     

 

 

 

More than

Compare in Compare X to . class
Children of N one

nearest MDF space in MEF space in node?

 

 

 

 

LDistance mesaure

 

 

 

 

 

 

 

 

 

 

 

 

Distance mesaures from children Com
4 pare X to - -
‘L / _ChildrEn of N Progectlon to the
E / in MD space MDF subspace

 

‘

Figure 3.26: The general ﬂow of each SHOSLIF Processing Element.

automatically select the most salient features and how to organize the visual database.
The SHOSLIF uses the theories of optimal linear projection to generate a hierarchi-
cal tessellation of a space obtained from the training images. Two feature spaces are
described: a Most Expressive Features space, or principal component feature space,
obtained using a Karhunen-Loeve projection, and a Most Discriminating Features,
or discriminant analysis feature Space, obtained using the multivariate linear dis-
criminant analysis procedure. We assume a visual attention mechanism to provide a

well-framed set of images, each of which contains a single object of interest.

We develop the Most Expressive Features and the Most Discriminant Feature
spaces, and describe the advantages of each of these feature spaces. We discuss how
the MEF Space is well-suited to object class characterization, because the Space cap-

tures the major variations present in the training set of images. But we also discuss

101

a «acted distance from child Ci
a Tree description of this space decomposition

Expected
distance from
child CZ

 

Figure 3.27: A tessellation in which a single-path search may not ﬁnd a nearest match.
Node N at level I will shuttle test probe X to its nearest child C2. The nearest neighbor
for X may well be found in the subtree of G, which is missed in a single-path search.

how these major variations are not necessarily good for classiﬁcation. The MDF
subspace, however, is designed speciﬁcally to provide optimal performance object
classiﬁcation in the sense of linear transform. Though these features are not neces-
sarily good at capturing the necessary features for class characterization, they are
good at selecting the best match from among a set of classes. For both of these

feature spaces, we describe some computational issues.

Given the strengths of these feature sets, we provide an effective and efﬁcient
method for retrieval by incorporating these subspaces into a hierarchical space de—
composition. The feature space for a particular node N in a tree is partitioned into

regions based on the training data contained in that node. Each of these regions form

102

 

Algorithm 2 The Image Retrieval Algorithm

Input: Probe X, level l, and a list of at most constant k nodes which were explored
at level I.

Output: A list of nodes explored at level I + 1.

1. For each node N,- in the list explored at level I:

(a) If N,- is not a leaf node:
i. Project X to the MDF subspace of node N,, producing Z.
ii. Compute d(Cj, Z) for all children j of N,- with center vectors Cj.

iii. Transfer at most constant k of the children of N,- to the output list
such that those transferred are the k nearest neighbors of Z.

2. Truncate the output list to hold at most constant 1: nodes to explore at the next
level.

This algorithm is repeated for all levels of the Space-Tessellation tree.

Figure 3.28: Algorithm 2, The Image Retrieval Algorithm

 

a child of N in the tree, and all those samples belonging to each region are given to
their associated child nodes. As the processing moves down from the root node of
the tree, the hierarchy recursively subdivides the objects of the model database into
increasingly simpler problems. At each node of the tree, the set of features found
in the discriminant analysis procedure are speciﬁcally tuned to the set of samples

found in the node. So although the MDF space provides an optimal set of features

 

Theorem 2 For the k-competing path case, the number of nodes visited in a Bounded
Unbalanced Tree is bounded above by knlogwm n, where k is the number of competing
paths explored, K: is the upper bound on the number of children that any node can have,
a is the Unbalance Bound of the tree, and n is the number of samples in the database.

Proof: The proof follows directly from Theorem 1, Lemma 2, and the fact that a
constant k nodes are explored for every level of the Bounded Unbalanced Tree. Cl

 

Figure 3.29: Theorem 2

103

for class selection in the sense of linear transform, this optimal set may be insufﬁcient
to separate classes. But even if a node cannot completely separate the classes, it can
make a ﬁrst attempt at separation, dividing the samples among its children nodes.
Then at this child level, Since fewer samples exist, the linear discriminant analysis
procedure is more likely to succeed. Since this iS applied recursively, eventually a
successful separation of classes is achieved.

We provide an algorithm to effect this hierarchical Quasi-Voronoi tessellation in
an incremental update fashion. We discuss problems attendant in this incremental
update paradigm, such as the adoption problem, where descendents of a particular
node N are lost due to a changing tessellation of node N; we describe solutions to these
problems. We provide some properties of this incremental tree-building algorithm,
viz., that each node has a bounded number of children and the tree produces O(log n)
levels in the average case with a uniform grid of training points.

We discuss image retrieval under the SHOSLIF paradigm, and give an algorithm
to effect O(log n) complexity for image retrieval. We describe the need for multiple
competing search paths in the tree, and discuss how only a band of these competing

nodes are explored as we progress down the tree.

Chapter 4

SHOSLIF for Object Recognition

The SHOSLIF paradigm described in chapter 3 can be used for a wide variety of
applications, including object recognition [146] [147] [148] [149] [150] [151] [152], mo-
tion understanding [43] [44] vision-guided robot manipulation [80], and autonomous
navigation [33] [34] [35]. This work describes the SHOSLIF setup applied to object

recognition, SHOSLIF-O.

4.1 SHOSLIF-O using manually-deﬁned hierarchi-

cal labels

One mode of operation for the SHOSLIF-O system is a mode where the operator
supplies a hierarchical set of labels for each training image. For example, for training
images X,- and Xj, the hierarchical set of labels could be as shown in Table 4.1. Thus
we have a hierarchical set of classiﬁcations that we can use to build the tree.

In this mode, while constructing the tree, we view at most three levels at a time, as

104

105

 

 

 

 

 

 

 

Sample General Label Species Label Name Label Particular Sample label
' everything humans Joe image-1
j everything dogs Fido image-4

 

 

 

Table 4.1: An example of a set of hierarchical labels that could be used in the hierarchical

labels mode.

shown in ﬁgure 4.1. For a given node N, we compute the MDFS using the means of N’s

 

 

 

Cars

 

 

 

Humans

 

 

 

 

 

 

 

 

 

 

 

Node N

 

 

 

Chairs Classes

 

 

Training samples

Figure 4.1: Nodes used for training the MDFS.

grandchildren as sample points, and the children of N as the classes among which we

want to discriminate. This gives each level of the tree maximal discrimination among

the various subtrees deﬁned by the training labels.

To demonstrate the operation of the SHOSLIF-O in this mode, the tree was

trained using eight hundred images. The labels were assigned to the training samples

to produce a tree designed to have four levels, as Shown in Figure 4.2. The resultant

tree was tested both using the re-substitution method (i.e., where the test image is in

the training set) and using a disjoint test set (i.e., where the test image is not in the

training set, so the system has not learned the test image). The results from these

tests are Shown in table 4.2.

Several problems present themselves with this approach to constructing the tree.

106

  

  

 

Figure 4.2: The set of hierarchical labels used.

When the operator assigns a hierarchical set of labels to each training sample, the
structure of the tree is essentially deﬁned under this mode of operation. Though
this may result in a more efficient (i.e., compact) tree, some level in the hierarchy of
classes may not be lineally separable, and the performance of the system suffers from
this.

It would be much better for the system itself to decide the structure of the Space-

test set

Training
Sample sample Correct sample recognition recognition
Size size size is choice is in 20

classes

41

 

Table 4.2: Results of testing the tree built under the hierarchical labels learning mode

107

Tessellation tree rather than the operator, because the system is able to construct
the space tessellation based on the data itself rather than on some semantic meaning
imposed by the operator, as described in chapter 3. In order to discuss the automatic
space tessellation mode of the SHOSLIF-O, some of the parameters for the framework
must be explored. Following the discussion of some of the system parameters, results

from various experiments are reported.

4.2 SHOSLIF-O Options for Automatic Space Tes-

sellation

4.2.1 Distance measures

As in any recognition system, the similarity measure used is of critical importance.
For the SHOSLIF-O case, it is especially important because the distance measure
affects how the database of training images is partitioned and how the tree is built.
As such, the distance measure will have a direct effect on the recognition rate and
the retrieval capability of the system.

Several experiments were done to determine a good distance measure to use. A
normalized distance, a simple feature space distance, and a metric that included the
distance from the subspace were all tested. Each distance measure was used to train
a set of 1151 images taken from 432 classes. The images used are Shown in Figures
A1 and BL At the time that these experiments were run, the full set of faces was

not available. The same disjoint set of test image was used to test the recognition

108

tree for each of the different distance measures.

In addition to the different distance measures, an analysis of the signiﬁcance of
maintaining a history of how well the ancestors of a particular node had matched
the test probe to determine which subtrees to explore was performed. Upon reaching
the leaf nodes, this tally can be viewed as the conﬁdence of reaching the leaf. The
conﬁdence measure used was the sum of the conﬁdence for each node on the path
from the root to the leaf over the path length. The conﬁdence for each node was
computed differently for each of the different distance measures. For the normalized
distance, which was already in the range of [0, 1], the conﬁdence for the node was
simply 1.0 - the normalized distance. For the other distance measures, the maximum
distance among the children of a node was computed, and the conﬁdence was taken

to be 1.0 - the distance from a child over this maximum distance.

In the following experiments, the list of nodes to explore was determined by either
the distance alone or by the conﬁdence values obtained. When the distance was used
to determine the set of nodes to explore, for a particular level of the tree, the nodes
were ranked by increasing distance. The top k of these, that is, those nodes with the
smallest distances, were chosen to explore further. When the conﬁdence measure was
used, the nodes visited in a particular level were ranked by decreasing conﬁdence; the

1: nodes from a level with the largest conﬁdence were chosen to explore further.

109

Normalized distance: a bad measure

The distance measure for the normalized distance experiment used

(_1__1_

d
d(X, A) = e‘i 2:1 .,

X-—A')2

where d is the dimension of the vectors, X is the test probe, A is the center of
the node, and o,- is the variance of the ith dimension for the space being used. To
determine this 0,, the training samples for the node being computed were projected
to the subspace in use (i.e., the MEF or the MDF subspace, depending on whether
the node has a single class or multiple classes). Then for dimension i and subspace

feature vector Y], j = 1, 2, ~ - - , s for 3 samples in the node, we have

1
n—l

 

8
0’,‘ = 2(Y'J — 7;)2.
i=1
A performance graph for exploring k concurrent neighbors of the tree for various
k is given in ﬁgure 4.3. Results obtained using both the distance ranking method and

the conﬁdence ranking method are shown in the graph.

The results of this distance measure were quite poor, only roughly 44% accuracy
under the best circumstances. The reason for this poor performance is that by trying
to normalize the distance using the division of the variance for each dimension, we
confound the separation of the classes in the MDF space that we worked so hard to
achieve. Clearly, another measure is required, and the next metric considered was

the feature Space distance measure.

110

 

 

0.5 A Top choice, no history maintained —.— -
In top ﬁfteen, no history maintained -—+----
Top choice, maintain history -a----
0.45 - In top ﬁfteen, maintain history ~---5-,--,-, ----- -
/;rr
8 0.4 ~ _
53
c:
8 0.35 - _
'8
8”
at
0.25 ~ 1
0.2 - _
0.15 l l 1 I 1 1 1

 

 

 

0 2 4 6 8 10 12 14 16
Number of competing nodes at each level

Figure 4.3: Performance of the Normalized Distance measure as a function of the number of
nearest neighbors explored. Both the distance ranking method and the conﬁdence ranking
method are shown.

Feature space distance

In the feature Space distance, the displacement from a node center vector of a test
probe’s projection into that node’s subspace was used as the distance measure. For
MEF nodes, the eigenvectors corresponding to the top 95% of the eigenvalues found
in the training set were maintained. For MDF nodes, the eigenvectors corresponding

to the top 30% of the eigenvalues were used.

The distance measure for the feature space distance experiment was

 

d

d(X, A) = \]Z(X.- — Ai)2

i=1

111

where d is the dimensionality of the vectors, X is the test probe in the MDF space,
and A is the center of the node being tested in the MDF space.
The performance graph showing the results from this distance measure is given in

Figure 4.4. Though an improved recognition rate was achieved, the results were still

 

j I I I I I I

0.85 - Top choice —A—
In top ﬁfteen --+---

n- *‘ _
I ~~
. I s
I
I

l

.
[I
0.75 r r”, r

0.65 r

Recognition rate

0.55 -

 

 

 

 

0.45 l M l 1 I 1 1m
0 2 4 6 8 10 12 14 16
Number of competing nodes at each level

Figure 4.4: Performance of the feature space distance measure as a function of the number
of nearest neighbors explored. When the history was maintained, the results traced the
same lines as when history was not maintained.

only roughly 80% accuracy in the best cases. An analysis of the reasons for this is

given in section 4.2.1.

Image space subtrees instead of MEF subspace subtrees

Because the results were still not good, we decided to do an experiment where we

replaced the MEF subspace subtrees in the SHOSLIF-O recognition tree with image

112

space subtrees. Algorithm 1 calls for us to construct a subtree in the MEF subspace
for those nodes for which all the samples come from a Single class. In this experiment,
that was changed so that once a node contained a single class, the Space tessellation
tree for that single-class node was generated in the original image Space instead of in
the MEF subspace. That is, once a node contained images all from a particular class,
rather than building a MEF space subtree from that node, we built an image space

subtree from that node. The distance measure used in this experiment is given by

 

d

d(X,A) = \JZ(X:‘ - 1492

i=1

where d is the dimensionality of the vectors, X is the test probe, and A is the center
of the node being tested. Once a partitioning of the training set has produced nodes
containing only single classes, rather than generating a hierarchy in the MEF subspace
of these nodes, this experiment generated a hierarchy in the image space directly. The
assumption is that the search in the MDF tree ﬁnds those node(s) associated with the
appropriate class for the test probe. Given this assumption, the MEF subtrees Simply
provide a means for ﬁnding those images from the class most Similar to the test probe.
In this experiment, we measure the effectiveness of using the MEF feature vectors
instead of the image space directly. Instead of using the MEF space to generate a
hierarchy of the training samples in the MDF leaf nodes, we use the image space
directly. The performance graph obtained using the image Space directly is given in
Figure 4.5. In this experiment, we achieved a better recognition rate, which gave us

pause to consider why such an event would occur.

113

 

 

1 r l w . r o I : . ﬁ -
Top cho1ce, no history maintained —A—
In top ﬁfteen, no history maintained -—+---
Top choice, maintain history -----8-
0.9 - In top ﬁfteen, maintain history ~53 ----- -
’4'""§'"
”.¥W::.ux ..........
/,&’
.93
‘13 0.8 ~ _
r:
o
3::
El
8" 0.7 ~ .
3
o:
0.6 - _
0.5 - -

 

 

 

0 2 4 6 8 10 12 14 16
Number of competing nodes at each level

Figure 4.5: Performance of the feature space distance measure using image space instead
of MEF subspace once a single class has been found. The performance is given as a function
of the number of nearest neighbors explored.

Distance from Subspace

The experiment where the image space replaced the MEF space for single-class subtree
construction made us analyze the comparison of distances obtained using the centers
of different nodes. Each node in the SHOSLIF-O recognition tree constructs both a

MEF and a MDF subspace for those samples contained in the node.

Each node constructs a different set of subspaces based on the samples contained
in that node. The test probes that came into a node were being compared in the
local MDF or MEF subspace for that node. It was entirely possible that a test probe

vector that was miles away from a particular node’s subspace would project into that

114

subspace very near to the node’s center vector. This would cause poor recognition
results because test probes that were not at all similar to the node centers would

erroneously be considered close.

The Distance from Subspace (DFS) distance measure takes into account the dis—
tance from the projection space in addition to the distance of the projection to the

node centers.

The DFS distance measure is given by

 

d(X, A) = \/||X — VVtX||2 + ||VWWtVtX - VWWtVtAH2

where X is the test probe, A is the center vector, V is the projection matrix to
the MEF space, and W is the projection matrix to the MDF Space. So the product
Y = V‘X is the projection of the test probe X onto the MEF subspace; VY = VV‘X
expands this MEF projection into the original image space. Likewise, Z = W‘V‘X is
the projection of X onto the MDF subspace, and VWZ = VWW‘V‘X expands this
MDF projection back into the original image space. Note that computationally, for

Z = W‘V‘X, A2 = W‘V‘A, and VW = M,

IIMZ-Mz‘lzll2 = ||M(Z-Az)||2

= (Z — Az)‘M‘M(Z — AZ).

Thus only a small matrix multiplication need be performed to effect this distance

measure. This M tM matrix can be pre—computed during the learning phase and

115

stored in each node so that in the testing phase, the computational complexity of this

operation is minimized.

Intuitively, what is being measured can be seen in Figure 4.6. The ﬁrst term under

x
MDF‘W , 1|
02‘ l V V tX
7/

vww‘v‘x

 

 

Figure 4.6: Distance from Subspace description for 3D. The subspace is Shown as the plane,
and is deﬁned by the matrix V. Two classes are shown in this example, and the MDF vector
that can separate them is shown.

the radical indicates the distance of the original vector from the population (i.e., the
subspace). The second term indicates the distance in the subspace from the center
vector of a class.

The performance of the DFS measure is shown is Figure 4.7. Because we project

back to the original image space in using this DFS distance measure, we are assured

116

 

I I I I j I 1

Top choice, no history maintained -A—
’ In top ﬁfteen, no history maintained -—+---- «
Top choice, maintain history -a----

 

1 - In top ﬁfteen, maintain history "-5 --------- -l
‘,-4------I----—i—----+--- “xx-1.1:.
“ggzzzzif ...... x .......... x ----------- x ----------- x """""
*p‘
q.) 0.95 ” ~
H
:3
I-n
8 0.9 r -
35
50 '- -4
o 0.85
0
6’2
0.8 - .
11.73.. _]
0.75 ’ ~"13 ---- Bung
0.7 - l -
at
0.65 J 1 1 1 1 1 1

 

 

 

0 2 4 6 8 10 12 14 16
Number of competing nodes at each level

Figure 4.7: Performance of the DFS distance measure, given as a function of the number
of nearest neighbors explored.

that all the distance comparisons are in the same frame, and can therefore be logically
compared. This DFS measurement takes into account not only how far a test probe
vector is from the node’s center vector in the node’s subspace, but also the distance
that the test probe is from the node’s subspace itself. The DFS distance measure
gives us the good results shown in Figure 4.7, and it is this DFS distance measure

that is used for the experiments reported here.

The consideration for maintaining a historical deviation

In the foregoing experiments, we performed an analysis of various distance measures.

In addition to the different distance measures, an analysis of the Signiﬁcance of main-

117

taining a historical deviation from the ultimate leaf node’s ancestors was performed.
When a bad distance measure was used, such as the normalized distance or the feature
Space distance alone, this history provided an attempt at overcoming the shortcom-
ings of the distance measure. This history is not needed, however, when a prOper

distance measure that includes the distance from the projection subspace is utilized.

Because the history does not prove useful when utilizing the distance from sub-
space distance measure, it is not used in the remaining experiments described in this

work.

4.2.2 Use of Multiple Competing Paths

As described in section 3.6, in order to ﬁnd a nearest neighbor match in the recognition
tree, it may be necessary to explore more than a single path of the tree. This constant
1:: parallel paths to explore is a parameter that the system operator must determine.
For the data sets used in this work, unless stated otherwise, we chose 1: = 10 based on
the performance graph given in Figure 4.8. Based on the data shown in this graph,
the performance of the system levels out at k = 10. We chose this value to try to

achieve the best performance at the lowest computational cost.

4.2.3 Expected Radius

Node N uses the expected radius r(l) to determine when it needs to create a new

child to accommodate a training sample. This expected radius is a decreasing positive

118

 

I I I I I I I

Top choice -°—

 

1 ' In the topLﬁfteen; --+:-- ‘
1*7777‘r’ ’
.4’“"+/

8 0.95 .. -
(d
H
l:
o
31::
a
8"

g 0.9 - -
M

0.85 r -

 

 

 

1 l l l I l l

 

0 2 4 6 8 10 12 14 16
Number of competing nodes at each level

Figure 4.8: A comparison of the performance for various k competing paths

function based on the level of the tree. For this work, we have chosen

r(l) = 1.336"

for node N on level l of the Space-Tessellation Tree. This number was chosen because
at the root node of the tree, i.e., at level I = 0, the expected radius is 12646.219. This

radius is large enough to cover all of the samples in the data sets utilized in this work.

The value 1.3 was used as a base over a more traditional value because it gave
favorable results in the shape of the tree. A tree that is too wide and fat produces
inefﬁciencies because many children must be explored at a high level of the tree; a

tree that is too thin and tall produces ineﬂiciencies because many projections must

119

be done in order to ﬁnd the nearest neighbor in the database.

4.3 Experimental Results

In order to verify the proper functionality of the system, we experimented with a
large multifarious set of real-world objects found in natural scenes. In this section,
we demonstrate the ability of the MDF space to tolerate within-class variations and to
discount such imaging artifacts as lighting direction, and Show how the tree structure

provides the ability for view-based recognition.

4.3.1 Space comparison

Since the analysis showed that the MDF space should perform better than the MEF
space (or the image space directly), a study was performed to demonstrate this fact.
Furthermore, though the hierarchy described provides eﬂiciency to both the learning
and retrieval phases, the recognition performance may suffer somewhat from not
examining all possibilities in the database. Therefore, a study was also performed
to examine the performance difference between using a “ﬂat” database, in which
all images in the database are examined for possible matches, and the described
hierarchical database. Finally, the recognition performance will be dependent on
whether a single MDF projection is used, performing the Space tessellation in this
Single space; or if a new MDF projection is performed at each node of the tree. A
study examining the performance between these two modes was also done.

The training images come from a set of real-world objects in natural settings. At

120

least two training images from each of 36 object classes were provided; a disjoint
set of test images were used in all of the tests. For each of the tests performed, the
identical set of training images and test images were used.
The training images used are given in Figure A]. Pixel weighting was engaged,
and the images provided for the training set Show the results of the pixel weighting.
For those tests where subspaces are used, the Distance From Subspace (DFS)
distance metric was used, and ﬁfteen nodes were explored at each level when using

the tree structure.

Effects of the Feature Spaces and the Tree Hierarchy

The results of the studies are summarized in Table 4.3.

 

 

 

 

 

 

 

 

 

 

 

 

Flat database Single subspace Multiple subspaces
with tree with tree

Correct class

Image is top choice 90% 86% 86%
space Correct class

is in top 15 100% 90% 90%
Correct class

MEF is top choice 90% 76% 86%
space Correct class

is in top 15 100% 83% 90%
Correct class

MDF is top choice 90% 89% 97%
space Correct class

is in top 15 100% 100% 100%

 

 

Table 4.3: Results of subspace comparison study. For the “ﬂat” databases, only a single
level of the tree was created. For the single projection databases, the MEF/MDF projection
matrices were only computed at the root node, and the same projection matrices were used
throughout the tree. For the multiple projection trees, a new projection was made at each
node of the tree.

121

The test probes and their top ﬁve responses for the multiple—projection experiment
are given in Figure A.2. Note that a band of ﬁfteen competing paths were explored,
but only the top ﬁve responses are given in this ﬁgure.

As expected, the data in Figure 4.3 shows the MEF subspace tracing the same
sorts of responses that the original image space produces. This is expected because
the MEF subspace is a good representation of the original image Space, and can
therefore be used as a compact means for storing images.

The data also shows, however, that the MDF subspace can outperform the MEF
subspace. This is also expected, since the MDF subspace utilizes more information
supplied by the user in the form of the image labels. The MDF performance for a ﬂat
database shows the same results as the MEF performance. This could be due to the
fact that there was not enough within-class variation in the training samples that were
needed to catch the variation present in each class. But it may be more likely that the
classes in a ﬂat database were not lineally separable. This is supported by the data,
because when a Single projection is done and a space tessellation tree produced based
on that single projection, the rate drops off a little; when multiple projections are
done throughout the tree, the rate improves Signiﬁcantly. The data lends credence to
this claim that the multiple projections in the tree do indeed reduce the problem to
a smaller, more manageable Size, and can therefore successfully separate the classes
more easily as the processing moves down the tree. The MDF subspaces generated
at the various nodes of the tree are adaptive in that they optimally separate the
classes for the samples contained in the node in the sense of linear transform. As

the processing moves along the nodes of the tree, a different set of features tuned

122

Speciﬁcally for those samples contained in the node are utilized for subclass selection.

A Comparison Between the MEF and the MDF feature spaces

A further experiment was performed to demonstrate the relatively small number of
MDF vectors required to produce good recognition results versus the number of MEF
feature vectors. In this experiment, we compare the performance of the system using
the MEF space alone versus using the MDF Space. The number of features from
each subspace was varied to Show how the system performed using up to the 95% of
the variance for the training images used. Figure 4.9 Shows the response rates for
the system using these two subspaces. The images for this experiment came from
the Weizmann Institute and contained well-framed faces with different expressions
taken under different lighting conditions. Each individual in the set of images used
had two expressions; each expression image was taken with three different lighting
conditions. This set of images seemed particularly suited to testing the claim that
the MDF feature space could outperform the MEF feature Space because of its ability
to discount expression changes and imaging artifacts such as lighting direction. The
major difference between the MEF and the MDF performances is due to the fact
that for this experiment, we had enough training images with sufﬁcient within-class
variation. With fewer training images, or training images that do not sufﬁciently
capture the desired variations, the performance of the MEF and the MDF spaces
will come closer together. An example of the pool of available images for a sample
of individuals is given in Figure 4.10. The training for this experiment utilized a

ﬂat database to demonstrate the superiority of the MDF subspace over the MEF

 

 

 

 

 

MDF feature space -A—
MEF feature space -----+
i
5
E
O + ------ +— ----- +- ----- -i— ------
'3 ,il- ----- 4
a) 0.8 '- xiv. + d
o . ,
a r
o x
‘8‘. 0.7 - f """ “‘ -
E_.
0.6 "- """ -*’ .
0.5 l l l l l l
6 8 10 12 14 16 18
Number of features used

Figure 4.9: The performance of the system for different numbers of MEF/MDF features.
The number of features from the subspace used was varied to show how the MDF subspace
outperforms the MEF subspace. 95% of the variance for the MDF subspace was attained
when ﬁfteen features were used; 95% of variance for the MEF subspace did not occur until
37 features were used. Using 95% of the MEF variance resulted in an 89% recognition rate,
and that rate was not improved using more features.

subspace without the (possibly) confounding factor of the tree. For this experiment,
a disjoint test set was utilized. This test set was formed by randomly choosing one
image from the set of images available for each individual. Therefore, each individual

was trained with all of the different lighting conditions for one expression, but not for

the expression image in the test set.

The results in Figure 4.9 Show that the MDF subspace performs better than the
MEF subspace. With eight MDF feature vectors being used, the system retrieved the

correct result 100% of the time; the MEF performance when eight features were used

124

  
  

  
  

    
  

       
  

 

p1,exl,ill p1,ex1,il2 p1,ex1,il3 " p1ex,2,ill "pi':’éx2,112 'p'i,ex2,il3

        
     
 
   

     

2:;
p2,ex2,i13

p2,ex1,ill p2,ex1,il2 p2,ex1,il3 p2,ex2,ill p2,ex2,i12

   

p3,ex1,i11 p3,ex1,i12 p3,ex1,il3 p3,ex2,ill p3,ex2,il2 p3,ex2,il3

   

p4,ex1 ,ill p4,exl,il2 p4,exl,il3 p4,ex2,ill p4,ex2,i12 p4,ex2,il3

  

        

.1 I ’ i‘ . "
p5,ex1,ill p5,ex1,il2 p5,ex1,il3

”l

t:
3 .

p5,ex2,il2 p .ex2,il3

       

l, I. (g,
p5,ex2,ill

Figure 4.10: A sample of the Weizmann Institute face data. The frontal images for an
individual contain two expressions; for each expression, a set of three images with differing
lighting conditions forms the set of images available for an individual.

125

retrieved the correct result in only 71% of the cases. At fourteen feature vectors, the
MDF feature Space retrieval rate remained at 100%, and the MEF retrieval rate was
at the 82% level. The MEF retrieval rate went up to 89% when 24 feature vectors

were active, and did not improve when more features were used.

4.3.2 The Clustering Effect of the MDF subspace using the

DKL projection

To Show how the MDF subspace effectively discounts factors unrelated to classiﬁca-
tion, Figure 3.12 shows samples of the data in the space of the best (in terms of the
largest eigenvalues) two MEFS and MDFS for the experiment summarized in Table 4.6
for the root node of the tree. From Figure 3.12, it is clear that the MDF subspace has
a signiﬁcantly higher capability than the MEF subspace to correctly classify an input
image that is considerably different from those used in training, because classes in the

MDF subspace have larger between-class distance and smaller within-class scatter.

The results shown in table 4.4 demonstrate a better result using the MDF subspace
than the MEF subspace. For this experiment, a total of 38 images were used. These
images came from 11 face classes (i.e., individuals) taken from the Michigan State
University Pattern Recognition and Image Processing face database, and eight non-
face objects. There was a Signiﬁcant amount of contrast changes in the face images.

The non face images came from outdoor scenes.

126

 

 

 

 

'ITaining Test Correct Correct
Sample sample classiﬁcation classiﬁcation
Size size using MDF using MEF
Faces 22 11 100% 91%

(2 samples each (one from
from 11 classes) each class)
Other objects 16 8 100% 88%
(2 samples each (one from
from 8 classes) each class)

 

 

 

 

 

 

 

 

Table 4.4: Results Showing the superiority of the MDF subspace for subclass selection

The power for handling within-class variation using the MDF subspace

The capability of the system for handling large within-class variation is demonstrated
in Figure 4.11. Each of these search probes retrieved samples from the correct class de-
ﬁned by the training exemplars. This example supports the claim that those features
unimportant for subclass selection are weighted apprOpriately by the Most Discrimi-
nating Feature selection process. For example, the change in the Shape of the mouth
among the different expressions is unimportant for determining to which subclass
these images belong. As long as the variation in a class is present in the training set

for MDF feature computation, the approach performs quite well.

4.3.3 Face database

In order to compare with the results of others in the community utilizing eigenfea-
ture methods, a test was performed on a database comprised of only faces. In this
experiment, when an image that was used for training was also used as a test probe,

such as is done with Photobook [123] [122] [113] (i.e., re-substitution method), the

127

 

(b) List of search images for the person

Figure 4.11: Example of how well within-class variation is handled. All search images
were correctly classiﬁed into the class deﬁned by the training images.

SHOSLIF always retrieved the correct image as its ﬁrst choice 100% of the time. In
resubstitution, we have an exact match in the database, so the ﬁrst choice will al-
ways retrieve the correct image. The second best match from the database using this
resubstitution method also came from the correct class for 98% of the test probes in
a database of 1042 face images. Table 4.5 shows a summary of the results obtained
both by re-substituting the training samples as test probes and by using a disjoint

set of images for testing. The 98% correct match for the second response in the

128

 

Number of training images 1042 of 384 individuals
Number of nodes in generated tree 1761
Number of nodes expanded at each level 10
Average number of nodes explored per probe 102
Re-substitution:
Correct retrieval is top choice 100.0%
Correct retrieval is second choice 98%
Disjoint test set:

 

 

 

 

 

 

 

 

 

Number of test images 246 of 246 individuals
Correct retrieval is top choice 95.5%
Correct retrieval is in top 10 97.6%

 

 

 

 

 

 

Table 4.5: Summary of experiment on a face database of 1042 images (384 individuals).
For re-substitution each training image was given as a test probe. For the disjoint test set,
a list of 246 images not found in the training set was used for testing.

re-substitution method is comparable to Photobook’s response rates for face images

alone [113].

The images in the face database are shown in Figure 8.1. The face database was
organized by individual; each individual had a pool of images from which to draw
training and test data sets. Each individual had at least two images for training
with a change of expression. The images of 38 individuals (182 images) came from
the Michigan State University Pattern Recognition and Image Processing laboratory.
Images of individuals in this set were taken under uncontrolled conditions, over several
days, and under different lighting conditions. 303 classes (654 images) came from
George Mason University under an Army Research Labs contract. All of these classes
had at least two images of an individual taken under controlled lighting, with a change
of expression. 24 of these classes had additional images taken of the subjects on a
different day with very poor contrast. Sixteen classes (144 images) came from the MIT

media lab under identical lighting conditions (ambient laboratory light). Twenty-nine

[l

129

classes (174 images) came from the Weizmann Institute, and are images with three

very controlled lighting conditions for each of two different expressions.

4.3.4 Combination database: Faces and Other Objects

To Show the general applicability of the method, we have trained the system on a
diverse set of objects from natural scenes, ranging from human faces to street signs to
aerial photographs. The database is a combination of those images shown in Figures

Al and B1

A list of some examples from the various classes learned is given in Figure 4.12.
Most classes in the database were represented by two images, and 13% of the classes
had three or more images, up to twelve for some objects (e.g., ﬁre hydrant). Each
image consisted of a well-framed object of interest. The different images from each
class were taken either in a different setting or from a different angle; where possible
a change in the lighting arrangement was used to provide variation in the training
images.

Following training, the system was tested using a test set completely disjoint from
the training set of images. A summary of the makeup of the test and the results are

shown in Table 4.6.

The instances where the retrieval failed were due in large part to signiﬁcant dif-
ferences in object shape and three-dimensional (3D) rotation. Figure 4.13 shows an
example of a face image which the system failed to properly retrieve. This search

probe failed because the training data for this class did not include any 3D object

[l

130

QAQJQSQQ".

 

car csign chair mouse cup dog handle egg

  
      

isign monitor phone renault seated sharpener sidewalk dbody

Figure 4.12: Representatives of training images from some of the various classes learned.
Fovea images are shown without the pixel weighting; this pixel weighting will tend to
suppress the background. Objects of interest are at the center of the fovea images. In the
learning phase for this experiment, the fovea was generated using manual extraction of the

areas of interest.

131

 

Table 4.6: Summary of large natural scene experiment. The training images were drawn at
random from the pool of available images, with the remaining images serving as a disjoint
set of test images.

   

(a) Search probe (b) Training images

Figure 4.13: Example of a failed search probe. The retrieval failed to select the appropriate
class due to a lack of 3D rotation in the set of training images.

rotation and the search probe did. In fact, when the test image was swapped with
one of the training images, a correct retrieval was attained. The recognition results
for this system rely heavily on the assumption that the training images are represen-
tative of images which will be seen during the recognition phase. Since the SHOSLIF
network is open-ended, it allows the system to continuously learn and improve while

the low logarithmic complexity makes learning a huge number of samples practical.

132
4.3.5 Handling 2D variation

Since the attention mechanism described in [159] provides a well—framed object of
interest in a fovea image, the problem of two-dimensional (2D) transformation com-
putation becomes less signiﬁcant. However, because the attention point is only a
rough estimate of the position, scale, and orientation of the object of interest, a bet-
ter estimate of the 2D transformation to align the test probe with the appropriate
training image needs to be done. As described in section 3.2, this can be accomplished
by greatly increased image acquisition for points surrounding the ﬁxation point from
the visual attention mechanism. The increase in the image acquisition is expensive
in terms of time and storage space, however, and could instead be accomplished by
extending the training set for a grid of points surrounding the extracted attention
image in terms of the position, scale, and 2D orientation foveation parameters. Each
fovea image comes from an attention mechanism and contains a ﬁxation point and a
scale that the attention mechanism provides to the recognition module. This ﬁxation
point can be overlaid with a grid in both the position and the scale parameters, and
a new fovea image extracted from the original image at each of these grid points.
Thus the variation due to size and position can be handled by extracting a set of
fovea images to add to the Space-Tessellation tree for each grid point surrounding the

attention point and scale instead of just extracting a single image.

Figure 4.14 demonstrates the variability that the system can handle by extending
the training set in this manner. When the training set is thus extended, the Space-

Tessellation tree grows in size, but not by the factor of the increased number of

133

 

Search probes Retrieved images

Figure 4.14: Generalization for size and position of objects. The search probes were
synthetically generated from images in a disjoint test set. Each search probe retrieved an
image from the appropriate class. The images shown in the ﬁrst line were the original
images. The scaling shown is 50% of the fovea size (both positively and negatively); the
position change shows 12% negative change in the ﬁxation point for both the row and the
column coordinates.

134

samples. If we assume that the search tree provides O(log 3) levels for 3 training
samples, then the tree has nlogs levels for some constant is. Now if the number of
training samples is expanded by P to handle the positional variation, and by S to

handle the scale variation, then we have a total of P83 samples to put into a tree.

Then there will be

nlog{PSs} = nlogPlogSlogs (4.1)

= n’ log 3 (4.2)

levels in the Space-Tessellation Tree, for constant lc’, which is still O(log 3) levels
in the tree. Typical values for P will be P = 9 or 25 for a 3 x 3 or a 5 x 5 grid
surrounding the attention point, respectively; S will typically be S = 3 or 5 to deal

with the various scales.

Although the number of levels in the tree remains O(log n), the number of nodes
in the tree will increase nearly lineally with the extension of the training set. For
example, the number of nodes in a complete binary tree with n leaf nodes is 2n — 1
[40], or roughly n internal nodes. Our trees will have fewer total nodes because the
fan-out at each node is not restricted to a binary fan-out, but the order will still be
linear. We have been using 1316 training images for our large combination data set,
each with dimension 88 x 64 = 5632 pixels. Expanding this to P = 3 x 3 = 9 position
grid points and S = 3 scale grid points gives us n = 35, 532 training images to use. If
we retain only I: MDF vectors at each node, in order to project to the MDF space, we

need to retain k+ 1 5632-dimensional vectors, one for each MDF vector and one mean

135

vector, to project a test probe to the MDF Space, each of these a ﬂoating point vector
(i.e., 4 bytes per entry). Since only the internal nodes require projection matrices,
we would need to store roughly 35,532 nodes capable of projecting test probes to
the local MDF space. Using k = 5 MDF vectors, we would require roughly 4.8GB
of storage, which we do not have in one place in Michigan State University’s Pattern
Recognition and Image Processing laboratory at the time of this writing. Therefore,
in order to test this generalization to size and position that the SHOSLIF paradigm
can produce, we have tested the system on two different data sets.

The ﬁrst data set is the full combination data set described in section 4.3.4 with
more than 1300 original training images. But the positional variation was not included
in the test due to the space considerations described above. Instead, we ran the
experiment using P = 1 (i.e., on the attention mechanism’s attention point alone)
and S = 3 (i.e., 3 grid points surrounding the attention mechanism’s scale). For
training, we set the scale span to be 30% of the fovea size. That is, each grid point
represented a 15% change in the fovea size, one training image at a 15% smaller scale
than the attention mechanism dictated, one at the attention mechanism’s speciﬁed
scale, and one at 15% larger than the attention mechanism’s speciﬁed scale; For
testing purposes, we took the disjoint test set described in 4.3.4 and generated a set
of test images from this set with a random scale change in the range of [—20, +20]%
of the fovea Size. The characterization and results of this test is shown in Table 4.7.
The table shows the difference between the performance when the scaling was built
into the training phase and when it was not. As can be seen from the data, for the

test probes, either the attention supplier must correspond to the scaling done in the

136

 

Number of
training images

1316 (expanded to 3948)

 

Positional enhancement

0% of fovea size

 

Scale enhancement

15% of fovea size

 

 

 

 

 

P 1
S 3 (3 grid points)
Number of
image probes 461
Number of nodes
in generated tree 6324
Number of nodes
expanded at each level 3 15

 

 

 

 

Tree built with no scaling in training phase

 

 

 

 

 

 

Top choice correct 47.9%
In top 15 64.6%
'Ifee built with 15% scaling in training phase
Top choice correct 93.3%
In top 15 98.7%

 

 

 

 

Table 4.7: Summary of the scale generalization experiment on the large combination face
and other object data set. A disjoint test set was used for testing the retrieval capability;
each test probe was randomly scaled in the range of [—20, +20]% of the fovea size to test
the ability of the system to generalize over various scales.

training phase, or the training set must be expanded to include those images in the
range of scales that need to be properly retrieved. Figure 4.15 shows example test
probes and the images retrieved when scaling was enabled for the training phase and
when it was not built into the tree. The data contained in the tree structure for this

experiment required roughly 430 MB of storage.

Utilizing the system on the large combination data set Shows the ability of the
system both to operate on large image database sizes and to handle a speciﬁed vari-
ation in the scale of the extracted area of interest for an attention point and scale.

To Show the positional extension of the training data, a smaller original training set

137

   

Testprobe Retrieved images

(a) Sample test probe and retrieved images from the database when scaling was
disabled during the training phase.

   

Test probe Retrieved images

(b) Sample test probe and retrieved images from the database when scaling was
enabled during the training phase.

Figure 4.15: Example probes and their retrieved images when scaling was enabled on the
tree and when scaling was disabled on the tree during the training phase.

138

was used. For the second experiment in the demonstration of the system to handle
2D parameter changes, the data set described in section 4.3.1 was used. Table 4.8

shows the data pursuant to this experiment. When no training set expansion was

 

Number of
training images 86 (expanded to 2322)
Positional enhancement 20% of fovea size
Scale enhancement 20% of fovea size
P 9 (3 x 3 grid)
8 3 (3 grid points)
Number of
image probes 29
Number of nodes
in generated tree 4634
Number of nodes
expanded at each level 3 10
Top choice correct 93.1%
In top 10 96.6%

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 4.8: Summary of the learning-based parameter generalization experiment. A disjoint
test set was used for testing the retrieval capability.

performed, a total of 222 nodes were generated from the 86 training images. We
know that there is a leaf node for every training sample from Algorithm 1. Therefore,

we know that there are 136 internal nodes in the tree. If we assume that we have

a complete k-ary tree, we know that the number of internal nodes iS 5,571, for tree

height h [40]. If we know that there are exactly 3 leaf nodes for 3 training samples, we

have that the number of internal nodes Z = 3—‘1. So we have for s = 86 and I = 136

 

k—l
s—1
I =
k—l
—1
k = 3 +1

 

139

85
k = ——+ =. .
136 1 1625

Using this base, we would expect that the expanded training set of 2322 images would

 

produce a tree with $2211 z 3714 nodes. The data in Table 4.8 shows that 4634
nodes were generated from the 2322 expanded set of training images in the tree, or
2312 internal nodes, even less than our estimate. Since the tree with the expanded

training set was built separately from the original tree, a slightly different tree Shape

is expected.

4.3.6 Handling different 3D views

We want to determine whether the system can handle variation in 3D orientation.
For this experiment, we used the Weizmann Institute face database. This database
was well-suited to test the handling of 3D rotation because for each of 29 individuals,
ﬁve camera viewpoints were available under identical lighting conditions. A sample
of the available images is shown in Figure 4.16.

When the training set contains a representative set of views of the objects under
consideration, the system is able to successfully ﬁnd objects from a novel view, as
shown in Figure 4.17. The image pool for a particular individual were images taken
from ﬁve viewpoints under identical lighting and expression conditions. A total of
four views from each individual were used for training, the remaining view left out to
form a disjoint test set.

We used a disjoint test set for determining the accuracy of the learning-based

view generalization. The results of this experiment are summarized in Table 4.9.

140

   

   

p2,vp3 p2,vp4

   

p3,vr>1 p3,vp2 p3,Vp3 p3,vp4 ’ p3,vp5

   

   

    

- ~i i‘

it i 1‘ 'a I I 5
p5,vvl p5,vp2 p5,vr>3 p5,vn4 p5,vr>5

 

Figure 4.16: A sample of the Weizmann Institute face data. Each individual for this

experiment contained ﬁve viewpoints under identical lighting conditions.

141

 

Retrieved images

Figure 4.17: View-based generalization. When sufficient training images are given for a
particular class, the system is able to accurately retrieve images from the correct class.

142

Though Table 4.9 shows favorable results, 100% accuracy was not achieved. The

 

Number of training images 112
Number of image probes 28
Number of internal nodes in generated tree 177
Number of nodes expanded at each level 3 10
Average number of nodes explored per probe 28.5
Correct class is top choice 78.9%
Correct class is in top 10 choices 89.4%

 

 

 

 

 

 

 

 

 

 

Table 4.9: Summary of the learning-based view generalization experiment. A disjoint test
set was used for testing the retrieval capability.

failures occured where the test probe viewing angle did not fall between two training
sample viewing angles, as shown in Figure 4.18. Because of the open-ended nature
of the SHOSLIF system, however, mistakes such as these can be identiﬁed by a
human operator, and the system can incorporate this new knowledge. The system is
able to easily and efﬁciently incorporate new knowledge because of the logarithmic
complexity of adding new samples to an existing tree, provided by the incremental

update mechanism.

4.3.7 Neighborhood information

In the experimentation described so far, much image information is thrown away in
the form of a pixel’s location and its neighborhood. The distances between pixels is
unused, since each is treated as another dimension in the high-dimensional Space when
the image is viewed as a vector. Two approaches could be considered to utilize the
neighborhood information: multi-level resolution and blurring. Multi-level resolution

involves a stack of images, each level of which is a subsample of pixels at the next

143

  
  

 

Training images from proper class

 

Failed probe Sample of retrieved images

Figure 4.18: Example failure in the learning-based view generalization experiment. The
failures occured only when the viewing angle of the test probe did not fall between the
viewing angles of two training images. These images are courtesy of the Weizmann Institute.

144

lower level, with the lowest level being the original image as Shown in ﬁgure 4.19. The

 

 

 

 

 

 

 

 

 

 

 

Figure 4.19: A stack of multi-level resolution images. Each level is a subsample of
pixels from a group in the next lower level.

problem with this multilevel-resolution stack is that the neighborhood only within
the deﬁned group is utilized. Neighboring pixels which are outside the group are not
utilized for higher-level computations.

An alternative to this multi-level resolution stack is the use of blurring to main-
tain neighborhood information. With blurring, every level of the image stack can
consist of the same number of pixels. A higher level in the image stack has more

blurring; it is created by replacing every pixel in the image with a linear combination

145

of its neighborhood pixels from the lower level, as shown in ﬁgure 4.20. So in some

 

/ /, //.»:/ ,7 ['11/ / / ///'I / / //Zj / ./ ///////
/4.«’;, ’7 / / / / /_,/ / ,x; ////////7

_. /' . //
. ‘7 fr .~ '7‘ ' 4 //:// l /1/ .7
' /" V4. " f V
/ ' /‘ .' . .- / /
)1. T/ r 7‘ E; ,1 /‘ /
, 'V [3" ."r /. I.” 4 1,.

 

 

70/
/044//7776////77764///%

 

 

l
/7QC:%//77OQﬂ/W/V7QOLK/
/T77€4///

   
   
 
 
  

 

 

/C/
z/7777705€//777024ﬂV/700/
////7/OCC£//77OC%/

// //// //// /
”C///7/QOC/4//777/
/7/4// /////€//77CC//

 

 

 

 

 

/OCC///77OOCAV /77044//7
/77QO/%//7770%4V/X77764/
,///////{7//772%22%;C%7

boggyJ///C///

 

 

 

     
 
 

 

 

1",- +1 1" 1' I WJ / j // J 1
/ . ,ﬁ/j // / / // // / // //

/,//./ /",+/v/ 444/ ff/ /////

4 4

Figure 4.20: A stack of blurred images. The pixels in each level correspond to a linear
combination of neighborhood pixels at the next lower level.

ways, blurring is superior to the multi—resolution image stack because neighborhood

information is not discarded at any level.

At levels near the root of the tree, a coarser classiﬁcation is being done, and the
level of detail necessary for the leaf node classiﬁcations is not necessary. We exper-
imented using a stack of blurred images to traverse through the Space-Tessellation
tree in both the learning and in the recognition phases. At levels of the tree near the

root, we used the most blurred images in order to provide the most global information

146

to this coarse classiﬁcation. As we move away from the root node, we used images
with less and less blurring applied to them to provide more details for the lower levels
of the tree. An example of the blurring applied to a face image and a non-face image

at the various levels is shown in Figure 4.21.

   

Figure 4.21: Examples of blurring applied to different images. At levels of the tree nearer
the root, more blurring is applied to both the training and the test images. As the processing
moves away from the root node of the tree, less and less blurring is applied to provide more
detail for the ﬁner classiﬁcations. Shown here are examples of images with progressive
blurring applied to them.

Using this approach on the large data set described in section 4.3.4 gives the results

shown in Table 4.10. These results, though in performance not as good as achieved

 

Table 4.10: Summary of blurring experiment.

147

without blurring at all, show a smaller tree generated. Furthermore, at each node of
the tree, a smaller number of MEF and MDF vectors were required to produce 95%
of the variance for the samples contained in the node. For the ﬁrst ten MDF nodes
created (i.e., those closest to the root of the tree, with the most variation) without
blurring, an average of 19.4 MDF vectors were produced at each node. For the ﬁrst
ten MDF nodes created when blurring was engaged for just ﬁve levels of the tree, an

average of 12.6 MDF vectors were produced.

The reason for the reduced performance is the fact that in some cases, the class
separation was done before ﬁve levels of the tree were formed. In that case, blurred
images were used for projection into the subspaces, and the distances were based on

the blurred images rather than the original resolution ones.

4.3.8 Timing

The hierarchical organization of the database inherent in the SHOSLIF-O paradigm
provides the means for efﬁcient retrieval of images from the database. The tree
structure provides O(log n) access time for 17. samples in the database. To demonstrate
this on some real data, we provide the timing results shown in Figure 4.11. The timing
data for the tree version shows that the average time required for a test probe to obtain
retrieval results is linear in the number of competing nodes. The data in this table is
shown in Figure 4.22. The ﬁgures compare the average time required for a single test
probe in a “ﬂat” database (i.e., one with no hierarchy; all the samples are in a single

pool) with the time required using the same samples in the SHOSLIF hierarchy. The

148

Average timing data per test probe

 

 

 

 

 

Tree mode:
Number of Average seconds
Competing nodes required per probe
2 2.1865772 sec
3 2.8681208 sec
4 3.5721477 sec
5 4.2000000 sec
6 4.8077181 sec
7 5.3697987 sec
8 6.0080537 sec
9 6.5325503 sec
10 7.1597315 sec
11 7.5885906 sec
12 8.0248322 sec
13 8.3714765 sec
14 8.7355705 sec
15 9.1271812 sec
Non-tree mode (Flat database): 400.7245 sec

 

 

 

Table 4.11: Average time required to perform a single probe in Tree mode and Non-tree
mode. In contrast to the rapid retrieval of 9.1 seconds for images from the tree version using
ﬁfteen competing paths, when a “ﬂat” database is used, i.e., one in which no hierarchy is
utilized, a test probe into the same data requires an average of 400.7245 seconds to complete.

training and test images for both organizational schemes was the large combination

database described in section 4.3.4.

The tree structure can have multiple competing paths, whereas the ﬂat database
structure will examine the same number of possibilities in all cases. So the tree
database runs show the difference in timing over the different numbers of competing
paths. Recognition results from the tree structure using these different numbers of

competing paths are shown in Figure 4.8. For the ﬂat database, the retrieval rate

149

 

 

 

 

10 I T j I I T T
Timing using Tree Structure -+— ,
8 ~ -
d)
.D
8.
2;; 6 ‘ -
8
8.
a 4 - .
o
8
U)
2 - -
O 1 I 4 1 I 1 1
0 2 4 6 8 10 12 14

Number of competing paths

Figure 4.22: Timing results obtained from running the system in the tree mode. This data
shows the linear relationship between the time required to complete a test probe retrieval
and the number of competing paths. This linear relationships demonstrates how the k

competing paths only explores a band of the tree.

was as shown in Table 4.12.

 

Number of training images

1316 from 526 classes

 

Number of test images

298 from 298 classes

 

Correct retrieval is t0p choice

88.9%

 

 

Correct retrieval is in top 10

 

 

96.2%

 

Table 4.12: Recognition results using the ﬂat database for the combination database

So the hierarchical structure in the SHOSLIF paradigm not only provides a means

for improving the correct retrieval rate from a database of objects, but it also provides

the low computational complexity to allow a large database of objects to be used as

a query database.

150
4.4 Summary

In this chapter we describe our experimentation using the SHOSLIF paradigm ap-
plied to object recognition (SHOSLIF-O). We describe two basic modes of operation:
the manually-defined set of hierarchical labels for characterizing the tree, and the
automatic space tessellation mode, where only a single (lowest-level) label is given to
a training sample image.

In the manually defined set of hierarchical labels, the operator supplies a hierar-
chical set of labels to deﬁne the structure of the tree (and what samples go into which
nodes). The MDF computation is done using the grandchildren of a particular node
as sample points. The results show that allowing the system to determine its own

tree structure is a better approach.

In the automatic space tessellation mode of operation, there are several options
available to the user. We investigated several distance measures and provide retrieval
results obtained using these different measures. We have found that the Distance
Horn Subspace (DFS) measure is theoretically and practically the measure that we
should use. In this same vein, we also experimented with maintaining a historical
deviation from the node centers in the path traversed to the leaf nodes of the tree. We
discovered that when a proper distance measure was utilized, this historical deviation
was not needed.

In our experimental results, we ﬁrst compare the recognition results obtained using

a “flat” database (i.e., one with no hierarchical structure imposed), a tree structure

in which a single projection was done and the space tessellation obtained using this

151

single projection, and the case where multiple projections were done in the tree, one
for each node. The data supports our claim that the multiple projection hierarchy
can divide the problem into smaller, solvable pieces and can therefore improve the
recognition rate. We also compared the performance of the image space directly and
the MEF and MDF subspaces under these different modes of database organization.
Our data obtained for these tests also supports our claim that the MDF space can
provide better results than the MEF space does. We show the clustering effect that
the MDF subspace has, and demonstrate the power that the DKL projection that
produces the MDF space has for handling within-class variation in a particular class.

In order to compare with the results of others in the community utilizing eigen-
feature methods, we performed a test on a set of data comprising of only faces. We
produced results consonant with those obtained with Photobook [123] (100% same-
image retrieval; 98% accuracy in retrieving the correct class as the second image)
with the re—substitution paradigm (like the Photobook uses), and produced results
nearly as good using a disjoint test set (96% accuracy as the top choice; 98% for the
correct choice being in the t0p 10 responses).

To show the utility of the system on a wide range of objects, we tested the system
on a combination database, comprised of both faces and non-face objects taken from
natural scenes. These objects ranged from human faces to street signs to dogs to
aerial photographs. The recognition results obtained using this database were very
good: 95% correct top choice response; 99% correct class in the top 10 responses.

In order to handle 2D variations, such as size and position, we experimented on

simulating more image acquisition by extending the training set. We impose a grid

152

over the ﬁxation point and scale and extract multiple fovea images from the original
image. We show results that demonstrate the correct retrieval of images from different
positions and scales.

We demonstrate how the system is able to successfully ﬁnd objects from a novel
3D view, providing that the training set of images contains a representative set of
views of objects under consideration. We incorporated neighborhood information by
blurring the images at higher levels of the tree, producing less and less blurring as we
moved down the tree. Though this did not improve our recognition rate (in fact, it
actually brought it down somewhat), it did make for a more efﬁcient tree and a faster
retrieval. And ﬁnally, we provided some timing data for the system to demonstrate
the order of magnitude difference between utilizing the hierarchical structure and

using a simple ﬂat database.

Chapter 5

Summary and Future Research

This chapter summarizes the results of the research described in this document, and

provides an outline for future research direction.

5. 1 Summary

We have developed an object recognition system that performs automatic feature
selection and extraction, utilizes a hierarchical database organization to provide ef-
ﬁcient retrieval of images. The system generalizes an image training set to handle
size and position variations, handles a variety of objects from natural scenes, and
identiﬁes a visual attention point in an image for further exploration.

The automatic hierarchical discriminant analysis method used in this work recur-
sively decomposes a huge, high-dimensional nonlinear problem into smaller, simpler,
tractable problems. We study the Most Expressive Features, or principal compo-

nent features, for class characterization. These features are demonstrated by other

153

154

researchers and this study to be good at efﬁciently describing a class and can be
used for image reconstruction. We talk about the Most Discriminating Features, or
discriminant analysis features, for subclass selection, and demonstrate their superi-
ority over the Most Expressive Features for subclass selection. We discuss the new
DK L projection for dealing with discriminant analysis for high dimensions. The
MDF subspace is an effective way for automatically selecting features while discount-
ing unrelated factors present in the training data, such as illumination variation, face
expressions, etc.

The hierarchical Quasi-Voronoi tessellation provides a means for dividing the space
into smaller pieces for further analysis. This allows the DK L projection to produce
better local features for more efﬁcient and tractable subclass separation. We dis-
cuss problems with the Space-Tessellation tree, such as the Adoption Problem, and
describe effective ways to circumvent these problems before they arise. The Space-
Tessellation Tree introduced in this work provides an average time complexity of
0(log n) for a search for the best match in a model database of 72. objects. This low
complexity opens the door towards learning a huge database of objects. Our exper-
iments with a large number of classes of objects are made possible by the efficient
Space-Tessellation structure. In order to maintain a retrieval rate level, however,
it may be necessary to increase this k if the number of images increases. It is not
clear whether this I: is a function of n to keep a retrieval rate; this issue should be
investigated further.

We have demonstrated the described system for a manually-deﬁned set of hierar-

chical labels associated with each search probe. We discuss the parameters pursuant

155

to the automatic space tessellation using the techniques of chapter 3. We discovered
that the use of the “Distance from Subspace” metric produced not only theoretically
but experimentally good results. We experimented with various numbers of compet-
ing paths to determine the beneﬁts of maintaining multiple paths through the tree
juxtaposed with the computational expense of pursuing them, and arrived at the
conclusion that ten competing paths produced the best recognition performance.

The experimentation compares using the different spaces, i.e., the image space
itself, the MEF subspace, and the MDF subspace. A comparison between a ﬂat
and a hierarchical database organization was made to show how the hierarchical
decomposition of the problems produces better results than can be achieved using a
single level. We compare our method to the results of others in the community who
are using eigenfeature methods, such as Photobook [123], and show a comparable
result to their work. To show the general applicability of the method, we trained
the system on a diverse set of objects from mostly outdoor natural scenes. Using
a disjoint test set, we found that the best response from the system was from the
correct class in 95% of the cases, and was in the top ten for 99.0% of the 300 test
cases.

We have demonstrated how the MDF subspace can effectively discount those fac-
tors unrelated to subclass selection, signiﬁcantly better than using the principal com-
ponent features alone. Furthermore, although the the MDF subspace does well in
dealing with large variation within a particular class, the Space-Tessellation Tree
provides a mechanism for dealing with the problem of many classes in the database

by reducing the complexity of a nonlinear class separation problem into smaller, man-

156

ageable problems that are lineally separable. The system described generalizes the
training set for 2D transformation, and shows an ability to generalize across 3D view-
point when the training set contains a representative set of views of the objects under
consideration.

We have incorporated neighborhood information in the form of blurring. At the
higher levels of the constructed tree, a larger level of blurring was done because the
nodes nearer the root of the tree only provide a coarse classiﬁcation of the images.
At those nodes where blurring is done, a more general shape of the object is provided
for the MEF and MDF subspaces to use, rather than the details. The results of
this technique did not improve the recognition performance of the system, but it
did succeed in signiﬁcantly reducing the number of features required to capture the
classes (in the MEF space) and separate the classes (in the MDF features).

The system described has several inherent strengths and limitations, which are
summarized in Figure 5.1. Although the system might be extended to other sensing
modalities, the current system is view-based: in order to retrieve a correct image
from an image database, the SHOSLIF requires that the system has been trained on
an image taken at a very similar viewpoint (i.e., within a few degrees), as is demon-
strated by Figure 4.18. This limitation is typical with approaches that are based on
a pixel-level representation. However, the generality of the representation, effected
by selecting the features directly from the intensity pixel values, is a fundamental
strength of this approach. The only representation used is the digital image. This
representation itself does not incorporate knowledge, which may be seen as some as

a weakness. The incorporation of knowledge in the representation scheme, however,

157

 

Strengths

Limitations

 

The Generality of the representation ob-
tainable by selecting features directly
from intensity image pixel values. The
only imposed representation is the digi-
tal image.

The exclusion of speciﬁc domain knowl-
edge in the system design and imple-
mentation phases.

 

The capability of the system to handle
multifarious variations without system
modiﬁcation.

Variations must be covered in the train-
ing samples.

 

The method is view-based to deal with
images directly rather than using other
sensing modalities of limited applicabil-
ity (such as range scanners).

The system requires many views as
training input for nontrivial problems.

 

The generality of real-world problems
that can handled by the system.

The system well-framed

images.

requires

 

The logarithmic complexity and the
class-separation problem decomposition
made possible by the tree structure.

Occlusion has not been investigated,
but will not be handled well without ad-
ditional attention mechanisms.

 

Labeled and unlabeled samples (i.e., su-
pervised and unsupervised learning) can
be incorporated into the same tree.

 

The system does not cover high—level
reasoning at the current stage.

 

 

 

Figure 5.1: Major strengths and limitations of the described system.

is also a limiting factor. Typically knowledge that is incorporated into the system
design is not generally applicable. The resulting system tends to be brittle under gen-
eral settings. The view-based method can deal with intensity images directly, rather
than using other, limiting modalities such as range scanners. This requires many
views for nontrivial problems, however. The generality of the real-world problems
that can be handled by this system is another of its major strengths; the variation
handling, however, has not been tested extensively. The system requires well-framed
images with non-occluded objects. Occlusion and segmentation must still be investi-
gated. And although the system does not cover high-level reasoning, the hierarchical

structure provides for both a logarithmetic computational complexity and a prob—

158

lem decomposition made possible by the tree structure. Object recognition systems
typically have a reject option to indicate that an object was not retrieved from the
database. The SHOSLIF-O could learn a reject distance threshold value to provide
this functionality. In this work, however, all of the top It candidates were provided to

the user.

5.2 Future research

There are a number of issues that should be researched further. The algorithm for
determining the number of features to use in both the MEF and the MDF subspaces
should be analyzed further. In our system, the operator selects a percentage P of
variance to discard. Typically P = 5%, keeping a large portion of the variance in the
sample set while providing a substantial reduction in the number of features used.
In our experiments, retaining 95% of the variance allowed us to retain approximately
half of the features on the average. However, this is simply a parameter chosen by the
operator as something that works well. An investigation into the optimum number

of features to retain at each node should be started.

Because of the computational complexity involved in computing the eigensystem
decomposition for a large set of training images, our system takes a random sample
from the training set to use to compute the eigenvalues and eigenvectors; the full
training set is projected into the MEF and the MDF subspaces for the tree construc-
tion. This random sample of training images may or may not produce a good selection

of features to use in the decomposition of the spaces. Because of the hierarchical na-

159

ture of the system, even if a bad set of features is used at one level, as the training
set is partitioned to different parts of the tree, the system will be able to ﬁnd better
features for classiﬁcation and class characterization. However, this can lead to an
inefﬁcient tree structure, and more work must be done by the system to produce the
kinds of results that we reported. We should study alternative ways for ﬁnding the
optimum set of images from the training set to use in the eigensystem computation.
Since this search for an optimum set of samples at each node involves an enormous
search space, it seems like a good candidate for a genetic algorithm solution.

The SHOSLIF-O uses a relatively large number of features at each node, and
produces a tree for which a different number of children can appear for every node.
Weng and Chen [161] have explored using a much simpler binary tree structure for
autonomous navigation. The technique ﬁnds a single eigenvector associated with the
largest eigenvalue, and projects the mean of the training set onto that eigenvector.
All of the training samples are then also projected onto the single eigenvector and
compared with the mean projection. Those projections which are greater than the
mean projection are assigned to the right subtree, while those less than the mean pro-
jection are relegated to the left subtree. This means that a very coarse classiﬁcation
is done at each level of the tree, but the single-vector projection is very fast. Fur-
thermore, to ﬁnd the single eigenvalue/eigenvector pair, a much less computationally
expensive method can be utilized than is required when the complete eigensystem
must be found. Therefore, the training phase becomes near-real-time as well as the
recognition phase. We need to study this paradigm applied to object recognition.

For our 2D transformation handling, we simulate ﬁxation points in our visual

160

attention mechanism with our imposed grid in the scale and position of the object
of interest. The system would perform better if instead a collection of images were
taken of the object at different ﬁxation points. Images taken at novel views of the
object of interest would provide a host of further information, such as different lighting
conditions and shadows. We have done a limited number of experiments exploring
how the system performs on 3D rotation of objects, predominantly faces. We should
study how the system performs on a larger class of objects with 3D rotation. We
need to explore the system in detail using much larger data sets.

The SHOSLIF-O as described works on 2D grayscale intensity images. There is no
inherent limitation in the SHOSLIF paradigm that limits its use to grayscale images; a
multi-spectral image, such as a color image, could be utilized in the same framework.
In addition, 2.5D and 3D images could be cast into this SHOSLIF paradigm. In
a 2D grayscale image, the rows are concatenated together to form a long vector.
If each pixel in this 2D grayscale image were instead a 3D vector of (Red. Green,
Blue>, the same setup could be utilized. A range image, such as obtained from the
Technical Arts 100 White Scanner, has a z dimension in place of an intensity value
in the image. This could be used directly in the SHOSLIF framework. Finally, stereo
cameras provide a depth map, and this information could also be included in the
input to the SHOSLIF-O system.

In [159], there is a description for the SHOSLIF to perform image segmentation
during the test phase at the same time that the recognition is being done. During
the learning phase, the human teacher speciﬁes the areas of interest on the training

images. During the recognition phase, a size and position estimation is improved

{£7

161

as the test probe traverses the Space-Tessellation tree. While that size and position
estimation is improving, the background in the attention images are gradually masked
off using the feature vector projections computed through the levels of the tree. This
is done because the components in each of the MEF and MDF vectors are zeros in
the learning phase. Therefore, the coarse-to—ﬁne classiﬁcation masks the background
pixels that may be present in the original fovea image for recognition. Finally, by
backtracking through the tree, those nonzero components in the MEF space can be
recorded back into the input image. This is clearly something that we should pursue
further.

Finally, the visual attention mechanism should be explored further. This visual
attention mechanism provides a well-framed portion of the image that contains a sin-
gle object of interest in a standard position and scale. This mechanism is fundamental
for the SHOSLIF to operate successfully.

To summarize, we have noted the following research topics for further study:

Investigate the optimum number of features to retain.

Explore an optimum sample of training set to use for eigensystem computation.

Construct a binary tree based on a single eigenvector projection.

Study the SHOSLIF-O system for 2D transformation handling and 3D rotation

of objects in more detail and with larger data sets.

Extend the work from 2D grayscale intensity images to multi-spectral images,

such as color images.

162

o Utilize the basis vectors to produce a segmentation of the input test images.

0 Develop the visual attention mechanism.

The SHOSLIF paradigm can be used for a host of applications. Some application
areas being explored include object recognition [146], motion understanding [43],
vision-guided robot manipulation [80], and autonomous navigation [33]. All of these
application areas should be further explored to provide a uniﬁed framework for all

sorts of computer vision-based applications.

APPENDICES

Appendix A

Space comparison test database

This appendix contains the images used in the space comparison experiment (section
4.3.1). Pixel weighting was engaged for this experiment, and this weighting is reﬂected
in the images displayed here.

The test probes and their top ﬁve responses for the multiple-projection experiment
are given in Figure A.2. Note that a band of ﬁfteen competing paths were explored,

but only the top ﬁve responses are given here.

163

164

 

bike

 

bushes

   

centersign

mouse

Figure A.1: Training images for space and hierarchy study.

  

currency desk

   

door handle egg

Figure A.1 continued: Training images for space and hierarchy study.

 

hand house

   

impacth isign

   

lab

Figure A.1 continued: Training images for space and hierarchy study.

 

monitor parkinglot

   

pedestrian pentagon

 

phone renault

Figure A.1 continued: Training images for space and hierarchy study.

 

rome runner

 

ruts seated

 

sharpener sidewalk

Figure A.1 continued: Training images for space and hierarchy study.

 

standingdog ssign

   

truck chair

   

chair (con’t)

Figure A.1 continued: 'Ifajning images for space and hierarchy study.

 

dog (con’t) hydrant

   

hydrant (cont’t)

Figure A.1 continued: Training images for space and hierarchy study.

 

hydrant (con’t)

Figure A.1 continued: Training images for space and hierarchy study.

172

Image
Space

     

1437 1874 2119 2131 2165

MEF
Space

   

1437 1874 2119 2131

MDF
Space

     

1354 1366 1746 1748 1893

Figure A.2: Test images for space and hierarchy study. Each probe image is followed by
the list of the top ﬁve responses using pure image space, MEF space, and MDF space.
Beneath each retrieved image is the distance measure computed between the test probe
and the retrieved image.

173

Image
Space

     

1155 1433 1993 2078 2180

thF‘
Space

   

1 155 1433 1993 2078 2180

IAIDF
Space

 

1175 1179 1786 1797 1837
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF

space.

174

 

Image 1 . .1.
Space ' '

     

2247 2485 2951 3002 3012
MEF
Space
2247 2485 2951
5‘
MDF
Space

     

2034 2115 2764 2764 2774
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space.

175

hnage
Space

     

2631 2964 3273 3281 3293

thF
Space

   

2630 2943 3273 3281 3293

MDF
Space

   

2148 2159 2993 3002 3154
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space.

176

hnage
Space

     

2164 2966 3042

thF
Space

     

2966 3042

 

2164

IAI)F
Space

       

1954 1956 1959 1962 1965
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space.

177

hnage
Space

     

2047 2061 2370 2375 2486

NHEF
Space

   

1882 2277 2419 2479 2646

h4I)F
Space

 

1821 1842 2238 2481 2504
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space

178

Image
Space

     

1499 2542 2569 2645 2925

MEF
Space

   

2047 2061 2370 2375 2486

MDF
Space

 

1975 2063 2146 2155 2181
Figure A.2 (continued): Tact images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF

space.

179

hnage
Space

    

3001 3061 3147 3152

thF‘
Space

     

1463 2540 2566 2645 2925

h4I)F
Space

   

1404 1421 2261 2265 2323
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space.

180

Image
Space

     

2663 2699 2712

MEF
Space

     

3062 3147 3152

MDF
Space

     

2532 2563 2655 2661 2724
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space.

181

Image
Space

     

2256 2269 2392 2515 2637

MEF
Space

   

2523 2663 2699 2712

MDF
Space

   

1892 1895 2060 2074 2481
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space.

182

hnage
Space

       

1552 1982 2408 2458

MEF
Space

     

2256 2269 2392 2515 2637

MDF
Space

 

2024 2106 2141 2151 2152
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF

space.

183

hnage
Space

     

1584 2138 2198 2219 2324

thF‘
Space

   

1552 1925 2408 2455

h4I)F
Space

     

870 1100 1928 1944 2149
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space.

184

Image
Space

     

1910 1987 2441 2550

MEF
Space

     

1582 2138 2198 2219 2324

MDF
Space

   

1542 1558 2017 2024 2039
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space.

185

Image
Space

     

2871 2954 2990 3138 3140

MEF
Space

     

1910 1987 2441 2550 2562

MDF
Space

     

1614 1666 2182 2215 2264
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space.

186

Image
Space

     

1285 1671 3785 3856 3869

MEF
Space

       

2990 3138

 

2871

MDF
Space

   

2071 2106 2122 2131 2142
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space.

 

187

hnage
Space

     

3224 3357 3604 3775 3785

kdEF‘
Space

       

1285 1671 3785 3856 3869

NHDF
Space

   

1130 1218 3596 3661 3687
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space.

188

hnage
Space

     

1489 1542 2095 2120 2128

NHEF
Space

       

3224 3357 3604 3775 3785

BJIDF
Space

   

2881 2882 3321 3326 3379
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space.

189

hnage
Space

     

2083 2414 2685 2693 2752

LTEF
Space

     

2095 2120 2128

DAI)F
Space

   

1157 1205 1931 - 1937 1944
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space.

190

Image
Space

     

1141 1968 2187 2231 2266

MEF
Space

MDF
Space

  

1960 2325 2360 2382 2419
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space.

191

hnage
Space

     

2285 2613 2871 2878 2904

MEF
Space

   

1141 1968 2187 2231 2266

MDF
Space

 

1091 1092 1885 1922 1927
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF

space.

192

Image
Space

    

2663 M 2715 2738 2842

MEF
Space

       

2285 2613 2869 2878 2904

MDF
Space

     

2024 2033 2371 2454 2479
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF

space.

 

193

   

1827 2235 2317 2321

MEF
Space

    

1778 2663 2715 2738 2842

 

1703 2332 2332 2394 2396'
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF

space.

194

Image
Space

     

1908 2084 2133 2154 2217

thF‘
Space

     

1639 1827 2233 2315 2321

MDF
Space

   

1490 1498 1968 1987 2199
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space.

195

Image
Space

   

1410 1643 2148 2179 2218

MEF
Space

   

1410 1643 2148” 2179 2218

MDF
Space

     

1406 1612 2018 2027 2047
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space.

196

hnage
Space

     

3055 3091

MEF
Space

   

3066 3091

MDF
Space

       

1659 1864 2868 2962 2988
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space.

197

Image
Space

     

1458 2108 2112 2173 2235

MEF
Space

       

1458 2108 2112 2173 2235

MDF
Space

     

1580 1589 1652 1944 2055
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space.

198

Image
Space

    

2895 2904 3051 3066

MEF
Space

   

2895 2904

MDF
Space

     

2518 2542 2621 2722 2722
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space.

199

Image
Space

     

2966 3477 3480 3530 3719

MEF
Space

       

2963 3480 3530 3719

MDF
Space

     

2772 2925 3288 3316 3348
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space.

200

 

955 1107 1986 2085 2093
Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is
followed by the list of the top ﬁve responses using pure image space, MEF space, and MDF
space.

Appendix B

Face Database

This appendix contains the database of face images used for various experiments.
Some of the data is from Michigan State University’s Pattern Recognition and Image
Processing laboratory; some come from the FERET database maintained at George
Mason University under the direction of the Army Research Labs (Jonathan Phillips,
program director); some come from the Weizmann Institute in Israel; some from the
MIT media laboratory.

A selection of the images for training the face database is given is Figure B1. A

selection of test probe images used for the various experiments is given in Figure B.2.

201

202

  
 

         

 

000.01 003.02 003.04

   

 

003.05

     

013.01 013.02

\./"-' ~ " r
0

\
.4,

  

 

015.06 015.07

     

,0 ._

016.05 016.06 016.07 016.08 016.09 017.01

   

017.02 018.01 018.02 019.01

Figure B.1: Database of face training images

 

4 ‘2'
~:'

020.02 020.04 020.05 020.06 020.07 020.08 021.01 021.02

  
  
     
      
 

     

022.02 023.01 023.06 023.07 023.08 023.09

025.02 025.04 025.05 025.06 026.01 026.02 027.01

027.06 028.01 028.02 029.01 029.02 029.04 029.05

031.01 031.02

I i

‘ I . .
037.05 038.01 038.02 342.01

342.02 342.04 342.05 342.06 342.07 342.08 342.09 343.01 343.02 343.04

           

   

027.02 027.04

   

033.01

   

035.01

 

 

Figure B. 1 (continued).

 

352.02 352.04 352.05 352.06 352.07 352.08 352.09 353.01 353.02 353.04

Figure B.1 (continued).

 

364.05 364.06 365 01 365.02 365.04 365.05 365.06 366. 01 366.02 366. 04

Figure B.1 (continued).

«sitar-9'

‘s

 

380. 05 380.06 381.01 381.02 381.04 381.05 381.06 382. 01 382.02 382.04

Figure B.1 (continued).

207

       
  
 

    

383.01 383.02 383.04

«0'

389.02 390.01 391.01 392.01

97.01

 

383.05

  
  
  

  
  
  

  

; - ‘ f
, . .
3 , a. . .
, i'
‘KH a
’ “

388.02 389.01

394.02 395.01

4 j. '
3' ,
t ‘ :
' -. 6 .
1,. .
i 3 ”M“, L::
3.2!“

399.02 400.01

 

  

       
 
 

 

   
   

  
  

       

1!

401.02 402.01

.. .
~. __ .1 h;-

01.0

     

1

    
  

~V

408.02 409.01

 

 

414.01 414.02

       

418.01 418.02 419.01 420.01 423.02 424.01

Figure B.1 (continued).

208

   

425.01 425.0 429.01

.w 2
443.01 443.0

 

' LS! ‘

464.01 464.02 4

in ' _ .

65.01 466.01 466.02 468.02

Figure B.1 (continued).

208

       

  

 

 
  

429.01

   
  

427.01 427.02 428.01 428.02

432.01

   
  
  

435.01 435.02

441.01 441.02

   

450.01 450.02

462.01

                        

464.01 464.02 465

d"

.01

 

466.02 467.01 467.02 468.01 468.02

Figure B.1 (continued).

209

  
 
 
 

    
 
 

  
  

  
  
   

  
 

   
  
  

; WAT? '2 WW ;- ' made" f’ '5 I.
, . "k .1 f‘

470.02 471.01 472.01 472.02 473.01 473.02 474.01 474.02

, 1 .
.

4"» .443:-

  

     
  

 
  

 
  

1

154.01 154.02 155.01 155.02 156.01 156.02 157.01 15 158.01 158.02

E}.

7'. ~\- _.

476.02 477.01 477.02

.I' . ‘4
7.3/ 7“

475.01 475.02 476.01

Figure B.1 (continued).

210

031.ps 032.ps

 

359. ps 360. ps 361. ps 362. ps 363. ps 364. ps 365. ps 366 ps 367. ps 368 ps

Figure B.2: Database of face test images

211

 
  

369.ps 370.ps

~45 .

.- ' t:

> .- ‘ ‘

~ "( ‘ . _ _-. .

‘ J 2'
‘2 ‘ . . i. '
g; 5. :L 31¢ 45
- _ . .:.;~ A "

. 4 4,". .- . :gi ‘1
«f '_ ‘. . "t '

     

438.ps 446.ps 449.ps 450.ps 458.ps 468.ps 474.ps 477.ps

Figure B.2 (continued).

BIBLIOGRAPHY

Bibliography

[1] H. M. Abbas and M. M. Fahmy, “A neural model for adaptive Karhunen Loeve
Transform (KLT),” in Proc. of IEEE Intl. Joint Conf. on Neural Networks, vol. 2,
(Baltimore, Maryland), pp. 975—980, June 1992.

[2] S. Akamatsu, T. Sasaki, H. Fukamachi, and Y. Suenaga, “A robust face iden-
tiﬁcation scheme—KL expansion of an invariant feature space,” in SPIE Pro-
ceedings Vol. 1607: Intelligent Robots and Computer Vision X: Algorithms and
Techniques, pp. 71—84, 1991.

[3] P. Baldi and K. Hornik, “Neural networks and principal component analysis:
Learning from examples without local minima,” Neural Networks, vol. 2, pp. 53-
58, 1989.

[4] D. H. Ballard and C. M. Brown, Computer Vision. Englewood Cliffs: Prentice-
Hall, 1982.

[5] P. Ballard and G. Stockman, “ “twinkle” ,” Master’s thesis, Michigan State Uni-
versity, August 1991.

[6] R. Baron, “Mechanisms of human facial recognition,” International Journal of
Man-Machine Studies, pp. 137-178, 1981.

[7] J. C. Bartlett and A. Fulton, “Familiarity and recognition of faces in old age,”
Memory and Cognition, vol. 19, pp. 229—238, May 1991.

[8] E. B. Baum, “On the capabilities of multilayer perceptrons,” Journal of Com-
plexity, vol. 4, pp. 193—215, 1988.

[9] E. B. Baum, “When are k-nearest neighbor and back propagation accurate for
feasible sized sets of examples?,” in Proc. EURASIP Workshop, Sezimbra, Por-
tugal (L. B. Almedia and C. J. Wellekens, Eds.), pp. 2—25, New York: Springer-
Verlag, 1990.

[10] J. L. Bentley, “Multidimensional search trees used for associative searching,”
Comm. ACM, vol. 18, pp. 509—517, September 1975.

[11] M. Bichsel, Strategies of Robust Object Recognition for the Automatic Identi-
ﬁcation of Human Faces. PhD thesis, Eidgenbssischen Technischen Hochschule
Ziirich, 1991.Diss. ETH Number 9467.

212

213

[12] I. Biederman, “Human image understanding: Recent research and a theory,”
Computer Vision, Graphics, Image Process, pp. 29—73, 1985.

[13] W. F. Bischof and T. Caelli, “Learning structural descriptions of patterns: A new
technique for conditional clustering and rule generation,” Pattern Recognition,
vol. 27, no. 5, pp. 689—697, 1994.

[14] R. C. Bolles and P. Horaud, “3de: A three-dimensional part orientation sys-
tem,” International Journal of Robotics Research, vol. 5, pp. 3—26, Fall 1986.

[15] O. Bourdon and G. Medioni, “Object recognition using geometric hashing on
the connection machine,” in Proceedings International Conference on Pattern
Recognition, (Atlantic City), pp. 596—600, 1990.

[16] M. Brady and H. Asada, “Smoothed local symmetries and their implementation,”
International Journal of Robotics Research, vol. 3, no. 3, pp. 36—61, 1984.

[17] C. Bregler and S. M. Omohundro, “Nonlinear manifold learning for visual speech
recognition,” in Proc. International Conference on Computer Vision, pp. 494—
499, 1995.

[18] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classiﬁcation and
Regression Trees. Chapman & Hall, 1993.

[19] R. Brooks, “Model—based trhee—dimensional interpretations of two-dimensional
images,” IEEE Trans. Pattern Anal. Machine Intell, vol. 5, no. 2, pp. 140—150,
1983.

[20] V. Bruce, “Perceiving and recognizing faces,” Mind and Language, pp. 342-364,
1990.

[21] R. Brunelli and T. Poggio, “Hyperbf networks for gender classiﬁcation,” in Pro-
ceedings DARPA Image Understanding Workshop, pp. 311—314, 1992.

[22] R. Brunelli and T. Poggio, “Face recognition: features versus templates,” IEEE
Trans. Pattern Anal. Machine Intell, vol. 15, pp. 1042-1052, October 1993.

[23] J. Buhmann, M. Lades, and C. V. D. Malsburg, “Size and distortion invariant
object recognition by hierarchical graph matching,” in Proceedings International
Joint Conference on Neural Networks, pp. 411—416, 1990.

[24] R. Buse and Z.-Q. Lie, “Feature extraction and analysis of handwritten words in
grey-scale images using gabor ﬁlters,” in Proceedings IEEE International Con-
ference on Image Processing, vol. I, pp. 164—168, 1994.

[25] T. Caelli and A. Dreier, “Variations on the evidence-based object recognition
theme,” Pattern Recognition, vol. 27, no. 2, pp. 185—204, 1994.

[26] T. Caelli and S. Xu, “A new method for shape-from-focus,” in A1 ’90 (C. P.
Tsang, Ed.), World Scientiﬁc, 1990.

214

[27] R. A. Campbell, S. Cannon, G. Jones, and N. Morgan, “Individual face classiﬁ-
cation by computer vision,” in Proc. Conf. Modeling Simulation Microcomput.,
pp. 62-63, 1987.

[28] S. Carey, R. Diamond, and B. Woods, “The development of face recognition—
a maturational component?,” Developmental Psychology, vol. 16, pp. 257-269,
1980.

[29] S. Carey, “A case study: Face recognition,” in Explorations in the Biological
Language, New York: Bradford Books, 1987.

[30] C. Chen and A. Kak, “A robot vision system for recognizing 3-D objects in
low-order polynomial time,” IEEE Trans. Syst, Man, Cybern., vol. 19, no. 6,
pp. 1535—1563, 1989.

[31] J. L. Chen and G. C. Stockman, “The curvature method for modeling and match-
ing smooth objects,” tech. rep., Michigan State University, Dept. of Computer
Science, 1992.

[32] J.-L. Chen, G. C. Stockman, and K. Rao, “Recovering and tracking pose of
curved 3D objects from 2D images,” in Proceedings Computer Vision and Pattern
Recognition, (New York), pp. 233—239, 1993.

[33] S. Chen and J. Weng, “Vision-based navigation using self-organizing learning,”
in Proc. 2nd Asian Conf. on Computer Vision, (Singapore), December 1995.

[34] S. Chen and J. Weng, “Autonomous navigation through case-based learning,”
in Proc. IEEE Int ’1 Symposium on Computer Vision, (Coral Gables), November
1995.

[35] S. Chen and J. Weng, “Autonomous navigation using recursive partition tree,”
in Proc. Workshop on Vision for Robots, (Pittsburgh), August 1995.

[36] Y. Cheng, K. Liu, J. Yang, and H. Wang, “A robust algebraic method for human
face recognition,” in Proc. Int. Conf. on Pattern Recognition, pp. 221—224, 1992.

[37] C. H. Chien and J. K. Aggarwal, “Shape recognition from a single silhouette,”
in Proc. ICCV, (London), 1987.

[38] M. J. Conlin, “A rule based high level vision system,” in SPIE Proceedings
Volume 726: Intelligent Robots and Computer Vision, pp. 314-320, 1986.

[39] J. H. Connell and M. Brady, “Generating and generalizing models of visual
objects,” Artiﬁcial Intelligence, vol. 31, pp. 159—183, 1987.

[40] T. H. Cormen, C. E. Leirson, and R. L. Rivest, Introduction to Algorithms. The
MIT Press, 1991.

[41] I. Craw, H. Ellis, and J. Lishman, “Automatic extraction of face features,” Pat—
tern Recognition, vol. 5, pp. 183—187, 1987.

215

[42] I. Craw, D. Tock, and A. Bennett, “Finding face features,” in Proceedings 2nd
European Conference on Computer Vision, pp. 92—96, 1992.

[43] Y. Cui, D. L. Swets, and J. J. Weng, “Learning-based hand sign recognition using
SHOSLIF-M,” in Proc. International Conference on Computer Vision, (Cam-
bridge, Massachussetts), pp. 631—636, June 1995.

[44] Y. Cui and J. Weng, “Learning-based hand sign recognition,” in Proceedings,
International Workshop on Automatic Face- and Gesture-Recognition, (Zurich),
pp. 201—206, June 1995.

[45] L. Devroye, “Automatic pattern recognition: A study of the probability of error,”
IEEE Trans. Pattern Anal. Machine Intell., vol. 10, pp. 530—543, July 1988.

[46] S. J. Dickinson, A. P. Pentland, and A. Rosenfeld, “3-D shape recovery using dis-
tributed aspect matching,” IEEE Trans. Pattern Anal. Machine Intell., vol. 14,
pp. 174—198, 1992.

[47] R. Duda and P. Hart, Pattern Classiﬁcation and Scene Analysis. Wiley, New
York, 1973.

[48] H. D. Ellis, “Introduction to aspects of face processing: Ten questions in need of

answers,” in Aspects of Face Processing (H. Ellis, M. Jeeves, F. Newcombe, and
A. Young, Eds.), pp. 3—13, Nijhoff, 1986.

[49] H. D. Ellis, D. M. Ellis, and J. A. Hosie, “Priming effects in children’s face
recognition,” British Journal of Psychology, vol. 84, pp. 101—110, February 1993.

[50] M. Fang and S. Singh, “Private communication.” Siemens Coprorate Research,
Princeton, New Jersey, 1993.

[51] R. A. Fisher, “The statistical utilization of multiple measurements,” Annals of
Eugenics, vol. 8, pp. 376—386, 1938.

[52] P. J. Flynn and A. K. Jain, “3D object recognition using invariant feature in-
dexing of interpretation tables,” Computer Vision, Graphics, Image Process,
vol. 55, no. 2, pp. 119—129, 1992.

[53] P. J. Flynn and A. K. Jain, “BONSAI: 3-D object recognition using constrained
search,” IEEE Trans. Pattern Anal. Machine Intell., vol. 13, pp. 1066—1075,
October 1991.

[54] P. Foldiak, “Adaptive network for optimal linear feature extraction,” in Proc. of
IEEE Intl. Joint Conf. on Neural Networks, vol. 1, (Washington DC), pp. 401—
405, 1989.

[55] K. Fukunaga, Introduction to Statistical Pattern Recognition. Academic Press,
New York, 2nd ed., 1990.

216

[56] K. Fukshima, S. Miyake, and T. Ito, “Neocognitron: A neural network model
for a mechanism of visual pattern recognition,” IEEE Trans. Systems, Man,
Cybernetics, vol. 13, no. 5, pp. 826—834, 1983.

[57] K. Fukushima, “Cognitron: A self-organizing multilayered neural network,” Bi-
ological Cybernetics, vol. 20, pp. 121—136, 1975.

[58] S. F. Galton, “Personal identiﬁcation and description-I,” Nature, pp. 173—177,
1888.

[59] A. G. Goldstein, “Race-related variation of facial features: Anthropometric data
1,” Bulletin of the Psychonomic Society, vol. 13, pp. 187—190, 1979.

[60] A. G. Goldstein, “Facial feature variation: Anthropometric data II,” Bulletin of
the Psychonomic Society, vol. 13, pp. 191-193, 1979.

[61] B. A. Golomb and T. J. Sejnowski, “SEXN ET: A neural network identiﬁes sex
from human faces,” in Advances in Neural Information Processing Systems III,
pp. 572—577, San Mateo: Morgan Kaufrnann, 1991.

[62] G. Gordon and I. Vincent, “Applications of morphology to feature extraction for
face recognition,” in SPIE Proceedings, Vol 1658: Nonlinear Image Processing,
1992.

[63] G. Gordon, “Face recognition based on depth maps and surface curvature,” in
SPIE Proceedings, Vol 1570: Geometric Methods in Computer Vision, pp. 234—
247,1991.

[64] V. Govindaraju, S. N. Srihari, and D. B. Sher, “A computational model for
face location,” in Proceedings 3rd International Conference on Computer Vision,
pp. 718—721, 1990.

[65] P. Green, “Biology and cognitive development: The case of face recognition,”
Animal Behaviour, vol. 43, pp. 526—527, March 1992.

[66] W. E. L. Grimson and T. Lozano—Pérez, “Localizing overlapping parts by search-
ing the interpretation tree,” IEEE Trans. Pattern Anal. Machine Intell., vol. 9,
no. 4, pp. 469—482, 1987.

[67] W. E. L. Grimson, Object Recognition by Computer: The Role of Geometric
Constraints. The MIT Press, 1990.

[68] P. W. Hallinan, “Recognizing human eyes,” in SPIE Proceedings Vol. 1570: Ge-
ometric Methods in Computer Vision, pp. 214—226, 1991.

[69] R. W. Hamming, Coding and Information Theory. Englewood Cliffs: Prentice-
Hall, 1980.

[70] R. M. Haralick and L. G. Shapiro, Computer and Robot Vision, vol. II.Addison-
Wesley, 1993.

‘m via-mm." “‘ _ ‘

217

[71] L. Harmon and W. Hunt, “Automatic recognition of human face proﬁles,” Com-
puter Graphics and Image Processing, vol. 6, pp. 135-156, 1977.

[72] L. Harmon, M. Khan, R. Lasch, and P. Ramig, “Machine identiﬁcation of human
faces,” Pattern Recognition, vol. 13, pp. 97—110, 1981.

[73] L. Harmon, S. Kuo, P. Ramig, and U. Raudkivi, “Identiﬁcation of human face
proﬁles by computer,” Pattern Recognition, vol. 10, pp. 301—312, 1978.

[74] L. Harmon, “The recognition of faces,” Scientiﬁc American, vol. 229, no. 5,
pp. 71—82, 1973.

[75] D. C. Hay and A. W. Young, “The human face,” in Normality and Pathology
in Cognitive Function (H. D. Ellis, Ed.), pp. 173—202, London: Academic Press,
1982.

[76] J. Hertz, A. Krogh, and R. G. Palmer, Introduction to the Theory of Neural
Computation. Addison-Wesley Publishing Company, 1991.

[77] D. D. Hoffman and W. A. Richards, “Parts of recognition,” Cognition, vol. 18,
pp. 65—96, 1985.

[7 8] Z. Hong, “Algebraic feature extraction of images for recognition,” Pattern Recog-
nition, vol. 24, pp. 211-219, 1991.

[79] B. K. P. Horn, Robot Vision. Cambridge, MA: MIT Press, 1986.

[80] S. J. Howden and J. Weng, “Hand-eye coordinated learning using hierarchical
space tessellation,” Tech. Rep. CPS-94-29, Michigan State University, 1994.

[81] D. P. Huttenlocher and S. Ullman, “Object recognition using alignment,” in
Proc. Int ’1 Conf. Computer Vision, pp. 102-111, 1987. London, England.

[82] K. Ikeuche and T. Kanade, “Automatic generation of object recognition pro-
grams,” Proceedings of the IEEE, vol. 76, no. 8, pp. 1016—1035, 1988.

[83] C. L. Jackins and S. L. Tanimoto, “Oct-trees and their use in representing
three-dimensional objects,” Computer Vision, Graphics, Image Process, vol. 14,
pp. 249—270, November 1980.

[84] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Prentice-Hall, New
Jersey, 1988.

[85] A. K. Jain and P. J. Flynn, 3D Object Recognition Systems. Amsterdam: Elsevier,
1993.

[86] A. K. Jain and R. Hoffman, “Evidence—based recognition of 3—D objects,” IEEE
Trans. Pattern Anal. Machine Intell., vol. 10, pp. 783—802, November 1988.

 

218

[87] A. K. Jain and M. D. Ramaswami, “Rule—based 3D object recognition from range
images,” tech. rep., Michigan State University, 1988.

[88] A. K. Jain, Fundamentals of Digital Image Processing. Englewood Cliffs: Prentice
Hall, 1989.

[89] I. T. Jolliffe, Principal Component Analysis. Springer-Verlag, New York, 1986.

[90] T. Kanade, Computer Recognition of Human Faces. Basel and Stuttgart:
Birkhauser, 1977.

[91] G. J. Kaufman and K. J. Breeding, “The automatic recognition of human faces
from proﬁle silhouettes,” IEEE Trans. Systems, Man, and Cybernetics, vol. 6,
pp. 113—121, 1976.

[92] Y. Kaya and K. Kobayashi, “A basic study on human face recognition,” in
Frontiers of Pattern Recognition (S. Wantanabe, Ed.), pp. 265—289, New York:
Academic Press, 1972.

[93] W. Kim and P. Yuan, “A practical pattern recognition system for translation,
scale, and rotation invariance,” in Proceedings Conf. on Computer Vision and
Pattern Recognition, pp. 391—396, 1994.

[94] M. Kirby and L. Sirovich, “Application of the Karhunen-Loeve procedure for the
characterization of human faces,” IEEE Trans. Pattern Anal. Machine Intell.,
vol. 12, pp. 103—108, January 1990.

[95] J. J. Koenderink and A. J. Van Doom, “The internal representation of solid
shape with respect to vision,” Biological Cybernetics, vol. 32, no. 4, pp. 211—216,
1979.

[96] T. Kohonen, “Self-organized formation of topologically correct feature maps,”
Biol. Cybern., vol. 43, pp. 59-69, 1982.

[97] T. Kohonen, “Self-organized network,” Proc. IEEE, vol. 43, pp. 59—69, 1990.

[98] T. Kohonen, Self- Organization and Associative Memory. Berlin: Springer-Verlag,
1988.

[99] H. Kuhnel and P. Tavan, “A network for discriminant analysis,” in Artiﬁcial
Neural Networks (T. Kohonen, K. Makisara, O. Simula, and J. Kangas, Eds.),
Elsevier Science, 1991.

[100] S. Kung and K. Diamantaras, “A neural network learning algorithm for adap-
tive principal component extraction (apex),” in IEEE Intl. Conf. on Acoustics,
Speech, and Signal Processing, vol. 1, pp. 861—864, 1990.

[101] M. Lades, J. Vorbruggen, J. Buhmann, J. Lange, C. V. D. Malsburg, and
R. Wurtz, “Distortion invariand object recognition in the dynamic link architec-
ture,” IEEE Trans. Computers, vol. 42, pp. 300—311, 1993.

219

[102] Y. Lamdan and H. J. Wolfson, “Geometric hashing: A general and efﬁcient

model-based recognition scheme,” in Proc. 2nd Int ’1 Conf. Computer Vision,
pp. 238—249, 1988.

[103] X. Li and R. C. Dubes, “Tree classiﬁer design with a permutation statistic,”
Pattern Recognition, vol. 19, pp. 229—235, 1986.

[104] R. Linsker, “Self-organization in a perceptual network,” Computer, pp. 105—117,
March 1988.

[105] M. M. Loeve, Probability Theory. Princeton, NJ: Van Nostrand, 1955.

[106] D. G. Lowe, “Fitting parameterized three-dimensional models to images,” IEEE
Trans. Pattern Anal. Machine Intell., vol. 13, pp. 441—450, May 1991.

[107] B. Manjunath, R. Chellappa, and G. von der Malsburg, “A feature based ap-
proach to face recognition,” in Proc. IEEE Computer Soc. Conf. on Computer
Vision and Pattern Recognition, pp. 373—378, 1992.

[108] B. Manjunath, C. Shekhar, R. Chellappa, , and C. von der Malsburg, “A robust
method for detecting image features with application to face recognition and
motion correspondence,” in Proc. Int. Conf. on Pattern Recognition, pp. 208—
212, 1992.

[109] D. Marr and K. Nishihara, “Representation and recognition of the spatial or-
ganization of three-dimensional shapes,” in Proceedings of the Royal Society of
London, pp. 269—964, 1977.

[110] D. Marr, Vision: A Computational Investigation into the Human Representa-
tion and Procesing of Visual Information. San Hancisco: W. H. Heeman and
Company, 1982.

[111] R. Mauro and M. Kubovy, “Caricature and face recognition,” Memory and
Cognition, vol. 20, pp. 433—441, July 1992.

[112] E. A. Maylor, “Recognizing and naming faces: aging, memory retrieval, and
the tip of the tongue state,” Journal of Gerontology, vol. 45, pp. P215—P226,
November 1990.

[113] B. Moghaddam and A. Pentland, “Maximum liklihood detection of faces and
hands,” in International Workshop on Automatic Face- and Gesture-Recognition
(M. Bichsel, Ed.), pp. 122—128, 1995.

[114] J. Morton and M. H. Johnson, “Conspec and conlern: a two-process theory of
infant face recognition,” Psychological Review, vol. 98, pp. 164—181, April 1991.

[115] J. K. Mui and K. S. Fu, “Automated classiﬁcation of nucleated blood cells using
a binary tree classiﬁer,” IEEE Trans. Pattern Analysis and Machine Intelligence,
vol. 2, pp. 429—443, September 1980.

 

220

[116] H. Murase and S. K. Nayar, “Illumination planning for object recognition in
structured environments,” in Proc. IEEE Computer Soc. Conf. on Computer
Vision and Pattern Recognition, pp. 31-38, June 1994.Seattle, Washington.

[117] O. Nakamura, S. Mathur, and T. Minami, “Identiﬁcation of human faces by
identity maps,” Pattern Recognition, vol. 26, no. 3, pp. 263—272, 1991.

[118] R. Nevatia and T. Binford, “Description and recognition of complex-curved
objects,” Artiﬁcial Intelligence, vol. 8, pp. 77—98, 1977.

[119] M. Nixon, “Eye spacing measurements for facial recognition,” in SPIE Proc.
Vol. 575: Applications of Digital Image Processing VIII, pp. 279—285, 1985.

[120] E. Oja, “Neural networks, principal components, and subspaces,” Intl. Journal
of Neural Systems, vol. 1, pp. 61—68, 1989.

[121] F. Pan and M. Keane, “A new set of moment invariants for handwritten numeral
recognition,” in Proceedings IEEE International Conference on Image Process-
ing, vol. I, pp. 154—158, 1994.

[122] A. Pentland, B. Moghaddam, and T. Starner, “View-based and modular
eigenspaces for face recognition,” in Proc. IEEE Computer Soc. Conf. on Com-
puter Vision and Pattern Recognition, pp. 84—91, June 1994.Seattle, Washington.

[123] A. Pentland, R. W. Picard, and S. Scarloff, “Photobook: Tools for content-
based manipulation of image databases,” in SPIE Storage and Retrieval Image
and Video Databases II, no. 2185, (San Jose), February 1994.

[124] A. P. Pentland, “Perceptual organization and the representation of natural
from,” Artiﬁcial Intelligence, vol. 28, pp. 29—73, 1986.

[125] T. Poggio and F. Girosi, “Networks for approximation and learning,” Proc.
IEEE, vol. 78, pp. 1481—1497, 1990.

[126] A. Rahardja, A. Sowmya, and W. Wilson, “A neural network approach to
component versus holistic recognition of facial expressions in images,” in SPIE
Proceedings Vol. 1607: Intelligent Robots and Computer Vision X: Algorithms
and Techniques, pp. 62—70, 1991.

[127] S. S. Rakover and B. Cahlon, “To catch a thief with a recognition test: the
model and some empirical results,” Cognitive Psychology, vol. 21, pp. 423—468,
October 1989.

[128] D. Reisfeld and Y. Yeshurun, “Robust detection of facial features by general—
ized symmetry,” in Proceedings 11 th International Conference on Pattern Recog—
nition, pp. 117—120, 1992.

[129] G. Rhodes and T. 'Ii‘emewan, “The simon then garfunkel effect: semantic prim-
ing, sensitivity, and the modularity of face recognition,” Cognitive Psychology,
vol. 25, pp. 147—187, April 1993.

221

[130] J. Rubner and K. Schulten, “Development of feature detectors by self-
organization,” Biol. Cybern, vol. 62, pp. 193—199, 1990.

[131] J. Rubner and P. Tavan, “A self-organizing network for principal component
analysis,” Europhysics Letters, vol. 10, pp. 693—698, 1989.

[132] T. Sakai, M. N agao, and S. Fujibayashi, “Line extraction and pattern recogni-
tion in a photograph,” Pattern Recognition, vol. 1, pp. 233—248, 1969.

[133] A. Samal and P. A. Iyengar, “Automatic recognition and analysis of human faces
and facial expressions: A survey,” Pattern Recognition, vol. 25, no. 1, pp. 65—77,
1992.

[134] A. Samal, “Minimum resolution for human face detection and identiﬁcation,”
in Proc SPIE/SPSE Symposium on Electronic Imaging, 1991.

[135] H. Samet, “Region representation: Quadtrees from boundary codes,” Comm.
ACM, vol. 23, pp. 163—170, March 1980.

[136] M. Seibert and A. M. Waxman, “Recognizing faces from their parts,” in SPIE
Proceedings Vol. 1611: Sensor Fusion IV: Control Paradigms and Data Struc-
tures, pp. 129—140, 1991.

[137] J. Sergent, “Microgenesis of face perception,” in Aspects of Face Processing,
Dordrecht: Nijhoff, 1986.

[138] I. K. Sethi and G. P. R. Savarayudu, “Hierarchical classiﬁer design using mutual
information,” IEEE Trans. Pattern Anal. Machine Intell., vol. 1, pp. 194—201,
April 1979.

[139] L. Sirovich and M. Kirby, “Low-dimensional procedure for the characterization
of human face,” Journal of the Optical Society of America, vol. 4, pp. 519—524,
1987.

[140] W. Sjoberg, B. Gruber, and C. Swatloski, “Visual-ﬁeld asymmetry in face recog-
nition as a function of face discriminability and interstimulus interval,” Percep-
tual and Motor Skills, vol. 72, pp. 1267—1271, June 1991.

[141] F. Stein and G. Medioni, “Efficient two dimensional object recognition,” in 10th
International Conference on Pattern Recognition, (Atlantic City), 1990.

[142] F. Stein and G. Medioni, “Structural indexing: Efﬁcient three-dimensional ob—
ject recognition,” IEEE Trans. Pattern Anal. Machine Intell., pp. 125-145, 1992.

[143] G. C. Stockman, “Object recognition and localization via pose clustering,”
Computer Vision, Graphics and Image Processing, vol. 40, pp. 361—387, 1987.

[144] T. J. Stonham, “Practical face recognition and veriﬁcation with WISARD,”
in Aspects of Face Processing (H. D. Ellis, M. A. Jeeves, F. Newcombe, and
A. Young, Eds.), pp. 426—441, Dordrecht: Nijhoff, 1984.

Cur-um” no” .u-w.n 5

222

[145] D. L. Swets, B. Punch, and J. J. Weng, “Genetic algorithms for object recog-
nition in a complex scene,” in Proceedings, International Conference on Image
Processing, (Washington, DC), pp. 595—598, October 1995.

[146] D. L. Swets and J. J. Weng, “Image—based recognition using learning for gen-
eralizing parameters,” in Proceedings, Asian Conference on Computer Vision,
(Singapore), December 1995.Accepted and presented at conference; not in Pro-
ceedings due to an oversight between authors.

[147] D. L. Swets and J. J. Weng, “Efficient image retrieval using a network with
complex neurons,” in Proceedings, International Conference on Neural Networks,
(Perth, Western Australia), November 1995.1nvited Paper.

[148] D. L. Swets and J. J. Weng, “View—based recognition using SHOSLIF,” in Pro-
ceedings, International Conference on Neural Networks and Signal Processing,
(Nanjing, China), December 1995. Most Excellent Paper award.

[149] D. L. Swets and J. J. Weng, “Efﬁcient content-based image retrieval using auto-
matic feature selection,” in Proceedings, International Symposium on Computer
Vision, (Coral Gables, Florida), pp. 85—90, November 1995.

[150] D. L. Swets and J. J. Weng, “Using discriminant eigenfeatures for image re-
trieval,” IEEE Trans. Pattern Anal. Machine Intell., 1996.1n preparation.

[151] D. L. Swets and J. J. Weng, “SHOSLIF-O: SHOSLIF for object recognition
and image retrieval (phase II),” Tech. Rep. CPS 9539, Michigan State Univer-
sity, Department of Computer Science, A714 Wells Hall, East Lansing, Michigan
48824, October 1995.

[152] D. L. Swets and J. J. Weng, “SHOSLIF—O: SHOSLIF for object recognition
(phase 1),” Tech. Rep. CPS 94-64, Michigan State University, Department of
Computer Science, A714 Wells Hall, East Lansing, Michigan 48824, December
1994.

[153] M. Szpir, “Accustomed to your face,” American Scientist, vol. 80, pp. 537—540,
November-December 1992.

[154] M. 'Ihrk and A. Pentland, “Eigenfaces for recognition,” Journal of Cognitive
Neuroscience, vol. 3, no. 1, pp. 71—86, 1991.

[155] T. Valentine and A. Ferrara, “Typicality in categorization, recognition and
identiﬁcation: evidence from face recognition,” British Journal of Psychology,
vol. 82, pp. 87—102, February 1991.

[156] A. Wahlin, L. Backman, T. Mantyla, A. Herlitz, M. Viitanen, and B. Winbald,
“Prior knowledge and face recognition in a community-based sample of healthy,
very old adults,” Journals of Gerontology, vol. 48, pp. P54-P11, March 1993.

223

[157] M. A. Wallace and M. J. Farah, “Savings in relearning face-name associations as
evidence for “covert recognition” in prOSOpagnosia,” Journal of Cognitive Neu-
roscience, vol. 4, pp. 150—154, Spring 1992.

[158] J. J. Weng, “Cresceptron and SHOSLIF: Toward comprehensive visual learn-
ing,” in Early Visual Learning (S. K. Nayar and T. Poggio, Eds.), Oxford Uni-
versity Press, 1996.

[159] J. J. Weng, “SHOSLIF: The hierarchical optimal subspace learning and infer-
ence framework,” Tech. Rep. CPS 94-15, Michigan State University, Department
of Computer Science, A714 Wells Hall, East Lansing, Michigan 48824, March
1994.

[160] J. Weng, N. Ahuja, and T. S. Huang, “Learning recognition and segmentation
using the Cresceptron,” in Proc. International Conference on Computer Vision,
pp. 121-128, May 1993.Berlin, Germany.

[161] J. Weng and S. Chen, “SHOSLIF-N: SHOSLIF for autonomous navigation
(phase 11),” Tech. Rep. CPS-95-22, Michigan State University, Department of
Computer Science, May 1995.

[162] J. Weng, T. S. Huang, and N. Ahuja, Motion and Structure from Image Se-
quences. Springer-Verlag, 1993.

[163] S. S. Wilks, Mathematical Statistics. Wiley, New York, 1963.
[164] P. H. Winston, Artiﬁcial Intelligence. Addison-Wesley, third ed., 1984.

[165] R. J. Woodham, “Phottometric method for determining surface orientation
from multiple images,” in Shape from Shading (B. K. P. Horn and M. J. Brooks,
Eds.), Cambridge, MA: MIT Press, 1989.

[166] C. J. Wu and T. S. Huang, “Human face proﬁle recognition by computer,”
Pattern Recognition, vol. 23, no. 3/6, pp. 255—259, 1990.

[167] A. Yuille, D. Cohen, and P. Hallinan, “Feature extraction from faces using de-
formable templates,” in Proceedings Computer Vision and Pattern Recognition,
pp. 104-109, 1989.