I!!! I fix... finignumw . . .fikvxwfi a... , ,. .. ..... .. . . 3m“. , awhfiv «carvmzmn..u:..tdt v v3.1, wish... u. .1. I. an}. .3 23'!- . J! z. 15¢). 2‘ q 3:3)... . “Sarawak. :01 .. . 1... 319.1in 3.1313. . . . . a”... . .. «Ha-2.33:..x3 . ..\. .c a! :3..an . . . . , I»... .. . . L F... 3 PVT-YD“ .9: . r». .1 L59... I.” I :11! . Jx. . x. 5.. ‘1. I! «hag .. . new 14!..m1 3.. . a“, . chqnubufi‘ksurwsu .. an? .mfim 3|...) 2.. 58. l. 3.. n 1 .1 .QtIii? .irilnflui. r Xlih } . i . 32.1.... V IL... 1...... . 1 2. . . {kitcm . . 3.11%)?! . .1. 3. .. i 5 f . , i) 2... n. . -21.... . ?§:\F9Ai.zv 1.51...- l \,|(.’ . LT?! .. Lists... .31. .2: 1.1%}-.. .flrr..wl.lnpu.n. , 1:": .....1. Esofwn .. Ly . . ‘l’ r .L, . . .. , . . 3&4... , . | < \. .. V .. .r . , .. .1. . . . . R . . l . . Frag“. "mmfigi‘amfigm . .n . 11.11.. :1: :1 "Hrsa KN VESR SITY LIBRARI IES llllllllllllllllllllllllllllll2 ll ill 3 1293 013892 \I This is to certify that the dissertation entitled The Self-Organizing Hierarchical Optimal Subspace Learning and Inference Framework For View-Based Object Recognition and Image Retrieval presented by Daniel L. Swets has been accepted towards fulfillment of the requirements for PhD Computer Science degreein WWW WM? Mjor professor 27/;6/l976 I - MSU is an Affirmative Action/Equal Opportunity Institution 0- 12771 LlBRARY M‘Chlgan State University PLACE ll RETURN BOXtorunovothhchoderrom yourrocord. TO AVOID FINES return on or before date duo. DATE DUE DATE DUE DATE DUE THE SELF-ORGANIZING HIERARCHIGAL OPTIMAL SUBSPACE LEARNING AND INFERENCE FRAMEWORK FOR VIEW-BASED OBJECT RECOGNITION AND IMAGE RETRIEVAL By Daniel L. Swets A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1996 ABSTRACT THE SELF-ORGANIZING HIERARCHICAL OPTIMAL SUBSPACE LEARNING AND INFERENCE FRAMEWORK FOR VIEW-BASED OBJECT RECOGNITION AND IMAGE RETRIEVAL By Daniel L. Swets A self-organizing framework for object recognition is described. The SH OSLIF (Self- Organizing Hierarchical Optimal Subspace Learning and Inference Framework) system uses the theories of optimal linear projection for automatic optimal feature selection and a hier- archical structure to achieve a logarithmic retrieval complexity. A Space-Tessellation 'Il'ee is automatically generated using the Most Ezpressive Features (MEFS) and the Most Dis- criminating Features (MDFS) at each level of the tree. We use incremental learning for large-scale and Open-ended learning problems, and we allow for perturbations in the Size and position of objects in the images through learning. We demonstrate the technique of a large database of widely varying real-world objects in natural settings, and Show the ap- plicability of the approach even for large variability within a particular class, including 3D rotation. To Debby, for making the sacrifices that let this happen. To Joseph, who admonished me to work very hard. To Kira, who always smiled away the troubles of the day. COR MEUM TIBI OFFERO DOMINE PROMPTE ET SINCERE iii ACKNOWLEDGMENTS I would like to acknowledge those people and institutions that have assisted me through my PhD program at Michigan State University. First I would like to thank my advisors, Dr. John Weng and Dr. Anil K. Jain, for their patient and persistent guidance in my studies. Without their guidance, this thesis would never have been completed. Their ideas, advice, and encouragement over the years have proved in- valuable in my research activities. Dr. Weng was a constant fountain of ideas for alternative ways for approaching a problem. Dr. Jain, with the great time constraints that his stature in the vision community demands, always took the time to be gen- uinely interested in the progress of my studies. I would also like to thank my guidance committee members for the time and energies spent helping me work through my re- search. Dr. George Stockman provided practical ideas and insightful suggestions as to how the project could be improved. Dr. Jerry Schuur provided the keys to the mathematical basis and gave pointers for further analysis. I would like to acknowledge the financial support provided by the National Science Foundation, the Office of Naval Research, the Army Research Labs, and Ameritech. Without this institutional support, my studies at Michigan State University would not have been possible. iv I would like to thank my children Joseph and Kira, for putting up with Daddy being gone one more night. Finally, I would like to thank my wife Debby for her unwavering love, support, and encouragement over the years that this thesis developed. Her commitment to this degree was necessarily more formidable than my own, and without her backing, my studies would not have been successful. TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES l SHOSLIF-O: The Self-Organizing Hierarchical Optimal Subspace Learning and Inference Framework for Object Recognition 1.1 Overview of the SHOSLIF .......................... 1.2 Contributions of the Thesis 2 Object Recognition 2.1 Sensing scheme ................................ 2.1.1 Desktop Scanners .............................. 2.1.2 Cameras ................................... 2.1.3 Range scanners ............................... 2.2 Model representation ............................. 2.2.1 Feature-based representations ....................... 2.2.2 Multiple view representations ....................... 2.2.3 Hierarchical representations ........................ 2.3 Model-based object recognition ....................... 2.3.1 Interpretation tree ............................. 2.3.2 Geometric Hashing ............................. 2.3.3 Evidence-Based Recognition ........................ 2.3.4 Geons .................................... 2.3.5 Distance minimization ........................... 2.3.6 Pose Clustering ............................... 2.3.7 Graph Matching .............................. 2.3.8 Alignment .................................. 2.3.9 2D object recognition ........................... 2.4 Face recognition ................................ 2.4.1 The human capacity for face recognition ................. 2.4.2 Representations for automatic systems .................. 2.4.3 Face Localization .............................. 2.4.4 Face Recognition .............................. 2.4.5 Profile Recognition ............................. 2.5 Self-organizing approaches .......................... 2.6 Summary ................................... vi ix xi 1 4 5 1 1 12 13 13 17 17 18 24 26 26 26 29 31 35 35 36 36 37 37 38 39 41 43 46 53 54 56 3 The Self-Organizing Hierarchical Optimal Subspace Learning and Inference Framework (SHOSLIF) 58 3.1 System overview ............................... 59 3.2 Visual attention ................................ 60 3.3 The Most Expressive Features (MEF) .................... 64 3.3.1 Density of the space ............................ 65 3.3.2 Principal component analysis ....................... 67 3.3.3 Computational Considerations ....................... 69 3.4 The Most Discriminating Features (MDF) ................. 72 3.4.1 Multivariate linear discriminant analysis ................. 74 3.4.2 Computational Considerations ....................... 75 3.4.3 The DKL projection ............................ 76 3.4.4 Explanation of the MDFs ......................... 79 3.5 Space Tessellation ............................... 79 3.5.1 The Hierarchy Decomposes the Problem ................. 82 3.5.2 Hierarchical Quasi-Voronoi Tessellation .................. 85 3.5.3 Incremental Learning ............................ 88 3.5.4 The Adoption Problem ........................... 89 3.5.5 Conditional Sterility Principle ....................... 92 3.5.6 Properties of the Hierarchical Quasi-Voronoi Tessellation Algorithm . . 96 3.6 Image Retrieval ................................ 98 3.7 Summary ................................... 99 4 SHOSLIF for Object Recognition 104 4.1 SHOSLIF-O using manually-defined hierarchical labels .......... 104 4.2 SHOSLIF-O Options for Automatic Space Tessellation .......... 107 4.2.1 Distance measures ............................. 107 4.2.2 Use of Multiple Competing Paths ..................... 117 4.2.3 Expected Radius .............................. 117 4.3 Experimental Results ............................. 119 4.3.1 Space comparison .............................. 119 4.3.2 The Clustering Effect of the MDF subspace using the DKL projection 125 4.3.3 Face database ................................ 126 4.3.4 Combination database: Faces and Other Objects ............ 129 4.3.5 Handling 2D variation ........................... 132 4.3.6 Handling different 3D views ........................ 139 4.3.7 Neighborhood information ......................... 142 4.3.8 Timing ................................... 147 4.4 Summary ................................... 150 5 Summary and Future Research 153 5.1 Summary ................................... 153 5.2 thure research ................................ 158 APPENDICES 163 vii A Space comparison test database 163 B Face Database 201 BIBLIOGRAPHY 212 viii LIST OF TABLES 2.1 Classification of Major Approaches to Object Recognition ........ 3.1 A characterization of a sample tree built using the Hierarchical Quasi-Voronoi Tessellation Algorithm. The table lists the number of nodes found on each level of the tree, because the tree is too large to show in detail. ...... 4.1 An example of a set of hierarchical labels that could be used in the hierarchical labels mode. ................................ 96 105 4.2 Results of testing the tree built under the hierarchical labels learning mode 106 4.3 Results of subspace comparison study. For the “flat” databases, only a sin- gle level of the tree was created. For the single projection databases, the MEF/MDF projection matrices were only computed at the root node, and the same projection matrices were used throughout the tree. For the mul- tiple projection trees, a new projection was made at each node of the tree. 4.4 Results showing the superiority of the MDF subspace for subclass selection . . 4.5 Summary of experiment on a face database of 1042 images (384 individuals). For re-substitution each training image was given as a test probe. For the disjoint test set, a list of 246 images not found in the training set was used for testing. ................................. 4.6 Summary of large natural scene experiment. The training images were drawn at random from the pool of available images, with the remaining images serving as a disjoint set of test images .................... 4.7 Summary of the scale generalization experiment on the large combination face and other object data set. A disjoint test set was used for testing the retrieval capability; each test probe was randomly scaled in the range of [—20, +20]% of the fovea size to test the ability of the system to generalize over various scales. ............................. 4.8 Summary of the learning-based parameter generalization experiment. A disjoint test set was used for testing the retrieval capability. ............ 4.9 Summary of the learning-based view generalization experiment. A disjoint test set was used for testing the retrieval capability. .............. 4.10 Summary of blurring experiment. ....................... 4.11 Average time required to perform a single probe in Tree mode and Non-tree mode. In contrast to the rapid retrieval of 9.1 seconds for images from the tree version using fifteen competing paths, when a “flat” database is used, i.e., one in which no hierarchy is utilized, a test probe into the same data requires an average of 400.7245 seconds to complete. ........... ix 120 126 128 131 136 138 142 146 4.12 Recognition results using the flat database for the combination database . . . 149 LIST OF FIGURES 2.1 Pinhole camera model. The y — 2 projection is similar to the :1: — 2 shown in the figure ................................. 14 2.2 Stereo camera imaging system. Based on Fig. 2.3 in Ballard and Brown Computer Vision .............................. 16 2.3 Smoothed local symmetry. A and B are points on the bounding contour of a shape. The angle formed by AB and the outward normal at A = the angle formed by AB and the inward normal at B. The midpoint of AB defines the local symmetry. The loci of local symmetries which are maximal with respect to forming a smooth curve is called a spine. 22 3.1 A tree view of the SHOSLIF network. Each node corresponds to a processing element. .................................. 59 3.2 The network structure of the SHOSLIF. Each level of the network has a variable number of nodes. A node at level lo has a sub-network at levels I > lo. A node consists of a processing module and an attention supplier. As indicated by the multiplexers in the figure, each of the processing modules can activate one of the attached nodes at the next level of the network. Each of the processing modules controls an attention module, which is responsible for supplying properly aligned fovea images to the activated node. ...... 61 3.3 A top-level flow of the processing performed at each node in the Space- Tessellation 'Ii'ee during the training phase. A set of training samples which enter the processing element are extended to allow for learning-based gen- eralization for position, scale, and orientation. These extended samples are vectorized and used to produce the projection matrices to the MEF and MDF subspaces. The extended samples are projected to the MDF sub- space using these matrices, and a tessellation of the space covered by the node being worked on is produced. The projection matrices and the space tessellation for each node are produced in the learning phase. ...... 63 3.4 The vectorization of an image. SHOSLIF treats an p x q image as a point in pq-dimensional space. The rows of pixels in an image are concatenated together to form a vector. ......................... 64 3.5 A histogram of inter-point distances for the images in the database treated as vectors. The horizontal axis shows the distance between a pair of image vectors; the vertical axis shows the number of pairs that fell into each bin. 66 3.6 A histogram of inter-point distances for the images projected to the MEF space. The horizontal axis shows the distance between a pair of MEF vectors; the vertical axis shows the number of pairs that fell into each bin ........ 70 xi 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 3.23 3.24 Example MEFS and their ability to approximate a training set. The training images were taking from the Weizmann Institute of Science in Israel. The images of a single individual are shown. .................. 71 Example MEFs for a large training set. The top thirty are shown for a training set of 168 images. Six images of each of 28 individuals were used for training this set .................................... 73 Problems with the MEF projection for class separation. The MEF projec- tion onto the principal component Y1 is unable to separate the two obvious classes. A projection onto Z1, however, gives a clear separation ....... 74 A histogram of inter-point distances for the images projected to the MDF space. The independent axis shows the distance between a pair of MDF vectors; the dependent axis shows the number of pairs that fell into each bin. . . . 78 A sample of MEF and MDF vectors treated as images. The MEF vectors in (a) show the tendency of the principal components to indicate major variations in the training set, such as lighting direction. The MDF vectors in (b) show the ability of the MDFs to discount those factors unrelated to classification. The training images used to produce these vectors are courtesy of the Weizmann Institute. .................... 80 Distribution of some samples using the best two features in the MEF and the MDF spaces for the root node. In the MDF subspace, objects of the same class are clustered much more tightly than in the MEF space ........ 81 The SHOSLIF’S hierarchical Quasi-Voronoi space tessellation. Digits and “+” symbols indicate centers of cells at level I, l + 1, and 1 +2. Dark lines indicate cell boundaries at a level of the Space-Tessellation Tree nearer the root. . . 83 An example tessellation of the space covered by node N ............ 84 An example showing the complexity of the class separation problem at two different levels of the tree. Samples from the same class are given for both graphs. Since the internal node contains many fewer samples than the root node, the MDF vectors can be optimized for the classes contained in the node. 86 An example tree built with both the MEF and the MDF subspaces. Shaded nodes indicate that the MDF space was used; square nodes indicate that the MEF space was used; bold nodes are leaf nodes ............. 88 An example tessellation where grandchild G is close to a tessellation boundary and outside the expected distance radius for its nearest parent 0 1. . . . . 89 An incorrect tessellation of node N that would result from adding X as a child of N. The subtree rooted at G would be adopted by X and therefore unreachable if this tessellation were to be established. ........... 90 Lemma 1 .................................... 91 An example tessellation where changing the tessellation would not result in the loss of a subtree .............................. 92 A valid tessellation of node N resulting from the addition of X as a child of N. No subtrees of N are lost by this tessellation modification. ........ 93 Algorithm 1, The Hierarchical Quasi-Voronoi Tessellation Algorithm ...... 95 Theorem 1 ................................... 97 Hyperspheres used in Theorem 1. ....................... 98 xii 3.25 Lemma 2 .................................... 3.26 The general flow of each SHOSLIF Processing Element ............. 3.27 A tessellation in which a single-path search may not find a nearest match. Node N at level I will shuttle test probe X to its nearest child 02. The nearest neighbor for X may well be found in the subtree of G, which is missed in a single-path search. ............................. 3.28 Algorithm 2, The Image Retrieval Algorithm ................. 3.29 Theorem 2 ................................... 4.1 Nodes used for training the MDFS. ..................... 4.2 The set of hierarchical labels used ........................ 4.3 Performance of the Normalized Distance measure as a function of the number of nearest neighbors explored. Both the distance ranking method and the confidence ranking method are shown .................... 4.4 Performance of the feature space distance measure as a function of the number of nearest neighbors explored. When the history was maintained, the results traced the same lines as when history was not maintained. ........ 4.5 Performance of the feature space distance measure using image space instead of MEF subspace once a single class has been found. The performance is given as a function of the number of nearest neighbors explored. ..... 4.6 Distance from Subspace description for 3D. The subspace is shown as the plane, and is defined by the matrix V. Two classes are shown in this example, and the MDF vector that can separate them is shown .............. 4.7 Performance of the DPS distance measure, given as a function of the number of nearest neighbors explored. ....................... 4.8 A comparison of the performance for various 1: competing paths ....... 4.9 The performance of the system for different numbers of MEF/MDF features. The number of features from the subspace used was varied to show how the MDF subspace outperforms the MEF subspace. 95% of the variance for the MDF subspace was attained when fifteen features were used; 95% of variance for the MEF subspace did not occur until 37 features were used. Using 95% of the MEF variance resulted in an 89% recognition rate, and that rate was not improved using more features ............... 4.10 A sample of the Weizmann Institute face data. The frontal images for an individual contain two expressions; for each expression, a set of three images with differing lighting conditions forms the set of images available for an individual. ................................. 4.11 Example of how well within-class variation is handled. All search images were correctly classified into the class defined by the training images. 4.12 Representatives of training images from some of the various classes learned. Fovea images are shown without the pixel weighting; this pixel weighting will tend to suppress the background. Objects of interest are at the center of the fovea images. In the learning phase for this experiment, the fovea was generated using manual extraction of the areas of interest. ...... xiii 99 100 101 102 102 105 106 110 111 113 115 116 118 123 124 127 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 5.1 A1 A2 B.1 B.2 Example of a failed search probe. The retrieval failed to select the appropriate class due to a lack of 3D rotation in the set of training images. ...... Generalization for size and position of objects. The search probes were syn- thetically generated from images in a disjoint test set. Each search probe retrieved an image from the appropriate class. The images shown in the first line were the original images. The scaling shown is 50% of the fovea size (both positively and negatively); the position change shows 12% negative change in the fixation point for'both the row and the column coordinates. Example probes and their retrieved images when scaling was enabled on the tree and when scaling was disabled on the tree during the training phase. . A sample of the Weizmann Institute face data. Each individual for this exper- iment contained five viewpoints under identical lighting conditions. . . . . View-based generalization. When sufficient training images are given for a particular class, the system is able to accurately retrieve images from the correct class. ................................ Example failure in the learning-based view generalization experiment. The fail- ures occured only when the viewing angle of the test probe did not fall between the viewing angles of two training images. These images are cour- tesy of the Weizmann Institute. ...................... A stack of multi-level resolution images. Each level is a subsample of pixels from a group in the next lower level. .................. A stack of blurred images. The pixels in each level correspond to a linear combination of neighborhood pixels at the next lower level ....... Examples of blurring applied to different images. At levels of the tree nearer the root, more blurring is applied to both the training and the test images. As the processing moves away from the root node of the tree, less and less blurring is applied to provide more detail for the finer classifications. Shown here are examples of images with progressive blurring applied to them. . . Timing results obtained from running the system in the tree mode. This data shows the linear relationship between the time required to complete a test probe retrieval and the number of competing paths. This linear relationships demonstrates how the k competing paths only explores a band of the tree. Major strengths and limitations of the described system. ........... Tfaining images for space and hierarchy study. ................ Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. Beneath each retrieved image is the distance measure computed between the test probe and the retrieved image ............... Database of face training images ...................... Database of face test images ......................... xiv 131 133 137 140 143 144 145 146 172 202 210 Chapter 1 SHOSLIF-O: The Self-Organizing Hierarchical Optimal Subspace Learning and Inference Framework for Object Recognition In order for an intelligent machine to interact with its environment, it must be able to locate and recognize objects from its environment [67]. That is, the machine must sense its environment, represent this sensed data in some usable form, and match objects found in the environment with the set of objects known to the machine [110]. The machine invokes the computer vision module when the sensed data is visual input. Researchers have worked diligently to have their computer vision modules emulate the capability of biological visual systems, but the generality, adaptability, 1 2 and robustness of these biological systems have eluded researchers from a wide range of disciplines. A central task in computer vision module is the recognition of objects from various images of the machine’s environment [11] [47] [50] [82] [85]. Model-based object recognition is the domain where a model exists for every object in the recognition system’s universe of discourse. The research emphasis in this paradigm has historically been on the design of efficient matching algorithms from a manually designed feature set with hand-crafted shape rules, as discussed in chapter 2. Manually designing a feature set is appealing because such a feature set is very efficient. When designed properly, a very small number of parameters for each of the objects is sufficient to capture the distinguishing characteristics among the objects to be recognized. This pre—meditated efficiency is bitter—sweet, however, in that gen— eralization of the features to objects other than those for which they were designed is usually impossible. For example, parameters painstakingly tuned to efficiently dis- criminate between good and bad cherries will probably be useless in differentiating grades of lumber. Automatic selection of features is thus a crucial step toward a general, fully automatic recognition system. An alternative to hand-crafting features is the self-organizing1 approach in which 1In the neural network community, self-organization is often taken to be synonymous with un- supervised learning[76]. In this document, by self-organization we mean that the system is able to organize its structure autonomously. That is, we absent the learning modality from the structure of the network. Our learning mode can either be unsupervised (i.e., no class labels) or supervised (i.e., class labels are associated with each training sample); in either learning mode, the system organizes its hierarchical structure, so we call the system self-organizing. 3 the machine will automatically determine what features to use and how to organize the knowledge structure [8] [9] [45] [3] [100] [104] [99] [120] [131] [130]. In this framework, the recognition phase of the system is preceded by a learning phase. The learning phase focuses on the methods by which the system can automatically organize itself for the task of object recognition, giving it a wide range of generality [96], [97]. If designed properly, this approach can deal directly with complex, real-world images [160] because the system is able to learn the environment in which it is to function, thereby making it much more flexible than its static counterparts. Self-organizing object recognition systems are open-ended, allowing them to learn and improve con- tinuously [1] [54] [125]. As such, these systems are not restricted by their human designer’s limitations in understanding vision and its rules. Allowing the system to organize itself, however, raises some important efficiency issues. The first one is the feature selection issue. A well-known problem in feature se- lection is called “the curse of dimensionality” —more features do not necessarily imply a better classification success rate, and in fact, the performance can actually drop off. For example, the principal component analysis, also known as the Karhunen-Loéve projection and “eigenfeatures,” has been used for face recognition [122] [154] and lip reading [17]. An eigenfeature, however, may represent aspects of the imaging process which are unrelated to recognition (for example, the illumination direction). An in- crease or decrease in the number of eigenfeatures that are used does not necessarily lead to an improved success rate. The second issue is how the system Should organize itself. Given a k-dimensional feature space with n objects, a linear search is impractical since each recognition 4 probe requires 0(a) computations (1 billion visits to models for a database of one billion objects). So how does a self-organizing system automatically find the most useful features in the images? How can this be done without restricting the domain to which the system will be applicable? How can the system learn and recognize a huge number of objects (say a few million) without slowing the process down to a crawl? The work described here addresses these crucial issues. We discuss how to automatically find the most discriminating features for view-based object recognition and image retrieval and how a hierarchy can be utilized to achieve a very low computational complexity in both the training and the retrieval phases. Preliminary experiments utilize the largest number of diverse classes (48 plus faces) of natural objects reported in the literature so far. The work by Flynn and Jain [52] has 100 objects, using range images. The work by Weng et al. [160] uses intensity images and describes the use of 11 classes of objects plus faces. 1.1 Overview of the SHOSLIF The SHOSLIF uses the theories of optimal linear projection to generate a hierarchical tessellation of a space defined by the training images. This Space is generated using two projections: a Karhunen—Loeve projection to produce a set of Most Expressive Features (MEFS), and a subsequent discriminant analysis projection to produce a set of Most Discriminating Features (MDFS). The MEFS are well-suited to image characterization, and can be used as a representation of images. The MDFS are 5 better suited to class discrimination than the MEFs are. The system builds a Space-Tessellation Tree that partitions these MEF/MDF spaces to provide an O(log n) complexity for retrieving images from a database of images. The children of a node N represent a tessellation of the space subtended by N, using incremental learning to build the tree. This tree could suffer from the Adoption Problem, which occurs when a new node being added to the tree “adopts” children from another node’s subtree. The Hier- archical Quasi-Voronoi Tessellation algorithm shown in Algorithm 1 effectively and efficiently deals with this problem with th Conditional Sterility Principle. This tree- building routine will generate a bounded number of children for each node, and with a uniform distribution of input samples, will generate 0(log n) levels. 1.2 Contributions of the Thesis This work represents a significant advance towards recognizing a large number of diverse real-world objects in natural settings. Although the Cresceptron [160] has demonstrated a similar versatility, the SHOSLIF-O described here is unprecedented in its efficiency, optimality, and the number of classes learned and reported. This work has produced good recognition results without imposing hand-crafted Shape rules or human-specified features, which is a major departure from the traditional shape-rule based recognition methods. The detailed techniques that are used in this work are not entirely new. The techniques have been individually used in different fields for different topics such as 6 in statistics and pattern recognition. However, the basic approach and actual frame- work are both new in addressing the very complex problem of recognition. Although Cresceptron attacked this task of recognition from views of general objects, a large number of objects was not possible until this new framework, due to the way that the Cresceptron organized its database. Two major contributions of the SHOSLIF work are its generality and scalability, i.e., the capability and potential for applying to a large number of problems of different natures and the capability to recall from a large number of learned views quickly. The Most Expressive Features (MEFS), also known as principal component fea- tures and “eigenfeatures” [122], have been used in various capacities and applications, such as face recognition [154], image representation [94], and image processing plan- ning [116]. We develop these features and describe their use for the characterization of a class of objects in our database. But we also show their limitations in terms of discriminating between classes. These MEFS can deftly represent images in a compact and efficient manner, but they are not necessarily good features for distinguishing be- tween different classes of objects. The MEFS describe some major variations in the set of sample images; these variations, however, may be completely irrelevant to how the classes are divided. Those features tuned to discriminating between classes of sam- ple objects can be found using the discriminant analysis procedure [84], and because of their discriminating capabilities, we call them the Most Discriminating Features (MDFS), because the optimally discriminate among the training classes in the sense of linear transform. This thesis describes the Discriminant K arhunen-Loéve (DKL) projection for dealing with discriminant analysis in high dimensional space relative 7 to the number of training samples. The standard discriminant analysis procedure breaks down when the within-class scatter of the training set of samples becomes de- generate, such as when the number of samples is less than the dimensionality of the sample vectors. The DKL projection performs the discriminant analysis projection in the space of the Karhunen-Loéve projection to avoid this problem. The SHOSLIF produces a hierarchical space tessellation using the hyperplanes determined by these MEFS and MDFS. The feature Space of all possible images is partitioned into cells of varying sizes to produce a hierarchical Quasi-Voronoi tessel- lation of the feature space. AS the processing moves down from the root node of the tree, the hierarchy recursively subdivides the objects of the database into increas- ingly simpler problems. This has a two-fold advantage: the Space-Tessellation tree provides an efficient means for image retrieval, and the decomposition of the problem into simpler problems provides a mechanism for the linearization of the class separa- tion problem. Though the classes may not be lineally separable at some level of the Space partitioning, as the Space—Tessellation tree recursively subdivides the space, they will become lineally separable at a deeper level of the tree. The SHOSLIF assumes a well-framed image containing a single object of interest. We describe a mechanism that uses genetic algorithms for finding those objects in a complex scene. Genetic algorithms are a search technique for dealing with a very large search space, such as the one encountered in image segmentation. This work describes a technique for using genetic algorithms for object localization in a complex scene. A brief review of genetic algorithms is given, followed by the specifics of using a genetic algorithm for this image segmentation application. The results from several 8 experiments Show that this approach iS a viable method for image segmentation. Another approach to producing a well-framed image is to use a coarse-to—fine search, similar to that used by Cresceptron [160] Even assuming that a good visual attention technique exists for extracting im- ages of interest to process, this visual attention mechanism may not give identically registered images with which to work. The MEF and MDF subspaces rely for their good results on images that are close to perfectly registered. In order to allow for generalization to variations in the position and scale of the objects in the training samples, this work proposes a method for producing an extended training set from the training images already given. A small variation around the attention position and scale should be sufficient for the method to succeed, since the attention supplier is assumed to provide a rough estimate of these parameters. Though for a more ro— bust system, further image acquisition could be utilized, this work describes using the existing training images for extending the training set. A set of grids for the position and scale parameters are generated to overlay the fixation point and scale found in the visual attention phase. A new training image is extracted from the image at each of these grid points, generating an extended training image set that can handle the recognition generalization to position and scale variations. This thesis describes the Self-Organizing Hierarchical Optimal Subspace Learning and Inference Framework (SHOSLIF) and demonstrates its utility for view-based object recognition and image retrieval. We Show the technique on a large multifarious set of real-world objects found in natural scenes. We demonstrate the ability of the MDF space to tolerate within-class variations and to discount such imaging artifacts 9 as lighting direction. We have a total of 500 classes of objects, including faces, and we Show experiments on a training database of over 1300 images. We provide the results of many experiments that Show different favorable aspects of the system and its application to object recognition. What we describe is not a general object recognition system. The system is view— based: in order to retrieve a correct image from an image database, the SHOSLIF requires that the system has been trained on an image taken at a very similar view- point (i.e., within a few degrees). This limitation is true of any approach that is based on a pixel-level representation. In summary, the main contributions of this thesis are: 0 Use of the Most Discriminating Features for automatic optimal feature selection. 0 The DKL projection technique for discriminant analysis in high dimensional space relative to the number of training samples. 0 An automatically generated space-tessellation tree for fast access into a large model database, and problem decomposition into solvable linear discriminant analysis. 0 Use of genetic algorithms for object localization in a complex scene. 0 Learning-based object size and position generalization. 0 Application for object recognition on a large set of diverse, real-world objects in natural settings. 0 Low logarithmic computational complexity made possible by the framework. 10 o Generality of objects handled by the system. 0 Applicability to general scenes and handling variations in a unified framework. Such a capability has been severely lacking in computer vision. The major restrictions of the current work include: o Well-framed images in which the object of interest is roughly centered and has a standard Size. 0 Limitation in dealing with a significant amount of background in the well-framed images. Chapter 2 Object Recognition The general object recognition problem is a difficult one. The mapping is one-to- many, i.e., many 2D images can be interpreted as the same 3D object due to the imaging process. The pOpular object recognition paradigms must therefore have built-in biases. These biases allow the vision system to analyze the scene successfully most of the time, but they are application dependent. When the assumptions imposed by the object recognition system are not valid, the process fails. The steps preceding object recognition include image formation from an input sensor, possible pre-processing to enhance/restore images, image feature detection, and representation of objects found in the scene. The sensor used could be some sort of a scanner, such as those popular with optical character recognition (OCR) schemes; a Single image from a Single camera or multiple images from one or more cameras in order to obtain Sparse depth information; or a range scanner to obtain dense depth information. An object recognition system designer faces several issues [67], including the se- 11 12 lection of sensory cues to be used, the representation of these cues, the storage and retrieval of objects from large libraries, the matching strategy, the position and ori- entation estimation of objects found in the scenes, incremental addition of objects to the library (i.e., learning), and the role of attention, object function, and context in the recognition process. Some of these issues are addressed by the various approaches to object recognition developed over the past several years. 2.1 Sensing scheme The first step in image processing is the acquisition of the image from the real world. This can be accomplished in a variety of ways suitable for a particular application. This “image” can be a set of images presented to the image processing system as a particular instance for analysis. Thus for a system using the binocular stereo sensing scheme, the image instance is a pair of intensity images that the image processing system uses. Image acquisition systems can be either active or passive. Passive systems accept input from the world without feedback from the other levels of image processing; active systems actively seek out the areas of importance in their worlds using the instructions given by higher-level reasoning modules. Thus the active systems may control their sensor’s direction, focus, or resolution based on the instructions derived from previously obtained images; passive systems have no such automatic adjust- ments. Several different types of input sensors can be used to acquire an image for further 13 processing. These different sensors are explored in the following sections. 2.1.1 Desktop Scanners Images acquired for document analysis are typically obtained from a desktop or hand- held scanner. The scanning is accomplished by moving a light source coupled with a photo cell relative to the page or film being scanned. The paper or film is typically held stationary and the sensing assembly is swept along the item being scanned. For desktop scanners, this movement is mechanically handled by the scanner; hand-held scanners require the operator to sweep the sensing assembly across the page. The sensing assembly consists of a light emitter and a light sensor (i.e., a photo cell). The amount of light transmitted or reflected (for transparent film scanners or opaque paper scanners, respectively) is provided to the computer using the scanner. 2.1.2 Cameras Camera behavior is usually approximated using a pinhole camera model, shown in figure 2.1. Typical cameras in use today are either charge injection devices (CID cameras), where charges built up on capacitor arrays are read from the location, or chargecoupled devices (CCD cameras), where an array of capacitors form a shift register which responds to the illumination present in the scene. Intensity Images Cameras which are used for producing intensity images generate a set of images 1,-(r,c),i = 1,2,---,n that have a spectral value for each pixel located at row r 14 Image/(1 Plane Figure 2.1: Pinhole camera model. The y — 2 projection is similar to the a: — z Shown in the figure. column 0 in the image set. For color cameras, n = 3, and 11 corresponds to the red image, 12 the green image, and I3 the blue image. Gray scale cameras have n = 1, and II (r, c) represents the gray scale value of the pixel at (r, c). Depth Maps Often, the object recognition paradigm requires depth information in order to suc- cessfully recognize objects in the sensed scene. As such, intensity images have been replaced by depth maps, either directly using range scanners or indirectly using mul- tiple views from a set of cameras [25]. These indirect depth maps can be obtained using a multiple cameras or from a single intensity camera under photometric stereo, depth from focus, or structure from motion techniques among others. Direct depth 15 maps can be obtained using a range scanner. Photometric Stereo. Photometric stereo [165] is the sensing scheme in which multiple images are obtained using a single camera in a fixed position. For each image taken by the camera for a particular image instance, the direction or method of illumination is changed. This change provides a system of irradiance equations that can be used to recover Shape (and depth) from the different shadings produced by the different lighting directions. Depth from Focus. Shape from focus uses at least two images, each taken with a different focus, in order to obtain sparse depth information. When the focal length exactly coincides with the distance between the camera and the object, theoretically no blurring will be present at the pixels comprising the object. When this focal length is changed, the amount of blurring that results about a particular pixel is used to determine how far the point is from the camera [26]. Structure from Motion. Inter—frame motion between successive image frames can be used to ascertain the structure and therefore the depth of the scene. The motion parameters, such as translation and rotation, can be estimated from at least two views in which corresponding features are matched. The relative depth can be esti- mated from these feature correspondences as long as feature is matched across three views [162]. 16 Multiple cameras Multiple cameras can also be used to obtain depth information. For binocular imag- ing, the disparity between the two images, i.e., the difference in position between the images, is used to decode the depth information. The baseline distance can be used in conjunction with matched points to calculate the z coordinate of the object points in the real world. With the cameras configured as in figure 2.2, we have 2df This setup provides an inexpensive way to produce reliable depth maps. Z=Z'=Z”=0 Figure 2.2: Stereo camera imaging system. Based on Fig. 2.3 in Ballard and Brown Computer Vision. 17 2.1.3 Range scanners Three-dimensional range scanners, such as the Technical Arts 100 White Scanner, produce dense depth maps directly. Active illumination is typically used on range scanners, where either a stripe or grid is projected onto the scene being sensed. The third dimension can be obtained using triangulation techniques. Alternatively, lidar technology can be used to obtain range information. This technique uses light in the same way that radar uses radio waves and ultrasonic ranging uses high frequency sound waves. Lidar systems send a pulse of light toward the scene to be sensed and measures the time taken for a reflection to return. This elapsed time is used to deduce the range between the objects in the scene and the sensor. Although these lidar scanners are very accurate and versatile sensing systems, because of the speed that light travels, lidar scanning can only be done with very sensitive equipment. Only those points that are visible to both the sensing device and the emitter device can have depth information computed for them. Because of this, many images from differing views must be scanned to find the concavities and eliminate self-occlusion. 2.2 Model representation The decisions regarding the way that a model is represented, i.e., the way that an object is abstracted, stored, and utilized, are among the most important phases in the development of an object recognition system. The choice of model representation has direct implications on the types of features to extract and select, the utility of various 18 matching techniques, and the success of the overall object recognition endeavor. 2.2.1 Feature-based representations Feature-based object recognition systems represent the models in the database in terms of a set of N features or properties. As such, an object can be viewed as a point in this N -dimensional space. When a good choice of features is used, the feature vectors which represent the objects in the database will occupy different regions in the feature space. The choice of features depends on the types of objects that need to be recognized. This choice of features is a fundamental step in obtaining good object recognition rates using feature-based representations. Global features Probably the easiest and most compact representation for a set of objects is through global features. Global features are those whose measurement depends on the en- tire object spatially. These features are typically measurements such as Fourier de- scriptors, area, perimeter, moments, area, circularity, elongation, and the number of connected components. Global features, though simple and intuitively appealing, do not necessarily pro- duce the best results in an object recognition system. They are sensitive to noise (e.g., the number of connected components can vary wildly in the presence of noise), they are not necessarily sufficient for distinguishing all Shapes (e.g., the area of a circle with radius r is the same as the area of a square whose sides are s = fir), and they 19 do not work well under occlusion (e.g., the moments of objects change drastically when the object is occluded). Local features Local features can be used for representation of objects as well. Local features depend only on a small spatial part of the object. These features must be organized into some structure, such as a graph whose edges represent relationships among the features found at the vertices, in order to characterize the interrelationships among the local features. The same properties of objects used as global features can be used as local features, working simply on a subregion of the image. For example, the area of a hole detected in an input image can be useful for industrial part recognition and inspection. The difference is that only the local information is used to extract the features. Points Lists of points to describe various objects are the most direct set of features to use. Point features are essentially pixel coordinates of important landmarks on the ob- ject being represented and attributes associated with those coordinates. The pixels can represent corners under a 2D model representation scheme or vertices under a 3D modeling technique, vertices of cones and centers of spheres for quadric surfaces, or points of high curvature or other landmark points for free-form surfaces. The at- tributes can represent aspects such as the amount of convexity at an object’s boundary point, the torsion of the surface at a point, or the angle formed by the corners at the 20 point under consideration. These points and their attributes can be used to build an object model. Edges A local edge is a small area in the image where the local gray levels (for 2D images) or the depth values (for 3D images) are changing rapidly in a simple (e.g., monotonic) way [4]. These edges can be used to represent a boundary of an object or to construct a wire-frame polyhedral approximation to an object. Object Silhouettes. The boundary representation of an object is the most com- mon representation for 2D objects. The boundary is typically represented by a set of points, by a chain code, or by a sequence of line segments. The results from a border-following or edge tracking algorithm produces a list of points that represents the border of a 2D object. In most cases, some subset of this list, i.e., the landmark points of high curvature, is maintained to describe the object under consideration. In other cases, the whole Silhouette is stored to describe the objects in the model database [32] [37]. Alternatively, a boundary can be encoded by a chain code. The chain code is computed by quantizing the object boundary on a grid of appropriate resolution. The edges between the points on the resulting quantization are given a code based on the angle formed with a right—pointing vector. This chain code provides a drastic reduction in boundary storage requirements and can even be used directly to find many other features of the object [70]. 21 Finally, a 2D shape can be outlined by a sequence of line segments. Finding lines using a Hough transform or some other line-finding technique produces a list of line segments that describes the boundary of a 2D object. This list is similar to maintaining a list of points, but provides additional information such as the angles formed between adjacent line segments. Wire-frame models. Three-dimensional images can also use edges as a primitive for model representation. A 3D model consisting of only the edges of the object is called a wire frame model [70]. Attributes that are typically associated with these wire-frame model edges are such items as the sense of the edge (i.e., is it a jump edge, a crease edge, a blade edge, etc.), the orientation of the edge, or the length of the edge. Parts Rather than describing an object by its silhouette, another popular—and successful [77]-—representation scheme views an object as a collection of smaller parts. The parts- based representation is founded in the proposition that the visual system decomposes shapes into parts. These parts should be represented by rules defining part boundaries rather than part shapes. Skeletons. Objects that are a collection of elongated, thin pieces may be well represented by a set of curves, each of which describes a piece of the object. Smoothed Local Symmetries [16], used to generate and generalize model databases [39], finds the spine of an object by locating the midpoint of the line segment joining points A 22 and B on the silhouette of an object as shown in figure 2.3. Figure 2.3: Smoothed local symmetry. A and B are points on the bounding contour of a shape. The angle formed by AB and the outward normal at A = the angle formed by AB and the inward normal at B. The midpoint of AB defines the local symmetry. The loci of local symmetries which are maximal with respect to forming a smooth curve is called a spine. Surfaces. Objects, specifically 3D objects, can be decomposed into a set of sur- faces and their adjacency relationships. General surfaces can be approximated by a collection of planar or quadric surface patches, and the attributes of this patch set can yield definition to the attributes of the general surface. For example, the Shape of a surface can be ascertained by the set of surface normals of the planar patches approximating the surface. An object can be represented by three lists and a set of topological relationships among these lists: a list of surfaces, a list of edges, and a list of vertices. Information relating each edge in the edge list to the pair of surfaces surrounding it, and relating 23 each edge to a pair of vertices form the topological relationships among the lists. This model is called the Surface-Edge-Vertex representation [70]. Volumes. Representing 3D objects as a collection of volumetric primitives is another way to decompose objects into parts. A generalized cylinder is defined by an axis, a cross section, and sweeping rule [118]. By sweeping the cross section along the axis using the sweeping rule, a solid is created. For example, a standard cone uses a line segment axis, a circular cross section, and an increasing sweeping rule. Using this representation and a collection of connectivity relations, a wide variety of objects can be exactly defined. Hierarchical generalized cylinder models [109] more closely approximate the actual object as the hierarchy is traversed. A superquadric [124] is another volumetric primitive that can describe the parts of a 3D object. This primitive deforms an ellipsoid into an arbitrary (symmetric) shape. Parameters define the size of the volume and the degree to which the ellipsoid is deformed into a rectangular Shape in the latitude and longitude planes. Space-occupying features Spatial occupancy representations define a 2D or 3D grid where each cell is either occupied or vacant. Quad trees (for 2D images) [135] and oct trees [83] or kd—trees [10] (for 3D volumes) systematically decompose the Space from large areas or volumes into smaller ones. At each level of the tree, a certain area or volume is covered by each node; the node can indicate full, where the covered portion of the space is fully occupied; empty, where the covered portion iS completely empty, or partial, 24 where a portion of the space is occupied. Partially full areas or volumes are further decomposed in like manner until the finest resolution of cells is attained. Surface normals Finally, a robust set of features for representing a 3D object is a set of points on the Gaussian sphere [79]. The object is viewed as a collection of surface normals; each of these normal vectors are translated to the center of the unit sphere, and the points formed by these vectors on the sphere is called the Gaussian image of the object. 2.2.2 Multiple view representations Feature-based representations describe the three-dimensional properties of objects. Unless range images are used, however, objects must be recognized from projections of that object onto a 2D image plane. A good representation will allow the viewpoint from which the object is viewed to be arbitrary; 3D objects viewed from varying viewpoints appear completely different when projected to a 2D image plane. To combat this, many researchers have represented 3D objects by a set of 2D descriptions instead of full 3D descriptions. Collections of features from 2D views and their relationships Representative two-dimensional views of objects can have standard features described in previous sections represent the 2D projection of the object for that view [67]. These features, coupled with relationships among the various views, can represent a three- dimensional object when the set of 2D views and the extracted features are sufficient 25 [143]. Aspect graphs These two-dimensional views that Share some set of properties can be grouped to- gether into a (usually) small set of view classes [95]. These view classes can be organized into a graph structure. Each node of the graph represents an aspect of the object, i.e., a representative from a set of views which are topologically distinct from other views. An edge between nodes represents the transition between two such representative views. Typically, the entire aspect graph is not computed in systems that use this rep- resentation; instead, the Gaussian view Sphere is tessellated in some fashion and the resulting views are combined into equivalence classes according to some set of fea- tures or shapes [82]. This approach does not require any assumptions regarding the underlying Shape of the 3D objects being recorded (such as polyhedral objects), but only a sampling of the entire view sphere is made. As such, the resulting aspect graph is only an approximation of the true graph. Multiple 2D views Three-dimensional objects can also be stored as a collection of full two-dimensional images as well. In this case, the object for which a model is being built is viewed from many different viewpoints, each rendering a two—dimensional image of the model. The viewpoints to use in the process are selected by tessellating the Gaussian view sphere in some way in order to obtain representative 2D views of the object [31]. 26 2.2.3 Hierarchical representations Representation of objects by their parts and the relationships among these parts breaks down in complexity when complex objects and scenes are to be recognized. In a hierarchical structure, objects can still be defined in terms of their parts, but these parts can either be another hierarchy or an indivisible part. This approach is much like the hierarchical directory structures on modern computer systems. Directory entries can either be indivisible parts (files) or can themselves be directory entries that can be further explored. Hierarchical structures generally render a better efficiency in searching the object representation space, Since large portions of the model database can be truncated from the search for possible matches with search probe images. 2.3 Model-based object recognition Model-based object recognition is the domain where a model exists for every object in the recognition system’s universe of discourse. The research emphasis in this paradigm has historically been on the design of efficient matching algorithms from a manually designed feature set with hand—crafted shape rules. Representative examples of the different approaches to object recognition are given in table 2.1. 2.3.1 Interpretation tree The interpretation tree approach to object recognition constructs a set of candidate hypotheses that satisfy correspondences between a set of model feature and input 27 Approach Db Sensing Author Size Scheme Features Grimson and Lozano- 1 Range Surfaces and Edges Interpretation Perez [66] tree Flynn and Jain [53] 20 Range Surfaces and volumes Brooks [19] 3 Intens. Generalized cylinders Stein and Medioni 1 Intens. Edges, supersegments Geometric [141] Hashing Flynn and Jain [52] 23 Range Surfaces and volumes Stein and Medioni 12 Range Splashes [142] Jain and Hoffman [86] 10 Range Global, surfaces, and Evidence— boundaries Caelli and Drier [25] 25 Range Global, surfaces, cur- based . vatures, boundaries Bischof and Caelli [13] 5 Synth. Corners, perimeter Pentland [124] 1 81:11:12? Superquadrics Geons Dickinson Pentland 1 Intens. Volumes and Rosenfeld [46] Distance Lowe [106] 1 Intens. Points, edges . . . . Chen and Stockman 4 Intens. Silhouettes rmmmization [31] Pose clustering Stockman [143] 1 Intens. Edges miff‘llllfig Belles and Horaud [14] ] 1 ] Range | Surfaces . Huttenlocher and U11- 1 Intens. Points, silhouettes Ahgnment man [81] Table 2.1: Classification of Major Approaches to Object Recognition 28 image feature pairs. A node in this tree corresponds to a proposed match between the feature in a particular model and a feature in the test object. Each branch in the tree represents another feature in the model being matched to a feature in the test image. For a node N, the branches to the children of N are all mutually exclusive; only one such branch can be valid at a time. Impossible nodes of trees are pruned from further search. This determination of whether a node is valid is accomplished by a pruning predicate. These predicates have been limited in practice to unary and binary rules, where either the last feature match is tested alone or in conjunction with one of the previously seen matches in the path to the root of the tree, respectively. A node is pruned when the predicate indicates an impossible match; the descendants of a pruned node are not explored, and the search is thus constrained. The order in which these features are expanded in the tree is important to keep the size of the search tree to a minimum. Features that can cause mostly impossible nodes can be considered those features which are the most discriminatory; these features will cause most of the generated children nodes to be impossible nodes and therefore pruned from the tree. When such features are propagated toward the root of the interpretation tree, a much smaller search tree is produced, and less computational complexity is achieved for the same results. A different interpretation tree is constructed for each model in the model database. Each leaf node of such a tree must go through a verification step that determines the validity of the hypothesis described by the leaf node. Grimson [67] [66] is the major preponent of this technique in which he constrains 29 the search using the geometry of the ob ject(S) under consideration. His approach uses a unary edge-length constraint, a binary angle constraint, and two binary constraints on the distances between two linear edge fragments. Chen and Kak [30] strictly limit the interpretation tree search by invoking their verification procedure as soon as the object pose can be estimated. The remaining features are used to assist in the verification of the minimal hypothesis. Flynn and Jain [53] use the constraints of area, surface type, the radius of non- planar surfaces, the rotation, orientation, visibility, and the degree of parallelism for planar surfaces in order to prune their interpretation tree. Brooks [19] used two-dimensional images to recognize 3D objects using the volu— metric primitives of the generalized cylinder. 2.3.2 Geometric Hashing Geometric hashing [102] attempts to minimize the computational effort during the recognition phase by extensively preprocessing the library of object models [52] and storing invariant features in a hash table. These features are invariant with respect to viewpoint, and tables of object models are indexed by these features and their values. Evidence is given that some implementations of geometric hashing can handle the uncertainty of segmentation, and it is fast [141]. When implemented in parallel [15] and combined with a constraint satisfaction module to perform the verification task, object recognition (of relatively small model databases) can be done in near real time [141]. 30 Stein and Medioni [141] base their implementation of geometric hashing on sev- eral polygonal approximations as a representation of a model or a scene. The authors believe that the curvature of a general curve is its most important feature; it is in- variant with respect to scale, rotation, translation, and weak perspective (when some redundancy is supplied). No explicit distinguishing points (such as corners or inflec- tion points) are required in this method, and though the polygonal approximation looses most of the curvature information, parts are preserved in the angles formed by the consecutive line segments in the polygonal approximations. Connected linear segments in the images form chains of adjacent segments; the segment chains are turned into super segments by grouping a fixed number of adjacent segments. These super segments consist of the line segments, the angles between them, and the mea- sure of eccentricity (second moment ratio) of the ellipse that can be computed from the covariance matrix of the vertices. All the super segments are encoded using a Gray code [69]; the resulting predicates are used as a key for the hash table, where the super segment is stored as an entry. Note that these predicates represent the intervals within which the features of the super segments lie. Flynn and Jain [52] propose a table—driven object recognition technique using pairs of surface patches as invariant features. These invariant features are such items as angles (those formed between the the normals of planar patches, between the axes of cylindrical parts, and between a plane’s normal and a cylinder’s axis) and distances (between a planar patch and the center of a spherical part, between the centers of two spherical parts, and the minimum distance between the center of a sphere and a cylinder’s axis). Triples of these patches are used to form two-dimensional 31 interpretation tables. Each pair of invariant features in a triple of surface patches combines so that a table has multiple entries. For example, a plane-plane-plane triple uses all three pairs of angles to index into the two—dimensional table. Standard hash table techniques are used to deal with entry collisions during the table-building stage. To match test images with models in the database, the defined invariant features are extracted from the test image during the segmentation phase. Interpretation table entries within some window of the obtained values are extracted, and the features from the model(s) given by the table entries are bound to the test image features. These sets of hypotheses are sent to the verification stage (interpretation tree in the work reported). Stein and Medioni [142] define “splashes” as representations of surface patches. The surface normal Np at a carefully selected point p on a surface of interest is computed. Points lying on a set of concentric circles, centered at p, are sampled at equal angular increments. The angles between N, and the normals computed at each of the sampled points, using a local coordinate system, are used to produce a set of piecewise-linear points representing a 3D contour of the surface. The circle radius, curvature and torsion of this representation are used as invariant features to use as an index into a hash table. 2.3.3 Evidence-Based Recognition Evidence—based object recognition derives a discriminatory evidence rulebase, capable of differentiating between a set of objects. This rulebase is a set of salient (evidence) 32 conditions with corresponding evidence weights for various objects in the database [87]. During recognition, weights for various features are accumulated to give evidence for a particular model in the database [25]. This technique can deal with real-valued data and can learn characteristics of classes which are not shared by every instance of the class, constructing a useful rulebase in polynomial time [86]. Psychophysical research of biological visual systems suggests that much analy- sis of the visual stimuli occurs at a low-level, i.e., pre—cognitively [110]. For the evidence-based recognition technique, this is taken to mean that knowledge should be integrated only after low-level processing has derived whatever it can from the data without the aid of the high-level cognitive processes. These low-level data for Jain and Hoffman [86] are surface patches, boundaries, and corresponding classifications of these. This information is structured to form an object representation. The evidence-based object recognition technique makes use of a discriminatory rulebase for object recognition. J uxtaposed with this type of rulebase is the generative rulebase, patterned after Winston’s work on analogy [164], explored by Connell and Brady [39]. In this generative type of rulebase, the objects can be regenerated from the rules themselves. A discriminatory rulebase, on the other hand, is capable only of differentiating between a set of objects and is incapable of generating the objects from the rules themselves. This requires the system to retain additional information about the model database besides that information actually used by the system in order to avoid the loss of a description of the models being matched. The object models in Jain and Hoffman [86] consist of several views of various objects. A segmentation, classification, and merging process provides a segmentation 33 of a range image into surface patches. These surface patches are assigned a sense, viz. planar, convex, or concave. The model representation is made up of the following classes of information derived from the range images and the surface patches: 0 morphological information, consisting of the perimeter of the set of object pixels, the number of connected background components in the image, and the number of connected components within the convex hull constructed over the object pixels; 0 local patch features, consisting of the sense of the patch (i.e., planar, convex, or concave), the 3D surface area of the patch, the Span of the patch (i.e., the maximum 3D distance between any two points on the patch), and the maximum squared error in finding a piecewise linear curve to the boundary between the patch and the background pixels; 0 relationships between all possible pairs of patches, consisting of the boundary type (adjacent, jump, or remote), the normal angle between patches, the min- imum and maximum distances between patches, and the boundary angle and linear fit of pixels along the boundary (for adjacent patches) or jump gaps (for jump boundaries between patches). Knowledge about the boundary angles of the models is used to perform a merging of patches to produce a new set of representations. Rather than using all of this informa- tion in the representation, however, only “remarkable” information—evidence-which strongly cues certain objects is used. An evidence rule consists of a list Specifying those features involved in the rule and the corresponding bounds for these features, 34 a specification of the range of possible distinct patches that must satisfy all of these feature bounds, and a list of evidence weights 10,-, (one for each database object) that this list of features in their associated bounds support the hypothesis that that the observed object is object i. These evidence rules either refute or support hypotheses about an object in a scene. The technique to build this rulebase was to construct a minimal spanning tree over a subspace of the feature space. Connected components of these minimum spanning trees which resulted from removing weak edges in the tree were considered to be clusters; each cluster was a potential rule which may be added to the rulebase. Clusters which contained only a few vectors were considered “good” rule candidates, because the feature subspace under consideration showed good differentiation between the objects. Caelli and Drier [25] extended the pioneering evidence-based object recognition work done by Jain and Hoffman [86] to provide some new view-independent features, an optimized rulebase generation procedure, and a neural network to estimate optimal evidence weights and match objects in the later recognition phase. CAD images projected to various viewpoints were used for database objects. The features used by Jain and Hoffman were augmented to include principal curvatures, which are invariant to rigid motion. Bischof and Caelli [13] discuss optimally classifying patterns in terms of the bounds on features which constitute different types of patterns. The label-compatibilities which should occur between unary and binary rules are taken into account in the rulebase generation. 35 2.3.4 Geons Geometric ions (geons) [12] are volumetric primitives with properties describing their cross section’s edge, symmetry, size, and axis. The edges can be straight or curved; there can be no symmetry, reflective symmetry, or rotational symmetry; cross section sizes can be constant, expanding, or expanding and contracting; the axis can be straight or curved. These yield 36 geons by taking the Cartesian product of these four properties. The geons can be viewed as the alphabet of a formal language. If the volumetric primitive attachments, location, and orientation could be encoded in the production rules of this language, recognition would be reduced to parsing strings of the language. Pentland [124] proposed the superquadric shape model in this regard. Dickinson, et al. [46] use view-dependent features in intensity images to produce a geon-based object representation that is used for object recognition. 2.3.5 Distance minimization Newton’s method of nonlinear least-squares minimization can be applied to minimiz- ing the distance between 2D contours and 3D objects [106]. Projection from 3D to 2D is a smooth and well-behaved nonlinear operation. Newton’s method is based on the assumption that the function being minimized is locally linear; this requires a good starting point to avoid falling into local minima. Lowe [106] proposed this method for object matching. Chen and Stockman [32] build their object model database by obtaining the object silhouettes from various 36 viewpoints. The silhouette of a test image is extracted and matched to an object using this distance minimization technique. 2.3.6 Pose Clustering A transformation of the model and image features, i.e., the translation, rotation, and scaling, can be achieved using pose clustering [143]. This technique derives the transformation between each model and image feature pair; a Hough transform approach is used to accumulate evidence for each candidate transformation. When all such pairs are considered, peaks in the transformation bins produce hypotheses to be verified. 2.3.7 Graph Matching Finding the correspondence between object model features and test image features can be considered a graph matching problem. Using a graph structure to represent object features, attributes, and their relationships can be effected by viewing graph nodes as the features and graph edges as relationships among the features. Viewed as such, object recognition reduces to finding Similar structures among subgraphs, commonly called graph isomorphisms. Bolles and Horaud [14] find the maximal cliques in the graphs and use this infor- mation to define the best set of consistent model/ image pairs. 37 2.3.8 Alignment Pose estimation by alignment [81] tackles the problem of recognizing objects from a single two-dimensional image of a 3D scene. The methodology provides a closed-form solution for the transformation of a 3D model coordinate frame to the 2D image coordinate frame using three pair model and image points. Huttenlocher and Ullman [81] uses the alignment method for object recognition. The transformation method is used to produce a set of hypotheses for alignment of the model with an image. A set of features, such as the silhouettes [31], is used to verify the hypotheses proposed by the transformations. 2.3.9 2D object recognition The foregoing sections have focused on the problem of three-dimensional object recog- nition using various representations. Two-dimensional objects, such as characters for document processing applications, are also of interest to the object recognition field. Kim and Yuan [93] use Zernike moments in order to make their character recogni- tion system invariant to translation, scale, and rotation. Using a k-nearest neighbor classifier for character label identification gave very good results. Applications for map reading and lid inspection are addressed and show good recognition of char- acters from the images. Pan and Keane [121] describe a set of moment invariants recognizing characters with differing aspect ratios. A multilayer perceptron was used on the preprocessed images to achieve a good numeral recognition rate of 99.97% for handprinted characters. Buse and Liu [24] present a word recognition system that 38 is able to recognize cursive handwriting. No preprocessing is done on the grey-level images; Gabor filters are used to extract parts from word images. An evidence-based approach for word recognition is used to achieve a 81% rate for the correct word being in the top six choices. 2.4 Face recognition Face recognition is a sub-area of general object recognition of particular interest for a wide variety of applications. Applications to law enforcement for mugshot identification, verification of personal identification such as driver’s licenses and credit cards, gateways to limited access areas, surveillance of crowd behavior all are potential applications to a successful face recognition system. Several basic problems confront researchers in this area. The choice of represen- tation, as with general object recognition, is a critical step toward successful face recognition. Using observable features alone (such as a long nose, long hair, or a round face) pose difliculties for automatic extraction and qualitative quantization. For practical automatic recognition systems, however, a compact representation is essential. Once the choice of representation is fixed, systems to automatically recognize faces must detect faces in sensed images. A scene may or may not contain a set of faces; if it does, their location must be ascertained before recognition can take place. This is a complex problem: facial hair, make-up, turbans all obscure facial features which may be useful in localizing faces; large variation in scale, position, and orientation 39 should be handled to make the automatic system practical. When a representation is chosen and face images extracted, the recognition prob- lem remains. Several approaches have been used to find the identity of faces, from holistic approaches to matching based on features. In a practical system, changing facial features over time must be handled; making the association between faces and their labels must be accomplished; multiple instances of the same face under varying imaging conditions, sizes, and orientations must be classified as the same person. 2.4.1 The human capacity for face recognition Though we may be unable to endow machines with the capability of the human visual system, it is a good reference point from which to start. Much research has been done on face recognition, both by machine and biological system researchers. Research issues of interest to neuroscientists and psychophysicists include the human capacity for face recognition [111], the modeling of this capability [127] [140], and the apparent modularity of face recognition [129]; the human facility for learning to recognize faces [29] [65] [49] [114]; the role of distinctive or unusual features in faces for recognition [155]; the degradation of face recognition capability as humans age [7] [112] [156]; and conditions which result in the human inability to recognize faces, such as prosopagnosia [153]. There is evidence to suggest that the human capacity for face recognition is a dedicated process, not merely an application of the general recognition process [48]. Infants are attracted more to face-like patterns than those without patterns [114]. 40 People suffering from prosopagnosia retain the ability to recognize people from various other cues, such as voice, dress, and hair style [157]; they can recognize features of faces and determine whether an object is a face, but are unable to attach an identification [153]. The analysis of what features humans use for face recognition has been cause for much debate. Apparently both global and local feature information are used in an hierarchical manner, the local features providing a finer classification system for facial recognition [75]. The features tend to become less important as the extraction process moves down the face [20]. “Attractive” faces tend to be recognized most accurately and readily; “unattractive” faces are able to be recognized almost as well; faces having little aesthetic weight either positively or negatively are the most difficult to recognize. This gives support to the theories suggesting that distinctive faces are more easily recognized than typical ones [6]. Information contained in low Spatial frequency bands are used in order to make the determination of the sex of the individual, while the higher frequency components are used in recognition [137]. Young children typically recognize unfamiliar faces using unrelated cues, such as glasses, clothes, hats, and hair style. By age twelve, these paraphernalia are usually reliably ignored [28]. Psychosocial conditions also affect the ability to recognize faces. Humans may code a mean face with average attributes; these means may be different for different races, and recognition may suffer from prejudice or unfamiliarity with the class of faces from another race [59] or gender [60]. The emulation of this human capacity for face recognition is the goal espoused by most computer vision researchers in face recognition. Sample approaches taken by 41 various research groups is summarized in the following sections. 2.4.2 Representations for automatic systems 2D intensity images are usually used as input to automatic systems. Though it is not very compact, the images themselves are good for robustness in recognition. For detection of a face in an image, a resolution of 16 x 16 is sufficient; for recognition, 3 32 x 32 at 4 bits per pixel (sixteen grey level values) intensity image is sufficient [134]. Therefore, the use of this resolution of 2D images would require only 512 bytes for each image to be recognized. Alternatively, a feature vector could be extracted from the face. The features are derived from the intensity or edge map images, and usually include items such as the hair intensity, the distance between the eyes, or the size of the eyes. Features derived from face profiles typically include a set of characteristic points on the profile (such as the notch between the brow and the nose or the tip of the nose) and the angles between these points. Combinations of these representations have also been employed. Galton [58] defined a set of characteristic points and a collection of curves for the profile shapes at or adjacent to those points. The index of the shape curve for the characteristic point was used for a description of face profiles. Hong [78] worked on recognizing photographs. Each photograph was sampled using five different relative positions between the photograph and the camera. A Singular value decomposition of the image matrix extracted the representation for this work as an n-dimensional 42 single value feature vector. As a combination of features and 2D images for face representation, Campbell, et al. [27] utilized hair and cheek intensity values coupled with subimage patches for eye representation in order to recognize faces. The Karhunen-Loéve projection, though it has been studied for many years [88] and used in countless pattern recognition applications [84], has recently been revisited for face representation [94] [139], face recognition [154] [122], and for planning the illumination of objects for future recognition tasks [116]. Sirovich and Kirby [139] do not attempt to recognize faces, but their work has been at the heart of many other face recognition attempts. They use the Karhunen— Loeve projection in order to represent face images. Any image in the training set can be approximately reconstructed using a linear combination of the “eigenpictures” (see section 3.3 for a description of how these eigenvectors of the training set can be computed); arbitrary images not found in the training set can also be approximated using a weighted combination of eigenvectors obtained from the training set. They extend this work [94] by exploiting the near-symmetrical nature of faces in their representation. They augment their training set of images using the mirror images of these training images in an attempt to reduce the errors involved in reconstructing the faces. Though no representation approach has established itself as being far superior to others, the use of feature vectors, either implicitly derived or explicitly extracted from images, are by far the most popular. 43 2.4.3 Face Localization In order to automate the face recognition process, the system must be able to find a face in the input image(s). Though this is a typical problem in most object recognition systems, much research effort has been expended in the pursuit of face localization in particular to assist in the face recognition process. Sakai, et al. [132] seeks to determine the existence of a face in a set of digitized photographs. A Sobel-type edge operator found pixels with the largest magnitude of gradient values; neighboring pixels with Similar properties were joined in order to obtain a silhouette of the face. These outlines produced a similarity measure when matched with a pre-defined face template. Once an outline was identified as a face silhouette, a coarse-to-fine approach discovered individual features within the face outline. Nixon [119] locates the eyes using a Hough transform. The iris has a nearly cir- cular perimeter in most cases. Furthermore, the iris is expected to have gradient directions going out in each quadrant. A Sobel operator produces gradient direc- tions by quantizing the gradient magnitudes, and these data are used in the Hough transform technique. Conlin [38] detects faces from artist sketches using a knowledge based approach for task organization. The system utilizes a face template which contains both eyes, the nose and the mouth; various levels of image information are used. Line segments propose component parts; component parts propose components (e.g., nose), and components propose the layout of the face. The best candidates at each level survive 44 and are propagated to the next level; competing candidates are explored to provide the best possible detection results. Craw, Ellis, and Lishman [41] employ a multi-resolution scheme to locate the face in an input image. At a low resolution of the image, a Sobel edge detector in conjunction with a line follower algorithm finds the outline of the head. At each level of the resolution hierarchy, the template of the head position is propagated to the next finer resolution of the image for further refinement. Once the head outline has been reliably extracted, the process is repeated for internal facial features (such as eyes) using the head outline as a guide. For example, to find the lips and eyebrows, a vertical line is drawn in the lower quarter of the face outline (for finding the lips) or in the upper quarter of the face outline (for finding eyebrows); each pixel of this line is a seed point for line finding algorithms. A collection of these lines that fit in a thin box indicates that the desired facial features have been found. Yuille, Cohen, and Hallinan [167] define an energy function to be used with de- formable templates. These templates are altered to find the best representation of the shape present in the image. Preprocessing the intensity input images using mor- phological filters generates representations of peaks and valleys in the images. A collection of energy surfaces are defined, and the templates are deformed to Simulta- neously seek good minima in all of the surfaces. The parameter values of the template are modified to minimize the energy function. Govindaraju, Srihari, and Sher [64] extend the work by Yuille, et al. [167] by basing their template on three arcs of the head Silhouette which have curvature dis- continuity, viz. the right, left, and top sides of the head. The chord vector connecting 45 the ends of the arc is used to define an area and a region centroid that are used with the arc length and the chord itself to form a four-tuple for each of the three arcs to be used as a template to deform. Considering these three arcs as a supersegment [141] gives a face region whose center is the center of the face being located. Ballard and Stockman [5] locate the eye twinkles, i.e., the bright specular reflection from the cornea. The twinkles were found to be the brightest points on the grayscale images. A two-dimensional binary search of the histograms quickly finds the locations of these twinkles; Shape heuristics are used to rule out spurious twinkles. Using the positions of the eye twinkles, a similar procedure was used to find a much weaker “nose twinkle” along the bisector of the image columns containing the eye twinkles. Hallinan [68] views eyes as having two regions: the iris and the white region of the eye. Each of these regions has a nearly uniform intensity. The non-uniform intensity values of the eye are modeled as noise in the ideal template (model) of the eye. For input images, a valley-detection algorithm is used to find the iris surrounded by the white of the eye. A deformable template is fit to the data using both the intensity surface and the edge surface. Craw, Tock, and Bennet [42] use a hierarchical course-to—fine search to find the location of a head. At each level, Simulated annealing produces a polygonal ap- proximation to the head Silhouette; once the best such silhouette is achieved, the approximation is refined by adjusting each vector of the polygon representing the silhouette. Manjunath, Chellappa, and Malsburg [107] use Gabor wavelet decomposition to extract features at points of maximal curvature in the images. In a particular neigh- 46 borhood, a point (2:, y) is selected as a feature point if its response to the wavelet representation of the Gabor function is maximum. Reisfeld and Yeshurun [128] seek facial features using the almost symmetric nature of the face about the center of the nose. Paired points in the image about the axis of near-symmetry are given a symmetry measure value. Points with a high response to this measure correspond to facial features such as eyes or cheeks. Weng, Ahuja and Huang [160] combine the localization and recognition steps in a general, self-organizing paradigm. Their system was demonstrated primarily for face segmentation and recognition, but the general learning technique was also applied to other, real-world objects. The Cresceptron is described in more detail in section 2.5. 2.4.4 Face Recognition In any system that can be utilized for automatic recognition using faces, an identity must be assigned to the representation of the input images. This implies that the faces to be recognized must be suitably organized in a database for future retrieval. Using this database of faces, a label must be assigned to an input facial image. Furthermore, this label must be consistently and robustly assigned for large variances in the image acquisition parameters. Typically, during the system setup phase, a set of features that can be reliably extracted are identified, and the set of “known” faces are encoded using this set of features in the face database. Then at system run-time, novel face images undergo the feature extraction process to place them in the same feature space as the face 47 database, and a matching algorithm finds the best candidate(s) for the face identity [133]. For the most part, the different approaches to face recognition differ only in the features selected and the matching procedures used. Some of the pioneers in face recognition include Kaya and Kobayashi, who manu- ally extracted facial features in 1972; Kanade, who automated the feature extraction and recognition of faces in 1977; and Kohonen who applied the use of neural network technology for recalling a face from a noisy image in 1988. Kaya and Kobayashi [92] studied classifying a restricted domain of faces. The subjects used in the experiments looked directly into the camera, kept their mouths closed; they had neither glasses, nor facial hair, and all the images were taken under identical illumination conditions. Two basic types of noise were identified: the inher- ent noise, namely the variation of features of an individual from one image to another, and extraction noise, namely the errors involved in extracting parameters from the images. Euclidean distances between manually identified points in the images char- acterize the faces in this work, and the parameters thus obtained were normalized by dividing each length by the nose length. Kanade [90] identified several steps for face recognition: the selection of effective features to identify the face, automatic extraction of those features, and automatic classification based on the automatically extracted features. The subjects in the experiments had no glasses and no facial hair; two images of each subject were used, taken a month apart under different conditions. One of the images was used in the training set and the other in the test set. The method for characterizing the face is the parameterization of the facial geometry using the distances and angles between 48 eye corners, ends of the mouth, nostrils, and t0p of the chin. A two—stage process, a low- and a high-resolution scan, located the requisite features in the images. The low-resolution scan found regions for the two eyes, the nose and the mouth. Each of the regions is then scanned at the higher resolution, producing more precise features. Kohonen [98] demonstrated the use of neural network technology for face recol— lection applications. Even when the input images were very noisy or had portions missing, an accurate recall capability was achieved on a small set of face images. More recently, a renewed interest in the field of face recognition has appeared on the computer vision research scene. This is due in large part to technological improvements in the ability to perform large computational efforts and the application of new or the reorientation of previously seen techniques applied to face recognition. Stonham [144] distributed a set of single-layer adaptive neural networks to each person found in the database to perform face recognition, expression analysis, and face verification. Several hundred presentations were required to train the classifiers, providing changes in facial expression and translation of input images. The network that produced the highest response among all database networks was the identified person for a test image. The Dynamic Link Architecture [23] [101] can form sets of neurons grouped into structured graphs. Using Gabor-based wavelets as the features, object recognition invariant with respect to background, translation, scale, and distortion was achieved. An image is represented as a graph with attributes attached to the graph’s nodes; an object in the image is represented by a subgraph in the image graph. Recognition is thus reduced to the process of elastic graph matching. 49 Golomb and Sejnowski [61] classify the sex of the individual from their face image. Two three-layer neural networks are trained, but the hidden layer of the first is given as input to the second. The first network is trained as an image compression network, reducing a 30 x 30 = 900-node input layer to 40 hidden nodes. These 40 hidden nodes are given as input to the sex classification network, which is another three- layer network with a single output node indicating the sex of the person in facial input image. Turk and Pentland [154] use the Karhunen-Loeve projection in order to detect and recognize faces. The eigenvectors of the covariance matrix for the training set (“eigenpictures” in Kirby and Sirovich [94]) are called “eigenfaces” in this work. Every face in the database of “known” faces can be projected onto each of the eigenfaces to produce a weighting of how much that eigenface contributes to the reconstruction of the database face. When the database face has been projected to each of the eigenfaces, a weight vector is thus produced. When a novel face image which may or may not be in the database of known faces is presented, this same projection is done to produce another weight vector. The Euclidean distance between the test image’s weight vector and each of the face database weight vectors is computed; recognition is achieved by finding the individual in the face database with the smallest distance so obtained. They also observe that faces and non-faces produce very different weight vectors, and they uSe this information in order to detect whether a face exists in the image. Akamatsu, et al. [2] normalize the position and Size of the face in the images. Each target subimage is sent through an affine transformation so that the eyes and mouth 50 are placed in a pre-defined geometric arrangement within a standard face window. From this standard face window, two different experiments were performed. In the first experiment, the Karhunen-Loeve projection is done in a similar manner to that done by Turk and Pentland [154]. For the second experiment, the Fourier Transform is applied to this standardized image, and the Karhunen-Loéve projection is applied to the Fourier spectrum. The Fourier approach allowed for much more variation in Shifting the standardized image points, but both systems provided good recognition results. Nakamura, Mathur, and Minami [117] use curves of constant gray level (isodensity lines) for face recognition. The silhouettes of face images on a black background are discovered with a Sobel edge detector, and recognition is done with template matching on a small (ten pairs of images) data set. Rahardja, Sowmya, and Wilson [126] employ a pyramid of neural networks for expression recognition. The training images are partitioned into overlapping blocks, each block of which serves as input to the lowest level of networks in the pyramid. Training of the networks at the lowest level of the pyramid use backpropagation to extract features from the images. The hidden layer of these networks are combined to the networks at the next higher level in the pyramid. The training images were hand- drawn faces with varying expressions; the pyramid successfully recalled the trained data, but performed badly (approximately 50%) in recognizing distorted images. Seibert and Waxman [136] use a neural network to recognize faces from their parts. The eyes, nose, month are extracted from a facial image using symmetry detection. Considering the facial features as 2D views of a 3D object, ART 2 is utilized to produce 51 a clustering of the feature vectors. An aspect network integrates the accumulation of evidence for recognition. Gordon and Vincent [63] [62] use range data for face recognition. The image is segmented into planar and spherical regions and surfaces of revolution. From this segmented range image, the nose is located to aid in the location of the eyes. Faces are normalized to a standard position and orientation using the positions of the eyes and nose. The volume between the test image and the database image in this normalized Space is the criterion for matching. Manjunath, Chellappa, and Malsburg [107] [108] recognize faces by combining Spa- tial location in the image with the feature points obtained using the Gabor wavelet decomposition in a local neighborhood. These vectors are considered as the vertices of graphs, connected using the Spatial information from the image. The graph from an input image is matched with a database graph, first by finding feature pairings be- tween the graphs, followed by combining this information with the topology difference of the two graphs. Cheng, Liu, Yang, and Wang [36] use three training samples for each individual to produce a mean image for the individual. The eigensystem for this average image is computed using singular value decomposition, thresholding the eigenvalues near zero. The average eigenvectors obtained for all of the mean images are computed to produce the feature vectors used in the system. A novel image is projected to this average eigenvector space to produce a feature vector. Brunelli and Poggio [22] look at the problem of gender classification from face images. The positions of the faces in the images are standardized by automatically 52 detecting the eyes and placing them at a standard position. A HyperBF network [125] is trained for each of the two genders, and during the test phase, the gender network with the larger output is taken as the appropriate one. This same technique was extended to face recognition, training a network for each person. Synthetic data was generated from the small set of real images by placing points around the average of the feature vectors and using these vectors for training, using the real data as test images. Weng, Abuja, and Huang [160] demonstrate a general learning paradigm for object segmentation and recognition. Their system is a self-organizing recognition system for general object recognition; as such, it does not rely on any hand-crafted shape rules extracted from the objects to be recognized. Their system is described in more detail in section 2.5. Pentland, Moghaddam, and Starner [122] extend the work done in [154] regarding the Karhunen-Loeve projection for face recognition by taking a sample of a very large database of faces as the training set from which to compute the eigenvectors used for weight vector calculation. The database consisted of many different views of each individual and wide variation in the paraphernalia (such as glasses and hats) that the person was wearing. Using the eigenface techniques, the database could be considered as a whole or it could be divided into multiple sub-databases consisting of the various views available. Their work favors the use of multiple eigenspaces for view-based face recognition. The eigenface concept was also extended to include such features as “eigeneyes” and “eigenmouths.” These feature eigenspaces produced only a marginal improvement in the recognition rates. The work explores various applications to a 53 face database of this size and reports good results on each tested application. One of the applications is an interactive database search. Constraining an initial search using annotations associated with each individual in the database, a face can be selected for further search. Using the eigenface recognition technique, a list of faces is presented to the operator in decreasing similarity. 2.4.5 Profile Recognition Due to the requirements of law enforcement applications to face recognition, some research in the area of profile recognition has also been done. For profile recognition, the Size and shape of the nose is of primary importance, whereas it is of less importance for frontal view face recognition. Kaufman and Breeding [91] use binary images for profile recognition using the profile silhouettes. A set of normalized autocorrelations is utilized as a feature vector; the k-nearest neighbor rule is used for classification. Harmon et al. [71] [72] [73] [74] treated the profile as a waveform to achieve profile recognition. From a set of characteristic points on the profile, Six feature character- istics (protrusion of nose, wiggle, profile triangle angles, angle between characteristic points) were established. Eleven features from these were used as a feature vector, and Euclidean distance was used to find a good match. Set partitioning was used to reduce the number of candidate matches for the Euclidean distance calculation, thereby both improving the performance and reducing the computational complexity. Wu and Huang [166] use B-splines to extract points of interest. A feature vector 54 is generated using the distances between neighboring points of interest, the curvature joining adjacent points, etc. The feature vector thus extracted is compared with the database vectors for matching purposes. 2.5 Self-organizing approaches The previous sections have discussed various techniques for representing and rec— ognizing 3D objects in general and faces in particular. The research emphasis has historically been on the design of efficient matching algorithms from a manually de- signed feature set with hand-crafted shape rules. This manual design of a feature set is appealing because typically the system can be designed to be very efficient. When designed properly, a very small number of parameters for each of the objects or faces is sufficient to capture the distinguishing characteristics among the objects to be recognized. These hand-crafted features, however, can make the system brittle in that it is unable to Operate except for those domains for which it was specifically designed. An alternative to hand-crafting features is the self-organizing approach. Auto- matic selection of features is a crucial step toward a general, fully automatic recogni- tion system. In a self-organizing approach, the structure of the knowledge base is a function of the training samples given to the system. With such a system, the human designers must provide an environment for learning, but most of the design details are left to the learning phase of the system to complete. The Cresceptron [160] is such a self-organizing system. Similar to the Neocogni- 55 tron [57] [56] in that the Cresceptron uses multilevel retinotopic layers of neurons, the Cresceptron automatically determines the configuration of its network in the learning phase. The Cresceptron learns in an incremental and unsupervised manner, automat- ically creating an hierarchical image analysis structure to describe images. Since the Cresceptron forms a hierarchical structure, it is able to tolerate deviation at various levels. Near the root of the hierarchy, the tolerance is large; near the leaves, this tolerance is smaller in order to handle perceptually similar objects from a relatively small set of training samples. Unlike most neural networks, the Cresceptron does not use back prOpagation for learning. Instead, the Cresceptron analyzes the structure of an object in a bottom-up manner, as proported by Kohonen [98] to be the way biological learning takes place. No pre-segmentation is required by the Cresceptron, Since the system combines the segmentation and the recognition steps, and different features for different objects may be selected based on the training set. The network that the learning phase of the Cresceptron constructs consists of several levels; each level has several layers, and each layer has several neural planes. A neural plane is a square set of nodes that represents a concept; the response at a certain position in the neural plane is representative of the presence of the concept. At the lowest level, the nodes correspond to the pixels in the input image. The higher layers consist of pattern detection layers, node reduction layers, blurring layers, or combination of other types of layers. Though the available layers and their function have been pre—determined, the configuration of the network connections is automatically learned from the training samples. The network can continue to learn in an incremental fashion as new training 56 samples become available. The Cresceptron network grows very quickly, however [159]. Regardless of the input image, every node in the network must be visited. While in a parallel hardware implementation this may be acceptable, for the current state of software simulation, the system runs more and more slowly as more images are learned. A concept is regarded as new and memorized whenever it is significantly different from those that have been previously learned. Instead, an optimal feature selection procedure Should be employed. 2.6 Summary Object recognition is a difficult problem in general. The sensing scheme, the selection of features and their representation, the matching procedures and the model database indexing techniques all contribute to this difficult problem. A summary of popular sensing schemes to bring the images from the world into the computer is given, including desktop scanners, cameras for intensity images or depth maps, and range scanners are briefly described. Then the problem of model representation is addressed, surveying many popular representations being used in the computer vision community. A survey of object recognition techniques is given. As a sub-area to general objection, face recognition has received much attention in recent years. Both frontal face and profile recognition are important topics for many varied applications, but frontal face recognition has received more attention. A wide variety of representations have been employed, and features ranging from holistic to 57 individual local features or eigenfeatures have been utilized. Using these features, a collection of works has been summarized for locating the face in the image, if one exists. A survey of face recognition techniques has been provided for extracted face images. Using the human capacity for face recognition as an ideal for which to strive, the automatic recognition of faces still has a long way to go. Chapter 3 The Self-Organizing Hierarchical Optimal Subspace Learning and Inference Framework (SHOSLIF) The SHOSLIF is a framework that aims to provide a unified methodology for compre- hensive visual learning. Its objective is not just to attack a particular vision problem, but a wide variety of vision problems. It addresses critical problems such as how to automatically select the most useful features and how to automatically organize visual information using a coarse-to—fine space partition tree which results in a very low, logarithmic time complexity for retrieving from a large visual knowledge base [158]. The SHOSLIF uses the theories of optimal linear projection to generate a hierar- chical tessellation of a Space defined by the training images. This Space is generated 58 59 using two projections: a Karhunen—Loeve projection to produce a set of Most Expres- sive Features (MEFS), and a subsequent discriminant analysis projection to produce a set of Most Discriminating Features (MDFS). The system builds a network that tessellates these MEF/MDF spaces to provide an 0(log n) average time complexity for recognizing objects from images, where n is the number of learned (and stored) samples. 3. 1 System overview The network that is constructed takes the properties of an abstract tree structure. An example of such a network is shown in Figure 3.1. It is this Space- Tessellation P Figure 3.1: A tree view of the SHOSLIF network. Each node corresponds to a processing element. Tree that provides the key to the efficient object recognition capability of the system described in this work. 60 The shaded blocks in Figure 3.1 indicate nodes which may be visited during the recognition phase. The leaf nodes represent the smallest cells defined by the train- ing set. AS the processing moves down from the root node of the tree, the Space- Tessellation Tree recursively subdivides the training samples into smaller problems until a manageable problem size is achieved. When a test object is presented to a node, a distance measure from each of the node’s children is computed to determine the most likely child to which the test object belongs. At each level of the tree, the node that best captures the features of the test object is used as the root of the subtree for further refinement, thereby greatly reducing the search space for object model matches. Figure 3.2 provides a more detailed view of the processing elements found in the network. The Space-Tessellation Tree may be invoked multiple times, each corresponding to an attention point and scale of the visual saccades learned in the training phase of the system. Each time the tree is invoked, it uses a fovea image extracted from the original image. 3.2 Visual attention This fovea image is produced using a visual attention mechanism. The visual atten- tion mechanism provides a well-framed portion of the image that contains a single object of interest in a standard position and scale. During the training phase, the attention mechanism finds areas of interest using a scanning technique. The scanning is accomplished using a manual selection, a pixel—by-pixel search [154], a Simple raster 61 Image 000 Feature Processing Extraction Module Q0000] l . , Processmg Feature Module Extraction o l (WL Processing Feature Module Extraction (p l f’“? , Processing Module 0 O 0 Processing Module 0 O 0 Processing Feature Module Extraction 0 \cf 1 In“? Processing Module 0 O 0 Processing Module Figure 3.2: The network structure of the SHOSLIF. Each level of the network has a variable number of nodes. A node at level to has a sub-network at levels I > lo. A node consists of a processing module and an attention supplier. As indicated by the multiplexers in the figure, each of the processing modules can activate one of the attached nodes at the next level of the network. Each of the processing modules controls an attention module, which is responsible for supplying properly aligned fovea images to the activated node. 62 scanning approach [160], or a more sophisticated technique such as described in [145]. In manual selection, the operator uses the mouse to select the area of interest from a displayed image. A Simple raster scan of an input image is used in [160] by raster—scanning an input image using a fovea window. At various predefined points in the raster scan, the subimage subtended by the fovea window is extracted as a fovea image to process. Once an area of interest is selected, it is mapped to a standard—sized “fovea” by the foveation procedure. Because the fovea is representative of a fixation point in the visual saccades performed on an input image, we consider those pixels nearer the center of this fovea to be more important than those nearer the edges. Therefore, the pixels in the fovea are weighted appropriately. The center pixel receives a weight of 1, and the pixels on the edge of the fovea receive a weight of 0, with a linear interpolation between these weights for those pixels in between. This ensures that the fixation point that produced the fovea receives more attention than the background. The details of this visual attention mechanism is beyond the scope of this docu- ment, and are given in more detail in [159]. Visual attention is utilized to be able to limit the scope of images processed by this system to be those images containing a single, well-framed object. A top—level flow diagram for the processing done in each of the Space—Tessellation Tree’s processing elements during the learning phase is given in Figure 3.3. In order to allow for generalization to variations in the position, scale, and orientation of the objects in the training samples, these parameters are varied Slightly during the training phase. Because the attention supplier provides a rough estimate of the 63 Vectorlze J—l $9809 r—‘ Tessellation for this node Figure 3.3: A top-level flow of the processing performed at each node in the Space- Tessellation Tree during the training phase. A set of training samples which enter the processing element are extended to allow for learning-based generalization for position, scale, and orientation. These extended samples are vectorized and used to produce the projection matrices to the MEF and MDF subspaces. The extended samples are projected to the MDF subspace using these matrices, and a tessellation of the space covered by the node being worked on is produced. The projection matrices and the space tessellation for each node are produced in the learning phase. position and scale of the area of interest, only a relatively small variation around this point need be expanded using the learning-based parameter generalization. The span of the variation for each of the parameters (e.g., row and column positions in the image, size of extracted fovea) is divided into a uniform grid of points. A new fovea is extracted from the image at each of these grid points, generating an extended training set designed to handle the recognition generalization to position and size variations. 64 3.3 The Most Expressive Features (MEF) Each input subimage can be treated as a high dimensional feature vector by con- catenating the rows of the subimage together, using each pixel as a single feature, as Shown in Figure 3.4. Thus each image is considered as a sample point in this 11 1 I: f.1 f. 1 (a) An image, made up of rows of pixels . le . f1,2 f1.q f2,1 f2,2 f2.q fl. fp,2 . fp,q . (b) The vector is formed by concatenating the rows of the image together. Figure 3.4: The vectorization of an image. SHOSLIF treats an p x q image as a point in pq-dimensional space. The rows of pixels in an image are concatenated together to form a vector. high-dimensional space. 65 3.3.1 Density of the space This high-dimensional space is sparsely populated with interesting images. For an p x q image with g grey levels, there are a total of 9” possible points. A typical grayscale image has 9 = 256 pixel values. For the experimentation described in this work, the image dimensions were fixed at p = 64 and q = 88, so the number of possible points in the space that this work utilizes is 2565632, which is a very large number. Most of these points are not interesting images to consider in a vision processing system, however. For example, an all- (or near-) white image or an all- (or near-) black image is useful for testing how the system handles degenerate cases, but not interesting from a practical standpoint of vision processing in the real world. In addition, contrived images, such as those images with alternating white and black pixels, though they exist in the Space, do not occur naturally and are therefore not very interesting images to use except to test the handling of these outlying cases. In fact, the interesting images tend to be grouped together in a small area. For the database of images used in this work, the maximum Euclidean distance between images viewed as points in this space is less than 16, 000. Figure 3.5 shows a histogram of the distances between image vectors in the database. Image instances of a particular object can be represented by an n-dimensional random vector X. X can be expanded exactly by X=VY, 66 35““) I I T if I I l istances in Image Space -— 300000 ~ - «3 250000 b - is a. 3.3. 200000 - - 8. l'-t-a 2 150000 - i 2 100000 L _ 50000 F - O 1 I— 1 1 l l —_l—l l 0 2000 4000 6000 800010000120001400016000 Inter-point distances in image space Figure 3.5: A histogram of inter-point distances for the images in the database treated as vectors. The horizontal axis shows the distance between a pair of image vectors; the vertical axis shows the number of pairs that fell into each bin. 67 where the columns of the n x n square matrix V are orthonormal basis vectors; this makes Y a random feature vector of the image X. Without loss of generality, Y can be considered as the vector resulting from subtracting the sample from its mean, uy, since we could always redefine Y — uy as the new vector. 3.3.2 Principal component analysis This dimension n of X is usually very large, on the order of several thousand for even small image sizes. Since we expect that a relatively small number of features are sufficient to characterize a class, it is eflicient and reasonable to approximate X using m < n columns of V to give X(’”) = 2 MW, i=1 where the v,-’s are the column vectors of V. Let the effectiveness of the approximation be defined as the mean-square error ||X - X(m)|]2. Then we can use the proven result [55] [89] [105] that the best vectors v1, v2, - - ~ vm to use are the unit eigenvectors associated with the m largest eigenvalues of the covariance matrix of X, 2,, = E [(x — Mx)(X - Mx)‘] , where Mx is the mean (expected) vector of X. During the training phase, Ex is replaced by the sample scatter matrix. Then the features y1,y2,- - - , ym can be easily 68 computed from y,=vf(X—Mx) , i=1,2,~-,m. Since these y,’s give the minimum mean-square error, we call them the Most Expres- sive Features (MEF) because they best express the population in the sense of linear transform. This projection, also called the Karhunen-Loeve projection and principal com- ponent analysis [84], has been used to represent [94] and recognize [154] [122] face images, and for planning the illumination of objects for future recognition tasks [116]. To determine m, the number of features to use, we first rank the eigenvalues of Ex, A1, A2, - - - , A", in non-increasing order. The residual mean-square error in using m < n features is simply the sum of the eigenvalues not used, and is a natural criterion to determine how many features are needed to sufficiently represent a class. We can choose m such that the sum of these unused eigenvalues is less than some fixed percentage P of the sum of the entire set. So we let m satisfy ?=m+l Ai < P n Ai i=1 If P = 5%, a good reduction in the number of features is obtained while retaining a large proportion of the variance present in the original feature vector [84] [154], thereby losing very little of their original population-capturing power. 69 3.3.3 Computational Considerations We can approximate the covariance matrix Ex with the sample scatter matrix S = UU‘, where U = [U1U2'° -Uk], and U,- = X; — X, for k training samples. If k < n, as is typically the case when dealing with a small number of training samples relative to the image dimension, S is degenerate. When this happens, however, we can find the eigensystem of the k x 1: matrix U ‘U . This means that UtUW,‘ = AiWi, with eigenvalue A,- and associated eigenvector w,. Pre-multiplying by U gives UUtUWi = AiUWi. Then v,- = Uw, is the eigenvector of S = U U ‘ with eigenvalue /\,-. If the number of samples available is more than the image dimensions, then the eigensystem of U U ‘ can be computed directly. For the database of images used in this work, each image was projected to the MEF space to find the density of the vectors in this MEF Space. Figure 3.6 Shows a histogram of the distances between MEF vectors in the database. The MEF vectors are well-suited to be utilized for image approximation. Figure 3.7 demonstrates the capability for the MEFS to effectively approximate an image set. Images of a single individual were used for training; variations in lighting and expression were present in the training samples. We can view the MEF vectors as 70 80000 r . i i T _ Distances in MEF Space --— 70000 in” « Fl— 60000 t ”L - w L E. H- ._. 50000 ~ _ . .s _ S 40000 'l i - o F_ § 30000 ~ I ~ g r- 2 I 20000 .. - . 10000 ~ 4 O 1 1 _ l 1 0 10000 20000 30000 40000 50000 60000 Inter-point distances in MEF space Figure 3.6: A histogram of inter-point distances for the images projected to the MEF space. The horizontal axis shows the distance between a pair of MEF vectors; the vertical axis shows the number of pairs that fell into each bin. 71 Training set Approximations of images Figure 3.7: Example MEFS and their ability to approximate a training set. The training images were taking from the Weizmann Institute of Science in Israel. The images of a single individual are shown. 72 images by reversing the vectorization procedure shown in Figure 3.4. Figure 3.8 shows samples of the MEF vectors for a large training set of images with variations in lighting direction and expression. Six images of each of 28 individuals were used as the training set; each individual had two expressions taken under three different lighting conditions. As can be seen in the figure, the most important features are due to the lighting direction and the expression changes. 3.4 The Most Discriminating Features (MDF) Although the MEF projection is well-suited to object representation, the features produced are not necessarily good for discriminating among classes defined by the set of samples. The MEFS describe some major variations in the set of sample images, such as those due to lighting direction; these variations may well be irrelevant to how the classes are divided. Figure 3.9 shows an example of a two—dimensional case where the MEF projection cannot separate classes in the population. In this figure, the MEF projection onto the principal component Y1 is unable to separate the two obvious classes. A projection onto Z1, however, gives a clear separation. This clear separation is provided by the discriminant analysis procedure. The features used to effect this clear separation are called the Most Discriminating Features (MDFS) because in a linear sense, they Optimally discriminate among the classes represented in the training set. 73 Figure 3.8: Example MEFS for a large training set. The top thirty are shown for a training set of 168 images. Six images of each of 28 individuals were used for training this set. 74 MEF ector MD vector Y1 21 2.... .......................... ....... ”fifighvgli‘iiamtes l x: """"""""""""""""""""""""" l "l - No MEF value can the c ........... ../—\ separate the two classes Y2 . ' - - - I- 22 ‘ / _x1 Figure 3.9: Problems with the MEF projection for class separation. The MEF projection onto the principal component Y1 is unable to separate the two obvious classes. A projection onto 21 , however, gives a clear separation. 3.4.1 Multivariate linear discriminant analysis Let W be a projection matrix that projects a vector into the MDF subspace. Vector Z = W‘Y is a new feature vector from samples of c classes with class means M,, i = 1,2, . -~,c. Then the within-class scatter matrix is defined as [84] S", = 5:1 2?;1(Yj - M,)(Y,- — M,)‘ for n,- Samples from class i. For a grand mean vector M for all samples from all classes, the between-class scatter matrix is defined as 3,, = 22:1(M, — M) (M,- — M)‘. In discriminant analysis, we want to determine the projection matrix W that det s et{s }. In other words, we want to maximize the between-class maximizes the ratio scatter while minimizing the within-class scatter. It has been proven [51] [163] that this ratio is maximized when the column vectors 75 of projection matrix W are the eigenvectors of Sngb associated with the largest eigenvalues. Then the scalar components in Z are feature values of the given samples and the column vectors of W are the MDF feature vectors. 3.4.2 Computational Considerations Because the matrix Sbe need not be symmetric, the eigensystem computation could be unstable. To avoid this problem, the following method diagonalizes two symmetric matrices, and produces a stable eigensystem computation. Compute H and A such that Sw = HAH‘, (3.1) where H is orthogonal and A is diagonal. Then (HA-i)‘ SwHA-i = I. (3.2) Now compute U and 2 such that (HA-i)‘ stA~i = UzzUt (3.3) where U is orthogonal and E is diagonal. Then S, = HAisztAiH‘ (3.4) 5,, = HAiUIutAiH‘, (3.5) 76 Defining V = H A“iU , V diagonalizes Sb and Sw at the same time. Since 3,;1 = HA‘1H‘, (3.6) Equations 3.4 and 3.6 give 5,33,, = HA-IH‘HAiszU‘Ath = HA-iUEU‘AiH‘ = VEV‘ That is, V consists of the eigenvectors of 5,3131, and 2 contains the eigenvalues of 5,3135. Thus using the symmetric properties of the component scatter matrices, we have a method for finding the eigensystem of $513,, that is stable. 3.4.3 The DKL projection The discriminant analysis procedure breaks down, however, when the within-class scatter matrix Sw becomes degenerate. This can happen when the number of sam- ples is smaller than the dimension of the sample vectors. Using discriminant analysis directly on the images will generally render Sw non-invertible due to the high dimen- sion of the input image relative to the number of training samples. For example, a very small image of 64 X 64 when vectorized turns into a 4096-dimensional sample vector. The number of training samples available is usually much smaller than this 77 large dimension. The resolution to this problem is to perform two projections instead of one. The discriminant analysis projection is performed in the space of the Karhunen-Loeve projection (i.e., MEF space), where the degeneracy does not occur. We first project the n-dimensional image space onto the m-dimensional MEF space. We choose m such that for 3 training samples from 0 classes, m + c S s. In fact, it is impossible for m > s — 1 Since there are a maximum of s — 1 non-zero eigenvalues in the Karhunen-Loeve projection. But we further constrain the final dimension of the MEF Space to be less than the rank of Sw in order to make 3,, non-degenerate. Since there are at most 0— 1 non-zero eigenvalues of 3,3135, we choose I: S c—1 to be the final dimension of the MDF space. So we relate the dimensions of the MEF- and the MDF- spaces as k g c— 1 S m g s—c. Thus, the new overall discriminant projection is decomposed into two projec- tions, the Karhunen-Loeve projection followed by the discriminant projection. We call this new projection the Discriminant Karhunen-Loeve projection (DKL projection). We first project the n—dimensional image space (X—Space) onto the m-dimensional MEF space (Y-space) using the Karhunen-Loeve projection. Then we project the m—dimensional MEF factors onto the MDF Space using the discriminant analysis projection. Definition 1 (DKL projection) The DKL projection to the Most Discriminating Feature (MDF) space is Z = W‘V‘X, where V is the projection matrix from the image 78 space to the MEF space, and W is the projection matrix from the MEF space to the MDF space. For the database of images used in this work, each image was projected to the MDF space using the DKL projection to find the density of the vectors in this space. Figure 3.10 shows a histogram of the distances between MDF vectors in the database. 18m I I t 1 _ Distances in MDF Space — 160000 -L - F 140000 - "‘ . m L— E 120000 - _ - g 100000 - — - “a __ 1-1 80000 ‘ .] g s 5 60000 F - 2 fl 40000 ~ '_ . 20000 - __ - O 1 l F-l-—11___I—-_ 1 O 5000 10000 15000 20000 25000 Inter-point distances in MDF space Figure 3.10: A histogram of inter-point distances for the images projected to the MDF space. The independent axis shows the distance between a pair of MDF vectors; the de- pendent axis shows the number of pairs that fell into each bin. 79 3.4.4 Explanation of the MDFS The Most Discriminating Features are not directly tied to the absolute intensity values of the input images. Figure 3.11 shows some MEF S and MDFS obtained from a large training set of human faces. As can be seen from the figure, the features encapsulated in the MDF vectors Show directional edges found in the training set and have discarded the imaging artifacts, such as lighting direction, to a large extent. Figure 3.11 further demonstrates how the MDF vectors have divorced themselves from the artifacts of imaging process, especially in this instance, the lighting direction. The Clustering Effect of the MDF subspace using the DKL projection To show how the MDF subspace effectively discounts factors unrelated to classifica- tion, an experiment was performed to obtain the MEF and the MDF vectors for a collection of training images. Figure 3.12 shows samples of the data in the space of the best (in terms of the largest eigenvalues) two MEFS and MDFS for the experi- ment. From Figure 3.12, it is clear that the MDF subspace has a significantly higher capability than the MEF subspace to correctly classify an input image that is con- siderably different from those used in training, because classes in the MDF subspace have larger between-class distance and smaller within-class scatter. 3.5 Space Tessellation Given the strengths of these two feature sets, we want to provide an effective and efficient method for retrieval of images from the database. The SHOSLIF produces 80 MEF 1 MEF 2 MEF 3 MEF 4 MEF 5 MEF 6 MEF 7 MEF 8 (a) MEF vectors MDF 1 MDF 2 MDF 3 MDF 4 MDF 5 MDF 6 MDF 7 MDF 8 (b) MDF vectors Figure 3.11: A sample of MEF and MDF vectors treated as images. The MEF vectors in (a) show the tendency of the principal components to indicate major variations in the training set, such as lighting direction. The MDF vectors in (b) Show the ability of the MDFS to discount those factors unrelated to classification. The training images used to produce these vectors are courtesy of the Weizmann Institute. 81 6000 u i x i r _ Cars A g 5000 Chair + 4000 - Computer mice 0 ~ Digits X 3000 " Human faces A ‘ 2000 _ AHumanbodies 1* j N A E 1000 ~ . . - O .A Acmm‘” . . ,,............_ AA -1000 - A ~ -2000 - A "‘ ~ -3000 - x x- -4000 ‘ ' ' ' -6000 -4000 -2000 0 2000 4000 6000 MEF 1 3% I I T I I I A Cars A 2000 ~ A A A A Chair + - A A Computer mice E1 _ + AA Digits x 1000 + ‘ Human faces A I Human bodies 3* N O .. ............. , ............... , ............................................................................................................................................ .. Li. a 1... - ,0 _ -2000 - 3‘ . DD -3000 - - £6 _4(XX) 1 1 1 l l 1 -3000 -2000 -1000 0 1000 2000 3000 4000 5000 MDFl Figure 3.12: Distribution of some samples using the best two features in the MEF and the MDF spaces for the root node. In the MDF subspace, objects of the same class are clustered much more tightly than in the MEF space. 82 a hierarchical space tessellation using the hyperplanes determined by the MEF and MDF features. The feature space of all possible images is partitioned into cells of differing sizes as shown in Figure 3.13. A cell at level I is subdivided into smaller cells at level I + 1. The network structure that can effect this recursive tessellation is a Space- Tessellation Tree whose nodes represent cells at the corresponding levels as shown in 3.1. The tree is built automatically during the learning phase; this tree is used in the recognition phase to find the model in the learned database that best matches an unknown image in O(log n) time on average for n images in the database. As the processing moves down from the root node of the tree, the hierarchy recur- sively subdivides the objects of the model database into increasingly simpler problems. When a sample vector is presented to a node, a distance measure from the center of each of the node’s children is computed to determine the most likely partition to which the sample vector belongs. At each level of the tree, the k nodes that best capture the features of the sample vector are used as roots for further refinement. The operator chooses a constant k for retrieval purposes, thereby ensuring that the retrieval complexity is 0(log n) on average. The children of a node N represent a tessellation of the space subtended by N. A two-dimensional example is shown in Figure 3.14. 3.5.1 The Hierarchy Decomposes the Problem The tree structure is able to decompose the problem of linear discriminant analy- sis into smaller, tractable problems. At the root node of the tree, where the entire 83 (a) 2D space view I ‘J i . "V I «1 ‘\‘ ,/' A ,I \ fl/ .- /’ \ ./’ ,, f x / t/' o . \\ / '/ I / \‘\\ / " I,” 3 \\~. .'/’ ~ \\ r\,. t 4' . I , ‘\. . ._ .. N‘s. O++ :l__f+__2~:- 3+4» [4+ l5++wmtf§+fl+§++ 7++++++ 5++++‘ (b) Tree view Figure 3.13: The SHOSLIF’s hierarchical Quasi-Voronoi space tessellation. Digits and “+” symbols indicate centers of cells at level I, l + 1, and l + 2. Dark lines indicate cell boundaries at a level of the Space-Tessellation 'Ii'ee nearer the root. 84 [:] Space covered by node N at level I km Space covered by node N at level I +1 Tree description of this space decomposition Figure 3.14: An example tessellation of the space covered by node N database of samples are found, the classes may not be separable using the linear discriminant analysis technique. But because we do not attempt to completely sep- arate the classes at a single level, we can successfully solve the problem in stages by breaking it down into solvable pieces. The DKL projection does well in dealing with large variation within a particular class. When the database contains many classes, however, they still may not be linearly separable. The Space-Tessellation Tree provides a mechanism for dealing with this problem. Children of a particular node decompose the difficult problem of separating many classes into several smaller problems. At each node of the tree, the set of features found in the discriminant analysis procedure are specifically tuned to the set of samples found in the node. So although the MDF space provides an 85 optimal set of features for class selection in the sense of linear transform, this optimal set may be insufficient to separate classes. But even if a node cannot completely separate the classes, it can make a first attempt at separation, dividing the samples among its children nodes. Then at this child level, Since fewer samples exist, the linear discriminant analysis procedure is more likely to succeed. Since this is applied recursively, eventually a successful separation of classes is achieved. Figure 3.15 shows an example of the difference in the complexity of the class separation problem for the root node and an internal node of the tree. A child node contains fewer samples than its parent does, and the MDF vectors can therefore be optimized to the smaller set of samples in the child node. 3.5.2 Hierarchical Quasi-Voronoi Tessellation We tessellate the Space covered by a node N using a Hierarchical Quasi- Voronoi T essellation, with the resulting tessellated Space resembling that shown in Figure 3.13, for two dimensions. First we define some terms and provide a few properties of the Hierarchical Quasi-Voronoi Tessellation algorithm. Definition 2 Expected radius. Each level of the tree has an expected radius r(l) of the space it covers, where l is the level of the node and r(l) is a decreasing positive function based on the level. Definition 3 Node center. The center of a node N is the sample vector associated with the training sample that caused N to be created in the tree. 86 8m I I I I I I I Class 1 A 600 " ClassZ P ‘ 5 x Class3 0 400 i f Class4 x ‘ 200 7 o x D x o I N ‘25 -200 » * 3 ~ 2 0 02° X -400 " e . O : -600 . x . -8m __ + D ‘ 4; -12“) 1 J 1 1 1 1 1 + -2000 -1500 -1000 -500 0 500 1000 1500 2000 MDF 1 (a) Root node 3”) I j I I I T I I— I Class 1 A 2500 ' " " ClassZ + - x x (“3.883 13 200° ” ..§ .. Class4 x ‘ 1500 " x " 1000 - a 1 z: =1 0 1.1-10 ........ Q ............ 9 ......... D. ......................................... E.) .......... , , .............. . ............. g ........... 4. ......... .. -500 .. q -1500 - . - .2” 1 1 1 1 1 1 1 1 1 -1500-1000 -500 0 500 1000 1500 2000 2500 3000 3500 MDF] (b) Internal node Figure 3.15: An example showing the complexity of the class separation problem at two diflerent levels of the tree. Samples from the same class are given for both graphs. Since the internal node contains many fewer samples than the root node, the MDF vectors can be optimized for the classes contained in the node. 87 Definition 4 Distance measure. d(X, A) is the distance measure between node N with center vector A and a sample vector X. A node N in the tree contains a set of labeled training images. Every node which contains more than a single training image computes a projection matrix V that is used to project the samples to the MEF space as described in Section 3.3. If the training samples contained in N are drawn from multiple classes, as indicated by the labels associated with each training sample, then the MEF sample vectors are used to compute a projection matrix W to project the samples to the MDF space as described in Section 3.4. Otherwise, the training samples in N are from a single class, and we organize them into a subtree using MEFS for efficient image retrieval. The MEF subtree is used to find the nearest neighbor for a test probe. Once that nearest neighbor is found, that sample is projected back to the nearest ancestor node that utilizes an MDF space. This is done because the center of the nearest MDF ancestor node may not be very near the test probe in this MDF space. But it represents a class of objects that contains a vector that is near the test probe. Therefore, the MEF subtree is used to find that nearest neighbor in order to compare the test probe with nearest neighbor sample using the most specific MDF subspace. A flag is set for each node to indicate what sort of feature space it is working in. A simple example tree constructed with these spaces is shown in Figure 3.16. 88 sanples from class I [2 samples lrom classl J O sa from class 1 1 511 from 1 ” E 51.33.13: from class 2 [2 88 “madam“ 2 .2 .. l 11% unsung; . Osanviestromclesst Osalroleslromclasst lsenpletromclasst samplesirom lasst turnpletroml 1 1 tromdasez ls. llromclassZ Ose lromclass2 [gearmiestromglass'z ] lawnmca‘asz I [gg$gliowmngl&2] A an lromclasst sa homeless sunblock do 1 fr Ilufljromclassz Il’lfileiflllomclass Illmnwclasfir I 1’ 11333335 Figure 3.16: An example tree built with both the MEF and the MDF subspaces. Shaded nodes indicate that the MDF space was used; square nodes indicate that the MEF space was used; bold nodes are leaf nodes. 3.5.3 Incremental Learning Training samples are added to the tree in series of batches. Larger batches produce more efficient image retrieval trees, but they take longer to develop. Smaller batches can be processed more quickly, but may yield a less efficient tree than one in which all the training images are given as a Single batch. Thus the operator of the system must consider the tradeoff in the learning versus the retrieval time complexity when determining the batch Sizes used for training the tree. Suppose that we want to add training sample X,- to node N which is at level I. If the feature vector for X,- is within the radius covered by one of the children of N, then X,- will be added as a descendent of that child as we increase the depth of the tree. If the feature vector for X,- iS outside the expected radius for all the children of N, however, we would like to add X,- as a new child of N. If N is an internal node, then a partial tessellation of N has already been established by previous training batches, and care must be taken in changing an existing tessellation, as discussed in the following section. 89 3.5.4 The Adoption Problem Once node N becomes a grandparent node, the tessellation of the space at node N must be fixed so that the descendants of N will not be lost in a changing tessellation. For example, due to some previous processing, G may be a grandchild of N as shown in Figure 3.17. G is close to a tessellation boundary, but it falls outside the expected Expected distance from child 02 E u : ed distance from child C1 ‘0 Tree description of this space decomposition Figure 3.17: An example tessellation where grandchild G is close to a tessellation boundary and outside the expected distance radius for its nearest parent C 1. cell radius for its nearest neighbor, node C1. Now suppose that a new image X falls into the space covered by N at a position shown in Figure 3.17. The nearest neighbor to X is node CZ, but X falls outside the cell radius for C2. So we wish to establish X as a child of N in order to contribute to the tessellation of N. If we were to do this, however, the new tessellation of node N would be as shown in Figure 90 3.18. The modified tessellation, though seemingly appropriate for node N, renders Expected distance from child CZ e . : ed distance from child C1 .5 Tree description of this space decomposition Figure 3.18: An incorrect tessellation of node N that would result from adding X as a child of N. The subtree rooted at G would be adopted by X and therefore unreachable if this tessellation were to be established. the entire subtree rooted at G unreachable because it is adopted by X. Node N will pass samples that should be classified using the subtree rooted at G to node X; since X knows nothing about subtree G, the tree rooted at G cannot be accessed. Due to the possibly complex subtree rooted at G, a simple attachment of G as a child of X may cause problems for G’s descendants. Of course, we could always restructure the tree whenever a new sample is added to the model database. Such a restructuring, however, is very computationally expensive to perform. The system could be set up to do this type of restructuring during slow times, but we do not want to impose such a large computational requirement for every 91 incrementally learned sample or batch of samples. Because of this problem of losing subtrees that are rooted near a tessellation boundary, we say that node N becomes a sterile node whenever a grandchild is added to N’s subtree. Definition 5 Sterile Node. A node N is a sterile node when its tessellation can no longer be changed. Lemma 1 Node N at level I is a sterile node when it becomes a grandparent. Proof: 1. When a node N produces its first child node, a copy of N is made at the level of the child (see Algorithm 1), and this copy contributes to the tessellation of N. So N can have either 0 or > 1 children contributing to its tessellation. When one of the children of N produces a child of its own, G, then N becomes a grandparent node. . It is possible for this new grandchild to be near a tessellation boundary of N, but outside the expected radius for nodes at level I + 1. . If a new node X now comes into N to (possibly) contribute to N’s tessellation, it is possible to be positioned as shown in Figure 3.17. If this is the case, then the tessellation of N cannot be changed by the new sample X, since the grandchild G may be adopted by X. Therefore, node N at level I is a sterile node when it becomes a grandparent. Cl Figure 3.19: Lemma 1 For sterile nodes, such as node N in Figure 3.17, we do not want to change the tessellation (i.e., the children of N). X is added as a decedent of C2, even though it is beyond the expected radius of C2. 92 3.5.5 Conditional Sterility Principle This sterile condition of a node need not be absolute, however. We could have pre- viously built a tree as shown in Figure 3.20, where C1¢+2 and G are the only grand- children of N. Adding X as a child of N in this case will alter the tessellation of N distance Child 02mm eq‘a E in ; ed distance from child 01 Tree description of this space decomposition Figure 3.20: An example tessellation where changing the tessellation would not result in the loss of a subtree but will not render any subtree of N unreachable. Figure 3.21 shows the tessellation resulting from adding X as a child of N. This example can be generalized and encapsulated in the Conditional Sterility Principle for dealing with the Adoption Problem. Principle 1 Conditional Sterility Principle. It is safe to add new node X as a child of sterile node N if none of the current descendants of N will be covered by the new 93 Expected distance from ‘ child CZ a .. : ed distance from child 01 ‘90 Tree description of this space decomposition Figure 3.21: A valid tessellation of node N resulting from the addition of X as a child of N. No subtrees of N are lost by this tessellation modification. node X. If none of the descendants of N fall into the new cell that will be covered by X (i.e., the shaded region in Figure 3.21), then X can be added as a child of N even if N is a sterile node. To do this, we keep track of D(C), the maximum distance between node C and all of its descendants, while we are building the tree. Then we can add new training sample X as a child of sterile node N if d (X, Cj) > D(C,-) + r(l + 1), for all children 01- of N at level I + 1. In summary, the tree is built using a sequence of training sample batches. Within each batch, the tree is updated at level I using all the batch samples before level l+1 is updated. Essentially, the algorithm goes through the following steps: 94 0 Given c classes and 1: training images for a particular node N (for example, ten individuals and 50 fovea training images (five from each individual) for the root node of the tree). 0 Obtain a vector representation of an image. 0 Compute the Most Expressive Features; the maximum number possible is k. 0 Choose m g It features of the MEFS to retain. 0 Compute the Most Discriminating Features in the MEF space; the maximum number possible is c — 1. 0 Choose p g c — 1 features to retain. 0 Perform a space tessellation in this MDF Space for node N. We have It points in this p—dimensional space to use to partition the Space. Algorithm 1 summarizes and incorporates the solutions to the concerns in the preceding discussion. This algorithm is called the Hierarchical Quasi— Voronoi Tessellation Algorithm because the space of each node is partitioned into a Quasi-Voronoi Tessellation in a hierarchical manner. A characterization of a sample tree built using this algorithm is shown in Table 3.1. The generated tree is too large to display in detail, so a characterization of the tree Showing the number of nodes produced on each level is given here. Other trees have been utilized for classification purposes [18] [115]. These trees typically perform the most obvious class separations first, postponing finer distinc- 95 Algorithm 1 The Hierarchical Quasi- Voronoi Tessellation Algorithm Input: Node N at level I — 1, list of samples X to add. Output: A (possibly) changed tessellation of N based on the new samples. 1. If this is not a sterile node: ( a) Add the samples already covered by node N to the list of new samples. { b) If there are multiple images in this node: i. Compute the projection matrix V to the MEF space for this node. ii. If there are multiple classes in this node, compute the projection matrix W to the MDF space for this node. iii. For each child C,- of this node: A. Project the center of C,- to the new MEF/MDF space of N, whichever is in use. B. Set the radius covered by each child C,- to the default for the level. Since we added the existing samples to the new ones, the samples will be re-partitioned. 2. For each sample X: (a) If V is defined, project X to the MEF space to get Y. (b) If W is defined, project Y to the MDF space to get Z. (c) Compute an array of distances between the children of N and Z or Y, depending on which space is in use for node N. Keep an array R of the radii covered by the children of N in case Z or Y belongs to none of the children of N. (d) If Z (or Y) is external to all of the spaces covered by the children of N: i. IfN is not sterile, then add Z (or Y) as a new child of N. ii. Otherwise, if d(Z (or Y), C.) > R; +r(l+ 1), for all children C,- of N, then add Z (or Y) as a new child of N. Otherwise if Z ( or Y ) is far enough away from the nearest child of N, put Z (or Y ) into the nearest child of N and update that child’s radius of space covered, if necessary. Figure 3.22: Algorithm 1, The Hierarchical Quasi-Voronoi Tessellation Algorithm. 96 Level Node count ] 0 1 I 9 2 32 3 93 4 256 5 471 6 491 7 310 8 113 9 25 10 2 Total: 1803 Table 3.1: A characterization of a sample tree built using the Hierarchical Quasi-Voronoi Tessellation Algorithm. The table lists the number of nodes found on each level of the tree, because the tree is too large to show in detail. tions to a later stage. In these trees, the classification problem is broken down into smaller, Simpler problems [103]. However, these trees typically have a fixed set of hand-crafted features [138], and they are not necessarily tuned specifically for the samples found in the nodes of the tree. In contrast, our Space-Tessellation Tree 10- cates those features that can best describe the samples contained in each node, and best discriminate among the different classes in these samples. 3.5.6 Properties of the Hierarchical Quasi-Voronoi Tessella- tion Algorithm The Hierarchical Quasi-Voronoi Tessellation Algorithm has several favorable proper- ties. Among them are a bounded number of children for each node, and for n samples, O(log n) levels in the average case. 97 Theorem 1 Given a fixed dimensionality d for the samples and a decreasing positive expected radius function r(l) based on the level of the tree, the number of children that a node N at level I — 1 can have is bounded above by a constant Ii, regardless of the training set. Proof: Let N be the center vector for a node and n, n’ be children of node N. Let P(N) be the hypersphere centered at N with radius r1_1 + (r; / 2). Let q(n) be the hypersphere centered at n with radius (rz/2). Then q(n) fl q(n’) = (i) if n 75 n’. Otherwise if these hyperspheres overlap, then one of {n, n’} will be the child of the other by Algorithm 1. See Figure 3.24 for a 2D example of these hyperspheres. P(N) is the extended space of N, i.e., the space that N might have to cover. q(n) is the smallest hypersphere that produce non-overlapping children of N, Since d(n, n’) _>_ r, by Algorithm 1. Note that P(N) contains all q(n), for all children n of N. Since the volume of a d-dimensional hypersphere of radius R is (2d’1/d)7er, the volume of P(N) is V, = (2d‘1/d)7r(rl_1 + (rt/2))“. Likewise, the volume of q(n) is V,, = (2d‘l/d)7r (rl/2)“. Since q(n) and q(n’) do not overlap, and P(N) contains all q(n), then the number of q(n)’s, which is also the number of children of N, is bounded above by constant _ l/p _ T[_1+(T1/2) d_ 277-1 d n_7q_[ (Tl/2) ]_[1+ ]. (3.7) Cl Corollary 1 If r1_1 = on, a > 1, then the number of children of any node is not larger than (1 + 2a)d. Figure 3.23: Theorem 1 Theorem 1 uses the fact that both the dimensionality of the sample vectors and the expected radius that is used to create a node N ’3 children are constants to assert that the maximum number of children that N can produce is bounded above by a constant it, irrespective of the training samples used. Lemma 2 asserts that Algorithm 1 produces O(log n) levels. 98 q(n) 4 r, 11 P(N) l1 + 2 Figure 3.24: Hyperspheres used in Theorem 1. 3.6 Image Retrieval When an unknown image X is presented to the recognition tree, the general flow shown in Figure 3.26 is followed. When a node N of the tree is activated, X is projected to Xy, a vector in the MEF subspace of node N. Xy is then projected to Xz, a vector in the MDF subspace for node N. Xz is compared with each of N ’8 children. The child with the best response is selected as the path of the tree to be explored. Such a single-path search will guarantee that an exact match can always be found, but it cannot guarantee that a nearest match will always be found. As shown in Figure 3.27, the best match for X may be found in the subtree rooted at C, but a Single-path search will explore only C2 and its subtree. If the nearest neighbor lies in the subtree rooted at G, it will be missed in a single-path search. Therefore, we define a constant k of nodes to be explored at every level of the tree. Using a constant k at every level ensures that only a band of the Space-Tessellation 99 Definition 6 Given 11 samples, a Bounded Unbalanced Tree with Unbalance Bound 0 S a S 1 (a constant) is a tree such that for any node N containing n1+n2+° - -+n;c samples, where N has In children with n,- samples assigned to node i and n1 2 n2 2 -~nk,n15a(n1+n2+---nk). ' Lemma 2 The number of levels in a Bounded Unbalanced Tree with n samples is bounded above by log(1 /a) n where a is the Unbalance Bound of the tree. Proof: Each node N of the tree is assigned with n1 + 722 -l- - - - + n], samples, where n,- is the number of samples assigned to the ith child of N. Rank these n,’s so that n1 2 72; _>_ - - - 2 nk. Because the tree is a Bounded Unbalanced Tree, we know that n1 3 a(n1 + n2 + - - . + nk), and is true for all nodes N of the tree. a is a constant. Note that a may be large (i.e., 99%), but it is still a constant. This means that each node will not assign a huge portion of samples to a single child. An extreme example would be where child 1 receives n — 1 samples and child 2 receives 1 sample. Since the tree is a Bounded Unbalanced Tree, this is not allowed, and in the worst case, child 1 would receive om samples and child 2 would receive (1 — a)n samples. This is a significant constraint. Each deeper level of the tree will reduce the number of samples by a factor of at least a. The lth level down the tree will receive nal samples. At tree height h, we have just a single sample by Algorithm 1. Then nah = 1, and a” = (1 /n), or (1 /a)" = n. Then the height of the tree h = log(1/a)n = (log n/ log (1 /a)). Cl Figure 3.25: Lemma 2 tree is explored, that band surrounding the nearest leaf node match of a test probe. The retrieval algorithm is given in more detail by Algorithm 2. Theorem 2 uses the results from Theorem 1 and Lemma 2 to Show that this Image Retrieval algorithm runs in O(log n) time on the average. 3.7 Summary In this chapter introduces the Self-Organizing Hierarchical Optimal Subspace Learn- ing and Inference Framework. This SHOSLIF paradigm is an attempt at a unified methodology for comprehensive visual learning. It attacks problems such as how to 100 Subim ge Input More than one image in node? Activate node N Yes Projection to the MEF subspace O More than Compare in Compare X to . class Children of N one nearest MDF space in MEF space in node? LDistance mesaure Distance mesaures from children Com 4 pare X to - - ‘L / _ChildrEn of N Progectlon to the E / in MD space MDF subspace ‘ Figure 3.26: The general flow of each SHOSLIF Processing Element. automatically select the most salient features and how to organize the visual database. The SHOSLIF uses the theories of optimal linear projection to generate a hierarchi- cal tessellation of a space obtained from the training images. Two feature spaces are described: a Most Expressive Features space, or principal component feature space, obtained using a Karhunen-Loeve projection, and a Most Discriminating Features, or discriminant analysis feature Space, obtained using the multivariate linear dis- criminant analysis procedure. We assume a visual attention mechanism to provide a well-framed set of images, each of which contains a single object of interest. We develop the Most Expressive Features and the Most Discriminant Feature spaces, and describe the advantages of each of these feature spaces. We discuss how the MEF Space is well-suited to object class characterization, because the Space cap- tures the major variations present in the training set of images. But we also discuss 101 a «acted distance from child Ci a Tree description of this space decomposition Expected distance from child CZ Figure 3.27: A tessellation in which a single-path search may not find a nearest match. Node N at level I will shuttle test probe X to its nearest child C2. The nearest neighbor for X may well be found in the subtree of G, which is missed in a single-path search. how these major variations are not necessarily good for classification. The MDF subspace, however, is designed specifically to provide optimal performance object classification in the sense of linear transform. Though these features are not neces- sarily good at capturing the necessary features for class characterization, they are good at selecting the best match from among a set of classes. For both of these feature spaces, we describe some computational issues. Given the strengths of these feature sets, we provide an effective and efficient method for retrieval by incorporating these subspaces into a hierarchical space de— composition. The feature space for a particular node N in a tree is partitioned into regions based on the training data contained in that node. Each of these regions form 102 Algorithm 2 The Image Retrieval Algorithm Input: Probe X, level l, and a list of at most constant k nodes which were explored at level I. Output: A list of nodes explored at level I + 1. 1. For each node N,- in the list explored at level I: (a) If N,- is not a leaf node: i. Project X to the MDF subspace of node N,, producing Z. ii. Compute d(Cj, Z) for all children j of N,- with center vectors Cj. iii. Transfer at most constant k of the children of N,- to the output list such that those transferred are the k nearest neighbors of Z. 2. Truncate the output list to hold at most constant 1: nodes to explore at the next level. This algorithm is repeated for all levels of the Space-Tessellation tree. Figure 3.28: Algorithm 2, The Image Retrieval Algorithm a child of N in the tree, and all those samples belonging to each region are given to their associated child nodes. As the processing moves down from the root node of the tree, the hierarchy recursively subdivides the objects of the model database into increasingly simpler problems. At each node of the tree, the set of features found in the discriminant analysis procedure are specifically tuned to the set of samples found in the node. So although the MDF space provides an optimal set of features Theorem 2 For the k-competing path case, the number of nodes visited in a Bounded Unbalanced Tree is bounded above by knlogwm n, where k is the number of competing paths explored, K: is the upper bound on the number of children that any node can have, a is the Unbalance Bound of the tree, and n is the number of samples in the database. Proof: The proof follows directly from Theorem 1, Lemma 2, and the fact that a constant k nodes are explored for every level of the Bounded Unbalanced Tree. Cl Figure 3.29: Theorem 2 103 for class selection in the sense of linear transform, this optimal set may be insufficient to separate classes. But even if a node cannot completely separate the classes, it can make a first attempt at separation, dividing the samples among its children nodes. Then at this child level, Since fewer samples exist, the linear discriminant analysis procedure is more likely to succeed. Since this iS applied recursively, eventually a successful separation of classes is achieved. We provide an algorithm to effect this hierarchical Quasi-Voronoi tessellation in an incremental update fashion. We discuss problems attendant in this incremental update paradigm, such as the adoption problem, where descendents of a particular node N are lost due to a changing tessellation of node N; we describe solutions to these problems. We provide some properties of this incremental tree-building algorithm, viz., that each node has a bounded number of children and the tree produces O(log n) levels in the average case with a uniform grid of training points. We discuss image retrieval under the SHOSLIF paradigm, and give an algorithm to effect O(log n) complexity for image retrieval. We describe the need for multiple competing search paths in the tree, and discuss how only a band of these competing nodes are explored as we progress down the tree. Chapter 4 SHOSLIF for Object Recognition The SHOSLIF paradigm described in chapter 3 can be used for a wide variety of applications, including object recognition [146] [147] [148] [149] [150] [151] [152], mo- tion understanding [43] [44] vision-guided robot manipulation [80], and autonomous navigation [33] [34] [35]. This work describes the SHOSLIF setup applied to object recognition, SHOSLIF-O. 4.1 SHOSLIF-O using manually-defined hierarchi- cal labels One mode of operation for the SHOSLIF-O system is a mode where the operator supplies a hierarchical set of labels for each training image. For example, for training images X,- and Xj, the hierarchical set of labels could be as shown in Table 4.1. Thus we have a hierarchical set of classifications that we can use to build the tree. In this mode, while constructing the tree, we view at most three levels at a time, as 104 105 Sample General Label Species Label Name Label Particular Sample label ' everything humans Joe image-1 j everything dogs Fido image-4 Table 4.1: An example of a set of hierarchical labels that could be used in the hierarchical labels mode. shown in figure 4.1. For a given node N, we compute the MDFS using the means of N’s Cars Humans Node N Chairs Classes Training samples Figure 4.1: Nodes used for training the MDFS. grandchildren as sample points, and the children of N as the classes among which we want to discriminate. This gives each level of the tree maximal discrimination among the various subtrees defined by the training labels. To demonstrate the operation of the SHOSLIF-O in this mode, the tree was trained using eight hundred images. The labels were assigned to the training samples to produce a tree designed to have four levels, as Shown in Figure 4.2. The resultant tree was tested both using the re-substitution method (i.e., where the test image is in the training set) and using a disjoint test set (i.e., where the test image is not in the training set, so the system has not learned the test image). The results from these tests are Shown in table 4.2. Several problems present themselves with this approach to constructing the tree. 106 Figure 4.2: The set of hierarchical labels used. When the operator assigns a hierarchical set of labels to each training sample, the structure of the tree is essentially defined under this mode of operation. Though this may result in a more efficient (i.e., compact) tree, some level in the hierarchy of classes may not be lineally separable, and the performance of the system suffers from this. It would be much better for the system itself to decide the structure of the Space- test set Training Sample sample Correct sample recognition recognition Size size size is choice is in 20 classes 41 Table 4.2: Results of testing the tree built under the hierarchical labels learning mode 107 Tessellation tree rather than the operator, because the system is able to construct the space tessellation based on the data itself rather than on some semantic meaning imposed by the operator, as described in chapter 3. In order to discuss the automatic space tessellation mode of the SHOSLIF-O, some of the parameters for the framework must be explored. Following the discussion of some of the system parameters, results from various experiments are reported. 4.2 SHOSLIF-O Options for Automatic Space Tes- sellation 4.2.1 Distance measures As in any recognition system, the similarity measure used is of critical importance. For the SHOSLIF-O case, it is especially important because the distance measure affects how the database of training images is partitioned and how the tree is built. As such, the distance measure will have a direct effect on the recognition rate and the retrieval capability of the system. Several experiments were done to determine a good distance measure to use. A normalized distance, a simple feature space distance, and a metric that included the distance from the subspace were all tested. Each distance measure was used to train a set of 1151 images taken from 432 classes. The images used are Shown in Figures A1 and BL At the time that these experiments were run, the full set of faces was not available. The same disjoint set of test image was used to test the recognition 108 tree for each of the different distance measures. In addition to the different distance measures, an analysis of the significance of maintaining a history of how well the ancestors of a particular node had matched the test probe to determine which subtrees to explore was performed. Upon reaching the leaf nodes, this tally can be viewed as the confidence of reaching the leaf. The confidence measure used was the sum of the confidence for each node on the path from the root to the leaf over the path length. The confidence for each node was computed differently for each of the different distance measures. For the normalized distance, which was already in the range of [0, 1], the confidence for the node was simply 1.0 - the normalized distance. For the other distance measures, the maximum distance among the children of a node was computed, and the confidence was taken to be 1.0 - the distance from a child over this maximum distance. In the following experiments, the list of nodes to explore was determined by either the distance alone or by the confidence values obtained. When the distance was used to determine the set of nodes to explore, for a particular level of the tree, the nodes were ranked by increasing distance. The top k of these, that is, those nodes with the smallest distances, were chosen to explore further. When the confidence measure was used, the nodes visited in a particular level were ranked by decreasing confidence; the 1: nodes from a level with the largest confidence were chosen to explore further. 109 Normalized distance: a bad measure The distance measure for the normalized distance experiment used (_1__1_ d d(X, A) = e‘i 2:1 ., X-—A')2 where d is the dimension of the vectors, X is the test probe, A is the center of the node, and o,- is the variance of the ith dimension for the space being used. To determine this 0,, the training samples for the node being computed were projected to the subspace in use (i.e., the MEF or the MDF subspace, depending on whether the node has a single class or multiple classes). Then for dimension i and subspace feature vector Y], j = 1, 2, ~ - - , s for 3 samples in the node, we have 1 n—l 8 0’,‘ = 2(Y'J — 7;)2. i=1 A performance graph for exploring k concurrent neighbors of the tree for various k is given in figure 4.3. Results obtained using both the distance ranking method and the confidence ranking method are shown in the graph. The results of this distance measure were quite poor, only roughly 44% accuracy under the best circumstances. The reason for this poor performance is that by trying to normalize the distance using the division of the variance for each dimension, we confound the separation of the classes in the MDF space that we worked so hard to achieve. Clearly, another measure is required, and the next metric considered was the feature Space distance measure. 110 0.5 A Top choice, no history maintained —.— - In top fifteen, no history maintained -—+---- Top choice, maintain history -a---- 0.45 - In top fifteen, maintain history ~---5-,--,-, ----- - /;rr 8 0.4 ~ _ 53 c: 8 0.35 - _ '8 8” at 0.25 ~ 1 0.2 - _ 0.15 l l 1 I 1 1 1 0 2 4 6 8 10 12 14 16 Number of competing nodes at each level Figure 4.3: Performance of the Normalized Distance measure as a function of the number of nearest neighbors explored. Both the distance ranking method and the confidence ranking method are shown. Feature space distance In the feature Space distance, the displacement from a node center vector of a test probe’s projection into that node’s subspace was used as the distance measure. For MEF nodes, the eigenvectors corresponding to the top 95% of the eigenvalues found in the training set were maintained. For MDF nodes, the eigenvectors corresponding to the top 30% of the eigenvalues were used. The distance measure for the feature space distance experiment was d d(X, A) = \]Z(X.- — Ai)2 i=1 111 where d is the dimensionality of the vectors, X is the test probe in the MDF space, and A is the center of the node being tested in the MDF space. The performance graph showing the results from this distance measure is given in Figure 4.4. Though an improved recognition rate was achieved, the results were still j I I I I I I 0.85 - Top choice —A— In top fifteen --+--- n- *‘ _ I ~~ . I s I I l . [I 0.75 r r”, r 0.65 r Recognition rate 0.55 - 0.45 l M l 1 I 1 1m 0 2 4 6 8 10 12 14 16 Number of competing nodes at each level Figure 4.4: Performance of the feature space distance measure as a function of the number of nearest neighbors explored. When the history was maintained, the results traced the same lines as when history was not maintained. only roughly 80% accuracy in the best cases. An analysis of the reasons for this is given in section 4.2.1. Image space subtrees instead of MEF subspace subtrees Because the results were still not good, we decided to do an experiment where we replaced the MEF subspace subtrees in the SHOSLIF-O recognition tree with image 112 space subtrees. Algorithm 1 calls for us to construct a subtree in the MEF subspace for those nodes for which all the samples come from a Single class. In this experiment, that was changed so that once a node contained a single class, the Space tessellation tree for that single-class node was generated in the original image Space instead of in the MEF subspace. That is, once a node contained images all from a particular class, rather than building a MEF space subtree from that node, we built an image space subtree from that node. The distance measure used in this experiment is given by d d(X,A) = \JZ(X:‘ - 1492 i=1 where d is the dimensionality of the vectors, X is the test probe, and A is the center of the node being tested. Once a partitioning of the training set has produced nodes containing only single classes, rather than generating a hierarchy in the MEF subspace of these nodes, this experiment generated a hierarchy in the image space directly. The assumption is that the search in the MDF tree finds those node(s) associated with the appropriate class for the test probe. Given this assumption, the MEF subtrees Simply provide a means for finding those images from the class most Similar to the test probe. In this experiment, we measure the effectiveness of using the MEF feature vectors instead of the image space directly. Instead of using the MEF space to generate a hierarchy of the training samples in the MDF leaf nodes, we use the image space directly. The performance graph obtained using the image Space directly is given in Figure 4.5. In this experiment, we achieved a better recognition rate, which gave us pause to consider why such an event would occur. 113 1 r l w . r o I : . fi - Top cho1ce, no history maintained —A— In top fifteen, no history maintained -—+--- Top choice, maintain history -----8- 0.9 - In top fifteen, maintain history ~53 ----- - ’4'""§'" ”.¥W::.ux .......... /,&’ .93 ‘13 0.8 ~ _ r: o 3:: El 8" 0.7 ~ . 3 o: 0.6 - _ 0.5 - - 0 2 4 6 8 10 12 14 16 Number of competing nodes at each level Figure 4.5: Performance of the feature space distance measure using image space instead of MEF subspace once a single class has been found. The performance is given as a function of the number of nearest neighbors explored. Distance from Subspace The experiment where the image space replaced the MEF space for single-class subtree construction made us analyze the comparison of distances obtained using the centers of different nodes. Each node in the SHOSLIF-O recognition tree constructs both a MEF and a MDF subspace for those samples contained in the node. Each node constructs a different set of subspaces based on the samples contained in that node. The test probes that came into a node were being compared in the local MDF or MEF subspace for that node. It was entirely possible that a test probe vector that was miles away from a particular node’s subspace would project into that 114 subspace very near to the node’s center vector. This would cause poor recognition results because test probes that were not at all similar to the node centers would erroneously be considered close. The Distance from Subspace (DFS) distance measure takes into account the dis— tance from the projection space in addition to the distance of the projection to the node centers. The DFS distance measure is given by d(X, A) = \/||X — VVtX||2 + ||VWWtVtX - VWWtVtAH2 where X is the test probe, A is the center vector, V is the projection matrix to the MEF space, and W is the projection matrix to the MDF Space. So the product Y = V‘X is the projection of the test probe X onto the MEF subspace; VY = VV‘X expands this MEF projection into the original image space. Likewise, Z = W‘V‘X is the projection of X onto the MDF subspace, and VWZ = VWW‘V‘X expands this MDF projection back into the original image space. Note that computationally, for Z = W‘V‘X, A2 = W‘V‘A, and VW = M, IIMZ-Mz‘lzll2 = ||M(Z-Az)||2 = (Z — Az)‘M‘M(Z — AZ). Thus only a small matrix multiplication need be performed to effect this distance measure. This M tM matrix can be pre—computed during the learning phase and 115 stored in each node so that in the testing phase, the computational complexity of this operation is minimized. Intuitively, what is being measured can be seen in Figure 4.6. The first term under x MDF‘W , 1| 02‘ l V V tX 7/ vww‘v‘x Figure 4.6: Distance from Subspace description for 3D. The subspace is Shown as the plane, and is defined by the matrix V. Two classes are shown in this example, and the MDF vector that can separate them is shown. the radical indicates the distance of the original vector from the population (i.e., the subspace). The second term indicates the distance in the subspace from the center vector of a class. The performance of the DFS measure is shown is Figure 4.7. Because we project back to the original image space in using this DFS distance measure, we are assured 116 I I I I j I 1 Top choice, no history maintained -A— ’ In top fifteen, no history maintained -—+---- « Top choice, maintain history -a---- 1 - In top fifteen, maintain history "-5 --------- -l ‘,-4------I----—i—----+--- “xx-1.1:. “ggzzzzif ...... x .......... x ----------- x ----------- x """"" *p‘ q.) 0.95 ” ~ H :3 I-n 8 0.9 r - 35 50 '- -4 o 0.85 0 6’2 0.8 - . 11.73.. _] 0.75 ’ ~"13 ---- Bung 0.7 - l - at 0.65 J 1 1 1 1 1 1 0 2 4 6 8 10 12 14 16 Number of competing nodes at each level Figure 4.7: Performance of the DFS distance measure, given as a function of the number of nearest neighbors explored. that all the distance comparisons are in the same frame, and can therefore be logically compared. This DFS measurement takes into account not only how far a test probe vector is from the node’s center vector in the node’s subspace, but also the distance that the test probe is from the node’s subspace itself. The DFS distance measure gives us the good results shown in Figure 4.7, and it is this DFS distance measure that is used for the experiments reported here. The consideration for maintaining a historical deviation In the foregoing experiments, we performed an analysis of various distance measures. In addition to the different distance measures, an analysis of the Significance of main- 117 taining a historical deviation from the ultimate leaf node’s ancestors was performed. When a bad distance measure was used, such as the normalized distance or the feature Space distance alone, this history provided an attempt at overcoming the shortcom- ings of the distance measure. This history is not needed, however, when a prOper distance measure that includes the distance from the projection subspace is utilized. Because the history does not prove useful when utilizing the distance from sub- space distance measure, it is not used in the remaining experiments described in this work. 4.2.2 Use of Multiple Competing Paths As described in section 3.6, in order to find a nearest neighbor match in the recognition tree, it may be necessary to explore more than a single path of the tree. This constant 1:: parallel paths to explore is a parameter that the system operator must determine. For the data sets used in this work, unless stated otherwise, we chose 1: = 10 based on the performance graph given in Figure 4.8. Based on the data shown in this graph, the performance of the system levels out at k = 10. We chose this value to try to achieve the best performance at the lowest computational cost. 4.2.3 Expected Radius Node N uses the expected radius r(l) to determine when it needs to create a new child to accommodate a training sample. This expected radius is a decreasing positive 118 I I I I I I I Top choice -°— 1 ' In the topLfifteen; --+:-- ‘ 1*7777‘r’ ’ .4’“"+/ 8 0.95 .. - (d H l: o 31:: a 8" g 0.9 - - M 0.85 r - 1 l l l I l l 0 2 4 6 8 10 12 14 16 Number of competing nodes at each level Figure 4.8: A comparison of the performance for various k competing paths function based on the level of the tree. For this work, we have chosen r(l) = 1.336" for node N on level l of the Space-Tessellation Tree. This number was chosen because at the root node of the tree, i.e., at level I = 0, the expected radius is 12646.219. This radius is large enough to cover all of the samples in the data sets utilized in this work. The value 1.3 was used as a base over a more traditional value because it gave favorable results in the shape of the tree. A tree that is too wide and fat produces inefficiencies because many children must be explored at a high level of the tree; a tree that is too thin and tall produces inefliciencies because many projections must 119 be done in order to find the nearest neighbor in the database. 4.3 Experimental Results In order to verify the proper functionality of the system, we experimented with a large multifarious set of real-world objects found in natural scenes. In this section, we demonstrate the ability of the MDF space to tolerate within-class variations and to discount such imaging artifacts as lighting direction, and Show how the tree structure provides the ability for view-based recognition. 4.3.1 Space comparison Since the analysis showed that the MDF space should perform better than the MEF space (or the image space directly), a study was performed to demonstrate this fact. Furthermore, though the hierarchy described provides efliciency to both the learning and retrieval phases, the recognition performance may suffer somewhat from not examining all possibilities in the database. Therefore, a study was also performed to examine the performance difference between using a “flat” database, in which all images in the database are examined for possible matches, and the described hierarchical database. Finally, the recognition performance will be dependent on whether a single MDF projection is used, performing the Space tessellation in this Single space; or if a new MDF projection is performed at each node of the tree. A study examining the performance between these two modes was also done. The training images come from a set of real-world objects in natural settings. At 120 least two training images from each of 36 object classes were provided; a disjoint set of test images were used in all of the tests. For each of the tests performed, the identical set of training images and test images were used. The training images used are given in Figure A]. Pixel weighting was engaged, and the images provided for the training set Show the results of the pixel weighting. For those tests where subspaces are used, the Distance From Subspace (DFS) distance metric was used, and fifteen nodes were explored at each level when using the tree structure. Effects of the Feature Spaces and the Tree Hierarchy The results of the studies are summarized in Table 4.3. Flat database Single subspace Multiple subspaces with tree with tree Correct class Image is top choice 90% 86% 86% space Correct class is in top 15 100% 90% 90% Correct class MEF is top choice 90% 76% 86% space Correct class is in top 15 100% 83% 90% Correct class MDF is top choice 90% 89% 97% space Correct class is in top 15 100% 100% 100% Table 4.3: Results of subspace comparison study. For the “flat” databases, only a single level of the tree was created. For the single projection databases, the MEF/MDF projection matrices were only computed at the root node, and the same projection matrices were used throughout the tree. For the multiple projection trees, a new projection was made at each node of the tree. 121 The test probes and their top five responses for the multiple—projection experiment are given in Figure A.2. Note that a band of fifteen competing paths were explored, but only the top five responses are given in this figure. As expected, the data in Figure 4.3 shows the MEF subspace tracing the same sorts of responses that the original image space produces. This is expected because the MEF subspace is a good representation of the original image Space, and can therefore be used as a compact means for storing images. The data also shows, however, that the MDF subspace can outperform the MEF subspace. This is also expected, since the MDF subspace utilizes more information supplied by the user in the form of the image labels. The MDF performance for a flat database shows the same results as the MEF performance. This could be due to the fact that there was not enough within-class variation in the training samples that were needed to catch the variation present in each class. But it may be more likely that the classes in a flat database were not lineally separable. This is supported by the data, because when a Single projection is done and a space tessellation tree produced based on that single projection, the rate drops off a little; when multiple projections are done throughout the tree, the rate improves Significantly. The data lends credence to this claim that the multiple projections in the tree do indeed reduce the problem to a smaller, more manageable Size, and can therefore successfully separate the classes more easily as the processing moves down the tree. The MDF subspaces generated at the various nodes of the tree are adaptive in that they optimally separate the classes for the samples contained in the node in the sense of linear transform. As the processing moves along the nodes of the tree, a different set of features tuned 122 Specifically for those samples contained in the node are utilized for subclass selection. A Comparison Between the MEF and the MDF feature spaces A further experiment was performed to demonstrate the relatively small number of MDF vectors required to produce good recognition results versus the number of MEF feature vectors. In this experiment, we compare the performance of the system using the MEF space alone versus using the MDF Space. The number of features from each subspace was varied to Show how the system performed using up to the 95% of the variance for the training images used. Figure 4.9 Shows the response rates for the system using these two subspaces. The images for this experiment came from the Weizmann Institute and contained well-framed faces with different expressions taken under different lighting conditions. Each individual in the set of images used had two expressions; each expression image was taken with three different lighting conditions. This set of images seemed particularly suited to testing the claim that the MDF feature space could outperform the MEF feature Space because of its ability to discount expression changes and imaging artifacts such as lighting direction. The major difference between the MEF and the MDF performances is due to the fact that for this experiment, we had enough training images with sufficient within-class variation. With fewer training images, or training images that do not sufficiently capture the desired variations, the performance of the MEF and the MDF spaces will come closer together. An example of the pool of available images for a sample of individuals is given in Figure 4.10. The training for this experiment utilized a flat database to demonstrate the superiority of the MDF subspace over the MEF MDF feature space -A— MEF feature space -----+ i 5 E O + ------ +— ----- +- ----- -i— ------ '3 ,il- ----- 4 a) 0.8 '- xiv. + d o . , a r o x ‘8‘. 0.7 - f """ “‘ - E_. 0.6 "- """ -*’ . 0.5 l l l l l l 6 8 10 12 14 16 18 Number of features used Figure 4.9: The performance of the system for different numbers of MEF/MDF features. The number of features from the subspace used was varied to show how the MDF subspace outperforms the MEF subspace. 95% of the variance for the MDF subspace was attained when fifteen features were used; 95% of variance for the MEF subspace did not occur until 37 features were used. Using 95% of the MEF variance resulted in an 89% recognition rate, and that rate was not improved using more features. subspace without the (possibly) confounding factor of the tree. For this experiment, a disjoint test set was utilized. This test set was formed by randomly choosing one image from the set of images available for each individual. Therefore, each individual was trained with all of the different lighting conditions for one expression, but not for the expression image in the test set. The results in Figure 4.9 Show that the MDF subspace performs better than the MEF subspace. With eight MDF feature vectors being used, the system retrieved the correct result 100% of the time; the MEF performance when eight features were used 124 p1,exl,ill p1,ex1,il2 p1,ex1,il3 " p1ex,2,ill "pi':’éx2,112 'p'i,ex2,il3 2:; p2,ex2,i13 p2,ex1,ill p2,ex1,il2 p2,ex1,il3 p2,ex2,ill p2,ex2,i12 p3,ex1,i11 p3,ex1,i12 p3,ex1,il3 p3,ex2,ill p3,ex2,il2 p3,ex2,il3 p4,ex1 ,ill p4,exl,il2 p4,exl,il3 p4,ex2,ill p4,ex2,i12 p4,ex2,il3 .1 I ’ i‘ . " p5,ex1,ill p5,ex1,il2 p5,ex1,il3 ”l t: 3 . p5,ex2,il2 p .ex2,il3 l, I. (g, p5,ex2,ill Figure 4.10: A sample of the Weizmann Institute face data. The frontal images for an individual contain two expressions; for each expression, a set of three images with differing lighting conditions forms the set of images available for an individual. 125 retrieved the correct result in only 71% of the cases. At fourteen feature vectors, the MDF feature Space retrieval rate remained at 100%, and the MEF retrieval rate was at the 82% level. The MEF retrieval rate went up to 89% when 24 feature vectors were active, and did not improve when more features were used. 4.3.2 The Clustering Effect of the MDF subspace using the DKL projection To Show how the MDF subspace effectively discounts factors unrelated to classifica- tion, Figure 3.12 shows samples of the data in the space of the best (in terms of the largest eigenvalues) two MEFS and MDFS for the experiment summarized in Table 4.6 for the root node of the tree. From Figure 3.12, it is clear that the MDF subspace has a significantly higher capability than the MEF subspace to correctly classify an input image that is considerably different from those used in training, because classes in the MDF subspace have larger between-class distance and smaller within-class scatter. The results shown in table 4.4 demonstrate a better result using the MDF subspace than the MEF subspace. For this experiment, a total of 38 images were used. These images came from 11 face classes (i.e., individuals) taken from the Michigan State University Pattern Recognition and Image Processing face database, and eight non- face objects. There was a Significant amount of contrast changes in the face images. The non face images came from outdoor scenes. 126 'ITaining Test Correct Correct Sample sample classification classification Size size using MDF using MEF Faces 22 11 100% 91% (2 samples each (one from from 11 classes) each class) Other objects 16 8 100% 88% (2 samples each (one from from 8 classes) each class) Table 4.4: Results Showing the superiority of the MDF subspace for subclass selection The power for handling within-class variation using the MDF subspace The capability of the system for handling large within-class variation is demonstrated in Figure 4.11. Each of these search probes retrieved samples from the correct class de- fined by the training exemplars. This example supports the claim that those features unimportant for subclass selection are weighted apprOpriately by the Most Discrimi- nating Feature selection process. For example, the change in the Shape of the mouth among the different expressions is unimportant for determining to which subclass these images belong. As long as the variation in a class is present in the training set for MDF feature computation, the approach performs quite well. 4.3.3 Face database In order to compare with the results of others in the community utilizing eigenfea- ture methods, a test was performed on a database comprised of only faces. In this experiment, when an image that was used for training was also used as a test probe, such as is done with Photobook [123] [122] [113] (i.e., re-substitution method), the 127 (b) List of search images for the person Figure 4.11: Example of how well within-class variation is handled. All search images were correctly classified into the class defined by the training images. SHOSLIF always retrieved the correct image as its first choice 100% of the time. In resubstitution, we have an exact match in the database, so the first choice will al- ways retrieve the correct image. The second best match from the database using this resubstitution method also came from the correct class for 98% of the test probes in a database of 1042 face images. Table 4.5 shows a summary of the results obtained both by re-substituting the training samples as test probes and by using a disjoint set of images for testing. The 98% correct match for the second response in the 128 Number of training images 1042 of 384 individuals Number of nodes in generated tree 1761 Number of nodes expanded at each level 10 Average number of nodes explored per probe 102 Re-substitution: Correct retrieval is top choice 100.0% Correct retrieval is second choice 98% Disjoint test set: Number of test images 246 of 246 individuals Correct retrieval is top choice 95.5% Correct retrieval is in top 10 97.6% Table 4.5: Summary of experiment on a face database of 1042 images (384 individuals). For re-substitution each training image was given as a test probe. For the disjoint test set, a list of 246 images not found in the training set was used for testing. re-substitution method is comparable to Photobook’s response rates for face images alone [113]. The images in the face database are shown in Figure 8.1. The face database was organized by individual; each individual had a pool of images from which to draw training and test data sets. Each individual had at least two images for training with a change of expression. The images of 38 individuals (182 images) came from the Michigan State University Pattern Recognition and Image Processing laboratory. Images of individuals in this set were taken under uncontrolled conditions, over several days, and under different lighting conditions. 303 classes (654 images) came from George Mason University under an Army Research Labs contract. All of these classes had at least two images of an individual taken under controlled lighting, with a change of expression. 24 of these classes had additional images taken of the subjects on a different day with very poor contrast. Sixteen classes (144 images) came from the MIT media lab under identical lighting conditions (ambient laboratory light). Twenty-nine [l 129 classes (174 images) came from the Weizmann Institute, and are images with three very controlled lighting conditions for each of two different expressions. 4.3.4 Combination database: Faces and Other Objects To Show the general applicability of the method, we have trained the system on a diverse set of objects from natural scenes, ranging from human faces to street signs to aerial photographs. The database is a combination of those images shown in Figures Al and B1 A list of some examples from the various classes learned is given in Figure 4.12. Most classes in the database were represented by two images, and 13% of the classes had three or more images, up to twelve for some objects (e.g., fire hydrant). Each image consisted of a well-framed object of interest. The different images from each class were taken either in a different setting or from a different angle; where possible a change in the lighting arrangement was used to provide variation in the training images. Following training, the system was tested using a test set completely disjoint from the training set of images. A summary of the makeup of the test and the results are shown in Table 4.6. The instances where the retrieval failed were due in large part to significant dif- ferences in object shape and three-dimensional (3D) rotation. Figure 4.13 shows an example of a face image which the system failed to properly retrieve. This search probe failed because the training data for this class did not include any 3D object [l 130 QAQJQSQQ". car csign chair mouse cup dog handle egg isign monitor phone renault seated sharpener sidewalk dbody Figure 4.12: Representatives of training images from some of the various classes learned. Fovea images are shown without the pixel weighting; this pixel weighting will tend to suppress the background. Objects of interest are at the center of the fovea images. In the learning phase for this experiment, the fovea was generated using manual extraction of the areas of interest. 131 Table 4.6: Summary of large natural scene experiment. The training images were drawn at random from the pool of available images, with the remaining images serving as a disjoint set of test images. (a) Search probe (b) Training images Figure 4.13: Example of a failed search probe. The retrieval failed to select the appropriate class due to a lack of 3D rotation in the set of training images. rotation and the search probe did. In fact, when the test image was swapped with one of the training images, a correct retrieval was attained. The recognition results for this system rely heavily on the assumption that the training images are represen- tative of images which will be seen during the recognition phase. Since the SHOSLIF network is open-ended, it allows the system to continuously learn and improve while the low logarithmic complexity makes learning a huge number of samples practical. 132 4.3.5 Handling 2D variation Since the attention mechanism described in [159] provides a well—framed object of interest in a fovea image, the problem of two-dimensional (2D) transformation com- putation becomes less significant. However, because the attention point is only a rough estimate of the position, scale, and orientation of the object of interest, a bet- ter estimate of the 2D transformation to align the test probe with the appropriate training image needs to be done. As described in section 3.2, this can be accomplished by greatly increased image acquisition for points surrounding the fixation point from the visual attention mechanism. The increase in the image acquisition is expensive in terms of time and storage space, however, and could instead be accomplished by extending the training set for a grid of points surrounding the extracted attention image in terms of the position, scale, and 2D orientation foveation parameters. Each fovea image comes from an attention mechanism and contains a fixation point and a scale that the attention mechanism provides to the recognition module. This fixation point can be overlaid with a grid in both the position and the scale parameters, and a new fovea image extracted from the original image at each of these grid points. Thus the variation due to size and position can be handled by extracting a set of fovea images to add to the Space-Tessellation tree for each grid point surrounding the attention point and scale instead of just extracting a single image. Figure 4.14 demonstrates the variability that the system can handle by extending the training set in this manner. When the training set is thus extended, the Space- Tessellation tree grows in size, but not by the factor of the increased number of 133 Search probes Retrieved images Figure 4.14: Generalization for size and position of objects. The search probes were synthetically generated from images in a disjoint test set. Each search probe retrieved an image from the appropriate class. The images shown in the first line were the original images. The scaling shown is 50% of the fovea size (both positively and negatively); the position change shows 12% negative change in the fixation point for both the row and the column coordinates. 134 samples. If we assume that the search tree provides O(log 3) levels for 3 training samples, then the tree has nlogs levels for some constant is. Now if the number of training samples is expanded by P to handle the positional variation, and by S to handle the scale variation, then we have a total of P83 samples to put into a tree. Then there will be nlog{PSs} = nlogPlogSlogs (4.1) = n’ log 3 (4.2) levels in the Space-Tessellation Tree, for constant lc’, which is still O(log 3) levels in the tree. Typical values for P will be P = 9 or 25 for a 3 x 3 or a 5 x 5 grid surrounding the attention point, respectively; S will typically be S = 3 or 5 to deal with the various scales. Although the number of levels in the tree remains O(log n), the number of nodes in the tree will increase nearly lineally with the extension of the training set. For example, the number of nodes in a complete binary tree with n leaf nodes is 2n — 1 [40], or roughly n internal nodes. Our trees will have fewer total nodes because the fan-out at each node is not restricted to a binary fan-out, but the order will still be linear. We have been using 1316 training images for our large combination data set, each with dimension 88 x 64 = 5632 pixels. Expanding this to P = 3 x 3 = 9 position grid points and S = 3 scale grid points gives us n = 35, 532 training images to use. If we retain only I: MDF vectors at each node, in order to project to the MDF space, we need to retain k+ 1 5632-dimensional vectors, one for each MDF vector and one mean 135 vector, to project a test probe to the MDF Space, each of these a floating point vector (i.e., 4 bytes per entry). Since only the internal nodes require projection matrices, we would need to store roughly 35,532 nodes capable of projecting test probes to the local MDF space. Using k = 5 MDF vectors, we would require roughly 4.8GB of storage, which we do not have in one place in Michigan State University’s Pattern Recognition and Image Processing laboratory at the time of this writing. Therefore, in order to test this generalization to size and position that the SHOSLIF paradigm can produce, we have tested the system on two different data sets. The first data set is the full combination data set described in section 4.3.4 with more than 1300 original training images. But the positional variation was not included in the test due to the space considerations described above. Instead, we ran the experiment using P = 1 (i.e., on the attention mechanism’s attention point alone) and S = 3 (i.e., 3 grid points surrounding the attention mechanism’s scale). For training, we set the scale span to be 30% of the fovea size. That is, each grid point represented a 15% change in the fovea size, one training image at a 15% smaller scale than the attention mechanism dictated, one at the attention mechanism’s specified scale, and one at 15% larger than the attention mechanism’s specified scale; For testing purposes, we took the disjoint test set described in 4.3.4 and generated a set of test images from this set with a random scale change in the range of [—20, +20]% of the fovea Size. The characterization and results of this test is shown in Table 4.7. The table shows the difference between the performance when the scaling was built into the training phase and when it was not. As can be seen from the data, for the test probes, either the attention supplier must correspond to the scaling done in the 136 Number of training images 1316 (expanded to 3948) Positional enhancement 0% of fovea size Scale enhancement 15% of fovea size P 1 S 3 (3 grid points) Number of image probes 461 Number of nodes in generated tree 6324 Number of nodes expanded at each level 3 15 Tree built with no scaling in training phase Top choice correct 47.9% In top 15 64.6% 'Ifee built with 15% scaling in training phase Top choice correct 93.3% In top 15 98.7% Table 4.7: Summary of the scale generalization experiment on the large combination face and other object data set. A disjoint test set was used for testing the retrieval capability; each test probe was randomly scaled in the range of [—20, +20]% of the fovea size to test the ability of the system to generalize over various scales. training phase, or the training set must be expanded to include those images in the range of scales that need to be properly retrieved. Figure 4.15 shows example test probes and the images retrieved when scaling was enabled for the training phase and when it was not built into the tree. The data contained in the tree structure for this experiment required roughly 430 MB of storage. Utilizing the system on the large combination data set Shows the ability of the system both to operate on large image database sizes and to handle a specified vari- ation in the scale of the extracted area of interest for an attention point and scale. To Show the positional extension of the training data, a smaller original training set 137 Testprobe Retrieved images (a) Sample test probe and retrieved images from the database when scaling was disabled during the training phase. Test probe Retrieved images (b) Sample test probe and retrieved images from the database when scaling was enabled during the training phase. Figure 4.15: Example probes and their retrieved images when scaling was enabled on the tree and when scaling was disabled on the tree during the training phase. 138 was used. For the second experiment in the demonstration of the system to handle 2D parameter changes, the data set described in section 4.3.1 was used. Table 4.8 shows the data pursuant to this experiment. When no training set expansion was Number of training images 86 (expanded to 2322) Positional enhancement 20% of fovea size Scale enhancement 20% of fovea size P 9 (3 x 3 grid) 8 3 (3 grid points) Number of image probes 29 Number of nodes in generated tree 4634 Number of nodes expanded at each level 3 10 Top choice correct 93.1% In top 10 96.6% Table 4.8: Summary of the learning-based parameter generalization experiment. A disjoint test set was used for testing the retrieval capability. performed, a total of 222 nodes were generated from the 86 training images. We know that there is a leaf node for every training sample from Algorithm 1. Therefore, we know that there are 136 internal nodes in the tree. If we assume that we have a complete k-ary tree, we know that the number of internal nodes iS 5,571, for tree height h [40]. If we know that there are exactly 3 leaf nodes for 3 training samples, we have that the number of internal nodes Z = 3—‘1. So we have for s = 86 and I = 136 k—l s—1 I = k—l —1 k = 3 +1 139 85 k = ——+ =. . 136 1 1625 Using this base, we would expect that the expanded training set of 2322 images would produce a tree with $2211 z 3714 nodes. The data in Table 4.8 shows that 4634 nodes were generated from the 2322 expanded set of training images in the tree, or 2312 internal nodes, even less than our estimate. Since the tree with the expanded training set was built separately from the original tree, a slightly different tree Shape is expected. 4.3.6 Handling different 3D views We want to determine whether the system can handle variation in 3D orientation. For this experiment, we used the Weizmann Institute face database. This database was well-suited to test the handling of 3D rotation because for each of 29 individuals, five camera viewpoints were available under identical lighting conditions. A sample of the available images is shown in Figure 4.16. When the training set contains a representative set of views of the objects under consideration, the system is able to successfully find objects from a novel view, as shown in Figure 4.17. The image pool for a particular individual were images taken from five viewpoints under identical lighting and expression conditions. A total of four views from each individual were used for training, the remaining view left out to form a disjoint test set. We used a disjoint test set for determining the accuracy of the learning-based view generalization. The results of this experiment are summarized in Table 4.9. 140 p2,vp3 p2,vp4 p3,vr>1 p3,vp2 p3,Vp3 p3,vp4 ’ p3,vp5 - ~i i‘ it i 1‘ 'a I I 5 p5,vvl p5,vp2 p5,vr>3 p5,vn4 p5,vr>5 Figure 4.16: A sample of the Weizmann Institute face data. Each individual for this experiment contained five viewpoints under identical lighting conditions. 141 Retrieved images Figure 4.17: View-based generalization. When sufficient training images are given for a particular class, the system is able to accurately retrieve images from the correct class. 142 Though Table 4.9 shows favorable results, 100% accuracy was not achieved. The Number of training images 112 Number of image probes 28 Number of internal nodes in generated tree 177 Number of nodes expanded at each level 3 10 Average number of nodes explored per probe 28.5 Correct class is top choice 78.9% Correct class is in top 10 choices 89.4% Table 4.9: Summary of the learning-based view generalization experiment. A disjoint test set was used for testing the retrieval capability. failures occured where the test probe viewing angle did not fall between two training sample viewing angles, as shown in Figure 4.18. Because of the open-ended nature of the SHOSLIF system, however, mistakes such as these can be identified by a human operator, and the system can incorporate this new knowledge. The system is able to easily and efficiently incorporate new knowledge because of the logarithmic complexity of adding new samples to an existing tree, provided by the incremental update mechanism. 4.3.7 Neighborhood information In the experimentation described so far, much image information is thrown away in the form of a pixel’s location and its neighborhood. The distances between pixels is unused, since each is treated as another dimension in the high-dimensional Space when the image is viewed as a vector. Two approaches could be considered to utilize the neighborhood information: multi-level resolution and blurring. Multi-level resolution involves a stack of images, each level of which is a subsample of pixels at the next 143 Training images from proper class Failed probe Sample of retrieved images Figure 4.18: Example failure in the learning-based view generalization experiment. The failures occured only when the viewing angle of the test probe did not fall between the viewing angles of two training images. These images are courtesy of the Weizmann Institute. 144 lower level, with the lowest level being the original image as Shown in figure 4.19. The Figure 4.19: A stack of multi-level resolution images. Each level is a subsample of pixels from a group in the next lower level. problem with this multilevel-resolution stack is that the neighborhood only within the defined group is utilized. Neighboring pixels which are outside the group are not utilized for higher-level computations. An alternative to this multi-level resolution stack is the use of blurring to main- tain neighborhood information. With blurring, every level of the image stack can consist of the same number of pixels. A higher level in the image stack has more blurring; it is created by replacing every pixel in the image with a linear combination 145 of its neighborhood pixels from the lower level, as shown in figure 4.20. So in some / /, //.»:/ ,7 ['11/ / / ///'I / / //Zj / ./ /////// /4.«’;, ’7 / / / / /_,/ / ,x; ////////7 _. /' . // . ‘7 fr .~ '7‘ ' 4 //:// l /1/ .7 ' /" V4. " f V / ' /‘ .' . .- / / )1. T/ r 7‘ E; ,1 /‘ / , 'V [3" ."r /. I.” 4 1,. 70/ /044//7776////77764///% l /7QC:%//77OQfl/W/V7QOLK/ /T77€4/// /C/ z/7777705€//777024flV/700/ ////7/OCC£//77OC%/ // //// //// / ”C///7/QOC/4//777/ /7/4// /////€//77CC// /OCC///77OOCAV /77044//7 /77QO/%//7770%4V/X77764/ ,///////{7//772%22%;C%7 boggyJ///C/// 1",- +1 1" 1' I WJ / j // J 1 / . ,fi/j // / / // // / // // /,//./ /",+/v/ 444/ ff/ ///// 4 4 Figure 4.20: A stack of blurred images. The pixels in each level correspond to a linear combination of neighborhood pixels at the next lower level. ways, blurring is superior to the multi—resolution image stack because neighborhood information is not discarded at any level. At levels near the root of the tree, a coarser classification is being done, and the level of detail necessary for the leaf node classifications is not necessary. We exper- imented using a stack of blurred images to traverse through the Space-Tessellation tree in both the learning and in the recognition phases. At levels of the tree near the root, we used the most blurred images in order to provide the most global information 146 to this coarse classification. As we move away from the root node, we used images with less and less blurring applied to them to provide more details for the lower levels of the tree. An example of the blurring applied to a face image and a non-face image at the various levels is shown in Figure 4.21. Figure 4.21: Examples of blurring applied to different images. At levels of the tree nearer the root, more blurring is applied to both the training and the test images. As the processing moves away from the root node of the tree, less and less blurring is applied to provide more detail for the finer classifications. Shown here are examples of images with progressive blurring applied to them. Using this approach on the large data set described in section 4.3.4 gives the results shown in Table 4.10. These results, though in performance not as good as achieved Table 4.10: Summary of blurring experiment. 147 without blurring at all, show a smaller tree generated. Furthermore, at each node of the tree, a smaller number of MEF and MDF vectors were required to produce 95% of the variance for the samples contained in the node. For the first ten MDF nodes created (i.e., those closest to the root of the tree, with the most variation) without blurring, an average of 19.4 MDF vectors were produced at each node. For the first ten MDF nodes created when blurring was engaged for just five levels of the tree, an average of 12.6 MDF vectors were produced. The reason for the reduced performance is the fact that in some cases, the class separation was done before five levels of the tree were formed. In that case, blurred images were used for projection into the subspaces, and the distances were based on the blurred images rather than the original resolution ones. 4.3.8 Timing The hierarchical organization of the database inherent in the SHOSLIF-O paradigm provides the means for efficient retrieval of images from the database. The tree structure provides O(log n) access time for 17. samples in the database. To demonstrate this on some real data, we provide the timing results shown in Figure 4.11. The timing data for the tree version shows that the average time required for a test probe to obtain retrieval results is linear in the number of competing nodes. The data in this table is shown in Figure 4.22. The figures compare the average time required for a single test probe in a “flat” database (i.e., one with no hierarchy; all the samples are in a single pool) with the time required using the same samples in the SHOSLIF hierarchy. The 148 Average timing data per test probe Tree mode: Number of Average seconds Competing nodes required per probe 2 2.1865772 sec 3 2.8681208 sec 4 3.5721477 sec 5 4.2000000 sec 6 4.8077181 sec 7 5.3697987 sec 8 6.0080537 sec 9 6.5325503 sec 10 7.1597315 sec 11 7.5885906 sec 12 8.0248322 sec 13 8.3714765 sec 14 8.7355705 sec 15 9.1271812 sec Non-tree mode (Flat database): 400.7245 sec Table 4.11: Average time required to perform a single probe in Tree mode and Non-tree mode. In contrast to the rapid retrieval of 9.1 seconds for images from the tree version using fifteen competing paths, when a “flat” database is used, i.e., one in which no hierarchy is utilized, a test probe into the same data requires an average of 400.7245 seconds to complete. training and test images for both organizational schemes was the large combination database described in section 4.3.4. The tree structure can have multiple competing paths, whereas the flat database structure will examine the same number of possibilities in all cases. So the tree database runs show the difference in timing over the different numbers of competing paths. Recognition results from the tree structure using these different numbers of competing paths are shown in Figure 4.8. For the flat database, the retrieval rate 149 10 I T j I I T T Timing using Tree Structure -+— , 8 ~ - d) .D 8. 2;; 6 ‘ - 8 8. a 4 - . o 8 U) 2 - - O 1 I 4 1 I 1 1 0 2 4 6 8 10 12 14 Number of competing paths Figure 4.22: Timing results obtained from running the system in the tree mode. This data shows the linear relationship between the time required to complete a test probe retrieval and the number of competing paths. This linear relationships demonstrates how the k competing paths only explores a band of the tree. was as shown in Table 4.12. Number of training images 1316 from 526 classes Number of test images 298 from 298 classes Correct retrieval is t0p choice 88.9% Correct retrieval is in top 10 96.2% Table 4.12: Recognition results using the flat database for the combination database So the hierarchical structure in the SHOSLIF paradigm not only provides a means for improving the correct retrieval rate from a database of objects, but it also provides the low computational complexity to allow a large database of objects to be used as a query database. 150 4.4 Summary In this chapter we describe our experimentation using the SHOSLIF paradigm ap- plied to object recognition (SHOSLIF-O). We describe two basic modes of operation: the manually-defined set of hierarchical labels for characterizing the tree, and the automatic space tessellation mode, where only a single (lowest-level) label is given to a training sample image. In the manually defined set of hierarchical labels, the operator supplies a hierar- chical set of labels to define the structure of the tree (and what samples go into which nodes). The MDF computation is done using the grandchildren of a particular node as sample points. The results show that allowing the system to determine its own tree structure is a better approach. In the automatic space tessellation mode of operation, there are several options available to the user. We investigated several distance measures and provide retrieval results obtained using these different measures. We have found that the Distance Horn Subspace (DFS) measure is theoretically and practically the measure that we should use. In this same vein, we also experimented with maintaining a historical deviation from the node centers in the path traversed to the leaf nodes of the tree. We discovered that when a proper distance measure was utilized, this historical deviation was not needed. In our experimental results, we first compare the recognition results obtained using a “flat” database (i.e., one with no hierarchical structure imposed), a tree structure in which a single projection was done and the space tessellation obtained using this 151 single projection, and the case where multiple projections were done in the tree, one for each node. The data supports our claim that the multiple projection hierarchy can divide the problem into smaller, solvable pieces and can therefore improve the recognition rate. We also compared the performance of the image space directly and the MEF and MDF subspaces under these different modes of database organization. Our data obtained for these tests also supports our claim that the MDF space can provide better results than the MEF space does. We show the clustering effect that the MDF subspace has, and demonstrate the power that the DKL projection that produces the MDF space has for handling within-class variation in a particular class. In order to compare with the results of others in the community utilizing eigen- feature methods, we performed a test on a set of data comprising of only faces. We produced results consonant with those obtained with Photobook [123] (100% same- image retrieval; 98% accuracy in retrieving the correct class as the second image) with the re—substitution paradigm (like the Photobook uses), and produced results nearly as good using a disjoint test set (96% accuracy as the top choice; 98% for the correct choice being in the t0p 10 responses). To show the utility of the system on a wide range of objects, we tested the system on a combination database, comprised of both faces and non-face objects taken from natural scenes. These objects ranged from human faces to street signs to dogs to aerial photographs. The recognition results obtained using this database were very good: 95% correct top choice response; 99% correct class in the top 10 responses. In order to handle 2D variations, such as size and position, we experimented on simulating more image acquisition by extending the training set. We impose a grid 152 over the fixation point and scale and extract multiple fovea images from the original image. We show results that demonstrate the correct retrieval of images from different positions and scales. We demonstrate how the system is able to successfully find objects from a novel 3D view, providing that the training set of images contains a representative set of views of objects under consideration. We incorporated neighborhood information by blurring the images at higher levels of the tree, producing less and less blurring as we moved down the tree. Though this did not improve our recognition rate (in fact, it actually brought it down somewhat), it did make for a more efficient tree and a faster retrieval. And finally, we provided some timing data for the system to demonstrate the order of magnitude difference between utilizing the hierarchical structure and using a simple flat database. Chapter 5 Summary and Future Research This chapter summarizes the results of the research described in this document, and provides an outline for future research direction. 5. 1 Summary We have developed an object recognition system that performs automatic feature selection and extraction, utilizes a hierarchical database organization to provide ef- ficient retrieval of images. The system generalizes an image training set to handle size and position variations, handles a variety of objects from natural scenes, and identifies a visual attention point in an image for further exploration. The automatic hierarchical discriminant analysis method used in this work recur- sively decomposes a huge, high-dimensional nonlinear problem into smaller, simpler, tractable problems. We study the Most Expressive Features, or principal compo- nent features, for class characterization. These features are demonstrated by other 153 154 researchers and this study to be good at efficiently describing a class and can be used for image reconstruction. We talk about the Most Discriminating Features, or discriminant analysis features, for subclass selection, and demonstrate their superi- ority over the Most Expressive Features for subclass selection. We discuss the new DK L projection for dealing with discriminant analysis for high dimensions. The MDF subspace is an effective way for automatically selecting features while discount- ing unrelated factors present in the training data, such as illumination variation, face expressions, etc. The hierarchical Quasi-Voronoi tessellation provides a means for dividing the space into smaller pieces for further analysis. This allows the DK L projection to produce better local features for more efficient and tractable subclass separation. We dis- cuss problems with the Space-Tessellation tree, such as the Adoption Problem, and describe effective ways to circumvent these problems before they arise. The Space- Tessellation Tree introduced in this work provides an average time complexity of 0(log n) for a search for the best match in a model database of 72. objects. This low complexity opens the door towards learning a huge database of objects. Our exper- iments with a large number of classes of objects are made possible by the efficient Space-Tessellation structure. In order to maintain a retrieval rate level, however, it may be necessary to increase this k if the number of images increases. It is not clear whether this I: is a function of n to keep a retrieval rate; this issue should be investigated further. We have demonstrated the described system for a manually-defined set of hierar- chical labels associated with each search probe. We discuss the parameters pursuant 155 to the automatic space tessellation using the techniques of chapter 3. We discovered that the use of the “Distance from Subspace” metric produced not only theoretically but experimentally good results. We experimented with various numbers of compet- ing paths to determine the benefits of maintaining multiple paths through the tree juxtaposed with the computational expense of pursuing them, and arrived at the conclusion that ten competing paths produced the best recognition performance. The experimentation compares using the different spaces, i.e., the image space itself, the MEF subspace, and the MDF subspace. A comparison between a flat and a hierarchical database organization was made to show how the hierarchical decomposition of the problems produces better results than can be achieved using a single level. We compare our method to the results of others in the community who are using eigenfeature methods, such as Photobook [123], and show a comparable result to their work. To show the general applicability of the method, we trained the system on a diverse set of objects from mostly outdoor natural scenes. Using a disjoint test set, we found that the best response from the system was from the correct class in 95% of the cases, and was in the top ten for 99.0% of the 300 test cases. We have demonstrated how the MDF subspace can effectively discount those fac- tors unrelated to subclass selection, significantly better than using the principal com- ponent features alone. Furthermore, although the the MDF subspace does well in dealing with large variation within a particular class, the Space-Tessellation Tree provides a mechanism for dealing with the problem of many classes in the database by reducing the complexity of a nonlinear class separation problem into smaller, man- 156 ageable problems that are lineally separable. The system described generalizes the training set for 2D transformation, and shows an ability to generalize across 3D view- point when the training set contains a representative set of views of the objects under consideration. We have incorporated neighborhood information in the form of blurring. At the higher levels of the constructed tree, a larger level of blurring was done because the nodes nearer the root of the tree only provide a coarse classification of the images. At those nodes where blurring is done, a more general shape of the object is provided for the MEF and MDF subspaces to use, rather than the details. The results of this technique did not improve the recognition performance of the system, but it did succeed in significantly reducing the number of features required to capture the classes (in the MEF space) and separate the classes (in the MDF features). The system described has several inherent strengths and limitations, which are summarized in Figure 5.1. Although the system might be extended to other sensing modalities, the current system is view-based: in order to retrieve a correct image from an image database, the SHOSLIF requires that the system has been trained on an image taken at a very similar viewpoint (i.e., within a few degrees), as is demon- strated by Figure 4.18. This limitation is typical with approaches that are based on a pixel-level representation. However, the generality of the representation, effected by selecting the features directly from the intensity pixel values, is a fundamental strength of this approach. The only representation used is the digital image. This representation itself does not incorporate knowledge, which may be seen as some as a weakness. The incorporation of knowledge in the representation scheme, however, 157 Strengths Limitations The Generality of the representation ob- tainable by selecting features directly from intensity image pixel values. The only imposed representation is the digi- tal image. The exclusion of specific domain knowl- edge in the system design and imple- mentation phases. The capability of the system to handle multifarious variations without system modification. Variations must be covered in the train- ing samples. The method is view-based to deal with images directly rather than using other sensing modalities of limited applicabil- ity (such as range scanners). The system requires many views as training input for nontrivial problems. The generality of real-world problems that can handled by the system. The system well-framed images. requires The logarithmic complexity and the class-separation problem decomposition made possible by the tree structure. Occlusion has not been investigated, but will not be handled well without ad- ditional attention mechanisms. Labeled and unlabeled samples (i.e., su- pervised and unsupervised learning) can be incorporated into the same tree. The system does not cover high—level reasoning at the current stage. Figure 5.1: Major strengths and limitations of the described system. is also a limiting factor. Typically knowledge that is incorporated into the system design is not generally applicable. The resulting system tends to be brittle under gen- eral settings. The view-based method can deal with intensity images directly, rather than using other, limiting modalities such as range scanners. This requires many views for nontrivial problems, however. The generality of the real-world problems that can be handled by this system is another of its major strengths; the variation handling, however, has not been tested extensively. The system requires well-framed images with non-occluded objects. Occlusion and segmentation must still be investi- gated. And although the system does not cover high-level reasoning, the hierarchical structure provides for both a logarithmetic computational complexity and a prob— 158 lem decomposition made possible by the tree structure. Object recognition systems typically have a reject option to indicate that an object was not retrieved from the database. The SHOSLIF-O could learn a reject distance threshold value to provide this functionality. In this work, however, all of the top It candidates were provided to the user. 5.2 Future research There are a number of issues that should be researched further. The algorithm for determining the number of features to use in both the MEF and the MDF subspaces should be analyzed further. In our system, the operator selects a percentage P of variance to discard. Typically P = 5%, keeping a large portion of the variance in the sample set while providing a substantial reduction in the number of features used. In our experiments, retaining 95% of the variance allowed us to retain approximately half of the features on the average. However, this is simply a parameter chosen by the operator as something that works well. An investigation into the optimum number of features to retain at each node should be started. Because of the computational complexity involved in computing the eigensystem decomposition for a large set of training images, our system takes a random sample from the training set to use to compute the eigenvalues and eigenvectors; the full training set is projected into the MEF and the MDF subspaces for the tree construc- tion. This random sample of training images may or may not produce a good selection of features to use in the decomposition of the spaces. Because of the hierarchical na- 159 ture of the system, even if a bad set of features is used at one level, as the training set is partitioned to different parts of the tree, the system will be able to find better features for classification and class characterization. However, this can lead to an inefficient tree structure, and more work must be done by the system to produce the kinds of results that we reported. We should study alternative ways for finding the optimum set of images from the training set to use in the eigensystem computation. Since this search for an optimum set of samples at each node involves an enormous search space, it seems like a good candidate for a genetic algorithm solution. The SHOSLIF-O uses a relatively large number of features at each node, and produces a tree for which a different number of children can appear for every node. Weng and Chen [161] have explored using a much simpler binary tree structure for autonomous navigation. The technique finds a single eigenvector associated with the largest eigenvalue, and projects the mean of the training set onto that eigenvector. All of the training samples are then also projected onto the single eigenvector and compared with the mean projection. Those projections which are greater than the mean projection are assigned to the right subtree, while those less than the mean pro- jection are relegated to the left subtree. This means that a very coarse classification is done at each level of the tree, but the single-vector projection is very fast. Fur- thermore, to find the single eigenvalue/eigenvector pair, a much less computationally expensive method can be utilized than is required when the complete eigensystem must be found. Therefore, the training phase becomes near-real-time as well as the recognition phase. We need to study this paradigm applied to object recognition. For our 2D transformation handling, we simulate fixation points in our visual 160 attention mechanism with our imposed grid in the scale and position of the object of interest. The system would perform better if instead a collection of images were taken of the object at different fixation points. Images taken at novel views of the object of interest would provide a host of further information, such as different lighting conditions and shadows. We have done a limited number of experiments exploring how the system performs on 3D rotation of objects, predominantly faces. We should study how the system performs on a larger class of objects with 3D rotation. We need to explore the system in detail using much larger data sets. The SHOSLIF-O as described works on 2D grayscale intensity images. There is no inherent limitation in the SHOSLIF paradigm that limits its use to grayscale images; a multi-spectral image, such as a color image, could be utilized in the same framework. In addition, 2.5D and 3D images could be cast into this SHOSLIF paradigm. In a 2D grayscale image, the rows are concatenated together to form a long vector. If each pixel in this 2D grayscale image were instead a 3D vector of (Red. Green, Blue>, the same setup could be utilized. A range image, such as obtained from the Technical Arts 100 White Scanner, has a z dimension in place of an intensity value in the image. This could be used directly in the SHOSLIF framework. Finally, stereo cameras provide a depth map, and this information could also be included in the input to the SHOSLIF-O system. In [159], there is a description for the SHOSLIF to perform image segmentation during the test phase at the same time that the recognition is being done. During the learning phase, the human teacher specifies the areas of interest on the training images. During the recognition phase, a size and position estimation is improved {£7 161 as the test probe traverses the Space-Tessellation tree. While that size and position estimation is improving, the background in the attention images are gradually masked off using the feature vector projections computed through the levels of the tree. This is done because the components in each of the MEF and MDF vectors are zeros in the learning phase. Therefore, the coarse-to—fine classification masks the background pixels that may be present in the original fovea image for recognition. Finally, by backtracking through the tree, those nonzero components in the MEF space can be recorded back into the input image. This is clearly something that we should pursue further. Finally, the visual attention mechanism should be explored further. This visual attention mechanism provides a well-framed portion of the image that contains a sin- gle object of interest in a standard position and scale. This mechanism is fundamental for the SHOSLIF to operate successfully. To summarize, we have noted the following research topics for further study: Investigate the optimum number of features to retain. Explore an optimum sample of training set to use for eigensystem computation. Construct a binary tree based on a single eigenvector projection. Study the SHOSLIF-O system for 2D transformation handling and 3D rotation of objects in more detail and with larger data sets. Extend the work from 2D grayscale intensity images to multi-spectral images, such as color images. 162 o Utilize the basis vectors to produce a segmentation of the input test images. 0 Develop the visual attention mechanism. The SHOSLIF paradigm can be used for a host of applications. Some application areas being explored include object recognition [146], motion understanding [43], vision-guided robot manipulation [80], and autonomous navigation [33]. All of these application areas should be further explored to provide a unified framework for all sorts of computer vision-based applications. APPENDICES Appendix A Space comparison test database This appendix contains the images used in the space comparison experiment (section 4.3.1). Pixel weighting was engaged for this experiment, and this weighting is reflected in the images displayed here. The test probes and their top five responses for the multiple-projection experiment are given in Figure A.2. Note that a band of fifteen competing paths were explored, but only the top five responses are given here. 163 164 bike bushes centersign mouse Figure A.1: Training images for space and hierarchy study. currency desk door handle egg Figure A.1 continued: Training images for space and hierarchy study. hand house impacth isign lab Figure A.1 continued: Training images for space and hierarchy study. monitor parkinglot pedestrian pentagon phone renault Figure A.1 continued: Training images for space and hierarchy study. rome runner ruts seated sharpener sidewalk Figure A.1 continued: Training images for space and hierarchy study. standingdog ssign truck chair chair (con’t) Figure A.1 continued: 'Ifajning images for space and hierarchy study. dog (con’t) hydrant hydrant (cont’t) Figure A.1 continued: Training images for space and hierarchy study. hydrant (con’t) Figure A.1 continued: Training images for space and hierarchy study. 172 Image Space 1437 1874 2119 2131 2165 MEF Space 1437 1874 2119 2131 MDF Space 1354 1366 1746 1748 1893 Figure A.2: Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. Beneath each retrieved image is the distance measure computed between the test probe and the retrieved image. 173 Image Space 1155 1433 1993 2078 2180 thF‘ Space 1 155 1433 1993 2078 2180 IAIDF Space 1175 1179 1786 1797 1837 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 174 Image 1 . .1. Space ' ' 2247 2485 2951 3002 3012 MEF Space 2247 2485 2951 5‘ MDF Space 2034 2115 2764 2764 2774 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 175 hnage Space 2631 2964 3273 3281 3293 thF Space 2630 2943 3273 3281 3293 MDF Space 2148 2159 2993 3002 3154 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 176 hnage Space 2164 2966 3042 thF Space 2966 3042 2164 IAI)F Space 1954 1956 1959 1962 1965 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 177 hnage Space 2047 2061 2370 2375 2486 NHEF Space 1882 2277 2419 2479 2646 h4I)F Space 1821 1842 2238 2481 2504 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space 178 Image Space 1499 2542 2569 2645 2925 MEF Space 2047 2061 2370 2375 2486 MDF Space 1975 2063 2146 2155 2181 Figure A.2 (continued): Tact images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 179 hnage Space 3001 3061 3147 3152 thF‘ Space 1463 2540 2566 2645 2925 h4I)F Space 1404 1421 2261 2265 2323 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 180 Image Space 2663 2699 2712 MEF Space 3062 3147 3152 MDF Space 2532 2563 2655 2661 2724 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 181 Image Space 2256 2269 2392 2515 2637 MEF Space 2523 2663 2699 2712 MDF Space 1892 1895 2060 2074 2481 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 182 hnage Space 1552 1982 2408 2458 MEF Space 2256 2269 2392 2515 2637 MDF Space 2024 2106 2141 2151 2152 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 183 hnage Space 1584 2138 2198 2219 2324 thF‘ Space 1552 1925 2408 2455 h4I)F Space 870 1100 1928 1944 2149 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 184 Image Space 1910 1987 2441 2550 MEF Space 1582 2138 2198 2219 2324 MDF Space 1542 1558 2017 2024 2039 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 185 Image Space 2871 2954 2990 3138 3140 MEF Space 1910 1987 2441 2550 2562 MDF Space 1614 1666 2182 2215 2264 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 186 Image Space 1285 1671 3785 3856 3869 MEF Space 2990 3138 2871 MDF Space 2071 2106 2122 2131 2142 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 187 hnage Space 3224 3357 3604 3775 3785 kdEF‘ Space 1285 1671 3785 3856 3869 NHDF Space 1130 1218 3596 3661 3687 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 188 hnage Space 1489 1542 2095 2120 2128 NHEF Space 3224 3357 3604 3775 3785 BJIDF Space 2881 2882 3321 3326 3379 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 189 hnage Space 2083 2414 2685 2693 2752 LTEF Space 2095 2120 2128 DAI)F Space 1157 1205 1931 - 1937 1944 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 190 Image Space 1141 1968 2187 2231 2266 MEF Space MDF Space 1960 2325 2360 2382 2419 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 191 hnage Space 2285 2613 2871 2878 2904 MEF Space 1141 1968 2187 2231 2266 MDF Space 1091 1092 1885 1922 1927 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 192 Image Space 2663 M 2715 2738 2842 MEF Space 2285 2613 2869 2878 2904 MDF Space 2024 2033 2371 2454 2479 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 193 1827 2235 2317 2321 MEF Space 1778 2663 2715 2738 2842 1703 2332 2332 2394 2396' Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 194 Image Space 1908 2084 2133 2154 2217 thF‘ Space 1639 1827 2233 2315 2321 MDF Space 1490 1498 1968 1987 2199 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 195 Image Space 1410 1643 2148 2179 2218 MEF Space 1410 1643 2148” 2179 2218 MDF Space 1406 1612 2018 2027 2047 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 196 hnage Space 3055 3091 MEF Space 3066 3091 MDF Space 1659 1864 2868 2962 2988 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 197 Image Space 1458 2108 2112 2173 2235 MEF Space 1458 2108 2112 2173 2235 MDF Space 1580 1589 1652 1944 2055 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 198 Image Space 2895 2904 3051 3066 MEF Space 2895 2904 MDF Space 2518 2542 2621 2722 2722 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 199 Image Space 2966 3477 3480 3530 3719 MEF Space 2963 3480 3530 3719 MDF Space 2772 2925 3288 3316 3348 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. 200 955 1107 1986 2085 2093 Figure A.2 (continued): Test images for space and hierarchy study. Each probe image is followed by the list of the top five responses using pure image space, MEF space, and MDF space. Appendix B Face Database This appendix contains the database of face images used for various experiments. Some of the data is from Michigan State University’s Pattern Recognition and Image Processing laboratory; some come from the FERET database maintained at George Mason University under the direction of the Army Research Labs (Jonathan Phillips, program director); some come from the Weizmann Institute in Israel; some from the MIT media laboratory. A selection of the images for training the face database is given is Figure B1. A selection of test probe images used for the various experiments is given in Figure B.2. 201 202 000.01 003.02 003.04 003.05 013.01 013.02 \./"-' ~ " r 0 \ .4, 015.06 015.07 ,0 ._ 016.05 016.06 016.07 016.08 016.09 017.01 017.02 018.01 018.02 019.01 Figure B.1: Database of face training images 4 ‘2' ~:' 020.02 020.04 020.05 020.06 020.07 020.08 021.01 021.02 022.02 023.01 023.06 023.07 023.08 023.09 025.02 025.04 025.05 025.06 026.01 026.02 027.01 027.06 028.01 028.02 029.01 029.02 029.04 029.05 031.01 031.02 I i ‘ I . . 037.05 038.01 038.02 342.01 342.02 342.04 342.05 342.06 342.07 342.08 342.09 343.01 343.02 343.04 027.02 027.04 033.01 035.01 Figure B. 1 (continued). 352.02 352.04 352.05 352.06 352.07 352.08 352.09 353.01 353.02 353.04 Figure B.1 (continued). 364.05 364.06 365 01 365.02 365.04 365.05 365.06 366. 01 366.02 366. 04 Figure B.1 (continued). «sitar-9' ‘s 380. 05 380.06 381.01 381.02 381.04 381.05 381.06 382. 01 382.02 382.04 Figure B.1 (continued). 207 383.01 383.02 383.04 «0' 389.02 390.01 391.01 392.01 97.01 383.05 ; - ‘ f , . . 3 , a. . . , i' ‘KH a ’ “ 388.02 389.01 394.02 395.01 4 j. ' 3' , t ‘ : ' -. 6 . 1,. . i 3 ”M“, L:: 3.2!“ 399.02 400.01 1! 401.02 402.01 .. . ~. __ .1 h;- 01.0 1 ~V 408.02 409.01 414.01 414.02 418.01 418.02 419.01 420.01 423.02 424.01 Figure B.1 (continued). 208 425.01 425.0 429.01 .w 2 443.01 443.0 ' LS! ‘ 464.01 464.02 4 in ' _ . 65.01 466.01 466.02 468.02 Figure B.1 (continued). 208 429.01 427.01 427.02 428.01 428.02 432.01 435.01 435.02 441.01 441.02 450.01 450.02 462.01 464.01 464.02 465 d" .01 466.02 467.01 467.02 468.01 468.02 Figure B.1 (continued). 209 ; WAT? '2 WW ;- ' made" f’ '5 I. , . "k .1 f‘ 470.02 471.01 472.01 472.02 473.01 473.02 474.01 474.02 , 1 . . 4"» .443:- 1 154.01 154.02 155.01 155.02 156.01 156.02 157.01 15 158.01 158.02 E}. 7'. ~\- _. 476.02 477.01 477.02 .I' . ‘4 7.3/ 7“ 475.01 475.02 476.01 Figure B.1 (continued). 210 031.ps 032.ps 359. ps 360. ps 361. ps 362. ps 363. ps 364. ps 365. ps 366 ps 367. ps 368 ps Figure B.2: Database of face test images 211 369.ps 370.ps ~45 . .- ' t: > .- ‘ ‘ ~ "( ‘ . _ _-. . ‘ J 2' ‘2 ‘ . . i. ' g; 5. :L 31¢ 45 - _ . .:.;~ A " . 4 4,". .- . :gi ‘1 «f '_ ‘. . "t ' 438.ps 446.ps 449.ps 450.ps 458.ps 468.ps 474.ps 477.ps Figure B.2 (continued). BIBLIOGRAPHY Bibliography [1] H. M. Abbas and M. M. Fahmy, “A neural model for adaptive Karhunen Loeve Transform (KLT),” in Proc. of IEEE Intl. Joint Conf. on Neural Networks, vol. 2, (Baltimore, Maryland), pp. 975—980, June 1992. [2] S. Akamatsu, T. Sasaki, H. Fukamachi, and Y. Suenaga, “A robust face iden- tification scheme—KL expansion of an invariant feature space,” in SPIE Pro- ceedings Vol. 1607: Intelligent Robots and Computer Vision X: Algorithms and Techniques, pp. 71—84, 1991. [3] P. Baldi and K. Hornik, “Neural networks and principal component analysis: Learning from examples without local minima,” Neural Networks, vol. 2, pp. 53- 58, 1989. [4] D. H. Ballard and C. M. Brown, Computer Vision. Englewood Cliffs: Prentice- Hall, 1982. [5] P. Ballard and G. Stockman, “ “twinkle” ,” Master’s thesis, Michigan State Uni- versity, August 1991. [6] R. Baron, “Mechanisms of human facial recognition,” International Journal of Man-Machine Studies, pp. 137-178, 1981. [7] J. C. Bartlett and A. Fulton, “Familiarity and recognition of faces in old age,” Memory and Cognition, vol. 19, pp. 229—238, May 1991. [8] E. B. Baum, “On the capabilities of multilayer perceptrons,” Journal of Com- plexity, vol. 4, pp. 193—215, 1988. [9] E. B. Baum, “When are k-nearest neighbor and back propagation accurate for feasible sized sets of examples?,” in Proc. EURASIP Workshop, Sezimbra, Por- tugal (L. B. Almedia and C. J. Wellekens, Eds.), pp. 2—25, New York: Springer- Verlag, 1990. [10] J. L. Bentley, “Multidimensional search trees used for associative searching,” Comm. ACM, vol. 18, pp. 509—517, September 1975. [11] M. Bichsel, Strategies of Robust Object Recognition for the Automatic Identi- fication of Human Faces. PhD thesis, Eidgenbssischen Technischen Hochschule Ziirich, 1991.Diss. ETH Number 9467. 212 213 [12] I. Biederman, “Human image understanding: Recent research and a theory,” Computer Vision, Graphics, Image Process, pp. 29—73, 1985. [13] W. F. Bischof and T. Caelli, “Learning structural descriptions of patterns: A new technique for conditional clustering and rule generation,” Pattern Recognition, vol. 27, no. 5, pp. 689—697, 1994. [14] R. C. Bolles and P. Horaud, “3de: A three-dimensional part orientation sys- tem,” International Journal of Robotics Research, vol. 5, pp. 3—26, Fall 1986. [15] O. Bourdon and G. Medioni, “Object recognition using geometric hashing on the connection machine,” in Proceedings International Conference on Pattern Recognition, (Atlantic City), pp. 596—600, 1990. [16] M. Brady and H. Asada, “Smoothed local symmetries and their implementation,” International Journal of Robotics Research, vol. 3, no. 3, pp. 36—61, 1984. [17] C. Bregler and S. M. Omohundro, “Nonlinear manifold learning for visual speech recognition,” in Proc. International Conference on Computer Vision, pp. 494— 499, 1995. [18] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees. Chapman & Hall, 1993. [19] R. Brooks, “Model—based trhee—dimensional interpretations of two-dimensional images,” IEEE Trans. Pattern Anal. Machine Intell, vol. 5, no. 2, pp. 140—150, 1983. [20] V. Bruce, “Perceiving and recognizing faces,” Mind and Language, pp. 342-364, 1990. [21] R. Brunelli and T. Poggio, “Hyperbf networks for gender classification,” in Pro- ceedings DARPA Image Understanding Workshop, pp. 311—314, 1992. [22] R. Brunelli and T. Poggio, “Face recognition: features versus templates,” IEEE Trans. Pattern Anal. Machine Intell, vol. 15, pp. 1042-1052, October 1993. [23] J. Buhmann, M. Lades, and C. V. D. Malsburg, “Size and distortion invariant object recognition by hierarchical graph matching,” in Proceedings International Joint Conference on Neural Networks, pp. 411—416, 1990. [24] R. Buse and Z.-Q. Lie, “Feature extraction and analysis of handwritten words in grey-scale images using gabor filters,” in Proceedings IEEE International Con- ference on Image Processing, vol. I, pp. 164—168, 1994. [25] T. Caelli and A. Dreier, “Variations on the evidence-based object recognition theme,” Pattern Recognition, vol. 27, no. 2, pp. 185—204, 1994. [26] T. Caelli and S. Xu, “A new method for shape-from-focus,” in A1 ’90 (C. P. Tsang, Ed.), World Scientific, 1990. 214 [27] R. A. Campbell, S. Cannon, G. Jones, and N. Morgan, “Individual face classifi- cation by computer vision,” in Proc. Conf. Modeling Simulation Microcomput., pp. 62-63, 1987. [28] S. Carey, R. Diamond, and B. Woods, “The development of face recognition— a maturational component?,” Developmental Psychology, vol. 16, pp. 257-269, 1980. [29] S. Carey, “A case study: Face recognition,” in Explorations in the Biological Language, New York: Bradford Books, 1987. [30] C. Chen and A. Kak, “A robot vision system for recognizing 3-D objects in low-order polynomial time,” IEEE Trans. Syst, Man, Cybern., vol. 19, no. 6, pp. 1535—1563, 1989. [31] J. L. Chen and G. C. Stockman, “The curvature method for modeling and match- ing smooth objects,” tech. rep., Michigan State University, Dept. of Computer Science, 1992. [32] J.-L. Chen, G. C. Stockman, and K. Rao, “Recovering and tracking pose of curved 3D objects from 2D images,” in Proceedings Computer Vision and Pattern Recognition, (New York), pp. 233—239, 1993. [33] S. Chen and J. Weng, “Vision-based navigation using self-organizing learning,” in Proc. 2nd Asian Conf. on Computer Vision, (Singapore), December 1995. [34] S. Chen and J. Weng, “Autonomous navigation through case-based learning,” in Proc. IEEE Int ’1 Symposium on Computer Vision, (Coral Gables), November 1995. [35] S. Chen and J. Weng, “Autonomous navigation using recursive partition tree,” in Proc. Workshop on Vision for Robots, (Pittsburgh), August 1995. [36] Y. Cheng, K. Liu, J. Yang, and H. Wang, “A robust algebraic method for human face recognition,” in Proc. Int. Conf. on Pattern Recognition, pp. 221—224, 1992. [37] C. H. Chien and J. K. Aggarwal, “Shape recognition from a single silhouette,” in Proc. ICCV, (London), 1987. [38] M. J. Conlin, “A rule based high level vision system,” in SPIE Proceedings Volume 726: Intelligent Robots and Computer Vision, pp. 314-320, 1986. [39] J. H. Connell and M. Brady, “Generating and generalizing models of visual objects,” Artificial Intelligence, vol. 31, pp. 159—183, 1987. [40] T. H. Cormen, C. E. Leirson, and R. L. Rivest, Introduction to Algorithms. The MIT Press, 1991. [41] I. Craw, H. Ellis, and J. Lishman, “Automatic extraction of face features,” Pat— tern Recognition, vol. 5, pp. 183—187, 1987. 215 [42] I. Craw, D. Tock, and A. Bennett, “Finding face features,” in Proceedings 2nd European Conference on Computer Vision, pp. 92—96, 1992. [43] Y. Cui, D. L. Swets, and J. J. Weng, “Learning-based hand sign recognition using SHOSLIF-M,” in Proc. International Conference on Computer Vision, (Cam- bridge, Massachussetts), pp. 631—636, June 1995. [44] Y. Cui and J. Weng, “Learning-based hand sign recognition,” in Proceedings, International Workshop on Automatic Face- and Gesture-Recognition, (Zurich), pp. 201—206, June 1995. [45] L. Devroye, “Automatic pattern recognition: A study of the probability of error,” IEEE Trans. Pattern Anal. Machine Intell., vol. 10, pp. 530—543, July 1988. [46] S. J. Dickinson, A. P. Pentland, and A. Rosenfeld, “3-D shape recovery using dis- tributed aspect matching,” IEEE Trans. Pattern Anal. Machine Intell., vol. 14, pp. 174—198, 1992. [47] R. Duda and P. Hart, Pattern Classification and Scene Analysis. Wiley, New York, 1973. [48] H. D. Ellis, “Introduction to aspects of face processing: Ten questions in need of answers,” in Aspects of Face Processing (H. Ellis, M. Jeeves, F. Newcombe, and A. Young, Eds.), pp. 3—13, Nijhoff, 1986. [49] H. D. Ellis, D. M. Ellis, and J. A. Hosie, “Priming effects in children’s face recognition,” British Journal of Psychology, vol. 84, pp. 101—110, February 1993. [50] M. Fang and S. Singh, “Private communication.” Siemens Coprorate Research, Princeton, New Jersey, 1993. [51] R. A. Fisher, “The statistical utilization of multiple measurements,” Annals of Eugenics, vol. 8, pp. 376—386, 1938. [52] P. J. Flynn and A. K. Jain, “3D object recognition using invariant feature in- dexing of interpretation tables,” Computer Vision, Graphics, Image Process, vol. 55, no. 2, pp. 119—129, 1992. [53] P. J. Flynn and A. K. Jain, “BONSAI: 3-D object recognition using constrained search,” IEEE Trans. Pattern Anal. Machine Intell., vol. 13, pp. 1066—1075, October 1991. [54] P. Foldiak, “Adaptive network for optimal linear feature extraction,” in Proc. of IEEE Intl. Joint Conf. on Neural Networks, vol. 1, (Washington DC), pp. 401— 405, 1989. [55] K. Fukunaga, Introduction to Statistical Pattern Recognition. Academic Press, New York, 2nd ed., 1990. 216 [56] K. Fukshima, S. Miyake, and T. Ito, “Neocognitron: A neural network model for a mechanism of visual pattern recognition,” IEEE Trans. Systems, Man, Cybernetics, vol. 13, no. 5, pp. 826—834, 1983. [57] K. Fukushima, “Cognitron: A self-organizing multilayered neural network,” Bi- ological Cybernetics, vol. 20, pp. 121—136, 1975. [58] S. F. Galton, “Personal identification and description-I,” Nature, pp. 173—177, 1888. [59] A. G. Goldstein, “Race-related variation of facial features: Anthropometric data 1,” Bulletin of the Psychonomic Society, vol. 13, pp. 187—190, 1979. [60] A. G. Goldstein, “Facial feature variation: Anthropometric data II,” Bulletin of the Psychonomic Society, vol. 13, pp. 191-193, 1979. [61] B. A. Golomb and T. J. Sejnowski, “SEXN ET: A neural network identifies sex from human faces,” in Advances in Neural Information Processing Systems III, pp. 572—577, San Mateo: Morgan Kaufrnann, 1991. [62] G. Gordon and I. Vincent, “Applications of morphology to feature extraction for face recognition,” in SPIE Proceedings, Vol 1658: Nonlinear Image Processing, 1992. [63] G. Gordon, “Face recognition based on depth maps and surface curvature,” in SPIE Proceedings, Vol 1570: Geometric Methods in Computer Vision, pp. 234— 247,1991. [64] V. Govindaraju, S. N. Srihari, and D. B. Sher, “A computational model for face location,” in Proceedings 3rd International Conference on Computer Vision, pp. 718—721, 1990. [65] P. Green, “Biology and cognitive development: The case of face recognition,” Animal Behaviour, vol. 43, pp. 526—527, March 1992. [66] W. E. L. Grimson and T. Lozano—Pérez, “Localizing overlapping parts by search- ing the interpretation tree,” IEEE Trans. Pattern Anal. Machine Intell., vol. 9, no. 4, pp. 469—482, 1987. [67] W. E. L. Grimson, Object Recognition by Computer: The Role of Geometric Constraints. The MIT Press, 1990. [68] P. W. Hallinan, “Recognizing human eyes,” in SPIE Proceedings Vol. 1570: Ge- ometric Methods in Computer Vision, pp. 214—226, 1991. [69] R. W. Hamming, Coding and Information Theory. Englewood Cliffs: Prentice- Hall, 1980. [70] R. M. Haralick and L. G. Shapiro, Computer and Robot Vision, vol. II.Addison- Wesley, 1993. ‘m via-mm." “‘ _ ‘ 217 [71] L. Harmon and W. Hunt, “Automatic recognition of human face profiles,” Com- puter Graphics and Image Processing, vol. 6, pp. 135-156, 1977. [72] L. Harmon, M. Khan, R. Lasch, and P. Ramig, “Machine identification of human faces,” Pattern Recognition, vol. 13, pp. 97—110, 1981. [73] L. Harmon, S. Kuo, P. Ramig, and U. Raudkivi, “Identification of human face profiles by computer,” Pattern Recognition, vol. 10, pp. 301—312, 1978. [74] L. Harmon, “The recognition of faces,” Scientific American, vol. 229, no. 5, pp. 71—82, 1973. [75] D. C. Hay and A. W. Young, “The human face,” in Normality and Pathology in Cognitive Function (H. D. Ellis, Ed.), pp. 173—202, London: Academic Press, 1982. [76] J. Hertz, A. Krogh, and R. G. Palmer, Introduction to the Theory of Neural Computation. Addison-Wesley Publishing Company, 1991. [77] D. D. Hoffman and W. A. Richards, “Parts of recognition,” Cognition, vol. 18, pp. 65—96, 1985. [7 8] Z. Hong, “Algebraic feature extraction of images for recognition,” Pattern Recog- nition, vol. 24, pp. 211-219, 1991. [79] B. K. P. Horn, Robot Vision. Cambridge, MA: MIT Press, 1986. [80] S. J. Howden and J. Weng, “Hand-eye coordinated learning using hierarchical space tessellation,” Tech. Rep. CPS-94-29, Michigan State University, 1994. [81] D. P. Huttenlocher and S. Ullman, “Object recognition using alignment,” in Proc. Int ’1 Conf. Computer Vision, pp. 102-111, 1987. London, England. [82] K. Ikeuche and T. Kanade, “Automatic generation of object recognition pro- grams,” Proceedings of the IEEE, vol. 76, no. 8, pp. 1016—1035, 1988. [83] C. L. Jackins and S. L. Tanimoto, “Oct-trees and their use in representing three-dimensional objects,” Computer Vision, Graphics, Image Process, vol. 14, pp. 249—270, November 1980. [84] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Prentice-Hall, New Jersey, 1988. [85] A. K. Jain and P. J. Flynn, 3D Object Recognition Systems. Amsterdam: Elsevier, 1993. [86] A. K. Jain and R. Hoffman, “Evidence—based recognition of 3—D objects,” IEEE Trans. Pattern Anal. Machine Intell., vol. 10, pp. 783—802, November 1988. 218 [87] A. K. Jain and M. D. Ramaswami, “Rule—based 3D object recognition from range images,” tech. rep., Michigan State University, 1988. [88] A. K. Jain, Fundamentals of Digital Image Processing. Englewood Cliffs: Prentice Hall, 1989. [89] I. T. Jolliffe, Principal Component Analysis. Springer-Verlag, New York, 1986. [90] T. Kanade, Computer Recognition of Human Faces. Basel and Stuttgart: Birkhauser, 1977. [91] G. J. Kaufman and K. J. Breeding, “The automatic recognition of human faces from profile silhouettes,” IEEE Trans. Systems, Man, and Cybernetics, vol. 6, pp. 113—121, 1976. [92] Y. Kaya and K. Kobayashi, “A basic study on human face recognition,” in Frontiers of Pattern Recognition (S. Wantanabe, Ed.), pp. 265—289, New York: Academic Press, 1972. [93] W. Kim and P. Yuan, “A practical pattern recognition system for translation, scale, and rotation invariance,” in Proceedings Conf. on Computer Vision and Pattern Recognition, pp. 391—396, 1994. [94] M. Kirby and L. Sirovich, “Application of the Karhunen-Loeve procedure for the characterization of human faces,” IEEE Trans. Pattern Anal. Machine Intell., vol. 12, pp. 103—108, January 1990. [95] J. J. Koenderink and A. J. Van Doom, “The internal representation of solid shape with respect to vision,” Biological Cybernetics, vol. 32, no. 4, pp. 211—216, 1979. [96] T. Kohonen, “Self-organized formation of topologically correct feature maps,” Biol. Cybern., vol. 43, pp. 59-69, 1982. [97] T. Kohonen, “Self-organized network,” Proc. IEEE, vol. 43, pp. 59—69, 1990. [98] T. Kohonen, Self- Organization and Associative Memory. Berlin: Springer-Verlag, 1988. [99] H. Kuhnel and P. Tavan, “A network for discriminant analysis,” in Artificial Neural Networks (T. Kohonen, K. Makisara, O. Simula, and J. Kangas, Eds.), Elsevier Science, 1991. [100] S. Kung and K. Diamantaras, “A neural network learning algorithm for adap- tive principal component extraction (apex),” in IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, vol. 1, pp. 861—864, 1990. [101] M. Lades, J. Vorbruggen, J. Buhmann, J. Lange, C. V. D. Malsburg, and R. Wurtz, “Distortion invariand object recognition in the dynamic link architec- ture,” IEEE Trans. Computers, vol. 42, pp. 300—311, 1993. 219 [102] Y. Lamdan and H. J. Wolfson, “Geometric hashing: A general and efficient model-based recognition scheme,” in Proc. 2nd Int ’1 Conf. Computer Vision, pp. 238—249, 1988. [103] X. Li and R. C. Dubes, “Tree classifier design with a permutation statistic,” Pattern Recognition, vol. 19, pp. 229—235, 1986. [104] R. Linsker, “Self-organization in a perceptual network,” Computer, pp. 105—117, March 1988. [105] M. M. Loeve, Probability Theory. Princeton, NJ: Van Nostrand, 1955. [106] D. G. Lowe, “Fitting parameterized three-dimensional models to images,” IEEE Trans. Pattern Anal. Machine Intell., vol. 13, pp. 441—450, May 1991. [107] B. Manjunath, R. Chellappa, and G. von der Malsburg, “A feature based ap- proach to face recognition,” in Proc. IEEE Computer Soc. Conf. on Computer Vision and Pattern Recognition, pp. 373—378, 1992. [108] B. Manjunath, C. Shekhar, R. Chellappa, , and C. von der Malsburg, “A robust method for detecting image features with application to face recognition and motion correspondence,” in Proc. Int. Conf. on Pattern Recognition, pp. 208— 212, 1992. [109] D. Marr and K. Nishihara, “Representation and recognition of the spatial or- ganization of three-dimensional shapes,” in Proceedings of the Royal Society of London, pp. 269—964, 1977. [110] D. Marr, Vision: A Computational Investigation into the Human Representa- tion and Procesing of Visual Information. San Hancisco: W. H. Heeman and Company, 1982. [111] R. Mauro and M. Kubovy, “Caricature and face recognition,” Memory and Cognition, vol. 20, pp. 433—441, July 1992. [112] E. A. Maylor, “Recognizing and naming faces: aging, memory retrieval, and the tip of the tongue state,” Journal of Gerontology, vol. 45, pp. P215—P226, November 1990. [113] B. Moghaddam and A. Pentland, “Maximum liklihood detection of faces and hands,” in International Workshop on Automatic Face- and Gesture-Recognition (M. Bichsel, Ed.), pp. 122—128, 1995. [114] J. Morton and M. H. Johnson, “Conspec and conlern: a two-process theory of infant face recognition,” Psychological Review, vol. 98, pp. 164—181, April 1991. [115] J. K. Mui and K. S. Fu, “Automated classification of nucleated blood cells using a binary tree classifier,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 2, pp. 429—443, September 1980. 220 [116] H. Murase and S. K. Nayar, “Illumination planning for object recognition in structured environments,” in Proc. IEEE Computer Soc. Conf. on Computer Vision and Pattern Recognition, pp. 31-38, June 1994.Seattle, Washington. [117] O. Nakamura, S. Mathur, and T. Minami, “Identification of human faces by identity maps,” Pattern Recognition, vol. 26, no. 3, pp. 263—272, 1991. [118] R. Nevatia and T. Binford, “Description and recognition of complex-curved objects,” Artificial Intelligence, vol. 8, pp. 77—98, 1977. [119] M. Nixon, “Eye spacing measurements for facial recognition,” in SPIE Proc. Vol. 575: Applications of Digital Image Processing VIII, pp. 279—285, 1985. [120] E. Oja, “Neural networks, principal components, and subspaces,” Intl. Journal of Neural Systems, vol. 1, pp. 61—68, 1989. [121] F. Pan and M. Keane, “A new set of moment invariants for handwritten numeral recognition,” in Proceedings IEEE International Conference on Image Process- ing, vol. I, pp. 154—158, 1994. [122] A. Pentland, B. Moghaddam, and T. Starner, “View-based and modular eigenspaces for face recognition,” in Proc. IEEE Computer Soc. Conf. on Com- puter Vision and Pattern Recognition, pp. 84—91, June 1994.Seattle, Washington. [123] A. Pentland, R. W. Picard, and S. Scarloff, “Photobook: Tools for content- based manipulation of image databases,” in SPIE Storage and Retrieval Image and Video Databases II, no. 2185, (San Jose), February 1994. [124] A. P. Pentland, “Perceptual organization and the representation of natural from,” Artificial Intelligence, vol. 28, pp. 29—73, 1986. [125] T. Poggio and F. Girosi, “Networks for approximation and learning,” Proc. IEEE, vol. 78, pp. 1481—1497, 1990. [126] A. Rahardja, A. Sowmya, and W. Wilson, “A neural network approach to component versus holistic recognition of facial expressions in images,” in SPIE Proceedings Vol. 1607: Intelligent Robots and Computer Vision X: Algorithms and Techniques, pp. 62—70, 1991. [127] S. S. Rakover and B. Cahlon, “To catch a thief with a recognition test: the model and some empirical results,” Cognitive Psychology, vol. 21, pp. 423—468, October 1989. [128] D. Reisfeld and Y. Yeshurun, “Robust detection of facial features by general— ized symmetry,” in Proceedings 11 th International Conference on Pattern Recog— nition, pp. 117—120, 1992. [129] G. Rhodes and T. 'Ii‘emewan, “The simon then garfunkel effect: semantic prim- ing, sensitivity, and the modularity of face recognition,” Cognitive Psychology, vol. 25, pp. 147—187, April 1993. 221 [130] J. Rubner and K. Schulten, “Development of feature detectors by self- organization,” Biol. Cybern, vol. 62, pp. 193—199, 1990. [131] J. Rubner and P. Tavan, “A self-organizing network for principal component analysis,” Europhysics Letters, vol. 10, pp. 693—698, 1989. [132] T. Sakai, M. N agao, and S. Fujibayashi, “Line extraction and pattern recogni- tion in a photograph,” Pattern Recognition, vol. 1, pp. 233—248, 1969. [133] A. Samal and P. A. Iyengar, “Automatic recognition and analysis of human faces and facial expressions: A survey,” Pattern Recognition, vol. 25, no. 1, pp. 65—77, 1992. [134] A. Samal, “Minimum resolution for human face detection and identification,” in Proc SPIE/SPSE Symposium on Electronic Imaging, 1991. [135] H. Samet, “Region representation: Quadtrees from boundary codes,” Comm. ACM, vol. 23, pp. 163—170, March 1980. [136] M. Seibert and A. M. Waxman, “Recognizing faces from their parts,” in SPIE Proceedings Vol. 1611: Sensor Fusion IV: Control Paradigms and Data Struc- tures, pp. 129—140, 1991. [137] J. Sergent, “Microgenesis of face perception,” in Aspects of Face Processing, Dordrecht: Nijhoff, 1986. [138] I. K. Sethi and G. P. R. Savarayudu, “Hierarchical classifier design using mutual information,” IEEE Trans. Pattern Anal. Machine Intell., vol. 1, pp. 194—201, April 1979. [139] L. Sirovich and M. Kirby, “Low-dimensional procedure for the characterization of human face,” Journal of the Optical Society of America, vol. 4, pp. 519—524, 1987. [140] W. Sjoberg, B. Gruber, and C. Swatloski, “Visual-field asymmetry in face recog- nition as a function of face discriminability and interstimulus interval,” Percep- tual and Motor Skills, vol. 72, pp. 1267—1271, June 1991. [141] F. Stein and G. Medioni, “Efficient two dimensional object recognition,” in 10th International Conference on Pattern Recognition, (Atlantic City), 1990. [142] F. Stein and G. Medioni, “Structural indexing: Efficient three-dimensional ob— ject recognition,” IEEE Trans. Pattern Anal. Machine Intell., pp. 125-145, 1992. [143] G. C. Stockman, “Object recognition and localization via pose clustering,” Computer Vision, Graphics and Image Processing, vol. 40, pp. 361—387, 1987. [144] T. J. Stonham, “Practical face recognition and verification with WISARD,” in Aspects of Face Processing (H. D. Ellis, M. A. Jeeves, F. Newcombe, and A. Young, Eds.), pp. 426—441, Dordrecht: Nijhoff, 1984. Cur-um” no” .u-w.n 5 222 [145] D. L. Swets, B. Punch, and J. J. Weng, “Genetic algorithms for object recog- nition in a complex scene,” in Proceedings, International Conference on Image Processing, (Washington, DC), pp. 595—598, October 1995. [146] D. L. Swets and J. J. Weng, “Image—based recognition using learning for gen- eralizing parameters,” in Proceedings, Asian Conference on Computer Vision, (Singapore), December 1995.Accepted and presented at conference; not in Pro- ceedings due to an oversight between authors. [147] D. L. Swets and J. J. Weng, “Efficient image retrieval using a network with complex neurons,” in Proceedings, International Conference on Neural Networks, (Perth, Western Australia), November 1995.1nvited Paper. [148] D. L. Swets and J. J. Weng, “View—based recognition using SHOSLIF,” in Pro- ceedings, International Conference on Neural Networks and Signal Processing, (Nanjing, China), December 1995. Most Excellent Paper award. [149] D. L. Swets and J. J. Weng, “Efficient content-based image retrieval using auto- matic feature selection,” in Proceedings, International Symposium on Computer Vision, (Coral Gables, Florida), pp. 85—90, November 1995. [150] D. L. Swets and J. J. Weng, “Using discriminant eigenfeatures for image re- trieval,” IEEE Trans. Pattern Anal. Machine Intell., 1996.1n preparation. [151] D. L. Swets and J. J. Weng, “SHOSLIF-O: SHOSLIF for object recognition and image retrieval (phase II),” Tech. Rep. CPS 9539, Michigan State Univer- sity, Department of Computer Science, A714 Wells Hall, East Lansing, Michigan 48824, October 1995. [152] D. L. Swets and J. J. Weng, “SHOSLIF—O: SHOSLIF for object recognition (phase 1),” Tech. Rep. CPS 94-64, Michigan State University, Department of Computer Science, A714 Wells Hall, East Lansing, Michigan 48824, December 1994. [153] M. Szpir, “Accustomed to your face,” American Scientist, vol. 80, pp. 537—540, November-December 1992. [154] M. 'Ihrk and A. Pentland, “Eigenfaces for recognition,” Journal of Cognitive Neuroscience, vol. 3, no. 1, pp. 71—86, 1991. [155] T. Valentine and A. Ferrara, “Typicality in categorization, recognition and identification: evidence from face recognition,” British Journal of Psychology, vol. 82, pp. 87—102, February 1991. [156] A. Wahlin, L. Backman, T. Mantyla, A. Herlitz, M. Viitanen, and B. Winbald, “Prior knowledge and face recognition in a community-based sample of healthy, very old adults,” Journals of Gerontology, vol. 48, pp. P54-P11, March 1993. 223 [157] M. A. Wallace and M. J. Farah, “Savings in relearning face-name associations as evidence for “covert recognition” in prOSOpagnosia,” Journal of Cognitive Neu- roscience, vol. 4, pp. 150—154, Spring 1992. [158] J. J. Weng, “Cresceptron and SHOSLIF: Toward comprehensive visual learn- ing,” in Early Visual Learning (S. K. Nayar and T. Poggio, Eds.), Oxford Uni- versity Press, 1996. [159] J. J. Weng, “SHOSLIF: The hierarchical optimal subspace learning and infer- ence framework,” Tech. Rep. CPS 94-15, Michigan State University, Department of Computer Science, A714 Wells Hall, East Lansing, Michigan 48824, March 1994. [160] J. Weng, N. Ahuja, and T. S. Huang, “Learning recognition and segmentation using the Cresceptron,” in Proc. International Conference on Computer Vision, pp. 121-128, May 1993.Berlin, Germany. [161] J. Weng and S. Chen, “SHOSLIF-N: SHOSLIF for autonomous navigation (phase 11),” Tech. Rep. CPS-95-22, Michigan State University, Department of Computer Science, May 1995. [162] J. Weng, T. S. Huang, and N. Ahuja, Motion and Structure from Image Se- quences. Springer-Verlag, 1993. [163] S. S. Wilks, Mathematical Statistics. Wiley, New York, 1963. [164] P. H. Winston, Artificial Intelligence. Addison-Wesley, third ed., 1984. [165] R. J. Woodham, “Phottometric method for determining surface orientation from multiple images,” in Shape from Shading (B. K. P. Horn and M. J. Brooks, Eds.), Cambridge, MA: MIT Press, 1989. [166] C. J. Wu and T. S. Huang, “Human face profile recognition by computer,” Pattern Recognition, vol. 23, no. 3/6, pp. 255—259, 1990. [167] A. Yuille, D. Cohen, and P. Hallinan, “Feature extraction from faces using de- formable templates,” in Proceedings Computer Vision and Pattern Recognition, pp. 104-109, 1989.