. V.nm_2‘u..m..J.n.-NW¢€MH I... . i892 x; \ HIE/b.1373” {I‘ll 3‘51}: .31....10 ; ilf? .‘) 2 ‘ urrJ:r,x r J memo»: STATE U lll W \ \\\\ l\\\\\\\\\\l\ ll llllllll \ll 3 1293 0155 0241 This is to certify that the dissertation entitled COSMOS: A FRAMEWORK FOR REPRESENTATION AND RECOGNITION OF 3D FREE-FORM OBJECTS presented by Chitra Dorai has been accepted towards fulfillment of the requirements for PhD degree in Wience “W Major professor é/zY/‘ié Date MS U is an Affirmative Action/Equal Opportunity Institution 0-12771 - —. —.v-‘ _ ‘rg- fl .7” A TO AVOID FINES return on or More data due. -_ LIBRARY M'chigafl State University PLACE IN RETURN BOX to remove thin checkout from your record. DATE DUE DATE DUE DATE DUE MSU Ie An Affirmative Action/Emu Oppommlty Intuition WW1 COSMOS: A FRAMEWORK FOR REPRESENTATION AND RECOGNITION OF 3D FREE-FORM OBJECTS By Chitra Dorai A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1996 ABSTRACT COSMOS: A FRAMEWORK FOR REPRESENTATION AND RECOGNITION OF 3D FREE-FORM OBJECTS By Chitra Dorai This dissertation presents a new approach to automated representation and recog- nition of 3D free-form rigid objects using dense surface data. We describe a computer vision system that recognizes arbitrarily curved 3D rigid objects from a single view when (a) the vieWpoint can be arbitrary, (b) the objects may vary in shape and com- plexity, and (c) no restrictive assumptions are made about the types of surfaces on the objects. We assume that a range image of a scene is available which contains a view of a rigid 3D object without occlusion. Availability of CAD models of 3D objects, although not a necessity, is considered an advantage and exploited in our system to easily generate multiple views of the object for its model construction. Our surface representation scheme, COSMOS, describes an object concisely in terms of maximal surface patches of constant shape index. These maximal patches that represent the object are mapped onto the unit sphere via their orientations, and aggregated via shape spectral functions. Surface properties such as area, curvedness, and connectiv— ity which are required to capture local and global information about the object are also built into the representation. The scheme yields not only a meaningful and rich description useful for the recoverability of several classes of objects, but also provides a set of powerful matching primitives for recognition. We present a recognition strategy which consists of a multi—level matching mech- anism employing shape spectral analysis and features derived from the COSMOS rep- resentations of objects for both fast and efficient object identification and pose esti- mation. Shape spectra based object view representations are employed for efficient View grouping and model organization with large databases. Given a range image of an uncluttered view (allowing self-occlusion) of an object, the shape spectrum based model selection scheme short—lists a few promising candidate views from a database of object views. The COSMOS-based view verification scheme then establishes the correct object identity of the input by comparing the COSMOS representations of the views in detail using a “patch-group graph” matching technique. Estimation of the pose of the recognized object is formulated as registration of the sensed data with the range image of the best matched View of the object. We present a minimum variance estimator to robustly register two range images of a complex object and compute their relative View transformation accurately. All theoretical aspects of this work have been experimentally validated via a prototype system, which has been tested on a database of over 6,000 object views generated from CAD models and surface trian- gulations and 100 range images of several different complex objects acquired using a 3D laser range scanner. To Sunfl Acknowledgments It is a great pleasure to record my sincere appreciation of the support and encour- agement shown by my colleagues, friends and family during the course of my Ph. D. First of all, I wish to express my gratitude to Prof. Anil K. Jain, my thesis advisor for his generous support, guidance and willingness to help throughout my research. I have benefited greatly from the extensive research training and opportunities he has provided me, and from his breadth and depth of experience. His insightful questions and remarks on the significance of various aspects of my research have influenced and improved the presentation of my work immensely. His vision and the care he gives to all his students have been a source of inspiration to me. I would like to thank Profs. George Stockman, John Weng, William Punch and Hira Koul for serving on my Ph. D committee. Special thanks to Dr. Stockman for his steady encouragement and for always keeping an open door for me to discuss research with him at any time throughout these years. Right from the moment I arrived at MSU, he has shown me a great deal of warmth and friendliness which I truly appreciate. I am very grateful to Dr. Weng for his helpful suggestions and con- structive criticism that contributed significantly to Chapter 6 of this thesis. Thanks to Dr. Punch for his cheerful disposition and his willingness to make time for our discussions. Dr. Koul initially suggested the analogy between the shape spectral distribution and the probability distributions. I also want to thank all the faculty members from Computer Science, Mathematics and Statistics departments, especially late Prof. Dubes for the excellent instruction that I received. Dr. Patrick Flynn at WSU played a key role in the early development of my interest in free-form sur- vi face matching. I would like to thank both Pat Flynn and Tim Newman for sharing their 3D data and source code, and for their prompt assistance whenever I turned to them with questions about the laser range scanner. I also gratefully acknowledge Dr. SW. Chen’s help with the cobra model. My PRIP friends and colleagues from all over the world contributed to a stimu— lating and friendly work environment, full of lively discussions and cheerful banter. I would like to thank all of them, in particular Sateesha Nadabar, N. S. Raja, Deborah Trytten, Sushil Bhattacharjee, Timothy Newman, Hansye Dulimarta, Qian Huang, Jianchang Mao, Marie Pierre Jolly, Jinlong Chen, Dan Swets, Sally Howden, Marilyn Wulfekuhler, Shanti Vedula, Aditya Vailaya, Yonghong Li, Karissa Miller, Lin Hong, and Gang Wang. Special thanks to Jayashree, Ram, Geeta, Natraj, Sujatha, and Shankar for the moral support and fun times outside school. Many old friends such as Madan and Shreekumar often sent their affection and encouragement via e-mail. I am grateful to Innovision Corporation for a Graduate Fellowship award, MSU Graduate School for the Dissertation Completion Fellowship, Northrop Corporation, NASA Lewis Research Center and the Department of Computer Science for their generous financial support over the years. I am thankful to Dr. Anbalagan Reddiar at Innovision Corporation, Dr. Harpreet Sawhney and Dr. Byron Dom at the IBM Almaden Research Center for their interest. I wish to thank all the system adminis— trators in the department, especially Lisa Lees for maintaining excellent computing facilities in PRIP and instructional laboratories. I appreciate the enthusiastic help- fulness of the departmental support staff, particularly Cathy Davison, Linda Moore, and Lora Mae Higbee at all times. Special thanks to my parents and brothers whose encouragement, love and nonin— terference made this all possible. My mother in particular, was a source of inspiration to me in many aspects. I am also most thankful for Sunil’s unfailing support and love over the last eleven years, from IIT to MSU. I am indebted to him for serving as a (very!) critical sounding board at many points during my research, for his generous sacrifice of our personal time to accommodate long discussions on COSMOS and finally for being always there. His brilliance and his keen sense of perfection have influenced vii g Contents List of Tables List of Figures 1 Introduction 1.1 Automatic Model-Based Scene Analysis ................. 1.2 Challenges in Three-Dimensional Object Recognition . . ' .............................. 1.3 Free-Form Object Recognition ...................... 1.3.1 Motivation and Statement of the Problem ........... 1.3.2 Definition of Free-Form Surfaces ................. 1.3.3 Problem Definition ........................ 1.3.4 Problem Difficulty ........................ 1.4 Overview of the Thesis .......................... 1.5 Main Components of the Recognition System ............. 1.6 Organization of the Thesis ........................ 2 Three-Dimensional Object Recognition 2.1 Recognition of 3D Objects ........................ 2.1.1 Sensors ............................... 2.1.2 3D Object Representations and Models ............. 2.1.3 Matching Strategies ........................ 2.1.4 Difficulties and Challenges .................... viii xiii XV 9 9 10 13 13 15 17 19 21 21 22 23 31 42 ix 2.2 Free-Form Object Recognition ...................... 2.2.1 Representation Schemes ..................... 2.2.2 Recognition Techniques ...................... 2.3 Summary ................................. COSMOS —- A New Representation Scheme for Free-Form Objects 3.1 How Do We Describe Rigid 3D Objects? ................ 3.1.1 Local Surface Attributes ..................... 3.1.2 Combining Local and Global Descriptions ........... 3.2 COSMOS — A New Representation Scheme ............... 3.2.1 Definitions ............................. 3.2.2 Definition of the cosmos of a 3D Object ............ 3.2.3 Definition of Shape Spectrum .................. 3.2.4 Relationship between Shape Spectrum and G1 ......... 3.3 Properties of the COSMOS Representation ................ 3.3.1 Compactness ........................... 3.3.2 Convex Objects .......................... 3.3.3 Nonconvex Objects ........................ 3.4 3D Objects and their COSMOS Representations: Examples ................................. 3.4.1 Simple Objects .......................... 3.4.2 Torus ............................... 3.5 Deriving COSMOS Representation of an Object from Range Data . . . 3.5.1 Construction of the COSMOS of a Single View of an Object ............................... 3.5.2 Constrained Region Growing ................... 3.5.3 Sensitivity Analysis of Shape Index, 51 ............. 3.5.4 Shape Spectrum of an Object View: Examples ......... 3.6 Summary ................................. 42 43 49 50 52 53 55 56 58 58 70 78 80 82 85 85 87 89 93 95 98 101 106 107 109 X 4 Object View Grouping and Model Database Organization 111 4.1 Ob ject-centered versus Viewer-centered Representations .............................. 1 12 4.2 3D Object Model as a Collection of Representations of Multiple Views .................... 113 4.3 View Sensitivity of the Shape Spectrum ................. 114 4.4 Organizing Object Views ........................ 115 4.4.1 Feature Representation and Similarity between Shape Spectra ........................... 117 4.4.2 Object View Grouping ...................... 118 4.5 Experimental Results ........................... 119 4.5.1 Matching Accuracy: Resubstitution ............... 121 4.5.2 Matching Accuracy: Testing Phase ............... 125 4.5.3 Testing with 6400 Object Views ................. 127 4.5.4 Model View Selection with Real Range Data .......... 129 4.5.5 Shape Spectrum of Objects with Planar Surfaces .............................. 135 4.6 Summary ................................. 136 5 Multi-level Matching for flee-Form Object Recognition 137 5.1 COSMOS-Based Free-Form Object Recognition ............. 139 5.2 Shape Spectrum-Based Model View Selection ............. 141 5.3 COSMOS-Based Refined View Matching ................. 142 5.3.1 Patch Grouping, Correspondences and Graph Isomorphism ............................ 143 5.3.2 The Matching Algorithm ..................... 147 5.3.3 Goodness Measure of a Correspondence ............. 150 5.3.4 Highlights of the Matching Algorithm .............. 152 5.3.5 Estimation of Object Pose .................... 154 5.3.6 Experimental Results ....................... 155 xi 5.4 Performance of the Recognition System ................. 5.4.1 The COSMOS Representation Scheme .............. 5.4.2 View Grouping and Model View Selection ........... 5.4.3 Matching of Object Views using COSMOS ............ 5.4.4 Pose Estimation .......................... 5.5 Summary ................................. Pose Estimation by Registering Object Views 6.1 Robust Object View Registration .................... 6.2 Previous Work .............................. 6.3 Error in Surface Measurements ..................... 6.4 A Non-Optimal Algorithm for Registration ............... 6.5 Registration and Error Modeling .................... 6.5.1 Fitting Planes to Surface Data with Noise ........... 6.6 An Optimal Registration Algorithm ................... 6.6.1 Estimation of the Variance 035 .................. 6.7 Experimental Results ........................... 6.7.1 Selection of Control Points .................... 6.7.2 Initial Estimate of the Transformation ............. 6.7.3 Errors in the Estimated Transformation ............ 6.7.4 Results ............................... 6.8 Surface Geometry and Registration ................... 6.9 Summary ................................. Summary and Directions for Future Research 7.1 Summary ................................. 7.2 Future Research .............................. 7.2.1 Incorporation of Explicit Edge Information within COSMOS .............................. 7.2.2 Improving the Segmentation Algorithm ............. 157 158 159 160 162 164 167 167 169 172 175 176 180 181 187 187 188 189 192 207 208 7.2.3 7.2.4 7.2.5 7.2.6 Bibliography xii Deriving COSMOS from a 3D Object Model ........... 209 Better Distance Measures and Matching Efficiency ....... 209 Occlusion ............................. 210 Integrating Color and Texture .................. 211 212 List of Tables 2.1 2.2 2.3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 4.1 4.2 4.3 An overview of popular object representation schemes. ........ 32 An overview of major matching strategies. ............... 41 Current representation schemes for complex curved objects ....... 45 Shape index and curvedness values of the surfaces shown in Figure 3.5. 63 COSMOS and orientation-based representations .............. 84 Surface connectivity and support functions on the unit sphere for a convex polyhedron. ............................ 91 Surface connectivity and support functions on the unit sphere for a nonconvex polyhedron. .......................... 91 Surface connectivity and support functions on the unit sphere for a cylinder truncated with spherical ends. ................. 93 Surface connectivity and support functions on the unit sphere for a truncated cylinder with planar ends. .................. 95 Surface connectivity and support functions on the unit sphere for a telephone handset. ............................ 96 The COSMOS representation: Support functions for Vase2-1 on the unit sphere .................................... 102 View classification accuracy with view groups of a single object. . . . 124 Object matching accuracy with an independent test set of 2,000 views. 130 Shape spectrum based selection of five best matched model views among all the twenty five views at the second level. .......... 135 xiii 5.1 6.1 6.2 6.3 6.4 6.5 6.6 xiv Matching scores of the five model hypotheses determined by the view verification stage. ............................. 156 Estimated transformation for the cobra data ............... 189 Registration of cobra data with 156 control points. .......... 190 Registration of Big-Y views using 81 control points. .......... 190 Registration of Big-Y views using 154 control points. ......... 191 Registration of Facel views with 250 control points ........... 191 Registration of Face2 views with 142 control points ........... 192 List of Figures 1.1 1.2 1.3 1.4 1.5 2.1 2.2 2.3 3.1 3.2 3.3 3.4 3.5 3.6 Key components of a 3D object recognition system. .......... Example of a free-form surface. ..................... Range images of objects with free-form surfaces ............. Representation and object shape complexity ............... An overview of the proposed approach. ................. Approaches to building a 3D object recognition system ......... Object models with vertices and inflection points as local features. . . Examples of super segments and splash adopted from [147]: (a) a 3D super segment with 4 grouped segments; k1, k2, k3 are the curvature angles, t1 is the torsion angle; (b) a splash with n, the reference normal, p the geodesic radius, p, the location vector, 0, the angle ........ Representing objects: several levels of abstraction ............ An example of a nonconvex object. ................... Nine well known shape types and their locations on the 51 scale. . . . Nine representative shapes on the 5'] scale ................ Simple surfaces with different shape index and curvedness values. Shape index (51) and curvedness (R) in the (m, reg-plane. ...... XV 11 12 14 15 23 35 51 54 57 60 61 62 63 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 xvi Continuous spectrum of various surface shapes and their shape index values ranging from spherical cap to saddle: (a) 51 = 1.0; (b) S] = 0.96; (c) 51 = 0.92; (d) S; = 0.87; (e) 51 = 0.81; (f) 51 = 0.78; (g) S; = 0.75; (h) S; = 0.72; (i) S] = 0.69; (j) S; = 0.63; (k) 51 = 0.58; (1) 51 = 0.54; (m) S; = 0.5. ............................... Continuous spectrum of various surface shapes and their shape index values ranging from saddle to spherical cup: (a) 51 = 0.5; (b) S; = 0.46; (c) 51 = 0.42; (d) 51 = 0.37; (e) 51 = 0.31; (f) S] = 0.28; (g) S; = 0.25; (h) S] = 0.22; (i) S] = 0.19; (j) S] = 0.13; (k) S; = 0.08; (l) S] = 0.04; (m) 51 = 0. ........................ Surface shapes from spherical cap to saddle when the object scale changes: (a) 51 = 1.0; (b) S] = 0.96; (c) S; = 0.92; (d) S; = 0.87; (e) S; = 0.81; (f) S; = 0.75; (g) S; = 0.69; (h) S; = 0.63; (i) S] = 0.58; (j) S] = 0.54; (k) S] = 0.5. ....................... Surface shapes from saddle to spherical cup when the object scale changes: (a) S] = 0.5; (b) S; = 0.46; (c) 51 = 0.42; (d) S] = 0.37; (e) 51 = 0.31; (f) S] = 0.25; (g) S] = 0.19; (h) 51 = 0.13; (i) S; = 0.08; (j) S; = 0.04; (k) 51 = 0.0. ....................... Maximal patches of constant shape index (colors indicate different shape index values): (a) Range image of a vase; (b) CSMPs detected on the vase. ................................ Example of a 3D free-form object and its spherical mapping Go(P). . COSMOS and EGI of a convex polyhedron. (The support functions are shown only for normals N1 and N4 for clarity.) ............. A convex (Ob jectl) and a nonconvex object (Object2) that have iden- tical EGI representations. ........................ COSMOS representation: (a) a sphere of radius a, (a Z 1); (b) the Gauss patch map with the support functions indicated by (31). COSMOS representation. (a) a convex polyhedron; (b) the Gauss patch map with the support functions ...................... 66 67 68 69 72 73 86 89 90 3.17 3.18 3.19 3.20 3.21 3.22 3.23 3.24 3.25 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 xvii COSMOS representation. (a) a nonconvex polyhedron; (b) the Gauss patch map with the support functions. ................. COSMOS representation. (a) a truncated cylinder with spherical caps; (b) the Gauss patch map with the support functions. ......... COSMOS representation. (a) a truncated cylinder with planar ends; (b) the Gauss patch map with the support functions. ......... COSMOS representation. (a) a telephone handset; (b) the Gauss patch map with the support functions ...................... Shape index values on the surface of the torus .............. Construction of COSMOS from range data of an object. ........ Representation of objects with free-form surfaces: (a) range image; (b) constant-shape maximal patches; (c) the Gauss patch map. . . . . Sensitivity of the shape index to principal curvatures .......... Shape spectra of (a) Vase2—1 shown in Figure 1.3 and (b) Big-Y-l shown in Figure 1.3. ........................... Shape spectrum: (a) Range image of Vase2; (b) shape spectrum of Vase2; (c) a view of the cobra head - Cobra-1; (d) shape spectrum of Cobra-l; (e) another view - Cobra-2; (f) shape spectrum of Cobra-2. . A subset of views of Cobra chosen from a set of 320 views. ...... Hierarchical grouping of 320 views of Cobra. .............. Visualization of the centroids of eleven view clusters of 320 views of Cobra using Chernoff faces. ....................... View clusters of Cobra and their views: (a) views belonging to Cluster5; (b) Cluster6; (c) Cluster9; (d) ClusterlO. ................ View classification accuracy vs. number of clusters examined in the database. ................................. Model view selection with the view-grouping and matching system. Misclassification vs. number of clusters examined ............ 92 93 94 95 97 99 103 107 108 116 120 121 122 4.9 4.10 4.11 4.12 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 xviii Range images of objects generated from arbitrary viewing directions from twenty object models ......................... Range images of 50 model views. .................... Range images of 50 test views ....................... Incorrect model view selection: (a) Test view; (b) top 5 view hypothe- ses generated by the model selection scheme ............... Overview of our 3D object recognition system .............. 3D object recognition and pose estimation ................ Shape spectrum based two-tiered organization of a model database. . Correspondence between patch-group graphs: (a) View 1 of Vase2; (b) view 2 of Vase2; (c) correspondence established between the CSMPs in the views ................................. Correspondence between the views of Phone: (a) View 1; (b) view 2; 128 131 132 134 138 140 142 144 (c) correspondence established between the CSMPs visible in the views. 156 Range images of object views stored in the model database. ..... Five model hypotheses (b)-(f) generated using shape spectral analysis for a test view (a) of Cobra. ....................... COSMOS-based matching: (a) CSMPs on the test view; (b) CSMPs on the model view with the highest matching score; (c) scene-model patch correspondence established between the views of Cobra ......... Matching a test view in the COSMOS-based recognition system. . . . . Pose estimation: (a) Model view registered with the test view of Vase2 at the end of first iteration; (b) registered views after 3 iterations; (c) Registered views after 4 iterations; ((1) registered views after 5 iterations; (e) registered views at the convergence of the algorithm. Pose estimation: (a) Model view registered with the test view of Phone at the end of first iteration; (b) registered views after 2 iterations; (c) registered views after 3 iterations; ((1) registered views after 4 iter- ations; (e) registered views at the convergence of the algorithm. 157 158 159 165 165 166 xix 5.12 Pose estimation: (a) Model view registered with the test view of Co- 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 bra at the end of first iteration; (b) registered views after the second iteration; (c) registered views at the convergence of the algorithm. . . Point-to—plane distance: (a) Surfaces P and Q before the transforma- tion T" at iteration k is applied; (b) distance from the point p,- to the tangent plane Sf of Q. .......................... Effect of noise in 2 measurements on the fitted normal when the plane is horizontal. The double-headed arrows indicate the uncertainty in depth measurements. .......................... Effect of noise in z measurements on the fitted normal when the plane is inclined. The double-headed arrows indicate the uncertainty in depth measurements. .............................. Effect of noise in 2 measurements on the fitted plane using eigenvector approach: (a) i.z'.d. Gaussian noise; (b) uniform noise. ........ Effect of noise in 2 measurements on the planar fit using linear regres- sion: (a) i.i.d. Gaussian noise; (b) uniform noise. ........... Actual standard deviation of (1, versus the planar orientation: (a) i.i.d. Gaussian noise; (b) uniform noise. ................... Estimated standard deviation of the distance d, using the perturba— tion analysis versus the plane orientation: (a) i.i.d. Gaussian noise; (b) uniform noise. ............................ Actual standard deviation of d, versus planar orientation using linear regression for plane-fitting: (a) i.z'.d. Gaussian noise; (b) uniform noise. Estimated standard deviation of d, versus planar orientation using lin- ear regression for plane-fitting: (a) i.i.d. Gaussian noise; (b) uniform noise. ................................... 6.10 Relative error of the rotation matrix R .................. 166 174 176 177 196 197 198 199 200 202 XX 6.11 Range images and the principal axes: (a) Cobra head with depth ren- dered as pseudo intensity — view 1; (b) cobra head rotated -— view 2; (c) view 1 of Big-Y generated from its CAD model; (d) view 2 of Big—Y.202 6.12 Range images of Facel: (a) View 1; (b) view 2. ............ 203 6.13 Range images of Face2: (a) View 1; (b) view 2. ............ 203 Chapter 1 Introduction ...Today’s small piece is basically about arbitrary smooth lumps of three- dimensional space such as could be occupied by your typical potato or favorite torso, say... [Jan J. Koenderink, “Solid Shape”] One of the major goals of computer science is to build machines that mimic human capabilities. Towards realizing this objective, research in building intelligent systems has traditionally concentrated on advanced cognitive skills such as reasoning, prob- lem solving, and natural language understanding. However, a crucial component of intelligent behavior is the ability to sense and affect the world. The field of computer vision, which studies perception and how it can be combined with action is, therefore, an important adjunct to constructing intelligent machines. The growing importance of computer vision is evident from the fact that it was identified as one of the “Grand Challenges” [57] and also from its prominent role in the National Information Infras- tructure [52]. The primary goal in computer vision is to build a system that can automatically interpret a scene, given a snapshot (image) of the scene in terms of an array of brightness or depth values. While human vision is an existence proof of a system that operates flexibly in multiple environments, computer vision research so far has attempted to deal separately with images of outdoor and indoor scenes. 2 1.1 Automatic Model-Based Scene Analysis Our interest lies in building systems to automatically interpret images of a scene, primarily in an industrial setting. We define a scene as consisting of one or more 3D man-made objects. An interpretation of an image (data) of a scene is defined as knowing which 3D objects are where in the scene. This interpretation conceptually binds the entities in the scene to objects that we already have knowledge about. Thus, we are dealing with model-based scene analysis. Deriving an interpretation of a scene involves solving two interrelated problems. The first is identification or classification, where a label is assigned to an object to indicate the category to which it belongs. The second problem involves the estimation of the pose (position and orientation) or localization of the recognized object with respect to some global coordinate system attached to the scene. The term “recognition” is used in computer vision to describe the entire process of automatic identification and localization of objects from the sensed images of scenes in the real world. Automatic scene analysis is indeed a difficult problem as a recognition system has to make sense out of the pixels of an image array which by themselves contain very little information. Some knowledge of image formation and how the objects in the world are structured is essential to make any assertion about the scene that is being viewed. Object modeling is the first step that makes this knowledge of the object structures and how they appear in an image explicit. For example, objects can be modeled as volumes, or sets of bounding surfaces, or just lists of salient local features, and the image formation process can be described using either perspective or orthographic projection of the 3D scene into a 2D image array. The resulting image has to be processed to extract the regularities in the sensed 2D data, organize them as connected entities or “blobs” and characterize them in a way that is indicative of the objects and their spatial interrelations which are in fact manifest in the scene. Note that this process is confounded by practical problems such as sensor inaccuracies, variations in ambient illumination, clutter and occlusion owing to objects being close to or overlapping one another in a scene. Sensed Data Representation (Intensity. Range....) Module Object] Object2 Verifi- Identity & pose cation Recognition Object Models Module Object I of object(s) Analytical representations CAD models Object2 Multi-view sensor data Figure 1.1: Key components of a 3D object recognition system. Recognition is the processing step that compares a derived description from the image, which is most likely to be incomplete, with stored models or representations of objects in order to identify what is present in the scene. A recognition module has to search among the possible candidate object representations to identify the best match and then verify whether the candidate solution is indeed correct or in- correct. This search procedure can be very time-consuming because the search space of possible feature—correspondences and view-transformations is very large. The time complexity of matching depends crucially on the number of stored objects, how de- tailed or sophisticated the stored representations are, and how they are organized. In addition, the matching process has to deal with missing information in the input scene representation due to occlusion, and sometimes spurious additional information resulting from incorrect merging of connected “blobs”. Automatic recognition of ob- jects from images has to be performed accurately and quickly for practical use in the real world. The integrated representation of various types of visual information in the scene along with intelligent associations and real-time retrieval mechanisms are some of the challenges in building an automatic recognition system. Figure 1.1 presents the important stages in the design and development of a recognition system. All the processing steps in an automatic recognition system have to be performed reliably and efficiently in order to be employed in many challenging real-world ap- plications such as robot bin-picking, automated inspection of assembly parts, face recognition for security analysis, and autonomous navigation. Reconstruction and recognition of various aspects of shape and other physical properties of the objects in the real world are some of the fundamental issues that need to be addressed during the creation of vision augmented virtual environments. Medical diagnosis from X-ray and ultrasound images, and robot assistance in complicated surgeries are some of the important areas where automatic recognition can make a significant impact and be beneficial to humanity. 1.2 Challenges in Three-Dimensional Object Recognition The dominant and popular paradigm in 3D object recognition [135] assumes two stage processing of scene data: First, an internal representation of the scene is derived from the input data that may have been obtained from one or more sensors such as CCD cameras and range scanners. At the second stage, it is matched against stored representations or models of known objects. This paradigm is also hypothesized to underlie human visual processing [113]. During the model building stage, analytical or CAD models of objects, if available, are typically used to construct descriptions of objects that may be present in the scene. Alternatively, information about 3D objects in the scene is gathered from various viewing directions and organized to construct their descriptions. An explicit model or a computer representation of a 3D object is important to subsequent recognition by the machine as it influences the processing of sensed data that attempts to derive a similar description from the measurements in an image. A poor scene representation would place a heavy burden on the recognition algorithm when it matches the input against a collection of stored object representations, and would most likely lead to incorrect identification of the object. Therefore, while designing an object recognition system for a specified task, care should be taken to develop approaches that solve both representation and matching problems in an integrated manner. A number of representational and matching themes have been recognized by sev- eral researchers [70] recently as challenging and important, following a survey of a decade or so of intense activity in the computer vision field. These themes are related to the complexity, speed and generality issues that need to be addressed by a recog- nition system. We discuss below some of the emerging themes that are of interest to US. 1. Object shape complezity: 3D object recognition has so far dealt mainly with geometric entities in two and three dimensions, such as points or groups of points [106, 139], planar patches and normals [80, 78], straight edges and poly- lines [112, 147], polyhedral and quadric surfaces [22, 59, 30, 68, 164], and su— perquadrics [124, 144, 132]. The success of several existing object recognition systems can be attributed to the restrictions they impose on the classes of geo- metrical objects that can be handled. However, there has been a notable lack of systems that can handle arbitrary surfaces with very few restrictive assumptions about their geometric shapes. Interest has recently emerged in matching arbi- trarily curved surfaces that cannot be modeled using volumetric primitives and that may or may not have easily detectable landmark features such as vertices and edges of polyhedra, vertices of cones and centers of spheres. A sculpted object may possess various smoothly blended surface features which may not lead to easy object segmentation into simple analytical primitives. Given the complexity of the arbitrary shapes one can encounter in practical situations and the difficulty of representing them in a general fashion, it is not surprising that most computer vision systems have sought to address recognition of only a very restricted class of objects. However, complex real-world applications of computer vision systems have recently begun to stimulate the development of general 3D recognition systems that can handle arbitrarily curved objects. 2. Size of object model database: The organization of knowledge is strongly tied to efficient retrieval of pieces of information in an application. The analogue of stored knowledge in a recognition system is the set of descriptions, mod- els, or representations of objects of interest that might be encountered by the system. Efficient retrieval here implies faster matching. In real-world appli- cations, one typically finds the need to handle databases containing a large number of complex object models. By “large” we mean typically a thousand or more models. As the number of objects to be recognized by a system increases, the computational time to perform recognition of even a simple input image becomes discouragingly high. This is primarily due to the fact that in most systems, the input representation is matched against all the object models in the database. There is an increased awareness of this computational cost and it has resulted in substantial research to prune the number of matches needed either by using focus features [21] or by indexing a hash table based on invariant features [105, 23, 69, 147, 32] during recognition. In addition, approaches that can organize the representations in a hierarchical fashion to eliminate unlikely matches quite early during recognition and subsequently present only a few candidate objects for final verification of their identity and pose have begun to be considered seriously. . Learning: As the need for adaptation and flexibility in vision systems grows, we find a growing interest in systems that incorporate at least some aspect of learning [17, 170]. Most object recognition systems are built without the ability to reject portions of the scene as unrecognizable, and are therefore limited to domains wherein the set of objects to be recognized is pre—specified. A recog- nition system may perform well within the scope of its knowledge; but any slight deviation such as noisy segmentation or representation outside the nar- row expertise of the system causes the performance to deteriorate rapidly. Many application domains of computer vision systems are by and large unstructured. Less controlled environments, therefore, mandate construction of systems that can automatically build models of “unexpected” objects. It is desirable to have the recognition system learn the description of an unknown object so that fu- ture instances of the object when encountered by the system will not make the system break down [127]. The ability to learn to recognize new inputs as well as to remember the previous instances of objects is known as the adaptation or plasticity property. This property is lacking in current recognition systems which are thus rendered quite brittle. It is also vital to be able to learn gen- eralized representations of an object from multiple training instances in order to recognize a degraded or noisy instance of the object in an image. Learning can aid in overcoming the disadvantages of fixed parameters [137] and represen- tations, that are typical of current vision systems. Further, learning enforces the use of a large variety of real data in performance evaluation of recognition systems, due to the need for feedback that is necessary for valid generalizations. . Individual and generic object categories: A recognition system can represent and recognize either individual objects or store and match descriptions of generic classes of objects. For example, descriptions can be stored for individual chairs and /or the generic class of chairs. Recognition of the latter category is a more difficult problem since there is no unique structural description that character- izes the entire class of chairs although a functional description of the class of chairs is available and simple. Functionality is tied to the description of geo- metric features that need to be present in an object for it to be recognized as an instance of a generic category. Although it has been known that function—based representations are appropriate for constructing “generic” models of objects, issues of implementation and experimentation with systems that use function- based representations to represent and recognize elements of a generic class of objects have begun to be addressed only recently [I46]. . Articulated objects: Some objects such as a pair of scissors, have movable parts. Such objects are commonly referred to as articulated objects. The representa- tion of 3D objects should encode the range of relative movement between the parts of articulated shapes. The representation that appears suitable here is parts-based, i.e. the individual components or parts of an object and their in— terrelationships have to be extracted reliably and used to represent the object as a whole. A family of shapes for a single object can be specified by parameteriz— ing the point patterns representing the shape [163]. A multi—level representation can also be used to describe the parts in an object, their adjacency relationships and also allow some parameterized range on the geometry of the connections between the parts. . Non-rigidity of objects: Almost all object recognition systems assume that the objects under consideration are rigid. Flexible objects such as organs (for ex- ample, heart and lung) in a human body are those whose shape need not remain constant with time. A deformable object, unlike an articulated object is one which is entirely non-rigid. Since most current representation schemes use either volumetric or surface-based primitives, it appears that they are inappropriate for representing flexible objects. It still remains to be investigated if there exist primitives of fixed volume or of fixed surface shape at suitable scales to describe non-rigid objects. Applications in medical imaging have spurred some of the research in deformable shape representations such as deformable superquadrics and finite-element methods [158, 157, 86]. . Information-preserving representations: An information preserving representa- tion of an object is one from which the original object can be reconstructed. Representations based on constructive solid geometry possess this feature. Ex- tended Gaussian Image representations [85] of convex polyhedra can be used to recover them uniquely. Information preserving representations are interesting as they can be used to synthesize or reconstruct objects. The other class of representations is the discriminatory representation where only those features that discriminate between objects are captured. Here, reconstruction is not possible. An obvious drawback of this representation is that addition of new classes of objects to the database would require redesigning and rebuilding of the knowledge base of object descriptions. . Occlusion: Another major issue in the design of a recognition system is how to reliably recognize partial data of objects in the scene. Recognition systems that deal with multiple objects present within the field of view of a sensor need to be able to recognize objects that may be partly occluded by others [161]. Self occlusion is also possible with objects that are complex-shaped. The object description stored should be such that recognition of objects is possible even from partial information. A suitable representation possessing this property is in terms of local attributes or features (i.e., those that require information from only a small neighboring region around their locations) of objects. However, if a representation utilizes only local information, then it may fail to capture the global shape of an object. The presence of “salient” local features would certainly be a crucial and deciding criterion in recognizing the object. Even if an object is partially seen, a total lack of distinguishing features may render its recognition impossible. 1.3 Free-Form Object Recognition This dissertation deals with one of the main challenges described above. We concen- trate on building a recognition system that can handle arbitrarily shaped objects. We present here our motivation to address this problem, and the advantages that result from the proposed solution to the problem. 1.3.1 Motivation and Statement of the Problem As mentioned before, most current vision systems tend to be restrictive about the shape of the objects that can be handled. As Flynn and Jain note in [70], an ob- stacle to the widespread acceptance of 3D object recognition systems is represen- tational. Most current systems cannot accommodate sculpted surfaces and large model databases. Current recognition systems are limited to domains where the set of objects to be recognized is pre—specified and they tend to be mostly polyhedral, quadric and superquadric. We focus on the need to represent and recognize arbitrar- ily curved objects without restricting our recognition system to limited geometrical shapes. Free—form object recognition is also sometimes referred to as sculpted object 10 recognition. Observe that we do not include in our study statistically defined shapes such as textures and foams, arborizations (trees and bushes), crumpled surfaces (frac- tals), regular and quasi-regular tessellations of space and surface patches, and objects described using integral geometry. We also exclude from consideration surfaces that possess self-intersections and non-orientable surfaces [116] such as Mobius strip and Klein bottle. In general the complexity of the scenes analyzed by a recognition sys- tem can be characterized in terms of (a) scenes containing multiple occluding objects, (b) scenes containing multiple non-overlapping objects, and (c) scenes containing sin- gle objects with possible self-occlusion. We restrict ourselves to recognizing scenes containing a single object. Further, the focus of this research is automatic identi- fication of objects in industrial applications, and not automated inspection or face recognition. 1.3.2 Definition of Free-Form Surfaces A free—form surface S is defined to be a smooth surface such that the surface normal is well defined and continuous almost everywhere, except at vertices, edges, and cusps [11]. Figure 1.2 shows a smooth free-form surface. Since there is no other restriction on 5', it is not constrained to be polyhedral or piecewise-quadric. As Besl points out in his seminal paper [11], discontinuities in surface depth, surface normal or curvature may be present anywhere on the object and the curves that connect these points of discontinuity may meet or diverge smoothly. The shape of the object can be arbitrary. Some representative objects with free-form surfaces are human faces, cars, boats, airplanes, sculptures, etc. In this thesis, we use the terms “objects with free-form surfaces” and “free-form objects” interchangeably. Figure 1.3 shows a set of 3D objects with free-form surfaces that is representative of objects that we wish a vision system to recognize automatically. The range images of objects shown in Figure 1.3 were obtained using a laser range scanner (Technical Arts White scanner) that produces depth data in an X - Y grid. The figures show surface depth as pseudo intensity displaying the relative orientation of the surfaces; points Figure 1.2: Example of a free—form surface. oriented almost vertically are shown in darker shades. Observe that the surface data obtained from a single-view range imaging sensor typically take the form of a graph surface and hence the surface parameterization takes a very simple form: 5(u, v) = [u, v, f(u, v)]T where T indicates the transpose. However, we note that our proposed representation and recognition schemes work on any collection of (:L',y,z) points on which the fundamental notions of metric, tangent space, curvature and natural coordinate frames can be suitably defined. Free—form surfaces are extensively used in the design of smooth objects in au- tomotive, aerospace, and ship building industries. Computer aided design (CAD) packages with capabilities to model free—form surfaces allow users to design, ana- lyze, and test parts, without the need to build an object prototype. Recognition of free-form surfaces assists in automated machining of complex parts, inspection of ar— bitrarily curved surfaces, and path planning for robot navigation. Free—form surfaces are also used in terrain modeling for cartographic applications. An emerging area of 12 Small— vase—1 Small —vase-2 Small-vase—3 Small-vase-4 Phone-1 Phone—2 Phone—3 In Gir affe—l Giraffe—2 Cobra—3 Cup—l Cup-2 Cup—3 Fork-1 F ork-2 Fork-3 Two-mag—cyl-l Two-mag-cyl-2 Long-mag—cyl-l Long-mag—cyl-2 Figure 1.3: Range images of objects with free—form surfaces. 13 interest is automated re-engineering in which CAD models of manufactured objects that do not possess any geometric models can be built from their multiple views. The insights gained in sculpted object recognition may aid us in a better understanding, for example, of human face recognition. 1.3.3 Problem Definition Our interest lies mainly in addressing several important issues in designing a recogni- tion system for free—form 3D rigid objects from range data. We address the following important questions: (i) how to represent arbitrarily curved objects compactly with- out restricting the object shape and structure; (ii) how to recognize objects present in varying poses in the input scene from the sensed data; and (iii) how to organize rep- resentations of the model database of objects for faster recognition and also efficient storage? We believe that these issues are intercoupled in terms of the representation and matching schemes that can be used. 1 .3.4 Problem Difficulty Design of an appropriate representation scheme for 3D objects is a crucial factor that influences the ease with which they are recognized. Most of the object representa- tion schemes surveyed in Chapter 2 have adopted some form of surface or volumetric parametric models to characterize the shapes of objects. Current volumetric rep- resentations rely on representing objects in terms of spatial occupancy, generalized cylinders, superquadric or set-theoretic combinations of volume primitives as in con- structive solid geometry (CSG). However, objects with free-form surfaces, in general, may not have simple volumetric shapes that can be easily expressed with, for example, a superquadric primitive, even though it may contain eight (including bending and tapering) parameters. Further, the difficulty with recognizing an object via match— ing volumetric descriptions is that many views of the object must be used because there is uncertainty in the extent of the object in a direction parallel to the line of view. However, humans can identify objects even from a partial view. This aspect is 14 Local patches of unconstrained geometry Free-form surfaces Volumetric primitives Supenquadrics Analyticalsurface _ primitives Quadnc surfaces Edges and normals Polyhedral objects Representation scheme Points and contours Simple 2D shapes Complexity of object domain Figure 1.4: Representation and object shape complexity. a motivating factor for matching objects using “surface—based” representations that describe an object in terms of the properties of the surfaces bounding the object, such as the surface normals, curvatures, etc. This is more commonly employed for recognition since the representation directly corresponds to features easily derived from the sensed image of the scene. In addition, matching only the observed surface patches with those of a stored representation can aid in recognition of partially seen objects. Surface representations are mostly based on a small set of analytical surface prim- itives that either exclude sculpted objects from their domains, or allow free—form sur- faces at the expense of approximating the object with a very large number of simple primitives such as quadric and bicubic surface patches. Such an approximation tends to be coarse or fine depending on the number of primitives used. If it is coarse, then it may not capture the shape of the object accurately and hence can be ambiguous. If it is too fine, the number of primitives will be too large leading to a loss of global shape. Global representations such as the extended Gaussian image (EGI) [85] and other orientation-based descriptors [118, 96, 109, 115] describe 3D objects in terms of their surface-normal distributions on the unit sphere, with appropriate support functions. They handle convex polyhedra efficiently, but arbitrarily curved objects have to be either approximated by planar patches or divided into regions based on the Gaussian curvature. Figure 1.4 portrays various classes of objects and the representation schemes that are commonly adopted to describe them. 15 Real World (single object view) Sensed Data Scene l (Range) Representation J Multiple Object Views Identity & pose of the input Pose Estimation Structured Database of Object View Representations Figure 1.5: An overview of the proposed approach. 1.4 Overview of the Thesis The goals of the proposed 3D object recognition system are the following: e Design a representation scheme that can be used to describe sculpted objects with free-form surfaces as well as objects composed of simple analytical surface primitives. The scheme should be as compact and expressive as possible for accurate recognition of objects from a single range image. e Recognize objects from a single view using range data. 0 Estimate the pose and orientation of the recognized object from range data robustly. Figure 1.5 presents an overview of the proposed approach. In our system the fig- ure/ ground separation problem has been alleviated by considering uncluttered scenes and by using range images. Object recognition is performed from a single range image of a view of the object. The geometric approach we propose is based on surfaces and their shapes. We employ a modified definition of shape index originally proposed by Koenderink for graphical visualization of surfaces [101] to identify the shape cate- gory to which each surface point on an object belongs. Our approach makes use of the shape index to represent complex objects for recognition. An object is concisely characterized by a set of maximally sized surface patches of constant shape index and 16 their orientation-dependent mapping onto the unit sphere. This spherical mapping of the maximal patches not only results in a description of the object’s rotation in 3D space but also aggregates those patches that get assigned to the same point on the sphere using shape spectral functions; this allows us to summarize objects by the shape categories of the surface components, especially when multiple components of the same shape index and orientation are present. The points on the unit sphere that are mapped by the maximal patches are further characterized by a set of appropriate support functions describing the shape, average curvedness and surface area of the patches. The average curvedness of a surface patch specifies whether it is highly or gently curved; the surface area quantifies its extent or spread in three-dimensional space; the orientation (mean surface normal) of the patch describes how it is oriented or directed in 3D space. The relative spatial arrangement of the surface patches as captured by their adjacency is also built into the representation. We refer to our representation scheme as COSMOS (Curvedness- Orientation-Shape Map On Sphere). The main strength of our scheme is the integration of local and global shape infor- mation that can be computed easily from sensed data and is reflective of the under- lying surface geometry. Thus, our scheme provides a meaningful and rich description maintaining the recoverability of several classes of objects. The representation is compact for many classes of objects that contain only a few distinguishable surface patches of constant shape index, i.e, whose surface shapes do not change rapidly over large regions of the object. It is also a general scheme capable of representing arbitrar- ily curved 3D objects and objects with holes, as it does not rely on analytical surface primitives to approximate regions. A novel concept of shape spectrum of an object is also introduced within the framework of COSMOS for object recognition. We pro- pose a new shape-spectral feature based scheme for grouping object views of sculpted objects, that obviates object segmentation into parts and edge detection. These fea- tures allow object views to be grouped meaningfully in terms of the shape categories of the visible surfaces and their surface areas. By exploiting view-grouping in model databases, a small number of plausible correct matches can be quickly retrieved for more refined matching. We have demonstrated that in a database containing 6400 17 views of 20 different objects, only 20% of the database was examined, on the average when 2,000 independent test views were tested for their correct classification. The proposed model view selection scheme is general and relatively easy to use. A novel multi-level matching strategy that employs shape spectral analysis and features derived from the COSMOS representations of objects is proposed for fast and accurate recognition of free-form objects. Given a range image of an uncluttered view (allowing self-occlusion) of an object, the shape spectrum-based model selection scheme short-lists a few promising candidate views from a database of object views. During View hypothesis verification, we use the COSMOS representations of object views to determine the scene-model feature correspondences using a combination of search methods, and thus identify and localize the object in the input view. Object pose estimation is formulated as registration of the sensed data with the range image of the best matched view of the object. We present a minimum variance estimator to robustly register two range images of a complex object and compute their relative view transformation accurately. Experiments on a database of over 6,000 object views generated from CAD models and surface triangulations and 100 range images of several complex objects acquired using a 3D laser range scanner have demonstrated the strengths of our COSMOS based 3D object recognition system. 1.5 Main Components of the Recognition System We describe here some of the key traits that distinguish our proposed 3D recognition system from most other work in computer vision. e Free-form surfaces: Our system can handle general 3D rigid objects that may be arbitrarily curved. e Representation: Our representation provides a shape-based description of ob- jects and is suitable for representing free-form surfaces without requiring com- plex 3D analytical modeling of objects. It can describe complex shaped objects concisely in terms of the surface patches that are mapped to a unit sphere by 18 their orientations, along with a set of support functions specifying their geo- metric attributes. The spherical mapping of surface patch normals allows us to derive a global orientation of objects and at the same time provides us an elegant method to derive high-level geometric feature summaries of multiple surface patches with the same orientation that may map to identical points on the unit sphere (a realistic situation with nonconvex objects) in terms of their various shape categories and their local surface attributes. Recognition: Our multi-level recognition strategy uses a powerful pruning tech- nique at the first level to eliminate the non matching candidate object models quickly by matching “shape spectral” features of an input object view with those of the views present in a structured database. At the second level it per- forms detailed matching of components of the COSMOS representations of the retrieved candidate views to establish the model-scene feature correspondences. The second level thus determines the correct object identity of the input view. The 3D rotation of the object is computed using the corresponding surface nor— mals established from the CSMP correspondences between the scene and the stored object view. Pose estimation as registration: In order to accurately estimate the pose of the object in the scene, we derive a robust minimum variance estimator to compute the transformation between the scene view and the matched object view using their range data. The initial estimate of the object pose is refined using an iterative minimization scheme. We have proposed an error model to take into account uncertainties in 2 measurements at diflerent orientations of the surfaces in order to handle numerical errors, surface orientation eflects, etc. We show that the transformation parameters estimated using our weighted objective function [50] are significantly more accurate than those obtained using an unweighted distance criterion [38]. 19 e Object database: The model database used in our experiments consists of two categories of range images of object views that are used to test the strengths of our representation and recognition schemes. The first category consists of 6,400 object views generated from 320 viewing directions using 20 different models (available as CAD models and surface triangulations) of free-form ob- jects. These 3D surface models were obtained from an on-line database resource (Section 4.5.3). The other category includes range images of multiple views of ten different objects, and these were obtained by scanning the objects using a laser range scanner. 1.6 Organization of the Thesis The rest of this thesis details the key ideas that have been outlined above. Chap- ter 2 presents a literature survey of previous work related to 3D object recognition. Separate sections are devoted to 3D object representation, recognition strategies and free-form object matching techniques. Chapter 3 describes our proposed COSMOS representation scheme. The COSMOS representation characterizes an object by a set of new surface primitives referred to as CSMPS (Constant Shape Maximal Patches). Definitions of this and other components of COSMOS, and properties of this repre— sentation scheme are provided in Chapter 3. This chapter introduces the concept of the shape spectrum of an object and describes techniques for deriving the COSMOS representation of an object view from surface depth data obtained using a laser range scanner. It also presents experimental results with real range images of several differ— ent objects. Chapter 4 describes a multiple-view based object model and addresses the problem of constructing view aspects of free-form objects for efficient matching during recognition. It introduces a novel View representation based on shape spectral features and proposes a general and powerful technique for organizing multiple views of objects of complex shape and geometry into compact and homogeneous clusters. It also describes the structuring of a large model base of views for quick matching against input object views. Chapter 5 proposes a multi-level recognition strategy that 20 exploits the representational power of COSMOS to establish the correct identity of the objects in the scene and their spatial pose, and presents experimental results using real range images. Chapter 6 looks at object pose estimation as a registration prob- lem and proposes a new minimum variance estimator for accurate pose estimation from a given pair of views of an object. It also presents and discusses experimen- tal results. Chapter 7 summarizes the important results of this work and outlines possible directions for future research. Chapter 2 Three-Dimensional Object Recognition The object recognition problem discussed in this thesis addresses a number of signif- icant research issues in computer vision: representation of a 3D object, identification of the object, robust estimation of its pose, and registration of multiple views of the object for automatic model construction. This chapter surveys the previous work in three important topics of computer vision that relate to this thesis: representation, matching and pose estimation of a 3D object. It also presents an overview of the free-form surface matching problem, and describes current research efforts to solve this problem. Previous work in other relevant topics such as registration of multiple object views is discussed in Chapter 6. 2.1 Recognition of 3D Objects Three—dimensional object recognition is a topic of active interest motivated by a desire to provide computers with “human-like” visual capabilities and also by the pragmatic need to aid numerous real-world applications such as robot bin-picking, autonomous navigation, automated visual inspection and assembly tasks. The dominant paradigm in computer vision system proposes to achieve recognition and localization of 3D 21 22 objects [91] from images by a two-stage process: first derive an internal representation of a scene from the sensed input data and then match it against stored representations of objects in the database. Figure 2.1 shows popular approaches to the design and development of these processing stages. Besl and Jain [14], Suetens et al. [152] and Arman and Aggarwal [3] present comprehensive surveys of 3D object recognition systems. The spectrum of 3D object recognition problems is also discussed in [5, 40]. Sinha and Jain [143] also provide an overview of geometry-based representations derived from range data of objects. In this chapter we discuss some of the influential schemes for representation of sensed data and recognition. Model-based 3D object recognition systems differ in terms of a number of factors, namely: (i) the type of sensors used, (ii) the kinds of features extracted from an image, (iii) the class of objects that can be handled, (iv) the approaches employed to hypothesize possible matches from the image to object models, (v) the conditions for ascertaining the correctness of hypothesized matches, and (vi) the techniques to estimate the pose of the object. We discuss these various design issues in our descriptions of the popular schemes prevalent for recognition and representation. 2.1.1 Sensors Among the various type of image sensors used for 3D object recognition, the two most commonly used are: (a) intensity sensors that produce an image (a 2D array) con- taining brightness measurement I (as, y) at each pixel location (:13, y) of the image, and (b) range sensors which provide the range or the distance z(:r, y) of a visible surface point (pixel) on the object from the sensor. The brightness measurement of a scene at a location in an image is a function of the surface geometry, the reflectance properties of the surfaces, and the number and positions of the light sources that illuminate the scene. Range sensors are often calibrated to result in images with coordinates that are directly comparable with the coordinates used in object representations. A big advantage with range data over intensity (brightness) data is that it explicitly represents surface information. This makes it easier to extract and fit mathematical 23 [ Application Domain ] Images [Representation of a 3D object J /\ ] A complete BD model ] Collection of representations ] from multiple ZD views of multiple 20 views c J l f [ 7 Input View , Model Database Representation [ Sequential Matching J [ “‘99me J 6 Candidate Matches [T [ Object Localization J (3 Rotations & 3 Translations) (3 Rotations only) [ BD Pose Estimation ] [ Attitude Determination ] L J L ldenity and Location of Matched Object(s) Figure 2.1: Approaches to building a 3D object recognition system. surfaces to the data. Range data are also usually not sensitive to ambient lighting. Regions of interest belonging to objects can be separated from the background easily in range images based on the distinctive background depth value provided by the sen— sor, as opposed to intensity images which can have complex backgrounds with varying grey-levels that are similar to those of the objects themselves. A taxonomy and de— tailed descriptions of range sensing methods can be found in [94]. A comprehensive description of the Technical Arts 100X scanner that was used to obtain depth data for the experiments reported in this thesis is found in [66]. Recent surge of interest in medical applications has also motivated the use of magnetic resonance imaging (MRI) and other types of 3D data obtained through medical imaging modalities. 2.1.2 3D Object Representations and Models We now discuss various geometric approaches to representing objects and we discuss their applicability to different object domains. We hasten to note that the perfor- 24 mance of a recognition system can in general be improved by incorporating informa- tion about other cues such as color and texture of objects. However, in this thesis we concentrate only on geometry-based representations. Any representation employed in object recognition should have the following properties: (i) it should be rich so that similar objects can be clustered together easily from their descriptions, (ii) it should be stable such that local changes do not radically alter the description, and (iii) it should also have local support so that partially visible objects can be identified. We discuss below some of the well-known representation schemes that attempt to satisfy one or more of the above conditions. Representations used in computer vision can be fundamentally classified into object—centered and view-centered categories. Geometric techniques that use an object—centered representation attempt either to describe the entire volume of the 3D space occupied by a solid opaque object or use intrinsic or vieWpoint-independent features of objects such as corners, holes and straight edges that are projected onto the image under non-accidental viewing conditions and are detectable by various im— age processing operations. On the other hand, viewer-centered representations rely on specifying the “appearance” of an object from a single or a set of multiple view- points and use vieWpoint-dependent features such as occluding contours, silhouettes and T-junctions of a shape that are not intrinsic to an object. Further distinction can be made among these classes of representations depending on whether they are local or global shape descriptors. Among the object-centered representations are the boundary-based methods, vol- umetric descriptors and sweep representations. The boundary-based local methods represent objects as lists of faces, edges and vertices. VieWpoint-independent features such as long and straight edges on the objects that are projected onto the image are included in this representation. Early systems [112, 78] represented polyhedral ob- ject domain using edges, straight line segments and normals and reported promising results. Since polyhedral representations of curved objects require large amounts of space to adequately approximate them, both planar and quadric equations were then used to describe the surfaces [59, 173, 58]. Some early work also extracted sur— 25 face patches enclosed by boundaries that were orientation discontinuities from range data [122]. Bolles and Horaud [22] used cylindrical and planar surfaces surrounded by circular arcs and straight lines. Surfaces were also classified into primitive shapes such as peak, pit, saddle, etc. based on the signs of Gaussian and mean curva— tures [10, 167]. A structural representation [147] of objects that specifies edges and local surface patches in terms of their surface normal distributions has been recently advocated to handle general free-form surfaces. The local boundary and surface-based methods, in general, are sensitive to noise in the sensed data and depend on reliable extraction of primitives describing the objects from input images. In situations where the data sampled from a surface are sparse, the surface can be represented as a col- lection of triangular patches, and more advanced surface reconstruction techniques can then be used to interpolate from these patches. Triangular approximations of surfaces provide very little information about object parts or components, but are nonetheless useful when no other method is suitable. The volumetric methods describe the subset of points in 3D space that are con- tained within an object by representing the object as an implicit or parametric func- tion in an object—centered coordinate system. As the implicit function represents the entire shape of the object, the representation is global. Representations using voxels, octrees [136] and superquadrics [6] belong to the class of global, volumetric representations. In voxel representation, an object is described by the union of non overlapping cubes, where the voxels (cubes) are oriented in a rectilinear fashion, and are positioned in a 3D square lattice. Octrees describe objects in a hierarchical man- ner with the root of the tree being a cube that encloses the object completely and it is further subdivided hierarchically into octants to decompose the space occupied by the object into a very fine resolution. A superquadric primitive is a generaliza— tion of a class of ellipsoids called super ellipsoids and it has been used in computer graphics [6]. Pentland proposed it for shape representation in computer vision appli- cations [124] and demonstrated its use in building realistic-looking object models. A superquadric representation for an object from range data is obtained by fitting an implicit equation [144] to a set of input data points. The limited set of shapes repre- 26 sented by superquadric primitives can be extended to build more complex volumetric primitives by adding parameters, global deformations such as tapering, bending, and twisting to the generic implicit equations. The disadvantage, however, is that the fitting process becomes much more expensive and numerically unstable. One of the proposed solutions is to segment objects into a set of parts that can be modeled using superquadrics [61]; however the computational complexity still remains pro- hibitively high. Although powerful in describing shapes, superquadrics have some drawbacks. They are intrinsically symmetric along 1:, y, and z axes and their geo— metric bounds are just simple cartesian cubes. While original parts—based represen- tation used superquadric functions with multiplicative deformations, Pentland and Sclaroff [123] used spheres augmented with “modal” deformations that are additive. This approach, when tested with the recognition of human head shapes yielded a high accuracy (96%), but it required that parts segmentation and an initial orien- tation estimate were available. Objects have also been modeled using constructive solid geometry (CSG) based approach [37] where an object is represented as a bi- nary tree; each leaf represents an instance of simple volumetric primitives and each internal node in the tree represents a regularized Boolean operation of its children. A recent effort [20] emphasizes multiple surface representations, ranging from quadric to superquadrics to generalized cylinders to handle a large class of natural objects. A particular choice of representation is made based on the nature of information available from the range image. While volumetric representations describe objects as solid 3D primitives, sweep representations define objects by combining 2D surfaces with a sweeping rule. The generalized cylinder (GC), also called generalized cone is a representative of this class. A generalized cylinder is defined by a 3D space curve that serves as an axis, a 2D closed cross-sectional shape and a sweeping rule along the axis. GCs are specially suited for elongated shapes containing axial symmetry [25]. However, recognition using generalized cylinders, in general, has been hard due to the difficulty of extracting GCs from input images. GCs may not be a natural representation for non-elongated shapes such as polyhedra and for other shapes that may not have any elongated part. 27 Another set of global representations include a class of orientation-based descrip- tors such as the extended Gaussian image (EGI) [85], the support—function-based rep- resentation (SFBR) [118], the complex EGI (CEGI) [96] and the generalized Gaussian image (GGI) [109]. These map surface normal distributions of an object to the unit sphere with appropriate support functions, thus creating an orientation histogram. The support function at every point on the unit sphere with the EGI representa- tion is the Gaussian curvature. An attractive feature of EGI is that it is based on Minkowski’s theorem which states that for a closed convex object, the Gaussian cur- vature, as a function of the unit surface normal is sufficient to uniquely determine a surface up to a translation [128]. However, it cannot uniquely represent nonconvex objects [85, 128]. In addition, the translation of an object cannot be recovered when the EGI is used for recognition purposes. In the case of SFBR, the support function associated with the spherical mapping of normals on the object is the distance of the tangent plane at a point from a predefined origin. It is less compact, but can uniquely determine a closed surface [118, 109]. The drawback here is its dependence on the choice of the origin, and representations of the same object vary when the origin is chosen differently. The CEGI has been proposed as a representation for 3D pose estimation of primar- ily convex objects. It augments the support function of the EGI by also storing the distance of a point from a specified origin in the direction of the normal, and allows one to recover the translation of a convex object uniquely. In the CEGI representa- tion, the support function is stored as a complex number and it is possible that several nonconvex objects can have the same CEGI representation. GGI [110, 109] extends the EGI approach by storing the connectivity information between the immediate neighboring points on the unit sphere, thus ensuring the uniqueness of the represen- tation for all objects. The multiple folds present in the Gaussian image resulting due to concavities in the object are explicitly modeled using a linked-list of neighbors thus preserving the connectivity between the points. This aids in representing nonconvex objects uniquely. Note that except in the case of convex polyhedra (where all the points lying on a planar face map to the same point on the unit sphere), all of these 28 representations in general map every point on the object onto the unit sphere. They mostly attempt to answer the question of the recoverability of an object from its representation. From the point of view of recognition, however, they are verbose and they fail when parts of objects are occluded. It is also not clear how to segment the Gaussian image extracted from an image of a scene containing multiple objects into distinct regions corresponding to individual objects, except when the object shapes are simple. Among the recent global approaches for object representation are the techniques that fit bounded algebraic surfaces of a fixed degree to a set of data points [156]. Al- gebraic surfaces are attractive to use because they can be used to compute the limb edges and other properties of the object. During recognition, invariant quantities are computed from the algebraic equations of observed and reference surfaces [71] and then compared. This is a growing area of research and issues such bounding constraints, convergence of surface fitting and recognition need to be investigated thoroughly. Occlusion again is a problem here as there is no guarantee that the polynomial computed from a partial view of an object is similar to the polynomial computed from its complete view, all around the object. Surface reconstruction us- ing parametric surfaces such as B-spline surface patches has also been employed to approximate a 3D surface [111]. All the above object-centered representations concentrate on describing the com— plete 3D shape of an object in an intrinsic manner. The goal of a viewer-centered representation scheme is to summarize the set of possible 2D appearances of a 3D object. Motivated by some of the psychophysical findings [154], objects are repre- sented by a set of two—dimensional views rather than a single object—centered three- dimensional model. The aspect graph approach [99] attempts to group what possibly is a set of infinite 2D views of a 3D object into a set of meaningful clusters of ap- pearances. It partitions the viewpoint space of an object into regions of “similar” views” called aspects, separated by a “visual event” that occurs on the boundary of two neighboring class of views. The visual event signals a change in the topol- ogy of the silhouette of the object. The aspect-based representations take the form 29 of a connected graph in which every node denotes an aspect, and every connect- ing edge is a visual event. The topic of aspect graphs has proved to be a fertile area of research [55, 74, 75, 103, 126, 34, 145, 149, 148, 169, 56]. While some re- searchers [55, 74, 75, 103, 126, 145] have assumed orthographic projection to compute the aspect graphs of objects, others [54, 149, 148, 169, 56] have attempted to com— pute the more general perspective projection aspect graph. The enormous size and the complexity of aspect graphs for even simple polyhedral objects are the primary reasons for lack of their widespread use in recognition. Computing aspect graphs of general 3D objects is still an unsolved problem. Efforts have also been made to orga- nize multiple views generated using CAD models into aspects, and in [89] edges and faces of polyhedra are used to classify an object appearance into a small number of as- pect groups. Hansen and Henderson [83], lkeuchi and Kanade [90], Camps et al. [28], and Arman and Aggarwal [2] advocate an “automatic programming” approach in which recognition procedures that exploit both the CAD models and views aspects are constructed by processing the database of object models. The utility measures of features extracted from CAD models in this context are discussed in [31]. View-based recognition strategies [127, 53, 26, 117] have been particularly applied to object domains for which geometric object models can be difficult to obtain. A wing representation [36] models 3D objects using a set of views, each of which is a set of 2%D primitives called wings that describe a pair of surface patches separated by a 2D contour segment. A recent approach [9] that advocates a viewer-centered representation uses a set of silhouettes of objects as models to recognize smooth objects using an alignment approach. The projection of the object’s boundaries in other views can be expressed as a linear combination of the three model views, given that the correspondence between points in all views can be specified. Although not geometric-based, a recent technique [119, 117] called parametric eigenspace has been proposed to project a large set of 2D appearances of an object into “eigenspace” (parameterized by pose and illumination) using principal component analysis [121, 73, 160]. The eigenspace is constructed by computing the eigenvectors of a complete image set and retaining only a few eigenvectors to capture the variations in the 30 appearances of the object. Swets and Weng [153] built upon the basic eigenspace features by augmenting them with discriminant analysis to achieve better inter-class separability. Chen and Jain [35] organized the appearances in a hierarchical manner so that only optimal views are examined at every level of representation. The main drawback of most view-centered representations is the lack of terseness in object descriptions. If an object can be specified using a few parametric equations, then a viewer-centered representation is certainly not appropriate. However, in describing complex objects whose shapes cannot be captured by a single analytical form, or by a set of equations compactly, viewer-centered representations can play an important role. An integrated approach for describing an object using both view-independent attributes and view dependent features has been adopted by Flynn and Jain [68] wherein a relational graph—based description of planar and quadric surface patches obtained from the CAD models of the object is stored along with the patch areas of the object from a large number of (320) viewpoints. The combined information serves as an object mode]. Some of the recent and more general shape representation schemes attempt to cap- ture a variety of details about objects: viewing an object as a composition of primitive parts called geons [18, 43, 133]; representing articulatedness of an object as a parame- terization of relative movement between the parts that comprise the objects; modeling non-rigid objects such as hearts and lungs using deformable superquadrics [157] and methods using finite element analysis [86]; and employing function (utility)—based attributes to describe a generic category of objects [146]. General difficulties with parts-based representation schemes are the lack of consensus to decide the set of part primitives that need to be used, and to justify why they are necessary, sufficient and appropriate. The generality of a parts-based representation often leads to vagueness in terms of its practical application. In addition, the computation of all the primitives from a single (image) View of an object is difficult. Some of these issues are receiving serious attention. Before we conclude this section, we make a note that most of the object represen- 31 tation schemes reported in the literature have adopted specific parametric forms to characterize the shapes of objects, thus constraining the applicability of the schemes to a restricted class of objects. Table 2.1 presents an overview of some of the key representation schemes and applicable object domains. An emerging theme of impor- tance is the design of representations that can handle general 3D objects which can be arbitrarily curved with complex shapes. 2.1.3 Matching Strategies In the previous section we discussed various approaches for representing objects. The next step is the recognition and localization of objects that may be present in the scene. Recognition is achieved by matching features derived from the scene with stored object model representations. Each successful match of a scene feature to a model feature imposes a constraint on the matches of other features and their locations in the scene. A consistent set of matches is referred to as a consistent scene interpretation. Approaches vary in terms of how the match between the scene and model features is achieved, how a consistent interpretation is derived from the scene- model feature matches and how the pose is estimated from a consistent interpretation. In the following discussion, “scene” and “image” are used interchangeably and so are “model” and “object model”. A “model” indicates a stored representation. Major Approaches The popular and important approaches to recognition and localization of 3D objects are the following: (i) hypothesize-and-test, (ii) matching relational structures, (iii) Hough (pose) clustering, (iv) geometric hashing, (v) interpretation tree (I.T.) search, and (vi) iterative model fitting techniques. In the hypothesize-and-test paradigm, a 3D transformation from the object model coordinate frame of reference to the scene coordinate frame of reference is first hypothesized. This generally involves formulating a system of over—constrained linear or non-linear equations that relate the model features with the scene features through 32 Table 2.1: An overview of popular object representation schemes. Representation Type of Object Sensing Viewpoint scheme shape domain modality dependency descriptor (test) Points (corners local objects intensity stable over and inflection with well- changes in points along edge defined local viewpoint contours) [87] features Straight line seg— local polyhedra intensity vieWpoint- ments [112] invariant over wide ranges Points, planar local polyhedra range viewpoint- faces and edges independent [78] Silhouettes of 2D global curved intensity viewpoint— views [8, 162, 33] dependent Circular arcs, local planes and range vieWpoint- straight edges, cylinders independent cylindrical and planar surfaces [22] Planar and local planes, range viewpoint- quadric surface quadric independent patches [58, 59, 68] surfaces Gaussian and local curved range vieWpoint- Mean curvatures invariant based patches [10] Generalized cylin- global generalized intensity object-centered ders (GC) [24] cylinders Superquadrics global curved range object—centered [124, 144] objects Geons [133, 44] parts-based curved in- range object—centered eluding articulated objects Constructive sur— local curved range object-centered face geometry objects (simple volumetric primitives) [37] Table 2.1 (cont’d). 33 Representation Type of Object Sensing Viewpoint scheme shape domain modality dependency descriptor (test) Extended Gaus- global convex intensity object-centered sian image (EGI) objects [85] Algebraic polyno— global curved range object-centered mials [156] Splash and su- local arbitrarily range viewpoint- per (polygonal) curved independent segments [147] Eigen faces [117, global general 3D intensity vieWpoint- 153] object dependent Aspect graphs [99, global convex intensity vieWpoint- 126, 56] polyhedra sensitive and a class of curved— surfaces Convex, con- local arbitrarily range object—centered cave and planar curved surfaces [93] objects 34 a transformation. The system of equations is solved to provide the transformation that minimizes the squared error which characterizes the quality of match between the model and scene features. The transformation is used to verify the match of model features to image features, by aligning the model features (typically points and edge segments) to the scene features. The hypothesized match is either accepted or rejected depending on the amount of matching error. Lowe [112], Huttenlocher and Ullman [87] and Seales and Dyer [138] have presented representative work in this paradigm. An earlier recognition system, 3DPO [22] uses a distinctive scene feature to match the corresponding model feature, generating a hypothesis to search for matches of other scene features adjacent to the already matched distinctive feature. Verification is done by comparing the synthetically generated range map of the model object at the hypothesized pose with the scene data. The measured error during the verification stage in turn drives the hypothesis generation process. F ischler and Bolles [64] in their RANSAC system solve for a perspective trans— formation to project a planar model into an image. The transformation is computed using a heuristic and it is verified by projecting a set of coplanar points into image co— ordinates. Lowe, in his system SCERPO [112], solves for a perspective transformation that relates the three-dimensional constraints to a single set of image measurements. The plausible group of matches between scene and model features that are straight line segments is searched in the image of an object taken from a single viewpoint. The correspondences are established by refining a given estimate of transformation using an iterative least squares technique. In the alignment approach [87] three pairs of non-collinear points, each pair containing a point in the image and its correspond- ing point on the model, determine the transformation. A model of an object is represented as a combination of a wire frame and local point features (vertices and inflection points). It consists of the three-dimensional locations of the edges of its sur- faces (a wire—frame), and the corresponding corner and inflection features as shown in figure 2.2. In [87], different models of an object from different vieWpoints were used. Possible alignment of the object model with the image is tested using the computed transformation. Seales and Dyer [138] represent occluding contours of polyhedra as a 35 Figure 2.2: Object models with vertices and inflection points as local features. function of viewpoint, deriving viewpoint constraints associated with each occluding contour feature. The transformations are searched by associating the model and the image contours. Chen and Stockman [33] also employ the alignment approach using both the contour and internal edges for recognition of curved objects. Representations using relational structures attempt to capture the structural properties of objects explicitly for ease of recognition. Both scene and object mod- els are described using attributed-relational graphs (ARGs), where each node in the ARG stands for a primitive scene or model feature and the are between a pair of nodes represents a relation between the two features. Matching of a scene ARG with a model ARG is carried out using graph-theoretic matching techniques such as maxi- mal clique detection, sub-graph isomorphism, etc. [7]. Relational representations are attractive as they capture both the structural aspects of the objects and their geo— metrical inter—relationships. However, recognition using relational graphs is difficult because graph matching algorithms are NP-complete. This becomes especially true when scenes contain multiple objects that may be partially occluded. An extension of an ARG is an attributed hypergraph representation (AHR) that contains hyper- edges and hypernodes. The hyperedges and hypernodes are themselves ARGs where each hypervertex is associated with an ARG representing a face and each hyperedge corresponds to a primitive block graph representing a primitive block such as a poly— hedron, a cylinder, or a conical surface. The AHRs are less difficult to match as we can perform hierarchical matching, thus leading to an overall reduction in complex- 36 ity. However, the AHR matching problem is still NP-complete, that can result in exponential time algorithms in the worst case. Recognition schemes using relational graphs have been explored extensively. Brooks [24] organizes image feature relations using a graph and matches scene to a model graph using sub-graph isomorphism. An attributed region adjacency graph is constructed from multiple registered object views to form a complete 3D model of an object [166] and matching an input view with the object model graph is performed by identifying a subgraph of the object model. Fan et al. [58] derive relational descriptions of objects in terms of their visible surface patches from dense range data, where nodes characterize the surface patches and the arcs between the nodes specify connectivity and occlusion. The largest subgraph in the model graph that matches the scene graph is found by a depth-first search. Kim and Kak [98] combine a discrete relaxation technique with bipartite graph matching to make the search efficient. Here, scene and model are represented using bipartite graphs. These graphs are used to establish the compatibility between model and scene surfaces. A hypergraph [174] is used to represent an object in a hierarchical fashion, by imposing a grouping on the vertices of a graph in which each vertex corresponds to a surface of the object and each group of vertices to a primitive block of an object. Multiple AHRs obtained from different views of the object are put together to form a complete AHR and it is compared with the stored model AHRs. Recognition using the primitive blocks can quickly eliminate incorrect matches. Shapiro et al. [141] propose a relational pyramid to represent multiple view classes and to rapidly select the view class that best matches an unknown view of the object. In the pose clustering approach, also referred to as generalized Hough trans- form, evidence is collected for possible transformations (pose) from image-model matches and clustered in the transformation space to select a pose hypothesis with the strongest support. Each scene feature is matched with each possible model feature; matches are then eliminated based on local geometric constraints such as angle and distance measurements. A geometric transformation is computed from each success- ful match and it is stored as a point in the Hough (transformation parameter) space. The Hough space is six-dimensional if we deal with 3D objects with six degrees of 37 freedom whereas it is three-dimensional for 2D planar objects with three degrees of freedom. Maxima determination or clustering of points in the Hough space results in a globally consistent pose hypothesis of the object present in the scene. Some of the representative work using pose clustering is found in [150, 104, 142]. Grimson and Huttenlocher [76] analyze the sensitivity of Hough clustering using a statisti- cal “occupancy model”. They conclude that the probability of false maxima in the Hough accumulators can be fairly high for scenes with multiple cluttered objects and degraded by occlusion and sensor noise. Some bounds on the likelihood of false peaks in the parameter space as a function of sensor noise, occlusion and quantization effects are also presented. In geometric hashing, also known as indexing, feature correspondence deter- mination and model database search are replaced by a table look-up mechanism. Invariant features are computed from an image that can be used as indices into a table containing references to the object models. The pioneering work by Lamdan and Wolfson [105, 106] uses a two-stage methodology: (i) creation of a model hash table and (ii) indexing into the table to match an image. A model hash table is constructed by first selecting a k-tuple (k = 3 for a 2D object and k = 4 for a 3D object) of model points that forms a basis for a coordinate system into which all other model points are mapped. The mapped model points serve as indices into the hash table where the corresponding k-tuple that functions as the basis is stored. This is repeated for every set of k-tuple of model points, thus mapping all model points in a transformation-invariant manner as they are rewritten in terms of each of the reference frames. During recognition, a basis is first created by selecting a k-tuple of sensor points. Then the remaining scene points are re-mapped into the coordinate system defined by the basis and are used to index into the hash table. Each success- ful access of the hash table produces a model basis stored in the accessed location and a counter associated with the retrieved basis is incremented. The model basis which has the maximum support is used to compute a rigid transformation from the model to the scene coordinate system. The process is repeated for every possible scene basis definition until the object in the scene is recognized. Structural indexing 38 is a variant of geometric hashing in which invariant structural or geometric features are extracted from an object model. The invariant features are used to compute an indexing feature that can be used to access an object model database organized as an index table. Each set of invariant features is designed to aid in the recovery of the object pose when matched with the corresponding model in the scene. Stein and Medioni [147] and Flynn and Jain [69] have employed structural indexing and geo— metric hashing for 3D object recognition. Grimson and Huttenlocher have analyzed the sensitivity of geometric hashing to inexact sensor data. They conclude that it performs well in a scene containing a single object and with noise-free and perfect data, while the presence of noise and occlusion results in a significant reduction in performance. Flynn [67] argues for features with high saliency to reduce the accrual of spurious evidence in the entries of the hash table. The interpretation tree (I.T.) search, or constrained search is a very popular recognition scheme and has been a subject of active work over the past ten years. An I.T. consists of nodes that represent a potential match between a scene feature and a model feature. During search, a scene feature is paired with a model feature and thus a node at level n of the tree characterizes a partial interpretation, i.e., the path from the root to a node at level n specifies an assignment of model features to the first it scene features. Instead of searching the tree exponentially for a complete and consis- tent interpretation, local geometric constraints such as pairwise angle and distance measurements between features are used to discard or prune inconsistent matches between scene features and model features. A global transformation is computed to determine and verify the pose of the object when a path of sufficient length is found. The control structure of the algorithm takes the form of a sequential hypothesize- and-test with back tracking. The I.T. search has been formulated and well explored by Grimson [80, 78]. A robot vision system, 3D-POLY [30] uses I.T. to recognize occluded objects in a cluttered scene, and exploits a data structure called feature sphere that aids in a fast retrieval of features for verification. Flynn and Jain [68] and Vayda and Kak [164] have employed constrained search of the interpretation tree for CAD-based object recognition. Both unary and binary geometric constraints are 39 effectively used to prune the incorrect interpretations. lkeuchi [88, 89] uses an in— terpretation tree search to classify a scene object into one of the stored aspects of the object models and then estimates the pose of the scene object within this aspect group. CAD models are used to generate the multiple views of the objects and they are clustered into aspects. Since the task addressed in [88] is bin picking, only one type of object, same as the model, appears in the scene. Grimson also establishes that under some simplifying assumptions, the IT. has a search complexity of 0(n2), where n is the number of models and scene features for single—object scenes, but can become exponential in the worst case for multiple-object scenes. He also showed that if indexing is employed within the constrained search paradigm, along with clustering scene features into subsets likely to have arisen from a single object, then an I.T. based search can reduce the complexity to O(n3) which is otherwise exponential for multiple-object scenes. If the search is terminated with a “good enough” interpreta- tion [79], then the complexity becomes O(n4) provided the scene clutter is small, and it is still exponential when the scene clutter becomes very high. A wrong object model can be prevented from being chosen as a matched object by selecting a threshold on the fraction of model features that must be matched as a function of the number of model and sensor features [77]. Iterative model fitting is used when 3D objects are represented by paramet— ric representations wherein the parameters are used to specify both the shape and the pose of the objects. There is no feature detection and correspondence determi— nation between model and scene features. Object recognition and pose estimation reduce to estimating the (pose) parameters of the model from the image data, and matching with stored parametric representations. If a sufficient number of image data points is available then the estimation of parameters can be done by solving a system of over-constrained linear or nonlinear equations for a solution that is best in the minimum-sum-of-squared-error sense. Solina and Bajcy [144], Gupta et al. [81, 82] and Pentland [125] have modeled 3D objects as superquadrics with local and global deformations for recognition purposes. Deformable superquadric models [157] have been proposed for modeling and tracking non-rigid organs such as hearts and lungs. 40 Implicit equations [129] have been used to fit observed image features to the 3D posi- tion and orientation of object models. Pose determination here becomes an iterative fitting of the observed data to the stored implicit equation. In addition to the popular approaches described above that can be used to classify a majority of the existing matching strategies, there are a few others that perform matching using a different approach. One of them is a rule-based approach proposed by Jain and Hoffman [93] to recognize 3D objects based on evidence accumulation. Instead of matching object features to scene features, they construct an evidence rule base that stores salient information about surfaces, their morphological and patch attributes and their relational features, along with the evidence of their occurrences in each of the object in the database. The rules are used to compare the similarity between scene features and the support features as specified in the evidence rules for each object. Extensions of this work in terms of using more view-independent features, generating rules using a minimum entropy clustering scheme, estimating evidence weights and matching using a neural network have been proposed by Caelli and Dreier [27]. The other approach views matching as a registration problem [15] and it matches sets of surface data with one another directly without any appropriate surface fitting. In this approach, the distance between two point-sets obtained from surfaces is computed and minimized to find the best transformation between the model and scene data. The quality of the alignment between the model and the scene depth data can be used to determine if a model object matched closely with the scene data or not. Table 2.2 presents an overview of the matching strategies. 41 assesses 82.. cases ..I.. . 83m :23: 5.58 ”mg :anaad .l E 5.3% o>5m=dnxo 953238 .8 on :8 l gnooaovgm as 5532530 85:3 .52 .me was; -068— :msefi schema 5.88 35925 55.3 33925 o: 5.8% 339238 o: .mw .E .33 E58 @338: am. 55% season 53$ 95 .ww .E: .mo 53% $5 33325 so on :8 53% 852355 333918 as on :3 asap—38 as ea :8 .om .3 .o& 5532935 96.39: E385? 88m comaefiuflmceau vaoE 3830 83% So .3; a. wcfimd; Om of E 5.8% ems—:8 3 63% ozumaoaxo 5.53 332.25 0: Samoa czamsoaxo o: .2: .33 owns—2:80 8338 £258 guide mongefl omega: mo woeaw :oSeEeowmces Evofiowefim 3.23 $553 3 88.5.“an mmfiasoaw 83% 3 mm: Ga 23 E 558 8820 3 :quw o>3m=egxo condom o>5wsa€8 o: 558 332558 0: .2: .3: wctoumio omen 8.8m CHEESE“ coma—S: moosovaoamotow £258 goon vow: on :8 $8285 Om 23 E 558 .oSEfl wEEanaov he :82: 3 558 8:58“ 9.2m 8 we: wafioueE diam—55$ so on :8 no.“ 53$ tofidbmcoo Swans—:8 as on :8 55% view—35o so .3 .wm .va Ecofiflom moucovcoammhoo mongoose 8.338 mixata 238m cam—2.3 of E $358 $0.30 95% 3 voanfioa :osdumfifiao 88:33me .8 3 8.3—an 2368a oEECoaov 3 season mafia—wad; wee .mm .me $86.5 :wzoEa compeESmm 30¢ 55% ozamsezxo £038 333sz o: 38 o: biomab .5 .23 bummofioahm moocovcoqfitoO 33288 238m mESmmm comaofismm EvoEoweEH 53.85952 :oEeEanqu 95o Ewmgaem omen wEEEuSoQ .252 838m -Eomoaom yap—3&2 .s. aflwoaebm wafiofimfi 8.38 we 2638.6 :4 ”mm 05.3. 42 2.1.4 Difficulties and Challenges In summary, some of the issues that affect the existing 3D object representation and recognition systems are the following: The first issue is “representational”; although several representations schemes have been proposed, (see Section 2.1.2) none of them seems to satisfy the conflicting requirement of global shape description with local feature support. The need to handle partially visible objects which is frequently encountered in practice dictates the use of local features but they are ineffective in capturing the complete object shape. They are easy to extract from sensed data but may not be discriminating enough. On the other hand, global representations are more descriptive, but fail to be useful in recognizing partially visible instances of objects in images. They have greater discriminating ability but are difficult to be computed with a purely data-driven approach. Most of the representation schemes are limited by the class of shapes that can be described by them. They cannot handle free-form surfaces in particular, in a compact and unambiguous manner. The matching strategies such as the Hough clustering compare global features or shapes, and are relatively fast. However, they are error-prone when there is occlusion. The local feature—based matching schemes can handle occlusion but are computation- ally expensive. Recognition systems have to be made faster (e.g., by parallelizing both segmentation and matching algorithms) in order to handle large data bases. A preferred solution to the problem of building a robust and fast 3D object recogni- tion system is to combine representations of the 3D objects at both numerical and symbolic levels, to describe objects using hierarchical representations, to derive mech- anisms to match the hierarchical representations efficiently via indexing, and to design strategies that are general and reduce the search. 2.2 Free-Form Object Recognition The success of several object recognition systems described in the previous section can be attributed to the restrictions they impose on the geometry of objects. However, 43 there has been a notable lack of systems that can handle arbitrary surfaces with very few restrictive assumptions about their geometric shapes. To our knowledge, there have been only a few systems to date that address the problem of matching general surfaces that may or may not possess easily detectable point or curve features. Besl in his seminal article [11] pushed forward the idea of recognizing objects containing free-form surfaces as an emerging theme of importance since objects found in natural environments are arbitrarily shaped, complex and curved. These objects may not be modeled easily using volumetric primitives and they may not have easily detectable landmark (salient) features such as edges, vertices of polyhedra, vertices of cones and centers of spheres. The approach that has been commonly adopted for the recognition of free-form objects falls into the class of model-based recognition. For a definition of free-form surfaces, see Section 1.3.2. The free-form surface recognition task is generally formu- lated as that of establishing correspondences between features detected in the scene and features of similar types previously stored in models of the objects of interest. Since the underlying surfaces can be arbitrarily curved, most approaches attempt to define features of interest that are not constrained by any assumption of analytical forms present in the objects. Based on this formulation, there are several important questions that need to be addressed, regarding (i) the type of image data to be used, (ii) the kind of features to be extracted from the input images, and (iii) the strategy to be used to match image features to model features. 2.2.1 Representation Schemes Parametric representations are mainly employed in CAD systems to design and an- alyze free-form surfaces. However, vision researchers have sought to employ both global and local structural descriptions based on local surface curvatures or normals to represent free-form surfaces. Parametric Representations 44 Parametric free-form surface representations [11] include o Piecewise-polynomials (splines) over rectangular domains. 0 Piecewise-polynomials (splines) over triangular domains. Polynomials over rectangular domains. Polynomials over triangular domains. Non-polynomial functions (e.g., exponentials, sinusoids, etc.) defined over ar— bitrary domains. The IGES standard used in CAD representations for free-form surfaces is NURBS (Non-Uniform Rational B-spline Surfaces) [11]. Surfaces represented by NURBS form a superset of commonly used surfaces such as spheres, cylinders, and cones. A NURBS surface entity of order (m, n) (with degree (m — 1, n — 1)) is given by Nu" v" —o __ 21:01 Z0 121),, X Bf"(u;Tu) x B,”(v;Tv) x pi,- Fu,v -— ( ) Eli-"0‘1 320—1100 x BI"(U;Tu) >< B;(v;T,) ? where [2},- are the 3-D surface control points with N, control points in the u-direction and Nu control points in the v-direction for a total of Nqu independent control points, Wij are the rational weight factors, and Bf"(u; T u) is the m-th order B-spline defined on the knot sequence Tu, consisting of a set of Ku = (Nu + m) non-decreasing constants Tu that subdivide the interval [um_1, unu] of evaluation. Use of NURBS has not yet become prevalent in the computer vision community because of the difficulty in obtaining non-proprietary algorithms to fit a NURB surface to an arbitrary set of 3-D points and also due to the difficulty in matching NURBS-based representations of objects. Local and Global Geometry-based Representations Some recent approaches have specifically sought to address the issue of representing sculpted surfaces using local and global geometry. Table 2.3 presents an overview of these approaches. 45 Table 2.3: Current representation schemes for complex curved objects. Representation Type of Object domain Sensing scheme descriptor modality Algebraic polynomials [155, global objects described by range 97, 130] quartic curves and surfaces Splash and super segments local arbitrarily curved range [147] objects Simplex angle image [42] global objects topologically range equivalent to the sphere Registration using point sets global arbitrarily curved range [15] objects Registration using image global free-form surfaces 2D X-ray contours [107] projections 2D silhouettes with internal global arbitrarily curved intensity edges [33] objects Triangles and crease angle global free-form objects range histograms [13] HOT (High Order Tangent) global arbitrarily curved video curves [95] objects sequences Convex, concave and planar local arbitrarily curved range surfaces [93] objects 46 Global Representations Algebraic surfaces [155] are more flexible than quadric or superquadrics in represent- ing complex curved objects and are used to describe and segment objects in terms of volumetric primitives. Keren et al. [97] describe the use of implicit fourth—degree poly- nomials to represent arbitrary shapes. Ponce et al. [130] derive geometric constraints that are also algebraic polynomials for estimating the pose of the object. Issues such as bounding constraints and convergence of surface fitting need to be investigated thoroughly as surface approximations using implicit functions are less stable gener- ally and can be more computation-intensive than approximations using parametric forms. Occlusion is a problem with these approaches since there is no guarantee that the polynomial computed from a partial view of an object is similar to the polynomial computed from its complete view. Ultimately, even these representations are limited by their scope. Joshi et al. [95] propose a non-parametric representation called HOT (High Order Tangent) curves. It is a. collection of the parabolic, flecnodal, limiting and asymptotic bitangent, and tritangent curves that capture the structure of image contours of an object. Using this representation, viewer-centered image features such as the inflection points and bitangents of the contours are computed and used for recognition and pose estimation. Note that the recognition accuracy can be poor if the inflections points are not localized accurately. The simplex angle image (SAI) [41, 42] is also a global mapping of a curved object wherein a mesh covering the surface of the object is mapped to a mesh on the unit sphere. Using deformable surfaces [41], free-from objects can be reconstructed using a mesh. A general parametric surface is initialized in the vicinity of observed features and the surface is deformed with smoothness constraints to recover the object shape. Each mapped node on the unit sphere stores the simplex angle, a measure of the local surface curvature at the corresponding node on the object. The SAI is independent of the translation of the object and it also preserves the connectivity of the surface patches on the object. The SAIs of an observed scene and a model object are matched by minimizing the sum of the squared differences between the simplex 47 angles at the nodes of the scene and model SAIs under every possible rotation. The SAI representation can be used only for those objects that are topologically equivalent to the sphere. In addition, appropriate mesh resolution. for each object needs to be determined. Curved objects have also been modeled using a set of viewpoint-dependent features such as 2D silhouettes, along with internal edges [33]. The edge map derived from a set of model images is aligned with the scene edge map to estimate the pose of the object in the scene. Segmentation is a critical issue here. Local Structure-Based Representations Local structural descriptions of free-form surfaces can be derived from small sur— face patches directly without using volumetric primitives or planar, quadric, and superquadric surfaces. The representation using splash and super segments [147] is an example of this approach. Splash is a virtual feature encoding a local Gaussian map, thus characterizing the distribution of surface normals on a surface patch and a super segment is a group of line segments resulting from approximating the edges present on the surfaces. Figure 2.3 shows an example of a 3D super segment and a splash. Although this scheme can be applied to general surfaces, the representation does not provide a higher level description of the object. The features derived can be sensitive to noise in the sensed data and also to occlusions, thus affecting the reliability of the matching. A similar approach employs a curvature map which is essentially a layered structure containing the Gaussian curvature and the mean curvature at each point on a surface, and these can be derived from the principal curvatures estimated at surface points [168]. Besl [13] advocates employing triangles as a unifying representation for all kinds of 3D objects, including very precise curved surfaces used in manufacturing. Since such a description is typically verbose, in order to efficiently match these representations for object recognition, he proposes the crease angle histogram that provides information about the crease angles at the edges between adjacent triangles. The stability of this feature depends on a good triangulation of an surfaces and also on smoothing 48 algorithms that would retain the important features of the surface with minimal number of triangles. Occlusion is a potential problem with crease angle histograms. Registration An alternate representational approach that does not require feature detection and correspondence determination is to match surface data directly without any appro- priate surface fitting [15]. In this approach, the distance between two points sets obtained from surfaces is computed and minimized to find the best transformation between the model and scene data. The main feature of the method is that it avoids any surface segmentation or surface fitting for recognition. It also does not require an explicit correspondence between models and scene features for recognition. This scheme, however, requires specification of a procedure to find the closest point on a geometric entity such as a curve, to a given point. The main disadvantage is that, just as in any iterative minimization technique, it is not guaranteed to find the global optimum especially if the scene contains occlusions, or if the point densities in model and representation are different, and if there are numerous spurious points from dif- ferent objects. Note that this approach assumes that the object identity is known beforehand, and the primary interest here is to estimate the pose of the object in the input scene. Recent work by Lavalle and Szeliski [107] also studies the problem of matching 3D surfaces using 2D X-ray images as a registration problem where the segmented 3D anatomical structures are matched using two or more contours of the same structure derived from its X-ray projections using a least—squares minimization technique. The emphasis of this work is towards recovering the spatial pose accurately rather than discrimination or recognizing multiple objects. In summary, current representation schemes work best for a limited class of 3D objects. The generality of the shape of free—form objects is a major difficulty that is not easily overcome with analytical representations. Edges need not be present or detected reliably on smooth objects in order to be used as landmark features with edge-based representations. These difficulties have motivated us to find other means 49 of capturing the object shape in a general manner. 2.2.2 Recognition Techniques Approaches that use parametric descriptions tend to match a (scene) surface descrip- tion extracted from a range image with model (surface) descriptions derived from CAD models of objects in the database. Some of the difficulties in comparing a scene surface description with a model surface description using parametric forms are: (a) it may not be possible to parameterize the scene surface accurately such that the derived parameters can be directly compared with those of the model surface and (b) The surfaces may not be aligned in 3—D space. Hence it is assumed in these approaches that a one-to—one correspondence between all points on the scene surface and all points on the model surface is not required. Matching is done by computing the “shape distance” between a surface description and a subset of another surface description [11]. Approaches that use local structural descriptions tend to derive invariant de- scriptions from input scene data, and match these local descriptions with the stored model descriptions. Matching schemes use indexing [147] or search methods em- ploying heuristics to reduce the search complexity [168]. Since these approaches do not utilize CAD-based representations of object models, the range images of objects themselves serve as models. The model descriptions are constructed using range data of different views of the objects. The object may then be rotated arbitrarily and sensed to provide the scene data. Jain and Hoffman [93] presented an evidence-based approach for identifying 3D rigid objects that are arbitrarily shaped. A rule-based database was developed using the evidence conditions based on the presence of distinguished features that represent objects. In their approach, the rules were used to compare the similarity between scene features and the support features as specified in the evidence rules. They also developed a learning process for automatically extracting evidence conditions from different views of objects. 50 2.3 Summary We have presented a survey of 3D object representation and recognition schemes in this chapter. We also discussed the inadequacy of a number of prevalent representa- tion techniques in handling objects with free-form surfaces. We presented an overview of the current research efforts that specifically attempt to solve the free-form surface matching problem. 51 b3 k3 segment 4 segment I segment 2 Figure 2.3: Examples of super segments and splash adopted from [147]: (a) a 3D super segment with 4 grouped segments; k1, k2, k3 are the curvature angles, t1 is the torsion angle; (b) a splash with n, the reference normal, p the geodesic radius, p, the location vector, 9, the angle. Chapter 3 COSMOS — A New Representation Scheme for Free-Form Objects In this chapter, we describe a new representation scheme called COSMOS for handling general 3D objects using surface depth data. We reiterate that by the term “general objects” we imply that we make very few restrictive assumptions about their shapes; the objects of interest to us are free-form surfaces such as cars, turbines, human faces, sculptures, etc. However, we do not include statistically defined shapes such as foams, crumpled objects such as fractals, and arborizations such as trees or bushes. Note that the class of 3D free-form objects includes convex and nonconvex smooth surfaces. Since the emphasis of this chapter is on deriving an effective representation of a sculpted object for its recognition, the topic of multi-object segmentation in the scene will not be discussed here. We assume that a range image of a scene is available which contains a view of a single rigid 3D object without occlusion for the purpose of model construction. However, we briefly note here that our object recognition system based on the COSMOS scheme described in Chapter 5 can handle multiple objects in the scene. Our proposed representation scheme has been designed to aid identification of objects and discrimination between them, and hence it has not been evaluated for other industrial tasks such as object inspection. Objects can be represented using descriptions at several levels as shown in Fig- 52 53 ure 3.1. The kinds of objects that need to be handled and the specific task that has to be carried out by an automatic system would determine the appropriate level of description that needs to be used. If the goal is to render surfaces graphically for visualization or reconstruct general surfaces as accurately and in as much detail as possible, then the lowest level of point-based representation might be appropriate. If the task is to create graphical user interfaces, and the object domain is limited to blob-like entities, then we can choose a volumetric representation to describe objects realistically to the extent possible. The task addressed in our system is recognition, and the object domain to be handled is general and not restricted to certain classes of surfaces such as polyhedra or quadrics. Low-level descriptions of free-form surfaces are not stable with respect to viewing directions and are also not robust at handling noise in the image data. On the other hand, higher levels of descriptions such as edge—based representations, parametric-form fitting, and volumetric representations are also not suitable for handling general free—form surfaces. The difficulty is primar- ily due to the fact that the shape of the objects under consideration can be arbitrary; currently popular parametric forms seem to fit only a restricted class of shapes. Edges need not be present in a general smooth object in order to represent it. Volumetric representations are not appropriate as it is usually difficult to infer the extent of a 3D object from a single view accurately without making simplifying assumptions to compensate for the unseen part. The parts-based representation is difficult as we do not know what part primitives should be chosen as building blocks that can help us in recognizing general free-form surfaces. These difficulties motivate us to find descrip- tions that are based on visible surfaces and also determine other means of capturing the shapes of objects in a general manner. 3.1 How Do We Describe Rigid 3D Objects? In order to represent objects in a way that leads to their recognition, one is forced to answer the following question: What is the intrinsic property of an object that can aid us in its recognition? 54 Parts-based representation (geons) A Volumetric representation (ellispoids, hyperellipsoids) A Surface-based representation (planar, quadric surface patches) A Contour-based representation (silhouettes, jump, crease edges) ll Salient-points based representation (vertices, comers, high/low curvature locations) A 3D point-based representation (depth, normal, curvature values) Figure 3.1: Representing objects: several levels of abstraction. Humans seem to disambiguate objects in terms of their shapes. The Webster’s Dic— tionary defines shape as follows: shape 11 1a: the visible makeup characteristic of a particular item or kind of item lbl: spatial form 1b2: a standard or universally recognized spatial form 2: the appearance of the body as distinguished from that of the face : FIGURE 3a: PHANTOM, APPARITION 3b: assumed appearance : GUISE 4: form of embodiment 5: a mode of existence or form of being having [sic] identifying features 6: something having a particular form 7: the condition in which someone or something exists at a particular time. The shape is the visible or perceived form of an object that differentiates a cylinder from a sphere, a vase from a bottle. The shape of a rigid object does not depend on its position and orientation in space. It is purely a geometric structure that does not depend on isometrics [102]. The notion of shape is also independent of the scale; two 55 spherical bowls of large and small sizes are both usually described as spherical-shaped. Thus, shape seems to be one of the major criteria that characterize the geometry or structure of an object and it seems to play a conspicuous and important role in differentiating one object from another. Note that since many naturally occurring objects (e.g., human faces, terrain, etc.) are not single—shaped but made of regions of distinguished shapes that often merge with one another in a smooth fashion, even the shapes of the local regions in an object can characterize it. An object can be partitioned into a number of different shapes and this division is often done in a local manner, i.e., we not only describe objects as being either entirely cylindrical or spherical but also sometimes as those containing a planar face on the top, a large convex patch on the bottom and so on. Thus, we are able to capture the global shape if the object has a single, easily describable and distinguished shape, or a pattern of local shapes if many easily perceived shapes are present in an object. 3.1.1 Local Surface Attributes Shape alone does not constitute a complete description of an object. What differen- tiates between a soccer ball (a sphere of large radius) and a cricket ball (a sphere of smaller radius) is the curvedness (scale) or the amount of curvature in an object. We should also be able to characterize how curved an object is. A fast curving convex object and a gently curving convex object are often perceived as distinct shapes. At- tneave [4] points out how humans concentrate on points of high and low curvatures on a surface for recognizing it. With 2D line drawings of objects, an indication of how curved each portion of a line segment is provides us clues about what part it can pos- sibly constitute in the overall shape of an object. A sharply curved protrusion of an object captures our attention before the rest of the object. Thus, a characterization of curvedness of a surface is also an important factor that aids in its recognition. The next crucial characterization of an object is how a surface is oriented in the three—dimensional space. This helps us to distinguish between a bowl resting on its curved sides and a bowl resting on its flat side. The experiments conducted by 56 Pinker and Tarr [154] demonstrate how humans are able to rotate objects mentally in order to recognize them. These experimental results provide a deeper insight into the representation problem. Unless a representation scheme characterizes how objects or their sub-parts are oriented with respect to one another in a three-dimensional space, it is difficult to perform mental rotations of the stored representations to recognize an input in a new orientation. Often, in our attempts to describe objects, we resort to saying, “assume that the vase is vertical, the bottle is horizontal, lying on its side” and so on. The orientation of a point, a region, or the entire object seems to be intertwined with its description. In addition to the orientation information, the extent or the spread of the object in 3D space is usually included in its representation, characterizing how big it is. This can be measured quantitatively as the area occupied by an object. Note that although the curvedness, or the average amount of curvature in a re- gion of an object captures the scale of the object, both area and curvedness are needed to describe a region because area only provides the extent of the region in 3D space, whereas curvedness characterizes the curvature of the entire region. It is possible to imagine a long and highly curved, truncated (by planes on both ends) cylinder having the same area as that of a short and less curved, truncated cylinder. What disambiguates these two objects is the amount of curvedness present in each of them. Shape, curvedness and area are intrinsic quantities that do not change with re- parameterizations of an object or with changes in coordinate (rigid) transformations of the object. However, the orientation of an object or regions present on the object can change with coordinate transformations. It gives us a sense of how the object is localized in space and it should be, therefore, sensitive to the transformations of the coordinate system. 3.1.2 Combining Local and Global Descriptions The next important question is whether these local descriptors that capture (a) the type of surface, (b) how curved it is, (c) how big it is, and (d) how it is oriented in 57 three—dimensional space, are enough to discriminate between objects. In other words, do these local descriptors suffice to uniquely represent all kinds of surfaces, convex and nonconvex? The answer is in the negative for the following reason: Consider a nonconvex object 0 that has two convex regions R1 and R2 on it that are rigid translations of one another. An example of such an object is shown in Figure 3.2. The two regions R1 and R2 are identical; they have the same shape (convex), same curvedness, same area and same orientation. Figure 3.2: An example of a nonconvex object. Create a new object 01 by distorting a small surface patch in only 12;; the distor— tion can be as simple as multiplying the curvatures of the points in the patch by a constant. Now perform the same distortion in R2 6 O and call the resulting object 02. The representations of both 01 and 02 are identical in terms of their local descriptors alone - shape, curvedness, area, and orientation; however, 01 and 02 are dissimilar to each other as the distortions are in two different regions. Unless we encode adjacency information explicitly in object 01 (for example, the region containing the distorted portion R1 is on the right of the undistorted region R2), it is difficult to characterize the two objects uniquely. Adjacencies of the regions in an object specify how they are present together in the object, and clearly different arrangements of regions can 58 result in many dissimilar objects. This motivates the necessity of characterizing the connectivity of regions or patches explicitly along with local descriptions, if one is in- terested in designing a representation to discriminate objects. Note that maintaining connectivity information is tantamount to characterizing the global structure of an object in a sense. In the above discussion, it is essential to note that the descriptors we discussed are applicable to all classes of objects. A free-form surface can thus easily ,be de- scribed using the above quantities. However, we do not claim that these descriptors form a complete set. Compactness of a representation is also vital in designing an automatic recognition system. Hence it is important how the geometric information present in different attributes should be specified; simple objects should have simple representations, while complex objects are represented in as much detail as required. 3.2 COSMOS — A New Representation Scheme The novelty of our scheme [45] lies in its description of an object as a smooth compo- sition or arrangement of regions of arbitrary shapes that can be detected regardless of the complexity of the object. 3.2.1 Definitions Each of the local and global attributes used in the COSMOS scheme captures a spe- cific geometric aspect of the object and is defined using differential geometry-based concepts such as shape index, curvedness, and surface normals. A general parametric form of a surface S with respect to a known coordinate system is given by S: CEER325= y(u,v) ,(u,v)€fl§R2 . (3.1) 59 We assume without any loss of generality that the surface S can be adequately mod- eled as being at least piecewise smooth, i.e., it contains smooth surface patches sep- arated by discontinuities in depth and orientation, where the extent of smoothness depends on the type of attributes that need to be made explicit at a point P on the surface 5. Most free—form surfaces are smooth. Differential geometry [116] is used for describing local behavior of a surface in a small neighborhood. To represent orienta- tion and curvature, for example, the functions used to represent 5 must at least be of class C 2 (twice differentiable). Then the first and second fundamental forms [116] of a surface are well-defined on these patches. These fundamental forms are used to define geometric quantities that are of interest to us. Shape Index A quantitative measure of the shape of a surface at a point p, called the shape index 51, is defined as where 161 and K2 are the principal curvatures of the surface, with n1 2 n2. Note that in Koenderink’s definition [101] of the shape index, its range is [—1, 1] whereas with our definition, all shapes can be mapped on the interval 51 = [0, 1], and assigned non— negative values conveniently, allowing aggregation of surface patches based on their shapes. The shape index values need to be non-negative in our formulation because we use them in the definition of the shape spectral functions of surface patches. These spectral functions are used in the aggregation of attributes of the surface patches in terms of their shape index values as shown in Section 3.2.4. With mirror shapes such as the spherical cap and the cup, whose 51 values are +1 and -1 with Koenderink’s definition, shape index values interact during attribute aggregation and local infor- mation about each shape category is not maintained distinctly. Hence it is necessary to redefine the shape index to take only non-negative values. Every distinct surface shape corresponds to a unique value of S], excepting the 60 planar shape. Note that a planar surface has an indeterminate shape index as m = it; = 0. For computational purposes in our implementation, a symbolic la- bel (or sometimes, a shape index value of 2.0) is used to indicate surface planarity. The umbilic points on the object (m = fig ¢ 0) are not affected by the permutation of principal curvatures, i.e., by a rotation of the shape by ninety degrees. The end- points of the 5; scale represent the concave (n1 2 IQ = —a, where a is some positive constant) and convex umbilics (r21 = 5:2 2 b, where b is some positive constant). All convex shapes have values of 51 greater than 0.5 and concave shapes have values less than 0.5. This also generalizes the common concave-convex surface classification to anticlastic patches (surfaces that have concave and convex curvatures that are oppo- site, for example, a saddle surface, where the surface lies on both sides of its tangent plane). Their shape indices lie between [025,075]. The symmetrical saddle shape with 51 = 0.5 is neither convex nor concave. Nine well known shape categories and their locations on the shape index scale are shown in Figure 3.3. The representative shapes from each category are graphically illustrated in Figure 3.4. Spherical Rut Saddle Ridge Spherical Cup C3D T h S ddl R t Saddle D roug a e u Ridge ome 0 0.0625 0.1875 0.3125 0.4375 0.5625 0.6875 0.8125 0.9375 1 Figure 3.3: Nine well known shape types and their locations on the 5; scale. The shape index captures the intuitive notion of ‘local’ shape of a surface. The shape classification into eight basic surface types based on the signs of Gaussian and mean curvatures that was employed by Besl [10] for surface segmentation can be carried out within our framework by quantizing the continuous 31 scale into eight categories. Since Gaussian curvature is intrinsic to a surface, bending a surface with— out stretching preserves the Gaussian curvature, although the “shape” is modified by this action. Any operation except an isometry or similarity (change of scale) may destroy the shape, and therefore the Gaussian and mean curvatures are less useful for characterizing the notion of “extrinsic shape” [101]. The shape index provides a more continuous gradation between salient shapes such as convex, saddle and concave and 61 & Spherical cap Dome (1.0) (0.875) Saddle ridge Saddle Saddle rut (0.625) (0.5) (0.375) Rut Trough Spherical cup (0.25) (0.125) (0.0) Figure 3.4: Nine representative shapes on the 51 scale. thus has a large vocabulary to describe subtle shape variations very well. Curvedness The shape of a rigid object is not only independent of its position and orientation in space, but also independent of its scale. In order to capture the scale differences between objects (e.g., a soccer ball and a cricket ball) we use the curvedness or the amount of curvature in a region. The curvedness [102] of a surface at a point p is defined as R0») = were) + ream/2. (3.3) It is a measure of the scale of the surface, and its dimension is that of the reciprocal of length. A unit sphere has unit curvedness and so does a unit saddle surface (Ingl = [rel] = 1). A unit cylinder has R = 1/\/2(Is1 = 1 and r92 = 0). The 62 curvedness becomes zero only for planar patches unlike the Gaussian curvature which vanishes on parabolic surfaces (e.g., cylindrical ridge and rut) although these surfaces appear definitely curved to humans. We can observe that as the curvatures of a surface tend to 0 it becomes planar. Surfaces that are identically shaped may have different amounts of curvedness. For example, bell—shaped objects with different widths have the same shape index but different values of R. The same is true for two spheres of differing radii. Table 3.1 presents the shape index and curvedness values of different objects such as spheres, ridge and rut surfaces shown in Figure 3.5. 8 (a) Spherical cap of radius rl (b) Spherical cap of radius r2 (c) Ridge with cross-sectional (d) Rut surface with cross-sectional radius r radius r Figure 3.5: Simple surfaces with different shape index and curvedness values. Relationship between 51-}? and Kl‘l‘ig The new parameters (5], R) can be viewed as polar coordinates in the (m, K2)—plane, with planar points mapped to the origin. The direction indicates the surface shape, whereas the distance from the origin captures the size. All shapes, except for the plane, can be mapped to the unit circle whose center is at (0, -.:;) in the Kym-plane as shown in Figure 3.6. The unit circle contains shapes with unit curvedness. Rays through the origin contain identical shapes that merely differ in their curvedness. The shape index conveniently shows “which” shape a surface has, and the curvedness indicates “how much” of that shape is present. 63 Table 3.1: Shape index and curvedness values of the surfaces shown in Figure 3.5. [ Object ] Shape Index, 5; ] Curvedness, R [ Spherical cap of ra— 1.0 rI—l dius, r1 Spherical cap of ra- 1.0 i- dius, r2 Ridge surface, 0.75 i with cross-sectional radius r Rut surface, with 0.25 :r cross—sectional radius r k2 Rut \ k, ’ Saddle Rut Spherical Cap " ‘\ Saddle Dome Saddle Ridge Ridge Figure 3.6: Shape index (51) and curvedness (R) in the (rel, n2)-plane. 64 Koenderink [101] provides several good reasons for using the shape index for sur- face classification over those schemes based on the (m, reg-plane: Parameterizing ‘shape’ by the two-sided rays of the (m, n2)-plane provides the shape space with the topology of a projective line. In this case, we cannot distinguish inside from outside. Nor can the half-rays at the origin be used. This would lead to a topology such as that of a unit circle and the same shape would appear twice in shape space. Rather, the space of shapes has the topology of a one dimensional disc or a ‘linear segment’ and this is clearly brought out with the definition of shape index. Koenderink states, “...in the majority of applications dealing in one way or another with the perception of form, the shape index is the most valuable measure and the curvedness comes next” [102]. Since (51, R) are polar coordinates in the (m, K2)—plane, K1 and 5:2 can be recov- ered from S] and R in the usual manner. The principal curvature K31 can be derived using 51 and R as n1 2 R\/(1+ sin(27r51)), (3.4) and mg as K2 = Rx/(l — sin(27r51)). (3.5) Thus it can be seen that no loss of information occurs when one goes from (K1, K2) representation to (51, R) space. What we have gained is the decoupling of shape (quality) of a surface from its scale (quantity). Qualitative information about the surface is provided by the shape index whereas the curvedness provides the quantita- tive characterization. Besl [12] also discusses four different sets of functions based on the surface curvatures and these include a polar transformation of the 5:1 and 192 plane akin to (51, R). In his formulation, the functions (p,1,l)) are used where p measures the total “bending energy” of the surface in both curvature directions and the 1,1) is the angle in the principal curvature plane. 65 Continuity of Surface Shapes and S] values A measure like the “shape index” is better than classical measures for identifying visually meaningful local shape features since the shape index scale maps continuously to the shape space, i.e. the local neighborhood relations are preserved. This is illustrated in Figure 3.7. As the shape index value varies continuously from 0.75 S S; S 1, the local shapes of surfaces become ellipsoidal; they start from the ridge shape where one of the curvatures is zero and the other is positive (51 = 0.75), turn ellipsoidal (the curvature parameter that was zero becomes slowly positive) and tend towards the spherical shape as S] approaches unity (where both the curvatures become equal and positive). As the shape index changes from 0.75 towards 0.5, the local surfaces achieve a saddle shape with the curvature parameter that was zero taking on more and more negative values. Similarly, the continuity of surface shapes from saddle to spherical cup is demonstrated in Figure 3.8. Even if the scale of the object changes, the shape of the object surface remains the same, and so does its shape index value. Figures 3.9 and 3.10 show the same surfaces present in Figures 3.7 and 3.8, but in a smaller scale. Shape Spectral Function We define the shape spectral function as 55 : PC) —> C, where ’Po is a set of surface patches and C is the space of complex—valued functions such that 55(1)) = eJ’S’lpl, P e 790 (3.6) where 31(P) is the shape index of the patch P and t is a parameter that allows us to study shape properties of objects in a suitable transform domain (see Section 3.2.3), just as time—varying waveforms can be studied conveniently in the Fourier transform domain. It is used for the aggregation of geometric attributes of surface patches when multiple patches of identical shapes are found on different parts of an object (a realistic situation in the cases of nonconvex and symmetrical objects) and need to be 66 Figure 3.7: Continuous spectrum of various surface shapes and their shape index values ranging from spherical cap to saddle: (a) S] = 1.0; (b) 51 = 0.96; (c) S; = 0.92; (d) S; = 0.87; (e) S] = 0.81; (f) S; = 0.78; (g) 51 = 0.75; (h) S] = 0.72; (i) S; = 0.69; (j) S] = 0.63; (k) S; = 0.58; (l) 51 = 0.54; (m) S] = 0.5. 67 (e) (f) (g) (h) Figure 3.8: Continuous spectrum of various surface shapes and their shape index values ranging from saddle to spherical cup: (a) 51 = 0.5; (b) S] = 0.46; (c) S] = 0.42; (d) 51 = 0.37; (e) 51 = 0.31; (f) S; = 0.28; (g) S] = 0.25; (h) S] = 0.22; (i) S; = 0.19; (j) S; = 0.13; (k) S] = 0.08; (1) S; = 0.04; (In) S; = 0. 68 Figure 3.9: Surface shapes from spherical cap to saddle when the object scale changes: (a) S] = 1.0; (b) S, = 0.96; (c) 51 = 0.92; (d) 51 = 0.87; (e) S] = 0.81; (f) S] = 0.75; (g) S] = 0.69; (h) S] = 0.63; (l) S] = 0.58; (j) S] = 0.54; (k) S] = 0.5. 69 (i) (j) (k) Figure 3.10: Surface shapes from saddle to spherical cup when the object scale changes: (a) S; = 0.5; (b) S; = 0.46; (c) 51 = 0.42; (d) S; = 0.37; (e) S] = 0.31; (f) S; = 0.25; (g) S] = 0.19; (h) S] = 0.13; (i) S] = 0.08; (j) S; = 0.04; (k) S] = 0.0. 70 summarized without any loss of information. It thus provides a way to qualitatively characterize “which shape categories are present and how much of each shape category is present in an object”. This characterization is developed in Section 3.2.3. Other Geometric Attributes COSMOS also characterizes how objects and their sub-parts are oriented with respect to one another in a three-dimensional space in terms of their average surface normals. The extent or spread of the object in 3D space is also encoded quantitatively in our representation by the surface area occupied by an object. Along with these local descriptors, our representation scheme encodes the relative arrangement of surface patches on the object by their adjacencies. Note that maintaining connectivity in- formation is tantamount to characterizing the global structure of an object in a local manner. These descriptors are applicable to arbitrarily curved objects as they are quite expressive and do not depend on the presence of any analytical primitives. Since the compactness of a representation scheme is crucial to the performance of object recognition systems, it is equally important to determine how the geometric information present in various surface attributes can be specified together such that the complexity of the representation in a way reflects the complexity of shapes present on the object. Our representation scheme composes the attributes together in a novel way to characterize an object compactly. It treats both convex and nonconvex objects alike. 3.2.2 Definition of the COSMOS of a 3D Object The fundamental component of COSMOS representation of an object is the description of an object in terms of surface patches that are of different shapes. We denote a maximal patch of constant shape index in an object 0 by the name CSMP (Constant- Shape Maximal Patch). 71 Definition: A CSMP is a maximally sized surface patch P Q 0 on the object such that VP, Vq E P, (i) 51(p) = 51(q) and (ii) El a path from p to q consisting of points 1' E P such that 51(7‘) 2 51(p) = 51(q). The second condition imposes connectedness of the points in P. An object can now be compactly described in terms of the CSMPs that are present on it. For example, a spherical surface has a single CSMP of spherical cap (convex) shape, whereas a truncated cylinder bounded by hemispherical caps at its ends has three CSMPs, one with cylindrical ridge shape and the other two of spherical cap shape. A n-faced convex polyhedron would contain n CSMPs of planar shape, separated by edges of surface normal discontinuities (shape index is indeterminate for these edges). In theory, edges and vertices would form their own patches (CSMPs) in COS- MOS. An edge has infinite curvature in one direction (between the faces it separates) and finite curvature in the other direction (along the edge). As a special case, in a polyhedron an edge can be viewed as the limit of a cylinder whose radius approaches zero (implying that curvedness has approached infinity while shape index has stayed constant). Thus each edge forms a CSMP of its own (recall that planar faces have indeterminate shape index, which is representationally distinct from the shape index of a cylinder) and would separate each face. Vertices would not aggregate into either edges or faces since they have a different shape index—they can usually be viewed as the limiting case of a cup or cap with curvedness approaching infinity. Therefore, a polyhedron theoretically gets segmented into its individual faces in COSMOS. Some of the issues that need to be addressed in practice are discussed in Section 3.5.2. In a digital implementation of the representation scheme with the surface depth data of an object obtained using a laser range scanner that produces data on a X —— Y grid, a CSMP would be computed as a region containing surface points (pixels) whose shape indices are the same and which are eight-connected with one another. With other types of surface data, the connectedness can be suitably defined in order to determine the CSMPs on the object. The shape index of a CSMP is the same as that of the surface points contained within it. Note that when the shape index varies at each surface point on the object, each CSMP contains a single point instead of a set of points. Figure 3.11 shows the constant-shape maximal patches detected in a range image of a view of a vase, obtained using the segmentation technique described in Section 3.5.1. (a) (b) Figure 3.11: Maximal patches of constant shape index (colors indicate diflerent shape index values): (a) Range image of a vase; (b) CSMPs detected on the vase. The COSMOS representation of an object segmented into a set of homogeneous patches (CSMPs) is comprised of two sets of functions discussed below: the Gauss patch map and surface connectivity list (CO. V), and support functions (G1, 02). The first set captures the orientation and connectivity information of the object and the latter captures salient local surface information of the CSMPs. Given an object O segmented into a set of patches Po according to the CSMP criterion, we define the Gauss patch map as a function G0 : ’Po —) 82 into the unit sphere (82) such that Z memento Gofpl W, (3.7) where n(p) is the normal at p E P, d0 is the differential area element in P and P 6 ’P0. The integral fprP d0 denotes the total surface area of the CSMP P. 630(P) is the average surface normal over P. For example, in a computer implementation 73 with discretely sampled object surface data , this will be computed as f1(1)) for P = {P} saw») for P={p1,---,PN}- When we refer to the “unit sphere” we also include an extra point, the center of the sphere to which we map zero normals. Thus Go maps each CSMP, P, on 0 to a point on the unit sphere whose normal corresponds to the orientation (mean surface normal) of the patch P. Viewed inversely, the Gauss patch map associates each point 3 6 $2 on the unit sphere with a (possibly empty) set of patches at s, 051(3) = {P 6 Po I G0(P) = s}. (3.8) As a notational convenience, we generalize this and define the inverse Gauss patch map of a region of the unit sphere S Q 82 as G31[S]= {P e 790| 33 e S, 00(1)) = s}. (3.9) Figure 3.12 shows a telephone handset, the CSMPs on the object and their spherical mapping as given by Go. P1." Pl. '2 Figure 3.12: Example of a 3D free—form object and its spherical mapping G0(P). 74 The surface connectivity list V : ”Po —> 27"? is defined as V(P) = {Q E 7’0 | Q 75 P, V(5 > 0) 3(17 6 Rt] 6 Q) “P - (Ill < <5}, (3310) where 2730 is the power set of P0 and ”p — q“ is the Euclidean distance between the points p and q. That is, V associates each patch P 6 ’P0 with the set of patches V(P) = {Q} g PC) that are adjacent to P, and thus represents connectivity informa— tion about the segmented object. It can be seen that the traditional region adjacency graph data structure can be easily abstracted from the set of CSMPs, ’Po, and the surface connectivity list V of an object, where each CSMP P on the object serves as a node in the graph and V(P) provides information about the edges (or the connectivity) that link the nodes in the region adjacency graph. The orientation of the patches determined by Go associates the CSMPs with points on the unit sphere. The mapping Go for any given convex object is one-to—one since surface normals are unique on the convex object and no two CSMPs on the object will have the same surface orientation. However, in the case of a nonconvex object, it is possible to have identical surface normals at multiple points on the object’s surface (see Figure 3.12). So, several CSMPs may map to the same point on the unit sphere, leading to multiple folds on the sphere. The surface connectivity list V maintains the connectivity information of the patches by keeping a list of adjacent patches for each patch on the object, and thus identifies each fold with unique information that is in many cases useful for a coarse approximation of the object’s surface, even when full recoverability is not assured. Furthermore, observe that for symmetrically closed smooth objects containing a single CSMP, the Gauss patch map of the object is a single point on the unit sphere, its center. In a practical implementation, this mapping may turn out to be unstable depending on how the object is discretely sampled and also on the amount of noise that may be present in computing the surface normal. However, as will be shown in Section 3.5, we have adopted an object model which is a collection of views of an 75 object and hence only 2D views (appearances) of objects are dealt with in our system. Since 2D object views need not provide complete information on the object (e.g., the back of the object may not be visible in some views), we are not likely to encounter situations where all the surface normals sum to zero and thus, the problem of possible instability of the Gauss patch map of objects subject to sampling is avoided in our implementation. While the domain of the first set of functions was Po, i.e., Go and V are defined over the object itself, the second set of functions are defined over the unit sphere 52. In essence, these functions summarize at each point in 52 the local information about all the patches that have been mapped by Go to the point. Let s E 82 be a point on the unit sphere and dS'(s) be a small neighborhood of s on 82. Several CSMPs from different parts of the object 0 may have been mapped to s by Go, i.e., the Gauss patch map may have several folds over 3. Then (351(3) is the set of patches from 0 mapped at s and G;‘[d5] is the inverse Gauss patch map of dS, i.e, the set of all patches of O that map to points within d5. For any patch P 6 (731(3), let us denote the restriction of GEMS] to neighbors of P as Gg‘[dS]|P = G51[dS] n V(P). (3.11) This restriction, Gg‘[d5]|,, defines a single continuous region of 0 (since all the patches in G;1[dS][ P are neighbors of P, they are joined into one large patch whose image is entirely within d5), dO|P= U Q, (3.12) QEGJ’ldSIIP corresponding to a single fold over 3. We then define the patched Gaussian object density D(s] P) on the single fold corresponding to P as D(s|P) = lim dOIP dis—+0 dS . (3.13) 76 Note that D(s] P) becomes equal to the inverse of Gaussian curvature of a point on the object when the patch P consists of a single point. When the patch is of finite nonzero area, D(sl P) can be written as the product of the surface area of the patch and a Dirac delta function whose spike is at 5 (see Section 3.2.3 for a definition of the Dirac Delta function), as explained later. We define the two support functions, G1 : 82 ——> C and 02 : 52 —> C, where C is the space of complex—valued functions such that G1(S,t)= Z D(s|P)Ss(P) (3.14) moo—1(3) and G2(s,t)= Z Rm(P)D(s|P)SS(P), (3.15) PEG;‘(s) where SS(P) is the shape spectral function of a CSMP P, and Rm(P) is the mean curvedness over P given by fprPR(p)dO W): 11,...de ' (3.16) The support functions (01,02) defined on the unit sphere capture the local ge- ometric attributes of the mapped CSMPs. Gl(s,t) integrated over a region on 82 maintains a summary of the surface areas of all the mapped CSMPs in each shape category. As a special case, the integral of G1(s,0) over a region on 52 provides the surface area of the CSMPs that are mapped into the region. The term D(s] P) in G1 becomes equal to the inverse of Gaussian curvature of a. point on the object when the CSMP P is a point, and it is equal to the product of the area and the Dirac delta function when the CSMP is of finite, nonzero area for the following reason. Recollect that the Gaussian curvature K at a surface point p on 0, is defined by , . d5 11(1)) : dlolr—IEO d_O-’ 77 where d5 is the area on the unit sphere to which a region d0 C O has been mapped. The inverse of the Gaussian curvature (1/ K) is given by 1 _ 1. d0 mp) ‘ dsfino dS' If the shape index changes everywhere on 0, then each single point p E 0 becomes a CSMP P on the object. Then GI, defined at the point on the unit sphere where P is mapped to, is equal to (1/K(p))e3‘5’(7’). In the case of objects where the shape index is distinct and constant over different regions, the CSMPs are well-defined and their surface areas are nonzero. Then limd3_,0 £93 for a CSMP P becomes equal to the surface area of P multiplied by the Dirac delta function because d0 is nonzero for P and d5 tends to zero; it is equivalent to the area (10 of P multiplied by a Dirac delta function 6(t). Using the sifting property of the Dirac delta function, we define that for CSMPS of nonzero area, GI at a point so on the unit sphere is equal to the area of the mapped patch multiplied by the Dirac delta function and also by its shape spectral function. Then, the integral of G1(so, 0) in a region around so on 82 provides the surface area of the CSMPs that are mapped into the region. From now on, for ease of discussion, when we refer to G1 defined for patches of finite surface area we will simply state that Cl contains information about the surface area of the mapped patches. Similarly, G2(s, 0) when integrated over a region around a point on the unit sphere and normalized by the area of the mapped patches provides the mean curvedness of the patches mapped into the region. When a point on the unit sphere is the image of multiple CSMPs on the object, G2(s, t) after integration and normalization provides a weighted summary of the mean curvedness of these patches categorized in terms of the shape index. Note that in the definitions of both G1 and G2 at a point on the unit sphere, the area and the mean curvedness of each patch P are multiplied by the shape spectral function, eJtS’UD). The shape spectral function aids in maintaining the surface area and curvedness of each patch individually without aggregating them if their shapes are different, although many patches may have been mapped to the same 78 point. When multiple patches of identical shape index map onto the same point on the unit sphere, then their areas and curvedness add up appropriately, resulting in a summary of area and curvedness of each class of shape present in the object as explained in Section 3.2.4. This can be used effectively for indexing during the matching stage of an object recognition system. The above definitions can be illustrated using a few examples: For a convex polyhedron, each face of the polyhedron becomes a CSMP (the faces are planar everywhere, and the edges bound these surface patches of constant planar shape). The orientations of the CSMPs are given by the surface normals of the planar faces. Each CSMP gets mapped individually to a point (as the object is convex) on the unit sphere. The support function G1 defined at the points on the unit sphere where the planar faces are mapped specifies the area of the planar faces. Note that for planar shapes, the curvedness R is zero. In the case of general convex objects where several CSMPS of different shapes and areas exist, each CSMP P gets mapped to a unique point 3 on the unit sphere with Gl(s, t) = (Area(P))eJ‘S’(P). In the cases of nonconvex objects, where multiple CSMPs P,,i = 1,2, - -- are mapped to the same point so on the unit sphere, Gl(so,t) = ((Area(P1))eJ‘5’(P1) + (Area(P2))eJ‘S’(P2) + --.) If the shape indices of the CSMPs are the same, then their areas are aggregated. Similarly, the support function 02 results in a summary of the mean—curvedness of the CSMPs in each shape category. A suitable transform of the support functions result in high level feature summaries that characterize the local geometric attributes of an object. This will be established in the following sections. 3.2.3 Definition of Shape Spectrum We define the shape spectrum H : [0,1] -—) [0, 00] as follows: H(h)=//06(81(p)— h)d0 (3.17) 79 where h is the shape index variable, d0 is a small region containing a point p on object 0, 31(p) is the shape index at p, and 6() is the Dirac delta function. The latter is defined by /6(”_k)f(rc)dx= f() _ _ 0 otherwise. for all functions f(:r).l As a consequence of its definition, 6(33 — k) is zero everywhere except at x = k where it has an infinite value (a “spike”). Clearly, the delta function has a ‘sifting’ or ‘sampling’ property, in the sense that it picks up the value of f(.r) at the point where its spike appears. Our objective in defining the shape spectrum is to determine “how much” of the object’s surface area has a particular shape index value h, and therefore equation (3.17) utilizes the Dirac delta to accumulate the area of 0 for each shape index value h.2 In practice, we need a discretized definition of the shape spectrum since we work with pixels when we deal with range images of the object; we call this the shape histogram. Let us partition the shape index scale [0,1] into n “bins,” such that the kth bin is the half open interval [(k — 1)/n, k/n) (the shape index value 1 is included in the nth bin by definition). The value of the shape histogram in bin k is the number 1Strictly speaking, for a suitably chosen set of all “test. functions’ on the domain of a: [151], which places no restriction on us in practice. 2An alternative way of defining the shape spectrum is as the derivative of a shape distribution function, in analogy with the way probability density functions can be defined from probability distribution functions: H(h) = Edgsnru.) dé‘ dih (//Ou(s,(p) — h)d0) where u(:r) is the unit step function (also known as the Heaviside function): 12:20 ”(I): 0 x<0. Thus SDF(h) is the surface area of O that has shape index value less than or equal to h. Moving the differentiation inside the integral and observing that du(.r)/da: = 6(3), we obtain equation (3.17). 80 of pixels whose shape index falls in that bin: k N H(h = g) = ZXHSMPJ) (3.18) where p,- is a pixel, N is the total number of object pixels in the range image, and X is the characteristic function of a bin: 0 otherwise While equation (3.17) is the precise definition required for our theory, equation (3.18) is not as important, in the sense that we are free to employ other discretizations that may display algorithmically better properties. For example, instead of “binning” the shape index scale into n fixed-width bins, we often employ a more flexible binning scheme which has narrower bins (i.e., “higher resolution”) near values of the shape in- dex where a large number of pixels accumulate. Similarly, when we work with patches (the CSMPs used to segment an object) explicit bin boundaries are stored in each patch and inverted to approximate the shape spectrum. All these practical methods are qualitatively equivalent to the concept of “how much of the object has a given shape value,” formally embodied in equation (3.17). For computational purposes such as comparing spectra of two different objects (Chapter 4), we use a normalized (with respect to the object area) shape histogram where each bin contains percentage surface area of the object. 3.2.4 Relationship between Shape Spectrum and G1 As has been signaled by the term “spectrum” itself, there is a fundamental connec- tion between the support function Gl(s, t) and the shape spectrum H (h): the shape spectrum is the Fourier transform of the integral of G1 over the entire unit sphere. In other words, if “shape” is a “frequency-domain” concept, g1(t) = ffG1(s,t) dS is the corresponding “time-domain” concept. Indeed, this is what originally motivated 81 our definition of the support function GI; GI spreads the information over the unit sphere, which makes it much easier to work with patches (our CSMPs), whereas H spreads the information over the shape index scale, which makes it much easier to work with object view correlations. To formally prove the relationship between 01 and H, we begin with the integral of G1 summarizing it over the entire unit sphere 82: . dOlp -_— Q— _ Jt51(P) 91(t) / 52G1(, s, t) d [[52 E [dis’gbd S]e dS (3.19) PEGo’il) To simplify the right hand side, we make the following observations. Whereas in a normal integral we would replace (dO/dS)dS by d0, the term (dOlp/dS)dS is included, because we wished to consider point sets P on each fold in 051(5) separately. This need to consider each fold of O separately is also the reason for the presence of the summation: the right hand side of the definition of Gl(s,t) (equation 3.14) aggregates all the CSMPs that get mapped to a point 3 on the unit sphere. We now note that the integral is over the entire unit sphere 82, and therefore each fold (and CSMP P) on 0 will be duly considered in turn, and will be considered exactly once. Therefore, in changing domain of the integral to d0, we may safely drop the restriction d0] p, as well as the summation sign, and just traverse the entire object O. For the same reason, we can change the variable and integrate over points p E 0 instead of considering point sets P 6 G51(s). Thus, the above equation can be simplified to 2]] eJtSllpldO. 0 82 Taking the Fourier transform of g1 with h as the transform variable, .7-"(g1(t))= /: ( / /:’51(P)d0) e-Jthdt //d0 : e—Jt(31(p)- -h) dt 0 —oo = f/OdO (27r 6(510?) — h)) : 27r H(h), where we have used the facts that the Fourier transform of 6""t is 27r 6(0 — h) for any constant a, and 6(a) == 6(—a‘). The factor of 27r is of no particular significance as it can be made to disappear with an alternative definition of the Fourier transform (see e.g., the treatment in [151]). In passing, we observe that the above treatment of “shape” as an analogue of “frequency” explains our need to relocate the shape index scale to the interval [0,1] from Koenderink’s original [—1,+1]. When all the “frequencies” are nonnegative numbers, terms with different frequencies in the expressions for G1, H, etc. do not interact with each other. On the other hand, transform theory makes it clear that two spikes, one at —1 and another at +1, do interact to define a single signal, the sinusoid. We need to keep the shape content in an object at 51 = —0.5 (rut shape) distinct from (i.e., unaggregated with) the shape content at S; = +0.5 (ridge shape), and hence the redefinition of the shape index. Observe that a similar formulation can be used to define “how much” of the object’s curvedness has a particular shape index value h, and this can be seen to provide a qualitative measure of the scale of each surface shape present in the object. 3.3 Properties of the COSMOS Representation In this section we briefly highlight some of the important properties of the COSMOS representation. Table 3.2 discusses the features of various orientation-based descrip- tors and contrasts them with the COSMOS scheme. As detailed in Section 2.1.2, all the other orientation-based descriptors are verbose from the point of view of recog— 83 nition and their main emphasis is on abstracting an object description that can aid in the recoverability of the object. The matching can fail when parts of objects are occluded as only spherical maps of the objects are available to establish object simi- larities. However, COSMOS provides local surface features, CSMPS and their various local and global information that can be potentially used for recognition even when objects are occluded. Some important issues in the design and evaluation of a representation scheme as put forth by Requicha [134] are as follows: (i) object domain (objects that can be modeled with the scheme), (ii) validity of the representation (i.e., whether it avoids nonsensical representations), (iii) uniqueness and completeness (whether the mapping from objects to representations is one-to—one) and (iv) conciseness of the representation. The properties of the COSMOS scheme are discussed and evaluated with respect to each of these issues. Although the decomposition of an object in terms of the surface patches has been studied earlier, the COSMOS scheme is new because of its use of the continuous-scale shape index for segmenting and matching free-form objects. The continuous scale increases the expressiveness of our scheme and makes it suitable for representing complex objects. Observe that the previous approaches which use the signs of the Gaussian and mean curvatures [10, 109] can describe a surface in terms of only eight categories (based on user-specified thresholds) whereas the shape index scale can be divided into as many classes as needed to represent an object depending on the complexity of its shape. In addition, the mapping of the segmented CSMPs on the object surface based on their mean surface normals onto a unit sphere is novel. The representation combines both local and global aspects of describing an object via support functions defined on the unit sphere and the CSMP connectivity that can be used for reconstruction and recognition of surfaces. A COSMOS of an object is independent of the position of the origin of the object coordinate system. It is independent of the position of the origin because orientations of the CSMPs get projected on the unit sphere in parallel transformation. The scale information is captured in terms of the curvedness of the maxpatch. A rotation of 84 Table 3.2: COSMOS and orientation-based representations. Repre- Mapping Support functions Applicabl Salient features sentation on the unit sphere object scheme domain EGI [85] spherical Gaussian curva— convex objects can be recovered; mapping of ture at a surface closed translation cannot be re- surface nor- point objects covered uniquely. mals at all points SFBR spherical distance of the convex closed surfaces can be [118] mapping of tangent plane at closed uniquely determined; rep- surface nor- a point from a objects resentation depends on mals at all predefined origin choice of origin. points CEGI spherical complex number convex the translation of a con- [96] mapping storing the Gaus— closed vex object can be recovered of surface sian curvature objects uniquely. normals of all and the distance points of a point from a specified origin in the direction of the normal GGI spherical connectivity convex nonconvex objects can be [109] mapping information and non- represented uniquely and of surface convex recovered. normals of objects constant Gaussian curvature patches OBR the Gauss distance func- convex convex and star shaped ob- [108] map (surface tion; first and and star jects can be recovered from normals) and second curvature shaped their representations. the dilation functions; radial objects map distance function COSMOS spherical the Gaussian convex shape-based analysis of ob- [45] mapping of curvature; mean and non- jects; connectivity informa- orientations curvedness convex tion is maintained as a list; of CSMPs objects convex polyhedra and con- vex objects whose shape in- dex varies continuously can be recovered. 85 the coordinate systems is reflected by the rotation of the orientations of the CSMPs mapped to the unit sphere. 3.3.1 Compactness The COSMOS representation compactly captures the geometry of an object in terms of a set of easily detectable maximal patches. Objects of simple shapes such as polyhedra, cylinders, and spheres have very compact representations. For example, for a full sphere, we have a single CSMP (S; = 1) and it is mapped to the center of the unit sphere (C = 0,77 = 0) which is then associated with the appropriate support functions for area and curvedness. The adjacency set V is empty as there is only a single CSMP. This is compact in comparison with other orientation-based descriptors where every surface normal of a closed convex object would be mapped to the sphere. It is also compact for many classes of objects that contain only a few distinguishable surface patches of constant shape index, i.e, whose surface shapes do not change rapidly over large regions of the object. If an object is complex and composed of different shapes with rapid protrusions and indentations, then its shape complexity is reflected in an increased number of CSMPs on the object. 3.3.2 Convex Objects The COSMOS representation is equivalent to the EGI scheme in the case of convex polyhedra. For a convex polyhedron 0, each planar face (f,) of the polyhedron becomes a maximal patch of constant shape separated by edges consisting of points of surface normal discontinuity, ft 6 0 <=> P,- E 0. Go maps each P,- to the unit sphere, based on the direction of its surface normal (orientation). The adjacencies of the planar faces are stored in V. Since the curvedness of a planar surface is zero, the support function G2 = 0 for all patches P.- (i.e., for all faces f,- in the polyhedron) and G1 provides the area of each of the planar faces. Thus COSMOS for a convex polyhedron is a mapping of the normals of each of the planar faces of the polyhedron onto the unit sphere, along with their associated areas and connectivity (denoted by 86 A and V) as shown in Figure 3.13. The EGI representation for the polyhedron is also shown in this figure. We can now define a one-to—one mapping F : SZCOSMOS —+ 821.301 between the spherical map S2EG1 generated by the EGI representation and Szoosmos by the COSMOS description, such that every normal defined on the 525301 has a normal from 52cosmos associated with it. The support function stored at the normals on SZEGI is also included among the support functions stored at the points on SZCOSMOS. Thus, COSMOS is isomorphic to EGI for a convex polyhedron. N3 <—— (a) A Cu (c) COSMOS of a cube Figure 3.13: COSMOS and EGI of a convex polyhedron. (The support functions are shown only for normals N1 and N4 for clarity.) The COSMOS reduces to EGI in the case of a continuous smooth convex closed object in which 51 (p), p E 0 is different at every point p on the object. Let 0 be an object that is convex (for a convex object 51 2 0.5.) and whose 51(p) differs at every point p E 0; then a maximal patch P,- with constant shape index reduces to a point p,- as the shape index is not constant over any neighborhood around p,. The orientation of P,- is then the same as the normal Np, at pg. Then each 87 p, on the surface gets mapped to a point on the unit sphere by the function Go, _¢ .4 Go(P.-) = G0(Pi) = Np.- As explained earlier, we associate the inverse of the Gaussian curvature of the point l/K(p,-) as the coefficient of the support function Cl. C; stores the curvedness of the point, R(p,-)e3‘5’(p‘). Since the EGI of a closed smooth convex object is a spherical map 523m containing the normal at each point p on the object, along with a support function of 1/K(p), it can be observed that we can define a function F1 : SZCOSMos —> 82331 that associates each point on the spherical map SZCOSMOS to that on 523m. The support function stored on 525m is included among the support functions stored at the points on Szcosmos- Thus, both convex polyhedra and smooth convex closed objects whose shape index varies continuously are recoverable (up to a translation) from their COSMOS represen- tations as the surface recoverability theorems of E GI apply in these cases. Though it may be possible that the COSMOS representation of a convex object containing many CSMPs that have nonzero area is unique up to a translation, the recoverability is not guaranteed. The uniqueness property appears intuitively plau- sible as no two surface normals have the same direction on a convex object, leading to a non-identical mapping of the orientation of each patch on the unit sphere. Thus, the mapping of a convex object in terms of its maximal patches onto the unit sphere is one-to-one. A coarse approximation of an object in terms of its CSMPs can be reconstructed. 3.3.3 Nonconvex Objects In dealing with a nonconvex object, global information or adjacency is required due to the nonuniqueness of the surface normals on the object. The main difficulty with EGI and CEGI in representing nonconvex objects is that the connectivity (neighbor— hood) information between surface points is lost. Global connectivity or adjacency 88 information of the surface points is required to reconstruct the object unambiguously. Consider two objects, one convex and the other nonconvex as shown in Figure 3.14. Both the surface normals N1 and N2 on Object2 map to the same point on the unit sphere with its EGI representation. Objectl and Object2 have identical EGI repre— sentations if we assume that the surface areas stored at the points on the unit sphere corresponding to these normals are the same. This example clearly demonstrates that N. J» / l9, Object I Object2 Figure 3.14: A convex (Objectl) and a nonconvex object (Object2) that have identical EGI representations. it is impossible to uniquely recover a nonconvex object from its EGI. It may be possible that the COSMOS representation of a nonconvex object con- taining many CSMPs that have nonzero surface area can be used to reconstruct the object coarsely in terms of the patches up to a translation. However, we hasten to add that the recoverability is not always guaranteed. Recollect that our representation of an object not only has the local descriptions of the CSMPs on the object, but also maintains their adjacencies explicitly. Since the connectivity information between the patches is captured, even when more than one patch maps onto an identical point on the unit sphere, there is a sufficient amount of information preserved to obtain the inverse map of the unit sphere unambiguously. The surface adjacency information helps in discriminating between different patches with identical orientation. However, note that since we use the average surface normals in our representation, information is lost about the orientation of each surface point and this can easily complicate the recoverability of the object. Our representation does not model the holes that may be present in an object and hence, recoverability of such an object from its COSMOS 89 representation is not possible. 3.4 3D Objects and their COSMOS Representations: Examples In this section, we give examples of 3D object surfaces and their COSMOS representa- tions. Note that we omit an explicit mention of the Dirac delta function in specifying the support functions GI and Go on the unit sphere for patches of finite, nonzero area. 3.4.1 Simple Objects o A sphere of radius a, (a Z 1): Its COSMOS representation is shown in Figure 3.15. The entire object is described by a a single patch Pl, whose surface orientation (8) (b) Figure 3.15: COSMOS representation: (a) a sphere of radius a, (a Z 1); (b) the Gauss patch map with the support functions indicated by (51). N1 maps it to the center of the unit sphere (0,0,0). The shape index of P1 is that of a spherical cap, with S] = 1 (convex umbilic). The support functions indicated by (51) in Figure 3.15 at (0,0,0) are G1 = (47ra2)eJt and G; = fie". Note that the adjacency set V is empty as there is only a single patch on the object. o A convex polyhedron: The object has six CSMPs, each of which is a planar shape and mapped to the unit sphere as shown in Figure 3.16. The surface con- nectivity information and the support functions indicated by (Si),i = 1, - -- ,6 in Figure 3.16 are listed in Table 3.3. Note that for a planar patch P, eJ‘S’lpl is Figure 3.16: COSMOS representation. (a) a convex polyhedron; (b) the Gauss patch map with the support functions. defined to be e]t2 as the shape index of a planar shape is indeterminate and it is assigned an arbitrary value of 2.0 in our implementation. Observe that the curvedness of a planar patch is zero, resulting in Co = 0 for all CSMPs on this object. o A nonconvex polyhedron: The eight planar shaped CSMPs are mapped to the unit sphere as shown in Figure 3.17. The support functions on the unit sphere denoted by (Si),i = 1, - -- ,8 in Figure 3.17 are specified in Table 3.4. Observe that the CSMPs P2 and P4 map to the same point on the unit sphere, thus causing a fold in the spherical mapping. However, the surface patch adjacency information stored explicitly in the connectivity list V for each CSMP aids in distinguishing them as different patches on the object. o A truncated cylinder with spherical caps: This object has 3 CSMPs; one of cylindrical ridge shape, and two patches of spherical cap shape. The cylinder has a radius of a and a height of h. The spherical mapping of the orientations 91 Table 3.3: Surface connectivity and support functions on the unit sphere for a convex polyhedron. CSMP P Connectivity list V(P) ff GI Sur- ff G2 face area Mean curvedness P1 {P2, P3, P4, P5} A1 e]t2 0 P2 {P3, P1, P5, P6} A2 e1"2 0 P3 {P2, P4, P1, P6} A3 e]t2 0 P4 {P3, P5, P1, P6} A5 e”2 0 P5 {P2, P4, P1, P6} A5 e3‘2 0 P6 {P2, P3, P4, P5} A6 6"2 0 Table 3.4: Surface connectivity and support functions on the unit sphere for a non- convex polyhedron. CSMP P Connectivity list V(P) ff G1 Sur- ff G2 face area Mean curvedness P1 {P2, P5, P6, P7} A1 63‘2 0 P2 {P1, P3, P5, P7} A2 6]” 0 P3 {P2, P4, P5, P7} A3 63‘2 0 P4 {P3, P5, P7, P8} A4 e]t2 0 P5 {P2, P3, P4, P6, P8, Pl} A5 6"2 0 P6 {P1, P5, P7, P8} A6 6"2 0 P7 {P1, P2, P3, P4, P6, P8} A7 e3” 0 P8 {P4, P5, P6, P7} A8 et2 0 of the patches is shown in Figure 3.18. Note that the orientation of the entire cylindrical patch reduces to (0,0,0) and hence maps to the center of the unit sphere. The support functions on the unit sphere denoted by (Si),i = 1, - -- , 3 in Figure 3.18 and the surface adjacency list are listed in Table 3.5. o A truncated cylinder with planar ends: It contains 3 CSMPs; one of cylindrical ridge shape, and two patches of planar shape. The cylinder has a radius of a and a height of h. The spherical mapping of the orientations of the CSMPs is shown in Figure 3.19. The orientation of the cylindrical surface in the object £92 N1 Pl P5,!" P4 NKJ '- (a) (only visible normals are shown for clarity.) 4 <88) <34> (b) Figure 3.17: COSMOS representation. (a) a nonconvex polyhedron; (b) the Gauss patch map with the support functions. maps to the center of the unit sphere. The support functions on the unit sphere denoted by (Si),i = 1, - -- ,3 in Figure 3.19 and the surface adjacency list are listed in Table 3.6. Note that for the objects illustrated in Figures 3.18 and 3.19, only the CSMPs P2 and P3 are different in their shape index and curvedness values. 0 A telephone handset: A simplified drawing of a telephone handset as shown in Figure 3.20 reveals 6 CSMPs; two surface patches of planar shape (P1, P6), three CSMPS of cylindrical ridge shape (P2, P3, P5), and one of saddle rut shape (P4). P2 and P5 are cylindrical ridge surfaces with a radius of a1 and a height h. P3 is a ridge shaped surface with a radius a2 and height ha. The spherical mapping of the patches is illustrated in Figure 3.20. Note that the mean orientation of the entire cylindrical patch, for example, P2 sums to fl] Pr v (b) Figure 3.18: COSMOS representation. (a) a truncated cylinder with spherical caps; (b) the Gauss patch map with the support functions. Table 3.5: Surface connectivity and support functions on the unit sphere for a cylinder truncated with spherical ends. CSMP P Connectivity list V(P) ff G1 Sur- ff G2 face area Mean _ curvedness P1 {P2, P3} (h1ra2)ejm5— (7-120 )e’m':5 P2 {P1} (zfiwt (3%,. P3 {P1} (27ra2)e’t (%)eJ‘ (0,0, 0) and thus, maps to the center of the unit sphere. The support functions on the unit sphere denoted by (Si),i = 1, - -- ,6 in Figure 3.20 and the surface adjacency list are listed in Table 3.7. 3.4.2 Torus In this section, we derive the shape index values on a torus as an example of a continuous smooth nonconvex object whose parametric representation is available. The parametric equation of the torus is given by torus(u, v) = ((a + bcos '0) cos u, (a + bcos '0) sin ti, bsinv). (3.20) 94 N2 N3 N2 <82) N3 (b) Figure 3.19: COSMOS representation. (a) a truncated cylinder with planar ends; (b) the Gauss patch map with the support functions. The principal curvatures ml and so are given by cosv 1 nlz— n =——. (a+bcosv)’ 2 b Thus we note that m of the torus vanishes along the curves given by v 2 :l:%. These values correspond to convex cylindrical shape category locally. The set of hyperbolic points (saddle-shaped points) is given by 7r 37r {torus(u,v) | 5 < v < —2—} , and the set of elliptic points on the surface (dome shaped points) is 7r 7r {torus(u,v) I —-2- < v < 2}. 95 Table 3.6: Surface connectivity and support functions on the unit sphere for a trun- cated cylinder with planar ends. CSMP P Connectivity list V(P) ff G1 Sur- fng face area Mean __ curvedness P1 {P2, P3} (hiraf)em_ (E1 )e’flls P2 {Pl} A2 e"’2 0 P3 {Pl} A2 e-’t2 0 (a) (b) Figure 3.20: COSMOS representation. (a) a telephone handset; (b) the Gauss patch map with the support functions. Figure 3.21 shows the values of the shape index computed on the surface of the torus. The shape index values are color coded and displayed in Figure 3.21. The blue points are elliptic, the green points are hyperbolic and the parabolic points are shown in red. 3.5 Deriving COSMOS Representation of an Object from Range Data A complete description of a 3D object may be available as an analytic function, for example, with simple 3D objects such as a sphere, a hyperboloid, or a polyhedron. In such situations, we can construct the COSMOS of the complete object easily. Let the input object be convex or nonconvex. We compute the shape index of the points 96 Table 3.7: Surface connectivity and support functions on the unit sphere for a tele- phone handset. CSMP P Connectivity list V(P) ff G1 Surface ff G2 area Mean curvedness P1 {P2} Al 61‘“ 0 P2 {P1, P3, P4} (hlna12)eJ‘°'75 ( 2,, )eMuir P3 {P2, P4, P5} (h27ra22)e’“"75 ( )e’t ' P4 {P2, P3, P5} A4 61‘0'375 (Iii)e1‘m’g P5 [[P3, P4, P6} (h1fl012)6‘"o'75 (Elfin-[r P6 {P5} A6 e”? 0 on the object using its analytic form, and we then determine the maximal patches of constant shape index present on the object. We compute the area A pi, the orientation and the mean curvedness of each of the CSMPs P,, 0 < i S n. The orientation 0p, provides the position (C, n) on the unit sphere, thus specifying the spherical mapping of the CSMP. At the mapped points on the unit sphere, the support functions G1 and G2 are computed. In the case of a convex object, since the surface normal at each point on the object is unique, the mapping from the object to the unit sphere is one-to-one and G1 and G2 take simple forms in their computation. However, in the case of a nonconvex object, it is possible that there are multiple CSMPs whose orientations map to the same point on the unit sphere. In such a case, if the shape index is the same for the multiple patches mapped together, their support functions are added depending on their shape categories as specified in Section 3.2.2. In practice, however, with free—form surfaces we may not have complete object models nor their descriptions in the form of analytic functions. Hence, we need to build an internal representation of the object from range data of its multiple views. The COSMOS representation of objects derived from complete 3D surface data of objects is viewpoint-independent. However, the COSMOS derived from a given range image of an object is view-dependent due to the following two factors: (i) the orientations of the CSMPs, and (ii) the curved nature of the object. For a point p 97 H ooooooooo Hmwitr-Uimqooto' 0...... Figure 3.21: Shape index values on the surface of the torus. on a surface of the object to be visible from a point q in space, the outward-pointing surface normal at p must make a positive projection on the vector from p to q. Surface normals on the “outer side” of a surface project out of the volume occupied by the surface. In the cases of polyhedra, either the planar face is fully visible or it is not at all visible. A planar face is never visible partially because the normals at every point on the planar face make the same angle with the viewing direction. However, in the case of curved objects, this is not necessarily true. Since the normals on a single CSMP can possibly vary, it is possible that only a part of the complete CSMP present on the entire object is visible, and thus the area and the curvedness of the CSMP seen in a specific View of the object are reflective of only what is visible in that view. Hence multiple views are needed to fully describe a curved object. Given multiple views of an object we compute the COSMOS of each view of the object from its range data. We assume that the scene contains only a single object. This is not a restrictive assumption, as a cluttered scene can be broadly segmented 98 using depth discontinuities into several connected components, each of which serves as an object of interest. A set of COSMOS representations derived from various views of an object forms a 3D model of a single object. We have adopted a “collection of views” description of an object instead of building the complete COSMOS representation of the object after registering and integrating the multiple views. This is primarily because an input to a recognition system is typically a 2D appearance of an object. Since object identification has to proceed from a single view of an object, our approach is to maintain a set of views in our object database and to match only observed surface patches with those of the model views. The range data of the views of the object are obtained either using a laser range sensor or using the CAD models or surface triangulations, if available. The catalog of possible views of objects is based only on a chosen tessellation of the view sphere. Therefore, the number of views chosen to represent an object is model-driven, based on the complexity of the object. For a simple object, a very coarse tessellation resulting in a few views may be sufficient. The issue of determining the number of views necessary for recognition is addressed in Chapter 4. 3.5.1 Construction of the COSMOS of a Single View of an Object We first present a scheme to derive the COSMOS representation from the range data from a single view of an object. For the discussion below, we assume that a range image of an object from an arbitrary viewpoint is obtained using a Technical Arts White scanner [66]. The computational scheme outlined is applicable to data obtained using other kinds of range scanners with suitable definitions of surface curvatures, pixel connectedness, etc. The processing steps are shown in Figure 3.22. The range image obtained using the laser range scanner available in our laboratory provides the surface depth data of the object visible to the camera with the pixels in the image arranged in a cartesian X -— Y grid. The accuracy of depth value is of the order of 0.001 inches. Since the camera used in the scanner does not use square 99 Input range image of an object View] i Compute and smooth principal ] curvatures at each point l [ Compute shape index at each point] i Construct maximal patches of ] constant shape index using the constrained region growing scheme l Construct the Gauss patch map using surface patch orientations k 1 l! r N Compute the surface connectivity list L J i Compute support functions at the mapped points on the unit sphere Figure 3.22: Construction of COSMOS from range data of an object. pixels and since the camera is mounted at an angle to the object to be scanned (it is not placed directly overhead) some views of the object appear compressed in their length while some of the views appear stretched. This is also aggravated by the fact that the image sampling along the X direction is not the same as in the Y direction. Currently, no preprocessing is done on our data to correct for this distortion as our intent is to demonstrate that our recognition system can perform well in spite of the distortions that may be present in the data. We treat these distortions as different forms of noise in the sensed surface data. We first compute the surface curvatures at each pixel (object point) in the image 100 by estimating them using a bicubic approximation of the local neighborhood. In our implementation, we use a neighborhood of 5 X 5 pixels to locally approximate the un- derlying surface. Then we apply the iterative curvature smoothing algorithm based on the curvature consistency criterion [63] to improve the reliability of the estimated curvatures. We then process all the data by running the curvature smoothing algo- rithm for 15 iterations and it can be seen from the segmentation results shown in Section 3.5.2 that this smoothing is sometimes excessive for some object views. Once the curvatures are reliably estimated at each pixel, the shape index values are computed. The next challenging task is to obtain as few maximally connected patches of constant shape index as possible while retaining sufficient information about different shapes that may be present. Since the range data are of finite res- olution, and since curvature estimates can be noisy, we need to take into account the possibility of noise in shape index values of surface points belonging to the same shape. The connected points whose shape indices are similar have to be merged to obtain a CSMP. An obvious approach to find maximal patches of constant shape index in the range image is to partition the shape scale a priori (independent of the actual distribution of shape index values in the image) into a finite number of bins of either fixed or varying width. However, with such a method, the bin boundaries are artificial and may not correspond to a more “natural” segmentation of the image. In other words, a region of the image containing pixels with similar shape index values may be split up into two maximal patches just because the values crossed the a priori bin bound- ary. Therefore, the shape index boundaries should depend on the contents of the image. More generally, the problem is to group image pixels into CSMPs with the following objectives: (i) minimize the total number of CSMPs in the image (to avoid fragmentation into hundreds of small sized patches); and (ii) minimize some measure of the spread of shape index values of the pixels within a CSMP. These two objectives obviously conflict. Global information is required to achieve objective (i) subject to some constraints. A brief description of our segmentation algorithm is given below. 101 3.5.2 Constrained Region Growing Our segmentation algorithm constructs maximal patches by repeatedly merging smaller patches, starting with very small (pixel-sized) patches. Since in each step it merges the two (adjacent) patches that would result in the least merged shape di- ameter (the difference between the minimum and maximum shape index values within the patch) of all the feasible (connected) pairs of patches, the spread of shape indices of the pixels within a patch is minimized. In addition, since the algorithm is applied until the maximum merged shape diameter would exceed the constrained value for every remaining pair of patches, it minimizes the total number of patches. Using the shape index in this way improves the segmentation over methods that use only fixed—width and fixed-threshold quantized bins. The algorithm can be supplemented with a post-processing step in which purely local adjustments are done to pixels on the boundaries of patches. For example, a boundary pixel can be moved to another patch depending on its influence on the means and deviations of the maximal patches. To illustrate the effectiveness and the generality of our scheme, we show the max- imal patches for different objects in Figure 3.23. In this figure, part (a) shows the range image of an object and part (b) shows the image of the various CSMPs (shown in different colors) on the object. The Gauss map of the CSMPs is shown in part (c). A shape diameter of 0.25 yielded good CSMPs in most images. An increased shape diameter resulted in bigger patches in the cases of the cobra-head and the cup. This parameter can be adaptively adjusted depending on the size of the smallest CSMP that is detected in a given image. If the design philosophy is that the smallest CSMP found in an object view should contain at least 10% of the visible surface area in the view, then the shape diameter can be adaptively adjusted depending on the object present in the view. We note that the perceptual accuracy of the segmentation is not the only criterion to evaluate object segmentation in our case; one can have visually unsatisfactory segmentation (e.g., over-segmentation) which is, however, adequate for object recognition. Our recognition system (Chapter 5) has been designed to han- dle imperfect segmentation results by merging connected CSMPs if necessary, while 102 Table 3.8: The COSMOS representation: Support functions for Vase2-1 on the unit sphere. Point on the unit sphere CSMP Mean shape Surface Mean (C, 77) index area Curvedness (3.106989,0.677436) P1 (red) 0.854807 5725 0.593455 (-3.079618, 0.482079) P2 (green) 0.720356 112 0.366058 (0.065550, 0.607387) P3 (blue) 0.473433 289 0.454395 (—2.893622, 0.616532) P4 (yellow) 0.656924 793 0.737030 (3.091251, 0.926560) P5 (magenta) 0.576004 343 0.571970 (0.230828, 0.580261) P6 (cyan) 0.821453 520 0.932052 (—3.l29172, 0.983077) P7 (white) 0.747665 326 1.105722 (—3.115084, 0.647045) P8 (light green) 0.678569 1088 0.474518 (0.053756, 0.681923) P9 (pine green) 0.666768 17 0.394001 establishing the model-scene patch correspondences during recognition. A few remarks about the role of edges on the objects (surface depth and surface normal discontinuities) are in order here. In range images of objects, edges are never knife-sharp nor are corners needle-like. Particularly, with free-form sculpted objects, the edges present on the objects are typically smooth, without sudden jumps in the depth or normals (contrast a human face with a polyhedronl). Since our current segmentation algorithm does not explicitly detect edges prior to region growing, our experimental results indicate that the detected CSMPs gracefully blend into neigh- boring CSMPs without a sharp boundary between them. Future work in improving the segmentation results should integrate edge detection schemes along with the re- gion growing algorithm to obtain stable region boundaries and also prevent leaking of a CSMP across the discontinuities to its surrounding regions. The surface attributes stored by the coefficients of the support functions on the unit sphere for the vase are given (without the shape spectral functions) in Table 3.8. The support functions are defined at points on the unit sphere as indicated in the table. At all other points on the unit sphere, they are undefined. 103 2—1 Vase 1 Phone- 1 Cobra- ) ( (a) range image; Figure 3.23: Representation of objects with free—form surfaces (b) constant-shape maximal patches; (c) the Gauss patch map. 104 Small—vase-l Giraffe—2 Cup-4 (a) (b) (C) Figure 3.23 (cont’d): Representation of objects with free—form surfaces: (a) range image; (b) constant-shape maximal patches; (c) the Gauss patch map. 105 F ork—4 (b) (C) Figure 3.23 (cont’d): Representation of objects with free-form surfaces: (a) range image; (b) constant—shape maximal patches; (c) the Gauss patch map. 106 3.5.3 Sensitivity Analysis of Shape Index, SI When shape index is used to segment the digital range images of objects into CSMPs, it is apparent that any inaccuracy in surface curvature values resulting due to sensor noise and numerical errors introduced by the curvature estimator affect the accuracy of the shape index computed and the CSMPs. The curvatures computed from range images are sensitive to noise in the surface depth values as they are second—order differentials of the surface depth functions. We studied the sensitivity of the shape index to small changes in the principal curvatures to improve our segmentation results. The sensitivity is defined as the change in the shape index with an infinitesimal change in the principal curvatures, K1 and K2, and is given by 85] _ K32 571' — H(Kf + [$33) (3.21) GS] _ fill 87.. _ W, + 1.3) (3.22) The plot of the differential sensitivity of the shape index 5; with respect to both K1 and n2 is shown in Figure 3.24. It can be seen from Figure 3.24 that when the principal curvatures are zero or near zero, any small variation in their values result in a very large change in the shape index. In fact, when the principal curvatures are nearly zero, the sensitivity goes to infinity as shown in the figure. Thus the shape index is very sensitive to the variations in the curvature values when they are almost zero. This information can be utilized in identifying the planar surfaces. Observe that in practice, to detect the planar points on a surface, we can only assert that the principal curvatures have to be less than some value; they are never zero in an implementation using a digital computer. The difficulty in choosing the threshold near zero is avoided by computing the sensitivity of the shape index at all points, and declaring all those points that have sensitivity values higher than a threshold to be planar points. This threshold is easier in practice to set, as it is a relatively 107 11.‘ Sensitivity of shape ind-2x 1 Figure 3.24: Sensitivity of the shape index to principal curvatures. large value. In addition, pixels where the curvature values are unreliable can also be detected and excluded from further analysis. When the shape sensitivity values computed at the pixels exceed a certain threshold, it indicates that the curvature values are too low to reliably compute shape index. Further processing of such points is then avoided. 3.5.4 Shape Spectrum of an Object View: Examples As described in Section 3.2.3, the shape spectrum of an object is similar to that of the frequency spectrum of a time-domain signal. It characterizes the shape content 108 .itt953’5'29 In . .. 1. , .3. » y u d mm- to"... (a) (b) Figure 3.25: Shape spectra of (a) Vase2-1 shown in Figure 1.3 and (b) Big-Y—l shown in Figure 1.3. of an object by summarizing the area on the surface of an object at each shape index value. The shape spectrum of an object view is obtained from its range data by constructing a histogram H (h) of the shape index valuesmwe used 0.005 as the bin width—and accumulating all the object pixels that fall into each bin. Note that the shape spectrum could also have been constructed using the surface area and shape index values of the CSMPs generated by segmentation. However, we preferred to directly use the original shape index values computed for each pixel in the image (instead of using the mean and variance information stored in the CSMPs) and thus avoided being affected by any segmentation imperfections. Under theoretically ideal conditions (noiseless curvatures, zero C SMP shape diameter) these two ways of con- structing the spectrum would yield identical results, but in practice the former is desirable. Since the shape index for planar points on an object surface is indetermi— nate, we have assigned a symbolic label to the planar points for our consideration. For plotting and computational purposes, this label is arbitrarily assigned a value of 2.0, and therefore, the range of shape index values for computing the shape spectrum of a view takes a value in the range [0,1] and also 2.0. The non-planar shape spectral plot of the vase (Vase2-1), shown in Figure 3.25(a), indicates that the main shape category present in this object is dome, along with a few smaller peaks in the ridge shape type and saddle-ridge category. From Figure 3.25(b), it can be seen that the dominant shape category in Big—Y-l object is ridge (0.6875 S 109 51 S 0.8125). A few concavities in the object are characterized by the nonzero bins below the 0.5 shape index level. It can be observed that based on the shape spectral information alone, it is difficult to discriminate between objects that are only polyhedral. However, the concept of shape spectrum can be effectively used for categorizing objects into purely polyhedral and non-polyhedral; polyhedral object views exhibit a peak (maximum surface area) only at the planar shape index (which has been chosen to be 2.0, in our experiments). Since there is a huge body of techniques available for polyhedral object recognition, we will not emphasize on polyhedral object recognition further. As the main focus of this thesis has been the representation and recognition of free—form surfaces, we will study mainly the use of non—planar shape spectra for grouping object views and for fast matching as discussed in Chapter 4. Whenever, the shape spectral information utilizes the planar surface information explicitly, the reader will be alerted to this fact. Note that occlusion of an object in a view by other objects or due to self-occlusion results in some of the surface patches being visible only partially in the image and this affects the shape spectrum by influencing the surface area count (percent) stored at various shape categories. We have devised a novel technique for comparing two shape spectra that can tolerate the presence of occlusion in an object to a certain extent and it will be presented in Chapter 4. 3.6 Summary In this chapter, we have presented our new representation scheme, called COSMOS for 3D objects. It is a shape-based description of objects suitable for representing free- form surfaces without requiring complex 3D analytical modeling of objects. In the next chapter, we discuss how a model of a 3D free—form object can be constructed as a collection of COSMOS representations of its multiple views and also present results on clustering a large number of views of a 3D object into a small number of salient groups that can be used effectively for view matching during recognition. Chapter 5 110 presents a recognition strategy which consists of a multi-level matching mechanism that employs shape spectral analysis and features derived from the COSMOS represen- tations of objects for fast and efficient object identification and pose estimation. It also presents experimental results on real range images of complex free-form objects. Chapter 4 Object View Grouping and Model Database Organization As discussed in Chapter 3, a multiple-view based description of a free-form object has been adopted in this thesis for recognition. A 3D rigid object can give rise to arbitrarily many different 2D appearances (views). For objects with free—form or sculpted surfaces, only a part of a surface will typically be visible from a single vieWpoint, due to the object’s curvedness. Variations in viewing directions and angles can result in very distinct views of the object, with more of the curved surface(s) either coming into view or disappearing from view. Thus, a sculpted object can give rise to infinitely many different views owing to its smoothly curved nature. In practice, only a finite number of such views can be stored. Therefore, an important issue is which and how many of these object views are actually necessary and useful for recognition. The problem we address in this chapter is as follows. Given a set of views of a free- form 3D rigid object, how do we represent and organize these views in a meaningful and efficient manner? 111 112 4.1 Object-centered versus Viewer-centered Representations Previous approaches to 3D object representation can be categorized as either viewpoint-independent (object-centered) or viewpoint-dependent (viewer-centered). A viewpoint—independent representation attaches a co—ordinate system to an object; all points or object features are specified with respect to this system. The description of the object thus remains canonical, independent of the vantage point. Although, this approach has been favored by Marr and Nishihara [114] and others, it is difficult to derive an object-centered representation from an input image. A unique co-ordinate system needs to be first identified from the input images and this becomes difficult when the object has many natural axes. Practical implementation becomes a com- plicated task as only 2D or 2.5D information is usually available from a single image and perspective projection effects have to be corrected in the image, before building the representation. Note that this approach is well suited to simple 3D objects that can be specified by analytic functions. A viewer-centered approach on the other hand, describes an object relative to the viewer and as one does not have to compensate for the vieWpoint, view representations can be easily computed from images. A major disadvantage is that a large number of views needs to be stored for each object, since different views of an object are in essence treated as containing distinct objects. However, representing an object with multiple views is quite useful in view-based matching, alleviating the need for expensive 3D model construction. A viewer-independent scheme suffers more difficulty when representing an object with free-form surfaces. Such an object may neither have a complete geometric model, nor a description in terms of analytic functions. As noted earlier, it may not be an assembly of simple surfaces like planes and cylinders. The constituent surfaces can be of such a high degree that reliable surface segmentation from image data is difficult. Therefore, a practical solution would be to build a multiple view based representation 113 of the object. However, a sculpted object gives rise to infinitely many different views owing to its curved nature. In practice, only a finite number of such views can be stored. Therefore, an important issue is: which and how many of these object views are actually necessary and useful for recognition? The number of views chosen to represent an object for quick and accurate identification entirely depends on the complexity of the object. This chapter addresses the following question specifically: How do we generate a representative and adequate grouping of the views such that a new object view can be indexed effectively and efficiently to one of the stored views in the database? We place emphasis on automatically obtaining the clusters of views without requiring segmentation of object surfaces. Such a set of view-clusters will serve as a view-based representation of each object in the database. Efficient retrieval of a cluster of views provides a small set of plausible correct matches for further refined matching. The term view refers to a range image of an unoccluded object obtained from any arbitrary viewpoint. For the purposes of this chapter, two views of an object are not considered distinct if they produce appearances of the object that merely differ from each other by a rotation about the view plane. 4.2 3D Object Model as a Collection of Representations of Multiple Views When constructing a multiple—view based description of an object, we adopt an “ap- proximate visibility technique” to restrict the set of possible viewpoints to a sphere of large but finite radius, centered around the object. The surface of the viewing sphere is tessellated in a quasi-regular fashion to provide a discrete set of points. Each of these points provides a viewpoint vector in the “approximate visibility” space. Range data of an object surface seen from each of the sampled view directions are obtained using the laser range scanner or from the CAD model or surface triangulation of the object. The collection of representations of these 2D views then constitutes a 114 multi-view description of the object. View clustering for a single object has been addressed by several researchers under the topics of characteristic views (CVs) [72, 34] and aspect graphs (AGs) [100]. A lot of research has been directed towards obtaining the aspect graph representation for different classes of objects [55, 54, 74, 75, 103, 126, 34, 145, 149, 148, 169, 56]. Although there exists an extensive body of work dealing with aspect graphs, a major difficulty that confounds the use and implementation of AGs for object recognition is that complicated objects can result in enormous and complex AGs. To derive aspect graphs of manageable sizes, appropriate heuristics need to be designed. The problem of computing the aspect graph of an arbitrary object still remains unsolved. Ikeuchi’s practical approach [88] relied on detecting the planar and curved faces of an object using photometric stereo in order to form the aspects of the object containing topologically similar views. However, such an approach is difficult with a free-form object since each vieWpoint gives rise to a slightly different view of the object, because of its smoothly curved nature. It is also hard to define a single face in a sculpted object. A recent paper [140] organizes the model base hierarchically using parametric structural descriptions built from the CAD models of objects where it is assumed that a complete 3D description of an object is available for its recognition. 4.3 View Sensitivity of the Shape Spectrum As shown in Chapter 3, with the COSMOS representation scheme, an object’s shape and surface area can be characterized quantitatively in terms of its shape spectrum. The shape spectrum of an object, derived from complete 3D surface data of the object is viewpoint-independent. However, the spectrum derived from a single range image of an object is view-dependent. The view sensitivity of this high level feature is exploited for object view grouping as shown in section 4.4.1. Figure 4.1 shows a set of object views to demonstrate how shape spectra of views of various objects differ and how spectra computed from range images obtained by observing an object at nearby vieWpoints are similar to one another. Recollect that as explained in Section 3.5.4, we 115 construct the shape spectrum of a view directly using the original shape index values computed for each pixel in its image (thus avoiding segmentation of object into parts). Figure 4.1(b) shows the non planar shape histogram of a view of Vase2 and it indicates that the main shape category present in this view is dome (S; = 0.875), along with a few smaller peaks in the ridge (0.75) shape type and saddle-ridge (0.625) category. Figures 4.1(d) and 4.1(e) show the strong similarities between the spectral plots of two different views of Cobra. These plots also indicate the predominance of rut (0.25), ridge (0.75) and trough (0.125) shapes in Cobra. Since the spectra of purely polyhedral objects exhibit a single peak at the shape index value of 2.0 (all the planar patches contribute to this bin), it is difficult to discriminate between various views of these objects. However, shape spectrum based classification can be used to categorize object views in a database into two classes: those that are purely planar and those that contain non-planar shapes on the object surfaces. 4.4 Organizing Object Views We now describe how the non—planar shape spectra of object views (spectra computed without taking the planar points on the surfaces into account) can be used efficiently for (a) view grouping and (b) view matching. Note that when the model database is populated during its construction, the object identities of the views to be stored in the database are known. We first investigate whether multiple views of the same ob— ject can be clustered into meaningful groups based on their shape spectra. We have chosen to perform clustering instead of supervised classification to find out whether there is any inherent clustering tendency present among the training set views. Sec- ondly, the object view-grouping can be repeated with each set of object views, and the model database can thus be structured into a collection of distinct groups of views of each object. For example, if n,- object views are used to originally represent an object i, then with the grouping scheme, it may result in m, groups of views, where n,, << n,-. We propose to determine the matching efficiency and accuracy by hierar- 116 Same An- In phat Sal-co Am 'n pink Subc- Ar. in pink (6) Figure 4.1: Shape spectrum: (a) Range image of Vase2; (b) shape spectrum of Vase2; (c) a view of the cobra head - Cobra—1; (d) shape spectrum of Cobra—1; (e) another view — Cobra-2; (f) shape spectrum of Cobra—2. 117 chically comparing an input view with the view cluster representatives first, followed by matching it with the views within the clusters themselves. Our primary concern is to structure a large database of object views in order to eliminate matching the input view with all the stored views before the object identity can be ascertained, and to narrow down the possible set of views that need to be matched more comprehensively as described in chapter 5. 4.4.1 Feature Representation and Similarity between Shape Spectra A group of object views organized on the basis of “similarity” of shape spectral features would contain views that exhibit the characteristics of the same set of visible surfaces of the object [46]. Note that views that can be obtained by rotations about the viewing direction are likely to possess similar shape spectral features and are hence grouped together. No surface segmentation or edge detection is required with this approach. We have proposed a feature representation that emphasizes the spread character- istics of the spectral distribution. Our feature vector representation R of a view is based on the first ten moments of the normalized (with respect to the visible object surface area) shape spectral distribution, PI (h), of an object view. By normalizing the spectrum with respect to the total object area, we remove the scale (size) differ- ences that may exist between different objects. The first moment is the mean which provides the average concentration of the shape index values of the surfaces visible in the view; the second feature is the variance, the third is the skewness, and the fourth is the kurtosis (peakedness around the mean) of the shape index distribution. Features five through ten are the central moments of the shape spectrum of orders five through ten, respectively. These features are best understood if we observe the likeness between the shape spectrum of an object view and a probability density function of a random variable [60]. 118 The first moment is computed as the weighted mean of R (h), and is defined as m1: :0.) H(h). (4.1) h The other moments, mp, 2 S p S 10 are computed as follows: m, = :(h — m1)” H(h). (4.2) h Then the feature vector is denoted as R = (ml,m2, - -- ,mlo). Note that the range of each of these moments is [—1, 1]. Let 0 = {01, 02, - - - , 0”} be a collection of n 3D objects whose views are present in the model database, MD. The jth view of the ith object, O;- in the database is 0 represented by (Lg-,Rj), where Lj- is the object label and R;- is the shape spectral moment vector. Given a set of object representations ”Ri = {(L], R]), - - - , (Lin, R;,,)} that describe m views of the ith object, the goal is to derive a partition of the views, ’P‘ = {C}, 3, ~ - - , CL}. Each cluster in ’Pi contains views that have been adjudged similar based on the dissimilarity between the corresponding moment features of the shape spectra of the views. The measure of dissimilarity between R;- and R]c is defined as 10 mini) = Z] 31, — Rt)? (4.3) 1:1 4.4.2 Object View Grouping In order to provide a meaningful categorization of views of the object 0‘, views are clustered based on their dissimilarities D(Rg, R2) using a hierarchical clustering scheme such as the complete-link algorithm [92]. The partition 7" is obtained by splitting the hierarchical grouping of 0‘ at a specific level of dissimilarity in the dendrogram. The split level is chosen at the dissimilarity value of 0.1 or less to result in a set of compact and well-separated clusters. If the number of resultant clusters is pre—specified as a design criterion, then the cut level can be automatically selected. 119 Once the partition ’Pi is determined from the training views of 0‘, the database M D is organized into a two-level structure, M D = {P1, - H ,’P”}, where each 77" is itself a set of view clusters. A summary representation for each view cluster C; is abstracted from the moment vectors of its constituent views such as the centroid of the view cluster. Given an input view, its object label and best matching view are identified quickly and accurately in two stages: (i) the object identity is established by first comparing the moment vector of the input view with the cluster summary representations and selecting the best matched cluster; (ii) comparison of the input view with the moment vectors of the views in the best matched cluster determines the view that matches most closely with the input. 4.5 Experimental Results We have used a database containing 3,200 range images of 10 diflerent sculpted objects with 320 views per object [47]. The range images from 320 possible vieWpoints (deter— mined by the tessellation of the view-sphere using the icosahedron) of the objects were synthesized either from the CAD models when available or from the hand-constructed object triangulations. Figure 4.2 shows a subset of the collection of views of Cobra used in the experiment. We computed the shape spectrum of each view and then determined its feature vector R. We clustered the views of each object based on the dissimilarity measure ’D between their moment vectors using the complete—link hier- archical clustering scheme [92]. The hierarchical grouping obtained with 320 views of the Cobra object is shown in Figure 4.3. The view grouping hierarchies of the other objects are similar to the dendrogram in Figure 4.3. These clusterings demonstrate that the views of each object fall into several distinguishable clusters. The hierarchi- cal grouping obtained from each object was then cut at a dissimilarity level of 0.1 or less to result in compact view clusters. The centroid of each of these clusters was determined by computing the mean of the moment vectors of the views falling into the cluster. Figure 4.4 shows the visualization of the centroids of the view clusters of Cobra (obtained by splitting its hierarchy at a dissimilarity level of 0.05) using 120 Figure 4.2: A subset of views of Cobra chosen from a set of 320 views. Chernoff faces [39] where the ten moment features were depicted using the area of the face, shape of the face, length of nose, location of the mouth, curve of the smile, width of the mouth, location, separation, angle and shape of the eyes, respectively. Figure 4.5 shows some of the clusters obtained with 320 views of Cobra and the views contained in them. Observe that the shape spectra of these views do not change with a rotation of the views about a single axis and this leads to a more concise method for grouping multiple views. 121 0.25 I 0.20 l 0.15 1 0.10 1 0.05 I,1. Ifo 11,; ['qu I]! [H] I": H [llhllll] [[l[[ [[‘I .[]l , [l[1][[[‘ Figure 4.3: Hierarchical grouping of 320 views of Cobra. 4.5.1 Matching Accuracy: Resubstitution The goal of our experiment is to examine how view grouping facilitates matching in terms of classification accuracy and the number of matches necessary for correct classification of views. View Groups of a Single Object For this experiment, the database was assumed to contain view groups of only a single object at a time. Each training pattern (feature vector of a view) from the object class was matched with each of the cluster centroids in the database, and the best matched cluster was identified. Then the input pattern was matched with all the constituent patterns in the cluster and once again, the stored pattern giving rise to the minimum distance with the input pattern was identified. Since the viewpoint of each input view was known, the viewpoint of the best matched pattern was compared with the actual object vieWpoint to compute the view classification error. The input 122 6 9 Figure 4.4: Visualization of the centroids of eleven view clusters of 320 views of Cobra using Chernoff faces. view was matched with the views belonging to only the best matched cluster; this is the “top-one—cluster” strategy. Table 4.1 illustrates the correct classification rate obtained when tested with 320 input views. It can be seen that a view classification accuracy of at least 90% can be obtained even by matching views from the best matched cluster alone. We now modified our matching strategy to allow the input pattern to match with all the patterns falling into the two best clusters that resulted in the smallest and the next smallest distances among all view groups. We refer to this as the “top- two-clusters” strategy. This improved our view classification accuracy to 100% in eight object classes. These results show that object views are grouped into compact and homogeneous view clusters, thus demonstrating the discriminatory power of the shape spectrum-based feature representation. 123 ‘N N 1% f, ’ “4 g“ h a a! J 9:.) :9 fit; iv (a I” i": #33! $3 i" if, r.) t Figure 4.5: View clusters of Cobra and their views: (a) views belonging to Cluster5; (b) Cluster6; (c) Cluster9; (d) ClusterlO. View Groups of Several Objects We now consider a database containing views of different objects, which has been organized into a two-tiered structure: the first level containing all the view—groups obtained from clustering views of each object individually, and the second level con- sisting of the views themselves in these clusters. We verified the accuracy of view matching in the presence of groups of views arising from various objects in the database. We collected together view groups, for example, of four objects obtained with a training set of 320 views per object, resulting in a total of 48 groups. For each view group, the centroid-based representative pat— tern was determined. Each input pattern was first matched with all the 48 centroid patterns, and the group with the least distance from the input was chosen for further examination to identify the final match. We repeated this experiment, increasing the 124 Table 4.1: View classification accuracy with view groups of a single object. Object No. of Correct view class clusters classification (%) Top 1 Top 2 cluster clusters Vase2 12 96.9 100 Vasel 10 92.8 100 Big-Y 9 93.4 100 Two—mag—cyl 10 91.9 . 100 Long-mag-cyl 16 91.3 99.7 Cobra 11 94.1 100 Cup 10 96.6 100 Apc 10 91.9 99.4 Jeep 11 90.0 100 Truck 11 91.3 100 number of object classes in the database to five, eight and finally ten. Figure 4.6 presents the view classification accuracy obtained when tested with the training pat- terns themselves with different numbers of objects in the database. The horizontal axis contains the number of best (in terms of the dissimilarity between the input pattern and the centroid patterns) matched clusters that were examined for the final match, for a given level of accuracy. As shown in Figure 4.6, as the number of objects in the database increased, 14 best-match clusters had to be examined to obtain 100% accuracy with correct view identification. We observe that even when tested with a large number (3,200) of object views, an accuracy of 99% can be achieved by choosing only the top 8 clusters. Only 20% of the views in this large database were examined on the average, even when allowing the top 14 clusters and their constituent patterns to match with the input in order to achieve a 100% correct object identification and view determination accuracy. As the number of object classes increases, the percent- age of comparisons performed decreases, thus confirming the efficiency of matching with a structured database of views. In the resubstitution mode, an error in the classification of an input view arises from only being placed in the wrong object cluster in the first place. This further 125 100 , _______ """"""""" "4-objects-with-1280-pattems" +— "5-objects-with-l600-pattems" -+--- x”. "8-objects-with-2560-patterns" ~G--- 90 h "lO-objects-with-BZOO-patterns" "at ------- ‘ S9 V 80 _ - S 5 it c: 70 " - .2 ‘8' .‘E g 60 - _ U 50 - . x’ 40 l 1 ‘ 1 I 1 O 2 4 6 8 10 12 14 Number of clusters examined Figure 4.6: View classification accuracy vs. number of clusters examined in the database. indicates that the simple centroid-based generalization that we adopted is a reasonable scheme and the clusters are tight enough that after a view falls into a cluster, the view very rarely matches with a wrong pattern within the cluster. 4.5.2 Matching Accuracy: Testing Phase We trained the view grouping system with 3,200 training views (320 views per ob— ject) and tested it with 1,000 independent test views (100 per object). At the top level in the database, there were 110 cluster centroids to match with the input test pattern. We studied the number of clusters that had to be examined in order to at- tain several levels of misclassification rates. The computational steps are summarized in Figure 4.7. Figure 4.8 provides several interesting observations. First, for eight complex objects (e.g., Cobra) the correct object identity of 97% or more of their test views can be determined within the top 10 best cluster matches. With the other two objects, 20 best matched clusters are needed to be examined before the correct 126 Input range image of an object view 1 Compute and smooth principal curvatures at each object pixel I Compute shape index at each object pixel in the image ....... l Database of 3D free-form objects Object view clusters Comp“tc Shape Spectrum of the View . . . Compute moments Indrvrdual v1ews from the shape spectrum \Eatchijg/ Identify the tap m clusters that best match the input moment vector 1 Identify the top in views that best match the input moment vector from the selected clusters Best matched model views Figure 4.7: Model view selection with the view-grouping and matching system. object identity of an input can be obtained; this is due to the fact that these objects contain mostly surfaces of predominantly cylindrical ridge shape. Twenty one best matched clusters were needed to be examined for 100% correct classification of the test views when the percentage planar surface area of the object was incorporated as an eleventh feature (see Section 4.5.5). Even in the worst case, allowing the best 20 out of 110 clusters to be examined, only about 23.5% of the 3,200 view compar- isons were performed. In addition, on the average, across all the ten object classes, only 15% of the database was examined to identify the correct object identity of an unknown input view. Each test pattern took about 20 ms to be correctly classified on a SPARCstation 20. This demonstrates the efficiency of the view grouping and matching system even with simple centroid-based cluster summaries. With more so- phisticated methods of generalizing the cluster patterns, further reduction of these matching costs can be expected. 127 100 I I I I l ' I l 1 X "Vase2" +- 90 b "Vasel" -+--~ - "Big-Y" . G- . - - 80 - "Long-mag-cyl" Wx "TWO'mag'Cyl" -&.-.. -4 0 "Cobra" -*-—- £030 70 - "Cup" ..o .. . .1 C "Jeep" .-+.. ‘ § 60 " "Truck" --a---- £ "Ape" - E H 8 50 - _ 8 SE 3 40 - _ .53 8 g 30 r _ 20 —- _ 10 - _ 0 0 2 4 6 8 10 12 14 16 18 20 Number of clusters examined Figure 4.8: Misclassification vs. number of clusters examined. 4.5.3 Testing with 6400 Object Views In order to study the performance of the shape spectrum-based view grouping and matching system in the presence of a larger model database of views, we added views of ten additional free-form objects, resulting in a total of 6,400 training views in the database. We discuss below the results of some of our experiments conducted with the enlarged database. Figure 4.9 shows the twenty complex objects, each of which was modeled using 320 different views to populate the database. The polyhedral models of the ten objects added were collected from a public domain (http://www.eecs.wsu.edu/”flynn) database on the Internet and were used to gener- ate the views from multiple vieWpoints. As before, we generated 100 random views of the twenty free-form objects in the database for testing. At the top level, the database contained 229 view clusters obtained by grouping the views of all the objects. The second level contained the training views themselves. Each of the 2000 test views was used as a query view 128 Porsche Vasel Vase2 Violin Venus Figure 4.9: Range images of objects generated from arbitrary viewing directions from twenty object models. 129 and the number of best-matched clusters that needed to be examined in order to correctly identify the object class of the query view was noted. Table 4.2 summarizes the results of this experiment. It can be observed that despite the increase in the size of the database, only ten (about 4.4%) of the view clusters needed to be examined to obtain an accurate classification of 95% of the test views. Complex free-form objects such as the Cobra, Vase2, Beethoven, Cow, Violin, Venus etc. required fewer number of clusters (15 top clusters or less) to be examined to obtain 100% correct classification of the test views drawn from their object categories. Model views from the vehicle category such as Porsche and Camaro were retrieved often as the best matched views for test views from either of these two classes. It can be observed from the table that the number of best matched clusters that need to be examined for 100% object classification accuracy depends on the size of the database. These results further demonstrate that the shape spectrum-based moment vector for view representation can serve as a useful pruning primitive during matching with a model database containing many complex free-form objects. Only 20% of the database was matched for view classification, on the average, over 2000 test views, even when top 30 clusters were examined. 4.5.4 Model View Selection with Real Range Data The shape spectrum based matching scheme was also tested on real range images of free—form objects obtained using the Technical Arts White scanner in our laboratory. A total of 100 range images of different free-form objects (10 views per object) were collected using the White scanner. The views in the database were randomly sepa- rated into two different categories: (i) a model database containing 50 views with five views obtained from each of the ten different objects and (ii) an independent test set containing 50 views of the objects (5 views per object). Figure 4.10 shows the range images of model views from each of these ten object classes and Figure 4.11 shows fifty test views. 130 Object class Correct object classification (%) when K best-matched clusters were examined K21 K22 K25 K210 K215 K220 K225 K230 Vase2 30.0 43.0 79.0 98.0 100.0 Vasel 27.0 51.0 84.0 97.0 97.0 97.0 97.0 97.0 Big-Y 3.0 7.0 35.0 68.0 92.0 100.0 Cobra 61.0 84.0 100.0 Cup 31.0 56.0 90.0 98.0 100.0 Apc 21.0 44.0 97.0 100.0 Jeep 25.0 48.0 94.0 100.0 Truck 17.0 48.0 88.0 98.0 98.0 99.0 99.0 99.0 A1 25.0 51.0 86.0 99.0 99.0 99.0 99.0 99.0 Beethoven 33.0 63.0 91.0 99.0 100.0 Cow 29.0 62.0 96.0 100.0 Dinosaur 23.0 46.0 70.0 92.0 98.0 99.0 99.0 99.0 Porsche 16.0 44.0 75.0 98.0 99.0 100.0 Shark 36.0 46.0 80.0 93.0 96.0 96.0 98.0 99.0 Shoe 39.0 59.0 78.0 91.0 95.0 100.0 Triceratops 23.0 41.0 77.0 97.0 99.0 100.0 Venus 26.0 46.0 81.0 100.0 Violin 66.0 95.0 99.0 100.0 Camaro 21.0 26.0 66.0 97.0 99.0 99.0 100.0 Mustang 19.0 23.0 47.0 77.0 85.0 91.0 96.0 100.0 Table 4.2: Object matching accuracy with an independent test set of 2,000 views. VIII! lmlll III Figure 4.10: Range images of 50 model views. , , % mama . / 4‘ 1». '~ . “a.“ Figure 4.11: Range images of 50 test views. 133 The shape spectral moment representation was derived for each of the views. The model database was then structured into two levels: at the first level were ten view clusters, with each cluster containing five views from its object class at the second level. Each test view was first matched with the ten view clusters to rank the best matched 3 view clusters and then the views falling into these three view groups were examined clusterwise to select a best matched view from each of the three clusters. This resulted in a view classification accuracy of 92%, with only 4 views out of 50 test views failing to select even a single model view from their correct object classes among their top three matches. When five best matched clusters were examined to select a view from each of them, the accuracy increased to 98% with only one test view incorrectly classified. The wrongly classified test view was a view belonging to the Creamer class. The range image of the incorrectly classified view and the five candidate views that matched the best are shown in Figure 4.12. The shapes of the surfaces visible in this view of Creamer were shared by views from other objects, leading to an incorrect classification. The average number of view comparisons performed in retrieving the top five hypotheses that matched a test view was 35, which is smaller than the number of view comparisons required when linear matching of the test view with all 50 model views is performed. However, observe here that for each test View that was correctly classified, there was only a single model view from its correct object class among its top five view hypotheses. More sample views in each object class in the model base are needed to increase the discrimination between the object shapes visible in the views. We also repeated the matching experiment with a slight variation in the way the best matched views were chosen from the selected clusters in the first level. Each test view was used as a probe to select the five best matched view clusters as before. However, instead of choosing one best matched view from each of these clusters, five best matched views among all the twenty five views collected from the selected view clusters were determined. Although this method incurs the additional computational cost of sorting the dissimilarity values of twenty five views to select the top five views, it increases the possibility of more than one view from the correct model object class 134 “1‘ Figure 4.12: Incorrect model view selection: (a) Test view; (b) top 5 view hypotheses generated by the model selection scheme. featuring among the best matched ones. Table 4.3 shows the number of test views in each object class that elicited at least one model view from their correct object class and also the number of test views that did not find even a single model view from their correct object classes. Three views out of the test set of 50 views did not short-list even a single model view from their correct object classes among the 5 views selected for each of them, thus resulting in a model view selection accuracy of 94%. However, thirty seven views out of the correctly classified 47 test views retrieved two or more model views from their correct object classes. Comparison of shape spectral moments-based representation of an object view with the approach proposed by Dudani et al. [51] where rotation and scale—invariant moments are derived to identify aircraft types brings forth an important observa— tion: Although both schemes are global representations of object views, the shape— based representation provides a richer discrimination between objects (e.g., disks and spheres) in terms of their surface shapes which would not be possible with the scheme proposed by Dudani et al. 135 Table 4.3: Shape spectrum based selection of five best matched model views among all the twenty five views at the second level. Object class Number of correct and incorrectly classified test views in each object class No. of views correctly classified No. of views wrongly classified Big-Y 4 1 Cobra 5 0 Creamer 3 2 Cup 5 0 Giraffe 5 0 Phone 5 0 Small-vase 5 0 Spoon 5 0 Vase2 5 0 Vase3 5 0 4.5.5 Shape Spectrum of Objects with Planar Surfaces We also studied the performance of the view grouping and matching system using the spectral information of the object views that included the amount (the percentage) of planar surface area found in the object visible in a view. The non—planar shape spectrum of a view was augmented by accumulating the percentage planar points on the surface into a bin with the shape index value 2.0. Using this augmented spectrum, a eleventh feature was added to the 10 original moments derived from the non-planar shape spectrum to form the view feature vector. The additional feature described the percentage of surface area present in the planar shape category. We obtained comparable performance when the experiments described in Section 4.5.1 were repeated. Note that all of the objects in the database used were mainly smooth objects, with planar surfaces present only in a very few views. Hence, our results did not change drastically. 136 4.6 Summary We addressed the problem of constructing View clusters of free-form objects. By exploiting view-grouping in model databases, a small number of plausible correct matches can be quickly retrieved for more refined matching. We have proposed a novel shape-spectral feature based scheme for grouping views that obviates object segmentation into parts and edge detection. These features allow object views to be grouped meaningfully in terms of the shape categories of the visible surfaces and their surface areas. The proposed approach is general and relatively easy to use. We demonstrated that in a database containing 110 View clusters of 10 different objects, only the top 20 best-matched clusters need to be examined for 100% recognition accuracy. Only 23.5% of the database was examined in the worst case for a 100% classification accuracy of test views. Even with a larger database containing 6,400 views of 20 different objects, only 20% of the database was examined, on the average, over 2,000 independent test views for correct classification. We also demonstrated the effectiveness of the scheme on real range images of views of free-form objects. Chapter 5 Multi—level Matching for Free-Form Object Recognition This chapter focuses on utilizing the COSMOS representation scheme for recognizing a given object view and estimating its position and orientation with respect to the views stored in the model database. Given a database consisting of different objects and their views obtained from many different vieWpoints, our goal is to recognize and estimate the pose of an object view using its range image. The difficulties that make this task challenging are as follows: (i) The recognition procedure should efficiently be able to cope with our view-based representation scheme and (ii) object identification and pose estimation must be achieved accurately and quickly. We address the follow- ing specific issues in this chapter: (i) What sort of indexing (model view selection) mechanism should be used for identifying a subset of candidate object views that can then be matched in detail with the input view? (ii) What recognition strategy should be used to match the COSMOS representation derived from an input view of an object with the selected model view representations? There are a-number of processing steps involved in a COSMOS-based 3D object recognition system and they can be divided into two stages: model construction and recognition. During the model construction stage, the following steps are carried out: (i) acquisition of dense depth data of an object from multiple vieWpoints, (ii) seg- 137 138 mentation of the range images of the object into maximal patches of constant shape index, (iii) construction of the COSMOS representation using the CSMPs detected in each range image, and (iv) building a collection of representations of multiple views of the object to be stored as its model. During the recognition stage, a range image of an unidentified object view is presented to the system. Given the sensed data, the following actions need to be taken: (i) determine the identity of an unknown object view from its COSMOS rep- resentation, and (ii) estimate the pose of the recognized object view. Note that the models of objects can be constructed and stored a priori and the recognition can then be performed on-line. The performance of the recognition stage, in terms of its accuracy and speed, essentially determines the usefulness of the system in real world situations. Figure 5.1 shows the various modules that comprise our recognition system. Model database construction Recognition and pose estimation f 1 Rangeimages 0f_ Range image of an object view multiple Views of objects the view of object views [ COSMOS representations ] Multi-level matching ] Candidate view(s) ‘ View clustering and database organization [ COSMOS representation of ] Verification and pose estimation] Object ideiftity & pose Figure 5.1: Overview of our 3D object recognition system. During recognition, as the model database is updated with an increasing num- ber of model objects, the computational time required to establish the identity of a given input also grows prohibitively high. With free-form objects especially, this 139 computational cost can be crucial, as the sculpted objects themselves may use com- plex representations and hence require large computational resources even to match a single pair of object representations. A preferred solution to building a robust and fast 3D object recognition system, therefore, is to to derive strategies to efficiently prune the model database (i.e., indexing or model selection) and to design methods that are general and reduce the search while matching the candidate object represen- tations with the input in a detailed manner. In this spirit, we propose a multi-level matching strategy that employs shape spectral analysis and features derived from the COSMOS representations of objects (including patches of constant shape index, mean curvedness, orientation, etc.) for fast and accurate recognition of arbitrarily curved 3D objects using range data. 5.1 COSMOS-Based Free-Form Object Recognition The proposed multi-level matching scheme [48] makes use of the components of the COSMOS representation to prune the set of possible matches to the input view and also to establish the input-model view feature correspondences and to estimate the pose of the object in the input view. The terms “input view” and “scene view” will be used interchangeably in this chapter as also the terms “model views” and “stored object views.” The input scene is assumed to contain an uncluttered view of an object (allowing self-occlusion). In the first level, objects are matched efficiently on the basis of shape spectral information (see section 5.2). The shape spectrum is an easily computable feature of an object view. The comparison of feature vectors containing the moments derived from the view spectra is also based on a simple measure, thus allowing it to be used for rapid pruning of a model database of object views to obtain a small set of candidate views. As demonstrated in Chapter 4, shape spectra of object views can also be used to sort a large model base into structurally homogeneous subsets, leading to a meaningful organization of the database. In the second level of matching, we may encounter two kinds of scenarios as 140 depicted in Figure 5.2: (i) an input scene contains multiple overlapping objects and (ii) the input scene may contain unoccluded objects. Our current study is restricted Unoccluded J Accurate pose Accurate pose estimation estimation unavailable available [Sha spectral analysis Shape s can] analysis and CO MOS-based matching an view verification Figure 5.2: 3D object recognition and pose estimation. to unoccluded object views only. W’hen multiple nonoverlapping objects are present in the scene, our recognition system can encounter two kinds of scenarios. In the first situation, it is possible that the database consists of object views generated either from CAD models of objects or from surface triangulations of objects, as typically encountered in industrial applications. In that event, the potential matches selected from the database are tagged with their identity as well as their 3D pose. The recognition and the pose estimation problem then becomes one of verifying which of the views in the selected subset is most similar to the sensed object. Flynn and Jain [68] propose a simple verification scheme that generates synthetic range images of the hypothesized objects at the hypothesized pose and performs a pixel-by-pixel comparison of this data with the input view to establish the correct identity and the pose of the sensed object. With the second situation, it is possible that the database consists of object views that are acquired using a laser range scanner without a control device having six degrees of freedom of motion. The 3D positions and orientations of the stored views are, therefore, unknown and need to be estimated from the data during matching. In our system, we exploit the COSMOS representations of object views to determine the image-model feature correspondences using a combination of search methods, and 141 thus identify and localize the object in the input view. 5.2 Shape Spectrum-Based Model View Selection A shape spectrum based hierarchical organization of the model database is exploited in our recognition system to efficiently match a given input view with the stored representations in the database. Let 0 2 {01,02,~ - ,0”} be a collection of n 3D objects whose views are present in the model database, M D. The jth view of the ith object, 0; is stored in the database with its tag (Lg-yogi), where L;- is the object label and PO;- is its pose (position and orientation) vector, if already known. The feature vector representation R;- of a view is based on the first ten moments of its shape spectral distribution, R J' (h), that has been normalized with respect to the visible object surface area [47]. The database M D is organized into the two-tiered structure shown in Figure 5.3, MD 2 {P1, - -- ,’P"}, where each ”Pi is itselfa set of view clusters, {0], (1;,- - - , CL}. Each cluster in ’P" contains views that have been adjudged similar based on the dis- similarity between the corresponding moment features of the shape spectra of the views belonging to the ith object. A summary representation for each view cluster Cf is abstracted from the moment vectors of its constituent views by computing the centroid of the view cluster. Given an input view, a small set of object views that are most similar to the input is determined quickly and accurately in two stages: first compare the moment vector of the input view with all the cluster summary represen- tations and select K best matched clusters; then match it with the moment vectors of the views in these best matched clusters and select the top m closest views. This spectrum based first level matching step results in a set of probable model object views that are similar to the input in terms of visible surface shapes. 142 Database of 3D free-form objects Individual views (L§,PO}) Figure 5.3: Shape spectrum based two—tiered organization of a model database. 5.3 COSMOS-Based Refined View Matching Having tackled the task of short-listing the model views to a few promising candi- date views, we now address the problem of exploiting our COSMOS representation to compare views. The function of this view verification stage is to determine a correct object match among the selected model view hypotheses. Specifically, the objective of this section is to elucidate how, given COSMOS representations of two object views, we may establish correspondences between the CSMPs in the views. The model view features and the scene features need to be related in a detailed manner to establish the identity and the pose of the sensed object accurately. The matching algorithm how- ever must deal with noise, missing patches, and spurious information due to imperfect segmentation. Since COSMOS provides a structural description of surface patches in an object view (CSMPs arranged in a definite pattern of organization), scene—model feature matching is formulated as a search and optimization problem exploiting the region adjacency graph data structure (Section 3.2.2) that can be abstracted from the COS- MOS representation of the view. A good and consistent correspondence between two 143 graphs has to be found, where goodness implies similarity of matched components, and consistency means that violation of connectivity relationships between matched components is not allowed, or allowed only to a small degree. In essence, our solution is to merge some patches into “patch-groups” (which along with their connectivity information define a “patch-group graph”), construct a consistent correspondence between the two patch-group graphs, and compare the resulting graphs, iteratively until they become isomorphic. The goal is to construct a solution that maximizes a given measure of goodness of matched patch-groups in the graphs, given the topological constraints imposed by the surface connectivity information, and to obtain the finest grain mapping possible between the patch-groups in the views. 5.3.1 Patch Grouping, Correspondences and Graph Isomorphism We now introduce some concepts and terminology used in the matching algorithm. Most of these concepts translate directly into data structures in the implementation of the algorithm. In a region adjacency graph each vertex of the graph represents a patch—in COS- MOS, a CSMP—and each edge represents the fact that the patches represented by the adjoining vertices are directly connected to each other. (We defined direct con- nectedness as 8—connectedness in the range images obtained using the Technical Arts White scanner in Section 3.2.2; this information is available in COSMOS through the surface connectivity list V.) In Figure 5.4(c), a CSMP is denoted by a circle and a bidirectional arrow between the circles indicates the adjacency of the CSMPs. When a set of connected patches is grouped together, we call such a set a patch-group. We extend the notion of direct connectedness to patch-groups and say that two patch- groups P1 and 792 are directly connected if and only if there exists some patch P1 6 P1 and another P2 6 ”P2 such that patches P1 and P2 are directly connected. We then define a patch-group graph as a graph in which each vertex represents a patch-group, 144 and each edge represents direct connectedness between the two patch—groups denoted by the adjoining vertices. For example, in Figure 5.4(c) where the correspondence shown has been obtained from matching two views of Vase2, the ellipses denote patch- groups. We are interested only in patch-group graphs in which the patch-groups are disjoint, i.e., a given CSMP appears inside one and only one vertex in the graph. We make use of groups rather than individual patches because it is possible that there are excess patches in the input (scene) View or in the model view, so CSMPs need to be combined in both the scene and the model views to identify a good equivalent on the other side. Thus, the algorithm has the robustness to noise, missing data, and spurious information resulting from imperfect segmentation of object views, built into its design. Figure 5.4: Correspondence between patch-group graphs: (a) View 1 of Vase2; (b) View 2 of Vase2; (c) correspondence established between the CSMPs in the views. 145 Given two object views, each of which has been partitioned into a patch-group graph, we need to uniquely associate a patch-group in one view with a patch—group in the other view. Such an ordered pair of CSMP groups is called a MappedPair structure in the COSMOS implementation; the two sides of a MappedPair may of course contain different numbers of patches. In Figure 5.4(c), a dashed line associates each patch—group established in one view to its matching patch-group in the other view. Given two objects decomposed into patch-group graphs with equal numbers of vertices, a bijective mapping between the vertex sets of the two graphs is called a correspondence. Thus a correspondence is a set of MappedPairs that fully covers both object views. In Figure 5.4(c), the correspondence shown between the views of Vase2 is given by {({P14}.{P26}),({P12},{P23}),({P11,P19}.{P21. 1322,1027» a ({P17,P13,P18},{P24}),({P15},{P25}),({P16},{P28})} where P1,- denote CSMPs from the scene (View 1 of Vase2) and the patches P2,- from the model view (View 2 of Vase2). There are many ways of constructing correspondences between two images and we need to identify one that best detects any similarity between the two objects in the presence of noise, displacement, etc. Therefore, our search for a good match is conducted conceptually over the entire space of possible correspondences, and each correspondence is a candidate solution to the problem of image comparison. In prac- tice, we will use heuristics and examine only a small part of this space. In general, a correspondence arbitrarily associates each patch-group in one image with a unique patch-group in the other image. However, we are primarily interested only in feasible correspondences which are defined as those that satisfy patch-group graph isomorphism. To recapitulate the concept of isomorphism, we say that two patch-group graphs are isomorphic if we can establish a one—to—one mapping (i.e., a correspondence) from the patch-groups in one image to the patch-group in the other 146 image in such a way that adjacency is maintained (i.e., for every edge in one image, there is a unique corresponding edge in the other image, whose adjoining pair of vertices have been mapped from the adjoining pair of vertices of the first edge). Thus if the correspondence preserves local connectivity information between the two graphs it is called feasible. Since the grouping of patches can usually be done in many ways, different corre- spondences may contain different sizes of patch-groups given the same two images. Some correspondences may be fine grained, i.e., each patch-group on each side of the correspondence may contain just one or a few patches. Other correspondences may be relatively coarse grained, comprised of a few large MappedPairs. The extreme case of a coarse correspondence is the trivial correspondence in which all patches in an image have been collected into a single patch-group, and mapped to a similar single all-inclusive patch—group in the other image. Clearly the trivial correspondence always exists, and is always a feasible corre— spondence. It should be intuitively evident that it is more desirable for us to produce a fine grain feasible correspondence than a coarse one, since the former indicates a higher degree of similarity between the parts of the two images. But since there is no guarantee in general that the region adjacency graphs of the two images even resem- ble each other, we may have to terminate our matching algorithm with a fairly coarse correspondence (possibly even with the trivial correspondence). Therefore, a measure of “fineness of grain” of a correspondence is a necessary component in any definition of the overall goodness measure of the correspondence, and coarse correspondences must be penalized implying that the images are probably of different objects. We have thus shown how connectivity information is used as a hard constraint in the sense that any correspondences that do not preserve patch-group graph isomor~ phism are promptly eliminated as being infeasible. The fineness of a correspondence, and other measures based on surface attributes 81, R and surface area act as soft constraints, i.e., we attempt to maximize those measures. However, the fineness of a correspondence is also treated in a way that is central to our matching algorithm, whose details are presented in the next section. The algorithm begins with the trivial 147 correspondence and then iteratively attempts to refine the correspondence to a finer grain by splitting patch-groups on each side and reassigning them creating new map- pings. Thus, the main loop of the algorithm tries to progressively increase the fineness of the correspondence while maintaining feasibility at all times and terminates when it is not possible to further refine the correspondence without losing isomorphism. Thus, only those model view hypotheses that are geometrically consistent with the input are retained and their goodness measures ranked to determine the object view that results in the best scene-model correspondence. 5.3.2 The Matching Algorithm We will present the matching algorithm that we employed using pseudo code. “//” denotes the beginning of a comment, which continues to the end of the line. The top level function, match, takes two image views represented using the COSMOS scheme and returns a complete feasible correspondence that is in some practical sense the “best” correspondence that could be established between the views: match (viewl, view2) // Start with the trivial correspondence current-correspondence 2 {({merge all CSMPs in viewl}, {merge all CSMPs in view2})} ; loop: // Refine current-correspondence { // Identify where to split current-correspondence: mp 2 select-MappedPair—from (current-correspondence); // Replace the selected portion with its refinement new-correspondence 2 current-correspondence — mp + refine (mp); if (quality(new-correspondence) > quality(current—correspondence)) current-correspondence 2 new-correspondence; } until all pairs in current-correspondence are fully refined; return current-correspondence. 148 Match simply keeps trying to refine each of its constituent MappedPairs until none of them is further refinable. A MappedPair is unrefinable when either side contains a single patch, or when connectivity constraints between neighboring patch- groups would be violated for every possible way of breaking up the MappedPair into smaller MappedPairs; this is more fully explained in the definition of the refine function below. Refine takes a MappedPair as its argument, examines each side of the MappedPair (i.e., the patch-group from view 1 that has been assigned to the patch-group from view 2 by the MappedPair), and tries to split both patch-groups in such a way that the resulting patch-group graphs are isomorphic. refine (mp) left-patch-group 2 left (mp); right-patch-group 2 right (mp); loop: { lp 2 select-patch-from (left-patch-group) // E.g., largest patch left-components 2 split (left-patch-group, {lp}) for all rp in right-patch-group right-components 2 split (right-patch-group, {rp}) if (# left-components 2 # right-components) sub-corresp 2 correspond each left component to some right component (feasibly!) record best sub—corresp for all rp2 2 neighbor of rp right-components 2 split (right-patch-group, {rp, rp2}) repeat as above for all rp3 . . . . . . till some groupsize limit for all lp2 2 neighbor of 1p left-components 2 split (left-patch-group, {lp, lp2}) 149 repeat as above . . . . . . till some groupsize limit } until some maximum number of correspondences checked return best sub-corresp To describe the effect of refine more precisely, we need to understand how a patch-group is split. A patch-group P is by definition a set of connected patches. Given a strict subset P1 C P of these patches, two or more graph-components get defined on the underlying patch adjacency graph, namely the subgraph identified by the patches in P1 itself, and the components1 in the rest of the graph (i.e., the subgraph obtained by deleting all vertices in P1 from P). Thus selecting a subset P1 is equivalent to splitting patch-group P into some n smaller patch-groups including P1, based on connectivity information between the patches. Further, examining the group-connectivity (described in section 5.3.1) between these new patch-groups results in the establishment of a unique patch-group graph for a given P1. Having split left-patch-group into n; components, and right-patch—group into n, components, we wish to establish a correspondence (sub-corresp) between the compo- nents, i.e., to create MappedPairs (P11, Prl) etc., which will be finally returned as the replacement for old MappedPair mp. Clearly both left-patch-group and right-patch- group must be split in such a way that they have an equal number of component patch-groups; otherwise there would be no way to establish an isomorphism between the equivalent patch—group graphs. Further, having n1 2 nr is not enough; con- nectivity relationships between patch-groups must also be identical on both sides to satisfy isomorphism. Therefore the refine function attempts to pair every compo- nent on the left with every component on the right and thus tries out all possible sub-correspondences. If a patch-group graph edge on one side cannot be associated with a corresponding edge on the other side, the sub-correspondence is rejected. If more than one isomorphic sub-correspondence can be created, the “best” one is 1A component C of a graph G is a subgraph such that all vertices of C are connected to each other, and no vertex in G that is not in C is connected to any vertex in C. 150 recorded; the measure of goodness is very similar to the quality function used in the match function (the difference arising from the fact that the match function works with complete correspondences, while the refine function works with partial (sub) correspondences; more criteria can be used on the former). Since there is combinatorially huge number of ways to split each patch-group, all possible ways are not explored. Heuristics are used to decide how to split; these heuristics are embodied by the “select-patch-from()” function and the “for” loops in the refine function. As a rule, large patches will correspond to each other because mismatches between small patches have lower penalties. Therefore we typically pick the largest patch I p in left-patch—group and try to find a patch rp in right—patch-group that matches it most closely in size, shape index, and all the other quality-of—match criteria. Since there may not be a single patch rp that closely resembles lp, we need to try combining pairs of adjacent patches in right-patch-group and comparing with l p. Then we need to try all possible triplets of adjacent patches in right-patch-group as a possible match for lp, and so on. Since this quickly gets combinatorial, we impose a “groupsize limit.” Since it is possible for a group of patches from left-patch- group, rather than a single patch I p to form a better match with rp, we need to try out combinations on the left side too, once again up to some practical computational limit. While we have used simple thresholds, smarter heuristics are obviously possible. 5.3.3 Goodness Measure of a Correspondence The goodness measure (the quality function) evaluates the similarities in the shape index, mean curvedness, area and the neighbours’ attributes of matched components in a correspondence. The goodness measure of an entire correspondence is computed as the weighted average of the goodness values of the MappedPairs contained within the correspondence. The weight that multiplies a MappedPair is proportional to the combined area of the patches within the MappedPair; thus larger patch-groups have a greater say in establishing the goodness of the correspondence. quality (correspondence) 151 total-goodness 2 0; for mp in correspondence weight 2 sum of percentage-area of each patch in left(mp) + sum of percentage-area of each patch in right(mp); total-goodness 2 total-goodness + (weight X goodness(mp)); return total-goodness Note that total-goodness is normalized (by using percentage areas) to lie in the [0, 1] interval. A smarter version of this quality function has been employed that adds a factor to reflect the overall coarseness of the correspondence. The goodness function which measures the quality of an individual MappedPair is a product of the goodness measures of various features derived from the Mapped- Pair. We have used 0 the area-goodness: how similar is the total percentage-area of the left patch- group to the total percentage-area of the right patch-group? o the SI-goodness: how close is the mean shape index of the left patch—group to that of the right patch-group? e the mean curvedness-goodness: how close are the mean curvedness values? 0 neighbor—goodness: how close is the mean shape index of the neighbours adja- cent to the left patch-group to that of the neighbours of the right patch-group? This incorporates a measure of similarity between the neighbours adjacent to the left and right patch-groups. The quantitative definitions of the above criteria are both simple and intuitive. For example, area-goodness is computed as [areal — area2] 1 areal + area2 where areal is the total percentage-area of the left patch-group of mp, and area2 that of the right patch-group. Thus the area-goodness has the value 1 when both groups are of identical size, and goes to 0 as the areas diverge in size. Thus, an 152 overall matching score characterizing the similarity of the features of the matched patch-groups established in the correspondence between the two views is returned by the COSMOSbased matching scheme, along with the correspondence. 5.3.4 Highlights of the Matching Algorithm Our COSMOS-based matching algorithm determines the overall goodness measure of the similarity between the given views both locally (i.e., by examining the con— stituent patches in the views and their attributes) as well as globally (i.e., by using the neighbor-goodness). Local patch comparisons are difficult when the two views are of very different objects, since it is hard to establish a correspondence between the different graph components of the two images. Therefore a very important re- quirement of a good matching algorithm is that it should be robust, in the sense that it should display graceful degradation with a decrease in the amount of relevant information in the scene data that is useful for matching. An appealing feature of our COSMOS-based matching algorithm is that it is robust in the above sense, since it always returns some feasible correspondence between the given views along with its goodness measure. The greater the dissimilarity between the two objects, the coarser the correspondence returned by COSMOS. In the worst case, the trivial correspondence is returned. In other words, an intuitively meaningful comparison is performed in all cases, and the goodness measure reflects this. Since verification of consistency with respect to surface connectivity is embedded within the algorithm at all points of the computation, the structures manipulated (candidate solutions) are always feasible. An extension of this approach would be to temporarily permit inconsistencies that would be cleaned out before the algorithm terminates. Our scheme exploits a combination of bottom-up (merging patches) and top- down (splitting patch-group graph) approaches. It thus combines their advantages namely utilizing both local and global information simultaneously such that a good correspondence can determined. Recognition schemes using relational graphs have been explored extensively in the 153 literature as described in 2.1.3. Although our matching scheme bears resemblance to the hypergraph approach [174] where a grouping was imposed on the vertices of a graph extracted from an object view just as in our scheme, the motivating reasons behind the grouping are very different. In [174], each group of vertices (hypernode) corresponds to a face of a geometric primitive, and a collection of face graphs corre- spond to a a primitive block of an object, and thus forms a hyperedge in the AHRs. Recognition proceeds by merging multiple AHRs obtained from different views of the object to obtain a complete AHR and it is compared with the stored model AHRs by matching the subgraphs depicted by hyperedges. However, in our scheme, patch merging is carried out to offset segmentation errors and to obtain the best possible correspondence between the views. The problem of finding the isomorphism between an arbitrary graph and a sub- graph of another graph falls into the class of NP-complete problems. Since patch- group graph isomorphism is required by our matching algorithm, without heuristics it would incur this theoretical complexity. However, our matching algorithm using heuristics at several places (as described in the context of the refine function) to limit the exploration of isomorphic correspondences (ultimately trading off perfection in establishing the finest grain correspondence possible in favor of efficiency). Because of these heuristics and groupsize limits, all steps of the algorithm are of polynomial complexity, with the sole exception of the step during refinement (immediately after the split step) where two sets of components are mapped to each other in establish- ing the graph isomorphism. The complexity of this step is exponential because all permutations of the components on one side are tried against the other side. Nev- ertheless, this step’s complexity is not significant in practice because the number of graph components generated by splitting out a patch or group of adjacent patches is likely to be of the order of 2—5. Thus the effective performance of the algorithm is a (large) polynomial in the size (number of patches and edges) of the graphs. 154 5.3.5 Estimation of Object Pose Once we ascertain the stored model view in the database that best matches with the input view, we can estimate the rotational component of the pose of the object in the view by aligning the surface normals of corresponding CSMPS. We adopt the technique proposed by Flynn and Jain [66] for the estimation of the rotation of the model. Given a single pair of MappedPairs, ( 3,, If) and ( :2, 3), let their mean orientation vectors be so, am, 73.32, and am, respectively. Our goal is to find a 3 x 3 rotation matrix which, when applied to am, aligns it with a,,, and also aligns nm2 with 13,2 due to the rigid nature of the object. An alternate representation of this rotation matrix consists of an axis of rotation, r and an angle 0 which can be determined in the following manner [80]. Given two non parallel model normal vectors and their corresponding scene vectors, the rotation axis is given by 1‘ = (nml — 71.91) X (”m2 — n32). The angle of rotation, 0 is given by [Ir x W] 1 — (as, - am.) The range of values for the components of r and 0 are examined for each pair of scene-model MappedPairs to ensure that the rotation estimated is valid. If the rotation axis and the angle vary only by a small amount (less than 19 degrees in our experiments), we average these estimates to get an average axis f‘ and 9. Once a coarse estimate of the rotation is obtained, it can then be refined using an optimal range image registration algorithm [50]. The translational component of the object pose can directly be estimated using the view registration algorithm. Note that the correctness of the object identity of the input view as determined by COSMOS can be further confirmed by registering the input range image with the range image of the best—matched model View using our registration algorithm. 155 5.3.6 Experimental Results We demonstrate the performance of our COSMOS-based matching algorithm with sev- eral pairs of views obtained from diflerent objects. Figure 5.4 shows the CSMPs obtained from two different views of Vase2 and the correct correspondence deter- mined by the COSMOS-based matching algorithm between the patch-group graphs. Figure 5.5 shows the correspondences between the CSMPs detected in the two views of Phone. Observe that since our matching algorithm does not model symmetry ex- plicitly, the correspondence shown in Figure 5.5(c) inversely matches the symmetrical structures in the two views. In our second experiment, we tested the performance of our complete recogni- tion strategy using a model database containing 10 different object views obtained using a laser range scanner. Figure 5.6 shows these model views. A view of Cobra (Figure 5.7(a)) was independently obtained and used as a test view. The moments computed from its shape spectrum were compared with those in the model database to select the top five best matched model views for further verification. Figures 5.7(b)- (f) show the five model views selected on the basis of their low dissimilarity values which were computed using the shape spectrum based matching technique. The COSMOS representation of the input test view of Cobra was then matched with the five model view hypotheses using our view verification algorithm. Figures 5.8(a) and (b) show the segmentation of the test view and the stored model view with the highest matching score of 0.447. Observe that despite the differences in the segmen— tation results of these two views, our matching algorithm was able to successfully merge patches overcoming the imperfections arising from segmentation, and provide a structurally correct correspondence between the scene and model patches as shown in Figure 5.8(c). Table 5.1 lists the matching scores obtained when the COSMOS rep— resentation of the test view was compared in detail with those of the five model view hypotheses. Notice that the matching score for the correct hypothesis is significantly higher than the matching scores for the incorrect hypotheses. Figure 5.5: Correspondence between the views of Phone: (a) View 1; (b) view 2; (c) correspondence established between the CSMPs visible in the views. Table 5.1: Matching scores of the five model hypotheses determined by the view verification stage. COSMOS shown in 5.7 based 157 Figure 5.6: Range images of object views stored in the model database. 5.4 Performance of the Recognition System Three primary components can be identified in our recognition strategy for handling 3D free-form rigid objects: (i) COSMOS as a representation scheme for handling free- form rigid objects, (ii) the concept of shape spectrum, and its use in establishing view clusters when given a large number of views of a free-form object, and also its potential use as a “fast” matching primitive when a large model database of object views is available, and (iii) a graph—based matching scheme for comparing COSMOS representations of two object views for establishing the correct object interpretation via correspondences of surface primitives and for estimation of object pose. Experimental results obtained from testing each of these components individually have been presented in Chapters 3, 4 and 5. In this section, we concentrate on testing the complete system in an integrated manner. For our discussion below, we categorize the objects used in our experimentation available into three kinds: (i) Type I—real objects that are available for obtaining view data using a laser range scanner and whose geometric models (in the form of CAD models or surface polygonal models) are also available to generate multiple views from arbitrary viewpoints, (ii) Type II— objects for which only their geometric models are available (typically polygonal/ CAD models obtained electronically from multiple ftp sites), and (iii) Type III—real objects that can be used for obtaining data with a laser scanner, but no geometric models 158 (C) ( (a) d) Figure 5.7: Five model hypotheses (b)—(f) generated using shape spectral analysis for a test view (a) of Cobra. are available for synthetic view generation. Although there are many data resources available for collecting Type II objects, these models are available in various image formats, and one encounters the practical difficulty of implementing or searching for various filters to convert these formats into one uniform format that conforms with the local computing environment. With Type III objects, since our laser scanner is not equipped with six degree motion device, images of multiple views of objects can be obtained to a limited extent by rotating them about the Z axis and this renders the task of obtaining the ground truth information about the objects’ orientation harder. 5.4.1 The COSMOS Representation Scheme Currently, data belonging to Type III objects have been used to illustrate the var— ious components of the COSMOS representation scheme—shape index based surface primitive extraction, building the Gauss patch map, surface connectivity list and the support functions. A total of 21 object views obtained using 11 different real objects 159 Figure 5.8: COSMOS—based matching: (a) CSMPs on the test view; (b) CSMPs on the model view with the highest matching score; (c) scene—model patch correspondence established between the views of Cobra. have been used to demonstrate the strengths of the COSMOS representation. 5.4.2 View Grouping and Model View Selection Here, we have used a mixture of objects belonging to Type I and II categories to illustrate the strengths of shape spectral based matching, view grouping, and model database organization. We have used 20 different complex shaped objects, with 5 from Type I category and 15 from Type II category. We populated an object view database with 6,400 views (320 views/object), trained the view grouping system us- ing this database and tested the performance of our spectral matching scheme with 160 2,000 independent test views (100 views/object). We also demonstrated the good performance of the scheme on 100 real range images of arbitrarily curved (Type III) free—form objects. 5.4.3 Matching of Object Views using COSMOS In Section 5.3.6, the strengths of this matching module have been demonstrated with pairs of views of two complex free—from objects. In addition, the integrated recognition strategy involving shape spectral feature-based model selection and detailed COSMOS- based matching was illustrated using an image of Cobra as a test view on a database of multiple views of ten free—form objects (belonging to Type III category). The pose estimation results for views of Vase2, Phone and Cobra are shown in Section 5.4.4. The representation and recognition system was tested as a whole using 50 in- dependent test images on a database containing 50 model views obtained from ten different free-form objects. The range images of these 100 views were obtained using a laser range scanner (Section 4.5.4). Figure 4.10 shows the range images of the model views in the database. The identity and pose of the objects in the test views shown in Figure 4.11 was established as shown in Figure 5.9. For each of the 50 test views, its non-planar shape spectrum was computed and a moment vector was derived from the shape spectrum. The database was organized into two levels of hierarchy with the first level containing 10 view clusters correspond- ing to ten different object classes. The second level contained the fifty model views with five views in each cluster. Comparison of the moment vector of a single test view with those in the database using the shape spectral matching yielded five model view hypotheses that matched most closely with the input moment vector among all the views present within the selected clusters. Forty seven of the test views were found to retrieve at least one model view from their correct object classes among the selected five view hypotheses, thus resulting in a view classification accuracy of 94%. The COSMOS representation was then computed for each of the fifty test views. During CSMP detection, the shape diameter was adaptively determined to derive 161 a maximum of fifteen surface patches in each object view. The matching scheme presented in 5.3.1 was used to determine the best object view match among the five candidate view hypotheses short-listed using the shape spectral analysis for each test view. For each test view, the matching scores of the view correspondences returned by the “patch-group” graph matching algorithm were ranked to determine the view with the highest goodness measure among the five model hypotheses. The recognition system was able to identify 82% of the fifty test views correctly by returning a model view from the correct object class with the highest matching score. Out of the nine test views incorrectly matched, three views did not have any view from the correct object class present among the five hypotheses that were examined using the detailed matching scheme. The remaining six errors were mainly caused due to errors in the surface connectivity information introduced by noisy small patches. Observe that the correct object identity and pose of the input are determined in our recognition system by examining only the few best-matched view hypotheses returned by the shape spectral matching scheme. Hence, the system can fail to rec— ognize the object in the input scene when only model views from incorrect object classes are presented as hypotheses to the COSMOS-based detailed matching stage by the spectrum-based pruning strategy. In addition, the current version of our match- ing algorithm does not tolerate any violation of connectivity relationships between matched patch-groups, and as observed in our experiments, noisy small patches can introduce serious errors in the adjacency relationships between the patches thus af- fecting the recognition accuracy. Note also that in the current implementation of the recognition system, we have not incorporated a reject option to prevent matches with high shape spectral dissimilarity values from being examined in detail. Even among the best subset of view hypotheses determined using shape spectral features, it is possible examine only a few hypotheses in detail using the COSMOS-matching stage by enforcing a reject option using a threshold on the dissimilarity levels of the hypotheses. The algorithm can be improved by allowing violations of the connectivity to a small degree depending on the strength of the adjacency as determined by the number of boundary pixels that are shared between a pair of patches. 162 5.4.4 Pose Estimation In this section, we continue to use the views of Vase2, Phone and Cobra to illustrate the strength of our pose estimation technique. The rotational component of the test View of Vase2 (View 1 shown in Figure 5.4(a)) with respect to the model view (View 2 shown in Figure 5.4(b)) was estimated using surface normals of corresponding patch-groups. For each patch-group established in the correspondence, the average surface normal was computed as the mean of the surface normal vectors of the patches present in the group. A total of 10 pairs of MappedPairs was used to estimate the average rotation axis and the angle of rotation. These rotation parameters (r 2 (0.005288,—0.004433,0.024552) and 0 2 0.180429 radians) were used to compute the 3 x 3 rotation matrix [66] which was then used as an initial guess to register the model view (View 2) with the test view (View 1) of Vase2 using the registration technique presented in Chapter 6. We note here that the computational procedure described in Chapter 6 has been further augmented using a verification mechanism during its implementation [49]. We derive the results presented in this section using this augmented procedure. Figure 5.10 shows the iterative registration of the model view with the scene view. It can be seen that the views are in complete registration with one another at the end of seven iterations. Figure 5.11 shows the registration of the model view with the scene view of Phone through several iterations of the algorithm. The registration scheme converged with the lowest error value at the sixth iteration. The initial transformation matrix (incorporating both the rotation and translation components of the pose) for aligning the best matched model view (View 2 shown in 5.8(b)) with the test view of Cobra (View 1 shown in 5.8(a)) was computed as P098138 —0.0044 -0.0008 0.0 0.0044 0.9838 —0.0009 0.0 Tinit: (51) 0.0008 0.0009 0.9838 0.0 0 0 0 1 163 and after three iterations of the registration algorithm, the final transformation matrix was given by , 0.8849 —0.4501 —0.0041 1.8645 0.4499 0.8848 —0.0126 0.6214 Tm, 2 (5.2) 0.0094 0.0094 0.9927 ——0.02292 [ 0 0 0 1 1 Figure 5.12 shows the evolution of the registration of the model view with the scene view of Cobra. It can be seen that even with a coarse initial estimate of the rotation (see Figure 5.12(a)), the registration technique can align the two views successfully within a few iterations. Computational Time Requirements Given a range image, estimation of local curvatures at each surface point and the smoothing of curvatures using the curvature consistency algorithm takes between 5 and 15 minutes on the average on images of size 240 X 240 containing about 15,000 sur— face points. The CSMP extraction step, once the shape index values are determined using the curvatures requires computational time of the order of an hour. Obtaining the COSMOS representation once the CSMPs have been determined and the compu- tation of shape spectrum takes about a few seconds on the average. The matching of shape spectral features to identify a small set (10) of object views out of 3,200 views takes about 20 msec. View verification using graph-matching scheme was carried out with only fifteen largest (in surface area) patches in the views and this takes of the order of several minutes to compare a pair of views. Given a coarse correct initial guess, the registration stage in our system on the average takes about 30 seconds to register two range images whose sizes are 640 X 480 on a SPARCstation 10 with 32MB RAM. With our current two-stage matching strategy comprising shape spectral moments comparison and COSMOS based detailed matching, the system scales up linearly with 164 respect to the number of objects in the database. By adding more levels to the database hierarchy (as mentioned in Section 7.2.4), the cost of matching can be further reduced. Constructing such a search tree based on COSMOS features remains an interesting line of future research. 5.5 Summary We addressed the problem of recognizing 3D rigid free-form objects using the COS- MOS representation scheme in this chapter. We proposed a novel multi-level matching strategy that employs shape spectral analysis and features derived from the COS- MOS representations of objects for fast and accurate recognition of sculpted objects. A small subset of candidate model views are isolated from the database using shape spectrum based model selection scheme. During view hypothesis verification, we use the COSMOS representations of object views to determine the scene-model feature cor- respondences using a combination of search methods, and thus identify and localize the object in the input view. Experiments on a database containing 100 real range images Of views of ten objects have demonstrated the encouraging performance of our COSMOS-based 3D object recognition system. Although we have assumed unoccluded views of the objects, we expect that small amounts of occlusions will be tolerated by the graph-based matching scheme in our recognition system as long as the salient sur- faces that provide the characteristic shape information about the objects are visible. 165 ] Input range image ] Compute shape index at each object pixel in the image / kamputc shape spectrum of the view 1 Compute moments from the shape spectrum Compute the COSMOS representation of the input view Identify the top m clusters that best match the input moment vector Identify the top It views that best match the input moment vector from the selected clusters Compute the COSMOS representations of the selected views Match the input COSMOS representa- tion with the model view representations and rank the view correspondences Best matched view correspondence and coarse pose estimate 1 Refine pose estimation by register- ing the input range image with the range image ofthe best matched view 1 ] Object identity and 3D pose ] Figure 5.9: Matching a test view in the COSMOS—based recognition system. r l Figure 5.10: Pose estimation: (a) Model view registered with the test view of Vase2 at the end of first iteration; (b) registered views after 3 iterations; (c) Registered views after 4 iterations; (d) registered views after 5 iterations; (e) registered views at the convergence of the algorithm. 166 Figure 5.11: Pose estimation: (a) Model view registered with the test view of Phone at the end of first iteration; (b) registered views after 2 iterations; (c) registered views after 3 iterations; (d) registered views after 4 iterations; (6) registered views at the convergence of the algorithm. Figure 5.12: Pose estimation: (a) Model View registered with the test view of Cobra at the end of first iteration; (b) registered views after the second iteration; (c) registered views at the convergence of the algorithm. Chapter 6 Pose Estimation by Registering Object Views This chapter presents a technique to estimate the pose of a free-form object once its identity has been ascertained using the COSMOS-based recognition scheme. Pose estimation is cast as a registration problem. We assume that the object in the input range image has been recognized as one of the objects stored in the model database. The sensed range image and the range image of the stored view are then registered, thereby estimating the transformation between them. An initial estimate of the transformation between the sensed and the stored images is determined using the surface normals once the correspondences between the maximal patches visible in both the views have been established using our COSMOS based graph-matching scheme and it is then refined using the method presented in this chapter. If it is not known, then our method can estimate it using the range data themselves. A more accurate pose is computed by refining the initial estimate using an iterative minimization scheme. 6.1 Robust Object View Registration Our approach to pose estimation is to formulate it as the problem of obtaining an accurate registration between the input and the matched object views, thereby es— 167 168 timating the transformation between the range images of a free-form object even in the presence of noise or uncertainties in surface data. A registration-based approach is especially suitable for free-form surfaces as it does not rely on the presence of any salient features. In addition, a registration-based approach can offset errors intro— duced in the initial estimate of the object pose due to poor feature localization and noise-corrupted surface normals. However, View registration can be affected by the noise in the sensed data to a certain extent. Most approaches to range image registration assume that the surface depth data are accurate and hence do not take into account the sensitivity of the estimated transformation to noise. In order to provide a View registration algorithm that is reliable in the presence of uncertainties in the surface depth measurements, we derive a a minimum variance estimator (MVE) [50] for computing the transformation parameters from range data of views of an object. Another important application of our registration technique is automatic construc- tion of object models from multiple range views. Automatic construction of models of objects involves three steps: (i) data acquisition, (ii) registration of different views, and (iii) integration. Data acquisition involves obtaining either intensity or depth data of an object from multiple views. Integration of multiple views is dependent on the representation chosen for the model description and also requires knowledge of the transformation relating the data obtained from multiple views. The intermediate step, registration, is also known as the correspondence problem and its goal is to find the transformations that relate multiple views. In this chapter we focus on the issue of registering multiple range images of different views of an object. In order to register object views accurately, we model errors introduced due to sensor inaccuracies in range data explicitly. We have not seen any work that reports to date, establishing the dependencies between the orientation of a surface, noise in the sensed surface data, and the accuracy of surface normal estimation and how these dependencies can affect the estimation of transformation parameters that relate a pair of object views. We present a detailed analysis of this “orientation effect” with geometrical arguments and experimental results. 169 6.2 Previous Work There have been several research efforts directed at solving the registration problem. They fall into two categories: (i) the first kind relies on precisely calibrated data acquisition device to determine the transformations that relate the views and (ii) the second involves techniques to estimate the transformations from the data directly. Bhanu [16] describes an object modeling system in which objects are rotated through known angles to obtain multiple views. Ahuja and Veenstra [1] use orthogonal views to construct octree object models. The correspondence problem is solved easily with the calibration of data acquisition facilities in this work. Vemuri and Aggarwal [165] used a base—plane pattern to estimate the interframe rotation of objects by corre- sponding the control pattern in intensity images which are acquired simultaneously with the range images. These techniques are inadequate for constructing a complete description of complex shaped objects because views are restricted to rotations or to some known vieWpoints only. Therefore, we cannot make use of the object surface geometry in the selection of vantage views to obtain measurements. Inter-image correspondence has also been established by matching surface features derived from the data [62]. The accuracy of the feature detection method employed determines the accuracy of feature correspondences. Potmesil [131] matched multiple range views using a heuristic search in the view transformation space. Though quite general, this technique involves searching a huge parameter space, and even with good heuristics, it may be computationally very expensive. Chen and Medioni avoid the search by assuming an initial approximate transformation for the registration, which is improved with an iterative algorithm [38] that minimizes the distance from points in a view to tangential planes at corresponding points in other views. Besl and Mckay [15] proposed an iterative closest point algorithm for registration of free— form surfaces which requires the specification of an appropriate procedure to find the closest point on a geometric entity to a given point. Blais and Levine [19] propose a reverse calibration of the range-finder to determine the point correspondences between the views directly and use stochastic search to estimate the transformation. These 170 approaches, however, do not take into account the presence of noise or inaccuracies in the data and its effect on the estimated view-transformation. A related work by Turk and Levoy [159] that describes an entire system for registration and integration of object views uses a variant of the iterated closest-point algorithm [15]. Our registration technique also uses a distance minimization algorithm to register a pair of views, but we do not impose the requirement that one surface has to be strictly a subset of the other. 6.3 Error in Surface Measurements Range data are often corrupted by measurement errors and sometimes lack of data. The errors in surface measurements of an object include scanner errors, camera dis— tortion, and spatial quantization. The missing data can be due to self-occlusion, over- lapping objects, or sensor shadows. Therefore, the fusion of multiple views during 3D model construction should take into account different uncertainties in observations. The registration errors affect the integration stage in model construction. It can also affect the surface classification. Even if the noise in range data is small, it is im- portant to ensure that the estimated transformation is accurate because when data from different views are merged based on an inaccurately estimated transformation, it may result in holes as the merged data may have gaps and may also introduce discontinuities in the surfaces when multiple data points are mapped to represent the same physical point on the surface. Due to noise, it is generally impossible to ob- tain a solution for a rigid transformation that fits two sets of noisy three—dimensional points exactly. The least—squares solution in [38] is non-optimal as it treats all sur- face measurements with different reliabilities equally. Our objective is to derive a transformation that globally registers the noisy data in some optimal sense. With range sensors that provide measurements in the form of a graph surface 2 = f (:r, y), it is assumed that the error is present along the z axis only, as the a: and y measurements are usually laid out in a grid. The effects of errors in the 2 measure- ments on the estimation of surface attributes such as normals may vary depending 171 on the orientation of a surface patch. For example, the noisy 2 measurement on a horizontal surface patch affects the estimation of the surface normal more than an erroneous 2 value on an inclined surface patch, as shown in Figures 6.2 and 6.3. We discuss this aspect more in detail in Section 6.5.1. There are different uncertainties along different surface orientations and they need to be handled appropriately during view registration. Furthermore, the measurement error is not uniformly distributed over the entire image. The error may depend on the position of a point, relative to the object surface. A measurement error model dealing with the sensor’s viewpoint has been previously proposed [84] for surface reconstruction where the emphasis was to recover straight line segments from noisy single scan 3D surface profiles. In this chapter we investigate the effect of measurement noise on the registration of multiple views to accurately estimate the relative transformation between them and propose a new method that improves upon the approach of Chen and Medioni [38]. The registration technique is an iterative least squares algorithm minimizing the sum of weighted distances between a set of control points chosen from range data of an object observed from some viewpoint and the tangential planes fitted at the cor- responding control points in a range image of another view of the object. We use the terms “view” and “image” interchangeably in this chapter. We formulate a noise model that characterizes the error in estimating tangent planes from noisy range data, and present a minimum variance estimation of the transformation parameters that relate the data from two views. Our model handles numerical errors in 2 values and surface orientation effects. The minimum variance estimator algorithm proposed here handles the inaccuracies introduced in the range data by the sensor. It only assumes that noise distributions of the data are well behaved and possess short tails. When a Gaussian distribution is assumed to model the noise in the data, the minimum variance estimator becomes equivalent to a weighted linear least-squares algorithm. 172 6.4 A Non-Optimal Algorithm for Registration Two views of a surface are said to be in registration [38] when any pair of points, p and q from the two views representing the same object surface point, can be related to each other by a single rigid 3D spatial transformation T, such that Vp E P, Elq E Q such that [ITp —- q“ = 0, (6.1) where P and Q are two views of the same surface, Tp is a point obtained by applying the transformation T to p, and T is a transformation expressed in homogeneous coordinates as given below: T = T(aafia73 twatya tz) = cosrycosfi cos'ysinflsina — sin'ycosa cosasinflcos'y + sinasin'y tx sincycosfl sinasinflsiny + cosacosy sin7sinflcosa — cos'ysina ty — sin ,8 cos 6 sin Oz cos 5 cos (1 t2 0 0 O 1 (6.2) where a, ,8 and 'y are rotation angles about the 3:, y and z axes, respectively, and t3, ty and tz are the translation parameters. The transformation T needed to bring the two views into registration has 6 degrees of freedom. Thus, the problem of registration is to search for such a transformation in the six-dimensional parameter space, which satisfies Eq. (6.1). The approach of [38] is based on the assumption that an approximate trans- formation between two views is already known, i.e., data from the two views are approximately registered, and the goal is to refine the initial estimate to obtain more accurate global registration. Given a set of N pairs of corresponding points called control points in two views, p,- E P and q,- E Q, i = 1 . . . N, the transformation can 173 be estimated by minimizing N e = Z Iii/“p.- -— q.)u’~’. (6.3) i=1 where N Z 3. Since q,- E Q corresponding to a point p.- E P is not usually known, Chen and Medioni [38] used the following objective function to minimize the distances from points on one surface to another iteratively: e k=Zd§(T kp1,s1°), (6.4) where Tk is the 3D transformation applied to a control point p; E P at the kth iteration, l,- = {a | (p,- —a) x npi = 0} is the line normal to P at p;, q]? = (Tklg) flQ is the intersection point of surface Q with the transformed line Tkli, n1; is the normal to Q at qf, S,“ = {s I n; -(q[c —s) = 0} 1 is the tangent plane to Q at q? and d, is the signed distance from a point to a plane as given in Eq. (6.5). Figure 6.1 illustrates the distance measure d, between surfaces P and Q. I“, using a least squares The registration algorithm thus finds a T that minimizes e method iteratively. The tangent plane Sf serves as a local linear approximation to the surface Q at a point. The intersection point qf is an approximation to the actual corresponding point q,- that is unknown at each iteration It. An initial T0 that approximately registers the two views is used to start the iterative process. The signed distance d,, from a transformed point Tpi, p,- E P to a tangential plane 5? E Q is given by Ax+l3y+Cz+’D \/A2+B2+C2 d3 = _ , (6.5) where Tp; = (23,! , z)T and S,“ = (A,B,C,’D)T define the transformed point and the tangential plane respectively. Note, (as, y, z)T is the transpose of the vector (at, y, z). 1‘-’ stands for the scalar product and ‘x ’ for the vector product. " .rafiTh-JQI- 174 Figure 6.1: Point-to—plane distance: (a) Surfaces P and Q before the transformation T'c at iteration k is applied; (b) distance from the point p; to the tangent plane SI“ of Q. By minimizing the distance from a point to a plane, only the direction in which the distance can be reduced is constrained. The algorithm that performs this minimization is as follows: 1. A set of control points p,- E P,i = 1, 2, - - - ,N is selected and the surface normal npi is computed at each point. Let an initial transformation be To. 2. The following procedure is repeated for every iteration k for k = 1, 2, - -- until the process converges. (i) For each control point p,-, o The transformation T"‘1 is applied to both the control point p, and the normal np, to get p] and n'pi. o The intersection q]? of surface Q and the normal line 1.- defined by p;- and “in is computed. o The tangent plane 5]“ of Q is computed at qf. 175 o The distance d, between p,- and 51‘ is determined. (ii) The transformation T that minimizes e" in Eq. (6.4) is estimated with a least squares method. (iii) Now let T" = TTk-l. The convergence of the process can be tested by verifying that the difference between the errors 6" at any two consecutive iterations is less than a pre—specified threshold. The line—surface intersection given by the intersection of the normal line I,- and Q is found using an iterative search near the neighborhood of prospective intersection points. 6.5 Registration and Error Modeling Chen and Medioni’s algorithm is not optimal because it does not handle the errors in 2 measurements. Though there are no actual outliers in the range data, occluded and noisy surface points can be considered such. Then, a simple least-squared-error estimation procedure is not desirable since the estimated transformation parameters will be affected more by outliers than the actual data. The least-squares estimation method treats outliers equally since all errors are equally weighted proportional to their magnitude. Instead, we would like an estimation procedure that throws out or gives low weight to the noisy measurements. We show that the noise in 2 values affects the estimation of the tangential plane parameters differently depending on how a surface is oriented. Since the estimated tangential plane parameters play a crucial role in determining the distance d, (which is being minimized to estimate T), it is important to study the effect of noise on the estimation of the parameters of the plane and the minimization of d,. Note that the error in the iterative estimation of T is a combined result of errors in each control point (x, y, z)T from view 1 and errors in fitting a tangential plane at the corresponding control points from view 2. 176 6.5.1 Fitting Planes to Surface Data with Noise Figures 6.2 and 6.3 illustrate the effect of noise in the values of 2 on the estimated plane parameters. For the horizontal plane shown in Figure 6.2, an error in z (the uncertainty region around 2) directly affects the estimated surface normal. In the case of an inclined plane, the effect of errors in 2 on the normal to the plane is much less pronounced as shown in Figure 6.3. Here, even if the error in z is large, only its projected error along the normal to the plane affects the normal estimation. This projected error becomes smaller than the actual error in 2 as the normal becomes more and more inclined with respect to the vertical axis. Therefore, our hypothesis is that as the angle between the vertical (Z) axis and the normal to the plane increases, the difference between the fitted plane parameters and the actual plane parameters should decrease. This hypothesis has been verified by our simulations as explained below. Surface normal to the plane A A uncertainty region in Z Horizontal plane Z] 22 Z3 Z4 Figure 6.2: Effect of noise in 2 measurements on the fitted normal when the plane is horizontal. The double-headed arrows indicate the uncertainty in depth measure- ments. We carried out the simulations to study the actual effect of the noise in the 2 measurements on estimating the plane parameters and to verify the above hypoth— esis. We obtained the planar parameters from the surface measurements using two methods:(i) eigenvector method and (ii) linear regression. 177 )[ Surface normal to the plane uncertainty region I Z Effective uncertainty regi affecting the normal ‘[ / l v / 4 ‘ Inclined plane ll 22 Z3 24 Figure 6.3: Effect of noise in 2 measurements on the fitted normal when the plane is inclined. The double-headed arrows indicate the uncertainty in depth measurements. Eigenvector Method for Planar Fitting The conventional method for fitting planes to a set of 3D points uses a linear least squares algorithm which is described in the following section. Note that the linear regression method implicitly assumes that two of the three coordinates are measured without errors. However, it is possible that in general, surface points can have errors in all three coordinates, and surfaces can be in any orientation. Hence, we use a classical eigenvector method (principal components analysis) [65] that allows us to extract all linear dependencies. Let the plane equation be Ax + By + Cz + D = 0. Let X.- = (x;,y,-,z,-)T, i = 1, 2, - -- , n, be a set of surface measurements used in fitting a plane at a point on a surface. Let $1,111 211 $2 312 221 .rn yn 2,, 1J and h = (A, B, C, D)T be the vector containing the plane parameters. We solve for the 178 vector h such that ||Ah|| is minimized. The solution of h is a unit eigenvector of ATA associated with the smallest eigenvalue. We renormalize h such that (A, B,C)T is the unit normal to the fitted plane and D is the distance of the plane from the origin of the coordinate system. This planar fit minimizes the sum of the squared perpendicular distances between the data points and the fitted plane and is independent of the choice of the coordinate frame. In our computer simulations, we used synthetic planar patches as test surfaces. The simulation data consisted of surface measurements from planes at various ori— entations with respect to the vertical axis. Independent and identically distributed (i.i.d.) Gaussian and uniform noise with different variances were added to the z mea- surements of the synthetic planar data. The standard deviation of the noise used was in the range 0.001—0.005 in. as this realistically models the error in 2 introduced by a Technical Arts 100X White range scanner [120] that was employed to obtain the range data for our experiments. The planar parameters were estimated using the eigenvector method, at different surface points with a neighborhood of size 5 x 5. The error E fit in fitting the plane was defined as the norm of the difference between the actual normal to the plane and the normal of the fitted plane estimated with the eigenvector method. Figure 6.4 shows the plot of E fit versus the orientation of the normal to the simulated plane (with respect to the vertical axis) at different noise variances. The plot shows the error values averaged over 1,000 trials. It can be seen from the Figure 6.4 that the error in fitting a plane decreases with an increase in the angle between the vertical axis and the normal to the plane since the error in z contributes less to the estimation of the normal to an inclined plane. When the plane is nearly horizontal (the angle between the vertical axis and the normal to the plane is small), the error in z entirely contributes to the error in fitting as shown in Figure 6.2. The error plot in Figure 6.4 was observed to have the same behavior for varying amounts of variance with the Gaussian noise model. The results were found to have similar characteristics with a uniform noise model also as shown in Figure 6.4. These simulations confirm our hypothesis about the effect of noise in 2 on the fitted plane parameters as the surface orientation changes. 179 Plane Fitting Using Linear Regression Since we assume errors in z direction only, it may be more appropriate to use a linear least squares method to fit the plane in order to verify our hypothesis instead of the general eigenvector method described in the above section. The linear regression method brings out the errors in z direction explicitly. With the linear regression method, a plane equation is redefined as z = ax+by+c. Let X,- = (x;,y,,z,-)T, i = 1,2,--- ,n, be a set of surface measurements used in fitting a plane at a point on a. surface. Let - - [- - $1 yl 1 21 a: 1 z A: .2 “2 . and Z: .2 (6.7) .. $71, yn 1 _ L. 2n .- and m = (a,b,c)T. We solve for m such that Z = Am is satisfied in a least square sense. The solution of m is given by (ATA)"1ATZ and the covariance matrix Fabc associated with m is given by (ATI‘ZA). We compute the planar parameters (A, B, C )T from (a,b,c)T such that (A,B,C)T is the unit normal to the plane and ’D is the distance of the fitted plane from‘the origin. Note that the planar fit here minimizes the squared differences between the actual z measurements and the z coordinates of the fitted plane. We repeated our simulations described in Section 6.5.1, but used the linear re- gression method to fit planes to data. The experimental details for these simulations were identical to those described before. Figure 6.5 shows the estimated E fit between the fitted and actual normals to the plane at various surface orientations. It can be seen from Figure 6.5 that the error between the actual and the estimated normals again decreases with an increase in the angle between the vertical axis and the normal to the plane, thus confirming our hypothesis. The error plot was observed to exhibit the same behavior for varying amounts of variance with the Gaussian noise model. The results were found to be similar with uniform noise model also. Our model 180 works for general noise distributions as long as errors in different measurements are uncorrelated and their distributions have short tails. 6.6 An Optimal Registration Algorithm Since the estimated tangential plane parameters are affected by the noise in z mea- surements, any inaccuracies in them, in turn influence the accuracy of the estimates of d,, thus affecting the error function being minimized during the registration. There- fore, we characterize the error in the estimates of d, by modeling the uncertainties associated with the 2 measurements using weights. Our approach is inspired by the Gauss-Markov theorem [171] which states that an unbiased linear minimum variance estimator of a parameter vector In when y = f (m) + 6,, is the one that minimizes (y — f(m))TF;1(y — f(m)), where 6,. is a random vector with zero mean and covari— ance matrix Fy. Based on this theorem, we formulate an optimal error function for registration of range images as e’° = i id? (6 8) i=1 039 s? . where 035 is the estimated variance of the distance d,. When the reliability of a 2 value is low, the variance of the distance oi is large and the contribution of d, to the error function is small, and when the reliability of the 2 measurement is high, 03, is small, and the contribution of d, is large. In other words, d, with minimum variances affect the error function more. One of the advantages of this minimum variance criterion is that we do not need the exact noise distribution. What we require is that the noise distribution be well-behaved and have short tails. In our simulations, we employ both Gaussian and uniform noise distributions to illustrate the effectiveness of our method. We need to know only the second—order statistics of the noise distribution, which in practice can often be estimated. In the following section, we present a formulation to characterize and estimate 2 0d. 5 181 6.6.1 Estimation of the Variance ”<21. We need to estimate 033 to model the reliability of the computed d... This can then be used in our optimal error function in Eq. (6.8). Let the set of all the surface points be denoted by P and the errors in the measurements of these points be denoted by a random vector 6. The error ed, in the distance computed is due to the error in the estimated plane parameters and the 2 measurement. It is a function of P and 6: ed: 2 f(Pv 6)' (6'9) Our goal is to estimate the error ed, given the surface measurements P. However, we do not know 6. If we can estimate the standard deviation of 6d, (with c as a random vector) from the noise—corrupted surface measurements P, we can use it in Eq. (6.8). Estimation of ‘73. Based on Perturbation Analysis Perturbation analysis is a general method for analyzing the effect of noise in data on the eigenvectors obtained from the data. It is general in the sense that errors in 2:, y and 2 can all be handled in this model. This analysis is also related to the general method of plane fitting that we studied - the eigenvector approach. The analysis to estimate 035 is simpler as discussed in the next section if we use linear regression method to do plane fitting. Note that when we use linear regression we can assume that an error is present in 2 component only. Since we fit a plane with the eigenvector method that uses the symmetric matrix C = ATA computed from the (:r, y, 2) measurements in the neighborhood of a surface point, we need to analyze how a small perturbation in the matrix C caused by the noise in the measurements can affect the eigenvectors. Recall that these eigenvectors determine the plane parameters (A,B,C,D)T which in turn determine the distance d,. We assume that the noise in the measurements has zero mean and some variance and that the latter can be estimated empirically. The correlation in noise at different points is assumed to be negligible. Estimation of correlation in noise is very difficult 182 but even if we estimate it, its impact may turn out to be insignificant. We estimate the standard deviation of errors in the plane parameters and in d, on the basis of the first—order perturbations, i.e., we estimate the “linear terms” of the errors. Before we proceed, we discuss some of the notational conventions that are used: [m is a m x m identity matrix; diag(a,b) is a 2 x 2 diagonal matrix with a and b as its diagonal elements. Given a noise-free matrix A, its noise-matrix is denoted by AA and the noise—corrupted version of A is denoted by A(e), i.e., ma=A+A, The vector 6 is used to indicate the noise vector, AXE) = X + dx. We use F with a corresponding subscript to specify the covariance matrix of the noise vector/matrix. For a given matrix A = [A1 A2 - - - An], a vector A can be associated as - A, - A = A2 . An _ In other words, A consists of the column vectors of A that are lined up together. As proved in [172], if C is a symmetrical matrix (ATA) formed from the mea- surements and h is the parameter vector (.A, B,C, D)T given by the eigenvector of C associated with the smallest eigenvalue, say A1, then the first-order perturbation in the eigenvector (parameter vector) h is given by 6,. e HAHTAATAh, (6.10) 183 where A = diag{0, (A1 - M)“, (A1 — A3)'1,(/\1— A4)"1}, (6.11) and H is an orthonormal matrix such that H—‘CH = diag{/\1, A2,)t3,)t4}. (6.12) AATA is a 4 x 4 noise or perturbation matrix associated with ATA. If the noise matrix AATA can be estimated, then the perturbation (5;, in h can be estimated by a first-order approximation as given in Eq. (6.10). We estimate AATA from the perturbation in the surface measurements. We as- sume, for the sake of simplicity of analysis that only the 2 component of a surface measurement X.- = ($5,311, 2;)T has errors, with this general model. This analysis is easily and directly extended to include errors in :c and y if their noise variances are known. Let 2,- have additive errors 62,, for 1 S i S n. We then get 0 0 0 AA = . (6.13) 621 62:2 . . . 642,, 0 0 0 If the errors in 2 at different points on the surface have the same variance 02, we get the covariance matrix FAT = o” diag(P1,P2,--- ,Pn), (6.14) 184 where P;,1 S i S n, is a 4 X 4 sub matrix: F- — (6.15) O O Ot—‘OO 0 0 O 0 _ J Now, consider the error in h. As stated before, we have 6}, g HAHTAATAh = HAHT[AI4 814 614 DI4]6ATA (6.16) g GhJATA. In the above equation, we have rewritten the matrix AATA as a vector (SATA and moved the perturbation to the extreme right of the expression. Then the perturbation of the eigenvector is the linear transformation (by matrix G1,) of the perturbation vector (SATA. Since we have FAT(= F5), we need to relate JATA to (SAT. Using a first-order approximation [172], we get Am 2 ATAA + ASA. (6.17) Letting AT = [am-[T 2 [A1 A2 An], we write (SATA '5 GArAdAT, (6.18) where GATA is easily determined from the equation GATA = [Fl-j] + [01,-], where [F13] and [Gij] are matrices with 4 x n submatrices Fij and G1,- respectively; ng = aJ-gI4, and Gij is a 4 x 4 matrix with the ith column being the column vector A,- and all other columns being zero. Thus, we get 6,. e GMAT, e GhGATAaAT e 0,16,”. (6.19) 185 Then the covariance matrix of h is given by r,. e DhrAT DZ". (6.20) The distance d, is affected by the errors in the estimation of the plane parameters (A, B, C, ’D)T and the 2 measurement in (2:1,;1/1, 2:1)T. The error variance in the distance d, is, therefore, given by an 8A a_d.. 813 2_ ad ad 8d 8d 8d 8d 0- _ _1 x x __:. _a _:. __a _.1. _ 4: ac [th} 8.4 as ac a1) 82 (6 21) ad, 31) an .. 33 d The covariance matrix PM is given by F), 0 PM = . (6.22) 0 02 Once the variance of d3, (73‘ is estimated, we rewrite our error function that is to be minimized to estimate T as N 1 ek = Z Eg—dflTl‘pg, 51°). (6.23) i=1 a Figure 6.6 shows the plot of the actual standard deviation of the distance (1, versus the orientation of the plane with respect to the vertical axis. Note that the mean of d, is zero when the surface points are in complete registration and there is no noise. We generated two views of synthetic planar patches with transformation T between the views being an identity transformation. We experimented with the planar patches at various orientations. We added uncorrelated Gaussian noise independently to the two views. Then we estimated the distance d, at different control points from Eq.( 6.5) 186 and computed its standard deviation. The plot shows the values averaged over 1,000 trials. As indicated by our hypothesis, the actual standard deviation of (1, decreases as the planar orientation goes from being horizontal to vertical. As the variance of the added Gaussian noise to the 2 measurements increases, od, also increases. The figure also shows the results obtained when we added uniform noise to the data. We compare the actual variance with the estimated variance of the distance (Eq. (6.21)) in order to verify whether our modeling of errors in z values at various surface orientations is correct. We computed the estimated variance of the distance d, using our error model from Eq. (6.21) with the same experimental setup as described above. Figure 6.7 illustrates the behavior of the estimated standard deviation of d, as the inclination of the plane (the surface orientation) changes. A comparison of Figures 6.6 and 6.7 shows that both the actual and the estimated standard devia- tions of d, have similar behavior with varying planar orientation and their values are proportional to the amount of noise added. This proves the correctness of our error model of z and its effect on the distance d,. Estimation of oi with Linear Regression for Planar Fitting The analysis in the previous section to estimate the variance of the distance d, be- comes simpler when we use linear regression to fit a plane to the surface measurements. Since we assumed errors in 2 measurements only, it is possible to obtain the covari- ance matrix F), directly using the linear regression method. The difficulty with the eigenvector method was in estimating the covariance matrix F), of the fitted planar parameters. Recollect from Section 6.5.1 that the covariance matrix Fabc of the fitted plane parameters (a,b,c) is given directly by (ATI‘EIA). Thus I), simply becomes equal to ATFZA. Since we assume that the errors in 2 have zero mean and same vari- 2, - -- ,og). Therefore Fh = oQATA. Then oi can be computed ance, I‘z = diag(oz,o using Eq. (6.21) as before. We repeated the experiments described in Section 6.6.1 to compute the actual variance of d, using the planar parameters obtained with linear regression. Figure 6.8 187 shows the plot of the actual standard deviation of the distance d, versus the orien- tation of the plane with respect to the vertical axis when linear regression was used. We experimented with planar patches at various orientations. As indicated by our hypothesis, the actual standard deviation of d, decreases as the planar orientation goes from being horizontal to vertical. As the variance of the added Gaussian noise to the 2 measurements increases, 011, also increases. The figure also shows the results obtained when we added uniform noise to data. We also estimated the variance of the distance d, using the simpler formulation described above. Figure 6.9 illustrates the behavior of estimated 011, with varying planar orientations when linear regression is used to fit a plane. We can observe here too, that the estimated standard deviation of the distance d, decreases as the planar orientation goes from horizontal to vertical. The plots of estimated variance oi resemble those of the actual variance, demonstrating the validity of our method of estimating oi using linear regression. The two sets of plots shown in Figures 6.6 and 6.8 and Figures 6.7 and 6.9 together illustrate yet another important point. Similar behavior of the actual and estimated variance in both sets of figures as the planar orientation varies demonstrates the important fact that the actual method used for planar fitting does not bias our results. 6.7 Experimental Results In this section we demonstrate the improvements in the estimation of transformation parameters using the minimum variance estimator (MVE). Henceforth, we will refer to the technique of Chen and Medioni [38] as C-M method. 6.7 .1 Selection of Control Points We perform uniform subsampling of the depth data to locate the control points in View 1 that are to be used in registration. From these subsampled points we choose only those that are present in smooth surface patches. The local smoothness of the 188 surface was verified using the value of residual standard deviation resulting from the least-squares fitting of a plane in the neighborhood of a pixel. The algorithm does not rely on the exact correspondence of the control points, but uses the constraints from the geometry or the shape of the surfaces. 6.7 .2 Initial Estimate of the Transformation Even if an initial estimate for the transformation is not available, when the range images contain the entire object surface and the rotations of the object in the views are primarily in the plane, a good initial guess for the iterative algorithm can be determined automatically. It is based on estimating an approximate rotation and translation from the major (principal) axes of the object views. Figure 6.11 depicts the two major axes of the objects. The normalized eigenvectors from view 1 form the column vectors of matrix M1 and the eigenvectors from view 2 form the matrix M2. The rotation matrix R = (0,7,5) is then given by R = Mng1. Ift = (tx,ty,tz) is the translational vector then a point in view 2, X : is related to a point X,- in view 1 by X: = RX,- + t. Translation vector t is given by t = X’ — RX, where X and X’ are the centers of mass of views 1 and 2, respectively. We use this estimated transformation as an initial guess for the iterative procedure in our experiments, since we assume that we do not have the prior knowledge of the sensor placement. This also illustrates the effectiveness of our method in refining such a rough estimate to be close to the ground truth. In all our experiments, the same initial guess was used with both the C-M method and the proposed MVE. We also used Newton’s method for minimizing the error function iteratively. 6.7.3 Errors in the Estimated ’Dansformation In order to measure the error in the estimated rotation parameters, we define an error measure that does not depend on the actual rotation parameters. The relative error of rotation matrix R, ER is defined to be E3 2 [IR — RH/IIRII, where R is an estimate of R. Since R1 = R, the geometric sense of ER is the square root of the 189 Table 6.1: Estimated transformation for the cobra data. Parameters Actual Chen and New value Medioni [38] method a (degrees) 5 0.285706 4.524133 B (degrees) 0 —2.225014 0.014339 7 (degrees) 10 12.97406 10.38106 t2, (inches) 0 0.942902 0.668358 ty (inches) 0 0.348335 0.021682 tz (inches) 0 0.226312 ——0.297580 ER 0.084830 0.008691 E, 1.030348 0.73193 mean squared distance between the three unit vectors of the rotated orthonormal frames. This is illustrated in Figure 6.10. Since the frames are orthonormal, E3 = \/(d:1:2 + dy2 + d22)/\/3. The error in translation, E is defined as the square root of the sum of the squared differences between the estimated and actual tx, ty and t2 values. 6.7 .4 Results Figure 6.11 shows the range data of a cobra head and Big-Y. The figure renders depth as pseudo intensity and points almost vertically oriented are shown darker. View 2 of the cobra head was obtained by rotating the surface by 5° about the X axis and 10° about the Z axis. Table 6.1 shows the values of ER and Et for the cobra images estimated using only as few as 25 control points. It can be seen that the transformation parameters obtained with MVE are closer to the ground truth than those estimated using the unweighted objective function of C-M method. Table 6.2 shows the improved results when more control points were used; even in that case the estimates using our method were closer to the ground truth than those obtained with the C—M method. We also show the performance of our method when the two views are substantially different and the depth values are very noisy. Figure 6.11 shows two views of Big— 190 Table 6.2: Registration of cobra data with 156 control points. Parameters Actual Chen and New value Medioni [38] method 0 (degrees) 5 4.297135 4.351065 6 (degrees) 0 —0.601008 —0.190772 7 (degrees) 10 10.61468 10.55809 t, (inches) 0 0.704665 0.686584 ty (inches) 0 0.060029 —0.020930 tz (inches) 0 —0.230945 —0.267845 ER 0.015795 0.012487 E1 0.743970 0.737276 Table 6.3: Registration of Big—Y views using 81 control points. Parameters Actual Chen and New value Medioni [38] method (1 (degrees) 0 -15.563313 -10.706766 6 (degrees) 0 ——2.826640 —1.991803 7 (degrees) 45 42.18119 43.29852 1, (inches) 0 0.040259 0.049479 1,, (inches) 0 0.096469 0.062579 1. (inches) 0 —O.401167 —0.23081 ER 0.229121 0.157230 E. 0.414563 0.244215 Y generated from its CAD model. The second View was generated by rotating the object about the Z axis by 45°. We also added Gaussian noise with mean zero and standard deviation of 0.5 mm to the 2 values of the surfaces in view 2. Table 6.3 shows ER and E computed with 81 control points. When the number of control points used for registration was increased to 154, the results were better, as shown in Table 6.4. It can be seen from these tables that the proposed MVE method estimates the transformations more accurately in comparison with C-M method in the presence of noise. The transformation matrix, especially the rotation matrix obtained with the MVE is closer to the ground truth than that obtained using C-M method. The errors 191 Table 6.4: Registration of Big-Y views using 154 control points. Parameters Actual Chen and New value Medioni [38] method 0 (degrees) 0 1.263640 0.606140 6 (degrees) 0 2.015097 1.191350 7 (degrees) 45 44.46735 44.77864 1, (inches) 0 0.016323 0.000602 t, (inches) 0 0.131396 0.084499 t. (inches) 0 0.043766 0.037624 ER 0.034801 0.019322 E. 0.139452 0.092498 Table 6.5: Registration of Facel views with 250 control points. Parameters Actual Chen and New value Medioni [38] method 4 01 (degrees) 5 5.311695 4.608456 ,8 (degrees) 0 ——2.300209 -—0.741129 7 (degrees) 0 3.14468 0.7895 t1. (inches) 0 —0.120410 —0.087956 ty (inches) 0 0.450793 0.366696 t2 (inches) 0 —0.301092 —0.251068 ER 0.055615 0.016433 E1 0.555310 0.453031 in translation components in the final estimates of the transformation matrices are mainly due to the approximate initial guess. Our method refined these initial values to provide a final solution very close to the actual values. Our method can also handle large transformations between views robustly. We show more results on range images of faces. The depth data are noisy owing to the roughness of the face masks used to obtain the range images. Figure 6.12 shows the range data of Facel. View 2 of the face was obtained by rotating the surface by 5° about the X axis. Table 6.5 shows ER and E1 computed with 250 control points. When convergence was achieved, only 76 control points were used in updating the transformation. 192 Table 6.6: Registration of Face2 views with 142 control points. Parameters Actual Chen and New value Medioni [38] method 01 (degrees) 5 4.053439 4.669192 6 (degrees) 5 5.917358 5.045794 7 (degrees) 5 6.50263 5.08654 t1. (inches) 0 —0.077274 —0.104206 ty (inches) 0 0.248883 0.317506 t2 (inches) 0 —0.l32019 -0.192163 ER 0.029432 0.005019 Et 0.292136 0.385481 Figure 6.13 shows the range data of Face2. In this experiment, View 2 was ob- tained by rotating the face by 5° about X, Y, and Z axes. Table 6.6 shows ER and E computed with 142 control points. When convergence was achieved, only 35 control points were used in updating the transformation. In our experiments, we found that even when the depth data contained noise owing to the roughness of the surface texture of the object and also due to self-occlusion, more accurate estimates of the transformation were obtained with the MVE. Note that the measurement error is random and we minimize the expected error in the estimated solution. However, the method does not guarantee that every component in the solution will have a smaller error in every single case. Additional results on real range images using the MVE for estimating the object pose of the input with respect to the best matched model view have been presented in Section 5.4.4. 6.8 Surface Geometry and Registration In this section, we discuss the performance of the weighted and unweighted registra— tion algorithms on surfaces of various geometries. The geometry varies from planar surfaces where the normal is constant everywhere, to elliptical surfaces where the normal changes as we move along the major and minor axes of the surfaces. Since any registration method that uses estimated normals for its computation 193 is affected by the noise in 2 values depending on the orientation of the surfaces, our objective is to study the extent to which the registration accuracy of different surfaces gets affected because of the errors in 2. We first define a measure to characterize the average vertical orientation of a surface. A surface may be totally vertical (e.g., a vertically placed half-plane) or may be partially inclined or even fully horizontal. The average vertical orientation V3 of a surface S is defined as the average of the 2 components of the surface normals np computed at each point p on the surface wherever data are available. This measure of the average “vertical orientation” of a surface captures how much a surface is oriented vertically in a three-dimensional space. The orientation measure, Vs, equals 0 if a planar surface is fully vertical as the normals to the surface are parallel to the a: — y plane and the 2 components of the normals are all zero. The value of V3 increases and becomes equal to 1 as the plane goes from being vertical to horizontal. We expect that a surface that is nearly vertically oriented (V5 2 0) will get affected very little by the noise in z. This is due to the fact that surface normals estimated are more or less aligned in the horizontal direction and they are least affected by the noise in 2 when the surface is vertical. This has already been established by our results in Section 6.5.1. So any registration method that uses surface normals in their computation should estimate the view transformation fairly accurately for vertically oriented surfaces. However, as the surface becomes more horizontally oriented, the er- rors in surface normal estimation become more pronounced and affect the registration accuracy. For nearly horizontal planes, the errors in z affect the surface normal computation as established in Section 6.5.1. So the rotational errors are expected to be fairly high near the horizontal planes with both the methods. However, since the MVE compensates for this “orientation effect” by explicitly modeling the uncertainties in z, the relative rotational and translational errors in the estimates computed using the MVE should be much less than those of the C-M method for the same planar orientation. For the noiseless case, the behavior of both the methods should be similar. This orientation effect on registration accuracy is true across several classes 194 of surfaces irrespective of the surface type (cylinder or plane), based only on the amount of vertical orientation of the surface. In general, all the orientation parameters of an object will be improved by the proposed MVE method if the object surface covers a wide variety of orientations which is true with many natural objects. This is because each locally flat surface patch constrains the global orientation estimate of the object via its surface normal direction. For example, if the object is a flat surface, then only the global orientation component that corresponds to the surface normal can be improved, but not the other two components that are orthogonal to it. For the same reason, the surface normal of a cylindrical surface (without end surfaces) covers only a great circle of the Gaussian sphere, and thus, only two components of its global orientation can be improved. The more surface orientations that an object covers, the more complete the improvement in its global orientation can be, by the proposed MVE method. 6.9 Summary We cast the problem of pose estimation as one of registering two range views of an object and we proposed a robust technique for accurate estimation of transforma- tion between the range images. We formulated a noise model that characterizes the effect of error in 2 measurements upon estimating tangent planes, and we proposed a minimum variance estimator for registration in order to handle uncertainties in 2 values. Our model handles numerical errors and surface orientation effects. We presented a first-order perturbation analysis of the estimation of planar parameters from surface data. We derived the variance of the point—to—planar distance to be min- imized to update the transformation between views. We employed this variance as a measure of uncertainty in the point-to—planar distances resulting from noise in the 2 values and presented a weighted least squares method to estimate transformation parameters reliably. The results of our experiments on real range images have shown that the pose estimates obtained using our weighted objective function (MVE) gener- ally are significantly more reliable than those computed with an unweighted distance 195 criterion. Figure 6.4: Effect of 196 fa v-O E 0.0016 1 I r r l v I I C 'std_dev_of_noise=0.0' +— 8 _ _._ 'std_dev_of_noise=0.001' +-' u 0.00“ ". “-._ ”I 'std_dev_of_noise20.002' -a~- ., a .A 'std_dev_of_norse=0.003' 4!"- 3 ‘~,. 'std_dev_ot_noise=0.004' +~- m X 'std_dev__ot_noise=0.005' -l--- ° 0.0012 - .. ‘i .\ v-a ‘1 c 0.001 1- “ '1 3 [h - ‘-“’\. h} o m, R o \x .\ D \"\ . d 2 0.0008 3“ it“ U ‘\ \_ c: ‘x‘ ‘4 Q "‘~ \_ g 0.0006 t' ‘-\‘ ‘.‘ cl 5,; 1..- ., e x, -“.__. ‘“~..__ I‘. X. 8 0.0004 ~ -~-x.\ x‘ . E ‘\~..__.x KR. ” I‘ G ”x ‘.x s‘ t}- ‘Xx. \ 3 0.0002 - a 9 --G----a...--D- 3 X 7“ 7*. ‘ .,, x. "A. "a '0 G ..... ~-.. . 9""9---.G “eu.-. ‘ '8 .L t A. e 4.--- --- " ' T" t. 0 c 4. A e e I 3""5"“$“7-: : 9;" 3 o 10 20 30 40 50 so 70 so 90 3 Angle between the normal to the plane and vertical axis (a) ll) o—l E 0.0015 1 r I r u r r 1 c " ""‘c 'std_dev_of,noise-0.0' +- ‘8 ‘.‘ 'std-dev_of_noise=0.001' 4*” p 0.0014 ,. \- 'std_dev_of_noise=0.002' “a" .1 a .‘ 'std_dev_of_norse=0.003‘ “l"- : ‘~.\ 'std_dev_of_noise=0.004' ~5- 1» ‘. 'std_dev_ot_noisen0.005' -I~-- ‘” 0.0012 n “""1 4 ‘U is C '\ ‘5 I a 'x g 0 c 001 4;..- _.“' \.‘ 4 U ‘l‘. ‘- 8 x'\ ”R‘s 1- . - I -I 2 0.0003 3.“ 1 U \A‘. \_\ : \.“ . 8 0 0006 3 - ' u :1- ------- x\ k B K, °‘ 0 0004 ~ ””"N C) ”N La 3 r ---a- .1, 6; 0.0002 - " ‘6-~---B ..... a. o '9" ‘8 ----- *---+- e t : - t. o e a + t e 5 c v . g o 10 20 30 40 so so 70 so 90 U, Angle between the normal to the plane and vertical axis 0)) noise in 2 measurements on the fitted plane using eigenvector approach: (a) i.i.d. Gaussian noise; (b) uniform noise. 197 (D H at: E 0.0016 I r f r I r r l c . ‘std_dev_of_noise=0.0' -O— .3 .—"i .7 'std_dev_ot_noise=0.001' “t"- ; I; ‘. 'scd‘dev_of__noise=0.002' 'D-v . u 0.0014 . _ U \_ 'stdgdev_of_.norse=0.003' "NW 0 ‘ 'std_dev_of_noise=0.004' +°- .0 in-.-“ 'std_dev_o£_noise=0.005' -l- - 5 0.0012 1- ”x. . E "'II‘ 2 0.001 ~ ~ .H \-‘ 0 4b. -__“ I 3 “*A,\ 2 0.0008 - “is-“ ‘1“ . U ~\,‘ \. D \l \'\ Q \‘\, ‘ 3,: 0.0006 - a.“ y s‘ ‘ .8 fl ‘ 1— x '\ '~. 0 If” llllll x a» f‘. ~'\ U 0 . 0004 .x k.._ 1. - c mmx T“\ i. s 8 \ x "‘x ' ~‘\ . \ 8 is a wax ~. \ w 0.0002 - 9 G ----- a ,_ 7x- ‘. ‘~ ‘ .,.. fl. 0 '0 01 La 3 0 10 20 30 40 so 60 70 80 9o 0., Angle between the normal to the plane and vertical axis (I) H E 3 0-0015 m 1* r I I I I I C _ .- 'std_dev_of__noise=0.0' *- 3 " 'std_dev_of_noise=0.001' +-- :1 0 0014 _ ‘-, 'std_dev_of_noise=0.002' -D-- _( u ' v i o \' ‘std_dev_of_n01se=0.003' 41"- 0 x. 'std_dev_of_noise:0.004' 4»- '0 ‘~, 'scd_dev_of_noise=0.005' -I-- 5 0.0012 ~ ",1 -11 1 u I‘ 3 0.001 - ‘x - 'r‘ h" —-‘ ‘. Lm) \.\‘ \. 01 ex. ix. 2 0.0008 '- ‘~.“_‘ ix. '1 U ‘s, ‘. é a.“ 2“ 3 0.0006 '- x‘ .1 u 1[— «~K.,\ ‘~.\ ' 48 TXN “\ ‘1‘ “‘x-u-v— _ "x. ‘- 8 0.0004 - “a“ AK K, a C ‘-.\ ‘_ ‘.‘ 3 “K ‘‘‘‘ x "‘. Is Q ”‘\ ‘L, ‘ :1 ----- 9..-. ~ ~.. ~ “41 0.0002 ~ Bmew-13 ..... E, K‘ ‘4 "~. ' 8 "- 9 . G-~.- K” f“‘.‘.‘ ‘T‘. .8 ,__-_-+____+-_-_+__ -4 '9 "‘0--..G”_-a:\~x.a.:‘°~.; 3- t. o : e e e m:"’1""3""$""$" -: ; ""3""éii“---‘-:-‘*a&l-a—f‘ g 0 10 20 30 4O 50 60 70 8O 90 m Angle between the surface normal to the plane and vertical axis (‘0) Figure 6.5: Effect of noise in 2 measurements on the planar fit using linear regression: (a) i.i.d. Gaussian noise; (b) uniform noise. 198 0.0018 I I I I I I I I 'std_dev_ot_noise=0.0' +- m ”um-I 'std_dev_of_noise=0.001' -+--- I 0.0016 "w. ..... ‘ 'std_dev_of_noise=0.002' -€l-- " ‘3 ~~~~~ L 'std_dev_of_noise=0.003' ....,.. 8 "‘1. 'std~dev_of_noise=0.004' +-— g 0.0014 b- “x‘ 'std_dev_of_noise=0.005' -I- - a U “. ~—... .\ ”g A» a--.‘ m—a.‘ xx '° 0.0012 - . ..... A .4 u .‘ ‘. O -‘.“. ls“ c: "A “ \.\. “ g 0.001 ‘F“""""~--¥—-—_ x X,‘ ..~ -4 -_ “fix-m -“ \I 'g N—~_,_‘“N . ._ ‘\.~. " “xx \. ‘. 3 0.0008 - x ‘\ VI. ‘ '0 “Hi“ i\ ‘h‘ "\ 'U ‘x “x i u (I ----- Elm-+3.... "‘\ ‘A x a 0.0006 » ‘3 "Ow-9.. "x, x - U "B. x \ ‘ c ~~.Qh. 2%“ A‘ I 3 ..a ..... 9 ck .\ (D "“9. ‘\‘\ 'A‘ I. H 0.0004 - “'13. xx ‘1‘ . a 4.----+__--+__--*-~_-*“~- .."8__ "~x\\ .K, \'\ g +--‘-.-“‘*---*__-+ ‘B ., ~i'x\‘i‘.‘~. 3 0.0002 - ““"'2-+-.--.._. ““1: Q} . --*-.~ N... -._‘\.i *-‘- °._ \ “‘~-~.,_ i; o 4' e e t c 5 e ¢ ¢ .L :L e 5 e g"- 0 10 20 30 40 50 6O '70 80 90 Angle between the normal to the plane and vertical axis 0.005 I I I I f I I I 'std_dev_of_noise=0.0' *— 'L 'std_,dev~of_noise=0.001' -+--- q ”I 0'0045 "1K 'std_dev_of__n01se=0.002' -fl~- '0 ‘Ir-«lm-«K 'std_dev_of_noise=0.003' 4*“ 8 0 004 __ ‘x‘ ‘std_dev_ot_noise=0.004' *m 2 ~ It“. 'std_dev_of_noise=0.005' -I~ 1 ‘5 “2- ““1 6 0.0035 ~ ~-~.“.__‘_ -‘*‘ -.‘I‘ ,4 U "x‘ ‘\. 0 ~--.‘ ‘l. 8 0.003 » aux .,. I ........ x "‘A_ l.‘ U ‘N' --——--- 2.... . "‘i i". 3 0.0025 r- "‘ " MAN” ‘2‘ 1., 1 E ~ ..%\‘xhl \ “~‘ -§‘~ 0.002 » "‘" ----- “K. ‘. 4 '3 II N“ “‘ " B ‘\ ll \. a8 ‘9 "G" a ---- a ‘30». \‘2‘ “~ 5 0.0015 - "9‘2-«3 ..... a 2-x.\‘ '3“ a“ q u ..... a x\ .\ s.\ W -.._B ‘\Ix\ ‘ K .-4 0.001 '- ”'“..,_B ‘---.\ ‘x *4. d Q u“""f~---+---.+._-_+__- fl._ x\\ X, ‘, 3 -+----o-~--+---+ - a xx 0 --'+~._.‘---- . ‘0 u \“A \-‘ < 0.0005 ~ *‘--~+-.-* "a..,_‘x:;\.-‘ ‘1 ---2.--.r- ~5 ink“; o f 5 5 c t e 5 e 5 t 5 e t c 5 v 0 10 20 30 40 50 60 70 80 90 Angle between the normal to the plane and vertical axis 0)) Figure 6.6: Actual standard deviation of (13 versus the planar orientation: (a) i.i.d. Gaussian noise; (b) uniform noise. 199 0.006 I I I I I I I I m 'std_dev_of_noise=0.0' +- I 'std_dev_of_noise=0.001' +-- 'U i 'std_dev_of_noise=0.002' -a-v 0 l-- -I.-.-. 'std dev of noise=0.003' ~I----' in ~--. — — — A .I g 0‘005 1"‘1‘ 'std_dev_of_noxse=0.004' +~- u “1‘. 'std_dev_ot_noise20.005' -II--- ,2 “1‘ '0 ‘ ‘1‘ 4 ‘- " "--‘..- _ '\ *5 0 004 h At-.. 1‘ .L -« c: ‘"' “I 2.“. ‘ ‘3 "xxx ‘~. a: m. ‘2. g o . 00 3 lk-»-... 'l"""“"*~..;, “x~~>~ 2“ \I‘ -‘ 0.21m "~. -. ,0 ”a“ _ \‘> .‘ y. k g x. 4,‘ I 1% \liv‘xi“ “.\ '\ ‘ “x“ ‘- ‘I‘ g 0.002“'“"9""0----r3 ..... EL-.. “"x, "~.‘ ‘~. 4 ._, 5"“9 ..... a ""x "A, I m ~~~~~ ~. ~. x a _. \x ‘N ‘. O 3.. ~‘\\ ‘. ‘ 0 B ‘x't \'\ '\ Q J B \\_ \"\ g 0 001 P---+—-«+----+---+---_+____h_ a X\ - a-I ‘-*““+-- G - \X \K. ‘-’ ~-+"--+. \ A ~. 0) F--‘*--_‘ ."‘ " ‘. +-__ 8., R‘- m w.‘_ ._ ‘ -*“‘ 3‘ \“ *-. ‘~. “0-.“ o ¢ é : 5 e 5 : ¢ c 5 L 5 e t e 5 v 0 10 20 30 40 SO 60 70 80 90 Angle between the normal to the plane and vertical axis (a) 0.006 I 1 I I I I I I m 'std_dev_of_noise=0.0' +— 1 ‘5td_dev_of_noise=0.001' +-2 '0 'std-dev_of_noise=0.002' -EJ~- 0 "Jane... 'std_dev_of_noise=0.003' .7”.-. E) O 005 “ ~ 0‘ 'std_dev_of_noise=0.004' + - g “a. 'std_dev_of_noise=0.005" -!--- .33 ‘ x_ '0 ‘22.. w ‘t- "‘“"*.-5 -"~ 0 O 004 “w-.." 1L .4 "‘, ‘x ‘0: “‘\ “. 'H ",~ \. i; ‘0‘. I, "‘ 0 0031h----~-~2*t *2. . 3 ' "I w 2- n; x. ,0 fl ______ .. ~ ~ ‘ 3“" A ‘ .0 ix!» _\ "‘. .\ : ‘x‘~.._ “~ ‘ \ "'&___ "x 'u‘ g 0.002 Jl‘“"a""‘E}----g _____ 82... “'.x\- ‘x. ." T u '9' (3., G \X\ \.‘ .- m ..... l. ~‘ ~. 0 MB XE '\_ ‘3 .8 “.8 ‘x_ ‘AN .- ‘J B . B. \x "\A\ ‘ 0.001 ‘»--+---+----+--- ‘ -- ‘~ " " ,3; * *---+--~-+---*_g a .. x - u -+"‘-+.--_ .3... ‘.“. 0°} *“*-4----* '13.,“xx F“'*“** ‘a'n‘g ~--...____ 0 o i ‘r 5 : t e i e 5 e 5 c t A 5 v 0 10 20 3O 40 50 60 70 80 90 Angle between the normal to the plane and vertical axis (b) Figure 6.7: Estimated standard deviation of the distance d, using the perturbation analysis versus the plane orientation: (a) i.i.d. Gaussian noise; (b) uniform noise. 200 0.0035 Y I I V Y I T T 'std_dev_of_noise=0.0' *— m 'std_dev_of_noise=0.001' **~ 6' ‘tr-5 'std_dev_of_noxse=0.002'-O~ o 0~003 ' “It- ~‘ 'std_dev_of_noise=0.002' Wk ‘ u “m.“ 'std_dev_ot_noise=0.003' 4*- 5 "~.‘ 'std_dev_of_noise=0.004' -I ; “‘\ 'std_dev_of_noxse=0.005'-0'" ... 0.0025 - ‘5». a .0 “'"".u--..__- "s. m ~ILNH. ‘A\ ‘1 ““1 “x L) ‘R‘ A \ \ "" 0.002 ' ‘1 ‘-‘I .. O ‘\ ‘\ g ‘\ \‘a 1| \ id lI-v-I —. \‘ ”x U '. ~.-. -' K \. .3 0.0015 *- ‘l -11- \ x. . > ~-.-- ‘ . “‘ \_\ '8 “i ".‘h \ ‘\ . ‘ ' ~ . ~. \\ "‘x‘ '3 0.001 - -.‘.l ‘l\‘ \ .I 0) ‘ g ‘ \ ---_+--_+___-*-‘-+‘---+ I. ‘ ‘R ‘ ~-‘*‘-—. ‘ “n .‘- '3 *--‘*---*-. I‘ §- ‘ \\_ 3 0.0005 ~ ~-..--_* fl ~ U ~-~*s-. '\ .‘.‘ ‘ ‘¢~..-*‘- M“ , 4»----o-~-+----o—----o~--omam"......_,,,_.,m_..m "‘-~-,.:‘~‘ 0 e 5 eegev :7 $7 :7 5 v t 443:.”3'":$;¥‘$""'3‘“'$?":33 0 10 20 30 4O 50 60 70 80 90 Angle between the normal to the plane and vertical axis (a) 0.005 I T V I I Y I I 'std_dev_of_noise=0.0' +— m g“ 'std_dev_of_noise=0.001' ~h- Ul 0'0045 "on" 'std_dev_of_noise=0.002'~D~ q "0"".- -...' 'std_dev_of_noise=0.002' -l~--- 8 o 004 h 'std_dev_of_noise=0.003' +~- _ g ' .w“. 'stdfldev_of_noxse=0.004' ~l - 3 u .5“ 'std_dev_ot_nozse=0.005"0“ ». ‘t '35 0.0035 - 'I»-..._.__.-_ I a... ~ 0 ". " : ‘5.a_ ”o ” 0.003 F ‘ a. ‘a_ . u I." o A -m. ”I‘ 0t 8 0.0025 L "*""““--‘~“ “‘1. "0.. ‘ ~o-c -- __ ‘.‘ ~. 3 o-‘_‘h\ h“\V .n ‘; 0.002 + "a.“ 1‘ - a, 'L---- '\ i" . ‘0 "‘“inuI-m-QN‘ ‘ ‘_ "~._ .. .5 0.0015 - “WM--.“ “t“ ‘1‘ ‘o. - u “‘C—--_ ‘\_ 's m ~.\\ ‘x‘ ~I -n o 001 “5». ‘-~ ‘~. . a . 0----+'"‘*""*"“+----+---.+_ K. “‘-\_ ‘3“. u -+-"*‘“+---. \ ~.‘\ “'I‘“. 3 0.0005 - *‘“+----+_-N nix.“ ~ “‘-+----~ “3::- and o 4:, ¢ .¢ 5 e t e t e 5 c 5 An, A c 5 v 0 10 20 30 40 SO 60 70 80 90 Angle between the normal to the plane and vertical axis (b) Figure 6.8: Actual standard deviation of (1, versus planar orientation using linear regression for plane-fitting: (a) i.z'.d. Gaussian noise; (b) uniform noise. 201 0 - 006 I I f I I I T I n 'std_dev_of_noise=0 . 0" +— I 'std_dev_of_noise=0.001' 4.- '0 'std_dev_of_noise=0.002' -n-- a: I,I_~~-I--.-.,.’ 'std_dev_of_noise=0.003' ”K"- E 0'005 ""--... 'std_dev_of_noise=0.004' +-- .4 3 "M." 'std_dev__o£_noise=0.005' -l--- .52 1.x '0 4» 5 ‘I‘ --‘-' _'-“-.- “- .4 B 0 . 004 v-;.. m‘ l, u .““ \.‘ In F “A“ "\ o ‘."\ ‘ X C ‘\. ‘ K . o 1~m~ae~~ -~. \. ‘. - .3 0 . 00 3 L "‘ we‘kx ~.‘ x. .“ \‘ ‘ IU \'x~..~ \‘ s '9 x -~ 0’ x-\ \“ "\ p ‘1c\\ \. 'l - 0.002‘l--"-B--~-G----a ..... a ‘x. ‘-\ u U '"'£L.H ‘\~ \ x U G ---- a *\ ‘A I m ..... \.__ " ‘. 8 "fl Xx.- ‘\, ‘.‘ 2 "“t15_ a; “\_‘I‘ u 13". “\‘x ‘.“ .\ r---+---- ---- --- '13 ‘. g o . 001 + +- +-_--+-_-*----+- G. \.‘\ \ ‘v ‘ '1 "4 '-%-_-_*-- “.3 \‘ .\ A.) -*“- “ “x \ .\ m **--~+.-- '13. x -~.~ N “+~---*‘ an}; .‘k- "- -.‘_--~ 0 c 57 c 5 e er :7 ¢ c i e 5 .; é ‘ i 0 10 20 30 40 SO 60 70 80 90 Angle between the normal to the plane and vertical axis (a) 0 . 006 I I I I I I I I 'std_dev_of_noise=0.0' -O— 'std_dev_of_noise=0.001' -#~ on 'std_dev_of_noise=0.002' '8" I nth-.uh_.w 'std_dev_of_norse=0.003'~¥~- ‘0 0'005 "“M... ‘std_dev_of_noise=0.004' +-- d 8 ‘~\“ 'std_dev_of_noise=0.005'-I"- 5 ‘1‘ A.) 's 1‘” 4>-—---&-.-..‘_ "‘x 5 0.004 ---‘ .... .“ ‘IL, - ‘.‘ ‘v‘ g “A\ ‘\ U “1 s_ u L \K xx 0 0.003 ‘ "‘”‘*‘*“~*—~~~-g. "" " d w ~5~...x‘ \ 'A\ x. U m“-\_\ \‘ \ ‘ g k\x\ \ ‘ \‘A I‘ .3 '3L\‘ ‘A‘ ~‘ \x 5“ ‘- 3 0.002 “NH-manna ..... 9.... \x\ ‘t. 7 BHHB' . a‘xfi \“ \I E a ..... a \. \ . ‘ A.) '9 x\.\ “ 'u. g Q X‘ \‘ . 1‘ B .. "ex \‘\“'\. t; 0 . 001 *“‘+-‘--+----+---+---_+____h___+ 6"“3 K -\,“-‘ 'i m "-~o~---+__-* X 'x“.‘ -“*"~-§.-. '13.\\"~‘ w“..--*-- .'.B:\'": *’“+-;L;g o :7 5 c ‘5 :7 é An, 5 :7 b c 5* e at e t v 0 10 20 30 40 50 60 70 80 90 Figure 6.9: Estimated Angle between the normal to the plane and vertical axis (b) standard deviation of (13 versus planar orientation using linear regression for plane-fitting: (a) i.i.d. Gaussian noise; (b) uniform noise. 202 Figure 6.10: Relative error of the rotation matrix R. Figure 6.11: Range images and the principal axes: (a) Cobra head with depth ren— dered as pseudo intensity —— view 1; (b) cobra head rotated — view 2; (c) view 1 of Big-Y generated from its CAD model; (d) View 2 of Big-Y. 203 (a) Figure 6.12: Range images of Facel: (a) View 1; (b) view 2. Figure 6.13: Range images of Face2: (a) View 1; (b) view 2. Chapter 7 Summary and Directions for Future Research This chapter summarizes the results of the research reported in this thesis. Several exciting directions for future research are also outlined. 7. 1 Summary The most important contribution of this thesis is the introduction of the COSMOS rep- resentation scheme and the associated framework for the representation and recog- nition of 3D free-form rigid objects. COSMOS addressed the issue of representing and recognizing arbitrarily curved 3D objects using range data when (i) the object viewpoint (2D appearance) is not constrained, (ii) the objects may assume complex shapes and form, and (iii) there is no restriction about the types of surfaces on the ob- ject. It is within this framework that the techniques for representing and recognizing arbitrarily shaped 3D objects have been developed. Given surface depth data of a scene containing multiple nonoccluding objects (in general), the COSMOS-based recognition system derives a shape-based surface repre- sentation of each object (region of interest) in the scene. Under the formulation of the COSMOS scheme, an object is described concisely in terms of maximal surface patches 204 205 of constant shape index. The maximal patches that represent the object are mapped onto the unit sphere via their orientations. This spherical mapping not only preserves the orientation information about the object, but is also employed to aggregate the local geometric attributes of the CSMPs via shape spectral functions. Geometric at- tributes such as surface area, curvedness and adjacency which are required to capture local and global information are built into the representation using the connectivity list and the support functions defined on the unit sphere. The representation scheme has been demonstrated to provide a meaningful and rich description of objects that is useful for recognition of arbitrarily curved objects. We also introduced a powerful matching primitive, the shape spectrum of an object, for fast matching of an input object view with views stored in a model database and for eliminating unlikely views during recognition. It characterizes the shape content of an object by summarizing the area on the surface of the object at each shape index value. We also established that the shape spectrum is the Fourier transform of the integral of the support function G1 over the entire unit sphere. The concept of shape spectrum is especially appealing as it gives us the ability to construct an intuitive “frequency domain” (shape domain) characterization of spatially varying curvatures. We also discussed the issues of compactness of COSMOS representation of an object. We studied the recoverability of objects from COSMOS representations and we established that recovery of several classes of objects such as convex polyhedra and convex closed surfaces on which the shape index varies continuously at every point is feasible from their COSMOS representations from both theoretical and practical vieWpoints. We adopted a multiple—view based model for an object where the 3D model of the object is a collection of the COSMOS representations of its views seen from different viewpoints. Then we demonstrated how the COSMOS representation can be derived from range data of an object view and illustrated the representation using range images of several complex objects. Next, we studied the problem of organizing a database of multiple views of a free- form 3D rigid object in a meaningful and efficient manner. This is important because 206 a complex smooth object can give rise to infinitely many different views owing to its smoothly curved nature. We demonstrated how moment features derived from the shape spectrum of an object view are used to group views of objects of complex shape and geometry into compact and homogeneous clusters. The proposed method is general and easy to use, and offers a practical solution to the construction of view aspects of complex sculpted objects. We demonstrated with experimental results on a database of 6,400 views of 20 objects that view aspects can be determined for sculpted objects easily and effectively. We also demonstrated that when view-grouping is exploited to structure a large model base of views, even with a relatively flat (two-tiered) arrangement a small set of plausible correct matches to an input object View can be determined quickly. Experimental results on a database of 6,400 views of 20 objects show that when tested with 2,000 independent views, our matching technique examined on the average only 20% of the database for correct classification of the test views. We proposed and implemented a matching strategy that is a combination of shape spectrum based model database pruning and graph-based search for establishing scene-model feature correspondences, thereby exploiting the advantages of these two techniques for fast and efficient recognition. We tested the generality and the effec- tiveness of our scheme on a database of 100 range images of several complex objects acquired using a 3D laser range scanner. The shape spectral feature based model selection module yielded a view classification accuracy of 94% over fifty independent test views when the top five out of ten clusters were examined in the database for each input image. The COSMOS—based view verification stage exhibited 82% accuracy in establishing the correct object identity of the test images. Better CSMP detection will aid in increasing this accuracy. Our approach to pose estimation was that of deriving a robust registration between the input and the matched object views, thereby estimating the transformation be- tween the range images of a free-form object. An initial estimate of the transformation between the sensed and the stored images is determined using the surface normals, once the correspondences between the maximal patches visible in both views have 207 been established by our recognition strategy. This initial estimate of the pose is then refined to obtain a more accurate set of translation and orientation parameters using an iterative minimization scheme. The registration-based approach thus com- pensates for the errors that may have been introduced in the initial pose due to poor localization and noise-corrupted surface normals. The proposed representation, recognition and pose estimation schemes are de— signed to (i) handle general 3D rigid objects that are arbitrarily shaped, (ii) aid in easy model view selection from a large model database for recognition, and (iii) help in computing the object identity and pose accurately and robustly. The shortcom- ings of the COSMOS based recognition system are: (i) lack of explicit incorporation of edge information within the representation scheme, both in theory and in practical implementation; (ii) an unstable (with respect to changes in viewpoints), data-driven CSMP segmentation technique; (iii) the inability of the shape spectrum based match- ing scheme to handle model database pruning under occlusion of an input by other objects; and (iv) a graph-matching algorithm for final view verification that is likely to be slow on images that have been segmented into a very large number of CSMPs. It would also be more satisfying to test the representation and recognition schemes exhaustively on a larger set of objects than was possible within the constraints of this thesis. The shortcomings are not critical in the sense that there are clear means of overcoming them via further development as outlined in the next section. 7 .2 Future Research COSMOS is a novel framework within which there are a number of exciting avenues for future research. 208 7 .2.1 Incorporation of Explicit Edge Information within COSMOS The COSMOS representation currently does not explicitly treat edges (curves of discon- tinuities of both surface depth and surface normals) on object surfaces. As indicated in Section 3.2.2, an edge can be viewed as the limiting case of a cylindrical surface with infinite curvedness and a corner can be modeled as the limit of a spherical cup or cap shape. On the other hand, traditional edge detection algorithms explicitly represent edges as discontinuities separating homogeneous regions. While edges are likely to be often detected in our system as distinct patches as a byproduct of our segmentation algorithm, we have not made any special effort to detect them as our focus has been on smooth curved objects. However, explicit edge representation may provide additional information when dealing with polyhedra and may also have visual significance (high information content) in interactive/ integrated human-computer vi- sion systems. There is a whole body of earlier work [29] on edge detection which can be integrated into COSMOS. Future work can study how best to model edges theoret- ically and how to use them effectively during shape index based segmentation. One possibility, for example, is to use a traditional edge detection algorithm as a prepro— cessing step and then use the detected edges as constraints during the region-growing stage in COSMOS. 7 .2.2 Improving the Segmentation Algorithm Since surface depth and normal discontinuities are currently not modeled explicitly within our scheme, we observed that adjacent CSMPs tend to blend with their neigh- boring regions during the region growing process. An important future direction of research will be to integrate the region—growing segmentation algorithm with edge information to obtain stable (with respect to changes in viewpoints) CSMPs from a range image for effective matching. In addition, shape index based discontinuity can also be defined and formulated in such a way that distinct shapes of objects do not blend with one another in the 209 presence of noise in the sensed data. 7.2.3 Deriving COSMOS from a 3D Object Model We have adopted a “collection of views” approach to modeling a 3D object. An— other challenging and related issue is building the COSMOS representation for the entire 3D object from its multiple views which can then serve as an object-centered representation of a free—form object. This problem has several components such as (i) integrating the COSMOS derived from multiple views into a single representation, given the knowledge of the transformations between the multiple views, (ii) valid in— ferencing about the complete shape, area, curvedness of a maximal patch that is only partially visible in several views, and (iii) accumulating evidence about its attributes from several partial pieces of information present in multiple views and collating them to form a complete and correct representation of the patch. 7 .2.4 Better Distance Measures and Matching Efficiency We informally studied the efficacy of the moment features derived from the shape spectra of object views in classifying an input view correctly and found that only the first four moments significantly contributed to the correct classification of the input. The higher order moments were low in magnitude and did not add much to the Euclidean distances computed for comparison of moment vectors. Alternative metrics, especially the Mahalanobis distance, could be used instead to compare the feature vectors and measure the similarity of views. The Mahalanobis distance ensures “equal” contributions of individual feature values to the distance computed between views. A thorough comparison can be made between these two distance measures to determine the utility of the high-order moments in the feature vectors derived from the shape spectra of views. A potential future contribution is to to add more levels to the hierarchical database structure. For example, a set of object views can be organized based on their shape spectra into several categories: those that exhibit planar patches alone and those that 210 exhibit other shapes in addition to planar patches. Given this broad organization, a fine-grain organization of the latter category into those views that contain purely nonconvex shapes and those that contain purely convex shapes can also be obtained. Given an input object view, its shape spectrum can be computed easily, and then, by descending through this hierarchy, can be compared with only a small subset of views that are likely to best match with it. 7.2.5 Occlusion Currently, shape spectrum based pruning assumes that the shape spectrum computed from a view belongs to only a single region of interest and performs the classification of that object view during matching. However, when objects overlap one another in the scene, if we cannot separate them into distinct regions, the shape spectrum computed would be that of the entire overlapping surface area in the image. Since adjacency information is not part of the shape spectrum, there are no means of distinguishing whether the spectrum computed is from a single region or from multiple regions in the image. In cases of occlusion, this spectrum-based pruning step can be avoided, and the graph-based search can directly be used to establish patch-group graph isomorphism and thus determine the object identity. Future study should investigate the use of discontinuities in separating the regions of interest, and then computing the shape spectrum of each of these regions sequentially and performing the matching; this is related to our suggestion in Section 7.2.1. Further, geometric reasoning based on model features has to be employed once a representation of an input image has been derived in order to detect that some features are missing in the sensed data and that this could be due to the object being occluded by others. Occlusion events have to be determined by exploiting the fact that adjacent features in the model must match adjacent features in the image except when occlusion occurs or when there are errors in segmenting the regions of interest. Independent evidence for occlusion can be determined by detecting the loss of support for the model features. 211 7 .2.6 Integrating Color and Texture Our discussion so far has dealt with the design of a geometry—based representation and recognition system for free-form objects. However, we believe that a robust recognition system benefits from using other cues such as color and texture. Therefore, a very interesting line of future work would be to investigate how color and texture features can be incorporated within the COSMOS representation of an object and to explore the possibility of enhancing our matching scheme using features derived from these additional cues as indexing primitives. BIBLIOGRAPHY Bibliography [1] N. Ahuja and J. Veenstra. Generating octrees from object silhouettes in or- thographic views. IEEE Transactions on Pattern Analysis and Machine Intel- ligence, 11(2):137-149, 1989. [2] Farshid Arman and J. K. Aggarwal. Automatic Generation of Recognition Strategies Using CAD Models. In Proc. IEEE Workshop on Directions in Au- tomated CAD-Based Vision, pages 124—133, Maui, Hawaii, 1991. [3] Farshid Arman and J. K. Aggarwal. Model-based object recognition in dense range images—A review. Computing Surveys, 25(1):5-43, 1993. [4] F. Attneave. Some informational aspects of perception. Psychological Review, 61(3):183-193, 1954. [5] R. Bajcsy and F. Solina. Three-dimensional object representation revisited. In Proc. First IEEE International Conference on Computer Vision, pages 231—240., London,1987. [6] A. H. Barr. Superquadrics and angle-preserving transformations. Computer Graphics and Applications, 1:11—23, 1981. [7] H. G. Barrow and R. M. Burstall. Subgraph isomorphism, matching relational structures and maximal cliques. Information Processing Letters, 4(4):83—84, January 1976. 213 [8] R. Basri and S. Ullman. The alignment of objects with smooth surfaces. In [9] [10] [11] [12] [13] [14] [15] [16] Proc. Second IEEE International Conference on Computer Vision, pages 482— 488, Tarpon Springs, FL, 1988. Ronen Basri. Viewer-centered representations in object recognition: A computa- tional approach, chapter 5.4, pages 863-882. Handbook of Pattern Recognition & Computer Vision. World Scientific Publishing Company, 1993. Paul J. Besl. Surfaces in Range Image Understanding. Springer Series in Per- ception Engineering. Springer-Verlag, 1988. Paul J. Besl. The free-form surface matching problem. In Herbert Freeman, editor, Machine vision for three-dimensional scenes, pages 25—71. Academic Press, 1990. Paul J. Besl. Geometric signal processing. In Ramesh C. Jain and Anil K. Jain, editors, Analysis and Interpretation of range images, chapter 3, pages 141-205. Springer-Verlag, 1990. Paul J. Besl. Triangles as a primary representation. In Martial Hebert, Jean Ponce, Terry Boult, and Ari Gross, editors, Object representation in computer vision, pages 191—206. Springer-Verlag, Berlin, 1995. Paul J. Besl and RC. Jain. Three-dimensional object recognition. Computing Surveys, 17275—145, 1985. Paul J. Besl and Neil D. Mckay. A method for registration of 3—D shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2):239— 256,1992. B. Bhanu. Representation and shape matching of 3-D objects. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 6(3):340—351, May 1984. [17] [18] [19] [21] [22] [23] [24] [25] [25] 214 B. Bhanu and T. Poggio. Introduction to the special section on learning in com- puter vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(9) :865—868, September 1994. Irving Biederman. Recognition—by—components: A theory of human image un- derstanding. Psychological Review, 94(2):1l5-147, 1987. G. Blais and M. D. Levine. Registering multiview range data to create 3d computer graphics. IEEE Transactions on Pattern Analysis and Machine In- telligence, 17(8):820—824, 1995. A. F. Bobick and R. C. Bolles. The representation space paradigm of concur- rent evolving object descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, l4(2):146—156, 1992. R. C. Bolles and R. A. Cain. Recognizing and locating partially visible objects: The local feature focus method. International Journal of Robotics Research, 1(3):57—~82, 1982. R.C. Bolles and P. Horaud. 3DPO: A three-dimensional part orientation system. International Journal of Robotics Research, 5(3):3-—26, 1986. Thomas M. Breuel. Adaptive model base indexing. In Proc. DARPA Image Understanding Workshop, pages 805—814, Palo Alto, California, 1989. Rodney A. Brooks. Model-based three-dimensional interpretations of two— dimensional images. IEEE Trans. on Pattern Analysis and Machine Intelli- gence, PAMI—5(2):140—150, March 1983. Rodney A. Brooks. Model-Based Computer Vision. UMI Research Press, 1984. R. Brunelli and T. Poggio. Face recognition: Features versus templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(10):1042—1052, 1993. [27] [28] [29] [30] [31] [3?] [33] [34] [35] 215 Terry Caelli and Ashley Dreier. Variations on the evidence-based object recog- nition theme. Pattern Recognition, 27(2):185—204, 1994. Octavia I. Camps, Linda G. Shapiro, and Robert M. Haralick. PREMIO: An Overview. In Proc. IEEE Workshop on Directions in Automated CAD-Based Vision, pages 11—21, Maui, Hawaii, June 1991. John Canny. A computation approach to edge detection. IEEE Trans. on Pattern Analysis and Machine Intelligence, PAMI-8(6):679—698, 1986. C. Chen and A. Kak. A robot vision system for recognizing 3-D objects in low- order polynomial time. IEEE Transactions on Systems, Man, and Cybernetics, pages 1535-1563, 1988. Chien-Huei Chen and Prasanna G. Mulgaonkar. CAD-based feature-utility measures for automated vision programming. In Proc. IEEE Workshop on Di- rections in Automated CAD-Based Vision, pages 106—114, Maui, Hawaii, 1991. J. L. Chen and G. Stockman. Indexing to 3D model aspects using 2D contour features. In Proc. IEEE Conference on Computer Vision and Pattern Recogni- tion, San Francisco, CA, 1996. Jin-Long Chen and G. C. Stockman. Determining pose of 3D objects with curved surfaces. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 18(1):52—56, January 1996. S. Chen and H. Freeman. Computing characteristic views of quadric-surfaced solids. In Proceedings of the 10th ICPR, pages 77—82, Atlantic City, N. J ., 1990. Sci-Wang Chen and Anil K. Jain. Strategies of multiview and multi-matching for 3D object recognition. Computer Vision, Graphics and Image Processing, 57(1):121—130, January 1993. 216 [36] Sci-Wang Chen and George Stockman. Wing representation for rigid 3D objects. [37] [38] [39] [40] [41] [4‘2] [43] [44] In Proc. 10th International Conference on Pattern Recognition, pages 398—402, Atlantic City, 1990. Tsu-Wang Chen and Wei-Chung Lin. A neural network approach to CSG-based 3—D object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(7):719—-726, 1994. Yang Chen and Gérard Medioni. Object modelling by registration of multiple range images. Image and Vision Computing, 10(3):145—155, April 1992. H. Chernof. Using faces to represent points in k-dimensional space graphically. Journal of the American Statistical Association, 68:361-368, 1973. Jonathan H. Connell and Michael Brady. Generating and Generalizing Models of Visual Objects. Artificial Intelligence, 31:159—183, 1987. H. Delingette, M. Hebert, and K. lkeuchi. Shape representation and image seg- mentation using deformable surfaces. Image and Vision Computing, 10(3):132— 144, 1992. H. Delingette, M. Hebert, and K. lkeuchi. A spherical representation for the recognition of curved objects. In Proc. Fourth IEEE International Conference on Computer Vision, pages 103—112, Berlin, 1993. Sven J. Dickinson, A. P. Pentland, and Azriel Rosenfeld. From volumes to views: An approach to 3-D object recognition. In Proc. IEEE Workshop on Directions in Automated CAD-Based Vision, pages 85—96, Maui, Hawaii, 1991. Sven J. Dickinson, Alex P. Pentland, and Azriel Rosenfeld. 3-D shape recov— ery using distributed aspect matching. IEEE Trans. on Pattern Analysis and Machine Intelligence, 14(2):]74—198, February 1992. [45] [46] [47] [48] [49] [50] [51] [52] [53] 217 Chitra Dorai and Anil K. Jain. COSMOS—A representation scheme for free-form surfaces. In Proc. Fifth International Conference on Computer Vision, pages 1024—1029, Boston, Massachusetts, June 1995. Chitra Dorai and Anil K. Jain. Shape spectra based view grouping for free—form objects. In Proc. IEEE International Conference on Image Processing, volume III, pages 340—343, Washington, DC, October 1995. Chitra Dorai and Anil K. Jain. View organization and matching of free—form objects. In Proc. IEEE International Symposium on Computer Vision, pages 25-30, Coral Gables, Florida, November 1995. Chitra Dorai and Anil K. Jain. Recognition of 3D free-form objects. In Proc. 13th International Conference on Pattern Recognition, Vienna, Austria, August 1996, To appear. Chitra Dorai, Gang Wang, Anil K. Jain, and Carolyn Mercer. From images to models: Automatic 3D object model construction from multiple views. In Proc. 13th International Conference on Pattern Recognition, Vienna, Austria, August 1996, To appear. Chitra Dorai, John Weng, and Anil K. Jain. Optimal registration of multiple range views. In Proc. 12th International Conference on Pattern Recognition, pages 569—571, Jerusalem, Israel, October 1994. Sahibsingh A. Dudani, Kenneth J. Breeding, and Robert B. McGhee. Aircraft identification by moment invariants. IEEE Transactions on Computers, C- 26(1):39—46, 1977. D. S. Weld (Ed.). The role of intelligent systems in the National Information Infrastructure. AI Magazine, 16(3):45—64, Fall 1995. S. Edelman and D. Weinshall. A self-organizing multiple-view representation of 3d objects. Biological Cybernetics, 64:209—219, 1991. 218 [54] H. Edelsbrunner, J. O’Rourke, and R. Seidel. Constructing arrangements of lines and hyperplanes with applications. SIAM Journal on Computing, 15:341— 363, 1986. [55] D. Eggert and K. Bowyer. Computing the orthographic projection aspect graph of solids of revolution. In Proc. IEEE Workshop on Interpretation of SD Scenes, pages 102-108, Austin, November 1989. [.56] D. Eggert and K. Bowyer. Computing the perspective projection aspect graph of solids of revolution. IEEE Transactions on Pattern Analysis and Machine Intelligence, l5(2):109—128, February 1993. [57] Executive Office of the President, Office of Science and Technology Policy. The Federal High Performance Computing Program. Washington, DC, September 1989. [58] Ting-Jun Fan, Gérard Medioni, and Ramakant Nevatia. Recognizing 3-D ob- jects using surface descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(11):1140—1157, 1989. [59] OD. Faugeras and M. Hebert. The representation, recognition, and locating of 3-D objects. International Journal of Robotics Research, 5(3):27—52, 1986. [60] W. Feller. An Introduction to Probability Theory and its Applications, volume II. John Wiley & Sons, Inc., New York, 2 edition, 1971. [61] F. P. Ferrie, J. Lagarde, and P. Whaite. Darboux frames, snakes, and super- quadrics: Geometry from the bottom-up. In IEEE Workshop on Interpretation of 3-D scenes, pages 170—176, Austin, Texas, 1989. [62] F. P. Ferrie and M. D. Levine. Integrating information from multiple views. In IEEE Workshop on Computer Vision, pages 117-122, Miami Beach, FL, 1987. [63] F. P. Ferric, S. Mathur, and G. Soucy. Feature extraction for 3-D model build- ing and object recognition. In Anil K. Jain and Patrick J. Flynn, editors, [64] [65] [66] [67] [68] [69] [70] [71] 219 Three-Dimensional Object Recognition Systems, pages 57—88. Elsevier Science Publishers B.V., Amsterdam, The Netherlands, 1993. M. F ischler and R. Bolles. Random consensus: A paradigm for model-fitting with applications in image analysis and automated cartography. Communica- tions of the ACM, 24:381-395, 1981. P. J. Flynn and A. K. Jain. Surface classification: Hypothesis testing and parameter estimation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 261—267, Ann Arbor, Michigan, 1988. Patrick J. Flynn. CAD-based computer vision: Modeling and recognition strate- gies. PhD thesis, Michigan State University, Department of Computer Science, East Lansing, Michigan, 1990. Patrick J. Flynn. Saliencies and symmetries: Towards 3D object recognition from large model databases. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 322—327, Urbana, IL, 1992. Patrick J. Flynn and A. K. Jain. BONSAI: 3D object recognition using con- strained search. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 13(10):1066—1075, 1991. Patrick J. Flynn and A. K. Jain. 3D object recognition using invariant feature indexing of interpretation tables. CVCIP: Image Understanding, 55(2):119—129, 1992. Patrick J. Flynn and Anil K. Jain. Three-Dimensional object recognition. In Tzay Y. Young, editor, Handbook of Pattern Recognition and Image Processing, volume 2, chapter 14, pages 497—541. Academic Press, 1994. D. A. Forsyth, J. L. Mundy, A. Zisserman, C. Coelho, A. Heller, and C. Roth- well. Invariant descriptors for 3—D object recognition and pose. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 14:971—992, 1992. [72] [73] [74] [75] [76] [77] [78] [79] [80] 220 H. Freeman and I. Chakravarty. The use of characteristic views in the recogni— tion of three—dimensional objects. In E. Gelsema and L. Kanal, editors, Pattern Recognition in Practice. North-Holland Publishing Co., Amsterdam, 1980. K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, Boston, 1990. Z. Gigus, J. Canny, and R. Seidel. Efficiently computing and representing aspect graphs of polyhedral objects. In Proc. Second IEEE International Conference on Computer Vision, pages 30—39, Tarpon Springs, 1988. Z. Gigus and J. Malik. Computing the aspect graph for line drawings of poly- hedral objects. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 654—661, Ann Arbor, 1988. W. E. L. Grimson and D. P. Huttenlocher. On the sensitivity of the Hough transform for object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, l2(3):255—274, 1990. W. E. L. Grimson and D. P. Huttenlocher. On the verification of hypothesized matches in model-based recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(12):1201—1213, 1991. W. E. L. Grimson and T. Lozano-Pérez. Localizing overlapping parts by search- ing the interpretation tree. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 9(4):469—482, July 1987. W. Eric L. Grimson. The combinatorics of heuristic search termination for object recognition in cluttered environments. IEEE Trans. on Pattern Analysis and Machine Intelligence, 13(9):920-935, September 1991. W. Eric L. Grimson and Tomas Lozano—Pérez. Model-based recognition and localization from sparse range or tactile data. International Journal of Robotics Research, 3(3):3—35, Fall 1984. [81] [82] [83] [84] [85] [86] [87] [88] 221 A. Gupta, L. Bogoni, and R. Bajcsy. Quantitative and qualitative measures for the evaluation of the superquadric models. In Proc. IEEE Workshop on Interpretation of 3D Scenes, pages 162—169, Austin, 1989. Alok Gupta and Ruzena Bajcsy. Surface and volumetric segmentation of range images using biquadrics and superquadrics. In Proc. 11th International Con- ference on Pattern Recognition, pages 158—162, The Hague, The Netherlands, 1992. C. Hansen and T. Henderson. CAGD—based computer vision. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 11(11):1181—1193, Novem- ber 1989. Patrick Hebert, Denis Laurendeau, and Denis Pousart. Scene reconstruction and description: Geometric primitive extraction from multiple view scattered data. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 286—292, New York City, NY, 1993. B. K. P. Horn. Extended Gaussian image. Proceedings of the IEEE, 72:1671— 1686, 1984. B. Horowitz and A. P. Pentland. Recovery of non-rigid motion and structure. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 288—293, Maui, Hawaii, 1991. Daniel P. Huttenlocher and Shimon Ullman. Recognizing solid objects by align- ment with an image. International Journal of Computer Vision, 5(2):195—212, 1990. K. lkeuchi. Generating an interpretation tree from a CAD model for 3D—object recognition in bin-picking tasks. International Journal of Computer Vision, 1:145—165, 1987. 222 [89] Katsushi lkeuchi and Ki Sang Hong. Determining linear shape change: To- ward automatic generation of object recognition programs. Computer Vision, Graphics and Image Processing, 53(2):154—170, 1991. [90] Katsushi lkeuchi and Takeo Kanade. Automatic Generation of Object Recog- nition Programs. Proc. IEEE, 76(8):1016—1035, August 1988. [91] A. K. Jain and P. J. Flynn (Eds). 3D Object Recognition Systems. Elsevier [92] [93] [94] [95] [96] [97] Science Publishers B.V., Amsterdam, The Netherlands, 1993. Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, NJ, 1988. Anil K. Jain and Richard L. Hoffman. Evidence-based recognition of 3-D ob- jects. IEEE Trans. on Pattern Analysis and Machine Intelligence, 10(6):783— 802, November 1988. Ray Jarvis. Range sensing for computer vision. In Anil K. Jain and Patrick J. Flynn, editors, Three-dimensional Object Recognition Systems, pages 17-56. Elsevier Science Publishers B.V., Amsterdam, The Netherlands, 1993. T. Joshi, J. Ponce, B. Vijayakumar, and D. J Kriegman. HOT curves for mod- elling and recognition of smooth curved 3D objects. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 876—880, Seattle, Wash- ington, June 1994. S. B. Kang and K. lkeuchi. The complex EGI: A new representation for 3— D pose determination. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(7):707—721, 1993. Daniel Keren, David Cooper, and Jayashree Subrahmonia. Describing compli- cated objects by implicit polynomials. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16:38—53, 1994. 223 [98] Whoi—Yul Kim and Avinash C. Kak. 3—D object recognition using bipartite [99] [100] [101] [102] [103] [104] [105] [106] [107] matching embedded in discrete relaxation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 13(3):224—251, 1991. J. J. Koenderink and A. J. van Doorn. Internal representation of solid shape with respect to vision. Biological Cybernetics, 32(4):211—216, 1979. J. J. Koenderink and A. J. van Doom. The internal representation of solid shape with respect to vision. Biol. Cybern., 32:211—216, 1979. J. J. Koenderink and A. J. van Doorn. Surface shape and curvature scales. Image and Vision Computing, 10(8):557—565, October 1992. Jan. J. Koenderink. Solid Shape. The MIT Press, 1990. D. J. Kriegman and J. Ponce. Computing exact aspect graphs of curved objects: Solids of revolution. In Proc. IEEE Workshop on Interpretation of 3D Scenes, pages 116—122, Austin, 1989. R. Krishnapuram and D. Casasent. Determination of three-dimensional object location and orientation from range images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(11):1158—1167, 1989. Y. Lamdan and H.J. Wolfson. Geometric hashing: A general and efficient model-based recognition scheme. In Proc. Second IEEE International Confer- ence on Computer Vision, pages 238-249, Tarpon Springs, Florida, December 1988. Yehezkel Lamdan, Jacob T. Schwartz, and Haim J. Wolfson. Affine invariant model-based object recognition. IEEE Trans. on Robotics and Automation, 6(5):578—589, 1990. Stephane Lavalle and Richard Szeliski. Recovering the position and orientation of free-form objects from image contours using 3D distance maps. IEEE Trans- [108] [109] [110] [111] [112] [113] [114] [115] [116] 224 actions on Pattern Analysis and Machine Intelligence, 17(4):378—390, April 1995. Ying Li and R. J. Woodham. Orientation-based representations of 3-D shape. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 182—187, Seattle, Washington, June 1994. P. Liang and C. H. Taubes. Orientation-based differential geometric representa- tions for computer vision applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(3):249-258, 1994. P. Liang and J. Todhunter. Representation and recognition of surface shapes in range images: A differential geometry approach. Computer Vision, Graphics and Image Processing, 52:78—109, 1990. Chia—Wei Liao and Gérard Medioni. Representation of range data with B-spline suface patches. In Proc. 11th International Conference on Pattern Recognition, volume 3, pages 745—748, The Hague, The Netherlands, 1992. David G. Lowe. Three-dimensional object recognition from single two— dimensional images. Artificial Intelligence, 31:355—395, 1987. D. Marr. Vision. W. H. Freeman and Company, 1982. D. Marr and H. K. Nishihara. Representation and recognition of the spatial organization of three-dimensional shapes. In Proc. Royal Society, London, ser. B, volume 200, pages 269—294, 1978. Hiroshi Matsuo and Akira Iwata. 3-D object recognition using MEGI model from range data. In Proc. 12th International Conference on Pattern Recognition, pages 843-846, Jerusalem, Israel, October 1994. Richard S. Millman and George D. Parker. Elements of Differential Geometry. Prentice—Hall, 1977. 225 [117] Hiroshi Murase and Shree K. Nayar. Visual learning and recognition of 3-D [118] [119] [120] [121] [122] [123] [124] [125] [126] objects from appearance. International Journal of Computer Vision, 14(1):5— 24, 1995. V. S. Nalwa. Representing oriented piecewise C2 surfaces. International Journal of Computer Vision, 3:131-153, 1989. Shree K. Nayar, Hiroshi Murase, and Sameer A. Nene. Learning, positioning and tracking visual appearance. In Proc. IEEE Conference on Robotics and Automation, volume 4, pages 3237—3244, San Diego, California, 1994. Timothy S. Newman. Experiments in 3D CAD-based Inpection using Range Im- ages. PhD thesis, Michigan State University, Department of Computer Science, 1993. E. Oja. Subspace Methods of Pattern Recognition. Research Studies Press, Hertfordshire, 1983. Masaki Oshima and Yoshiaki Shirai. Object recognition using three-dimensional information. IEEE Trans. on Pattern Analysis and Machine Intelligence, PAMI-5(4):353—361, 1983. A. Pentland and S. Sclaroff. Closed-form solutions for physically based shape modeling and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13:715—729, July 1991. A. P. Pentland. Perceptual organization and the representation of natural form. Artificial Intelligence, 28:293—331, May 1986. A. P. Pentland. Automatic extraction of deformable part models. International Journal of Computer Vision, 4:107—126, 1990. W. H. Plantinga and C. R. Dyer. Visibility, occlusion, and the aspect graph. International Journal of Computer Vision, 5:137—160, 1990. [127] [128] [129] [130] [131] [132] [133] [134] 226 T. Poggio and S. Edelman. A network that learns to recognize three—dimensional objects. Nature, 343:263—266, 1990. A. V. Pogorelov. Extrinsic Geometry of Convex Surfaces, volume 35 of T rans- lations of mathematical monographs. American Mathematical Society, Provi- dence, R. I., 1973. J. Ponce, A. Hoogs, and D. J. Kreigman. On using CAD models to compute the pose of curved 3D objects. CVGIP: Image Understanding, 55(2):184—197, 1992. J. Ponce, D. J. Kriegman, S. Petitjean, S. Sullivan, G. Taubin, and B. Vi- jayakumar. Representations and algorithms for 3D curved object recognition. In Anil K. Jain and Patrick J. Flynn, editors, Three-Dimensional Object Recog- nition Systems, pages 17—56. Elsevier Science Publishers B.V., Amsterdam, The Netherlands, 1993. M. Potmesil. Generating models for solid objects by matching 3D surface seg- ments. In Proc. International Joint Conference on Artificial Intelligence, pages 1089—1093, Karlsruhe, Germany, 1983. N. S. Raja and A. K. Jain. Recognizing geons from superquadrics fitted to range data. Image and Vision Computing, 10(3):179—-190, 1992. N. S. Raja and A. K. Jain. Obtaining generic parts from range images using a multi-view representation. Computer Vision, Graphics and Image Processing, 60(1):44—64, July 1994. A. A. G. Requicha. Representations for rigid solids: Theory, methods, and systems. Computing Surveys, 1980. [135] LG. Roberts. Machine perception of three-dimensional solids. In James T. Tip- pett, David A. Berkowitz, Lewis C. Clapp, Charles J. Koester, and Jr. Alexan- der Vanderburgh, editors, Optical and Electro-Optical Information Processing, pages 159—197. MIT Press, Cambridge, Massachusetts, 1965. [136] [137] [138] [139] [140] [141] [142] [143] [144] 227 Hanan Samet. The Design and Analysis of Spatial Data Structures. Addison- Wesley, 1990. Steven R. Schwartz and Benjamin W. Wah. Machine learning of computer vision algorithms. In Tzay Y. Young, editor, Handbook of Pattern Recognition and Image Processing: Computer Vision, volume 2, chapter 14, pages 319—359. Academic Press, 1994. W. B. Seales and C. R. Dyer. Viewpoint from occluding contour. Computer Vision, Graphics and Image Processing, 55(2):198—211, 1992. M. Seibert and A M. Waxman. Adaptive 3-D object recognition from multi- ple views. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2):107—-123, February 1992. K. Sengupta and K. L. Boyer. Organizing large structural modelbases. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(4):321—332, 1995. L. G. Shapiro and H. Lu. The use of a relational pyramid representation for view classes in CAD-to—Vision system. In Proc. 9th International Conference on Pattern Recognition, pages 379—381, Rome, 1988. T. M. Silberberg, L. Davis, and H. Harwood. An iterative Hough procedure for three—dimensional object reocgnition. Pattern Recognition, 17(6):621—629, 1984. Sarvajit S. Sinha and Ramesh Jain. Range image analysis. In Tzay Y. Young, editor, Handbook of Pattern Recognition and Image Processing: Computer Vi- sion, volume 2, chapter 14, pages 185—237. Academic Press, 1994. F. Solina and R. Bajcsy. Recovery of parametric models from range images: The case for superquadrics with global deformations. IEEE Trans. on Pattern Analysis and Machine Intelligence, 12(2):131—147, February 1990. 228 [145] T. Sripradisvarakul and R. Jain. Generating aspect graphs for curved objects. In [146] [147] [148] [149] [150] [151] [152] [153] [154] Proc. IEEE Workshop on Interpretation of 3D Scenes, pages 109—115, Austin, 1989. L. Stark and K. W. Bowyer. Achieving generalized object recognition through reasoning about association of function to structure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13:1097—1104, 1991. Fridtjof Stein and Gérard Medioni. Structural indexing: Efficient 3-D object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2y125—145,1992. J. Stewman and K. Bowyer. Creating the perspective projection aspect graph of polyhedral objects. In Proc. Second IEEE International Conference on Com- puter Vision, pages 494—500, Tarpon Springs, 1988. J. H. Stewman and K. W. Bowyer. Aspect graphs for convex planar-face objects. In Proc. IEEE Workshop on Computer Vision, pages 123—130, Miami Beach, 1987. George C. Stockman. Object recognition and localization via pose clustering. Computer Vision, Graphics, and Image Processing, 40:361—387, 1987. Robert S. Strichartz. A Guide to Distribution Theory and Fourier Transforms. CRC Press, Boca Raton, 1994. P. Suetens, P. Fua, and A. J. Hanson. Computational strategies for object recognition. ACM Computing Surveys, 24(1):5—61, March 1992. Daniel L. Swets. The self-organizing hierarchical optimal subspace learning and inference framework for object recognition. PhD thesis, Michigan State Univer- sity, Department of Computer Science, East Lansing, Michigan, 1996. M. Tar and S. Pinker. Mental rotation and orientation-dependence in shape recognition. Cognitive Psychology, 21:233—282, 1989. 229 [155] G. Taubin. Estimation of planar curves, surfaces, and nonplanar space curves [156] [157] [158] [159] [160] [161] [162] [163] defined by implicit equations with applications to edge and range image seg- mentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(11):1115—1138, Nov. 1991. G. Taubin, F. Cukierman, S. Sullivan, J. Ponce, and D. J. Kreigman. Parametrized and fitting bounded algebraic curves and surfaces. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 103—108, Champaign, Illinois, 1992. D. Terzopoulos and D. Metaxas. Dynamic 3D models with local and global de— formations: Deformable superquadrics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13:703-714, 1991. Demetri Terzopoulos, Andrew Witkin, and Michael Kass. Constraints on de- formable models: Recovering 3D shape and nonrigid motion. Artificial Intelli- gence, 36(1):91-123, 1988. Greg Turk and Marc Levoy. Zippered polygon meshes from range images. In SIGGRAPH .94, pages 311—318, 1994. M. Turk and A. P. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1):71—86, 1991. Jerry L. Turney, Trevor N. Mudge, and Richard A. Volz. Recognizing Partially Occluded Parts. IEEE Trans. on Pattern Analysis and Machine Intelligence, PAMI—7(4):410—421, 1985. S. Ullman and R. Basri. Recognition by linear combination of models. IEEE Transactions on Pattern Analysis and Machine Intelligence, l3(10):992-—1006, October 1991. Shinji Umeyama. Parameterized point pattern matching and its application to recognition of object families. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(2):l36-144, 1993. [164] [165] [166] [167] [168] [169] [170] [171] [172] 230 A.J. Vayda and A.C. Kak. A robot vision systems for recognition of generic shaped objects. Computer Vision, Graphics, and Image Processing: Image Understanding, 54(1):1—46, July 1991. B. C. Vemuri and J. K. Aggarwal. 3-D model construction from multiple views using range and intensity data. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 435—437, Miami Beach, FL, 1986. RC. Vemuri and J.K. Aggarwal. Representation and recognition of objects from dense range maps. IEEE Transactions on Circuits and Systems, CAS- 34(11):1351—1363, November 1987. BC. Vemuri, A. Mitiche, and J .K. Aggarwal. Curvature-based representation of objects from range data. Image and Vision Computing, 4(2):107—114, May 1986. Wu Wang and S. S. Iyengar. Efficient data structures for model-based 3-D object recognition and localization from range data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(10):]035—1045, October 1992. N. A. Watts. Calculating the principal views of a polyhedron. In Proceedings of the 9th ICPR, pages 316—322, Rome, 1988. J .Weng. On comprehensive visual learning. In Proc. NSF/ARPA workshop on Performance vs. Methodology in Computer Vision, pages 152-166, Seattle, WA., June 24-25 1994. J. Weng, Paul Cohen, and Nicolas Rebibo. Motion and structure estimation from stereo image sequences. IEEE Transactions on Robotics and Automation, 8(3):362—382, June 1992. J. Weng, T. S. Huang, and N. Ahuja. Motion and structure from two perspective views: Algorithms, error analysis, and error estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(5):451—476, 1989. 231 [173] Peter R. Wilson. Conic representations for shape description. IEEE Computer Graphics and Applications, 7(4):23—30, 1987. [174] A. K. C. Wong, S. W. Lu, and M. Rioux. Recognition and shape synthesis of 3D objects based on attributed hypergraphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(3):279—290, 1989. HICH GQN T 1| :11[I]ufi[[][[1]]1]][][[“ 3 9