LIBRARY MECMQan State University PLACE IN RETURN BOX to remove thb chockout flom yam noord. TO AVOID FINES Mum on or baton data duo. DATE DUE DATE DUE DATE DUE Image Sequence Analysis: Motion and Structure Estimation with Transitory Sequences and Recognition of Hand Signs. By Yuntao Cuz' A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Computer Science Department 1996 ABSTRACT Image Sequence Analysis: Motion and Structure Estimation with Transitory Sequences and Recognition of Hand Signs. By Yuntao Cm' This thesis reports two pieces of work related to image sequence analysis. First, we investigate the problem for a calibrated stereo camera system traveling in an unknown environment to automatically build a 3D range map of the scene. Due to the dynamic sensing process, the obtained image sequence is transitory by our definition in that no component of the scene is visible through the entire sequence. We show that the inte- gration of a transitory sequence has properties that are very different from those of a non-transitory one. Two representations, world—centered (WC) and camera—centered (CC), behave very differently with a transitory sequence. The asymptotical error properties derived indicate that one representation is significantly superior to the other, depending on whether one needs camera-centered or world-centered estimates. Based on these properties, we introduce an efficient “cross-frame” estimation tech— nique for the CC representation. For the WC representation, our analysis indicates that a good technique should be based on global pose of the camera instead of inter- frame motions. In addition to testing with synthetic data, rigorous experiments have been conducted on a real-image sequence taken by a fully calibrated camera system. The experimental results have demonstrated good accuracy. Secondly, we focus on the problem of recognizing hand signs from intensity image sequences. We present a new framework to recognize hand signs from intensity image sequences. The framework has two major components: segmentation and recognition. A prediction—and-verification scheme using attention images from multiple fixations is presented to segment hands from input images. A major advantage of this scheme is that it can handle a large number of different deformable and articulated objects presented in complex backgrounds. The scheme is also relatively efficient since the segmentation is guided by the past knowledge through a prediction-and-verification scheme. During the recognition, the system uses multiclass, multidimensional dis- criminant analysis to automatically select the most discriminating features for gesture classification. A recursive partition tree approximator is presented to do classification. The framework has been tested to recognize 28 different hand signs. The experimen- tal results show that the system can achieve a 93% recognition rate for test sequences that have not been used in the training phase. To my Wife Yu Zhong and my parents iv ACKNOWLEDGMENTS I would like to acknowledge all the people who have assisted me throughout my studies at Michigan State University. I am particularly grateful to my adviser, Dr. John J. Weng, for his advice and encouragement. He provided me with numerous ideas, suggestions and comments, and has been very beneficial to my development as a researcher. Dr. Anil K. Jain provided critiques and encouragement that were very useful for this work. I am grateful to Dr. George Stockman for offering an excellent course on deformable objects and image databases which I took and learned so much from it. My gratitude also goes to my other committee members, Drs. Dennis Gilliland and Richard Hallgren, for their critiques of my research and for their personal support. My sincere thanks go to my dear wife Yu Zhong for her patience, encouragement, support, and understanding. She has provided many insightful suggestions and useful references. In my work, I used several pieces of software developed by Dan Swets and Shaoyun Chen, my fellow labmates working with Dr. Weng. Thanks also go to Yu Zhong, Kal Rayes, Doug Neal, and Valerie Bolster for making themselves available for the experiments. While working in the PRIP Laboratory, I also met a number of knowledgeable people from different parts of the world. Jinlong Chen, Shaoyun Chen, Yao Chen, Chitra Dorai, Hansye Dulimarta, Lin Hong, Sally Howden, Qian Huang, Wey Shiuan Hwang, Kalle Karu, Yonghong Li, Jianchang Mao, Karissa Elizabeth Miller, Aniati Murni, Tim Newman, Sharathcha Pankanti, Nalini Ratha, Dan Swets, Oivind Due Trier, Patchrawat Uthaisombut, Aditya Vailaya, Gang Wang, Marilyn Wulfekuhler, and Yu Zhong, are among them. I greatly appreciate Dr. Ray R. Brodeur and Ms. Maria Eppler. They have edited and proofread the thesis and made it more readable. vi TABLE OF CONTENTS LIST OF FIGURES x LIST OF TABLES xiii 1 Introduction 1 1.1 Motion and Structure Estimation from Image Sequences ......... 4 1.1.1 Feature-based and flow-based motion estimation approaches ...... 4 1.1.2 Monocular and stereo image sequence .................. 6 1.1.3 Perspective and parallel projection .................... 7 1.1.4 Two-view vs. multiple-view ........................ 8 1.2 Integration of Transitory Image Sequences ................. 9 1.3 Interpretation of Human Action ....................... 12 1.4 Hand Sign Recognition ............................ 14 2 Feature-Based Approaches for Motion Estimation 19 2.1 General Problem ............................... 20 2.2 Motion Model ................................. 22 2.3 2D—to—2D Correspondence .......................... 24 2.4 3D-to—3D Correspondences .......................... 29 2.4.1 Point features ................................ 29 2.4.2 Motion from stereo images ......................... 30 2.5 2D-to—3D Correspondences .......................... 34 2.6 Motion from Long Sequences ........................ 38 2.6.1 Kalman filter ................................ 38 2.6.2 Other approaches .............................. 40 3 Transitory Image Sequences, Asymptotic Properties, and Estima- tion of Motion and Structure 44 3.1 Basic Concepts ................................ 45 3.2 Asymptotic Error Properties of Integrations ................ 49 3.2.1 World-centered representation ....................... 50 3.2.2 Camera-centered representation ...................... 56 3.2.3 The tightness of the error rates ...................... 59 3.3 Methods and Algorithms ........................... 68 3.3.1 Cross-frame approach with CC representation .............. 69 3.3.2 World-centered representation ....................... 75 3.4 Experiments .................................. 77 vii 3.4.1 Simulation Data .............................. 78 3.4.2 Experiments with a real setup ....................... 83 3.5 Conclusions .................................. 95 4 Hand Sign Recognition 97 4.1 Glove—Based Systems ............................. 98 4.1.1 Glove devices ................................ 98 4.1.2 Interpreting hand sign with gloved-based devices ............ 99 4.2 Vision-Based Approach ............................ 100 4.2.1 Segmentation ................................ 100 4.2.2 Recognition ................................. 103 5 Overview of the Approach 110 5.1 Time as a Dimension ............................. 111 5.2 Recognition of Spatiotemporal Pattern ................... 112 6 Hand Segmentation from Attention Images Based on Eigen-subspace Learning 1 16 6.1 Learning .................................... 118 6.1.1 Karhunen-Loeve projection ........................ 118 6.1.2 Simulated fovea image ........................... 121 6.2 Segmentation ................................. 122 6.2.1 Reconstruction ............................... 122 6.2.2 Dynamic Deformation ........................... 124 6.3 Experiments .................................. 127 6.3.1 Training ................................... 127 6.3.2 Testing ................................... 128 6.4 Conclusions .................................. 129 7 Hand Segmentation Using a Prediction-and-Verification Scheme 132 7.1 Valid Segmentation .............................. 135 7.1.1 Karhunen-Loeve projection ........................ 137 7.1.2 Approximation as function interpolation ................. 138 7.1.3 Valid segmentation ............................. 140 7.2 Predication for Valid Segmentation ..................... 140 7.2.1 Overview .................................. 140 7.2.2 Organization of attention images from fixations ............. 143 7.2.3 Prediction as querying the training set .................. 145 7.3 Experiments .................................. 153 7.3.1 Training ................................... 153 7.3.2 Hand segmentation ............................. 155 7.4 Conclusions and Future Work ........................ 159 ix 8 View-Based Hand Sign Recognition from Intensity Image Sequences 160 8.1 Nearest Neighbor Approximator in the MEF Space ............ 163 8.2 Approximation Using Recursive Partition Tree in the MDF Space . . . . 164 8.2.1 The Most Discriminating Features (MDF) ................ 164 8.2.2 Curse of dimensionality and the DKL projection ............. 167 8.2.3 Recursive partition tree .......................... 167 8.3 Convergence of the Approximators ..................... 171 8.4 k Nearest Neighbors ............................. 175 8.5 Experimental Results ............................. 175 8.5.1 Results of the nearest neighbor approximator in the MEF space . . . . 176 8.5.2 Results of the recursive partition tree approximator in the MDF space 178 8.5.3 k nearest neighbors ............................. 181 8.5.4 Experiments related to MDF ....................... 181 8.6 Conclusions .................................. 185 9 Summary and Future Work 187 9.1 Summary ................................... 187 9.1.1 Integration of transitory image sequences ................. 187 9.1.2 Hand Sign Recognition ........................... 189 9.2 Future Work ................................. 191 9.2.1 Integration of transitory image sequences ................. 191 9.2.2 Hand sign recognition ........................... 192 1.1 1.2 2.1 2.2 2.3 2.4 3.1 3.2 3.3 3.4 3.5 3.6 LIST OF FIGURES Perspective projection ............................. The twenty eight different signs used in the experiment. (1) sign of “angry”; (2) “any”; (3) “boy”; (4) “yes”; (5) “cute”; (6) “fine”; (7) “funny”; (8) “girl”; (9) “happy”; (10) “hi”; (11) “high”; (12) “hot”; (13) “later”; (14) “low”; (15) “no”; (16) “nothing”; (17) “of course”; (18) “0k”; (19) “parent”; (20) “pepper”; (21) “smart”; (22) “sour”; (23) “strange”; (24) “sweet”; (25) “thank you”; (26) “thirsty”; (27) “welcome”; (28) “wrong” (Bornstein and Saulnier 1989) ................................ Basic perspective geometry. Lower case letters refer to coordinates in the object space and upper case letters refer to coordinates on the image plane. f is the focal length. .............................. Constraint on the motion parameters derived from point matches. ...... Stereo triangulation and elongated uncertainty shape. The true point can lie anywhere inside the shaded uncertainty region. .............. Imaging geometry for 2D—to—3D correspondence problem. A point p in coor- dinate system oasyz is imaged at location P’ on the image plane which is specified in coordinate system o’z’y'z’ .................... Two systems of reference: (a) world-centered and (b) camera-centered ..... Transitory and non-transitory sequences. (a) non-transitory. (b) simple transi- tory. (c) general transitory .......................... Using cross-frame motions to integrate many views. Each elongated ellipse indicates the uncertainty in 3—D point position transformed from a single previous stereo view to the current view. The integrated uncertainty is greatly reduced using the multiple cross-frame motions instead of interframe motions. .................................. Simulation environment, where 7000mm distance is covered by the 31 frames. Camera global pose error versus time. The CC representation on the left column and WC on the right. (a) and (e): Error in rotation matrix. (b) and (f): Error in translation vector. (c) and (g): Error in ray-component of the translation vector. ((1) and (h): Error in z-component of the translation vector. ................................... Structure error versus time. The CC representation on the left column and WC on the right. (a) and (e): Global structure error. (b) and (f): my-component of the global structure error. (c) and (g): Local structure error. ((1) and (h): my-component of the local structure error. .............. X 16 34 46 48 xi 3.7 The robot and a few stereo frames in the 151-frame sequence. (a) Robot. (b) Left image of frame 0. (c) Left image of frame 50. ((1) Left image of frame 100. (e) Left image of frame 150. ..................... 3.8 The tracking record of the feature points through the 151 frames in the sequence. If a point k is successfully tracked from frame 2' to frame j, a vertical line is shown at point number A: from frame 2' to frame j. (Due to the limit of the printer resolution, lines are merged in the plot.) .............. 3.9 Stereo matching and temporal matching-and-tracking. (a) An example of stereo matching (frame 0). (b) An example of temporal matching and tracking (frame 24 to 69). A needle is draw from the feature point to its position in the target frame. Due to camera vergence, the orientation of the needles in (a) is correct ................................. 3.10 Sample test points on one frame. Each cross shows the location of a test point 3.11 Camera rotation error versus time. (a) and (d): Yaw error. (b) and (6): Roll error. (c) and (f): Pitch error. ................... 3.12 Camera position error versus time. (a) and (e): Camera position (y— and z-components). (b) and (f): Position error (ac-component). (c) and (g): Position error (y-component). (d) and (h): Position error (z-component). .............................. 3.13 Structure error. (a) and (d): Global structure error. (b) and (e): Global structure error (z- and y-components). (c) and (f): Global structure error (z-component). ........................... 3.14 Reconstructed 3D surface integrated from many partial views in the se— 86 88 89 91 92 93 quence, shown with original intensity viewed from an arbitrary direction. 94 3.15 Weighted texture mapping from pixel to the reconstructed surface. ...... 5.1 The sign “no” and its image sequence representation. (a) The sign of “no”, snap middle finger, index, and thumb together. (b) The sequence representation of the sign “no”. .............................. 5.2 The three-stage framework for Spatiotemporal event recognition ........ 6.1 A 2D illustration of Karhunen-Loeve projection ................. 6.2 An illustration of the reconstruction ....................... 6.3 Twenty five different hand shapes used in the experiments. .......... 6.4 The SF18 of the training samples. ....................... 6.5 Segmentation results of the fovea images which are not used in the training. (a) Input fovea images. (b) The results of the reconstruction. We also blur the results of the reconstruction. (c) The results of applying masks to the original input fovea images. The masks were derived using the dynamic spring network model to the reconstructed images. ((1) The SF13 of the images in (c). (e) The contours of the nearest neighbors are superimposed onto the input images. (f) Segmentation results ............... 7.1 An illustration of two level fixations of an input hand image. ......... 7.2 The illustration of constructing attention images. ............... 95 113 120 124 127 127 131 133 136 139 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 xii Overview of the segmentation scheme .................... 142 A 2-D illustration of a hierarchical quasi-Voronoi diagram and the corre- sponding recursive partition tree. (a) In partition, the label indicates the center of a cell. The label of the child to which its parent’s center belongs is not shown due to the lack of space. (b) The corresponding recursive partition tree. ............................... 144 A 2D illustration of nearest neighbor query theorems ............ 147 A representative subset of hand shapes used in the experiment ....... 153 The attention images from 19 mechanical fixations of a training sample.155 Eight sample sequences. From top to bottum, they represent the signs “happy”, “hot”, “nothing”, “parent”, “pepper”, “smart”, “welcome”, and “yes” ..................................... 156 Results of motion-based attention are shown using dark rectangular windows.l57 The results of the segmentation are shown after masking off the background. 158 A 2D illustration of the most discriminating features (MDF). The MDF is projection along 21. The MEF along yl can not separate the two subclasses. 164 A 2-D illustration of a hierarchical Voronoi diagram and the corresponding recursive partition tree. (a) The partition, where the circles and rectan- gulars indicate the samples used to create the Voronoi diagram. (b) The corresponding recursive partition tree .................... 169 Illustration of components in the Mixture Distance in a 3D original space. 171 Top ten MEF’s ............................... 177 Performance of the nearest neighbor approximator in the MEF space. The performance is given as a function of the number of MEF’s used. ..... 178 Performance of the two different nearest neighbor query approaches: linear vs. quasi-Voronoi diagram. ........................ 179 Top ten MDF’s ............................... 180 The difference between MEF and MDF in representing samples. (a) Samples represented in the subspace spanned by the first two MEFs. (b) Samples represented in the subspace spanned by the first two MDFs. The numbers in the plot are the class labels of the samples. ............... 183 Two sample sequences of signs “of course” (a) and “wrong” (b). ..... 184 The difference between the MEF and MDF. (a) Reconstruction based on the first MDF. (b) Reconstruction based on 95% MEFs. ......... 185 3.1 3.2 3.3 6.1 7.1 7.2 8.1 8.2 8.3 LIST OF TABLES Asymtotic rate for error covariance matrix in integration ......... 58 Some Data for The Real Setup ....................... 89 Average Image Plane Residual ........................ 90 Summary of the segmentation results .................... 129 The list of fixation scale and position .................... 154 Summary of the experimental data ..................... 158 The number of MEF’s vs. the variation ratio ................ 177 Summary of the experimental data for RPTA ............... 180 Summary of the experimental data for kNN ................ 182 xiii Chapter 1 Introduction The goal of a machine vision system is to recover useful information about a scene from its two—dimensional projections. Early computer vision systems were concerned primarily with static scenes. Recently, some computer vision systems are being de— signed to analyze dynamic scenes for different applications. The input to a dynamic scene analysis system is a sequence of image frames taken from moving objects. The camera used to acquire the image sequence may also be in motion. Mounting evidence, accumulated over the past century and especially of late, leaves no doubt that motion is indeed a fundamental visual dimension like color and stereopsis [138]. The functional benefits of human image motion processing are listed as follows. 0 Depth reconstruction. The motion of objects can reveal their shapes. Ullman [181] has an ingenious demonstration of this point: a set of dots is projected onto a screen. When the dots are stationary, an observer sees merely a screenful 2 of randomly distributed dots. When they move, however, the display springs to life, and the observer sees two cylinders rotating in opposite directions. 0 Image segmentation. Related to the problem of depth measurement is the need to parse the complex pattern of illumination in the optic array into different physical objects and to distinguish “figure” from “ground”. Motion is eminently suited for this job because of the mathematical relation between neighboring points in the optical velocity field at the edges of objects. Points which are well within the boundaries of a visual object, for example, generally have the same or very similar velocities between neighboring points whereas this is not necessarily the case at the boundary of an object in an image. 0 Motion in a proprioceptive sense. Gibson [72] suggested that visual motion was one of the primary sources of information for the moving organism to know about its own motion in relation to its environment. Early work by Lee and Anderson [112] measured the importance of optical flow information for pos- tural control. Standing infants could be made to lose their balance and fall as a result of movement in the surrounding visual environment. Environmen- tal visual motion also destabilized the posture of adults, suggesting that visual motion information can override information obtained from stretch receptors in the limbs and gravity receptors in the inner ear. Visual motion can also lead to a profound sensation of self-motion, either as a rotation about a vertical axis or as a horizontal or vertical translation [27]. o Stimulus to drive eye movement. Ever since the important experiments of Rash- 3 bass [155], it has been recognized that the oculomotor pursuit system is driven by a velocity signal. Rashbass simultaneously stepped a visual target in one di- rection and initiated a constant velocity motion in the opposite direction. Thus position information (in the form of a step) was pitted against velocity informa- tion (in the form of a ramp). Surprisingly, the eye movement system responded separately to each, generating a smooth eye movement in response to velocity ramp and an oppositely directed saccade in response to the position step. Later work has suggested some additional contribution from a visual position encod— ing system [148], but theoretical discussions of the oculomotor pursuit system still hinge directly on the notion that the visual system can indeed read velocity. Broadly speaking, there are two groups of scientists studying motion. The first group is studying human/ animal vision with the goal of understanding the operation of biological vision systems including their limitations and diversity. The second group of scientists includes computer scientists and engineers conducting research in computer vision with the objective of developing vision system. A vision system with the ability to navigate, recognize and track objects, and estimate their speed and direction is the ultimate goal of the research. The proposed work falls into the latter category. We report two pieces of works in this thesis. One is the motion and structure estimation with transitory sequences and the other one is the recognition of hand signs. 4 1.1 Motion and Structure Estimation from Image Sequences Motion estimation is a long-standing problem in computer vision, with many different applications. A proper formulation of the motion estimation problem requires the identification of several characteristics of the image acquisition process and the scene, including: the camera model, the number of cameras, the number of views. Two distinct approaches have been developed for the computation of motion from image sequences [3], namely, the feature based approach and the flow based approach. These two approaches can be further classified based on: 1) The number of images in each frame: monocular or stereo; 2) Projection model: perspective or parallel projection; 3) The number of views the system is designed to handle: two-view or multiple—view. 1.1.1 Feature-based and flow-based motion estimation ap- proaches The feature-based motion estimation from image sequences can be divided into two steps. The first step is to establish feature correspondences for all pairs of consec- utive image frames in a sequence. The features are a set of relatively sparse, but highly discriminatory, two-dimensional geometric shapes in the images corresponding to three-dimensional object features in the scene, such as corners, occluding bound- aries of surfaces, and boundaries demarcating changes in surface reflectivity. Such 5 points, lines and/or curves are extracted from each image. The inter-frame corre- spondence is then established between these features. The second step is to estimate motion parameters. Constraints are formulated based on assumptions such as rigid body motion, i.e., the three dimensional distance between two features on a rigid body remains the same after object / camera motion. Such constraints usually result in a system of nonlinear equations. The observed displacement of the 2-D image features are used to solve these equations leading ultimately to the computation of motion parameters of the objects in the scene. Optical flow is the distribution of apparent velocities of movement of brightness patterns in an image. Optical flow can arise from relative motion of objects and the viewer. Consequently, optical flow can give important information about the spatial arrangement of the objects viewed and the rate of change of this arrangement. The flow-based approach is based on computing the optical flow or the two—dimensional field of instantaneous velocities of brightness values (gray levels) in the image plane. Instead of considering temporal changes in image brightness values in computing the optical flow field, it is possible to also consider temporal changes in values that are the result of applying various local operators such as contrast, entropy, and spatial derivatives to the image brightness values. In either case, a relatively dense flow field is estimated, usually at every pixel in the image. The optical flow is then used in conjunction with added constraints or information regarding the scene to compute the actual three—dimensional relative velocities between scene objects and camera. There is as yet no clearcut evidence recommending one general approach over the other. The feature-based approach allows a relatively large interframe motion. The 6 disadvantages of this approach include: 1) The features are expensive to extract; 2) The correspondence problem is nontrivial; 3) Features are relatively sparse. On the contrary, the flow-based approach does not have the above disadvantages, however, 1) it is expensive to compute due to the sheer number of pixels; 2) it can only handle relatively small interframe motions; 3) it can not integrate long sequences effectively due to the absence of anchors (features). 1.1.2 Monocular and stereo image sequence The monocular sequence is a set of images obtained by a single camera. Given a sequence of monocular images of the scene, the motion and structure of an object can be estimated by both feature-based approach as well as flow-based approach. The solutions for structure and motion remain ambiguous with respect to the absolute value of the distance between the camera and the scene. In other words, structure and motion parameters are unique only up to a scaling factor. The use of stereoscopy can provide this additional parameter to uniquely determine depth and hence the absolute values for the structure and motion parameters. The fusion of stereo and motion may be effected with different objectives in mind. Stereoscopic processing may be used to aid motion recovery, or conversely, motion analysis may be used to establish feature correspondence in stereo image pairs. The fusion of these two processing modules in human and other biological visual systems has been detected via neurobiological and psychophysiological investigation [146]. 7 1.1.3 Perspective and parallel projection In general, projections transform points in a coordinate system of dimension n into points in a coordinate system of dimension less than n. Perspective projection is the fundamental model for the transformation wrought by our eyes, by cameras, or by numerous other imaging devices. To a first-order approximation, these devices act like a pinhole camera in that the image results from projecting scene points through a single point onto an image plane (see Figure 1.1). X Figure 1.1: Perspective projection. The mathematical equations for a perspective projection are given by: 7. y _ y_ f (1.1) where f is the focal length. The perspective transformation yields parallel projection as a special case when the viewpoint is the point at infinity in the z direction. Then all objects are projected 8 onto the viewing plane with no distortion of their a: and y coordinates. Although perspective projection is a more precise model, it has its problems. First, when camera motion is small, effects of camera rotation and translation can be confused with each other. Second, the shape is computed as relative depth, and it is very sensitive to noise. These difficulties are especially magnified when the objects are distant from the camera relative to their size. So, there are still cases where the parallel projection model is used [8, 178]. 1.1.4 Two-view vs. multiple-view The presence of noise in the image leads to inaccuracy in the resulting estimates of the object motion parameters. In order to combat the noise, the algorithm needs to exploit redundancy in the available data to improve the accuracy of the solution. The over determination of the estimation equations is achieved by using a larger number of feature points in a two-view case. The alternative is to use a larger number of image frames. If the interframe motion remains constant, then an arbitrary number of frames can be used since the number of unknown (motion parameters) is not a function of the number of the image frames. If the interframe motion is not constant, which is probably more realistic in the navigation problem where the robot may constantly need to change direction to avoid obstacles, the problem of preventing global motion deviation is more important since the error accumulates from each inaccurate interframe motion. 9 1.2 Integration of Transitory Image Sequences In the case of motion estimation from image sequences (multiple-view), there is a special case in which the image sequence is transitory. In this thesis, we propose a new approach to deal with the transitory image sequence. The proposed approach is a feature based method. Our method uses stereo images and the perspective projection model. Next, we define the transitory image sequence. If a system needs to sense a large 3-D rigid scene which cannot be covered by a single view, the system may actively move and scan the scene [5]. For example, to automatically build a 3—D map of a floor in a building, a camera system moves from one room to the next on the floor. To obtain information about all the facets of a 3-D object, a camera system needs to actively circle around the object or let the object rotate. In general, during a dynamic sensing process, any component of the scene is visible only in a subsequence, and thus the resulting image sequence is transitory. Issues with transitory nature of scene components have mostly not yet been investigated. Current most of the research deals with non-transitory image se- quences, and successful improvements have been achieved in this type of fusion [28, 10, 111, 131, 164, 177]. Experiments for scene construction from transitory image sequence only started recently. In Cui et al [44], the relative accuracy was reported from a transitory image sequence, which indicated that the accuracy was not fur- ther reduced once incoming and exiting feature points are comparable. Tomasi and Kanade [178] conducted experiments with transitory image sequences and discussed how to expand the measurement matrix by filling in “hallucinated” projections. The 10 results showed that the object structure and camera pose constructed from transitory sequences “Ball” and “Hand” contained larger error than that from a non-transitory sequence “Hotel” [178]. Most questions related to the integration of transitory sequences are still open. Some of them are: 1. With a transitory image sequence, is it still reasonable to expect “the more images one has, the better the accuracy” as with a non-transitory sequence? 2. How should transitory image sequences be integrated? 3. What representation(s) should one use? World-centered (WC), camera-centered (CC) or some other representation? 4. What is the asymptotic error behavior from a transitory image sequence? In other words, how does the error in an estimated object structure and camera pose change with time (or frame number), for a good estimation method? 5. Are the asymptotic error rates for the method the lowest possible? 6. What kind of accuracy one can reach in a real setup with transitory image sequences? The new contributions of our work includes 1. We show that from a transitory sequence it is inherently not possible to get better estimates with a longer sequence. The later a scene part comes into the sequence, generally the worse its global accuracy is compared to that in the first view. 11 . We introduce different techniques for two different usages of the results: global and local(e.g., global corresponds to visual map generation and global pose determination while local corresponds to obstacle avoidance and object manip- ulation belong to the latter). . It is demonstrated that different representations result in very different stabili- ties. In general, WC is better for a global usage and CC is superior for a local usage. . We establish asymptotic error rates with respect to the number of frames, which indicates how the error estimate evolves with time as well as how to minimize the pace of error accumulation. Some concise expressions have been derived in terms of asymptotic error rate for different representations, processing methods, and image sequence types. . We establish that the asymptotic error rates are in fact the lowest possible based on the Cramér-Rao error bound. . In order to provide actual accuracy with a real system setup, careful experiments have been conducted with a fully calibrated camera system. The algorithm includes feature selection, stereo matching, temporal matching and tracking, 3-D structure integration, and motion and pose estimation. 12 1.3 Interpretation of Human Action It has to be admitted that the human visual system is capable of carrying out a wide variety of image analysis and interpretation tasks with what appears to be utmost ease. You walk off the plane after a long boring travel, and sight your friend near the gate who is waving his hand to you. You lighten up a little after seeing his smiling face. There are three problems that your visual system solves: (i) it identifies the objects on the basis of their shapes; (ii) it detects their movement, if there is any; (iii) it recognizes the motion patterns, e.g., hand waving and smiling. There has been significant interest in the human action recognition from time— sequential images. Model—based recognition consists of the recognition of objects or motions directly from the motion information extracted from the sequence of images. The knowledge about the object or motion is used to construct models that will serve in the recognition process. There are two main steps in this approach. The first consists of finding an appropriate representation for the objects or motions, from the motion cues of the sequence, and then organizing them into useful representations. The second step consists of the matching of some unknown input with a model. The work on interpretation of human motion can be classified into three cate- gories according to their applications, that is, interpretation of facial expressions, interpretation of hand gestures, and recognition of movements of other body parts. 0 Facial expression The human face has attracted much attention in several disciplines, including psychology, computer vision, and computer graphics. Psychophysical investi- 13 gations clearly indicate that faces are very special visual stimuli. Psychologists have studied various aspects of human face perception and recognition [30]. They have also examined facial expression - the result of a confluence of volun- tary muscle articulations that deform the neutral face into an expressive face. The facial pose space is immense. The face is capable of generating on the order of 55,000 distinguishable facial expressions with about 30 semantic dis- tinctions. Studies have identified six primary expressions that communicate anger, disgust, fear, happiness, sadness, and surprise in all cultures [58]. Automatic facial recognition had an early start in image understanding, but work on the problem has been sporadic over the years, evidently due to the difficulty of extracting meaningful information from facial images. Facial clas- sification systems based on measurements derived from interactively selected fiducial points (eye and mouth corners, nose, top of head, etc.) go back to the early 1970’s [99]. Recent works concentrate on the deformation of facial features [118,175, 199]. Hand sign Humans use gestures in daily life as a means of communication (e.g., waving hands to say good bye to a friend, pointing to an object to bring someone’s at- tention to the object etc). The best example of communication through gesture is given by sign language. American sign language (ASL) incorporates the en- tire English alphabet along with many gestures representing words and phrases [41], which permits people to exchange information in a nonverbal manner. It 14 was shown that requirements in precision and resolution for ASL are relatively low as compared to greyscale images [167]. 0 Body movement Body movement is a broad term. There are several ways to view this task. The first one is to recognize the action performed by a person in a scene, among a database of human action models. The second way is to be able to recog- nize the different body parts like arms, legs, etc. throughout a sequence, using motion. The third way is to define motion as a sequence of object configura- tions or shapes through time. The knowledge of the shape and motion of an object, in this case, is used to guide the interpretation of an image sequence in order to analyze the motion between frames, to determine the most plausible configuration of the body or to recognize and label different parts of the body. This approach has been used mostly with humans, and sometimes is called the tracking of human motion. The work presented in this thesis focuses on the hand sign recognition. 1.4 Hand Sign Recognition The evolution of computer technology has seen a corresponding evolution of com- puter input and human-machine interaction technologies. At first, humans had to prepare punch cards and paper tapes. Later, the machine was capable of reading keyboards and providing “real time” feedback on tele-type terminals. Recently, ad- 15 vances in memory and computation technology have permitted machines to have two-dimensional pointing devices such as mice. The next step is to build machines which can interact with the human users in a natural way. That means that machines not only need to understand speech but also gestures. One of the most general def- initions from the Lexis dictionary says that gestures are “movements of body parts, particularly the arms, the hands or the head conveying, or not conveying, meaning”. Among them, hand gestures are the most important, having different sign languages such as American Sign Language. Since its first known dictionary was printed in 1856 [29], American Sign Language (ASL) is widely used in the deaf community as well as by handicapped people who are not deaf [22]. The general hand sign interpretation needs a broad range of contextual information, general knowledge, cultural background and linguistic capabilities, which are beyond the capabilities present-day computer. In our current research, we extract a set of hand gestures which have meaning in human communication. Twenty-eight different signs are extracted from [23], which are illustrated in Fig. 1.2. These hand signs have the following characteristics: 1) they represent a wide variation of hand shapes; 2) they include a wide variation of motion patterns; 3) these hand signs are performed by one hand; 4) recognition of these signs does not depend on contextual information. The gestures which require the hand to perform in a certain environment or point to a specific object are excluded. In this thesis, we present a new vision-based framework which allows the computer to interact with users through hand signs. Vision-based analysis of hand signs is one of the most natural ways of constructing a human-computer gesture interface since it is 16 (13) (14) (20) (21) (22) (23) (24) (25) (26) (27) (28) Figure 1.2: The twenty eight different signs used in the experiment. (1) sign of “angry”; (2) “any”; (3) “bOY”; (4) “m”: (5) “cute”; (6) “fine”; (7) “funny”; (8) “girl”; (9) “happy”; (10) “hi”; (11) “high”; (12) “hot”; (13) “later”; (14) “low”; (15) “no”; (16) “nothing”; (17) “of course”; (18) “ok”; (19) “parent”; (20) “pepper”; (21) “smart”; (22) “sour”; (23) “strange”; (24) “sweet”; (25) “thank you”; (26) “thirsty”; (27) “welcome”; (28) “wrong” (Bornstein and Saulnier 1989). 17 based on the major way humans perceive information about their surroundings. How- ever, it is also the most difficult one to implement in a satisfactory manner because of the limitations in machine vision today. The approach faces two difficult problems: 1) segmentation of the moving hand from a sometimes complex background, and 2) analysis and recognition of hand motion. In order to avoid the segmentation problem, some of the existing systems rely on markers or marked gloves (e.g. [38, 54, 168]). The others simply assume a uniform background (e.g. [19, 46, 53]). The patterns of hand motion appearing in our gesture vocabulary are extremely complicated. It can be a local deformation such as the change of the hand posture. It can also be a global motion which means the change of the hand position, or it involves both local and global motion. Many existing systems use simplified models to characterize hand motion such as 2D fingertip trajectory [54] and 3D joint angles [110]. These motion models are hand-crafted by algorithm designers. The system does not have the ability to learn models, neither can it improve or alter the model in case the model does not fit. This results in a brittle system, since a model cannot fit in all the cases in reality and typically the system cannot tell whether the model will fail given an unknown input. Another problem of the existing approaches is that they typically isolate temporal understanding from spatial understanding. For example, a system can probably tell something is moving, but cannot tell what is moving. Isolation of temporal understanding from spatial understanding makes it impossible to apply the system to unstructured (unknown) environments. In this thesis, we present a new general framework to learn and recognize hand signs. The new contributions of this framework are listed as follows. 18 1. Segmentation. A prediction—and-verification scheme using attention images from multiple fixations is presented to segment hands from complex back- grounds. A major advantage of this scheme is that it can handle a large number of different deformable objects presented in complex backgrounds. The scheme is also relatively efficient since the segmentation is guided by the past knowledge through a prediction-and—verification scheme. 0 Prediction. A hierarchical quasi-Voronoi diagram which organizes training attention images for prediction of the segmentation masks. 0 Verification. A learning-based function approximation scheme to verify the segmentation result. 2. Recognition. We propose a new framework in which Motion understanding is tightly coupled with spatial recognition. 0 Automatic feature selection. In the framework, the discriminant analysis is used to automatically select the most discriminating features (MDF) for recognition. 0 Inference and generalization. In practice, the system is unable to learn unlimited number of samples. An interpolation scheme is introduced to generalize to other variations from a limited number of learned samples. Chapter 2 Feature-Based Approaches for Motion Estimation In this chapter, we review the motion estimation problem which deals with observing some features on the surface of an object in the environment at different points in time and using this information to derive the three dimensional motion and structure of these objects. Let us assume that we have one or more cameras that are moving continuously in a static environment and following some unknown trajectory. We will consider the images obtained at a number of time instants to,t1, ...,tn_1 and assume that we can extract from these it images a number of features and match them between the images. We are interested in the set of finite rigid displacements D,- that bring the camera from its position and orientation at time t,- to its new position and orientation at time t,+1. 19 20 (X'.Y’) (X’.y’.2') (XJJ) Figure 2.1: Basic perspective geometry. Lower case letters refer to coordinates in the object space and upper case letters refer to coordinates on the image plane. f is the focal length. 2. 1 General Problem The basic geometry of the two view case is sketched in Figure 2.1. The object- space coordinates are denoted by lowercase letters and the image-space coordinates are denoted by uppercase letters. The optical center of a pin-hole camera coincides with the origin of a cartesian coordinate system (oxyz) and the positive z-axis is the direction of view. The image plane is located at a distance equal to the focal length f 21 (which is assumed to be unity) from 0 along the direction of View. Using a perspective projection model, a point p = (cc,y, 2) on the surface of an object is projected at a point P = (X, Y) on the image plane, where E Z 31 Y=——. . z (21) Given N corresponding features (points, lines, corners, conics etc.) on the same rigid object and our problem is to infer motion and structure of this rigid body with respect to the imaging system. In general, there are three different cases: 1. QD-to-QD Correspondence: Here, N corresponding features are specified on the 2D image planes either at two different times for a single camera or at the same instant of time but from two different cameras. In the former case, the problem is to determine 3D motion and structure of the rigid object and in the latter case, the problem is to determine the relative orientation and location of the two imaging cameras. 2. 3D-to-3D Correspondence: We are given 3D locations of N corresponding fea— tures at two different times and we need to estimate the 3D motion of the rigid body. Thus the problem has application in either motion estimation using 3D information which can be obtained by stereo or other range-finding techniques. 3. QD-to-S’D Correspondence: In this case, we are given correspondence of N fea- tures (f;, f,-’) such that f,- are specified in three dimensions and f,’ are their 22 projection on the 2D image plane. The problem is to find location and orien- tation of the imaging camera. This has applications in either calibration of a single camera or passive navigation through known 3D landmarks. 2.2 Motion Model Let the two views be taken at t1 and t2, respectively. Consider a particular physical point on the surface of a rigid body in the scene. Let (:1), y, z) = object—space coordinates of the point at time t1, (.r’, y’, z’) = object—space coordinates of the point at time t2. It is well known in kinematics that . 1 . . 23’ a: y/ = R y +T .. z, .r . Z .1 7‘11 7‘12 7‘13 IE : 7‘21 7‘22 7‘23 y +T (2'2) [7‘31 7‘32 7“33] .2. where R is a 3 x 3 orthonormal matrix, i.e., R‘R = RR‘ = I (I is a 3 x 3 identity matrix) and det(R) = 1, T is a 3 x 1 vector. Rotation can be specified in a number of equivalent ways. For example, R can be specified as three successive rotation around the 33—, y—, and z—axis, by angles 0, 1/2, and <15, respectively, and can be written as a 23 product of these three rotations cosqfi sinqS 0 C082/) 0 -sinzt 1 0 0 R: —sin¢ cosqfi 0 0 1 0 0 c036 sin0 - (2‘3) 0 0 1 Sim/2 0 cosib 0 —sin0 c030] l- d d h Alternatively, the rotation matrix can be expressed in terms of a rotation axis, 7‘2 2 [nm ny, n], and a rotation angle, qS, about it. The rotation matrix can be written as r‘ .. 2 (nx — 1)c +1 nxnzc - nzs nxnzc + nys R = nynrc + nys (n: -— 1)c + 1 nynzc — 72,3 (24) nznxc — nys nznyc + nxs (n: — 1)c +1 J h where c = (1 — cosrt) and s = sing’). A rotation around an axis with direction cosines (n1,n2, rig) and rotation angle 45 can also be represented by the unit quaternion q = (s; z,m,n) = [min,sm?i,n.sm—,n.sm§] (2.5) 2 2 2 specifically (0; Pi) = q(0;p.')q" (2-6) 24 where * denotes complex conjugation. In terms of q, the rotation matrix becomes 32 + l2 — m2 -— n2 2(lm + sn) 2(ln + sm) R = 2(lm + sn) 32 ~12 + m2 — n2 2(mn — sl) - (2'7) 2(ln — sm) 2(mn + sl) 32 — 12 — m2 + n2 J 2.3 2D-to-2D Correspondence Consider the problem in which a point p,- z (:r,-, y;, 2,) on a rigid body moves to a point p: = (mg, yf, z,’-) with respect to a camera fixed coordinate system. Let the perspective projection of p; be P,- = (X,, 14,1) and that of pf be P,’ = (Xf, Y], 1). Due to the rigid body motion, p,- and p: are related by p? = RP; + T (2-8) where R and T are the rotation and translation respectively. Given N correspondences (Pi, P,’),i = 1,2, ..., N, it is impossible to determine the magnitude of translation. If the rigid body were two times farther away from the image plane, but twice as big, and translated at twice the speed, we would get exactly the same two images. Therefore, the translation T and object-point ranges (2,) can only be determined to within a global positive scale factor. Roach and Aggarwal [158] proposed an algorithm that solves for motion pa- rameters directly from nonlinear equations. The equations that relate the three- dimensional coordinates of a point (:r,y,z) and its image plane coordinates (X,Y) 25 are X = FT11($ - $0) + r12(y - yo) + r13(z — 20) T3105 — $0) + r32(y — yo) + r33(z "' 20) Y = 7‘21(113 — $0) + r22(y — 90) + 7‘23(2 — 20). (2.9) 7‘31(93 — 1’0) + 7'32(y - yo) ‘l' 7‘33(2 — 20) where F is the focal length, ($0,yo,zo) is the projection center and r11,r12,...,r33 are functions of (6,z/),¢), as shown in equation (2.3). Roach and Aggarwal showed that five points in two views are needed to recover these parameters. The five-points algorithm has a very long history [65]. The problem was solved in 1913 by Kruppa [109]. In [158], Roach and Aggarwal related the number of points to the number of equations available for the solution of 3-D coordinates and motion parameters as follows. The global coordinates of each point are known so the five points produce 15 variables. The camera position and orientation parameters (x0,yo,zo,0,i,tr,¢) in two views contribute another 12 variables yielding a total of 27 variables. Each 3-D point produces two projection equations thus forming a total of 20 nonlinear equations. To make the number of unknowns equal to the number of equations, the six camera parameters of the first View are set to be zero and the Z-component of one of the five points is set to be an arbitrary positive constant to fix the scaling factor. An iterative finite difference Levenberg—Marquardt algorithm was used to solve these nonlinear equations. Nonlinear equations generally have to be solved through iterative methods with an initial guess or through a global search. Searching is computationally expensive. Also the iterative methods may diverge or convert to a local minima. 26 (KT) 0 Figure 2.2: Constraint on the motion parameters derived from point matches. Given eight or more point correspondences, a linear algorithm can be derived [120, 179]. As shown in Figure 2.2, the 3-D point P has images p and p’ at two consecutive time instants. From the figure, it is clear that p and p’ form a correspondence if and only if the three vectors P0, 00’ and P0’ are coplanar. The constraint can be written in the coordinate system of the first camera as p' . (T x Hp) = o. (2.10) Introduce the antisymmetric matrix C: 27 Matrix G is such that Ca: = t X :1: for all vectors 2:. Let E = CH, equation (2.10) can be rewritten as (WEI) = 0 (2-12) Dividing both sides of equation (2.12) by the positive quantity zz’ (i.e., dividing p by z and p’ by 2’) gives [X’ yl 1]E y =0. (2.13) 1 b d Equation (2.13) is linear and homogeneous in terms of nine unknowns of elements of E, e,,~=1,2m9. Given N correspondences, we can write equation (2.13) in the form B[€1,€2, 63,64,853863 87, 883 691‘ = 0 (214) With eight point correspondences, if the rank (B) = 8, E can be uniquely determined to within a scale factor. Once E is determined, R and T can be obtained. The existing linear algorithms essentially consider noise-free data. High sensitivity to noise is reported in [179] [63]. To handle noise, Yasumoto and Medioni [200] used a regularization approach. Under the assumption of small rotation angles, rotation matrix R is expressed by R: n, 1 42,. (2-15) 28 where 93, fly, Qz denote the rotation angle around the 3:, y and z axis, respectively. The objective function is in. — a): + (a — s>21+ Am: + n: + 0:) (2.16) i=1 where the first and second term are the square of the difference between the predicted and computed displacement, and the third term is the regularization function. A consequence of their approach is that the estimated motion parameters are biased towards nonrotational interpretations if the regularization factor is not zero. In their approach, the search for the global minimum of the objective function in motion parameter space is computationally expensive. Weng et. al. [189] used least squares techniques to make use of the redundancy in the data to combat noise. The algorithm first solves for the essential matrix E. Then the motion parameters are obtained from E. Finally the spatial structure is derived from the motion parameters. All the steps of the algorithm use the redundancy in the data to combat noise. The linear equations in the linear algorithms [179] [120] [203] are converted to least squares minimization problems. Due to the coupling between different parameters in the motion vector, the large magnitude of translation is needed to obtain stable estimation of translation direction and structure of the scene. If the rank (B) < 8, the above linear algorithms cannot be used. However, if rank (B) = 5, 6 or 7, then the linear equations (2.14) can be solved along with the polynominal constraints on the components of matrix E [88] [90]. Specifically, E is equal to a skew-symmetric matrix post multiplied by a rotation matrix only if 29 {e,};=1,2,,_,,9, satisfy the following three constraint equations. Let 63,31,253 be the ith row of E, then €3'(€1X€2):0 (”52“2 + ”53”2 — ||€1||2)(€2 ' 63) + 2(51«53) = 0 “63”4 = (IIE - 2ll2 - llé‘ll”)2 + 4(51'62)2- (2-17) Thus we get three polynomial equations in {ei},=1,2,w9 of degree 3, 4, 4, respectively. 2.4 3D-to-3D Correspondences 2.4.1 Point features Suppose we are given N corresponding points (p,, p]) which obey the relationship of p:- = Rp, + T (2.18) The problem is: given (2.18), find R and T. The (p;, p:) are 3D coordinates of points on the surface of the rigid body in motion. It is well known that three noncollinear- point correspondences are necessary and sufficient to determine R and T uniquely. Equation (2.18), when expanded represents three scalar equations in six unknown motion parameters. With three point correspondences, we will get nine nonlinear equations. Iterative methods can be used to obtain the ‘best’ fits of the six unknowns. However, it is possible to get stuck in the local minima. 30 Blostein and Huang [20] used linear algorithms by observing that equation (2.18) is linear in components of R and T. Given four correspondences, (pi, p2),=1,2,3,4, we have the following linear equation: P11: Ply Plz 1 7"11 Pix 192.: 1192.. p22 1 7‘12 P33 = . (2.19) P31: Pay P32 1 7‘13 P3,, P4x P43; P42 1 J t1 J P51; Similar equations can be obtained to solve (r21, r22, r23, r31, r32, r33, t2, t3). The linear method uses four points instead of the minimum of three required for uniqueness. To overcome the problem of supplying the linear method with this extra point corre- spondence, a “pseudo—correspondence” can be artificially constructed on the basis of rigidity of the body. 2.4.2 Motion from stereo images A common method for determining the three-dimensional structure of the surround- ing environment is through stereo vision, and in fact, the human stereo system is remarkably adept at this computation, under a wide variety of conditions. Stereo vision can be characterized by three steps as shown in Figure 2.3: 1) The point in one image corresponding to the projection of a point on a surface; 2) The point in the other image corresponding to the projection of the same surface point; 3) The difference in projection of the corresponding points is used, together with estimates W1 (1 One pixel l f Y I left right 1‘ . —1 Figure 2.3: Stereo triangulation and elongated uncertainty shape. The true point can lie anywhere inside the shaded uncertainty region. of the parameters of the imaging geometry to determine a measure of the distance to the surface point. Marr and Poggio have proposed a feature—point based model of human stereopsis [126]. A computer implementation of their algorithm was then developed and tested [78, 79]. In the problem of motion from stereo images, (pg, p2):=1, N are measured by stereo 000’ triangulation. Since the measurements are subject to error, one prefers to work with more than three point correspondences. In this case, R and T can be obtained as a 32 solution to the following least squares problem: N 1193,12 up:- — (RP.- + THE} (2.20) w.r.. , i=1 subject to the constraint that R is a rotation matrix. where [I - I] represents the Euclidean norm. Such a constrained least square problem can be solved with linear procedures using quaternions [87] or by singular value decomposition [8]. One of the most important advantages of the closed-form solutions is that the cor- responding algorithms are fast and the solutions are guaranteed. However, the least squares solution is not optimal in that it equally trusts all components with different reliabilities. In the three-dimensional position of a point determined by a typical stereo triangulation, the depth component is much less reliable than the lateral com- ponents, as shown in Figure 2.3. Matthies and Shafer [131] studied some related issues . of error modeling in stereo navigation. They modeled the error of a three—dimensional point, constructed through stereo triangulation, by using a three-dimensional random vector with a Gaussian distribution (called an ellipsoidal model). Given a set of cor- responding three-dimensional points {pi} before motion and {p1} after motion, the interframe motion represented by a rotation matrix R and a translation vector T were determined to minimize N 2(Rpi+T-p§)‘W(Rp1+T—p£) (2.21) i=1 where the weighting matrix V,- is the inverse of RI‘ 1,, R-1 + I}: (I‘ with a subscript 33 denotes the error covariance matrix of the variable represented by the subscript). A closed-form solution to this problem was not found. They iteratively minimized (2.21) using a least squares solution as an initial guess. Kiang, et. al. [101] replaced the matrix V,- in (2.21) by a scalar 10?. The dis- tribution of error in the three-dimensional position of a point was simplified into an uncertainty line segment. From a least squares solution, a few iterations gave im- proved motion parameters. Moravec [134] has also used a simpler scalar weight, which is inversely proportional to the depth of the point. A scalar weight indiscriminately treats the uncertainties in different components. This implies that either reliable com- ponents are undertrusted or unreliable components are overtrusted. Furthermore, the correlation between errors in the components of a three-dimensional point cannot be taken into account by the scalar weight. Weng et. al. [191] also used matrix-weighted least squares. They presented a closed-form approximate matrix-weighted least squares solution for motion parame- ters from three—dimensional point correspondences, which minimizes (2.21) with the weighting matrix V,- simplified so that it does not vary with the unknown rotation matrix R. The result of the algorithm approximates the optimal solution if the rota- tion is small. With large rotation, the result can be used as a better initial guess for optimal iterative approach. 34 9 x V 2 Figure 2.4: Imaging geometry for 2D—to-3D correspondence problem. A point p in co- ordinate system oxyz is imaged at location P’ on the image plane which is specified in coordinate system 0’ :c’ y’ z’ . 2.5 2D-to-3D Correspondences This situation arises when each of the features is 3D and their corresponding features are 2D. If 3D locations of features are known along with their projection on the camera plane which is at unknown location, then the algorithms described in this section allow us to determine the attitude, i.e., the location and the orientation of the camera (known as the camera calibration problem). As shown in Figure 2.4, consider two coordinate systems. oxyz is a coordinate system in which the 3D point features are located. Thus p,- are points in this co— ordinate system with coordinates (x;,y;, 2,). The camera is referenced to the other I I I I coordinate system (0 a: y z ). Image coordinates on the camera plane are obtained by perspective projection and denoted by (X ’, Y’). Thus the image of point p,- is at P,’ 35 whose coordinates are given by N.‘ Y’z N I‘Q‘ ..ttlfi u.‘ (2.22) Coordinate system oxyz is obtained by a rotation R and translation T of the coor- dinate system o’x’y’z’ and the goal in camera calibration is to determine R and T, knowing the N point correspondences (pi, P,’),-=1,M,N. The 3D coordinates of p:- are related to those of p,- by :1:’ :1: z’ z Combining equations (2.22) and (2.23), we get T1133 '1' r12y: + 1‘132'I' + t1 rslx: + my. + r332.- + is 721% + r2231.- + T2321+t1 723153.; + T3231.“ + T332; ‘1' t3. xgz y; = (2.24) There are six unknowns (three for rotation and three for translation) and therefore with three point correspondences, we have enough (i.e. six) equations. Unfortu- nately, these are nonlinear transcendental equations, since r;,- are related to the three unknowns of the rotation matrix in a transcendental manner. Iterative methods are required to solve these nonlinear algebraic equations. In practice, the data (i.e. p.- 36 and P,’) are known only approximately and therefore one may use more than three point correspondences. In such a case, the following nonlinear least squares problem may be solved iteratively: N w.r.t.R,T r31$i ‘l' T323]; + T332; + t3 1:31;“ + 7‘323/1 + T332, + t3 )2] (2-25) i=1 subject to R being a rotation matrix. The disadvantages of these approaches is that unless one starts with a good initial guess, the iterative procedure may not converge to the right solution. If three nonlinear point correspondences are given, then without loss of generality we can assume that three points lie in the plane 2 = 0. Then (2.24) becomes 7‘1138 + r12yi+t1 73113 + r3231: + is 72113 + T2219; + t1 1“3133; + T329; + is. xg= y,’ = (2.26) Thus the three point correspondences give six linear and homogeneous equations in nine unknowns r11, r12, r21, r22, r31, r32, t1, t2, t3. Assuming t3 # 0, we can divide by t3, to get six linear equations in eight unknowns 7‘11 1‘12 7‘21 7‘22 7‘31 7‘32 t1 12 Additional constraints on {rgj} can be obtained based on the definition that R is a 37 rotation matrix as (52)? + (11): + (531)? = (1‘3)’~’+(93)2 + (’3)? t3 t3 t3 t3 t3 t3 7‘ 7‘ (fix—,3) + (321M131) + (ff—1x393) = o. (2.27) 3 3 3 3 3 3 Thus we have six linear and two quadratic equations in eight unknowns. According to Bezout’s theorem [184], the maximum number of solutions (real or complex) is four, assuming that the number of solutions is finite. Solutions can be obtained by computing the resultant which in this case will be a fourth-degree polynominal in one of the unknowns. With four coplanar point correspondences, (2.26) yields eight linear homogeneous equations in nine unknowns. If this system of equations is nonsingular, R, T can be obtained uniquely. If four or five point correspondences are known, then one can either solve a linear least squares problem or use the above method by taking three point correspondences at a time. If six point correspondences are known, then (2.26) gives twelve linear homoge- neous equations in twelve unknowns {rij}i,j:1,...,3g {t:}1=1,...,3. R, T can be determined uniquely, if the system is nonsingular. 38 2.6 Motion from Long Sequences In the previous section, we have concentrated on motion/structure determination using only two time instants or frames. This is sufficient for some applications (e.g., passive navigation, pose determination, camera calibration), but not for others (e.g., motion prediction). For motion prediction and general understanding, it is necessary to work with longer image sequences. Furthermore, using longer image sequences is potentially a way of combating noise in the data. 2.6.1 Kalman filter Broida and Chellappa [28] considered the case of a two—dimensional object undergoing one—dimensional motion. They assume that the object structure is known and attempt to recover the motion parameters. The unknown model parameters are represented as a vector: [:rc, :itc, 20, 73c, p1, p2, w]’ (2.28) where (we, zc) is the location of the center of mass of the object, (:ic, z'c) is the object translational motion, p1 and p2 are unknown phase angles of moment arms r1 and r2 that connect the two feature points to the center of mass. Here r1 and r2 / r1 are assumed known. The differential equation describing unforced motion is written in terms of the above vector as: :i(t) = [ic,0,ic,0,w,w,0]t (2.29) 39 with arbitrary initial conditions :rc(t), zc(t),p1(t), and p2(t). This system yields the following state equation: 2(1: +1) = F(k)x(k). (2.30) The measurement model is given by X1 = L[:rc + r1cos(p1)]/[zc + rlsin(p1)] = h1[$(l€)] X2 = L[:1:c + r2cos(p2)]/[zc + r2sin(p2)] = h2[:r.(k)] (2.31) where X1 and X2 are the images of the two feature points and L is the focal length of the sensor. The vector representation is given by X(k) = [X1(k)X(2(K)]‘ = h[x(k)] + n(k) (2.32) where h[:r] = [h1(:r)h2(.r)] and n(k) is the term corresponding to zero mean, Gaussian, spatially correlated, and temporally white noise. The above formulation is then used to design an iterated extended linear Kalman filter that solves for the state variables-in this case the translation and rotation pa- rameters. Ayache and Faugeras [10] also used an extended Kalman filter to deal with noisy stereo image sequences. An observation :1: that depends on a parameter a in a non- linear fashion that can be expressed as a relation f(:.:, a) = 0. (2.33) 40 Assume that the observation 3: is corrupted with noise which can be modeled as an additive zero mean Gaussian noise: at = x’ + 8 (2.34) with E(s) = 0 and E(€€t) = A. The problem is, given a number of stereo observations at,- and start with an initial estimate do of a and its associated covariance matrix S0 = E((a0 — a)(&0 -— a)‘), the Kalman filtering approach can deduce recursively an estimate a. of a and its covariance matrix S7, = E((a.. — a)(a,, — a)’) after taking into account n observations. The a. is the parameter vector that minimizes the criterion: (a — a0) S51 (a — a0) )+ :(y- —Ma )’W_1(y- —M,-a) (2.35) where Mi : af($iaa) 3a ._ 8f(:v,-,a) 8f(:1:,~,a)t W, _ 3.7: A (9:1: (2.36) 2.6.2 Other approaches For a linear problem, theoretically, the result of Kalman filtering is the same as that of a batch method. However, for a nonlinear problem, the result of Kalman filtering is not as good. The key problem with Kalman filtering for the nonlinear problem is 41 that the system Jacobiaan matrix for each old observation cannot be used for future observations. This is a fundamental structure of sequential processing. To see that clearly, we consider a time-invariant problem y = f (m) + 6,, where the parameter to be estimated dose not vary with time and the noise is white. In stead of minimizing the desired objective function, 221:1 [yi — f;(m)]2, the iterated extended Kalman filter minimizes " . n . ~ (I) Ely.- — Min“)? = 213/. — a—f’AZ—l’”? (237) where y,- and f;(m) are the components of y and f(m), respectively, and m“) is the sequentially estimated parameter vector based on the first i observations. For small i, firm has a large error since just i observations are available. Therefore, J;(m) evaluated at m“) gives a system matrix that is evaluated far from the true parameters. This results in inaccurate system equations. Once those inaccurate system equations are generated, they are included in the objective function (2.37), further preventing the estimated parameter m from approaching the correct parameters while new data are obtained. To overcome the above problems, Cui et. al. [45] have proposed a recursive batch approach to deal with long image sequences. Using batching processing, fit“) in (2.37) is replaced by m, which takes all the observations into account, instead of just the first i observations. Computationally, when all observations are processed in a batch fashion, the modification of parameters is reliable, and the system matrix of every observation is updated at each iteration. In other words, with a sequential algorithm, the contribution or influence of the later observations to the evaluation of 42 the system matrices for the earlier observations is neglected. In order to achieve good performance without suffering from excessive computational cost, a recursive—batch approach is used. In this approach, the observation sequence is processed in relatively small batches. For each batch of data, estimates are determined in a batch fashion from old estimates and the current batch of data. The approach is recursive because the processing step is repeated for each batch of data and the newly estimated result depends on the previous result. Tomasi and Kanade [178] have presented a factorization method to recover shape and motion from image sequences. Assume P feature points are tracked over F frames in an image stream. The trajectories of image coordinates are { (u fp, v fp)| f = 1,...,F,p = 1,...,P}. Write the horizontal feature coordinates uh, into an F x P matrix U with one row per frame and one column per feature point. Similarly, an F x P matrix V is built from the vertical coordinates v fp. The combined matrix of size 2F x P w = {—1 (2.38) is called the measurement matrix. The rows of the matrices U and V are then regis- tered by subtracting from each entry the mean of the entries in the same row: u}. = w. — aI v},, = vfp -— bf (2.39) where P 2 ”f? (2.40) W = [7] (2.41) is called the registered measurement matrix. Without noise, the registered measure— ment matrix W is at most of rank three. With noise, the best possible shape and rotation estimate is obtained by considering only the three greatest singular values of W, together with the corresponding left and right eigenvectors. The approach is only applicable to orthographic projection. Chapter 3 Transitory Image Sequences, Asymptotic Properties, and Estimation of Motion and Structure If a system needs to sense a large 3-D rigid scene which cannot be covered by single view, the system may actively move and scan the scene [5]. For example, to auto- matically build a 3-D map of a floor in a building, a camera system moves from one room to the next on the floor. To obtain information about all the facets of a 3—D object, a camera system needs to actively circle around the object or let the object rotate. In general, during a dynamic sensing process, any component of the scene is visible only in a subsequence, and thus the resulting image sequence is transitory. 44 45 A transitory image sequence is one in which no scene element is visible through the entire sequence. When a camera system scans a scene which cannot be covered by a single view, the image sequence is transitory. This chapter deals with some major theoretical and algorithmic issues associated with the task of estimating structure and motion from transitory image sequences. 3.1 Basic Concepts We consider a rigid scene and a sensing system (we will call it camera system). They undergo a motion relative to each other. No matter which is actually moving, or both are moving, what we need to consider for the kinematics here is just the relative motion between the two. We first define the system of reference. Because we are considering two entities: the scene and the camera system, it does not help us to place the system of reference on any object other than these two. If the system of reference is placed on the scene, the representation with respect to this system is called world—centered (WC) (also called object-centered). If the system is placed on the camera system, the representation is called camera—centered (CC). Fig. 3.1 shows these two representations. In the WC representation, the camera is moving with respect to a static scene, while in the CC representation, the scene is moving relative to a static camera. To be specific in discussion, we say that the scene is static and camera is moving. Thus, the world- centered reference system is fixed (with the scene) and the camera-centered reference system is moving (with the camera). 46 CC $fi —> I (a) (b) Figure 3.1: Two systems of reference: (a) world-centered and (b) camera-centered. A view it of a 3-D feature point :r is a 2-vector (two dimensional vector) in the monocular case and a 4-vector in the stereo case (left and right views). With random error in the image measurement u, the 3—D position of the point :5 determined from u becomes a probability distribution whose extent can be characterized by its error covariance matrix I}. The covariance matrix of a 3-D point from a monocular view can be represented by where, 03 is an extremely large number or infinity and the orthogonal matrix H spec- ifies the orientation of the major axes of the covariance. By using a covariance matrix also for monocular view, we can treat monocular and multi-ocular cases in a unified way. Our analysis is applicable to both perspective and orthographic projections. As a notation, we write a small perturbation of a vector v by 6,, and the error covariance matrix of a vector v by I}. First, we examine the error from determination of the pose m of a camera system 47 in a system of reference, where p is a 6-vector. For example, m =(0330y9629p1‘vpy’pz) (31) where Pmpymz Specifies the position of the camera projection center and 93,611,,02 specify the orientation of the pose represented by a rotation matrix . - . 1 r . 1 0 0 cos 0,, 0 sin 0,, cos 02 — sin 02 0 RU?“ 6241 02) = 0 cos 0,c — sin 0,, 0 1 0 sin 02 cos 0; 0 0 sin 9,, cos 0,, J — sin 6,, 0 cos 0,, 0 0 1 (3.2) The pose is estimated from :13, a set of 3-D points, represented in that reference system and u, their image observations in the camera. Therefore, the pose is a function of :1: and u: m(a:,u). We can express the error in m in terms of that in a: and u: 8m 8m and for its covariance matrix: 8m 6m‘ 0m Bm‘ Fm — fish's? + a? "‘37 (3'4) assuming that the correlation between a: and u is negligibly small. Next, we investigate the error in determining 3-D position of a set of 3—D points y visible by a camera system whose estimated pose is m. These points in 3] have 48 Figure 3.2: Transitory and non-transitory sequences. (a) non-transitory. (b) simple tran- sitory. (c) general transitory. been viewed as a set of image points v. The estimated 3-D position of points y in the above system of reference is then a function y(m, v). We can express the error in the estimated y by that of m and v as _ 8y (9y 6y — a—mdm + 556., (3.5) and for its covariance matrix: _ 8y 031’ 3y 031‘ I‘y — gang},— + 55135; (3.6) assuming that the correlation between error in m and v is negligibly small. The above equation indicates that the error covariance of the 3-D points has two components, one is caused by the error in the pose estimate, the other results from error in the feature measurements. Now, we use the above result to analyze pose determination from x and the use of estimated pose m to determine y. We consider two cases. (i) :c and y correspond to the same set of scene points, as shown in Fig: 3.2(a). Thus, a and v are the same 49 n (3.3) and (3.5), which gives (9y (9m 0y 8m 511+ _ 3y _ (9y 8m 3y 8m 8y 51—33553‘7337 an“ am 'a— (— , ,7 ...,. ).,, (3.7) which gives r, = AFIA‘ + 12m)t (3.8) where A and D are the appropriate Jacobians. (ii) :10 and y correspond to different scene points as shown in Fig: 3.2(b). Substi- tuting Fm in (3.4) for that in (3.6), it following that 1“,, = Alp/1‘ + BI‘,,Bt + 011,0t (3.9) where A, B and C are the appropriate Jacobians. The first term is caused by the error in the 3-D structure a: from which the pose is computed. The second term is due to error in u, the observation of at. The third term results from error in the observation of y. 3.2 Asymptotic Error Properties of Integrations In this section, we derive how the amount of error in the estimate changes with integration of various sequences. We assume that the algorithm obtains a linear minimum variance estimate in the sense of Gauss-Markov [122], which is the minimum variance estimate with Gaussian noise. 50 In order to investigate the best possible result, the processing method is assumed to be batch unless stated otherwise. This means that all the observed data are available for processing and the estimate is computed with all the data as a single batch. In contrast to batch processing is recursive processing [122] where data items are used one at a time, each giving an updated estimate for the result, and once an data item is used for updating it is discarded. In other words, recursive processing imposes a restriction on the way data are available and thus it might not be able to compute all the estimates that batch processing techniques can compute. Thus, recursive processing may have a worse asymptotic error behavior than the batch processing, unless the model is actually linear [122]. 3.2.1 World-centered representation In the WC representation, every new observation about object structure is trans- formed into the WC system of reference using the estimated camera pose. Then all the transformed structure observations are fused together according to each’s error covariance matrix. Ideal non-transitory sequence Consider that a set of feature points y is visible in all the views in the image sequence, as shown in Fig. 3.2(a). Suppose that from t = 1 to t = n, n observations are made for structure y: y : yt + 6y; (3.10) 51 Without loss of generality, we can assume that the pose m is relative to the pose at t = 1. The correlation of error in 6y, between different t’s is weak because error is random. According to the Gauss-Markov Theorem [122], the linear minimum variance estimator of z in the linear equation A2 = b + 6, where the noise term 6 has a covariance matrix F5, is z = (AtFEIA)_lAtF5—lb with an error covariance matrix Fz = (Atl‘s-IA)”. Thus, the minimum variance linear estimate for y in (3.10) is y: (2 F;‘)"(ZF;‘yI) (3.11) t=1 t=l where I}, is given in (3.6). The error covariance matrix of y in (3.10) is given by n 1“, = (X 117.1)"1 (3.12) i=1 Theorem 1 Let A and B be real 72 X 72 positive definite matrices. Then, A — (A’1 + B'l)‘1 is positive definite. Particularly, trace(A) > trace((A‘1 + B‘1)'1). Proof of Theorem 1. A — (A-1 + 8“)" = A -— (A-1(B + A)B-1)-l = A — B(B + A)“A = (1 — B(A + B)-1)A = ((A + B)(A + B)“ — B(A + B)-1))A = A(A + B)“1A 52 Thus, it is clear that A(A + B)‘1A is positive definite because A, B are all positive definite. C] Using the result of Theorem 1, we know from Equation (3.12) that any observation yt decreases the expected error in the structure. In order to give a concise and intuitive expression about error covariance matrix, we need to assume some uniformity in the sense that the difference in the error covariance matrix from each view is neglected and each is replaced by the average error covariance matrix. Here, we assume that difference of PM among different t is neglected. Thus, 1 1‘3, 2011321)“1 2 gf‘y, = 0(1/n) (3.13) Thus, it is clear that the expected error variance in the structure is inversely propor- tional to the number of frames n. We call the factor l/n error rate. The situation discussed here applies to that of “Hotel” sequence in Tomasi and Kanade [178]. A point should be mentioned here. As indicated in (3.7), the first term on the right—hand side is presented in all the observations yt. This implies that the observa- tions yt’s are not exactly uncorrelated. But, if the structure :1: is re-estimated using all the observations, the correlation between this re—estimated a: and yt is weak and it can be neglected especially when n is large. Simple transitory sequence In a simple transitory sequence, each scene point is visible in two consecutive frames. In this case, the pose m estimated from point set x, = a: and its observation at = u is 53 used to estimate the new structure at,“ = y whose observation is ut+1 = v, as shown in Fig. 3.2(b). From (3.9), we can estimate the error covariance of the structure (rt: 1“,, = A.1‘_,,_,A§ + 3.1".“ B: + 0.12,,0,‘ (3.14) Thus, using the above expression recursively, we get 11 11$” 2 2(Btl-‘utBt + Cth,C,t) i=1 where, B; and CI are the products of the appropriate Jacobian and we have neglected error in the re—estimated structure represented in the WC reference system, just as we did in the last subsection. Now, we assume a uniformity in which the difference among the terms under the summation is neglected. Thus, I}, = warms: + C.r.,,C;) (3.15) In other words, the error covariance in the structure is proportional to the number of frames. This implies that error is accumulated with the number of frames. General transitory sequence The general situation with a transitory sequence is shown in Fig.3.2(c), where a point can be visible in any number of frames (except the entire sequence). Detailed formulation for this general case is tedious and the resulting complex expression will not give us an insight. Because we are interested in asymptotic error behavior, we 54 may make some assumption about uniformity. Assume that every feature point is visible in 2k frames. Thus, we regard the entire sequence F = {ft I t = 0, 1, 2, - - - , n} as k subsequences F) = {kaH I p Z 0 is an integer}, l = 0,1,2,- -- ,k — 1, so that in each FI each point is visible by two frames and each F1 is then a simple transitory sequence. k is called visibility span. The entire sequence consists of k subsequences each is a simple transitory sequence and is of n / I: long. According to the result of simple-transitory case with the uniformity assumption, the error covariance matrix of the linear minimum variance estimate based on each F; is proportional to the length n/k: n 1“,, = (arms: + C.1‘.,,Cf) = 0(n/k) (3.16) PS" where the factor in the parentheses should be that for a simple transitory subsequence. On the other hand, we have It subsequences, each gives an independent observation of structure at. Thus, we can use the result for ideal non-transitory sequence we obtained when we derive (3.13), which says that the error covariance matrix is reduced by a factor of l/k: 1 r,“ = k, (armst + C.r.,,c:) = 0(n/k2) (3.17) which gives an error rate n/lc2 for error covariance matrix I‘m”. This is a very inter— esting rate. If the temporal sampling density is increased by a factor 2 with the same scan trajectory, n and k are both doubled, and the error covariance matrix is reduced by a factor 1/2! We can see that when k = 1, the general rate n / k2 becomes the rate for the simple transitory case and when k = n it gives that for the non-transitory case. 55 The error rate n/l»:2 in (3.17) implies that the later a part of scene enters the view, the larger the number n, and thus the larger the variance of the error in its position with respect to the WC reference system fixed at the first view. In the above subsequence decomposition, motion is expressed as interframe mo- tions in each subsequence, which is a motion that cross 117 frames in the original sequence. Of course, the above subsequence decomposition is not necessarily what is done by an actual estimation algorithm. This decomposition is in fact just a way to derive error rate. By grouping structure observations into the defined subgroups F1, errors in the estimate from different groups are uncorrelated. Thus, the decom- position is to make derivation of error rate more concise and simpler. It does not affect the asymptotic error rate n/k2, because the best estimate is still derived by processing the entire set of structure observations as a single batch. Global pose error In a non-transitory case, the error covariance matrix is given in (3.4) which is almost independent of n. Now, consider the transitory case. According to (3.4), the error variance of camera pose estimate is the sum of two terms, that from R, and that from I)“, where n — l is the past time frame that shares sufficient features with the current view at n. Therefore, we have n-l 8m amt [2 (BtPugBt+CthgCtt)+—Pufi at (3.18) 1‘3" 2 56 Since 1 is in the same order of k, the asymptotic error rate is n/kz. Denote the last term in the above equation by 0(1) indicating it is caused by a single view of u vector. Thus, the pose error with a transitory sequence has the same asymptotic error rate as that of the structure estimate: r, = 0(n/k2) + 0(1) (3.19) 11 This is an interesting yet expected result. When n/k2 increases without bound, the error rate of the camera global pose error is n/kz, i.e., the first term in (3.19). However, Suppose one increases the temporal density of the sequence which covers the same trajectory, i.e., letting n = ck for some constant c and increase k without bound. Then, the first term in (3.19) becomes 0(ck/k2) = 0(c/k) which approaches zero, and thus the second becomes dominant. The expression in (3.17) is a single term of 0(n/k2). This is because the pose of a camera system is determined by a single view. while the structure can be determined by many camera views. 3.2.2 Camera-centered representation In the CC representation, object structure is represented in the camera reference system. In other words, every previous observation about object structure must be transformed into the camera reference system at the current frame and be fused according to the Gauss-Markov Theorem. An important difference between the WC and CC representations is the following o In the WC representation, every part of the scene that has been observed but is 57 not currently visible does not need to be updated with the current view, because the WC reference system does not change with respect to the scene. 0 In the CC representation, every part of the scene that has been observed but is not currently visible must be transformed to the current camera centered system because the CC reference system moves with respect to the scene. The update for every part of the past structure can be computationally expensive. We will address this point when we describe our cross-frame approach. With the CC representation, the pose m to be computed is from the past time t — p to the current time t, p = 1,2,3, - - . ,t — 1. After fusing all the past views with the current view at t, the resulting structure is called the CC structure. Theoretically, the structure error should be the same as that with the WC representation if all the past frames are treated in a batch fashion. Thus the behavior of the error covariance matrix for the CC structure is the same as that of the WC structure estimated with the WC representation, except that time t is now reversed: the older the frame, the worse the structure accuracy in the CC representation. However, the local structure, i.e., that is visible in the current frame, does not have the above transitory problem, simply because it is visible at current time n. Therefore, it can take the advantage of the situation enjoyed by the ideal non-transitory sequence. If the CC structure only take past b frames into account as a batch, and those b frames share a considerable number of features with the current view at n, then, according to the result (3.13) derived with uniform ideal non—transitory sequence, the CC structure 58 of the currently visible part is of order I‘, = —r,, (3.20) where I‘y, is the error covariance matrix of the past structure transformed to frame 77., and b is the batch size. For the above expression to hold true, b should be small enough so that the past b frames share the structure 2,, with the view at time n. Now, we are ready to summarize the asymptotic error rates using Table 3.1. In Table 3.1: Asymtotic rate for error covariance matrix in integration I Representation Estimate ][ Non-transitory Simple transitory General transitory WC structure 0(1/n) 0(n) 0(n/k2) WC pose 0(1) 0(n) 0(n/k2) + 0(1) CC structure 0(1/n) 0(1) 0(1/b) CC pose 0 0 0 Table 3.1, n is current time (or frame number), k the visibility span, and b is the batch size b S k. All the structure error is that for the visible part at the current n-th frame. The camera pose error in CC representation should be zero in all the cases, because it is defined directly in the camera system itself instead of measure. In the table, “0” is used to indicate this fact. As can be seen from the table, with a general transitory sequence, for global structure representation which is necessary for extended scene reconstruction, one should increase the visibility span k as much as possible. For the camera—centered local structure which is useful in grasping or collision avoidance, one should increase 59 the batch processing size b 3 Is: for the best possible accuracy. 3.2.3 The tightness of the error rates The error rates we obtained in Table 3.1 are achievable rates, using the methods explained in the derivation. However, it is necessary to answer the tightness question. How tight are those rates? In other words, are those rates the best one can possibly achieve? If they are too loose, they are of little value. If they are the best one can possibly achieve theoretically, they give an important insight into the nature of the problem. To answer this important tightness question, we need to look into theoretical bounds with respect to parameter estimation. In general, the observation model of our problem can be expressed as 0 = u(a) + 6,, (3.21) where ii is a vector of image—plane observations, contaminated by noise vector 6“, and u(a) is the noise-free image plane vector which depends on the parameter vector (1. In our problem, u consists of image coordinates of all the features in all the image frames. 6., is the error vector which takes into account a wide variety of errors, such as errors in spatial digitization, feature detection, stereo matching and temporal matchings, etc. The vector 0: is the parameter vector one wants to estimate, such as structure of the currently visible scene, camera pose, motion parameters, etc. Suppose that ("1 is an unbiased estimator of a from it in (3.21), the noise vector 5,, has a zero mean and covariance matrix I}, and the probability distribution density 60 of the noise factor is p(u, a). In reality our estimator is not exactly unbiased and the noise mean does not have to be exactly zero. We assume that the absolute bias and the noise mean are negligibly small. The multi-dimensional version of the Cramér-Rao error bound [154, 192] gives E(a — a)(é — a)t 2 F-1 = CRB(a) (3.22) where E denotes expectation operator, and F is the Fisher information matrix: _ 01np(u,a) t Blnp(u,a) F - [T] [T] (323) The inequality in (3.22) means that the difference matrix of the two sides is nonneg- ative definite. In particular, the diagonal elements and the trace of a nonnegative definite matrix are all nonnegative. Therefore Cramér-Rao bound gives a lower error bound for the error covariance of every component of the parameter vector 0. As indicated in (3.23), such a bound is evaluated with noise-free observation u and true parameter vector 0. It is worth noting that the bound depends on the problem itself and is independent of actual algorithm that is used to estimate 0. Thus, the bound is algorithm independent. It indicates that no matter what algorithm is used to estimate a, the resulting error covariance matrix of (1 cannot be lower than that specified by the bound. Next, we investigate the Cramér-Rao bound of the global pose of the camera system in WC representation. We consider a general transitory sequence of length n, 61 F 2 {ft I t = 0,1,2,- . - ,n —- 1} with a visibility span k. Since we are investigating asymptotic behavior in which 12 goes to infinity, without loss of generality, we consider n to be an integral multiple of k, i.e., n = (j + 1)lc, for some positive integer j. j + 1 is the length of k subsequences F1: {fpkH I p = 0,1,--- ,j}, l = 0,1,2,--- ,k — 1, each of them is a simple transitory sequence. The simple transitory case Consider the subsequence F0, of length j. As explained in (3.1), the global position of the camera at the i-th frame of F0, with respect to its global position at 0-th frame F0, can be specified by a column vector m(i) : (px(i),py(i), P20)» 6x“): 0310'), 02(illt where p(i) = (p:,,.(i),py(i),pz(i))t and d(i) = (93(i),9y(i),02(i))‘ specify the global position and orientation, respectively. Define incremental interframe displacement d(i) = m(i) — m(i — 1) (3.24) i = 1,2, - - - ,j. From (3.24), we have the relation m(i) = i=1 d(t) + m(O), or m(0) ”I 0 - 0” m(0)-1 m(l) 1 I 0 d(l) = (3.25) [m(i) _1 1' 1 _ d(J) _ 62 with I denoting the identity matrix. Alternatively, we write mayo = dej.0 where we denote the left side of (3.25) by mjp and the right side by the product of the matrix Mj and vector dip. Geometrically, mm is the global attitude trajectory of the camera system while djp is the interframe displacement vector, plus the initial attitude m(0). According to the definition of the Cramér—Rao bound, we have CRB(dj,O) = {Ialn giiu,dj.o)It Ialnp(u,dj.o)I }_ adjp and —1 . _ alnp(u,mj,0) t 81np(u,mj,o) CRB(mJ,0) - I I 3mm 0mm Since 31np(u,dj,o) : Blnp(u,mj,o) 0mm : 81np(u,m,~,0)M' 3d,“; 8mm adj’o 0mm J it follows that CRB(mJ-,0) = MjCRB(dj,o)MJt (3.26) For our purpose of investigating the asymptotic behavior of the Cramér-Rao bound, we need the uniformity condition of the motion sequence as we did earlier, since the behavior of an otherwise arbitrarily changing motion trajectory can depend more on a particular local motion instead of the temporal trend of the error behavior. Now, we assume a uniformity with which the differences among the interframe motions 63 d(i),i = 1, 2, - - - ,j are neglected. In other words, the Cramér—Rao bound of interframe motions CRB(d,-,o), which is a symmetric matrix, is now a band matrix: ‘00 c1 c2 0 - c1 c0 C1 CRB(dI.o)== 02 C1 c0 02 (327) CI 0 C2 c1 00 In other words, denoting CRB(dJ’,0) 2 [CM], then Cpq = qu = 0 whenever Ip—qI 2 h, for some constant h. The unfirmity condition requires that the error bounds for estimating interframe motions d,- and dj, respectively, are not correlated when the interframe motions are farther than h frames apart. This is a reasonable condition because although interframe motion depends mostly on the two image frames that defined the interframe motion. Although the information about the scene structure may contribute to the estimation of interframe motion to some degree, two far apart interframe motions do not share any common scene element when h is large enough in a general transitory sequence. With a simple transitory sequence, two interframe motions do not share any common scene element as soon as I p — q] 2 2. Without loss of generality, we can consider h = 2 for a simple transitory sequence. 64 Thus, (3.26) and (3.27), give , - - - t I 0 0 Co C1 0 [I O O I I 0 Cl Co I I 0 CRBUTIJ'Q) ‘2 (3.28) I I '. Z '. ' Cl ' ' ' ' I I I 0 C1 Co] I I I] The last element in the bottom row of CRB(m,-,0) is the Cramér-Rao bound (CRB) for the global pose of the camera at frame j of F0, which gives C0 C1 0 I Cl 00 I ’ . . CRB(m,-,o) = I I I I I =(J+1)Co+2301= 00) Cl 0 C1 Co I_] (3.29) In other words, we have proved Theorem 2 Under the uniformity condition, the Cramér-Rao bound {CRB} of the global pose error at framej in a simple transitory sequence is of the order 0(j). It is worth noting that we have not imposed any particular distribution type on noise in the observation other than the uniformity condition, the above conclusion holds for any noise distribution provided the uniformity condition is satisfied. In (3.27), we regarded m(0) the same as d(1), d(2), - - - , d(j) in the vector djp just for notation convenience. It will not affect the order of the Cramér-Rao bound we derived in (3.29) since error in m(O) affects just a constant in CRB(m,-,o). 65 The Cramér-Rao bound of the global position of the structure can be investigated in a similar manner. Each scene element visible by the last frame in F0 is a function of global pose coupled with the structure estimate from the global pose, which results in an additive constant error covariance in addition to that of the global pose. Therefore, the Cramér-Rao bound of the global position of the structure has the same order in error rate as the global pose. The general transitory case First, we extend the above result for F0 to the other subsequences F1. We extend our notation from m 3,0 and dip to my] and d][], respectively, to denote the corresponding trajectories of Fl, starting from frame f; to frame fij, l = 0, l,~-,k — 1. Given F, the above discuss still holds for F1, except that the meaning of C: in (3.27) is the Cramér-Rao bound of the corresponding component based on the entire F instead of (I) - just F0. Therefore, the Cramér—Rao bound of the error rate of the global pose mjy, lS still of order 0(j) = 0(n/k): CRB(m§-iz)) = 0(n/k) (3.30) Consider each scene element .23., that is visible from fjp+1, the last frame of F1, l = 0,1,--- ,1: —- 1. Since Cramér-Rao bound is a lower bound and Table 3.1 means that CRB(:1:,,) S 0(n/k2). To establish CRB(:rn) = 0(n/k2), all we need to prove is CRB(:L'n) Z 0(n/k2). To do the latter, we can neglect some errors. For I = l, 2, . - - , k — 1, we neglect the interframe pose error between frame f0 and f1, and the 66 error in the process of constructing 27,, from frame fjp+l. Thus, 23,, is determined by the camera global position at fjp+1 by a function g: 0 l k—l I. = g 1, the old structure estimate is hardly useful in the fusion with that in view p. Under the second cross—frame method, each previous structure estimate at p —l is directly transformed to p under one transformation. Thus the transformed structure deteriorates by the motion error only once. The error covariance matrix of each transformed structure is a sum of two terms, one from single cross-frame motion and the other from the observation error at the frame p — l, as long as frame p — l is in the same batch as frame p. Thus, following a derivation similar to that in Section 3.2.1, we know that the second method has an asymptotic error rate of 1/b as listed in Table 3.1. Fig. 3.3 graphically explains the advantage of cross—frame motions. In practice, we define a number K, called extra batch size, to be the number 71 Cross—frame motions Figure 3.3: Using cross-frame motions to integrate many views. Each elongated ellipse indicates the uncertainty in 3-D point position transformed from a single previous stereo view to the current view. The integrated uncertainty is greatly reduced using the multiple cross-frame motions instead of interframe motions. 72 of extra image (stereo) frames that are processed as a batch in additional to the last two. Thus, at current frame number p, all the image frame batch consists of frames from p — K — 1 to p. Due to the transitory nature of the image sequence, any frame q, with q < p — K — 1, with not share any portion of the scene with frame p if K is sufficiently large. According to our discussion about non-transitory and transitory image sequences, it is useful for Ix" to span a subsequence that is nearly nontransitory. Otherwise, the estimate is sequential in nature, as expressed in (3.14), beyond a certain extent. With a batch at frame p, the current active cross-frame motion set is denoted by p—l W(P) : U {mp.i(Rp.ian.i)}- t:p—I(—I The cross-frame motion set completely defines the motion between any two frames within the batch. When K = 0, we have just an interframe motion in W(p). Let N be the total number of feature points being considered; 2,3, denote the three dimensional local structure of i—th point in s—th camera-centered system; um, be the 2—D image coordinate vector of i-th point on the j-th side (left, right) at the s-th frame. Assuming that the noise in the observations (um-,8) is uncorrelated and has the same variance (03,03) in the two image coordinates, expression (3.34) that is to be minimized can be written the following form min f(m, 33131;) = A + B (3.36) ng,p.Vm€W(p) 73 where N A = 2mm..-“P.;,,,_,._. Hi..-I.--.)-ls.} i=1 with Si = ($131) ‘ X(mp.p-K—lv mf,p-I\"-l)) and pRN 0‘20 B=ZZ( {Hays — “(mama mapl)’ ("this — "(mm 333)) szp-K j=L i=1 0 -2 In the above expression, X(m,,p, 13:31)) is the transformation function to transform the point mm from camera coordinate system at frame p to frame 3 based on the motion parameters map. Function u(m,,p,:r,-,p) is the noise—free projection computed from m”, and 2,4,, which includes transformation and projection. This objective function has two terms. The first term, A, reflects the integrated 3-D structure in the past up to time p — K — 1. The second term, B, is used to minimize the image plane error of the frames within the batch from from p — K up to p. The summation bound for i can be modified to include only those points that are visible in each frame so that a point does not have to be visible through the entire batch. Minimization of the objective function The objective function in (3.36) is neither linear nor quadratic in terms of cross-frame motion parameters, m, and 3—D feature points, 2:. An iterative algorithm is required to search for the solution of m and as. The dimension of the unknown parameters is 74 intractably huge due to a typically large N. Thus, a direct optimization is impractical. Our two procedures play a central role in resolving this problem: First, a (suboptimal) closed~form solution for interframe motion from p — 1 to p is first computed. This interframe motion is used together with previous pose estimate to compute a preliminary estimate for all the cross-frame motions needed. The second is to eliminate iteration on the structure. The gradient—based search is only applied to cross-frame motions, because given each candidate set of cross-frame motions the best structure for (3.36) can be directly computed in a closed-form. To show how, let us examine the objective function (3.36). The second term of the objective function corresponds to minimizing the image vector error within the batch. An alternative way to approximate this is to use the matrix-weighted discrepancy of 5mm — X(mp,j,$i’j), the 3-D position difference, to give the total discrepancy P N min 2 2137:}? _ X(mp.sa $i,.))t(Rp.st:,,Rid—Vat ’ X(mp.sa (vial) (3°37) 171' . ’10 szp-K :21 where :c‘f, is computed from the triangulation at frame 3, PI.’. is the estimated co- 8 variance matrix of 2:2", for triangulation. Substituting the second term of objective function (3.36) with (3.37), we minimize P N 121? ffifim) : 2 2(3):}? — X(mp»sv xi.s))t(RP13FIi,.Ri).s)_l(I‘1P _ X(mp'3’ 33:3» szp-K—l i=1 (3.38) given any W(p). The above is a linear minimization problem, for which we just need 75 to solve the following linear equation [59], p p i Z {Basra}: R;,sl}$i.p = Z “Basra-j, R;,3)X(mp,,, 37:33)} (3:39) s=p-K-l s=p—I\"—1 which gives P P 372312 = i Z IRASFFJ, R;,.]}'1{ Z “Rn-era: R;,s)X(mp.sv Well} s=p—K-l s=p—I\'—l Its error covariance matrix is estimated by [59] P [‘97in = 1 2: Fiji _1 szp—K—l where BX(mP”$f) 3X(mp, 37* 3X(m a 517’) 6X(m 3 :13") 11,-, = ’ ”3 I‘m ‘ ’ ‘13 t p, ’ 1.8 1" . P. v 1.3 t 9 ( 6m ) p,$( am ) +( 81:" )F 3",( 8x1. ) i,s 3.3.2 World-centered representation The WC representation follows a similar derivation. The difference is that the struc- ture does not move in WC system. Thus, the structure integrated in the WC system up to any time can be used directly for later WC integration. Objective function Without loss of generality, let the world coordinate system coincide with the camera-centered coordinate system of the first frame. M (m,n) = ?=m{m,-,1} = 76 U?=m{R,-,1,T;,1}, is the collection of all the global motions, where (R;,1,T,~,1) is the rotation matrix and translation vector from frame 1 to i. For each feature points i, we have structure C,- corresponding to the world coordinate system. Now slightly modifying the equation (3.36), we get the appropriate objective function for the WC representation: VG.,vmerAIil(g—K—1,p)f(G”m) : A + B (3.40) where N =Z< G G?) ‘Fa (G. — 0:) (3.41) and :2 Z Z( ui,j, s ‘- ”(7713,11 Gi))t (12ng — u(ms,la Gi)) (342) szp-K j=L i=1 0 0'2 In the objective function, u(m,,1, C,) is the noise-free projection computed from mm and C,. The essence of the above objective function is that newly updated global structure C,- takes into account the old observation Cf integrated up to frame p — K — l, but it considers all the observations in the batch as image vectors, all properly weighted in the sense of Gauss-Markov. Similar to computation for the CC representation, no iteration is needed for struc- ture part, and a suboptimal closed-form solution is computed first for motion and structure which is used as the initial guess for minimization. The following equation gives the closed form solution for structure parameters C,- when the motion parame— 77 ters M(m,n) are given: p a=< Z 1‘81,)-’( 2 rain.) (3.43) szp—K—l szp—K—l where Gig", is the estimation based on the single frame 3. The estimated error covari— ance matrix of the newly updated the structure is I‘G.-=( 2p: F231,)“ (3-44) This WC based objective function is in essence similar to those of [10] and [131]. The differences are (a) a batch parameter K is used to better deal with the transitory se- quence; (b) image—plane discrepancy is minimized to automatically take into account non-symmetrical nature of error distribution in 3-D point positions; (c) the algorithm can automatically handle leaving points and coming points which is required with transitory sequences. 3.4 Experiments We conducted experiments with synthetic and real world images in order to exper- imentally exam the error rates listed in Table 3.1 and compare the WC and CC representations. 78 - . . /’ Region for generating feature pomts 5) x f t ..... Stan position of camera End position of the camera Figure 3.4: Simulation environment, where 7000mm distance is covered by the 31 frames 3.4.1 Simulation Data 3-D feature points were generated randomly for each trial, between depth 2000mm and depth 3000mm, with a uniform distribution. The entire scene is covered by 31 frames and the distance between consecutive frames is roughly 200mm. A small rotation is added between each pair of two consecutive frames. Fig. 3.4 illustrates the simulation environment. This environment is similar to the real setup to show later. The average errors we will show were obtained through 100 random trials each with a different set of 3-D points. Measurement The simulated camera system has a resolution of 512 x 480 pixels, just like the real cameras we used. Measurement error was simulated by pixel round-off error. This level of measurement accuracy is generally higher than but close to the accuracy of our feature detector, matcher and tracker according to our visual inspection of 79 the algorithm. The camera’s global orientation is determined by a rotation matrix (Rm) and the position by translation vector (Tm). The error of a matrix or vector is measured as the Euclidean norm of the difference between the estimated and true one. The world system is placed at the scene at the first frame, based on which the global structure of the feature points is defined. If a feature point is visible at certain frame, it also has a local structure (i.e., with respect to the CC reference at that frame). In the WC representation, the global structure is directly estimated but its local structure needs to be computed via the estimated global pose of the camera. The situation is just the opposite in the CC representation, where the local structure is directly estimated while the global structure must be computed via camera’s global pose. Batch Size Visibility span determines the number of frames which share a common view. It can also be used as a criterion to select the maximum batch size. Obviously, a batch size that is beyond the visibility span cannot help much. With our setup, in order to let the first and the last frame in the batch share at least 30% of the scene, the batch size should not be larger than 3. Results Fig. 3.5 shows the current camera position error (Rm, Tm) for different frames, where i is the index of the time as shown in the figure. The first column is for the CC representation and second is for the WC representation. The results indicate that the 0.004 0.003 0.002 0.001 20 Error (mm) Error (mm) 0 A10 8 3. 0 (a) Error of rotation matrix in CC representation I I I I I Batch size 0 — Batch size I ---- _, Batch size 3 ----- l l l l l _ 0 I I I 6 21 26 Time sequence (b) Error of translation vector in CC representation r r r I 1 Batch size 0 —— Batch size I "~- I. Batch size 3 ----- .1 l 6 ll 16 21 26 Time sequence (c) XY component error of translation vector in CC 5 l I fl l 1 Batch size 0 -— Batch size I ---- Batch size 3 ----- 1 l l l l I 6 I I 16 2| 26 Time sequence (d) 2 component enor of translation vector in CC I I T I I Batch size O -— Batch size I ---- Batch size 3 ----- 1 1 1 1 1 l 6 I I 16 2| 26 Time sequence 80 Error (mm) Error (mm) Error (mm) (6) Error of rotation matrix in WC representation r r r r T Batch size 0 — Batch size 3 ---- _ (1) Error of translation vector in WC representation I l I l I Batch size 0 — r: Batch size 3 ---- 4 l l J l l l 6 ll 16 21 26 Time sequence (g) Translation XY component error in WC l I l I I Batch size 0 —- )- Batch size 3 ---- l 6 ll 16 21 26 Time sequence (h) Translation Z component error in WC T l l T I Batch size 0 — ,_ Batch size 3 ---- q l 6 ll [6 21 26 Time sequence Figure 3.5: Camera global pose error versus time. The CC representation on the left column and WC on the right. (a) and (e): Error in rotation matrix. (b) and (f): Error in translation vector. (c) and (g): Error in zy-component of the translation vector. ((1) and (h): Error in z-component of the translation vector. 81 error increases with the time, an inherent property with transitory image sequences as Table 3.1 shows. It can be seen from the figure that the batch size has more impact in the CC representation than WC. This is because in the CC representation, the reference frame moves, which introduces more nonlinearity than the WC case when the old observation is transformed into the current CC reference system. A larger batch size is more appropriate for such a nonlinear transform, because covariance for error modeling is based on a linear approximation for nonlinear systems. Fig. 3.5 clearly shows that for camera pose estimates the WC representation is a little better, which is consistent with Table 3.1. Fig. 3.6 shows the local and global structure error. The error is shown as the average error of all the visible feature points at the current frame. The result in- dicates that a larger batch size is very effective to reduce both the local and global structure errors, for CC representation, as we predicted in Section 3.3.1. Because of the structure is directly estimated in the WC representation and thus the “measure- ment equation” is linear. Thus, a lager batch size does not improve much for WC representation due to dominantly linear nature of the WC structure fusion. The fig— ure also shows that the WC representation performs better for estimating the global structure while the CC representation does better for local structure, as predicted by Table 3.1. A point worth noting here is the fact the local structure error with the CC representation is constant, while that with the WC representation grows with time, also a property predicted by Table 3.1. 82 (a) Global structure error in CC representation (e) Global structure error in WC representation T r r r r 8 r r r r r Batch size 0 — Batch size 0 -- Batch size I ---- Batch size 3 ---- I 15 - BmCh SI“ 3 ..... -I I -‘ ,~-r ------ g . ............ - ----------- . 3 5 _ ‘o ................. '— 0 1 1 1 1 1 0 1 1 1 1 m 0 5 I0 15 20 25 O 5 I0 15 20 25 Frame number Frame number (b) Global structure error (XY component) in CC (I) Global structure error (XY component) in WC # I I I I 6 I j I I I Batch size 0 — Batch size 0 — Batch size I ---- Batch size 3 "-- Batch size 3 ----- 4 4 '- - Error (mm) Error (mm) 0 A 1 1 L 1 0 1 1 1 1 1 0 5 IO 5 20 25 0 5 IO 15 20 Frame number Frame number (c) Local structure error in CC representation (g) Local structure error in WC representation 1 r r r r 2— r r r r 1 Batch size 0 —— Batch size 0 — L Batch size I ---- 20 #- Batch size3 ---- _, 15 Batch size 3 “W -< Error (mm) 5 E Error (mm) \- -------------------- \" ---------- s__,- \ 5.. \v,"~‘,-.-__,s“."_,"-“-_‘ 0 l l J l I 0 4 I J I O 5 IO I5 20 25 0 5 IO IS 20 25 Frame number Frame number (d) Local structure error (XY component) in CC (h) Local structure error (XY component) in WC I I I I I 15 I I T I I Batch size 0 — Batch size 0 —- Batch size I ---- Batch size 3 "-- Batch size 3 ----- 2 IO >- J Error (mm) / I I I I I I I I I \ I I I I I I I I I I ; I , I I \ e \ 0 I ' I ‘ \ ‘ \ I I I I I I I I I I I I I Error (mm) 0 5 IO I 5 20 25 Frame number Figure 3.6: Structure error versus time. The CC representation on the left column and WC on the right. (a) and (e): Global structure error. (b) and (f): zy-component of the global structure error. (c) and (g): Local structure error. ((1) and (h): zy-component of the local structure error. 83 3.4.2 Experiments with a real setup A challenging task facing the area of motion and structure analysis is to provide data from rigorous experiments that verified the actual accuracy of the results, so that we can evaluate whether passive structure sensing is possible and reliable in real world. The result reported here is an effort toward this goal. The data are processed in an off-line fashion. The setup used for our image acquisition is a Denning MRV-3 mobile robot and a pair of stereo cameras, 265mm apart, mounted on a custom—designed stereo positional setup that allows step-motor controlled pan and tile for each camera from a computer, as shown in Fig. 3.7. The stereo camera system was calibrated with distortion com- pensation using an algorithm from Weng et al [190]. The field of View of each camera is about 36 degrees diagonally. and each digitized image has 512 x 480 pixels. An im- age sequence of 151 frames was acquired from the moving mobile robot. It contains a left-view sequence and a right-view sequence. The entire stereo sequence was used for automatic feature extraction, matching and tracking. A temporally subsampled (one sample every 5 frames) subsequence of 31 frames was used for motion and structure estimation with a consideration that this subsequence is dense enough for estimation and yet enables cross-frame motions to cover more original frames with a relatively small batch size. Fig. 3.7 shows a few images in the 151-frame sequence. A feature point detector has been developed for this project to automatically detect feature points from images. The feature detector first computes the cornerness measure (the degree a point looks like at a corner) at every pixel. Then the local peaks 84 (d) (f) Figure 3.7: The robot and a few stereo frames in the 151—frame sequence. (a) Robot. (b) Left image of frame 0. (c) Left image of frame 50. (d) Left image of frame 100. (e) Left image of frame 150. 85 of this cornerness measure are detected to form a peak histogram, ranked with the cornerness measure. The program automatically determines the threshold so that the required number of features are given from top rankings. An area directed analysis is incorporated into the scheme so that the detected feature points evenly spread across the entire image. Stereo matching was done using an image matching algorithm from Weng et al [188], which provides a dense displacement field with a disparity vector for every pixel. The disparity vector at every feature point is extracted from this field. For efficiency, the algorithm uses tracking mechanism as much as possible. Only when the tracking is not successful based on the closeness measure used by tracking, the matcher is called. For each addition of new points, the stereo and temporal matchings are performed using the same algorithm from Weng et al [188]. Once a new feature point is added from a left image, a square template (7 x 7) centered at this point is recorded as the left template for this point. The stereo matching algorithm gives the matching pixel in the right image, from which a square template is recorded as the right template for the point. Temporal tracking uses a prediction-and- verification scheme for each of the left and right sequences. The temporal interframe displacement is predicted from the previous displacement of the point. Verification is then performed based on the value of template matching: Let t(i, j), —s S i, j S s be the template, and f(z', j) denote the image value at the point (i, j). A template 86 Trace of Matching-and-Tracking L I I I I I 150 100 - - 50 )- -—1 l I 0 100 200 300 400 500 Point number Frame number Figure 3.8: The tracking record of the feature points through the 151 frames in the se- quence. If a. point It is successfully tracked from frame 2' to frame j, a vertical line is shown at point number h from frame 2' to frame j. (Due to the limit of the printer resolution, lines are merged in the plot.) difference value centered at (x, y) is defined by d(m) = i 28“, I[t(z'.j)—t‘1— [f(rc+i,y+j) —f(a:,y)1I (3.45) i=—s j=—a where f = f=_8 ;=—s t(i,j) and f(m,;y) = 2:” ;:_s f(x+i,y+j). Namely, first, the template and image are both locally normalized so that their local sum is equal to zero, and then the template difference value is the sum of the absolute difference between the locally normalized template and locally normalized image. The best matching point is the pixel at which the difference value reaches the minimum in a small neighborhood (5 x 5 in our experiment) centered at the predicted position. A point becomes inactive if the best matching point exceeds the allowable template difference value. This temporal matching and tracking method was very successful. The trace record of the entire sequence is shown in Fig. 3.8. About 100 feature points were automatically kept at any time. Since some points may go out of view and some 87 points may become inactive, the number of active points may fall below a tolerable number (90 in our experiment). If this happens, the feature detector is called which provides additional points from the image and then the stereo matcher is called to give stereo matching for these new points. The time when the feature detector was automatically called can be clearly identified in Fig. 3.8. Many points have been successfully tracked until the time they went out of the view. Figure 3.9 presents an example of temporal matching-and-tracking. A careful visual inspection of entire point trace indicates that there was no visible errors. The measurement of the real setup is similar to the simulation. To verify the accuracy of structure estimates as well as camera pose estimates, the global coordi- nates of a set of test points were carefully measured to within an error of 1mm. The selection of test points were based on ease of measurement and was not based on automatically selected features. Thus, each test point is not necessarily a part of the feature points used for the automatic algorithms, although many of them are. The image coordinates of the test points are manually measured from digital images. The accuracy of the reconstructed structure error was measured by the following steps. (a) Compute the WC and CC representations for feature point position and camera pose using the fully automatic algorithm described above. (b) Manually measure the image coordinates of the test points. (c) Perform a multiframe triangulation to get the 3-D position of the test points. The number of frame used is according to the corresponding batch size. (d) Measure the global position error as the difference be- tween the true and estimated test points. This way of measuring error tests not only the pose of the camera, but also some of the reconstructed feature points, if they are 88 FiEUre 3.9: Stereo matching and temporal matching-and-tracking. (a) An example of StereO matching (frame 0). (b) An example of temporal matching and tracking (frame 24 to 69). A needle is draw from the feature point to its position in the target frame. Due to ca"IEra vergence, the orientation of the needles in (a) is correct. 89 Figure 3.10: Sample test points on one frame. Each cross shows the location of a test point also test points. Table 3.2 lists some data of the real setup. Fig. 3.10 shows the test points of one frame. Table 3.2: Some Data for The Real Setup I Number of frame 31 I Distance traveled (mm) 3097 I I Number of feature points 387I Number of test points 85 I First, to show how well the estimated structure and camera pose agree with the automatically detected feature points, the estimated 3—D feature points were projected Onto the image plane according to the pose. The average distance difference between every projected point and actually detected feature point is called image plane resid— ual and is shown in Fig. 3.3. The values are around a faction of a pixel for both representations. These numbers also indicate that camera distortion compensation 90 in the calibration was very effective. Table 3.3: Average Image Plane Residual Representation batch size 0 batch size 3 WC 0.76 pixels 0.68 pixels CC 0.45 pixels 0.51 pixels Fig. 3.11 shows the actual camera orientation error. Although the image sequence here is transitory, the pitch and row errors are comparable with those in the non— transitory ”Hotel” sequence experiment by Tomasi and Kanade [178] over the entire sequence. The visibility span of our setup is about 4. At frame 4, the yaw error has the same magnitude as that in [178]. After frame 4, the error tends to increase due to the transitory nature of the sequence. It is interesting that roll and pitch errors did not increase quickly with time. After traveling about 3000mm, the total orientation error is not more than 002° in roll, 0.23° in pitch and 2° in yaw with the WC representation. Fig. 3.12 shows the camera position error and Fig. 3.13 presents the global error of the test points visible at the current time. As we predicted, the error increases with the time. But the estimates appear good. After traveling about 3000mm, the estimated camera global position error is less than 60mm in depth Z (less than 2.3%), about 43mm horizontally and under 25mm vertically with the WC representation. This seems to indicate that reasonable results can be obtained with a fully automatic aalgorithm, even with a transitory image sequence. Fig. 3.14 shows the reconstructed 3D surface. The surface detail is described by mapping intensity of the images onto the reconstructed surface. This approach is (a)CamerayawermrinCC 3'5 I I I I I I I I I Batchsizeo — . 3 ' Batch size3 ---- 2.5 #- Error (degrees) N I 7:: 1.5 I- f" -l \ I, I ”- ’I ‘v’-- 'i s I’ ‘‘‘‘‘‘‘‘ I 0.5 '- ' d 0369121518212427 Timesequence (b)CamaarollerrorinCC 004 I I I I I I I I I Batch sizeO — Batch size3 ---- 0.03 50.02 0.01 0 I I 1 l 1 L J‘- L l 0 69121518212427 Time sequence (CICamerapitcherrorinCC 04 T I I I I I I I 7 Bachsizeo — Batch size3 ---- l I l 0 I l l l I 1 0369121518212427 Timesequence Figure 3.11: Camera rotation error versus time. (a) and ((1): Roll error. (c) and (f): Pitch error. 91 Error (degrees) 0.03 0.02 0.01 0.3 (d) Camera yaw error in WC I I I I— I l I Batch sizeO — Batch size 3 ---- . (e)CamenmllerrorinWC I I I I I I I I Bachsizeo —- Batchsize3 ---- 69121518212427 Tunesequence (DCamerapitchermrinWC I I I I I 1— I I Bachsizeo — Bdchsize3 "-- 6 9121518212427 Timesequence Yaw error. (b) and (e): o s 2 position (mm) s -1000 50 Error (mm) Enor (mm) Error (mm) (a) Camera position (Y2 component) in CC I r 1 True position —— Batch size 0 "-- I. Batch size 3 ----- .. 1 I I 0 1000 2000 3000 Y position (mm) (b) Camera position error (X component) in CC I I r I I I I T I Batch size 0 — Batch size 3 ---~ _, 0 3 6 91215 18 21 24 27 Timesequence (c) Camera position error (Y component) in CC i I I I I I I I I Batch size 0 -— Batch size 3 ---- _, 0 3 6 9 12 15 Timesequence 18 21 24 27 (d) Camera position error (2 component) in CC I I r I I Batch size 0 — _ Batch size3 ---- . I- P m 1 ~/‘~1-’ 1 1 1 0 5 10 15 20 25 Time sequence 92 2 position (mm) 1000 500 -1000 Error (mm) Error (mm) Error (mm) 50 20 20 (e) Camera position (YZ component) in WC I T I l r r True position —- Batch size 0 “-- Batch size 3 ----- _ *‘“ ..., l l l l 1 I 500 1000 1500 2000 2500 3000 3500 Yposition(mm) (1') Camera position error (X component) in WC T I I I I I I T I Batch size 0 —— Batch size 3 ---- 1 3 6 9 I2 15 18 21 Timesequence 2427 (3) Camera position error (Y component) in WC I I I I f I I I I Batch sizeO —- Batch s1ze3 nu _, ‘ ’-—'-* o" ---- d x 1 1 1 1 1 1 1 1 3 6 9 12 15 18 21 Timesequence 2427 (h) Camera position error (2 component) in WC fl I I I I Batch size 0 — Batch size 3 ---- _, ’ J l J l l S 10 15 20 25 Time sequence Figure 3.12: Camera position error versus time. (a) and (e): Camera position (y- and z-components). (b) and (f): Position error (ac-component). (c) and (g): Position el‘ror (y-component). (d) and (h): Position error (z-component). Error (mm) Error (mm) Error (mm) (a) Global structure error in CC 90 I I I I I 90 I I I I I Batch size 0 — Batch size 0 — Batch size 3 ~--- Batch size 3 "-- 60 — -1 6O )- ~ 93 Error(mm) (d) Global structure error in WC 0 1 1 m 1 1 0 1 1 1 1 0 10 15 20 25 30 5 10 15 20 25 30 Frame number Frame number (b) Global structure (XY component) error in CC (e) Global structure (XY component) error in WC I I j I I I I T I I Batch size 0 —- Batch size 0 — Batch size 3 ---- Batch size 3 ---- 3O - -1 30 I- -I Error (mm) 0 1 1 1 1 1 0 1 1 1 1 1 0 5 IO 15 20 25 30 5 10 15 20 25 30 Frame number Frame number (c) Global structure (2 component) error in CC r T r T r I I r F r Batch size 0 — Batch size 0 --- Batch size 3 ---- Batch size 3 "-- 5 IO 15 20 25 30 Frame number Error (mm) (1‘) Global structure (2 component) error in WC 5 IO 15 20 25 30 Frame number Figure 3.13: Structure error. (a) and ((1): Global structure error. (b) and (e): Global structure error (313- and y-components). (c) and (f): Global structure error ( z—component). 94 Figure 3.14: Reconstructed 3D surface integrated from many partial views in the sequence, shown with original intensity viewed from an arbitrary direction. known as texture mapping or pattern mapping [17]; the image is called a texture map, and its individual elements are often called tezels. The mapping is shown in the Fig. 3.15. The texture map of the tetrahedra of the reconstructed surface is assigned the average intensity value of its vertices in the images. In our transitory case, multiple images may corresponds to the single patch of the surface. A weighted method is used, I: [p = EwiIP.‘ (3.46) i=1 where the Weights wg’s are the inverse of the structure uncertainty of the points and I p, is the intensity value of the point P at ith—image. 95 Surface Image plane k w] \\ .' Image plane 1 Figure 3.15: Weighted texture mapping from pixel to the reconstructed surface. 3.5 Conclusions In this article, we introduced the concept of transitory image sequence for structure and motion estimation from long image sequences. It has been shown that integration for transitory sequence has asymptotic error rates that are very different from those with a non-transitory one. The theoretical error rates listed in Table 3.1 indicates that the WC representation is better for global estimates and the CC representation is superior for local estimates. Based on the analysis, our algorithm keeps both the WC and CC representations. The data shows that the error in the local structure is effectively reduced by a rel- atively larger batch size and does not increase with time. The global pose of the camera and global structure of the scene is better estimated by a WC representa- tion. Those properties are consistent with the error rates summarized in Table 3.1, 96 but our sequence is still not long enough to clearly duplicate the theoretically proved asymptotic rates. The experiment was conducted using a fully automatic algorithm and the accuracy of the result has been verified using the ground truth. The verified accuracy appears to indicate that with off-the—shelf cameras, one can automatically determine the scene structure and pose of the camera with a good accuracy (the depth error is less than 3% compared with the traveling distance), although the image sequence here is of a more different transitory type (compared to the non—transitory one). Chapter 4 Hand Sign Recognition Humans have the capability to interpret hand gestures. The study of how humans use and interpret hand gestures has a long and rich history. The first known dictionary of American Sign Language (ASL) was published in 1856 [29]. Today, American Sign Language is widely used in the deaf community, as well as by the people who are not deaf [22]. Recently, there is a significant amount of research which uses hand gestures in the field of human machine interface. There are two types of hand gestures: static hand gestures and dynamic hand gestures. Static gestures are determined by a particular hand posture while dynamic hand gestures are characterized by a dynamic process which includes the initial, the intermediate, and the final hand configuration. Sys— tems which are designed to handle dynamic hand gestures have the full potential of understanding human gestures. Two major types of approaches exist in the field of hand gesture recognition. The first approach uses glove-based input devices. Glove-based devices employ mechanical 97 98 or optical sensor attached to a glove that transduces finger flexion and abduction into electrical signals for the purpose of determining the hand posture. The second approach is the vision-based approach. This approach acquires visual information of a hand gesture by using a single video camera or a pair of cameras. Once the visual information is acquired, the sign is extracted by analyzing the temporal image sequence. In this chapter, we give the literature review of existing hand gesture recognition systems. 4 . 1 Glove-Based Systems Glove devices measure the shape of the hand as the fingers and palm flex. Over the past decade, especially in the last few years, many researchers have built hand and gesture measuring devices for computer input. In this section, we briefly describe Some significant ones. 4 .1 .1 Glove devices In the early 19803 research at MIT used a camera-based LED system to track body and limb position for real-time computer graphics animation, termed “scripting-by— enactment” [74]. This work included a glove studded with LEDs. By focusing the Camera on just the hand, they captured finger motion. Zimmerman et. al. developed the DataGlove that monitored 10 finger joints and the six degrees of freedom of the band’s position and orientation [206]. Commercial- ization of the DataGlove by VPL Research, at a reasonable cost led to its widespread 99 use around the world. Most DataGloves have 10 flex sensors, but some have been made with abduction sensors that measure the angle between adjacent fingers. Kramer and Leifer [107] developed the CyperGlove at Stanford University. It is a custom-made cloth glove with up to 22 thin foil strain gauges sewn into the fabric to sense finger and wrist bending. A small electronics box converts the analog signals into a digital stream that can be read by a computer’s standard serial port. 4.1.2 Interpreting hand sign with gloved-based devices Several projects have investigated various levels of recognizing hand signs from simple finger spelling to analysis of American Sign Language. The MIT Media Lab used their LED glove as part of an experimental system for finger-spelling using lookup tables in software to recognize finger postures [83]. Kramer and Leifer used the CyberGlove to translate ASL into spoken English [107]. They used a Bayesian decision rule—based pattern recognition scheme to map finger positions, represented as a “hand-state vector”, into predefined letters or sym- bols. When the instantaneous hand-state lay close enough to a recognizable state, the Corresponding ASL letter or symbol was put in an output buffer. When a word phrase Was completed, a special sign caused the result to be spoken by a voice synthesizer. ATR Research Labs in Japan developed a coding scheme to allow computer recog- nition of the Japanese kana manual alphabet [173]. Their system used the DataGlove to capture hand posture. It recognized signs through a combination of principal com- ponent analysis (to determine the contributions of each finger joint to the differences 100 between signs) and cluster analysis (to group hand configurations). Fels used a DataGlove to interpret hand motion to drive a speech synthesizer [66]. His particular approach used a three-stage back-propagation neural network trained to recognize gestural “words”. He divided hand motions among 66 finger positions and 6 hand motions. Finger positions defined the root word, while hand motions modified the meaning and provided expression. These combined to form the 203 words of his “language”, loosely based on conventional gestural languages. Fels reported a high recognition rate once the system was fully trained. 4.2 Vision-Based Approach The use of computer vision makes it possible to sense human communication un- obtrusively and enables human users to interact with computers in a truly natural fashion. There are two major problems which are needed to be solved in vision-based approaches. The first problem is segmentation of the moving hand from sometimes complex background. The second prblem is recognition. This involves modeling the hand and the hand motion and then recognition of the gesture. 4.2.1 Segmentation Segmentation is a very difficult problem. Many existing hand gesture recognition systems avoid this problem by 1) using markers or marked gloves [38, 54, 168]; or 2) assuming uniform backgrounds [19, 46, 53, 110]. In dealing with dynamic gestures and assuming that the hand is moving in a 101 stationary environment, we can use motion as a visual cue to do segmentation. Several motion segmentation methods have been proposed. These approaches fall into two categories. Approaches in the first category are designed to deal with rigid moving objects (e.g. [24, 56]). This type of approaches achieves a segmentation by either building a reference image of the static background [56], or extracting the motion entity based on 3-D motion models or 2-D velocity-field models [24]. Since hands are highly articulated and non-rigid objects, the above approaches are not suitable for hand segmentation. The second type of approach fits a shape to deformable objects (e.g. [77, 81, 93, 98, 100, 175]). These models typically need a good initial position to converge. They also need a relatively clean background since the external forces are defined by the image gradient. There are two classes of deformable models, namely, free-form models and para- metric models. In the free—form models, there is no prior information of the global structure of the template; the template is constrained only by local continuity and smoothness constraints [81, 98]. Due to lack of prior information about the global structure, this type of approach relies heavily on good initial positions. When this approach is applied to track moving deformable objects, it requires small interframe deformation to converge. This means that an increase in the sampling rate is needed to capture the deformation in detail since the change of the hand configuration can be dramatic within a single gesture. The computational cost increases as more frames are used to represent a hand sign. On the other hand, the parametric deformable models use some prior informa- 102 tion of the geometrical shape of the object. There are some efforts in this category which try to locate hands from input images [77, 93, 100]. Typically, the parametric deformable model includes a prototype template and a deformation model based on a small set of parameters. The final solution is found by minimizing certain type of objective function. Grenander et. al. used a polygon to represent the contour of a human hand [77]. The deformation is described by Markov processes on the edges. A similar scheme was used by Kervrann and Heitz [100] in their work on locating hands from image sequences. The “mean shape” is determined by the salient points of the hand contour and is obtained by the training samples. The deformations, given the “mean shape”, are modeled using linear combinations of the eigenvectors of the variations from the mean shape. Jain et. al. used a bitmap to represent the prototype template [93]. The template is then deformed to fit salient edges in the input image by applying a probabilistic transformation on the prototype contour which maintains smoothness and connectedness. The major drawback of the above approaches is that they only allow very limited deformation. Limited deformation results because the deformation is modeled as a small perturbation of the prototype [100] or because the penalty term is presented in the objective functive function when the template deforms from the prototype [93]. This drawback makes the model unsuitable for dealing with dynamic gestures where typically deformation is large. Recently, Moghaddam and Pentland [133] presented a maximum likelihood de- tection method to locate hands in a cluttered scene. The method uses the training 103 data to estimate the density of a Mixture-of—Gaussian model (for multimodal distri- bution). These probability densities are then used to formulate a maximum likelihood estimation framework to detect hands. The training set of hand shapes is obtained against a black background and the contour of the hand is extracted using a Canny edge-operator. Then, a diffusion process is applied to these binary edge maps to broaden and smear the edges. Finally, they are projected to the eigenspace. The density estimation is in the eigenspace instead of the original high-dimensional image space. The major drawback of this method is that on certain illumination condition, edge operator may fail to pick up some hand contour edges. In that case, the system can fail to locate hands. 4.2.2 Recognition Existing approaches typically include two parts, modeling hands and analysis of hand motion. Models of the human hand include the fingertip model, the three dimensional model, the two dimensional shape model, and the region-based model. Different hand models lead to different models of hand motion. The trajectory of each finger tip is Suitable to represent the motion in the case when the fingertip is used to model hands. The system which uses the three dimensional hand model is capable of modeling the real hand kinematics. The two dimensional hand model can describe two dimensional rotation and translation. For the system which uses the region-based model, motion Can be modeled as the change of state. 104 Fingertip model Cipolla, Okamoto and Kuno [38] presented a real-time structure—from-motion method in which the 3D visual interpretation of hand gestures is used in a man-machine interface. Consider an arbitrary coordinate system with the ct — y plane spanning the image plane (f from optical center) and the z-axis aligned with the ray. Assume the fingertip to have a translational velocity with components U1, U2, U3 and an angular velocity with components Q1, {22, Q3. The two components of the image velocity of a point in space, (X, Y, Z) due to relative motion between the observer and the scene under perspective projection are given by [119]: U —:I:U :1: 3:2 u =[f__1_Z__2]+m,_,/Q,_Tyg,+ 702 U — U :c 2 U = [Liz—y—B] — f9} + $03 — 73192 — £37512. (4.1) The second component depends only on rotational motion about viewer center. It gives no useful information about the depth of the point or the shape of the visible sur- face. The rotational component can be removed if, instead of using raw image motion the difference of the image motion of a pair of points, is used. This is called motion parallax. The parallax motion vector, divergence, curl, and deformation components of the affine transformation of an arbitrary triangle, with the points at each vertex, determine the projection of the axis of rotation, change in scale, and cyclotorsion. Davis and Shah [54] have created a system to recognize a sequence of multiple gestures. The library of gestures includes seven gestures. The user must start in 105 the designated start position upon initialization of the system and is able to make gestures until the termination gesture is recognized by the system. The moving light display is obtained by extracting finger tips from each frame. A moving light display labeled by Rashid [156] is a sequence of binary images representing points from a moving object. The positions of the finger tips are marked by the special gloves. The trajectory of each finger tip is found by minimizing the overall proximal smoothness function minimized as much as possible in addition to being fair to each individual assignment [153]. The success of the algorithm needs assumptions of smooth motion in the sequence and small motion between consecutive frames. The system is generally unable to handle occlusion. A finite state machine is used to guide the flow and recognition of gestures based on the motion characteristics of the hand. Owing to the nature of the machine, no warping of image sequences is necessary (i.e. it is not required to have a fixed number of images for each gesture sequence). Three dimensional model Kuch and Huang [110] used cubic B-splines to represent the surfaces of the palm, fingers, and thumb. The use of B-splines allows the rendering of smooth surfaces, while allowing the calibration system to keep track of a smaller set of control points versus every vertex in the model. The hand model has a total of 23 degree of freedom (DOF). Each four finger is given four DOF while five DOF is given to the thumb. The palm is given two internal DOF located at the base of the forth and fifth (ring and pinky) metacarpals. The last two DOF reflect the ability of the palm to fold or curve. These DOF determine the overall orientation in space of the entire hand. 106 The model is used to fit a real human hand, so it can be used in tracking and can be incorporated into a virtual environment or model based compression scheme such as sign language communication over phone line. The model needs to be calibrated before it can be used. The calibration is interactive. It requires three specific views of the real hand. The interactive selection is to locate all the joints and to delineate the portion currently being fitted from the background. Each tracking starts from a person holding his hand in a predefined orientation within the field of view of a camera looking at an uniform background. The DOF of the hand model is then independently and locally perturbed to fit the moving hand. Two dimensional model Starner and Pentland used Hidden Markov models (HMM’s) to recognize American Sign Language. An eight element feature vector consisting of each hand’s a: and y position, angle of axis of least inertia, and eccentricity of bounding ellipse is chosed to represent the hand shape. The eccentricity of the bounded ellipse was found by determining the ratio of the square roots of the eigenvalues that correspond to the matrix a b/2 b/2 c Where a, b, and c are defined as a = f/x'zdz'dy’ 07 1 b: f / x'y'dx'dy' _ I2 I I c-j/y (1:1: dy (z’ and y' are the :L' and y coordinates normalized to the centroid.) The work assumes that the order of words in American Sign Language is a first order Markov process. The topology of the HMM used in the paper is a four state HMM with skip transitions determined to be sufficient for the task. Six personal pronouns, nine verbs, twenty nouns, and five adjectives are used in the experiment. These words can form a sentence, however, the structure of the sentence is fixed. Region-based model Sometimes the extraction of precise hand and motion information from image se- quences is not desirable or it is very difficult if not impossible. In these instances, we can use features generated from the entire image or a relatively large region of the image. These features are called region-based features. Darrel and Pentland [53] have adopted a view-based representation for learning, tracking and recognizing human gestures from a sequence of images. The method uses an automatic view-based approach to build the set of view models from which gesture models will be created. The model views of an object are built using normalized correlation. The first view is chosen by the user as one of the images from a sequence. The object in the subsequent input images is tracked, and when the correlation score rm drops below a predetermined threshold, a new model view is created with the current input image. This process is repeated until no more models are necessary. 108 Once all the views of an object have been gathered, gesture models need to be created. A gesture is a set of views over time. A gesture will be correlated with each stored view of the object (the hand), and the score plotted, for each view, with respect to time. Several examples of the same gesture are used, and the mean gm(t) and variance 02(gm(t)) of the correlation scores with respect to model view m will be used to represent that particular gesture g. To compare a new input gesture, each frame of the new sequence is correlated with a model view. The score results for the whole sequence is plotted with respect to time. The view—based approach is capable of modeling complex, articulated objects for which no simple 3-D model or recovery method is available. The models can be learned by observation rather than needing precise CAD models. The drawback of view-based system is that complex, articulated objects (such as hands) have a very large range of appearances, making traditional approaches to view-based matching difficult. The use of grey level correlation can also be highly sensitive to noise. Bobick and Wilson [19] considered each hand image in a sequence as a point in the eigenspace. The eigenspace is computed from the training hand images. The sequence of a hand gesture is then defined as an ordered sequence of fuzzy states in the eigenspace. The fuzziness is defined by the variance of the points that fall near it. For a given gesture, these states are used to capture both the repeatability and variability evidenced in a training set of example trajectories. The states are positioned along a prototype of the gesture, and shaped such that they are narrow in the directions in which the ensemble of examples is tightly constrained, and wide in directions in which a great deal of variability is observed. They used Hastie and .C ’- t' L 'Mit'u 109 Stuetzle [84]’s “principal curves” to compute a prototype trajectory of an ensemble of trajectories. The method is demonstrated at a relatively low dimensional eigenspace (3D). Three eigenvectors may be enough for a simple hand sign such as waving hand in the paper, but for complex signs which have large amount of hand configuration variation, three eigenvectors may not be enough to capture the variance of the pixel intensity values of the training frame. With sparse and high dimensional data, the performance can degrade. Chapter 5 Overview of the Approach A hand gesture is a Spatiotemporal event. A Spatiotemporal event involves an object of interest and motion of the object. In the linguistic description of American Sign Language, Stokoe used a structural linguistic framework to analyze Sign formation [171]. He defined three “aspects” that were combined simultaneously in the formation of a particular sign - what acts, where it acts, and the act. These three aspects translate into building blocks that linguists describe as - the hand shape, the location, and the movement. In this thesis, we present a new framework which will deal with the above three “aspects” of hand signs. There are two major components in our framework. We have a prediction—and-verification scheme to locate hands from complex backgrounds. We also have a Spatiotemporal recognition component which combines motion under- standing (movement) with spatial recognition (hand shape) in an unified framework. 110 111 5.1 Time as a Dimension A natural way to represent a Spatiotemporal event is to consider input image sequence as data in space and time [42, 60] by associating the serial order of the pattern with the dimensionality of the pattern vector. The first temporal event is represented in the plane t = 0 and the second temporal event by plane t = 1, and so on. The entire spatiotemporal pattern vector is considered as a whole by the framework. Figure 5.1 shows an example in which the hand sign “no” is represented by a Spatiotemporal sequence (three images). _l (b) Figure 5.1: The sign “no” and its image sequence representation. (a) The sign of “no”, snap middle finger, index, and thumb together. (b) The sequence representation of the sign “no” . 112 5.2 Recognition of Spatiotemporal Pattern As shown in Fig. 5.1, a Spatiotemporal event includes two kinds of information: the object of interest, and the movement of the object. The movement of the object can be further decomposed into two components: global and local motions. The global motion capture gross motion in terms of position. The local motion characterizes deformation, orientation and gesture changes. In the case of sign language, the hand is the object of interest. The position change of the hand is a global movement and the change of the hand gesture and orientation is a local movement. In this thesis, we propose a three-stage framework for Spatiotemporal event recog- nition, as illustrated in Fig. 5.2. The first stage, sequence acquisition, acquires image sequences representing the event. This involves motion detection and motion—based visual attention. The start and end of motion mark the temporal attention window in which the event occurs. We map this temporal window to a standard temporal length (e.g., 5) to form what is called motion clip, while the speed information is avail- able from the mapping performed in this stage. In a motion clip, only the temporal dimension is normalized. The second stage is visual attention and object segmentation. This stage directs the system to focus on the object of interest in the image. Given an image, the object of interest may appear anywhere in the image with certain size and orientation. Besides the object of interest, the image may contain a complex background as well. Therefore, the first step is to determine where to look, or in other words, to select visual attention. If we assume that the object of interest is moving in a stationary 113 video Stage 1: sequence acquisition I image sequence 4 4e V I Stage 2: hand segmentation l fovea vector Global @ motion VCCtOI‘ | Stage 3: sign recognition l recognition result Figure 5.2: The three-stage framework for Spatiotemporal event recognition v I h‘L." ."t O 114 environment, it is not very difficult to roughly determine the position of a moving object in the image using motion information. However, it is not simple if the task is to extract the contour of the object from various backgrounds. In Chapter 6, we present an eigen-subspace learning method to segment hands from attention images. In that method, we assume the visual attention is accom- plished and the object of interest is centered in the fixed size attention window. In Chapter 7, a prediction-and-verification segmentation scheme is proposed to locate hands from complex backgrounds. The scheme uses the past experience to guide the search of the valid segmentation and is more efficient and effective than other stochastic approaches such as simulated annealing. After stage two, the object of interest in each image of a sequence is segmented and mapped to a fovea image of a standard fixed size. Segmented fovea images at different times form a standard spatiatemporalfovea sequence, in which both temporal and spatial dimensions are normalized. The global motion information of the object of interest is placed in a global motion vector, which records the size and position information of the segmented object in the original image. This vector is necessary because once the object is segmented and mapped to a fovea sequence with a standard Spatiotemporal size, the global motion information is lost. Let a fovea image f of m rows and n columns be an {mn)—dimensional vector. For example, the set of image pixels {f(z,j) I 0 S i < m,0 S j < n} can be written as a vector V = (’01,?)2, - - ' ,vd) where vmifl- = f(i,j) and d = mn. Note that although pixels in an image are lined up to form a 1-D vector V this way, 2-D neighborhood information between pixels will be characterized by the scatter matrix of V to be my!) ‘ .4 A. -_‘.ng.._ Ar“. . .—‘ ..4‘“‘::-l. 115 discussed later. Let p be the standard temporal length and f,- be the hand fovea image corresponding to the frame t. Then we create a new vector X, called the fovea vector, which is a concatenation of the hand foveas and global motion vector G, X: (f19f23'°'9fpaG)' (5.1) The third stage is to recognize the Spatiotemporal event from the fovea vector. In Chaptor 8, we present a new framework to recognize hand signs from fovea vectors. Chapter 6 Hand Segmentation from Attention Images Based on Eigen-subspace Learning Given an image, the object of interest may appear anywhere in the image with certain size and orientation. Therefore, in order to segment the object of interest from the input image, the first step is to determine where to look, or in other words, to select visual attention. Human’s visual attention is accomplished by the rotation of eye [170]. Specifically, it may rotate in a rapid jump-like manner (“saccade”) so as to bring the retinal image of an unattended, but eccentric, stationary object to fall at the fovea center. Different visual cues such as motion [24], the generalized symmetry [157], and the casual semantics [16] are used in computer vision to select visual attention. In the 116 117 next chapter, we present our own learning-based method to select visual attention and segment hands from images. In this chapter, we present an eigen—subspace learning method to segment hands from images. We assume the visual attention is accomplished and the object of interest is centered in the fixed size attention window. However, we do allow the background in the attention window. Our goal is to segment the hand from the attention image. Ever since it was first proposed by Kass et al. [98], the snake model has drawn a lot of attention. Various modifications have been proposed to segment objects from intensity images (e.g. [81, 117]). However, in general, these models need good initial position to converge. They also need relatively clean background since the external forces are defined by the gradient. In Malladi et al.’s front propagation approach [123], the initial curve can be arbitrary as long as it is either inside or outside the object. This method also suffers when the background (or object) is textured if starting from outside (or inside). In our work, we also use a dynamic deformation model, called spring network [35]. The difference is that we first reconstruct the input fovea image. This reconstruction is based on the learning and has the advantage to preserve the object of interest while blurring the background. Like all the other deformable models, the spring network model has quite a few parameters that need to be tuned. Tuning these parameters is not easy work and these parameters usually depend on the input. Our solution to this problem again resorts to the learning. We first try to get the best results from the spring network, then we go back to the learned examples to find the best match. ...n.‘. ‘1. -. a -.1--.“ ‘_~. 118 Finally, we apply the spring network model again using the best match as the starting curve to get the segmentation result. 6. 1 Learning Given a set of training attention images of hands with black background, we first derive an eigen-subspace using Karhunen-Loeve projection. Next, we transform the training attention images into the simulated fovea images. The definition of simulated fovea image will be given later. We store these simulated fovea images in a database. During the testing session, the database is queried to give an initial contour for deformation model. 6.1.1 Karhunen-Loeve projection Let a training attention image f of m rows and 12 columns be an (mn)—dimensional vector. For example, the set of image pixels {f(i,j) I 0 S i < m,0 S j < n} can be written as a vector V = (’01,?)2, . - - ,vd) where "om”,- = f(i,j) and d = mn. Note that although pixels in an image are lined up to form a 1-D vector V this way, 2-D neighborhood information between pixels will be characterized by the scatter matrix of V to be discussed later. Typically an image space is very large. For a moderate 128 x 128-pixel image, the dimension is d = rc = 16,384. The Karhunen—Loeve projection [121] is a very efficient way to represent a small subspace in a high-dimensional space. It reduces the dimension of representation from d in S' to a much lower dimension for 5’ yet still -..,. -. ‘aix 1:— --.~ . h4-’4 119 keeps most information in the data. A d—dimensional random vector X can be expanded exactly by d orthonormal vectors, v1,v2, - . - ,vd, so that d X = Zygv, = VY (6.1) i=1 where V is an orthogonal d x of square matrix consisting of orthonormal column vectors vi. Without loss of generality, we can assume that the mean of the random vector X is a zero vector, since we can always redefine X —- EX as the vector to consider. If the set of samples of a class typically occupy a very small portion of the entire d- dimensional space, we can expect that a relatively small number of vectors (or called features) v,- is suflicient to expand the space of the class. Suppose we use m vectors (features), each corresponds to a component in Y. The approximate representation is X(m) = 2:11 ygvi. It has been proved [121] that the best unit vectors v1,v2, - ~ - ,vm that minimize 62(m) = 153ll53'i(m)ll2 = EIIX - X(m)||2 (6-2) are the m unit eigenvectors of the covariance matrix EX of X, associated with the m largest eigenvalues. Let V consists of these m vectors {v,-} as column vectors. Then Y = VT(X — Mx), where Mx is the mean vector of X, is a projection from a d—dimensional space to a lower m-dimensional space. It is called the Karhunen-Loeve projection. We can choose m so that the ratio r = 21;", +1 /\,°/ 2;] A, is smaller than a given percentage (e.g., 5%). We call these m vectors {vi} the most expressive T...» n» A? q .. ---4 f -' I.- . .._—a 120 features (MEF) in that they best describe the sample population in the sense of linear transform. Fig. 6.1 gives a 2D illustration of Karhunen-Loeve projection to select the MEFs. yl MEF 2 MEF 1 F x Figure 6.1: A 2D illustration of Karhunen—Loeve projection. If we are given It discrete training samples, X1, X2, - - - ,Xk, Ex is approximated by the corresponding scatter matrix: I: S = 2(Xi — M)(X.- — M)‘ = UU‘ (6.3) :=l where U = [U1,U2, . - . ,Uk], and U,- = x.- — M, M =(1/k) 1;, x,. In our case, typically k < n and thus, 5 is degenerate. We can find the eigenvectors and eigenvalues of k X 11: matrix U tU , which has the same non-zero eigenvalue as = U U f. If w,- is an eigenvector of U tU associated with the eigenvalue /\,-, then B... .- A ll 1 , 121 v; = U w.- is the eigenvector of S = U U ‘ with the same eigenvalue. Therefore, we just need to compute the eigenvectors and eigenvalues of a much smaller k x k matrix U tU . For some earlier works on using MEF for recognition-related problems, the reader is referred to Turk & Pentland 1991 [180] and Murase & Nayar 1994 [136]. 6.1.2 Simulated fovea image It is well known that the resolution acuity of the human fovea varies with eccentricity [115]. The center has the highest resolution. In our case, we know that the hand is centered in the attention image, so it is reasonable to put more weight on the central pixels than the peripheral ones. Here, we simulate human’s acuity by transforming the attention image as follows. Let the size of the attention window be 2n X 2n and the center of the attention window be at the row re and the column cc. After the transformation, the intensity value (1:1) of the pixel at the row i and the column j is (1 — d/n)I,-,j if d S n i,j — 0 otherwise where d = \/(i — rc)2 + (j — cc)2 and 1,3,- is the original intensity. The new image is called simulated fovea image (SFI). The simulated fovea images from the training set are stored in a database for initial contour query during the test. 122 6.2 Segmentation During the stage of segmentation, an object is presented in an attention image with complex background and the object is centered in the attention image. Our segmen- tation scheme includes the following steps: 1. Reconstruct the input attention image based on learned feature values in the eigenspace. 2. Generate the mask from the reconstructed image using the dynamic spring network model. 3. Apply the mask to the input attention image for segmentation. 4. Transform the result of previous step into SFI using equation (6.4) and project the SFI to the eigenspace. 5. Find the nearest neighbor in the training samples as a recognized model. 6. Apply the spring network model again using the contour of the nearest neighbor as the starting curve. In the above steps, step 3, 4 and 5 are straightforward. Step 6 is simply applying the model. Next, we focus on step 1 and 2. 6.2.1 Reconstruction Given an image f with background and a set of eigenvectors v1,v2, ...,vm, we first obtain the projection coefficients e,- 2 vi - f, where - is the dot product of the two ’5 t I .‘ Lin—'7 123 vectors. Then we reconstruct the image using f, = Z egvi (6.5) i=1 The image f can be decomposed into two parts, one is the object of interest with zero background f0 and the other is the background occluded by the object fb. Thus, we have f 2 f0 + fb. Since our back—projection is linear, we have i" = fc’, + f{, (6.6) If we choose the m so that after the projection the mean-square error is less than 5%, then statistically the ff, has less than 5% loss compared with f() if the f0 happens to be the training sample. The loss for fb is expected more severe since in the training stage the background information has never been used. If the f0 is not the training sample but the training includes some similar objects which can represent this class of objects, the reconstruction is still expected to be effective. Fig. 6.2 shows the reconstruction. Of course, the degree of background reduction depends on the training samples. If we train the system to segment and recognize the objects which have similar ap- pearance, the reduction will be more obvious. This kind of background reduction is essential to the success of our dynamic deformation model. If the background has texture, we can pretty much remove it due to information loss in the reconstruction. V ”15‘ 124 Eigen-subspace Figure 6.2: An illustration of the reconstruction. 6.2.2 Dynamic Deformation In this stage, our input is f’. We want to single out the object of interest from the reconstructed image f’. First, we apply the Canny edge operator to find the edge map of the f’. The edge map includes the boundary edges of the object and the number of edges outside the object boundary is minimum since the f’ preserves most of the information of the object and at the same time reduces the background if this type of the object has been presented in the training session. Second, we apply our dynamic deformation model to obtain the mask. The pixels on edges act as attractors to pull a deformable spring, which is closed in shape and is presumeably made of an elastic material. This type of model has been widely used (e.g. [143, 174]). In our case, we use a two—stage deformation mechanism for suppressing undesirable effects appearing in the course of model construction [35]. These effects are the oversmoothing and the local concentration of the deformable an! -3. --4.__4o.—- h.— 125 model. In the first stage of global deformation, globally convex portions of the edge map are recovered. More specifically, the spring network is restricted to contract down to the convex hull of the given pixels on edges. In the second stage, we deform the convex hull to fit the data. Assume that the vertices of the polygon are interconnected with imaginary springs. The dynamic behavior of this spring network can be described using the equilibrium of forces. Let the spring polygon be SP: (V, E), where V = {:13in = 1, ..., n} is the set of vertices, and E = {eili = 1, ...,m} is the set of links. Assume that all nodes have the same mass m, and all springs have the same stiffness k. The motion equation for a single vertex is ([2112; dx, ' m——— k— g = i- 6.7 (it, + d t + g f ( ) The 9.- is the internal spring force. Define the adjacent set of a node 11:.- as the set of nodes connected to :r;. The spring force applied to a node is a net force summarizing the individual forces of springs connecting the node and its adjacent nodes. Let c be a constant coefficient, then g.- = z: 6““ ‘j‘l‘ ’) x (1:,— a). (6.8) The f,- is the external force. In our case, the external force is defined as the gravitational forces between spring nodes and sampled points. Let y be the edge -1‘- ..u. I 126 pixel and S is the collection of all the edge pixels, then 2 f,=ZGx m y—w. (6.9) X yES ly_$tl2 ly—‘Til, where G is the gravitational coefficient. The explicit Euler time-integration procedure can be used to solve the equation (6.7) [151]. Specifically aw = x: + Atvf+At Uf+At ___ v: + Ataf+At GE+AI : fiH-At/m film = fi — C’Ui — 95. (6-10) where At is the time step between each iteration, a,- = %‘ is the acceleration and v,- = “f is the velocity. At each iteration, forces, accelerations and velocity are evaluated. The model becomes stable when a,- z 0. The final polygon is used as a mask to mark off the image f. The marking is simple, we simply assign the zero intensity to pixels outside the polygon. The result of masking is transformed into SFI and then is projected to the eigenspace. Finally, we find a nearest neighbor in the space decomposition tree. This nearest neighbor is our segmentation result. 127 6.3 Experiments In our experiments, we have applied the above approach to the task of segmentation and recognition of hands from fovea images. 6.3.1 Training The system has been trained with 25 classes of different hand shapes. These hand shapes are illustrated in Fig. 6.3. For each hand shape, five or six training samples were used. In the training session, these samples were manually segmented. Figure 6.3: Twenty five different hand shapes used in the experiments. There are two steps in the training. First, we derived the eigenvectors of the training samples. In the current implementation, we keep m eigenvectors so that after the projection the mean—square error is less than 5%. Second, we obtained the SF Is based on equation (6.4). Fig. 6.4 shows a few SFIs of the training samples. Figure 6.4: The SF13 of the training samples. 128 6.3.2 Testing We have conducted two types of testing. In the first type of test, we used the training images. However, in the testing, we do not do any manual segmentation. So, this type of testing images combines the hands presented in the training and the backgrounds. In the second type of test, a new set of fovea images was used. The segmentation routine for test image f is outlined as follows. Outline of segmentation algorithm i begin I 1) Reconstruct f using m eigenvectors. i 2) Build mask using dynamic model. 3) Apply mask to f to get i". 4) Get SFI (£4) off’. 5) Reconstruct f; using m’ eigenvectors. 6) Search the nearest neighbor. 7) Adjust the curve using contour of the nearest neighbor as initial guess. end In the dynamic model, there is a tradeoff between the shape fidelity and the time complexity when choosing the number of nodes of a spring network. It is natural to think that an object with a complicated shape and large size should require a network having a large number of nodes. Currently, we use an adaptive-size model. If two nodes become too close or too far away, we start to delete the old vertices or add new vertices. We also decrease the time step monotonically with the number of iterations l. Empirically, we define At(l) = 01-3, where oz and E are positive constants. For the first type of test, if the nearest neighbor is right, we have the perfect seg- mentation. The number of images used in the first type of test is 135, the test results 129 show that we have the correct segmentation for 131 of them. The misclassification is due to background which was not presented in the training. The correct rate is 97%. For the test using new set of fovea images, we classify the result of segmentation to be correct if the hand shape of the nearest neighbor is similar to the one in the input fovea image. Out of 115 testing images, we have correctly classified 107 of them. The correct rate is 93%. The testing results are summarized in Table 6.1. We show the results of 9 testing images from the second type of test in Fig. 6.5. Table 6.1: Summary of the segmentation results Test No. Number of images Correct rate 1 135 97% 1 135 93% 6.4 Conclusions A learning-based segmentation scheme is presented in this chapter. During the train- ing, we use the Karhunen-Loeve projection for the training set to obtain a set of eigenvectors. The eigenvectors are used to reconstruct the test attention image. A dynamic spring network model is used to generate the proper mask. The system is tested to segment hands from fovea images. The experimental results show 97% cor- rect rate for the hands presented in the training and 93% correct rate for the hands that have not been used in the training phase. The scheme also has its drawbacks. The reconstruction is not going to work well when the attention image is dominated by the background. Imagining we have -;A-.-5-_a—r:‘ 1 ‘1’: »—_.A-‘n ...A. and .5 x 130 an elongated object in the square attention image, in this case the fovea image is dominated by the background. If we try to reconstruct this fovea image, the object can be lost in the reconstructed image. The SFI is also going to fail since it is symmetrical in all directions. We believe in this kind of situation, single fixation can not solve the problem. Like human vision, multiple fixations are needed. We need to examine different parts of the object first, then assemble each of the individual results together to get the global picture of the object. 5" . 131 (f) Figure 6.5: Segmentation results of the fovea images which are not used in the training. (a) Input fovea images. (b) The results of the reconstruction. We also blur the results of the reconstruction. (c) The results of applying masks to the original input fovea images. The masks were derived using the dynamic spring network model to the reconstructed images. ((1) The SF13 of the images in (c). (e) The contours of the nearest neighbors are superimposed onto the input images. (f) Segmentation results. Chapter 7 Hand Segmentation Using a Prediction-and-Verification Scheme In Chapter 6, we presented an eigen-subspace learning method to segment hands from attention images. The approach was motivated by the fixation of human vision. The fixation is the time when the human selects and examines objects from the fovea [170]. In that approach, the object was assumed to position in a rectangular atten- tion image together with the background. The attention image first went through a reconstruction based on features. Then, the reconstructed image was used to predict the segmentation result. Such a reconstruction is based on learning and can reduce the background interference to a certain degree. This type of reconstruction is moti- vated from studies in psychology. It is well known in psychology that in the retrieval stage, humans may make some kind of reconstructive changes based on past knowl- edge [161]. However, the reconstruction is not able to fully get rid of the background interference. 132 ‘7-” Q 133 Input image Attention window of Attention windows of first level fixation second level fixations .t Figure 7.1: An illustration of two level fixations of an input hand image. One attention window from a single fixation can not solve the segmentation prob— lem completely. Similar to human vision, multiple fixations are needed. This kind of multiple fixations has a hierarchal structure. As shown in Fig. 7.1, the first level of the fixation concentrates on the entire hand, while the next level of the fixation takes care of different parts of the hand. The attention window of the first level fixation usually contains a part of the background. But as we continue zooming in the object from different fixations, the attention windows become focusing on different parts of the object. One important feature of these attention windows is that they typically contain much less background than the attention window of the first level fixation. These attention images from multiple fixations can be used as important visual cues to segment the object of interest from the input image. In this chapter, we present a new approach which efficiently utilizes the atten— tion images obtained from the multiple fixations through a prediction—and—verification scheme to perform the task of hand segmentation. A general object segmentation sys- tem accepts an input image I and an intention signal P which specifies the type of object that it is looking for and outputs the segmentation result C = S(I,P). To 134 check the validation of the segmentation result, we need a verifier f. In order to build such a segmentation system, we need to answer two questions: 1) how to find C; 2) how to construct f. We present a prediction-and-verification scheme to answer above two questions in this chapter. First, we introduce the concept of valid segmentation and provide a criteria to evaluate whether a segmentation is valid or not. Secondly, we develop a systematic approach to predict the valid segmentation using attention images of multiple fixations as visual cues. The prediction is based on the nearest neighbor decision rule. It has been shown that the probability of error of the nearest neighbor rule is bounded above by twice the Bayes probability of error [43]. Unlike the Bayesian approach, the nearest neighbor decision rule is independent of the underlying joint distribution on the sample points. In practice, the joint distribution is unknown and in many cases, a normal distribution is assumed. The assumption may be invalid and could lead to poor performance. A hierarchical quasi-Voronoi tessellation is presented to organize the training sam- ples and propose an efficient algorithm to query the nearest neighbor in high dimen- sional space. Thus, unlike the exhaustive search method or other stochastic search approaches such as simulated annealing, our search for valid solution is guided by prediction using the past knowledge and is significantly more efficient. The outline of the chapter is as follows. We dedicate the next sections to the major components of this chapter, namely, verification and prediction. In Section 7.3, we present the experimental results. $39.1“, .‘A u 135 7 .1 Valid Segmentation In this section, we define the verifier f mentioned in the previous section. Let the segmentation result of an intensity image I with 1‘ pixel rows and c pixel columns be represented by a vector C in (rc)-dimensional space, where C[i x c + j] = 1 if pixel (i,j) in I belongs to the object, otherwise C[i x c + j] = 0. Let P = {($1,y1),(a:2, yg), - - - , (:rn, yn)} be the set of pixels in I such that C[x, xc+y,-] = 1. We _ ° 11 . __ n , . _ ° n . — denote mm,” — m1n,—___l 110,, :1:me — max,=1 17., ymm — ming=1 y., and ymaa: -— max?=l ya. Definition 1 An extractor 8 extracts a subimage I’ with 3 rows andt columns from an image Inc based on the segmentation result C, such that I'[i,j] = I[i+:rm,-,,,j+ym,-,,] ifC[(i+a:m,-,,)*c+j+ym,-n] =1, otherwise I'[i,j] = 0 for allO S i < s andO S j < t, where s 2 mm” — rm,” + 1 and t = ymax — ymgn + 1. Intuitively speaking, an extractor extracts a subimage from an image I according to the segmentation result C and C acts like a mask which marks off background. Definition 2 A scaler M maps on image I’ with 3 rows andt columns to an attention image F with m rows and n columns, such that F[i,j] = g(I’, “'7’”, L?) for all 0 S i < m and 0 S j < n, where g is an appropriate antialiasing function which reduces the sampling effect from a digital image to another. Using an extractor and a scaler, we can construct an attention image from the result of the segmentation. This entire process is illustrated in Fig. 7.2. Definition 3 A verifier f is defined such that, f(F,P) = 1 if the segmentation result is correct, otherwise f(F,P) = 0. Here F is an attention image based on a 136 Input image Hand window Attention image scale to attention Figure 7.2: The illustration of constructing attention images. segmentation result C such that F = M(8(I,C,P)), where I is the input image and P is the intention signal. The verifier defined in Definition 3 makes a decision based on the information presented in the attention window. The biggest advantage of using the attention window is that we can achieve size and position invariance with a fixed size attention window. The function f is extremely complex, because of the high—dimensionality of the attention image F. A challenging task is to approximate f. One common way to address this problem is to extract a particular type of feature and then design some rules to make decisions. One major difficulty of this common approach is in dealing with different appearances of the same object. It is intractable to manually define the features that can characterize the variety of appearances of objects in the real world. In this chapter, we use a learning-based approach to approximate the function f. The method assumes no restriction on the type of the object the system can handle. Therefore, it can be applied to a wide variety of objects. The approximation takes three steps: 137 1. Manually extract the object of interest from each training image. The extracted objects are mapped to attention images using the extractor and scaler. 2. Extract the most expressive features (the principle components) from the train- ing set. 3. Build an interpolation function to approximate f. In the following subsections, we discuss step 2 and 3 in detail. 7 .1.1 Karhunen-Loeve projection Definition 4 A vectorizer operator T transforms an attention image F ofr rows and c columns to a d-dimensional vector V, such that V[i X c + j] = F[i,j] for any 0Sil = 0 otherwise 7 .1 .3 Valid segmentation Based on the definition for the verifier f, we define the concept of valid segmentation. Definition 9 A segmentation result C defined on an input image I and an intention signal P is valid iff(’P(’T(M(E(I,C,P)))),P) = 1, where f is the verifier. Intuitively, a segmentation result C is valid if there is a training sample that is sufficiently close to it. 7 .2 Predication for Valid Segmentation This section investigate the first major problem presented in the introduction section, that is, how to find a valid segmentation. Our solution to this problem again resorts to learning. 7 .2.1 Overview Definition 10 An attention image F from a fixation of image I’ ofm rows and n columns with scale r S 1 and center position (s,t), where 0 S s < m and 0 S t < n, is defined as F = M(B), where M is a scaler and B is an image with m’ = r X m rows and n’ = r X n columns and 141 BIMI I'[b1+i,b2+j] if 0Sb1+i 0 we have P{IIX - XOII S 6} > 0, where P{e} denotes the probability of the event e. If S consists of a finite number of discrete points, a point X in P is positively sup- ported means that the probability of selecting X as a sample is not a zero—probability event. If S consists of infinitely many points, a point X in P is positively supported means that in any small neighborhood centered at X, the probability of selecting any point in the neighborhood is not a zero-probability event. Theorem 4 Suppose X0 is a positively supported point in S. Given any small 6 > 0, there is a positive number k0 > 0, such that as long as we independently draw I: > k0 learning set L 2 {L1, L2, . - - , Lk}, the probability that X0 is d-supported by L has the following property P{||Xo—L,-|I1—c. Proof of Theorem 4. X0 is positively supported, we have P{I|Xo—XII < d} > 0 = p. If we independently draw k learning samples L, the probability P{IIX0 — XI” 2 dIVX; E L} = (1 — p)k. Thus, limk_.,00 P = 0. C1 The theorem says that as training size increases, the training set pointwisely con- verges to a d—supportive set. Next two theorems show the fact that if the training set is d—supportive, we have an efficient query algorithm. ‘2‘...“ AMA: -iul 1 ___ -_ -5 4. anva." 147 Theorem 5 We have a set of d-supportive training set L = {L1,L2,-~,Ln}, a hierarchical quasi-Voronoi diagram P = {P1,P2,---,Pn} corresponding to L and a query sample X E S. Let the ith partition be P,- = {P,-,1,P,-,2,~~,P,-,n,.} and C = {C1,C2,---,Cn,} be the corresponding centers of regions in Pg. Assume C1 be the center to X such that “Cl —XII S IIC; -XII for anyi # 1. Let C2 be any other center and P1 be a boundary hyperplane between regions represented by C1 and C; as illustrated in Fig. 7.6. Then the region of C2 does not contain the nearest training sample to X if the distance between X and the hyperplane P1 is greater than (1. Proof of Theorem 5. If the distance from X to P1 is greater than (I, then according to the definition of the d, there is going to a training sample X1 in C1 such that IIX — XIII < d. Since for any samples Y in region Cg, IIX —— YII _>_ d, so the closest training sample will not be in the region of 02. D l d I..— l C I c 1 e M. n f 2 I b a m x l | ml in boundary hyperplane Figure 7.6: A 2D illustration of nearest neighbor query theorems. In order to avoid to calculate the point to hyperplane distance in a high dimen- W ’.h (..a‘ d..m“.. . ‘r..-. 148 sional space, we can use following equivalent theorem. Theorem 6 Let “C; — C2” = r, f = g, e = %— d,IIC1— XI] = a and [[02 -X|I = b as shown in Fig. 7.6. The region of C; does not contain the nearest training sample toX ifaQ—e2 0 and b2—f2 > m2 due to cos(180°-a) < 0. Therefore, we have a2 — e2 < b2 — f2. [:1 Now we are ready to present our query algorithm. Theorem 7 shows the time complexity of the query algorithm. 149 Given query sample X, a hierarchical quasi-Voronoi diagram P corresponding to begin nodesJist = root; while nodesJist ¢ nil do Pop the first node nd from nodesJist; if nd is leaf then add center Cnd to the centerJist; else for regions under the node nd do Add region which is closest X to nodesJist; Add regions which satisfy Theorem 3 to nodes-list; end for end if end while Output the center in the centerJist that minimizes IIX — GII. end begin a d—supportive learning set with 0,3, denotes to the center of the partition region PM. The next theorem proves that the average time complexity of the query algorithm is O(log n) where n is the size of the training set. Theorem 7 Assume that n training samples in L are independently drawn and the training set L is d-supportive. The hierarchical quasi-Voronoi diagram P is con- structed based on the training such that each Pm- has no more than p branches and the volume of region PM reduces by a constant factor as the i increases, Vi,j = fV,-+1,j and f > 1. Assume PM be the region Cl as shown in Fig. 7.6. We select d such that for each region PM in P, V],- C V231 where VLJ- is the volume of the region P,-,,- and V],- is the volume of the region between the hyperplane P1 and the hyperplane P2 in Fig. 7.6. Then, the above algorithm has the average time complexity O(log n) in terms of an A. a.-. - 1 r __-_ ...~. - ~-~.. nan... 150 finding the nearest neighbor. Proof of Theorem 7. Assume at the stage i, the query example X is in the region PM. Let PM be the C1 region as shown in Fig. 7.6. Let Vgg be the volume of the region Pm- and V,',- C Vig be the volume of the region (between P1 and P2 in Fig. 7.6) where the multiple search is needed. Thus, the average time complexity V.’. V —V.'. "J + i—fl)T(z’+ 1) + 1, (7.4) Tf’) S (19V; Vs where V5 is the volume of the entire space. Equation (7.4) means that if X E if], then may need to search maximum p branches, otherwise one branch is needed. We expand the equation (7.4) . . m (P — 1) i'+k,j m (P — 1) i’+k,j T(i) S T(i+m)l];[1(1 + Vs )+k§=_:1(1+ Vs )+1 Since Vt, = fV,+1,,-, we have V], < Vi,j = fkl/,-+k,j. Then (P— 1) ."+k,j < (P— Ill/Li VS ka5 For f > 1, it is not difficult to show that flu + ($71,151) = 0(1) k=l -' "E‘: 151 and Thus, we have Let i = 0 be the root and m be the height of the tree, we have m = 0(log n). Therefore, T(0) = 0(logn). D The query algorithm finds the nearest training sample from L using a tree struc- ture. If the training set is d-supportive, the nearest neighbor is guaranteed to be found. However, in order to achieve fast retrieval, as shown in Theorem 7, d should be selected such that for each region Pm- in P, V],- C ng- where V,”- is the volume of the region Pid- and V,’,- is the volume of the region between the hyperplane P1 and the hyperplane P2 in Fig. 7.6. This means that d can not be too large. If d is small, the assumption that the training set is d—supportive may be too strong. Another problem is that there is no way to verify whether the training set is d-supportive or not because the true distribution of samples is unknown. The next theorem shows what kind of the solution the query will obtain if the training set is not d—supportive. Definition 15 Let PM be a region in the hierarchical quasi- Voronoi diagram P and C be the center of the region. A minimum ball B(C,r) of the region PM is defined T = 3n rERa (7.5) where C is the center of the ball, r is the radius of the ball, and R is a set such that 152 for any r' E R, we have PM C B(C, r’). The minimum ball B(C,r) is centered at C and has the smallest radius for all balls which contain the region PM. Theorem 8 Let P = {P1,P2,---,Pm} be the hierarchical quasi-Voronoi diagram based on the training set L. Let the XC E PmJc be the solution found by the query algorithm and XN be the nearest neighbor of the query X in L. Then, we have r IIX — x.“ s IIX -— XNII x 3, (7.6) where r is the radius of the minimum ball of the region PmJc and d is our assumption of the training set. Proof of Theorem 8. First, since Pm is the finest partition and according to the definition of the hierarchical quasi-Voronoi diagram, there exists a ij, such that XC E ijc. Next, since XC is the solution, we have X in the minimum ball of the region Pch. So, we have IIX — XCII S r. Finally, we have IIX — XNII > (1 because otherwise Xc will not be the only choice according to Theorem 5. Thus, we have r IIX — x.“ s IIX — XNII x 3. (7.7) C] Theorem 8 gives us some insights to select d. There is a trade off between speed and accuracy. In practice, given a training set L = {11,12, - - - , In}, we choose the d as 153 follows: zeizenm—un n—l d , (7.8) where I: is the nearest neighbor of l,- in L. Since the region PmJ, in Theorem 8 only has one training sample, Xc, we can expect that r is comparable to d. This means that a reasonable neighbor is found although it is not guaranteed to be the nearest one. 7 .3 Experiments We have applied our segmentation scheme to the task of hand segmentation in the experiments. Figure 7.7: A representative subset of hand shapes used in the experiment. 7 .3.1 Training Two types of training were conducted in the experiments. The first type of training is to get the approximation for verifier f which would be used later to check the validation of the segmentation. For each gesture, a number between (27 and 36) 154 of training samples were used to obtained the approximation of the verifier f for that gesture. Given a set of training samples L = {11,12,- - - ,ln} for gesture k, we empirically determined the damping factor 0‘ in the interpolation function as follows: a = 0-2 XII-£11 IIX,- - X.+1l| n—l (7.9) where X.- = 'P(T(l,-)), and P and T are the vectorizer operator and the projection operator respectively. The second type of training was to generate the attention images from multiple fixations of training samples. In the current implementation, the selection of the fixations is mechanical. Totally 19 fixations were used for each training sample. The scales 5 and positions (s,t) of these 19 fixations for an image with m rows and n columns are listed in Table 7.1. Fig. 7.8 shows the attention images of the 19 fixations from one training sample. The attention images with more than 30% background pixels presented in the attention window would be discarded. The total number of remaining attention images used in the experiment is 1742. Table 7.1: The list of fixation scale and position Figure 7.8: The attention images from 19 mechanical fixations of a training sample. 7.3.2 Hand segmentation The trained system was tested to perform the segmentation task from a temporal sequence of intensity images. Each sequence represents a complete hand sign. Fig. 7.9 shows eight sample sequences. Motion-based visual attention In order to speed up the process of the segmentation, we utilize motion information to find a motion attention window. The algorithm to find the motion attention window is outlined as follows. Fig. 7.10 shows results of motion-based visual attention. Given an image I and a neighboring image I’ in the sequence. begin 1. Get the difference image D such that D1231 = 1112311 — man 2. Thresholding D. 3. Find the smallest rectangular window containing the largest connected component in D. end begin Segmentation The task of segmentation would be a lot easier if the motion attention window gen— erated by the attention algorithm was centered at the right position and included 156 D‘IA’F- 010A!!- acorn-p DIUAPF in Wit _, W171 ' ' u 7, F ' re 7 9' Eight sample sequences. From top to bottum, they represept the Slgns happy , “fifiltl” “nothing” “parent”, “pepper”, “smart”, “welcome”, and yes . i i nun-1... ' out. A... , 1 DIOAP'. 4., A.” 11111 111111, Figure 7.10: Results of motion-based attention are shown using dark rectangular windows. the entire hand. Unfortunately this is not always true. The attention algorithm can detect the rough position of a moving object, but the accuracy is not guaranteed. In Fig. 7.10, we show some of the results of motion-based attention. The dark rect- angular is the smallest rectangular which contains the largest connected component of the image difference. The positions of these rectangulars only give us the rough positions of the hand and they can deviate from the desired positions. We solve this problem by doing some limited search based on the motion attention window. In the current implementation, given a motion attention window with m rows and n columns, we try the candidates with size from (0.5m,0.5n) to (2m,2n) using step size (0.5m, 0.5n). The search stops if a valid segmentation is found. There is a trade off between training and testing. The more fixations we use in the training, the less search we need in the testing. We tested the system with 161 sequences (each has 5 images) which were not used in the training. A result was rejected if the system could not find a valid segmentation with a confidence level I. The segmentation was considered as a correct one if the correct gesture segmentation C was retrieved and placed in the right position of the 158 test image. For the case of l = 0.2, we have achieved 95% correct segmentation rate with 3% false rejection rate. Fig. 7.11 shows some segmentation results. The average computational cost for each image was 58.3 seconds on a SGI INDIGO 2 workstation. The experimental results are summarized in Table 7.2. Table 7.2: of the data correct Figure 7.11: The results of the segmentation are shown after masking off the background. 159 7 .4 Conclusions and Future Work A segmentation scheme using attention images from multiple fixations is presented in this chapter. The major advantage of this scheme is that it can handle a large number of different deformable objects presented in various complex backgrounds. The scheme is also relatively efficient since the search of the segmentation is guided by the past knowledge through a predication-and-verification scheme. In the current implementation, the fixations are generated mechanically. The number of fixations and the positions of fixations are the same regardless of the types of gestures. This is not very efficient. Some gestures may be very simple so that a few fixations are enough to recognize them. In order to achieve the optimal performance, different gestures require different positions of fixations. In the future, we plan to investigate the generation of the fixations based on learning. The previous fixations are used to guide the next action. The next action could be (a) termination of the process of generating fixation if the gesture has already been recognized; or (b) finding the appropriate position for next fixation. Chapter 8 View-Based Hand Sign Recognition from Intensity Image Sequences After segmentation, the hand in each image of a sequence is mapped to a fovea image of a standard fixed size. Segmented fovea images at different times form a standard Spatiotemporal fovea sequence, in which both temporal and spatial dimensions are normalized. The global motion information of the object of interest is placed in a global motion vector, which records the size and position information of the segmented object in the original image. This vector is necessary because once the object is segmented and mapped to a fovea sequence with a standard Spatiotemporal size, the global motion information is lost. Let a fovea image f of m rows and n columns be an (mn)-dimensional vector. For 160 161 example, the set of image pixels {f(i,j) I 0 S i < m,0 S j < n} can be written as a vector V = (v1,v2, - - - ,vd) where vm,+j = f(i,J) and d = mn. Note that although pixels in an image are lined up to form a 1-D vector V this way, 2—D neighborhood information between pixels will be characterized by the scatter matrix of V to be discussed later. Let p be the standard temporal length and f,- be the hand fovea image corresponding to the frame i. Then we create a new vector X, called the fovea vector, which is a concatenation of the hand foveas and global motion vector G, X = (f1,f2,...,fp,G). (8.1) In this chapter, we present a learning-based method to recognize hand signs from the fovea vector. An automatic hand gesture recognition system accepts an input fovea vector X and outputs the recognition result C which classifies the X into one of the gestures. Thus, a recognition system can be denoted by a function f that maps elements in the space of X to elements in the space of C. Our objective of constructing a recognition system is equivalent to approximating function f : S 1—> C by another function f : S 1—> G. The error of a approximation can be indicated by certain measure of the error f — f. One such measure is the mean square error: A Ev — f) = (6500:) — f(X))2dF(X) where F(X) is the probability distribution function X in S. In other words, f can 162 defer a lot from f in parts where X never occurs, without affecting the error measure. Another measure is the pointwise absolute error ”f(X) -- f(X)II for any point X in S’, where S' C S is a subset of S that is of interest to a certain problem. Of course, f is typically high-dimensional and highly complex. A powerful method of constructing f is using learning. Specifically, a series of cases is acquired as the learning data set: L = {(X,‘,f(X,))IZ : 112a ' ' ' ,n}. Then, construct f based on L. For notational convenience, the sample points in L is denoted by X (L): X(L) = {x41 = 1,2,-um}. (8.2) X (L) should be drawn from the real situation so that the underlying distribution of X(L) is as close to the real distribution as possible. In this chapter, we compare two different approximators of the function f. The first approximator uses the nearest neighbor decision rule in the MEF space. In the second approach, we use a recursive partition tree to approximate the function f in the MDF space. The definitions of MEF and MDF will be given at the following sections. 163 8.1 Nearest Neighbor Approximator in the MEF Space Typically an image space is very large. The Karhunen-Loeve projection (see Chapter 6.1.1 for details) is a very efficient way to reduce a high-dimensional space a much lower dimensional space. The base vector of this low dimensional space is called the most expressive features (MEF) in that they best describe the sample population in the sense of linear transform. Given a training set of fovea vectors L = {F1, F2,- - - ,Fn}, we first obtain the Karhunen-Loeve projection matrix V and the mean vector Mp. Then, we project each training sample F,- to a vector X, in the MEF space, where X,- = VT(F,- — Mp). Similarly, we also project any query sample onto the above MEF subspace. Definition 16 Given a learning set L = {X1,X2, ~ - - ,Xn} in its MEF space and its corresponding labels M = {l1,12, - - - ,ln}, a nearest-neighbor (NN) approximatorf of f associated with L is defined as follows. For any query sample X, f(X) = l,-, where X,- is the nearest neighbor ofX in L. An NN approximator is a piecewise constant function, constant in every Pg, where P,- is a region of the Voronoi diagram based on the training set L. In Chapter 7.2.3, we presented an efficient nearest neighbor query algorithm which uses the hierarchical quasi-Voronoi diagram. 164 8.2 Approximation Using Recursive Partition Tree in the MDF Space The MEF’s are, in general, not the best ones for classification, because the features that describe some major variations in the class are typically irrelevant to how the subclasses are divided as illustrated in Fig. 8.1. Figure 8.1: A 2D illustration of the most discriminating features (MDF). The MDF is projection along 21. The MEF along yl can not separate the two subclasses. 8.2.1 The Most Discriminating Features (MDF) In this chapter, multiclass, multivariate discriminant analysis [197] is used to select the MDF’s. It is a generalization of Fisher’s linear discriminant [57]. Suppose samples of Y are m-dimensional random vectors from c classes. The ith class has a probability p,-, a mean vector m,- and a scatter matrix 23,-. The within-class scatter matrix is 165 defined by 5w = iPiEUY - m,)(Y — m,)‘|w,-} = ipflr. (8-3) i=1 i=1 The between-class scatter matrix is 51, = Zp,(m, - m)(m,- —- m)‘, (8.4) i=1 where the grand mean m is defined as m = EY = 219:1 piMi. The mixture scatter matrix is the covariance matrix of all the samples regardless of their class assignments: Sm=EMY—mXY—mH=SWH% an Suppose we use k—dimensional linear features Z = W‘Y where W is an m X k rectangular matrix whose column vectors are linearly independent. The above map- ping represents a linear projection from m—dimensional space to 117-dimensional space. The samples Y1, Y2, . ~ ~ , Yn project to a corresponding set of samples Z1, Z2, - - - , Zn whose within-class scatter, and between-class scatter matrices are 52,, and S 2,, re- spectively. It is straightforward matter to show that smzwsmf we sgswaw an These equations show how the within—class and between-class scatter matrices are transformed by the projection to the lower dimensional space. What we seek is a 166 transformation matrix W that maximizes in some sense the ratio of the between- class scatter to the within—class scatter. A simple scalar measure of scatter is the determinant of the scatter matrix. The determinant is the product of the eigenval- ues, and hence is the product of the “variances” in the principal directions, thereby measuring the square of the hyperellipsoidal scattering volume. Using this measure, we obtain the criterion function to maximize 1wsw’ “quhfifififi‘ (so It can be proved [197] that the optimal W that maximizes the above function are the generalized eigenvectors that correspond to the largest eigenvalues in 511W.“ = AiSwWi. (8.9) In order to avoid compute the inverse of Sw, we can find the eigenvalues as the roots of the characteristic polynomial Ia—Asn=o (am) and then solve ta—Asnw“=o win directly for the eigenvectors w,. Since the rank of S1, is at most c — 1, we know that only at most c — 1 features {w,-} are needed and we call these features the most 167 discriminating features (MDFs). 8.2.2 Curse of dimensionality and the DKL projection The discriminant analysis procedure breaks down when the within-class scatter ma- trix 5,, becomes degenerate, which is our case due to a high dimension of the input image and a much smaller number of training samples. Weng [193] proposed DKL projection (short for Discriminant Karhunen-Loeve projection). In the DKL projec- tion, the discriminant analysis is based on the space of Karhunen-Loeve projection (MEF space), where the degeneracy typically does not occur. 8.2.3 Recursive partition tree For large number of classes, the overall feature set may not be best for specific pairs of classes. An alternative classification scheme is the hierarchical classifier, where the most obvious discriminations are done first, postponing the more subtle distinctions to a later stage [135, 163]. In this chapter, we present a recursive partition tree approximator in the MDF space. Definition 17 Given a training set of fovea vectors L = {F1,F2,---,Fn}, a recursive partition tree participates the space S as follows. Given a partition P,- = {P,°,1,P,-,2,---,P,-,,,,} of S at level i, the partition at level i + 1 P,“ = {P,-+1,1, P;+1,2, - - - , P,+1,n,+,} is afiner partition of P,- such that for each PM E P,, PM contains either PM itself or P,+1,k, P,-+1,k+1, - - - , P,-+1,k+b so that PM = U2”, P,+1,k+m. 168 A hierarchical Voronoi diagram is hierarchical partition that satisfies an additional condition: Every cell is further partitioned at a deeper level only by a Voronoi di- agram. The Voronoi diagram in each cell is determined by a few training samples within the cell. In the current implementation, we use an adaptive radius r to split the cell. Once r is determined, we have k (k > 1) training samples and the distance between each pair is greater than r. These k samples are the centers to generate a Voronoi diagram and then we move on to the next level. The graphic description in Fig. 8.2 gives an simplified but intuitive explanation of the hierarchical Voronoi diagram. In Fig. 8.2, the partition under the root is created by three samples indi- cated by circles. Two additional samples indicated by rectangulars are used to get the partition of the next level. The partition of a region ends when the samples under the region are from the same class. Due to the complexity of the problem, the overall MDF feature set may not be best for specific pairs of classes. In our implementation, the MDF’s are computed locally. For each subregion PM, we obtain DKL projection matrices VM and WM and mean vector MM based on the training samples within PM, where VM is the projection matrix to the MEF space and WM is the projection matrix to the MDF space as defined previously. The leaves of the partition tree correspond to the regions which contain the training samples from a single class. The approximator uses the following decision rule to classify the query fovea vector X to the class of a leaf cell. Definition 18 Given a training set of fovea vectors L = {F1,F2, - - . , Fn} and cor- responding recursive partition tree, for any query fovea vector X, if the current level 169 class 1 class 3 class 2 class 4 class 2 (b) Figure 8.2: A 2-D illustration of a hierarchical Voronoi diagram and the corresponding recursive partition tree. (a) The partition, where the circles and rectangulars indicate the samples used to create the Voronoi diagram. (b) The corresponding recursive partition tree. 170 is not a leaf, the recursive partition tree approximator (RPTA) selects the cell with center C,- iffor any other cell with center 03-, we have Rd(X,C.-) < Rd(X,C,-). If the current level is a leaf node, RPTA designates the label of the leaf to the query X. Since each local cell has its own DKL projection, in order to logically compare between two different cells, we use a measurement called Mixture Distance (Rd). Definition 19 Let C be the center of the region P, V be the projection matrix to the MEF space and W be the projection matrix to the MDF space. The Mixture Distance (MD) from a query X fovea vector of the center G is defined as follows. 12,,(X, C) = \/||x — thn2 + ||VWW‘V‘C — VWW‘V‘XIP Intuitively, what is being measured can be seen in Fig. 8.3. In Fig. 8.3, the original image space is a 3D space, the MEF space is a 2D subspace, and the MDF space is 1D subspace since two classes are well separated along the first MDF vector. The first term under the radical indicates the distance of the original vector from the population which indicates how well the MEF subspace represents the query vector X. This term is necessary since it is entirely possible that a query vector that is miles away from a particular subregion’s MEF subspace would project very near to the region’s center. The second term indicates the distance between the MDF components of the query vector and the MDF components of the center vector in the original image space. 171 7 /: I , vww‘v‘x MEF2 Figure 8.3: Illustration of components in the Mixture Distance in a 3D original space. 8.3 Convergence of the Approximators An important issue to study here is how well the above approximators can approxi- mate a function f. Its answer is closely related to the way samples are generated for the learning set L. In [196], Weng and Chen have shown that the nearest neighbor approximator approaches f pointwise in probability. In this section, we are going to show that the same result also holds for the recursive partition tree approximator. Due to a high complexity and undetermined nature of the way in which a learning set L is drawn from the real world, it is effective to consider that X (L), the set of samples in S, is generated randomly. We know that a fixed L is a special case of random L in that the probability distribution is concentrated at the single location. Thus, we consider X in X (L) as a random sample from S. The learning set L is 172 generated by acquiring samples from S with a d-dimensional probability distribution function F(X). Definition 20 A point X0 E S is positively supported if for any 5 > 0 we have P{IIX — X0” S 6} > 0, where P{e} denotes the probability of the event 6. If S consists of a finite number of discrete points, a point X in P is positively sup- ported means that the probability of selecting X as a sample is not a zero—probability event. If S consists of infinitely many points, a point X in P is positively supported means that in any small neighborhood centered at X, the probability of selecting any point in the neighborhood is not a zero—probability event. In practice, we are not interested in cases that almost never appears in a real-world application. An approximate function f can assume any value in subregions of S that will never be used in the application, without hurting the real performance of the system. Thus, we just need to investigate how well the approximation can do at points X’s that are positively supported. Definition 21 A point X0 6 S is an interior point ofa region f(X) = D if there is a d > 0 such that for any X we have f(Xo) = f(X) = D, where IIXO — XII < 6. For a classification problem, we have a set of discrete number of categories to be assigned to the input. Then, f (X) is continuous only at the interior points. Theorem 9 Suppose the query vector X is an interior and positively supported point in a bounded S. Let n be the size of a training set L. Given any small number 173 6 >0, there is a number N, so that as long as we independently draw 72 > N learning samples, the recursive partition tree approximatorf has the following property P{f(X) 75 f(X)} < 6. where f’ the approximator based on the recursive partition tree. Proof of Theorem 9. Let X’ be the nearest neighbor of the query X in the training set ()le X ( L5). We show that as long as we independently draw the training samples, the probability that X traverses the recursive partition tree exactly same as X’ approaches 1 as n —) oo. Assume at the subregion PM, we have the projection matrices V and W. The Mixture distance from X to a center C under region PM is R3(x, C) = ”x — vv‘xn2 + IIVWW‘V‘X — vww‘v‘on2 Using the triangle inequality property of the Mixture distance, we have Rifx. 0) S ||X - X'll2 + ”X — VVtX'll2 + llVVt(X — X')||2 + Ilvwwtv‘x' — VWW‘V‘CII2 + ||VWW‘V‘(X — x')||2 R§(X’. 0) + IIX - X'II2 + llVV’(X - X')||2 + IIVWW’V’(X - X')||2 If the sum of the last three terms (S3) of the above equation —> 0, we would have Rd(X’,C) which means that the approximator would make the same Ed (X, C) '2 decision for X and X’. Let Dn be the event that the nearest neighbor of X in the 174 training set U?:1X(L,) has a distance larger than 77 while E,- denotes the event that the the nearest neighbor of X in the single X(L,) has a distance larger than 77. Since each L, is independently and identically drawn, we have Panm} = IIP{E.-.n}~ i=1 On the other hand, X is a positively supported, which means that 1 — P{E,-,n} = fl > 0. Thus, PIDnan}=(1_fi)n which approaches zero when n —> 00. So, given any value S3 = 77 and e > 0, we can have a positive N, so that for any n > N, we have P{D,,,n} < 6. Now, we have shown that the approximator makes the same decision for X and X’ as n —~> oo 1. Finally, the fact that X is an interior point guarantees that f(X)=f(X’) asn—>oo. Theorem 9 means that the RPTA approaches f pointwisely in probability: P{f(X) 75 f(X)} —> 0, as the size of training set L increases without bound. lThis may not be true if X’ lies on the decision boundary. However, the event that X’ lies on the decision boundary is a zero probability event. 175 8.4 k Nearest Neighbors One drawback of using decision tree to do recognition is that it is unable to reject the undesirable input. In this section, we present a method to measure the confidence of the recognition result based on ls: nearest neighbors. The confidence can be used as a criteria to do rejection. Definition 22 Given a learning set L and k nearest neighbors {X1,X2,-~ - ,Xk) of a testing sample X, the confidence level ofX belonging to class c is defined as k IIX-XII 1: :m(X,,c)al-l——‘_‘+"x‘xllll. i=1 In the above definition, m(X,, c) is a membership function which takes value 1 if X, is a member of class c, otherwise it takes —1. e is a small positive number to avoid the denominator to become zero. Intuitively speaking, this confidence level is the sum of weight of each neighbor. The weight is inversely proportional to the distance between the query and the neighbor. The distance here is the Mixture Distance. The value of 0 determines how fast the weight will decrease for other runner up. A points X, at twice the distance compared to that of the nearest neighbor X1 will have its weight decreased by a factor of l/a. 8.5 Experimental Results The framework has been applied to recognize the twenty eight different signs as illustrated in the Fig. 1.2. The image sequences are obtained while subjects perform 176 hand signs in front of a video camera. The variation of hand size in images is limited. Two different lighting conditions are used. In the current implementation, each hand sign was represented by five images sampled from the video. Figure 7.9 shows several examples of these sequences. We first applied our segmentation scheme as discussed in Chapter 7 to segment hands from input images. Then we construct the fovea vectors as shown in Chapter 5.2. These fovea vectors were used as the input for sign recognition. The problem now is how to deal with the sequences which has some images that have been rejected by the segmentation routine. In this case, we still output those sequences because there are still good chances that they can be recognized if Only one or two images in the sequences are rejected while the rest of them are fine. The number of images used in the training is 3300 (660 sequences). The number of testing images is 805 (161 sequences). 8.5.1 Results of the nearest neighbor approximator in the MEF space We show some experimental results to indicate the performance of the nearest neigh- bor approximator in the MEF space. We computed MEF’s using 660 training se- quences. Fig. 8.4 shows top 10 MEF’S. The number of MEF’s was selected based on the variation ratio r = 2:17;, A, / Z:le A,, where m out of n MEF vectors were used, as defined in Section 6. 1 . 1 . Table 8. 1 shows the number of MEF’s corresponding to the variation ratio. Figure 8.4: Top ten MEF’s Fig. 8.5 shows the performance of the nearest neighbor approximator under the different variation ratio. The performance first improves when the ratio r increases. Then, at the point r = 0.4, the performance saturates at the recognition rate 87.0%. Fig. 8.6 shows average computation time for each sequence on a SGI INDIGO 2. The time was obtained based on the two different nearest neighbor query approaches, The variation ratio The number of MEF’s 10 1 20 2 io Tabl 6 80 48 95 178 namely, the linear search and the hierarchical quasi-Voronoi diagram in 7.2.3. The use of the hierarchical quasi-Voronoi diagramapproach dramatically improves the query time. 1-2 1 1 1 r NNin MEF _— ,, 0.9 — - E" G .9. a 06 - .. 010 O 8 °‘ 0.3 1- - 0 I l l I 0 0.2 0.4 0.6 0.8 1 Variation ratio Figure 8.5: Performance of the nearest neighbor approximator in the MEF space. The performance is given as a function of the number of MEF’s used. 8.5.2 Results of the recursive partition tree approximator in the MDF space In this experiment, the same 660 training sequences were used to build a recursive partition tree. For each nonterminal region, we selected an adaptive radius r as defined in Section 8.2.3 to split the region into subregions. Given r for a nonterminal region, we have k > 1 training samples and the distance between each pair of these 1: samples is greater than r. These Is samples were the centers to generate a Voronoi diagram. The distance here is the Euclidean distance in the MDF space corresponding 179 14 I I I I A 12 - linear search — _ 3 quasi-Voronoi diagram ---- 3 10 - _ 8 g 8 r- _ 8‘ :2 6 - _ 8. a) 4 - _ .§ 5" 2 __ .- 0 l 1 ______ E. _______ 1. ______ 0 0-2 0.4 0.6 0.8 1 Variation ratio Figure 8.6: Performance of the two different nearest neighbor query approaches: linear vs. quasi—Voronoi diagram. to the region. Fig. 8.7 shows the top 10 MDF’s at the root level. Once we have created the recursive partition tree, we used it to recognize the sign. As we did in the experiments for the nearest neighbor approximator in the MEF space, the segmentation result was used as the input for sign recognition. The results are summarized in Table 8.2. The correct recognition rate of 161 testing sequences is 93.2% which is better than the recognition rate (87.0%) of the nearest neighbor approximator in the MEF space. The average recognition time per sequence is 0.63 second on a SGI INDIGO 2. The time is longer than the time (0.27 seconds) of the nearest neighbor approximator when the quasi-Voronoi diagram is used in the query. This is because each nonterminal node in the recursive partition tree has its own DKL projection matrices and each time the query vector traverses the node, it has to go through the local projection, whereas in the case of the nearest neighbor l_-'.- ‘4». _ I IBJ‘ 4"! rs— . «nu-v .~-a.q~m. 180 (MDF 10 Figure 8.7: Top ten MDF’s approximator, only one projection is necessary. Table 8.2: Summary of the experimental data for RPTA Training Testing Number of Number of // training samples 660 (3300 images) testing sequences 161 (805 images) Height of Recognition // the tree 7 rate 93.1% (87% for MEF) Number of Time per // nodes 90 sequence (sec.) 0.63 \tx n1 1 I .Lll‘ ‘4P-CV . 181 8.5.3 k nearest neighbors There are 16 sequences in the above 161 testing sequences and in each of these 16 sequences, at least two out of five images have bad segmentation results. What we want to do is to reject these sequences. We used the method of k nearest neighbors (see Chapter 8.4) to evaluate the confidence level of each recognition result. Then, we set up a threshold and reject any result which has a confidence level below the threshold. In the experiments, we used the top 5 nearest neighbors and selected 1.1 as the confidence threshold. Here, each of the top 5 nearest neighbors can vote for the result. The weight of voting power of the result R,- equals to H where X is the test sample and Rn is the nearest neighbor of the test sample. The sign of the weight is positive if it agrees with the vote of the nearest neighbor, otherwise, it is negative. The weight for the nearest neighbor is 1.0. The selection of the threshold is heuristic. The threshold 1.1 requires slightly more support than a single nearest neighbor. The system accepts 89% of the input sequences. Within these 89% of the input sequences,, the correct recognition rate is 97%. Among those 11% (18) rejected sequences, 56% (10) sequences have segmentation error. The experimental results are shown in Table 8.3. 8.5.4 Experiments related to MDF We have shown that the approximator in the MDF space has better performance than the one in the MEF space. This is because the pixel-to—pixel distance, whether in the original image space or the MEF space, can not well characterize the difference ll"~ .x. 0%.. HuiMiIu.-'ntfl£i ‘1'" ...n‘ 182 Table 8.3: Summary of the experimental data for kNN Accept rate 89% Reject rate 11% Recognition rate among the accepted 97% Misclassification rate among the accepted 3% False reject due to segmentation error among the rejected 56% False reject due to recognition error among the rejected 44% between two signs due to the effects such as lighting, viewing angle, and hand variation between different subjects. On the other hand, the MDF’s are the features that best characterize different categories. In this section, we show some experimental results to indicate quantitatively how the MEF and the MDF may perform very differently in classifying signs. Clustering effects We computed MEF’s and MDF’s, respectively, using 50 sequences (10 for each signs). These signs are obtained from different subjects and the viewing positions are slightly different. Fig. 8.8 (a) shows the samples in the subspace spanned by the first two MEFs and Fig. 8.8 (b) shows them in the subspace spanned by the first two MDFS. As clearly shown, in the MEF subspace, samples from a single class spread out widely and samples of different classes are not far apart. In fact, some samples from different classes mingle together. However, in the MDF subspace, samples of each class are clustered more tightly and samples from different classes are farther apart. This shows that the MDFs are better in terms of classification of signs. MEFZ MDF2 Figure 8.8: The difference between MEF and MDF in representing samples. (a) Samples represented in the subspace spanned by the first two MEFs. (b) Samples represented in the subspace spanned by the first two MDFs. The numbers in the plot are the class labels of the samples. 183 1 o 0 3 3 j 0 g’l 3 1 i 0 Q) 3 1 2 ° 13 82 °- 33‘ 23 4 ‘ 1 3 44 4a 233 8— ‘ 4 2 7 4 2 3 :3 2 22 g 2 S— 4 2 l l l l 7 -3000 -2000 -1000 o 1000 2000 MEF1 (a) 1i111 1 i“ {1 °‘ 3 3 §-1 ' an 0 08(190 922g 0 0 I T I l f .1000 -500 o 500 1000 MDF1 (b) 184 Geometric meaning of the MDF In the MDFs, factors that are not related to classification are discarded or weighted down, which is accomplished by minimizing the within-class scatter; factors that are crucial to classification are emphasized, which is achieved by maximizing the between- class scatter. In this experiment, we show an example that the MDFs can capture the important geometric features. Figure 8.9: Two sample sequences of signs “of course” (a) and “wrong” (b). In our gesture vocabulary, the image sequences of two signs: “of course” and “wrong” are visually very similar. Fig. 8.9 illustrates two sample sequences of the above signs. The nearest neighbor approximator generally has difficulty to distinguish them, but not the recursive partition tree approximator in the MDF space. Fig. 8.10 shows the difference between the MEF and the MDF. The left sequence in Fig. 8.10 is a reconstruction of the sequence “of course” based on the first MDF and the right sequence is a reconstruction of the same sequence using 95% of MEFs. We can see that the MEFs are good in terms of preserve the information but not much help 185 for classification. On the other hand, the first MDF captures the feature locations (edges) because it accounts for the major between-sign variation. (b) Figure 8.10: The difference between the MEF and MDF. (a) Reconstruction based on the first MDF. (b) Reconstruction based on 95% MEFs. 8.6 Conclusions In this chapter, we have presented a new approach to recognize hand signs. In our approach, motion understanding (the hand movement) is tightly coupled with spatial recognition (hand shape). To achieve a high applicability and adaptability to various conditions, we do not impose prior features that the system must use, but rather the system automatically selects features from images during learning. The system uses multiclass, multidimensional discriminant analysis to automatically select the most linear discriminating features for gesture classification. The recursive partition tree approximator is proposed to do classification. This approach combined with our previous work on the hand segmentation forms a new framework which addresses three key aspects of the hand sign interpretation, that is, the hand shape, the location, and 186 the movement. The framework has been tested to recognize 28 different hand signs. The experimental results have shown that the system achieved a 93% recognition rate without a rejection option. The recognition rate was 97% with a 11% reject rate. Chapter 9 Summary and Future Work This chapter summarizes the results of the research described in this thesis. Several directions for future research are also outlined. 9. 1 Summary Two pieces of work are reported in this thesis. The first one is about motion and structure estimation using image sequences. The second one is about recognition of hand signs using intensity image sequences. 9.1.1 Integration of transitory image sequences We have developed a system to estimate motion and structure from transitory image sequences. A transitory image sequence is one in which no scene element is visible through the entire sequence. When a camera system scans a scene which cannot be covered by a single view, the image sequence is transitory. We have shown that 187 188 from a transitory sequence it is inherently not possible to get better estimates with a longer sequence. The later a scene part comes into the sequence, generally the worse its global accuracy compared to that in the first view. We establish asymptotic error rates with respect to the number of frames, which indicate how the error in the esti- mates evolves with time and how to minimize the pace of error accumulation. Some concise expressions have been derived in terms of asymptotic error rate for different representations, processing methods, and image sequence types. The asymptotic er- ror rates are in fact the lowest possible error rates based on the Cramér-Rao error bound. We have proposed two different techniques for two different usages of the results: global and local(e.g., visual map generation and global pose determination for the former and obstacle avoidance and object manipulation belong for the latter). We conducted experiments with synthetic and real world images in order to ex- perimentally examine the error rates and compare the two representations (WC and CC). The performance of the algorithm can be demonstrated through simulations, where ground truth and the amount of noise can be well controlled and the errors in the estimates can be accurately measured. In the simulation, 3-D feature points were generated randomly for each trial, between depth 2000mm and depth 3000mm, with a uniform distribution. The entire scene is covered by 31 frames and the distance between the consecutive frames is roughly 200mm. A small rotation was added be- tween each pair of two consecutive frames. Average errors were obtained through 100 random trials each with a different set of 3D points. The results demonstrated that different representations have very different stabilities. In general, the world-centered (WC) coordinate system is better for a global usage and the camera-centered (CC) 189 coordinate system is superior for a local usage. In order to provide actual accuracy with a real system setup, careful experiments have been conducted with a fully calibrated camera system. The setup used for our image acquisition was a Denning MRV-3 mobile robot and a pair of stereo cameras. The stereo camera system was calibrated with distortion compensation using an algo— rithm from Weng et al [190]. An image sequence of 151 frames was acquired from the moving mobile robot. A temporally subsampled (one sample every 5 frames) subse- quence of 31 frames was used for motion and structure estimation with a consideration that this subsequence is dense enough for estimation and yet enables cross-frame mo- tions to cover more original frames with a relatively small batch size. The algorithm includes feature selection, stereo matching, temporal matching and tracking, 3D structure integration, and motion and pose estimation. The results have been compared with the ground truth of the test points on the scene and the pose of the camera system. As we predicted, the error increases with the time. But the estimates appear good. After traveling about 3000mm, the estimated camera global position error is less than 60mm in depth Z (less than 2.3%), about 43mm horizontally and under 25mm vertically with the WC representation. This seems to indicate that reasonable results can be obtained, even with a transitory image sequence. 9.1.2 Hand Sign Recognition In this thesis, we have presented a new general framework to learn and recognize hand signs from intensity image sequences. The framework has two major components, {aifJuth 'UMB‘ -35....) Kin-w 5’1 190 namely, segmentation and recognition. A prediction-and—verification scheme using attention images from multiple fixa- tions is presented to segment hands from complex backgrounds. During the predic- tion stage, we first apply the motion—based attention to find the rough position of the hands from the input sequences. Then, the attention images are generated around the initial position. These attention images are used to query the database of the training attention images to predict the mask. One important feature of small atten- tion images which focus on the portion of the object is that they typically contain little background. These attention images can be used as important visual cues to segment the object of interest from the input image. We present a hierarchical quasi-Voronoi tessellation to organize the training atten- tion images. Based on the hierarchical quasi-Voronoi tessellation, we propose a new efficient query algorithm for the nearest neighbor in the high dimensional space. The result of prediction is verified using a learning-based function approximation scheme. A major advantage of this scheme is that it can handle a large number of different deformable objects presented in complex backgrounds. The scheme is also relatively efficient since the segmentation is guided by the past knowledge through a prediction- and-verification scheme. In our recognition scheme, motion understanding (hand movement) is tightly cou- pled with spatial recognition (hand shape). To achieve a high applicability and adapt- ability to various conditions, we do not impose prior features that the system must use, but rather the system automatically selects features from images during learning. The system uses multiclass, multidimensional discriminant analysis to automatically 4 .S""‘ 9‘ '1. 1‘ 191 select the most discriminating features for gesture classification. A recursive partition tree approximator is proposed to do classification. The framework has been tested to recognize 28 different hand signs. The ex- perimental results show that the system can achieve a 93% recognition rate without rejection. The recognition rate is 97% with 11% reject rate. 9.2 Future Work In this section, we discuss a number of research issues which could be addressed in we P' *“ "‘ the future. 9.2.1 Integration of transitory image sequences In current experiments for the real setup, we have only tested one sequence and we realize that in order to fully test the algorithm, we need to collect more data and run the program. The current design is suitable for a passive navigation system where the navigation of the sensor is guided by the human operator. During a passive navigation, the sensor grabs the image sequences. Later, these images are processed in an off-line fashion to obtain the global structure of the unknown scene. The next question is whether it is possible to build an automatic navigation system. In order to build an automatic navigation system, we must have 0 Real time processing. Currently, we are unable to do this is because feature extraction and tracking are very time consuming. 192 0 Scene understanding. For a goal oriented navigation, the 3D structure infor- mation of the scene alone is not enough. The navigator has to understand the scene so that it can follow the road and find the way. 9.2.2 Hand sign recognition We have presented a three stage framework in Chapter 5. This thesis addresses the stage 2 and 3, which are segmentation and recognition. However, the stage 1 which is the image acquisition is not a trivial problem if the speed of the hand motion is not uniform, for example, person A first performs a sign from a slow pace to a fast pace and then performs the same sign changing the pace from fast to slow. Then, we have the problem of selecting the frames to represent the sign since using the same time interval between the consecutive frame could end up with two very different sequences. In that case, time warping is necessary. In the current implementation of the segmentation, the fixations and the positions of fixations are the same regardless of the types of gestures. This is not very efficient. Some gestures may be very simple so that a few fixations are enough to recognize them. In order to achieve the optimal performance, different gestures require different positions of fixations. In the future, we need to investigate the generation of the fixations based on learning. The previous fixations are used to guide the next action. The next action could be (a) termination of the process of generating fixation if the gesture has already been recognized; or (b) finding the appropriate position for next fixation. 193 Training in the current implementation is done in a batch fashion. A complete new training session is needed each time we update the training set. This can be very inefficient when the training set becomes large. In the future, we need to investigate the problem of incremental learning. Finally, we would like to give some personal views on the problem of interpretation of the sign language such as American Sign Language (ASL). Can we extend our current work to interpret ASL for the disabled? The extension faces the following obstacles: 0 Currently, we are working at the word level (sign). If we want to understand ASL, we have to understand the sentence made up by a sequence of signs. 0 Our experiments only include 28 different signs. Although 28 is a good number in the current state of the art, it is just a small fraction if we compare it with the number of the common words in ASL, which is around 5000. 0 There are signs in ASL that use both hands and signs whose meaning of depends on contextual information. Currently our system can not handle these types of signs. Based on the above discussion, we think there are still a few large gaps between our system and a future system which can interpret ASL. Bibliography [1] [2] [3] [4] [5] [6] J. R. Anderson, Cognitive Psychology and Its Implications, 3rd ed., Freeman, New York, 1990. G. Adiv, Determining three-dimensional motion and structure from optimal flow generated by several moving objects, in IEEE Trans. PAM], vol. 7, pp. 384-401, July 1985. J. Aggarwal and N. Nandhakumar, On the computation of motion from sequences of images-a review, in Proc. of IEEE, vol. 76, no. 8, pp. 917-935, 1988. K. Akita, Image sequence analysis of real world human motion, in Pattern Recog- nition, vol. 17, pp. 73-83, 1984. J. Aloimonos, I. Weiss, and A. Bandyopadhyay, Active vision, Int’l Journal of Computer Vision, vol. 1, pp. 333-356, 1988. J. Aloimonos, Purposive and qualitative active vision, in IEEE C VPR., pp.346- 360, 1990. 194 195 [7] P. Anandan and R. Weiss, Introducing a smoothness constraint in a matching [8] [9] [10] [11] [12] [13] [14] approach for the computation of displacement fields, in Proc. ARPA I US Work- shop, SAIC, Dec. 1985, pp.186-197. K. S. Arun, T. S. Huang and S. D. Blostein, Least—squares fitting of two 3-D point sets, in IEEE Trans. PAMI, vol. 9, no. 5, 1987. N. Ayache and Lustman, Fast and reliable trinocular stereovision, in Proc. Ist Intern. conf. Comput. Vis., pp. 422-427, London, 1987. N. Ayache and OD. Faugeras, Building, registrating, and fusing noisy visual maps, in Proc. Ist ICCV, pp. 73-82, 1987. H. Baker and T.O. Binford, Depth from edge and intensity based stereo, Proc. 7th Joint Conf. Artif. Intell., pp.631-636,Vancouver, August, 1981. J. L. Barren, A. D. Jepson, and J. K. Tsotsos, The feasibility of motion and structure from noisy time varying image velocity information. in International Journal of Computer Vision, 5:239-269, 1990. T. Baudel and M. Beaudouin—Lafon, Charade: remote control of objects using free—hand gestures, in CACM, pp. 28-35, 1993. N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger, The r"-tree: an efficient and robust access method for points and rectangles, in Proceedings of the 1.990 ACM SIGMOD, pp. 322-331, 1990. 196 [15] P. J. Besl and N. D. Mckay, A method for registration of 3-d shapes, in IEEE [16] [17] [18] [19] [20] [21] [22] Trans. on PAMI, vol. 14, pp. 239-256, 1992. L. Birnbaum, M. Brand and P. Cooper, Looking for trouble: using causal se- mantics to direct focus of attention, in Proc. 4th Int ’1 Conf. Computer Vision, pp. 49-56, 1993. J. F. Blinn and M. E. Newell, Texture and reflection in computer generated images, in CACM, vol. 19, pp. 542—547, 1976. A. Bobick, A hybrid approach to structure-from-motion, in Proc. ACM Interdisc. Workshop Motion, pp. 91—109, Toronto, Canada. A. Bobick and A. Wilson, A state-based technique for the summarization and recognition of gesture, in Proc. 5th Int ’1 Conf. Computer Vision, pp. 382-388, Boston, 1995. S. D. Blostein and T. S. Huang, Algorithms for motion estimation based on 3-D correspondences, in Motion Understanding, W. Martin and J. K. Aggrawal Eds. Norewell, MA: Kluwer, 1988. J. Boissonnat, Geometric structures for three-dimensional shape representation, in ACM Trans. on Graphics, vol. 3, no. 4, pp. 266—286, 1984. J. Bonvillian, K. Nelson, and V. Charrow, Language and language related skills in deaf and hearing children, in Sign Language Studies, vol. 12, pp. 211-250, 1976. 197 [23] H. Bornstein and K. Saulnier, The Signed English Starter, CLERC BOOKS, Gallaudet University Press, Washington, DC, 1989. [24] P. Bouthemy and E. Francois, Motion segmentation and qualitative dynamic scene analysis from an image sequence, in International Journal of Computer Vision, vol. 10, pp. 157-182, 1993. [25] O. Braddick, The masking of apparent motion in random dot patterns, in Vision Res, vol. 13, pp. 355-369, 1973. [26] O. Braddick, A short-range process in apparent motion, in Vision Res, vol. 14, pp. 519-527, 1974. [27] T. Brandt, J. Dichgans and E. Koenig, Differential effects of central versus pe- ripheral vision on egocentric and eccentric motion perception, in Expl Brain Res, vol. 16, pp. 476-491, 1973. [28] T. J. Broida and R. Chellappa, Estimation of object motion parameters from noisy images, in IEEE Trans. PAMI, vol. 8, no. 1, pp. 90-99, Jan. 1986. [29] J. S. Brown, A Vocabulary of Mute Signs, Baton Rouge, Louisiana, 1856. [30] V. Bruce, Recognizing faces, Hillsdale: Lawrence Erlbaum, 1988. [31] R. Brunelli and T. Poggio, Face recognition: features versus templates, in IEEE Trans. PAMI, vol. 15, no. 10, pp. 1042-1052, 1993. [32] A. Bruss and B.K.P. Horn, Passive navigation, in Massachusetts Inst. Technol., Cambridge, A.I. Memo 662, 1981. ’9 I. : 'I‘. ‘.;-‘l 198 [33] S. Carey, Conceptual Change in Childhood, The MIT Press, Cambridge, MA, 1985. [34] C. Cédras and M. Shah, A survey of motion analysis from moving light displays, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Seattle, WA, pp. 214—221, June 1994. [35] S. W. Chen, G. Stockman, C. Dai and GP. Chuang, Two-stage dynamic defor- mation for construction of 3D models, working manuscript under review. [36] Z. Chen and H. Lee, Knowledge-guided visual perception of 3-D human gait from a single image sequence, in IEEE Trans. on Systems, Man, and Cybernetics, vol. 22, pp. 336-342, 1992. [37] B. Chazelle, How to search in history, in Inf. Control, vol. 64, pp. 77-99, 1985. [38] R. Cipolla, Y. Okamoto and Y. Kuno, Robust structure from motion using mo— tion parallax, in IEEE Conf. CVPR., pp. 374-382, 1993. [39] W. F. Clocksin, Perception of surface slant and edge labels from optical flow: a computational approach, in Perception, 1980. [40] L. D. Cohen, On active contour models and balloons, in Image Understanding, vol. 53, pp. 211-218, 1991. [41] E. Costello, Signing: how to speak with your hands, Bantam Books, New York, 1983. [42] [43] [44] [45] [46] [47] [48] [49] 199 G. W. Cottrell, P. W. Munro and D. Zipser, Image compression by back prop- agation: A demonstration of extensional programming, in N .E. Sharkey (Ed.), Advances in cognitive science, vol. 2, Chichester, England: ellis Horwood, 1987. T. M. Cover and P. E. Hart, Nearest neighbor pattern classification, in IEEE Tran. on Information Theory, vol. 13, pp. 21-27, 1967. N. Cui, J. Weng, and P. Cohen, Extended structure and motion analysis from monocular image sequences, in Proc. 3rd Int ’1 Conf. on Computer Vision, Osaka, Japan, pp. 222-229, 1990. N. Cui, Weng and Cohen, Motion and structure from long stereo image sequences, in IEEE Conf. CVPR, pp. 75-80, 1991. Y. Cui, D. Swets and J. Weng, Learning—Based Hand Sign Recognition Using SHOSLIF—M, in Proc. Int’l Conf. Computer Vision, pp. 631-636, Boston, June, 1995. Y. Cui and J. Weng, 2D object segmentation from fovea images based on eigen- subspace learning, in Proc. International Symposium on Computer Vision, pp. 305-310, Coral Gables, FL, Nov., 1995. Y. Cui and J. Weng, Learning-based object segmentation for fovea images, in the 2nd Asian Conf. on Computer Vision, Vol II, pp.71-75, Singapore, Dec. 1995. Y. Cui and J. Weng, Learning-Based Hand Sign Recognition, in Proc. Inter- national Workshop on Automatic Face- and Gesture-Recognition, pp. 201-206, Zurich, Switzerland, June 1995. 200 [50] Y. Cui and J. Weng, Hand segmentation using learning-based prediction and [51] [52] [53] [54] [55] [56] [57] [58] verification for hand-sign recognition, to appear in IEEE Conf. Computer Vision and Pattern Recog., 1996. Y. Cui and J. Weng, “View-Based Hand Segmentation and Hand—Sequence Recognition with Complex Backgrounds”, accepted for presentation in 13th In- ternational Conference on Pattern Recognition, 1996. K. Daniilidis and H. Nagel, The coupling of rotation and translation in motion estimation of planar surfaces, in IEEE Conf. CVPR, pp. 188-193, 1993. T. Darrell and A. Pentland, Space—time gestures, in IEEE CVPR, pp.335-340, 1993. J. Davis and M. Shah, Visual gesture recognition, in IEE Proc. Vis. Image Signal Process, vol. 141, No. 2, pp. 101—106, April 1994. D. P. Dobkin and R. J. Lipton, Multidimensional searching problems, in SIAM J. Comput, vol. 5, pp. 181-186, 1976. G. W. Donohoe, D. R. Hush and N. Ahmed, Change detection for target detec- tion and classification in video sequences, in Proc. Int ’1 Conf. Acoust., Speech, Signal Processing, pp. 1084-1087, 1988. R. Duda and P. Hart, Pattern Classification and Scene Analysis, Wiley, New York, 1973. P. Ekman, Unmasking the Human Face, New York: Prentice-Hall, 1971. [59] [60] [61] [62] [63] [64] [65] [66] 201 T. F. Elbert, Estimation and Control of Systems, New York, Van Nostrand Reinhold, 1984. J. L. Elman and D. Zipser, Discovering the hidden structure of speech, in Journal of the Acoustical Society of America, vol. 83, pp. 1615-1626, 1988. J. L. Elman, Finding structure in time, in Cognitive Science, vol. 14, pp. 179-211, 1990. I. Essa and A. Pentland, A vision system for observing and extracting facial action parameters, in IEEE Conf. CVPR, pp. 76-83, 1994. J. Q. Fang and T.S. Huang, Some experiments on estimating the 3-D motion parameters of a rigid body from two consecutive image frames, in IEEE Trans. PAMI, vol. 6, pp. 547-554, 1984. O. Faugeras, E. Bras-Mehlman and J. Boissonnat, Representing stereo data with the Delaunay triangulation, in Artificial Intelligence, vol. 44, pp.41-87, 1990. O. Faugeras, Three-Dimensional Computer Vision: A Geometric Viewpoint, The MIT Press, Cambridge, MA, 1993. S. S. Fels, Building Adaptive Interfaces with Neural Networks: The Glove-Talk Pilot Study, Tech. Report No. CRG-TR—90-1, Dept. of Computer Science, Uni- versity of Toronto, Canada, 1990. [67] K.Finn and A. Montgometry, Automatic optically-based recognition of speech, in Pattern Recognition Letters, vol. 8, pp. 159-164, 1988. 202 [68] W. Frieman, About time: inventing the fourth dimension, MIT Press, Cambridge, MA, 1990. [69] S. Furui, Digital Speech Processing, Synthesis, and Recognition, pp. 244-249, Marcel Dekker Inc., New York and Basel, 1989. [70] S. Ganapathy, Decomposition of transformation matrices for robot vision, in Pattern Recog. Letter, vol. 2, pp. 401-412, 1989. [71] A. Gelb, Applied Optimal Estimation, Cambridge, MA: MIT Press, 1974. [72] J.J. Gibson, Visually controlled locomotion and visual orientation in animals, in Br. J. Psychol, 1954. [73] J.J. Gibson, The Ecological Approach to Visual Perception, Houghton-Mifflin, Boston, MA, 1979. [74] C. M. Ginberg and D. Maxwell, Graphical marionette, in Proc. ACM Sig- graph/Sigart Workshop on Motion, pp. 172-179, ACM Press, New York, 1983. [75] A. Goshtasby, Recovering scene structures from scattered surface points, in Pat- tern Recognition, vol. 26, no. 10, pp. 1543—1547, 1993. [76] R. L. Gregory, Eye and Brain, World University Library. [77] U. Grenander, Y. Chow, and D. M. Keenan, Hands. A Pattern Theoretic Study of Biological Shapes, Springer, 1991. [78] W. E. L. Grimson, A computer implementation of a theory of human stereo vision, in Phil. Trans. Roy. Soc. London, vol. B292, pp. 217-253, 1981. 203 [79] W. E. L. Grimson, From Images to Surfaces, MIT Press: Cambridge, MA, 1981. [80] W. E. L. Grimson, Computational experiments with a feature based stereo al- gorithm, in IEEE Trans. PAMI, vol. 7 no. 1, pp. 17—34, 1985. [81] S. Gwydir, H. Buettner and S. Dunn, N on-rigid motion analysis and feature label- ing of the growth cone, in Proc. IEEE Workshop on Biomedical Image Analysis, pp. 80-87, 1994. [82] A. Guttman, R-trees: a dynamic index structure for spatial searching, in ACM SIGMOD, pp. 905—910, 1984. [83] J. A. Hall, The Human Interface in Three-Dimensional Computer Art Space, MSVS thesis, Media Lab, MIT, Cambridge, MA, 1985. [84] T. Hastie and W. Stuetzle, Principal curves, in Journal of the American Statis- tical Association”, vol. 84, pp. 502-516, 1989. [85] D. J. Heeger and A. D. Jepson, Subspace methods for recovering rigid motion I: Algorithm and implementation, in International Journal of Computer Vision, 7:95-117, 1992. [86] D. Hogg, Model based vision: a program to see a walking person, in Image and Vision Comp, vol. 1, pp. 5-20, 1983 [87] B. K. P. Horn, Closed-form solution of absolute orientation using unit quater- nions, in J. Opt. Soc. Amer., vol. 4, pp. 629-642, 1987. 204 [88] T. S. Huang and Y. S. Shim, Linear algorithm for motion estimation: how to handle degenerate cases, in Proc. British Pattern Recog. Association Conf., Cam- bridge, England, 1987. [89] T. S. Huang (ed.), Advances in Computer Vision and Image Processing, Vol. 3: Time- Varying Imagery Analysis, JAI Press, Greenwich, Connecticut 1988. [90] T. S. Huang and O. D. Faugeras, Some properties of the e-matrix in two-view motion estimation, in IEEE PAMI, vol. 11, no. 12, pp.1310-1312, 1989. [91] D. H. Hubel, Eye, Brain, and Vision, Scientific American Library, Vol. 22, 1988. [92] R. Ivry and A. Cohen, Dissociation of short- and long-range apparent motion in visual search, in J. of Earp. Psy.: Human Perception and Performance, vol. 16, No. 2, pp. 317-331, 1990. [93] A. K. Jain, Y. Zhong, and S. Lakshmanan, Object matching using deformable templates, in IEEE Trans. on PAMI, vol. 18, pp. 267—278, 1996. [94] M. I. Jordan, Serial order: A parallel distributed processing approach (Tech. Rep. No. 8604), San Diego: Univ. of California, Institute for Cognitive Science, 1986. [95] G. Johansson, Spatial-temporal differentiation and integration in visual motion perception, in Psychol. Research, vol. 38, pp. 379-393, 1976. [96] G. Johansson, Visual motion perception, in Scientific American, vol. 232, pp. 76—88, June, 1976. 205 [97] P. N. Johnson-Laird, The Computer and the Mind: An introduction to cognitive science, Harvard Univ. Press, Cambridge, MA., 1988. [98] M. Kass, A. Witkin and D. Terzopoulos, Snakes: active contour models, in Proc. Ist ICCV, pp. 259-268, 1987. [99] Y. Kaya and K. Kobayashi, A basic study on human face recognition, in Frontiers of Pattern Recognition (S. Watanabe Ed.), pp. 265, 1972. [100] C. Kervrann and F. Heitz, A hierarchical statistical framework for the segmen- tation of deformable objects in image sequences, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 724-728, 1994. [101] S. M. Kiang R. J. Chou and J. K. Aggarwal, Triangulation errors in stereo algorithms, in Proc. IEEE Workshop Computer Vision (Miami Beach, FL), pp. 72-78, 1987. [102] M. Kirby, F.Weisser, and G.Dangelmayr, A model problem in the representation of digital image sequences, in Pattern Recognition, vol. 26, pp. 63-73, 1993. [103] V. Klee, On the complexity of d-dimensional Voronoi diagrams, in Arch. Math., vol. 34, pp. 75-80, 1980. [104] D. Knuth, The Art of Computer Programming III: Sorting and Searching, Addison-Wesley, Reading, Mass., 1973. [105] T. Kohonen, Self-Organization and Associative Memory, 2nd., Springer-Verlag, Berlin, 1988. 206 [106] L.T. Kozlowski and J.E. Cutting, Recognizing the sex of a walker from dynamic point-light display, in Perception and Psychophys., vol. 21, no. 6, pp.575-580, 1977. [107] J. Kramer and L. Leifer, The Talking Glove: An Expressive and Receptive “Ver- bal” Communication Aid for the Deaf, Deaf-Blind, and Nonvocal, Tech. Report, Stanford University, Dept. of Electrical Engineering, Stanford, Calif, 1989. [108] J. D. Krol and W. A. Grind, The double nail illusion: experiments on binocular vision with nails, needles and pins, in Perception .9, 651. [109] E. Kruppa, Zur Ermittlung eines Ob jektes auss zwei Perspektiven mit innerer Orientierung, in em Sitz-Ber. Akad. Wiss., Wien, Math. Naturw. K1., Abt. IIa., 122:1939-1948, 1913. [110] J. J. Kuch and T. S. Huang, Vision based hand modeling and tracking, in Proc. International Conference on Computer Vision, June, 1995. [111] R. Kumar, H. S. Sawhney and A. R. Hanson, 3D model acquisition from monoc- ular image sequences, in Proc. IEEE Conf. Computer Vision and Pattern Recog- nition, Champaign, IL, pp. 209-215, 1992. [112] D. Lee and E. Aronson, Visual proprioceptive control of standing in human infants, in Percept. Psychophys., vol. 15, pp. 529—532, 1974. [113] D. N. Lee, The optical flow field: the foundation of vision, in Phil. Trans. Roy. Soc. London, 1980. 207 [114] H. Lee and Z. Chen, Determination of 3d human body postures from a single View, in CVGIP, vol. 30, 1985. [115] P. Lennie, C. Trevarthen, D. V. Essen and H. Wassle, Parallel processing of visual information, in Visual Perception The Neurophysiological Foundations (L. Spillmann and J. Werner Eds), Academic Press, 1990. [116] T. S. Levitt, D. T. Lawton, D. M. Chelberg, and P. C. Nelson, Qualitative navigation, in Proc. Image Understanding Workshop, pp. 447—465, 1987. [117] F. Leymarie and M. D. Levine, Tracking deformable objects in the plane using an active contour model, in IEEE Trans. PAMI, vol. 15, pp. 617-634, 1993. [118] H. Li, P. Roivainen and R. Forcheimer, 3-D motion estimation in model-based facial image coding, in IEEE Trans. PAMI, vol. 15, pp. 545—555, 1993. [119] H. C. Longuet-Higgins and K. Pradzny, The interpretation of a moving retinal image, in Proc. R. Soc. Lond., B2082385-397, 1980. [120] H. C. Longuet-Higgins, A computer algorithm for reconstructing a scene from two projections, in Nature, 1981. [121] M. M. Loeve, Probability Theory, Princeton, NJ: Van Nostrand, 1955. [122] D. G. Luenberger, Optimization by Vector Space Methods, Wiley, New York, 1969. [123] R. Malladi, J. Sethian and B. Vemuri, Shape modeling with front propagation: a level set approach, in IEEE Tran. PAMI, vol. 17, pp. 158-175, 1995. n - _‘—F_T ‘ , h 208 [124] D. Marr and T. Poggio, Cooperative computation of stereo disparity, in Science, 194:283-287, 1975. [125] D. Marr and H.K. Nishihara, Representation and recognition of the spatial organization of three-dimensional shapes, in Proc. R. Soc. Lond. B 200, pp. 269-294, 1978. [126] D. Marr and T. Poggio, A computational theory of human stereo vision, in Proc. Roy. Soc., B-204z301-328, 1979. [127] D. Marr, Vision, Freeman, San Francisco, 1982. [128] W. N. Martin, J. K. Aggarwal (eds.), Motion Understanding: Robot and Human Motion, Kluwer, Boston, MA 1988. [129] J. L. Martinez, Jr. and R. P. Kessner (eds.), Learning 89’ Memory: A Biological View, 2nd ed., Academic Press, San Diego, 1991. [130] K. Mase and A. Pentland, Lip Reading: Automatic visual recognition of spoken words, Technical Report 117, M.I.T. Media Lab Vision Science, 1989. [131] L. Matthies and S. Shafer, Error modeling in stereo navigation, in IEEE J. of Robotics and Automation, vol. RA-3, no. 3, 1987. [132] A. Mitiche and J. K. Aggarwal, A computational analysis of time-varying im- ages, in Handbook of Pattern Recognition and Image Processing, T.Y. Young and KS. Fu Eds, New York: Academic. 1986. 209 [133] B. Moghaddam and A. Pentland, Maximum likelihood detection of faces and hands, in Proc. International Workshop on Automatic Face- and Gesture— Recog- nition”, pp. 122-128, June 1995. [134] H. P. Moravec, Obstacle avoidance and navigation in the real world by a seeing robot rover, Ph.D dissertation, Standford Univ., CA, 1980. [135] J. K. Mui and K. S. Fu, Automatic classification of cervical cells using a binary tree classifier, in Pattern Recognition, vol. 16, pp. 69—80, 1980. [136] H. Murase and S. K. Nayar, Illumination planning for object recognition in structured environments, in Proc IEEE Conf. Computer Vision and Pattern Recognition, Seattle, WA, pp. 31 - 38, June 1994. [137] M. Murray, Gait as a total pattern of movement, in American Journal of Phys- ical Mdeicine, vol. 46, pp. 290-333, 1967. [138] K. Nakayama, Biological image motion processing: a review, in Vision Res., vol. 25, no. 5, pp. 625-660, 1985. [139] H. K. Nishihara, PRISM, a practical real—time imaging stereo matcher, Techni- cal Report A.I. Memo 780, MIT, Cambridge, MA. [140] Y. Ohta and T. Kanade, Stereo by intra- and inter-scanline search, in IEEE Trans. PAMI, vol. 7, no. 2, pp. 139-154, 1985. 210 [141] J. O’Rourke and N. I. Badler, Model-based image analysis of human motion using constraint propagation, in IEEE Trans. PAMI, vol. 2, no. 6, pp. 523-536, 1980. [142] I. Overington, Computer Vision: a unified, biologically-inspired approach, Else- vier Science Publishing Company Inc., New York. [143] A. Pentland and B. Horowitz, Recovery of nonrigid motion and structure, in IEEE Trans. PAMI, vol. 13, pp. 730-742, 1991. [144] E. D. Petajan, B. Bischoff, D. Bodoff, and N. M. Brooke, An improved au- tomatic lipreading system to enhance speech recognition, in Proc. SIGCHI’88: Human Factors in Computing Systems, pp. 19-25, 1988. [145] F. J. Pineda, Generalization of back propagation to recurrent and higher order neural networks, in D.Z. Anderson (Ed.), Neural information processing system. New York: American Institute of Physics, 1988. [146] G. F. Poggio and T. Poggio, The analysis of stereopsis, in Ann. Rev. Neurosci., vol. 7, pp. 379-412, 1984. [147] T. Poggio and F. Girosi, Networks for Approximation and Learning, in Pro- ceedings of the IEEE, no. 9, vol. 78, pp. 1481-1497, 1990. [148] J. Pola and H. Wyatt, Target position and velocity: the stimuli for pursuit eye movements, in Vision Res., vol. 20, pp. 523-534, 1980. [149] R. Polana and R. Nelson, Detecting Activities, in IEEE CVPR, pp. 2-7, 1993. 211 [150] Pollard, Mayhew and Frisby, PMF: A stereo correspondence algorithm using a disparity gradient constraint, in Perception, 14:449—470, 1985. [151] W.H. Press, B.P. Flannery, S.A. Teukolsky and W.T. Vetterling, Numerical Recipes, Cambridge University Press, New York, 1986. [152] V. S. Ramachandran, “Perceiving shape from shading,” in I. Rock (ed.), The Perceptual World, Freeman, San Francisco, CA, 1990, pp. 127-138. [153] K. Rangarajan and M. Shah, Establishing motion correspondence, in CVG'IP: image understanding, vol. 54, pp. 56-73, 1991. [154] C. R. Rao, Linear Statistical Inference and Its Applications, 2nd Ed., Wiley, New York, 1973. [155] C. Rashbass, The relationship between saccadic and smooth tracking eye move- ments, in J. Physiol, vol. 159, pp. 326-338, 1961. [156] R.F. Rashid, Towards a system for the interpretation of moving light displays, in IEEE Trans. PAMI, vol. 2, no. 6, pp. 523-536, 1980. [157] , D. Reisfeld, H. Woldson and Y. Yeshurun, “Context-free attentional operators: the generalized symmetry transform”, in Int ’1 J. of Computer Vision, vol. 14, pp. 119—130, 1995. [158] J. W. Roach and J. K. Aggarwal, Determining the movement of objects from a sequence of images, in IEEE Trans. PAMI, vol. PAMI-2, no. 6, pp. 554-562, 1980. 212 [159] K. Rohr, Incremental recognition of pedestrians from image sequences, in IEEE. CVPR, pp. 8-13, 1993. [160] O. Sacks, “To see and not see,” The New Yorker, May, 10, 1993. [161] A. Searleman and D. Herrmann, Memory from a Broader Perspective, ch. 5, McGraw-Hill, Inc., 1994. [162] T. Sellis, N. Roussopoulos and C. Faloutsos, “The r+-tree: a dynamic index for multidimensional objects”, in Proceedings of 13th International Conference on VLDB, pp. 507-518, 1987. [163] I. K. Sethi and G. P. R. Savarayudu, “Hierarchical classifier design using mutual information”, in IEEE Trans. Pattern Recog. and Machine Intell., vol. 4, pp. 441- 445, 1982. [164] H. Shariat, and K. E. Price, Motion estimation with more than two frames, IEEE Trans. Pattern Anal. Machine Intell., vol. 12, no. 5, pp. 417-434, 1990. [165] A. Sommerfeld, Mechanics of Deformable Bodies, 1950. [166] R. Sorabji, Aristotle on Memory, Brown Univ. Press, Providence, RI 1972. [167] G. Sperling, M. Landy, Y. Cohen, and M. Pavel, Intelligible encoding of ASL image sequences at extremely low information rates, in C VGIP, vol. 31, pp. 335-391, 1985. -. chi—LU Emmy/m. ..1. “M." - L'lh‘ . '/\ 213 [168] T. E. Starner and A. Pentland, Visual recognition of American sign language using hidden markov models, in Proc. International Workshop on Automatic Face- and Gesture- Recognition”, pp. 189-194, June 1995. [169] S. E. Stead, Smooth multistage multivariate approximation, Ph.D. Dissertation, Department of Math., Brown University, 1983. [170] R. M. Steinman and J. Z. Levinson, The role of eye movement in the detection of contrast and spatial detail, in Eye Movements and Their Role in Visual and Cognitive Process (E. Kowler Ed.). pp. 115-212, Elsevier, 1990. [171] W. Stokoe, Sign language structure: an outline of the visual communication system of the American deaf, Studies in Linguistics Occasional Paper No. 8, 1960. [172] N. Sugie and H. Inagaki, A computational aspect of kinetic depth effect, in Biol. Cybern., vol. 50, pp. 431-436, 1984. [173] T. Takahashi and F. Kishino, Hand gesture coding based on experiments using a hand gesture interface device, in Sigchi Bulletin, vol. 23, pp. 67-74, 1991. [174] D. Terzopoulos and D. Metaxas, Dynamic 3D models with local and global deformations: deformable superquadrics, in IEEE Trans. PAMI, vol. 13, pp. 703-714, 1991. [175] D. Terzopoulos and K. Waters, Analysis and synthesis of facial image sequences using physical and anatomical models, in IEEE Trans. PAMI, vol. 15, pp. 569- 579, 1993. 214 [176] P. Thompson, Margaret Thatcher: a new illusion, Perception, vol. 9, pp. 483- 484, 1980. [177] A. P. Tirumalai, B. G. Schunck, and R. C. Jain, Dynamic stereo with self- calibration, in Proc. 3rd International Conference on Computer Vision, Osaka, Japan, pp. 466-470, 1990. [178] C. Tomasi and T. Kanade, Shape and motion from image streams under or- thography: a factorization method, in Int. J. of comput. Vis., 9:2, pp.137-154, 1992. [179] R. Y. Tsai and T. S. Huang, Uniqueness and estimation of three-dimensional parameters of rigid objects with curved surface, in IEEE Trans. PAMI, vol. 6, pp. 13-26, Jan. 1984. [180] M. Turk and A. Pentland, Eigenfaces for recognition, Journal of Cognitive Neu- roscience, vol. 3, no. 1, pp. 71-86, 1991. [181] S. Ullman, The interpretation of visual motion, Cambridge, MA: MIT Press, 1979. [182] S. Ullman, Analysis of visual motion by biological and computer systems, in IEEE, August, pp. 57-69, 1981. [183] M. Umeda, Recognition of multi—font printed Chinese characters, in Proc. 6th ICPR, pp. 793-796, 1982. [184] B. Waerdern, Modern Algebra, vol. 2, Berlin, Germany, Springer, 1940. 215 [185] D. F. Watson, Computing the n-dimensional Delaunay triangulation with ap- plication to Voronoi polytopes, in Computer Journal, vol. 24, 1981. [186] J. A. Webb and J. K. Aggarwal, Structure from motion of rigid and jointed bodies, in Proc. 7th Int. Joint Conf. Artificial Intell., Vancouver, Canada, 1981. [187] J. Weng, T.S. Huang and N. Ahuja, 3-D motion estimation, understanding and prediction from noisy image sequences, in IEEE Trans. PAMI, vol. 6, pp.545-554, Sept.1984. [188] J. Weng, N. Ahuja, and T. S. Huang, Two-view matching, in Proc. Second International Conference on Computer Vision, pp. 64-73, 1988. [189] J. Weng T.S. Huang and N. Ahuja, Motion and structure from two perspective views: algorithm, error analysis, and error estimation, in IEEE Trans. PAMI, vol. 11, no. 5, pp. 451-476, May, 1989. [190] J. Weng and P. Cohen and M. Herniou, Stereo camera calibration with nonlinear corrections, in Proc. Tenth International Conference on Pattern Recognition, Atlantic City, New Jersey, pp. 246-253, June, 1990. [191] Weng, Cohen and Rebibo, Motion and structure estimation from stereo image sequences, in IEEE Trans. Robotics and Automation, vol. 8, no. 3, pp. 362-382, June,1992. [192] J. Weng, N. Ahuja, and T. S. Huang, Optimal motion and structure estimation, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 9, Sept. 1993, pp. 864—884. 216 [193] J. Weng, On comprehensive visual learning, in Proc. NSF/ARPA Workshop on Performance vs. Methodology in Computer Vision, pp. 152-166, Seattle, WA, June 24-25, 1994. [194] J. Weng, SHOSLIF: the hierarchical optimal subspace learning and inference framework, Technical Report CPS 94-15, Michigan State University. [195] J. Weng, Y. Cui, N. Ahuja and A. Singh, Integration of Transitory Image Sequences, in Proc. IEEE Conference on Computer Vision and Pattern Recog- nition, pp. 966-969, Seattle, June 1994. [196] J. Weng and S. Chen, SHOSLIF convergence properties and MDF version of SHOSLIF-N, Technical Report CPS-95—22, Department of Computer Science, Michigan State University, East Lansing, MI, 1995. [197] S.S.Wilks, Mathematical Statistics, Wiley, New York, 1963. [198] RR. Wolf, Elements of Photographnetry, New York: McGraw-Hill, 1974. [199] Y. Yacoob and L. Davis, Computing spatial-temporal representations of human faces, in IEEE Conf. CVPR, pp. 70-75, 1994. [200] Y. Yasumoto and G. Medioni, Robust estimation of three-dimensional motion parameters from sequence of image frames using regularization, in IEEE Trans. PAMI, vol. 8, no. 4, pp. 464-471, 1986. 217 [201] M. Yachida, , 3D data acquisition by multiple views, in OD. Faugeras and G. Giralt eds. Robotics Research: the Third International Symposium, MIT Press, Cambridge, MA. pp. 11-18, 1986. [202] J. Yamato J. Ohya and K. Ishii, Recognizing human action in time- sequential images using hidden markov model, in IEEE CVPR, pp. 379-385, 1992. [203] X. Zhuang, T. S. Huang and R. M. Haralick, Two-view motion analysis: A unified algorithm, in J. Opt. Soc. Amer. A. vol. 3, no. 9, pp. 1492-1500, 1986. [204] Z. Zhang and O. Faugeras, Three-Dimensional Motion Computation and Object Segmentation in a Long Sequence of Stereo Frames, in Int’l J. of Computer Vision, 7:3, pp. 211-241, 1992. [205] Z. Zhang, “Iterative point matching for registration of free-form curves and surfaces”, in International Journal of Computer Vision, vol. 13, pp. 119-152, 1994. [206] T. G. Zimmerman et al., A hand gesture interface device, in Proc. Human Factors in Computing Systems and Graphics Interface, pp. 189-192, ACM Press, New York, 1987. "111111111111111111“