.14 , .51. .2 . . it . ’mn. m 1v... Huh 3 ).\\t.x £5va . 3.. . If eight: « )3. r n: J' . . .t‘r’..v it} 3.3 Pl . x: l! Kakvtxhwfikwufi a . a 1 A Hal... i... 3 .x Syd... .i fir... , 5.... xi , .3 2.53:; ._.. .5 .e . t . . 4. . - .fiugwfiksfilnlo . P... affifinnflw ... up: : ...37., H_ .. . . 1,. w. . . . .. THESIS -( l J (L3 a 0 9. This is to certify that the dissertation entitled FACE DETECTION AND MODELING FOR RECOGNITION presented by Rein-Lien Hsu has been accepted towards fulfillment of the requirements for Doctoral degree in Computer Science & Engineering MW Major professor Date 3W loll 10° 1/ MS U i: an Affirmative Action/Equal Opportunity Institution 042771 LIBRARY i Michigan State University PLACE IN RETURN Box to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 6/01 c:/CIRC/DaleDue.p6&p. 15 FACE DETECTION AND MODELING FOR RECOGNITION By Rein-Lien Hsu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department Of Computer Science & Engineering 2002 ABSTRACT FACE DETECTION AND MODELING FOR RECOGNITION By Rein-Lien Hsu Face recognition has received substantial attention from researchers in biometrics, computer vision, pattern recognition, and cognitive psychology communities because of the increased attention being devoted to security, man-machine communication, content-based image retrieval, and image/video coding. We have proposed two au- tomated recognition paradigms to advance face recognition technology. Three major tasks involved in face recognition systems are: (i) face detection, (ii) face modeling, and (iii) face matching. We have developed a face detection algorithm for color images in the presence Of various lighting conditions as well as complex backgrounds. Our detection method first corrects the color bias by a lighting compensation technique that automatically estimates the parameters of reference white for color correction. We overcame the difficulty of detecting the low-luma and high-luma skin tones by applying a nonlinear transformation to the YCbC'r color space. Our method gener- ates face candidates based on the spatial arrangement. of detected Skin patches. we constructed eye, mouth, and face boundary maps tO verify each face candidate. Ex- perimental results demonstrate successful detection of faces with different Sizes, color, position, scale, orientation, 3D pose, and expression in several photo collections. 3D human face models augment the appearance-based face recognition approaches to assist face recognition under the illumination and head pose variations. For the two proposed recognition paradigms, we have designed two methods for modeling human faces based on (i) a generic 3D face model and an individual’s facial measurements of Shape and texture captured in the frontal view. and (ii) alignment of a semantic face graph, derived from a generic 3D face model, onto a frontal face image. Our mod- eling methods adapt recognition-oriented facial features of a generic model to those extracted from facial measurements in a global—tO-local fashion. The first modeling method uses displacement propagation and 2.5D snakes for model alignment. The resulting 3D face model is visually similar to the true face, and proves to be quite useful for recognizing non-frontal views based on an appearance-based recognition algorithm. The second modeling method uses interacting snakes for graph alignment. A successful interaction of snakes (associated with eyes, mouth, nose, etc.) results in appropriate component weights based on distinctiveness and visibility of individual facial components. After alignment, facial components are transformed to a feature space and weighted for semantic face matching. The semantic face graph facilitates face matching based on selected components, and effective 3D model updating based on 2D images. The results of face matching demonstrate that the proposed model can lead to classification and visualization (e.g., the generation of cartoon faces and facial caricatures) Of human faces using the derived semantic face graphs. © Copyright 2002 by Rein-Lien Hsu All Rights Reserved To my parents; my lovely wife, Pei-Jing; and my son, Alan ACKNOWLEDGMENTS First of all, I would like to thank all the individuals who have helped me during my Ph.D. study at Michigan State University. I would like to express my deepest gratitude to my advisor, Dr. Anil K. Jain, for his guidance in academic research and his support in daily life. He broadened my view in research areas, especially in pattern recognition and computer vision, and taught me how to focus on research problems. I will never forget his advice “Just do it,” while being caught in multiple tasks at the same time. I am grateful to my Ph.D. committee, Dr. Mohamed Abdel—Mottaleb, Dr. George Stockman, Dr. John J. Weng, and Dr. Sarat C. Dass, for their valuable ideas, suggestions, and encouragement. I would also like to thank Dr. Chant-Chin Chen and Dr. Wey-Shiuan Hwang for their help at the beginning of my study at MSU, and Dr. Shaoyun Chen and Yonghong Li for their help in the NASA modeling project. I am very grateful to Dr. Helen Shen and Dr. Mihran Tuceryan for their numerous suggestions and discussions on model compression. Special thanks are due to Philips Research-USA for Offering me summer internships in 2000 and 2001; to Dr. Mohamed Abdel—h‘lottaleb for his guidance and suggestions in my work on face detection; to Dr. Patrick Flynn vi for providing the range datasets; to Dr. Wey-Shiuan Hwang for providing his face recognition software; and to Dennis Bond for his help in creating a graphical user interface for face editing. Thanks are also due to Cathy M. Davison, Linda. Moore, Starr Portice, and Beverly J. Wallace for their assistance in the administrative tasks. Special thanks to all the Prippies: Lin Hong, Aditya Vailaya, Nicolae Duta, Salil Prabhakar, Dan Gutchess, Paul Albee, Arun Ross, Anoop N amboodiri, Silviu Minut, Umut Uludag, Xiaoguang Lu, Martin Law, Miguel Figueroa—Villanue, and Yilu Zhang for their help during my stay in the PRIP Lab in the Department of Computer Science and Engineering at MSU. I would also like to thank Michael E. Farmer for giving me an opportunity to work on human tracking research. I would like to thank Mark H. McCullen for mentoring me to be a teaching as- sistant in CSE232; Dr. Jeffrey A. Fessler at the University of Michigan, Ann Arbor, for his valuable help during my transfer to the Dept. of Computer Science at h’lichi— gan State University; and Dr. Yung-Nien Sun and Dr. Chin-Hsing Chen in Taiwan for their encouragement and spiritual support. Special thanks to NASA, Philips Research-USA, Eaton corporation, and ONR (grant NO. N00014-01-1-0266) for their financial support during my Ph.D studies. Finally, but not the least, I would like to thank my parents, my wife, Dr. Pei-jing Li, and my son, Alan, for all the happiness they have shared with me. vii TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES 1 Introduction 1.1 Challenges in Face Recognition .......... 1.2 Semantic Facial Components ............ 1.3 Face Recognition Systems ............. 1.4 Face Detection and Recognition .......... 1.5 Face Modeling for Recognition ........... 1.5.1 Face Alignment Using 2.5D Snakes ....... 1.5.2 ' Model Compression ................ 1.5.3 Face Alignment Using Interacting Snakes . . . . 1.6 Face Retrieval .................... 1.7 Outline of Dissertation ............... 1.8 Dissertation Contributions ............. 2 Literature Review 2.1 Face Detection ................... 2.2 Face Recognition .................. 2.3 Face Modeling .................... 2.3.1 Generic Face Models ............... 2.3.2 Snakes for Face Alignment ............. 2.3.3 3D Model Compression ............. 2.4 Face Retrieval .................... 2.5 Summary ...................... 3 Face Detection 3.1 Face Detection Algorithm ............. 3.2 Lighting Compensation and Skin Tone Detection 3.3 Localization of Facial Features ........... 3.3.1 Eye Map ..................... 3.3.2 Mouth Map .................... 3.3.3 Eye and Mouth Candidates ........... 3.3.4 Face Boundary Map ............... 3.3.5 Weight Selection for a Face Candidate ..... 3.4 Experimental Results ................ 3.5 Summary ...................... nnnnnnnnnnnnn ooooooooooooo ooooooooooooo ooooooooooooo ooooooooooooo ooooooooooooo ooooooooooooo ............. ............. ooooooooooooo ooooooooooooo ccccccccccccc nnnnnnnnnnnnn 4 Face Modeling 97 4.1 Modeling Method ............................... 97 4.2 Generic Face Model .............................. 99 4.3 Facial Measurements ............................. 101 4.4 Model Construction .............................. 103 4.5 Summary ................................... 1 10 5 Semantic Face Recognition 111 5.1 Semantic Face Graph as Multiple Snakes .................. 112 5.2 Coarse Alignment of Semantic Face Graph ................. 115 5.3 Fine Alignment of Semantic Face Graph via Interacting Snakes ..... 118 5.3.1 Interacting Snakes and Energy Functional ................ 120 5.3.2 Parametric Active Contours ........................ 127 5.3.3 Geodesic Active Contours ......................... 127 5.4 Semantic Face Matching ........................... 130 5.4.1 Component Weights and Matching Cost ................. 132 5.4.2 Face Matching Algorithm ......................... 133 5.4.3 Face Matching ............................... 134 5.5 Facial Caricatures for Recognition and Visualization ........... 140 5.6 Summary ................................... 143 6 Conclusions and Future Directions 144 6.1 Conclusions .................................. 144 6.2 Future Directions ............................... 147 6.2.1 Face Detection & 'ITacking ......................... 147 6.2.2 Face Modeling ............................... 149 6.2.3 Face matching ............................... 151 APPENDICES 153 A Transformation of Color Space 153 A.1 Linear Transformation ............................ 153 A2 Nonlinear Transformation .......................... 154 A3 Skin Classifier ................................. 156 B Distance between Skin Patches 157 C Image Processing Template Library (IPTL) 160 CI Image and Image Template ......................... 160 G2 Example Code ................................ 163 BIBLIOGRAPHY 165 ix 2.1 2.2 2.3 3.1 3.2 5.1 5.2 LIST OF TABLES Summary of various face detection approaches ................ 34 Geometric compression efficiency. ...................... 53 Summary of performance Of various face detection approaches ....... 56 Detection results on the HHI image database (Image size 640 x 480) on a PC with 1.7 GHz CPU. FP: False Positives, DR: Detection Rate. . . 88 Detection results on the Champion database (Image size ~ 150 x 220) on a PC with 860 MHz CPU. FP: False Positives, DR: Detection Rate. . 88 Error rates on a 50—image database ...................... 136 Dimensions of the semantic graph descriptors for individual facial compo— nents. ................................... 136 1.1 1.1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 LIST OF FIGURES Applications using face recognition technology: (a) and (b) automated video surveillance (downloaded from Visionics [1] and FaceSnap [2], respectively); (c) and (d) access control (from Visionics [1] and from Viisage [3], respectively); (e) management of photo databases (from Vi- isage [3]); (f) multimedia communication (from Eyematic [4]). Images in this dissertation are presented in color ................. (Cont’d). ................................... (Cont’d). ................................... Comparison of various biometric features: (a) based on zephyr analysis (downloaded from [5]); (b) based on MRTD compatibility (from [6]). . Intra—subject variations in pose, illumination, expression, occlusion, acces— sories (e.g., glasses), color, and brightness. ............... Face comparison: (a) face verification/ authentication; (b) face identifica- tion / recognition. Face images are taken from the MSU face database [7]. Head recognition versus face recognition: (a) Clinton and Gore heads with the same internal facial features, adapted from [8]; (b) two faces of different subjects with the same internal facial components Show the important role of hair and face outlines in human face recognition. . . Caricatures of (a) Vincent Van Gogh; (b) Jim Carrey; (c) Arnold Schwarzenegger; (d) Einstein; (e) G. W. Bush; and (f) Bill Gates. Images are downloaded from [9], [10] and [10]. Caricatures reveal the use of component weights in face identification. ............ Cartoons reveal that humans can easily recognize characters whose facial components are depicted by simple line strokes and color characteris- tics: (a) and (b) are frames adapted from the movie Pocahontas; (c) and (d) are frames extracted from the movie Little Mermaid II. (Disney Enterprises, Inc.) ............................. Configuration of facial components: (a) face image; (b) face image in (a) with enlarged eyebrow-to—eye and nose-to—mouth distances; (c) inverted face of the image in (b). A small change of component configuration results in a significantly different facial appearance in an upright face in (b); however, this change may not be perceived in an inverted face in (c). ................................... xi 1.9 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 2.1 2.1 2.2 2.3 2.4 2.5 Facial features/ components: (a) five kinds of facial features (i.e., eyebrows, eyes, nose, ears, and mouth) in a face for reading faces in physiognomy (downloaded from [11]); (b) a frontal semantic face graph, whose nodes are facial components that are filled with different Shades. ...... Similarity of frontal faces between (a) twins (downloaded from [12]); and (b) a father and his son (downloaded from [13]). ............ System diagram of our 3D model-based face recognition system using reg- istered range and color images. ..................... System diagram of our 3D model-based face recognition system without the use of range data ............................ Face images taken under unconstrained environments: (a) a crowd of peo- ple (downloaded from [14]); (b) a photo taken at a swimming pool. Face images for our detection algorithm: (a) a montage image containing images adapted from MPEG7 content set [15]; (b) a family photo. . . Face images not suitable for our detection algorithm: (a) cropped image (downloaded from [16]); (b) a performer wearing make-up (from [14]); (c) people wearing face masks (from [14]). ............... Graphical user interfaces of the FaceGen modeller [17]. A 3D face model shown (a) with texture mapping; (b) with wireframe overlaid. A face retrieval interface of the FACEit system [18]: the system gives the most similar face in a database given a query face image. ....... Outputs of several face detection algorithms; ((a), ((b) Féraud et al. [19]; (c ) Maio et a1. [20],( d) Garciha) et a1. [21];( (f) Schneid)erman et a1. [22];( g)Rowley et a1. [23];( ),(i) Rowleye et a1. [24];( j)Sung et a1. [25];( k) Yow et al. [26]; (1) Lew et al. [27] ................ (Cont’d). ................................... Examples of face images are selected from (a) the FERET database [28]; (b) the MIT database [29]; (c) the XM2VTS database [30] ....... Internal representations of the PCA-based approach and the LDA-based approach (from Weng and Swets [31]). The average (mean) images are shown in the first column. Most Expressive Features (MEF) and Most Discriminating Features (MDF) are shown in (a) and (b), respectively. Internal representations of the EBGM- based approach (from Wiskott et a1. [32]): )a graph is overlaid on a face Image; (b) a reconstruction of the Image from the graph, (c ) a reconstruction of the Image from a face bunch graph using the best fitting jet at each node. Images are downloaded from [33]; (d) a bunch graph whose nodes are associated with a bunch of jets [33]; (e) an alternative interpretation of the concept of a bunch graph [33]. .......................... Internal representations of the LFA- based approach (fiom Penev and Atirk [34]). a)An average face Image is marked with five localized features; (b) five atopographic kernels associated with the five localized features are shown in the top row, and the corresponding residual correlations are shown in the bottom row. ...................... xii 11 13 16 17 18 20 20 22 27 36 37 41 42 43 2.6 A breakdown of face recognition algorithms based on the pose-dependency, face representation, and features used in matching. .......... 2.7 Face modeling using anthropometric measurements (downloaded from [35]): (a) anthropometric measurements; (b) a B—spline face model. 48 2.8 Generic face models: (a) Water’s animation model; (b) anthropometric measurements; (b) six kinds of face models for representing general facial geometry. .............................. 3.1 Face detection algorithm. The face localization module finds face can- didates, which are verified by the detection module based on facial features. .................................. 3.2 Skin detection: (a) a yellow-biased face image; (b) a lighting compensated image; (c) skin regions of (a) shown in white; (d) Skin regions of (b). . 3.3 The YCbC', color space (blue dots represent the reproducible color on a monitor) and the skin tone model (red dots represent skin color samples). (a) The YCbC, space; (b) a 2D projection in the 05-0, subspace; (c) a 2D projection in the (Cb/Y)-(C,./Y) subspace. . . . . 3.4 The dependency of Skin tone color on luma. The skin tone cluster (red dots) is Shown in (a) the rgY, (c) the CI E 233/)”, and (e) the H S V color Spaces; the 2D projection of the cluster is shown in (b) the r — g, (d) the a: — y , and (f) S — H color subspaces, where blue dots represent the reproducible color on a monitor. For a better presentation of cluster shape, we normalize the luma Y in the rgY and the CI E scyY by 255, and swap the hue and saturation coordinates in the H S V Space. The 49 65 skin tone cluster is less compact at low saturation values in (e) and (f). 66 3.5 2D projections of the 3D skin tone cluster in (a) the Y—C'l> subspace; (b) the Y—C, subspace. Red dots indicate the Skin cluster. Three blue dashed curves, one for cluster center and two for boundaries, indicate the fitted models. ............................. 3.6 The nonlinear transformation of the YC’bC, color Space. (a) The trans- formed YCbC, color space; (b) a 2D projection of (a) in the Cb-Cr subspace, in which the elliptical Skin model is overlaid on the skin cluster .................................... 3.7 Nonlinear color transform. Six detection examples, with and without the transform are shown. For each example, the images shown in the first column are Skin regions and detections without the transform, while those in the second column are results with the transform. ...... 3.8 Construction of the face mask. (a) Face candidates; (b) one of the face candidates; (c) grouped skin areas; (d) the face mask .......... 3.9 Construction of eye maps: (a) from chroma; (b) from luma; (c) the com- bined eye map. .............................. 3.10 An example of a hemispheric structuring element for grayscale morpho— logical dilation and erosion with a = 1. ................. 3.11 Construction of the mouth map ........................ 3.12 Computation of face boundary and the eye—mouth triangle ......... xiii 67 67 69 72 74 77 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 3.22 3.22 3.23 3.24 4.1 Geometry of an eye-mouth triangle, where vi = —135; unit vectors if} and u} are perpendicular to the interocular segment and the horizontal axis, respectively. ................................ Attenuation term, e’3(1“c"32(9r(‘ijik))), plotted as a function of the angle 6, (in degrees) has a maximal value of 1 at 6’, = 0°, and a value of 0.5 at 6, = 25° ................................... Face detection examples containing dark skin-tone faces. Each example contains an input image, grouped skin regions Shown in pseudo color, and a lighting-compensated image overlaid with detected face and fa— cial features ................................. Face detection results on closed-eye or open-mouth faces. Each example contains an original image (top) and a lighting-compensated image (bottom) overlaid with face detection results ............... Face detection results in the presence of eye glasses. Each example contains an original image (top) and a lighting-compensated image (bottom) overlaid with face detection results. ................... Face detection results for subjects with facial hair. Each example contains an original image (top) and a lighting-compensated image (bottom) overlaid with face detection results. ................... Face detection results on half-profile faces. Each example contains an orig- inal image (top) and a lighting-compensated image (bottom) overlaid with face detection results ......................... Face detection results on a subset of the HHI database: (a) input images; (b) grouped skin regions; (c) face candidates; ((1) detected faces are overlaid on the lighting—compensated images ............... Face detection results on a subset of the Champion database: (a) input images; (b) grouped skin regions; (c) face candidates; (d) detected faces are overlaid on the lighting-compensated images ............. Face detection results on a subset of eleven family photos. Each image contains multiple human faces. The detected faces are overlaid on the color-compensated images. False negatives are due to extreme lighting conditions and shadows. Notice the difference between the input and color-compensated images in terms of color balance. The bias color in the original images has been compensated in the resultant images. . . (Cont’d). ................................... (Cont’d). ................................... Face detection results on a subset of 24 news photos. The detected faces are overlaid on the color-compensated images. False negatives are due to extreme lighting conditions, shadows, and low image quality (i.e., high compression rate) ........................... Graphical user interface (GUI) for face editing. (a) detection mode; (b) editing mode. ............................... The system overview of the proposed modeling method based on a 3D generic face model. ............................ xiv 81 81 83 84 84 85 86 89 90 91 92 93 94 96 98 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 5.1 5.2 5.3 5.4 3D triangular-mesh model and its feature components: (a) the frontal view; (b) a side view; (c) feature components. ............. Phong—shaded 3D model shown at three viewpoints. Illumination is in front of the face model ........................... Facial measurements of a human face: (a) color image; (b) range map; and the range map with texture mapped for (c) a left view; ((1) a profile view; (e) a right view. .......................... Facial features overlaid on the color image, (a) obtained from face detec- tion; (b) generated for face modeling. .................. Global alignment of the generic model (in red) to the facial measurements (in blue): the target mesh is plotted in (a) for a hidden line removal mode for a side view; (b) for a see-through mode for a profile View. Displacement propagation ........................... Local feature alignment and displacement propagation Shown for the frontal view: (a) the input generic model; the model adapted to (b) the left eye; (c) the nose; ((1) mouth and chin. ............. Local feature refinement: initial (in blue) and refined (in red) contours overlaid on the energy maps for (a) the face boundary; (b) the nose; (c) the left eye; and (d) the mouth. ................... The adapted model (in red) overlapping the target measurements (in blue), plotted (a) in 3D; (b) with colored facets at a profile view. ...... Texture Mapping. (a) The texture—mapped input range image. The texture-mapped adapted mesh model shown for (b) a frontal view; ((1) a left view; (e) a profile view; (f) a right view. ........... Face matching: the top row shows the 15 training images generated from the 3D model; the bottom row shows 10 test images of the subject captured from a CCD camera ...................... Semantic face graph is shown in a frontal view, whose nodes are (a) in- dicated by text; (b) depicted by polynomial curves; (c) filled with different shades. The edges of the semantic graph are implicitly stored in a 3D generic face model and are hidden here. ............ 3D generic face model: (a) Waters’ triangular-mesh model shown in the side view; (b) model in (a) overlaid with facial curves including hair and ears at a side view; (c) model in (b) shown in the frontal view. Semantic face graphs for the frontal view are reconstructed using Fourier descriptors with spatial frequency coefficients increasing from (a) 10% to (j) 100% at increments of 10%. .................... Face detection results: (a) and (c) are input face images of size 640 x 480 from the MPEG7 content set; (b) and (d) are detected faces, each of which is described by an oval and a triangle. .............. XV 100 100 102 102 103 104 105 107 108 109 110 113 114 115 110 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 Boundary map and eye component map for coarse alignment: (a) and (b) are gradient magnitude and orientation, respectively, obtained from multi-scale Gaussian-blurred edge response; (c) an eye map extracted from a face image shown in Fig. 5.4(c); (d) a semantic face graph overlaid on a 3D plot of the eye map; (e) image overlaid with a coarsely aligned face graph. ............................ Shadow maps: (a) and (c) are luma components Of face images in Figs. 5.4(a) and 5.4(c), overlaid with rectangles within which the average values of skin intensity is calculated; (b) and (d) are shadow maps where bright pixels indicate the regions that are darker than average skin intensity. ............................... Coarse alignment: (a) input face images of size 640 x 480 from the MPEG7 content set (first three rows), and of size 256 X 384 from the MSU database (the fourth row); (b) detected faces; (c) locations of eyebrow, nostril, and mouth lines using shadow maps; (d) face images overlaid with coarsely aligned face graphs. .................... Interacting snakes: (a) face region extracted from a face image shown in Fig. 5.4(a); (b) image in (a) overlaid with a (projected) semantic face graph; (c) the initial configuration of interacting snakes obtained from the semantic face graph shown in (b) ................... Repulsion force: (a) interacting snakes with index numbers marked; (b) the repulsion force computed for the hair outline; (c) the repulsion force computed for the face outline. ................... Gradient vector field: (a) face region of interest extracted from a 640x480 image; (b) thresholded gradient map based on the population of edge pixels shown as dark pixels; (c) gradient vector field. ......... Component energy (darker pixels have stronger energy): (a) face region of interest; (b) eye component energy; (c) mouth component energy; (d) nose boundary energy; (e) nose boundary energy shown as a 3D mesh surface. .................................. Fine alignment: (a) snake deformation shown every five iterations; (b) aligned snakes (currently six snakes~hairstyle, face-border, eyes, and mouth—are interacting); (c) gradient vector field overlaid with the aligned snakes. .............................. Fine alignment with evolution steps: (a) a face image; (b) the face in (a) overlaid with a coarsely aligned face graph; (c) initial interacting snakes with different Shades in facial components (cartoon face); (d) curve evolution shown every five iterations (totally 55 iterations); (e) an aligned cartoon face. ......................... xvi 119 120 121 122 124 125 126 128 5.14 5.15 5.16 5.17 5.18 5.19 5.20 5.21 5.22 5.23 6.1 Fine alignment using geodesic active contours: (a) a generic cartoon face constructed from interacting snakes; (b) to (f) for five different sub— jects. For each subject, the image in the first row is the captured face image; the second row shows semantic face graphs obtained after coarse alignment, and overlaid on the color image; the third row shows seman— tic face graphs with individual components shown in different shades of gray; the last row shows face graphs with individual components after fine alignment. ........................... A semantic face matching algorithm. .................... Five color images (256 X 384) of a subject. ................. Face images of ten subjects. ......................... Examples of misclassification: (a) input test image; (b) semantic face , graph of the image in (a); (c) face graph of the misclassified subject; (d) face graph of the genuine subject obtained from the other images of the subject in the database (i.e., without the input test image in (a)). Each row shows one example of misclassification. ........ Cartoon faces reconstructed from Fourier descriptors using all the fre- quency components: (a) to (j) are ten average cartoon faces for ten different subjects based on five images for each subject. Individual components are shown in different shades in (a) to (e). ........ Cartoon faces reconstructed from Fourier descriptors using only 50% of the frequency components: (a) to (j) are ten average cartoon faces for ten different subjects based on five images for each subject. Individual components are shown in different Shades in (a) to (e). ........ Cartoon faces reconstructed from Fourier descriptors using only 30% of the frequency components: (a) to (j) are ten average cartoon faces for ten different subjects based on five images for each subject. Individual components are shown in different shades in (a) to (e). ........ Facial caricatures generated based on a generic 3D face model: (a) a pro- totype of the semantic face graph, G0, obtained from a generic 3D face model, with individual components shaded; (b) face images of six dif~ ferent subjects; (c)-(g) caricatures of faces in (b) (semantic face graphs with individual components shown in different shades) with different values of exaggeration coefficients, k, ranging from 0.1 to 0.9 ...... Facial caricatures generated based on the average face of 50 faces (5 for each subject):(a) a prototype of the semantic face graph, G0, ob— tained from the mean face of the database, with individual components shaded; (b) face images of Six different subjects; (c)-(g) caricatures of faces in (b) (semantic face graphs with individual components shown in different shades) with different values of exaggeration coefficients, 1:, ranging from 0.1 to 0.9 ........................... A prototype of a face identification system with the tracking function. . . xvii 131 134 135 135 137 138 139 139 141 142 148 6.2 An example of motion detection in a video frame: (a) A color video frame; (b) extracted regions with Significant motion; (c) detected moving skin patches shown in pseudocolor; (d) extracted face candidates described by rectangles. ............................... 149 6.3 Face tracking results on a sequence of 25 video frames. These images are arranged from top to bottom and from left to right. Detected faces are overlaid on the lighting-compensated images ............... 150 Al Color spaces: (a) RGB; (b) YCbCr. .................... 154 CI Architecture of IPTL class templates ..................... 162 xviii Chapter 1 Introduction In recent years face recognition has received substantial attention from researchers in biometrics, pattern recognition, and computer vision communities (see surveys in [36], [37], [38]). This common interest among researchers working in diverse fields is motivated by our remarkable ability to recognize people (although in case of certain rare brain disability, e. g., prosopagnosia or face blindness [39], this recognition ability is lost) and the fact that human activity is a primary concern both in everyday life and in cyberspace. Besides, there are a large number of commercial, security, and forensic applications requiring the use of face recognition technology. These applications (see Fig. 1.1) include automated video surveillance (e. g., super bowl face scans and airport security checkpoints), access control (e.g., to personal computers and private buildings), mugshot identification (e. g., for issuing driver licenses), design of human computer interface (HCI) (e.g., classifying the activity of a vehicle driver), multimedia communication (e.g., generation of synthetic faces), and content-based image database management [40]. These applications involve locating, tracking, and recognizing a Single (or multiple) human subject(s) or face(s). Face recognition is an important biometric identification technology. Facial scan is an effective biometric attribute / indicator. Different biometric indicators are suited for different kinds of identification applications due to their variations in intrusiveness, accuracy, cost, and (sensing) effort [5] (see Fig. 1.2(a)). Among the six biometric indicators considered in [6], facial features scored the highest compatibility, shown in Fig. 1.2(b), in a machine readable travel documents (MRTD) system based on a number of evaluation factors, such as enrollment, renewal, machine requirements, and public perception [6]. 1.1 Challenges in Face Recognition Humans can easily recognize a known face in various conditions and representations (see Fig. 1.3). Such a remarkable ability of humans to recognize faces with large intra—subject variations has inspired vision researchers to develop automated systems for face recognition based on 2D face images. However, the current state~of-the-art machine vision systems can recognize faces only in a constrained environment. Note that there are two types of face comparison scenarios, called (i) face verification (or authentication) and (ii) face identification (or recognition). As shown in Fig. 1.4, face verification involves a one-to-one match that compares a query face image against a template face image whose identity is being claimed, while face identification in- volves one-to—many matches that compare a query face image against all the template images in a face database to determine the identity of the query face. The main chal- 2 g . i :1 3 :1 1 )— x 7:1 :x x .x :1 .J g, 1 1 :1 :1 ,x —x 1 x :1 :11 .1 .I 1 j 1 ”E2 ; vrs_Iomcs;; Dodgy arr ] 'i, _ j 'i I .1 l I According Control Mace Input Selection 3.1-.1731] 3334 33333 (b) Figure 1.1. Applications using face recognition technology: (a) and (b) automated video surveillance (downloaded from Visionics [1] and FaceSnap [2], respectively); (c) and (d) access control (from Visionics [I] and from Viisage [3] respectively); (e) management of photo databases (from Viisage [3]); (f) multimedia communication (from Eyematic [4]). (d) Figurc1.1. (Cont’d). [. ll lam-.1. . . a Figure 1.1. Zeohyr" Analysis @ Ko troke- Sex: Hand-Scan Signature- Faclal- Scan Scan Retina- Finger- can Scan Iris-Scan VoiceScan o W W lbs-tic em ] uletm-Ammy 0M oEllon (a) .5 c | Weighted percentage N o l Face Finger Iland Voice En Signature (1)) Figure 1.2. Comparison of various biometric features: (a) based on zephyr analysis (downloaded from [5]); (b) based on MRTD compatibility (from [6]). .l: g El. 5. a. a E' Figure 1.3. Intra—subject variations in pose, illumination, expression, occlusion, accessories (e.g., glasses), color, and brightness. are One-to-one ,1 One-to:many Query Template Query Template (8) (b) Figure 1.4. Face comparison: (a) face verification/ authentication; (b) face identifi- cation/recognition. Face images are taken from the MSU face database [7]. lenge in vision-based face recognition is the presence of a high degree of variability in human face images. There can be potentially very large intra—subject variations (due to 3D head pose, lighting, facial expression, facial hair, and aging [41]) and rather small inter-subject variations (due to the similarity of individual appearances). Cur- rently available vision-based recognition techniques can be mainly categorized into two groups based on the face representation which they use: (i) appearance~based which use holistic texture features, and (ii) geometry-based which use geometrical features of the face. Experimental results Show that appearance-based methods gen- erally perform a better recognition task than those based on geometry, because it is difficult to robustly extract geometrical features especially in face images of low 6 resolutions and of poor quality (i.e., to extract features under uncertainty). However, the appearance-based recognition techniques have their own limitations in recognizing human faces in images with wide variations in 3D head pose and in illumination [38]. Hence, in order to overcome variations in pose, a large number of face recognition techniques have been developed to take into account the 3D face shape, extracted either from a video sequence or range data. As for overcoming the variations in illumination, several studies have explored features such as edge maps (e.g., eigen- hills and eigenedges in [42]), intensity derivatives, Gabor-filter responses [43], and the orientation fields of intensity gradient [44]. However, none of these approaches by themselves lead to satisfactory recognition results. Hence, the explicit 3D face model combined with its reflectance model is believed to be the best representation of human faces for the appearance-based approach [43]. 1.2 Semantic Facial Components Face recognition technology provides useful tools for content-based image and video retrieval based on a semantic (high-level) concept, i.e., human faces. Is all face pro- cessing holistic [45]? Some approaches, including feature—based and appearance-based [46] methods, emphasize that internal facial features (i.e., pure face regions) play the most important role in face recognition. On the other hand, some appearance-based methods suggest that in some situations face recognition is better interpreted as head recognition [8], [31]. An example supporting the above argument was demon- strated for Clinton and Gore heads [8] (See Fig. 1.5(a)). While the two faces in 7 Democrat coalition. ' ,._ (a) (b) Figure 1.5. Head recognition versus face recognition: (a) Clinton and Gore heads with the same internal facial features, adapted from [8]; (b) two faces of different subjects with the same internal facial components show the important role of hair and face outlines in human face recognition. Fig. 1.5(a) have identical internal features, we can still distinguish Clinton from Gore. We notice that in this “example” the hair style and the face outline are significantly different. We reproduce this scenario, across genders, in Fig. 1.5(b). Humans will usually identify these two persons with different identities. This prompted Liu et a1. [47] to emphasize that there is no use of face masks (to remove the “non-pure- face” portion) in their appearance—based method. As a result, we believe that the separation of external and internal facial features/ components is helpful in assigning weights on external and internal facial features in the face recognition process. Modeling facial components at the semantic level (i.e., eyebrows, eyes, nose, mouth, face outline, ears, and the hair outline) helps to separate external and in- ternal facial components, and to understand how these individual components con- tribute to face recognition. Examples of modeling facial components can be found in the faces represented in caricatures and cartoons. However, the fact that humans 8 can recognize known faces in caricature drawings (e.g., faces shown in Fig. 1.6) and cartoons (see Fig.1.?) without any difficulty has not been fully explored in research studies on face recognition [48], [49], [50], [51]. Note that some of the faces shown (a) (9) Figure 1.6. Caricatures of (a) Vincent Van Gogh; (b) Jim Carrey; (c) Arnold Schwarzenegger; (d) Einstein; (e) G. W. Bush; and (f) Bill Gates. Images are down- loaded from [9], [10] and [10]. Caricatures reveal the use of component weights in face identification. (a) (b) (C) (d) Figure 1.7. Cartoons reveal that humans can easily recognize characters whose facial components are depicted by simple line strokes and color characteristics: (a) and (b) are frames adapted from the movie Pocahontas; (c) and (d) are frames extracted from the movie Little Mermaid II. (Disney Enterprises, Inc.) in Fig. 1.6 are represented only by strokes (geometrical features), while some others have parts of facial features dramatically emphasized with some distortion. Cartoon faces are depicted by line drawings and color without shading. People can easily identify faces in caricatures (see, Fig. 1.6) that exaggerate some of the facial compo- nents/landmarks. Besides, we can also easily identify known faces merely based on some salient facial components. For example, we can quickly recognize a known face 9 with a distinctive chin no matter whether the face appears in a caricature (e.g., Jim Carrey shown in Fig. 1.6(b)) or in a real photo [52]. Caricatures reveal that there are certain facial features which are salient for each individual and that a relatively easier identification of faces can occur by emphasizing distinctive facial components (using weights) and their configuration. Besides, the spatial configuration of facial components has been shown to take a more important role in face recognition than local texture by using inverted faces [53] in which the (upright) face recognition is disrupted (see Fig. 1.8). Therefore, we group these salient facial components [48] as (a) (b) (C) Figure 1.8. Configuration of facial components: (a) face image; (b) face image in (a) with enlarged eyebrow—to—eye and nose-to—mouth distances; (c) inverted face of the image in (b). A small change of component configuration results in a significantly different facial appearance in an upright face in (b); however, this change may not be perceived in an inverted face in (c). a graph and derive component weights in our face matching algorithm to improve the recognition performance. In addition, humans can recognize faces in the presence of occlusions, i.e., face recognition can be based on a (selected) subset of facial components. This explains 10 the motivation for studies that attempt to recognize faces from eyes only [54]. The use of component weights can facilitate face recognition based on selected facial com- ponents. Furthermore, the shape of facial components (see Fig. 1.9(a)) has been used in physiognomy (or face reading, an ancient art of deciphering a person’s past and personality from his/her face). In light of this art, we design a semantic face graph for face recognition (see in Chapter 5), shown in Fig. 1.9(b), in which ten facial components are filled with different shades in a frontal view. (a) Figure 1.9. Facial features/components: (a) five kinds of facial features (i.e., eye- brows, eyes, nose, ears, and mouth) in a face for reading faces in physiognomy (down- loaded from [11]); (b) a frontal semantic face graph, whose nodes are facial compo- nents that are filled with different shades. For each facial component, the issue of representation also plays an important role in face recognition. It has been believed that local facial texture and shading are crucial for recognition [52]. However, some frames of a cartoon video, as shown in Fig. 11 1.7, reveal that line drawings and color characteristics (shades) of facial components (e.g., dark colors for eyebrows and both bright and dark colors for eyes) provide sufficient information for humans to recognize the faces of characters in cartoons. People can even recognize cartoon faces without the use of shading information, which is rather unstable under different lighting conditions. Consequently, we believe that curves (or sketches) and shades of facial‘components provide a promising solution to the representation of facial components for recognition. However, very little work has been done in face recognition based on facial sketches [55], [56] and (computer- generated [57]) caricatures [58], [48], [50]. In summary, external and internal facial components, and distinctiveness, config- uration and local texture of facial components all contribute to the process of face recognition. Humans can seamlessly blend and independently perform appearance- based and geometry-based recognition approaches efficiently. Therefore, we believe that merging [59], [60] the holistic texture features and the geometrical features (es- pecially at a semantic level) is a promising method to represent faces for recognition. While we focus on the 3D variations in faces, we should also take the temporal (aging) factor into consideration while designing face recognition systems [41]. In addition to large intra—subject variations, another difficulty in recognizing faces lies in the small inter-subject variations (Shown in Fig. 1.10). Different persons may have very similar appearances. Identifying people with very similar appearances remains a challenging task in automatic face recognition. 12 (a) (b) Figure 1.10. Similarity of frontal faces between (a) twins (downloaded from [12]); and (b) a father and his son (downloaded from [13]). 1.3 Face Recognition Systems Face recognition applications in fact involve several important steps, such as face detection for locating human faces, face tracking for following moving subjects, face modeling for representing human faces, face coding / compression for efficiently archiv- ing and transmitting faces, and face matching for comparing represented faces and identifying a query subject. Face detection is usually an important first step. De- tecting faces can be viewed as a two-class (face vs. non-face) classification problem, while recognizing faces can be regarded as a multiple—class (multiple subjects) classi- fication problem within the face class. Face detection involves certain aspects of face recognition mechanism, while face recognition employs the results of face detection. We can consider face detection and recognition as the first and the second stages in a sequential classification system. The crucial issue here is to determine an appro- priate feature space to represent a human face in such a classification system. We believe that a seamless combination of face detection, face modeling, and recogni- tion algorithms has the potential of achieving high performance for face identification 13 applications. With this principle, we propose two automated recognition paradigms, shown in Fig. 1.11 and Fig. 1.12, that can combine face detection as well as tracking (not included in this thesis, but can be realized based on our current work), modeling, and recognition. The first paradigm requires both video sequences and 2.5D/ 3D facial measurements as its input in the learning/ enrollment stage. In the recognition/test stage, however, face images are extracted from video input only. Faces are identified based on an appearance-based algorithm. The second paradigm requires only video sequences as its input in both learning and recognition stages. Its face recognition module makes use of a semantic face matching algorithm to compare faces based on weighted facial components. Both paradigms contain three major modules: (i) face detection and feature ex- traction, (ii) face modeling, and (iii) face recognition. The face detection/location and feature extraction module is able to locate faces in video sequences. The most important portion of this module is a feature extraction sub—module that extracts geometrical features (such as face boundary, eyes, eyebrows, nose, and mouth), and texture/ color features (estimation of the head pose and illumination is left as a future research direction). The face modeling module employs these extracted features for modifying the generic 3D face model in the learning and recognition stages. In this thesis, we describe the implementation of the face modeling module in both proposed paradigms for the frontal view only. The extension of the face modeling module to non-frontal views can be a future research direction. The recognition module makes use of facial features extracted from an input image and the learned 3D models to 14 verify the face present in an image in the recognition stage. This thesis has developed a robust face detection module which is used to facilitate applications such as face tracking for surveillance, and face modeling for identification (as well as verification). We will briefly discuss the topics of face detection and recognition, face modeling as well as compression, and face—based image retrieval in the following sections. 1.4 Face Detection and Recognition Human activity is a major concern in a wide variety of applications such as video surveillance, human computer interface, face recognition [37], [36], [38], and face image database management [40]. Detecting faces is a crucial step and usually the first one in these identification applications. However, due to various head poses, illumination conditions, occlusion, and distances between teh sensor and the subject (which may result in a blurred face), detecting human faces is an extremely difficult task under unconstrained environments (see images in Figs. 1.13 (a) and (b)). Most face recognition algorithms assume that the problem of face detection has been solved, that is, the face location is known. Similarly, face tracking algorithms (e.g., [61]) often assume the initial face location is known. Since face detection can be viewed as a two—class (face vs. non-face) classification problem, some techniques developed for face recognition (e.g., holistic/template approaches [21], [62], [63], [64], feature- based approaches [65], and their combination [66]) have been used to detect faces. However, these detection techniques are computationally very demanding and cannot handle large variations in faces. In addition to the face location, a face detection 15 Swen: .830 can awash woacumwwm: wEm: Scum? nosEchE can“ womefiamwwca Om .30 mo 8.9%ch 889mm .2; magma 5.5508: 8mm .. . .3 .4 .. _ a...“ 32%.; y ...:.._3..,,... =o.=o§xm 258m a 5:033 8mm 16 dump owns: mo 8: 2: 5055 839mm coSEmoOg 83 8.83-chpo QM So we Seawese Ecuam .NHA 8:th 8a". m 25. 8a". 85228 =8 8 fine”... swank.“ 8.8.3 8r. one; :5 ass is r at “UN. 3.. _, .t . 0 .0 a. . ,m , y ,2. 3...??an 5:52 mmmem. 22:20.88; i I macaw mEEmmq coaogxm Sega a 3:088 83 17 a ‘- ,. *.-_. ‘ ,—,,‘.::--».«-, . a.“ -_ . “3..-kn ~. Figure 1.13. Face images taken under unconstrained environments: (a) a crowd of people (downloaded from [14]); (b) a photo taken at a swimming pool. 18 algorithm can also provide geometrical facial features for face recognition. Merging the geometrical features and holistic texture (appearance—based) features is believed to be a promising method of representing faces for recognition [59], [60]. Therefore, we believe that a seamless combination of face detection and recognition algorithms has the potential of providing a high performance face identification algorithm. Hence, we have proposed a face detection algorithm for color images, which is able to generate geometrical as well as texture features for recognition. Our approach is based on modeling skin color and extracting geometrical facial features. The skin color is detected by using a lighting compensation technique and a nonlinear color transformation. The geometrical facial features are extracted from eye, mouth, and face boundary maps. The detected faces, including the extracted facial features, are organized as a graph for modeling and recognition processes. Our algorithm can detect faces under different head poses, illuminations, and expressions (see Fig. 1.14(a)), and family photos (see Fig. 1.14(b)). However, our detection algorithm is not designed for detecting faces in gray-scale images, cropped face images (see Fig. 1.15(a)) and faces wearing make—up or mask (see Figs. 1.15(b) and (0)). 1.5 Face Modeling for Recognition Our face recognition systems are based on 3D face models. 3D models of human faces have been widely used to facilitate applications such as video compression/ coding, human face tracking, facial animation, augmented reality, recognition of facial ex- pression, and face recognition. Figure 1.16 shows two graphical user interfaces of a 19 Figure 1.14. Face images for our detection algorithm: (a) a montage image containing images adapted from MPEG7 content set [15]; (b) a family photo. Figure 1.15. Face images not suitable for our detection algorithm: (a) cropped image (downloaded from [16]); (b) a performer wearing makeup (from [14]); (0) people wearing face masks (from [14]). 20 commercial parametric face modeling system [17], FaceGen Modeller, which is based on face shape statistics. It can efficiently create a character with specified age, gen- der, race, and caricature morphing. Current trend in face recognition is to employ 3D face model explicitly [67], because such a model provides a potential solution to identifying faces with variations in illumination, 3D head pose, and facial expres- sion. These variations, called the intra—subject variations, also include changes due to aging, facial hair, cosmetics, and facial accessories. These intra—subject variations constitute the primary challenges in the field of face recognition. As object-centered representations of human faces, 3D face models not only can augment recognition systems that utilize viewer-centered face representations (based on 2D face images), but also can blend together holistic approaches and geometry-based approaches for recognition. However, the three state-of-the—art face recognition algorithms [68], (1) the principal component analysis (PCA)-based algorithm; (2) the local feature anal- ysis (LFA)-based algorithm; and (3) the dynamic-link-architecture—based algorithm, use only viewer-centered representations of human faces. A 3D model—based match- ing algorithm is likely to provide a potential solution for advancing face recognition technology. However, for face recognition, it is more important to capture facial distinctiveness of recognition-oriented components than to generate a realistic face model. We briefly introduce our face modeling methods for recognition (i.e., face alignment) and model compression in the following subsections. 21 Generate [Shape] Texture] Tween] Animate] Photo ] All Races ]Nncan] European] SE Asran] E Indran] All Races Controls Step 1 Oplronal Generate l Make a random (are Set Mragei Reset to Iverage lace rep 2 '5' .ms :ep morph_ 'T' . Tenur morph Use ‘Sync Lock' to synchronize movement OHM 2 sMers Gender Age Carlcature Asymmetry S l T S I T S / T S I T ] very male The average T Symmetric ] ] zu [ i 30 l Typlcal ] Tywul Mlle [ 1 40 [ Cancature | Female 1 ' 5° 3 "ouster ] ‘ 1 Warner: so ] Q V 0 1 s Very Tamale 4 lmng p m P" Syn Lock 9 c Lock l" Lock l" Lock Shading Mode Movement C M 3"" 8W 7 Smoolh r‘ Flat 5 Rolalton r Tnnalalron Race Morphing rTenureModt—"a -- ~ 7 - , r: 0,. r , ] ] . 7 ~ 7 , T NrRacea All Races NlRacns ’ All Race: Video Gamma Lighting Option ] f————- T‘_'——_l . . . . . . . . . . . [FrontilLrghtmg Ll ‘ 1-0 1 0 Miran Europe-n SE Asian East lnalen [— Wlmame [— Shiny Display Background Colour Rrght.’ rack and drag on the face to Still: rt Lefi clrclr and drag to .7 . ,, , 1., unit rem facet um» furl-(um .‘Jtnlullrrr [.IJ [JIMU rMurlt‘l: llrtmult Moduli Fle Ed Generate] Shape] Tenure TWO" [Animate] Photo ] Target Face: Load Target Symmetry Asymmetry S / T S / T r [ Lees similar ] l ] a T T Comm .1 ]r ] ] in between I ] Vrmng Optrons [ r g l Shldrng Mode Movement ; [ 73’9“ [ ’; Smooth (‘ Flat I" Rotation r Translation ] , Exaggerate ] Texture Mode , » 7'7 Sync Loclr F Sync Lock P On "' Vrdeo Gamma Lrghtmg Oprron * F—-—‘ ~_. 1 0 2 o 7 V WIMrame F Shmy DlSplZy Background Colour I Rrg‘ht clrck and drag on the {ace to scale rt Len tlrcka and drag to more rl There eare (1340 late and 9858‘"? (b) Figure 1.16. Graphical user interfaces of the FaceGen modeller [17]. A 3D face model shown (a) with texture mapping; (b) with wireframe overlaid. 22 1.5.1 Face Alignment Using 2.5D Snakes In our first recognition system (shown in Fig. 1.11), we have proposed a face modeling method which adapts an existing generic face model (a priori knowledge of a human face) to an individual’s facial measurements (i.e., range and color data). We use the face model that was created for facial animation by Waters [69] as our generic face model. Waters’ model includes details of facial features that are crucial for face recognition. Our modeling process aligns the generic model onto extracted facial features (regions), such as eyes, mouth, and face boundary, in a global—to—local way, so that facial components that are crucial for recognition are fitted to the individual’s facial geometry. Our global alignment is based on the detected locations of facial components, while the local alignment utilizes two new techniques which we have developed, displacement propagation and 2.5D active contours, to refine local facial components and to smoothen the face model. Our goal of face modeling is to generate a learned 3D model of an individual for verifying the presence of the individual in a face database or in a video. The identification process involves (i) the modification of the learned 3D model based on different head poses and illumination conditions and (ii) the matching between 2D projections of the modified 3D model, whose facial shape is integrated with facial texture, and sensed 2D facial appearance. 1.5.2 Model Compression Requirements of easy manipulation, progressive transmission, effective visualization and economical storage for 3D (face) models have resulted in the need for model 23 compression. The complexity of an object model depends not only on object geom- etry but also on the choice of its representation. The 3D object models explored in computer vision and graphics research have gradually evolved from simple polyhedra, generated in mechanical Computer Aided Design (CAD) systems, to complex free- form objects, such as human faces captured from laser scanning systems. Although human faces have a complex shape, modeling them is useful for emerging applica- tions such as virtual museums and multimedia guidebooks for education [70], [71], low-bandwidth transmission of human face images for teleconferencing and interactive TV systems [72], virtual people used in entertainment [73], sale of facial accessories in e—commerce, remote medical diagnosis, and robotics and automation [74]. The major reason for us to adopt the triangular mesh as our generic human face model is that it is suitable for describing and simplifying the complexity of facial geometry. In addition, there are a number of geometry compression methods available for compressing triangular meshes (e. g., the topological surgery [75] and the multi- resolution mesh simplification [76]). Beside these helps, we can further obtain a more compact representation of a 3D face model by carefully selecting vertices of the triangular mesh for representing facial features that are extracted for face recognition. Our proposed semantic face graph used in the semantic recognition paradigm (see Fig. 1.12) is such an example. 24 1.5.3 Face Alignment Using Interacting Snakes For the semantic recognition system (shown in Fig. 1.12), we define a semantic face graph. A semantic face graph is derived from a generic 3D face model for identifying faces at the semantic level. The nodes of a semantic graph represent high—level facial components (e.g., eyes and mouth), whose boundaries are described by open (or closed) active contours (or snakes). In our recognition system, face alignment plays a crucial role in adapting a priori knowledge of facial topology, encoded in semantic face graph, onto the sensed facial measurements (e.g., face images). The semantic face graph is first projected onto a 2D image, coarsely aligned to the output of the face detection module, and then finely adapted to the face images using interacting snakes. Snakes are useful models for extracting the shape of deformable objects [77]. Hence, we model the component boundaries of a 2D semantic face graph as a collec- tion of snakes. We propose an approach for manipulating multiple snakes iteratively, called interacting snakes, that minimizes the attraction energy functionals on both contours and enclosed regions of individual snakes and the repulsion energy function- als among multiple snakes that interact with each other. We evaluate the interacting snakes through two types of implementations, explicit (parametric active contours) and implicit (geodesic active contours) curve representations, for face alignment. Once the semantic face graph has been aligned to face images, we can derive component weights based on distinctiveness and visibility of individual components. The aligned face graph can also be easily used to generate cartoon faces and facial 25 caricatures by exaggerating the distinctiveness of facial components. After alignment, facial components are transformed to a feature space spanned by Fourier descriptors of facial components for face recognition, called semantic face matching. The matching algorithm computes the similarity between semantic face graphs of face templates in a database and a semantic face graph that is adapted to a given face image. The semantic face graph allows face matching based on selected facial components, and effective 3D model updating based on 2D face images. The results of our face matching demonstrate that the proposed face model can lead to classification and visualization (e. g., the generation of cartoon faces and facial caricatures) of human faces using the derived semantic face graphs. 1.6 Face Retrieval Today, people can accumulate a large number of images and video clips (digital con- tent) because of the growing popularity of digital imaging devices, and because of the decreasing cost of high-capacity digital storage. This significant increase in the amount of digital content requires database management tools that allow people to easily archive and retrieve contents from their digital collections. Since humans and their activities are typically the subject of interest in consumers’ images and videos, detecting people and identifying them will help to automate image and video archival based on a high-level semantic concept, i.e., human faces. For example, we can design a system that manages digital content of personal photos and amateur videos based on the concept of human faces, e.g., “retrieve all images containing Carrie’s face.” 26 Using merely low-level features (e. g., skin color or color histograms) for retrieval and browsing is neither robust nor acceptable to the user. High level semantics have to be used to make such an image/ video management system useful. Fig. 1.17 shows a graphical user interface of a facial feature-based retrieval system [18]. Figure 1.17. A face retrieval interface of the FACEit system [18]: the system gives the most similar face in a database given a query face image. In summary, the ability to group low-level features as a meaningful semantic entity is a critical issue in the retrieval of visual content. Accurately and efficiently detecting human faces plays a crucial role in facilitating face identification for managing face databases. In face recognition algorithms, the high-level concept—a human face—is implicitly expressed by face representations such as locations of feature points, surface texture, 2D graphs with feature nodes, 3D head surface, and combinations of them. The face representation plays an important role in the recognition process because 27 different representations lead to different matching algorithms. We can design a database management system that utilizes the outputs of our face detection and modeling modules as indices to search a database based on the semantic concepts, such as “find all the images containing John’s faces” and “search faces which have Vincent’s eyes (or face shape). 1.7 Outline of Dissertation This dissertation is organized as follows. Chapter 2 presents a brief literature review on face detection and recognition, face modeling (including model compression), and face retrieval. In Chapter 3, we present our face detection algorithm for color im- ages. Chapter 4 discusses our range data-based face modeling method for recognition. Chapter 5 describes the semantic face recognition system, including face alignment using interacting snakes, a semantic face matching algorithm, and the generation of cartoon faces and facial caricatures. Chapter 6 presents conclusions and future directions related to this work. 1.8 Dissertation Contributions The major contributions of this dissertation are categorized into the topics of face detection, face modeling, and face recognition. In face detection, we have developed a new face detection algorithm for multiple non—profile-view faces with complex back— ground in color images, based on localization of skin-tone color and facial features 28 such as eyes, mouth and face boundary. The main properties of this algorithms are listed as follows. 0 Lighting compensation: This method corrects the color bias and recovers the skin-tone color by automatically estimating the reference white pixels in a color image, under the assumption that an image usually contains “real white” (i.e., white reference) pixels and the dominant bias color in an image always appears as “real white”. 0 Non-linear color transformation: In literature, the chrominance compo— nents of the skin tone have been assumed to be independent of the luminance component of the skin tone. We found that the chroma of skin tone depends on the luma. We overcome the difficulty of detecting the low-luma and high-luma skin tone colors by applying a nonlinear transform to the YCbCr color space. The transformation is based on the linearly fitted boundaries of our training skin cluster in YCb and YC, color subspaces. 0 Modeling a skin-tone color classifier as an elliptical region: A simple classifier which constructs an elliptical decision region in the chroma subspace, 050,, has been designed, under the assumption of the Gaussian distribution of skin tone color. 0 Construction of facial feature maps for eyes, mouth, and face bound- ary: With the use of gray-scale morphological operators (dilation and erosion), we construct these feature maps by integrating the luminance and chrominance information of facial features. For example, eye regions have high Cb (difference 29 between blue and green colors) and low C. (difference between red and green colors) values in chrominance components, and have brighter and darker values in the luminance component. 0 Construction of a diverse database of color images for face detection: The database includes a MPEG7 content set, mug-shot style web photos, family photos, and news photos. In face modeling, we have designed two methods for aligning a 3D generic face model onto facial measurements captured in the frontal view: one uses facial mea- surements of registered color and range data; the other merely uses color images. In the first method, we have developed two techniques for face alignment: 0 2.5D snake: A 2.5D snake is designed to locally adapt a contour to each facial component. The design of snake includes an iterative deformation formula, placement of initial contours, and the minimization of energy functional. We reformulated 2D active contours (a dynamic programming approach) for 3D contours of eye, nose, mouth, and face boundary regions. We have constructed initial contours based on the outputs of face detection (i.e., locations of the face and facial components). We form energy maps for individual facial components based on 2D color image and 2.5D range data, hence the name 2.5D snake. 0 Displacement propagation: This technique is designed to propagate the displacement of a group of vertices on a 3D face model from contour points on facial components to other points on non-facial components. The propagation 30 can be applied to a 3D face model whenever a facial component is coarsely relocated or is finely deformed by the 2.5D snake. In the second face modeling method, we developed a. technique for face alignment: 0 Interacting snakes: The snake deformation is formulated by a finite differ- ence approach. The initial snakes for facial components are obtained from the 2D projection of the semantic face graph on a generic 3D face model. We have designed the interacting snakes technique for manipulating multiple snakes it- eratively that minimizes the attraction energy functionals on both contours and enclosed regions of individual snakes and minimizes the repulsion energy functionals among multiple snakes. In face recognition, we have proposed two paradigms as shown in Figs. 1.11 and 1.12. o The first (range data-based) recognition paradigm: This paradigm is de- signed to automate and augment appearance-based face recognition approaches based on 3D face models. In this system, we have integrated our face detection algorithm, face modeling method using the 2.5D snake, and an appearance- based recognition method using the hierarchical discriminant regression [78]. However, the recognition module can be replaced with other appearance—based algorithms such as PCA-based and LDA-based methods. The system can learn a 3D face model for an individual, and generate an arbitrary number of 2D face images under different head poses and illuminations (can be extended to different expressions) for training an appearance-based face classifier. 31 o The second (semantic) recognition paradigm: This paradigm is designed to automate the face recognition process at a semantic level based on the dis- tinctiveness and visibility of facial components in a given face image captured in near frontal views. (This paradigm can be extended to face images taken in non— frontal views). We have decomposed a generic 3D face model into recognition- oriented facial components and non-facial components, and formed a 3D seman- tic face graph for representing facial topology and extracting facial components. In this recognition system, we have integrated our face detection algorithm, our face modeling method using interacting snakes, and our semantic face matching algorithm. The recognition can be achieved at a semantic level (e.g., comparing faces based on eyes and the face boundary only) due to the alignment of facial components. We have also introduced component weights, which play a crucial role in face matching, to emphasize component’s distinctiveness and visibility. The system can generate cartoon faces from aligned semantic face graphs and facial caricatures based on an averaged face graph for face visualization. 32 Chapter 2 Literature Review We first review the development of face detection and recognition approaches, fol- lowed by a review of face modeling and model compression methods. Finally, we will present one major application of face recognition technology, namely, face re- trieval. We primarily focus on the methods that employ the task—specific cognition or behaviors specified by humans (i.e., artificial intelligence pursuits), although there are developmental approaches for facial processing (e.g., autonomous mental devel- opment [79] and incremental learning [80] methods) that have emerged recently. 2. 1 Face Detection Various approaches to face detection are discussed in [19], [20], [81],]82], and [83]. The major approaches are listed chronologically in Table 2.1 for a comparison. For recent surveys on face detection, see [82] and [83]. These approaches utilize techniques such as principal component analysis (PCA), neural networks, machine learning, infor- 33 Table 2.1 SUMMARY OF VARIOUS FACE DETECTION APPROACHES. Authors Year Approach Features Head Test Minimal Used Pose Databases Face Size Féraud 2001 Neural net- Motion; Frontal Sussex; 15 x 20 et a1. [19] works Color; to pro— CMU; Web Texture file images DeCarlo 2000 Optical Motion; Frontal Videos NA et a1. [61] flow Edge; to Deformable profile face model; Texture Maio et a1. 2000 Facial Texture; Frontal Video im— 20 x 27 [20] templates; Directional ages Hough images transform Abdel- 1999 Skin Color Frontal HHI 13 x 13 Mottaleb model; to et a1. [84] Feature profile Garcia 1999 Statistical Color; Frontal MPEG 80 X 48 et al. [21] wavelet Wavelet to near videos analysis coefficients frontal Wu et a1. 1999 Dizzy color Color Frontal Still color 20 x 24 [85] models; to images Template profile matching Rowley et 1998 Neural net- Texture (Upright) FERET; 20 x 20 a1. [24], works frontal CMU; Web [23] images Sung et a1. 1998 Learning Texture Frontal Video 19 x 19 [25] images; newspaper scans Colmenarez 1997 Learning Markov pro- Frontal FERET 11 x 11 et al. [86] cesses Yow et a1. 1997 Feature; Geometrical Frontal CMU 60 x 60 [26] Belief facial to networks features profile Lew et a1. 1996 Markov Most Frontal MIT; 23 X 32 [27] random informative CMU; field; pixel Leiden DFFS [64] 34 mation theory, geometrical modeling, (deformable) template matching, Hough trans- form, extraction of geometrical facial features, motion extraction, and color analysis. Typical detection outputs are shown in Fig. 2.1. In these images, a detected face is usually overlaid with graphical objects such as a rectangle or an ellipse for a face, and circles or crosses for eyes. The neural network-based [24], [23] and the view- based [25] approaches require a large number of face and non-face training examples, and are designed primarily to locate frontal faces in grayscale images. It is difficult to enumerate “non-face” examples for inclusion in the training databases. Schnei- derman and Kanade [22] extend their learning-based approach for the detection of frontal faces to profile views. A feature-based approach combining geometrical fa- cial features with belief networks [26] provides face detection for non-frontal views. Geometrical facial templates and the Hough transform were incorporated to detect grayscale frontal faces in real time applications [20]. Face detectors based on Markov random fields [27], [87] and Markov chains [88] make use of the spatial arrangement of pixel gray values. Model based approaches are widely used in tracking faces and often assume that the initial location of a face is known. For example, assuming that several facial features are located in the first frame of a video sequence, a 3D deformable face model was used to track human faces [61]. Motion and color are very useful cues for reducing search space in face detection algorithms. Motion information is usually combined with other information (e.g., face models and skin color) for face detection and tracking [89]. A method of combining a Hidden Markov Model (HMM) and motion for tracking was presented in [86]. A combination of motion and color filters, and a neural network model was proposed in [19]. 35 .E a a 33 3 ”TE .E 8 30> 3v Amfi .E .5 wnsm 3 gwfi Ad pm zofiaom 3 .3 ”HQ .3 am Egom 3 ”fig .E 8 swESquaom CV .on ”SQ am pm. @630 AU gofi .6 no 062 A3 ”a: .3 Ho vsgwm 3V A3 mmfifiiowdm aoBoSoU 8.3 353% m0 3:350 AN 85mg v V V 36 .Aquov .2“ Baa 37 Categorizing face detection methods based on their representations of faces reveals that detection algorithms using holistic representations have the advantage of finding small faces or faces in poor-quality images, while those using geometrical facial fea- tures provide a good solution for detecting faces in different poses. A combination of holistic and feature—based methods [59], [60] is a promising approach to face detection as well as face recognition. Motion [86], [19] and skin-tone color [19], [84], [90], [85], [21] are useful cues for face detection. However, the color-based approaches face dif- ficulties in robustly detecting skin colors in the presence of complex background and variations in lighting conditions. Two color spaces (YCbC', and H S V) have been pro- posed for detecting the skin color patches to compensate for lighting variations [21]. We propose a face detection algorithm that is able to handle a wide range of color variations in static images, based on a lighting compensation technique in the RGB color space and a nonlinear color transformation in the YCbC', color space. Our ap— proach models skin color using a parametric ellipse in a two-dimensional transformed color space and extracts facial features by constructing feature maps for the eyes, mouth and face boundary from color components in the YCbC, space. 2.2 Face Recognition The human face has been considered as the most informative organ for communication in our social lives [49]. Automatically recognizing faces by machines can facilitate a wide variety of forensic and security applications. The representation of human faces for recognition can vary from a 2D image to a 3D surface. Different representations 38 result in different recognition approaches. Extensive reviews of approaches to face recognition were published in 1995 [37], 1999 [31], and in 2000 [38]. A workshop on face processing in 1985 [91] presented studies of face recognition mainly from the viewpoint of cognitive psychology. Studies of feature-based face recognition, computer caricatures, and the use of face surfaces in simulation and animation were summarized in 1992 [49]. In 1997, Uwechue et a1. [92] gave details of face recognition based on high- order neural networks using 2D face patterns. In 1998, lectures on face recognition using 2D face patterns were presented from theory to applications [36]. In 1999, Hallinan et a1. [93] described face recognition using both the statistical models for 2D face patterns and the 3D face surfaces. In 2000, Gong et a1. [94] emphasized the statistical learning methods in holistic recognition approaches and discussed face recognition from the viewpoint of dynamic vision. The above studies show that the face recognition techniques, especially holistic methods based on the statistical pattern theory, have greatly advanced over the past ten years. Face recognition systems (e.g., FaceIt [1] and FaceSnap [2]) are being used in video surveillance and security monitoring applications. However, more reliable and robust techniques for face recognition as well as detection are required for several applications. Except for the recognition applications based on static frontal images that are taken under well-controlled environments (e.g., indexing and searching large image database of drivers for issuing driving licenses), the main challenge in face recognition is to be able to deal with the high degree of variability in human face images. The sources of variations include inter-subject variations (distinctiveness of individual appearance) and intra—subject variations (in 3D pose, facial expression, 39 facial hair, lighting, and aging). Some variations are not removable, while others can be compensated for recognition. Persons who have similar face appearances, e. g. twins, and an individual who could have different appearances due to cosmetics, or other changes in facial hair and glasses are very difficult to recognize. Variations due to different poses, illuminations, and facial expressions are relatively easy to handle. Currently available algorithms for face recognition concentrate on recognizing faces under those variations which can somehow be compensated for. Because facial variations due to pose cause a large amount of appearance change, more and more systems are taking advantage of 3D face geometry for recognition. The performance of a recognition algorithm depends on the face databases it is evaluated on. Several face databases, such as MIT [95], Yale [96], Purdue [97], and Olivetti [98] databases are publically available for researchers. Figure 2.2 shows some examples of face images from the FERET [28], MIT [29], and XM2VTS [30] databases. According to Phillips [68], [28], the FERET evaluation of face recognition algorithms identifies three state-of—the—art techniques: (i) the principal component analysis (PCA)-based approach [99], [100], [29]; (ii) the elastic bunch graphic match- ing (EBGM)-based paradigm [32]; and (iii) the local feature analysis (LFA)-based approach [34], [101]. The internal representations of PCA-based, EBGM-based, and LFA-based recognition approaches are shown in Figs. 2.3, 2.4, and 2.5, respectively. To represent and match faces, the PCA-based approach makes use of a set of orthonor- mal basis images; the EBGM-based approach constructs a face bunch graph, whose nodes are associated with a set of wavelet coefficients (called jets); the LFA-based approach uses localized kernels, which are constructed from PCA—based eigenvectors, 40 Figure 2.2. Examples of face images are selected from (a) the FERET database [28]; (b) the MIT database [29]; (c) the XM2VTS database [30]. 41 for topographic facial features (e.g., eyebrows, cheek, mouth, etc.) Mean MDFl MDF2 MDF3 MDF4 MDF5 MDF6 MDF7 (b) Figure 2.3. Internal representations of the PCA-based approach and the LDA-based approach (from Weng and Swets [31]). The average (mean) images are shown in the first column. Most Expressive Features (MEF) and Most Discriminating Features (MDF) are shown in (a) and (b), respectively. The PCA—based algorithm provides a compact but non-local representation of face images. Based on the appearance of an image at a specific view, the PCA algorithm works at the pixel level. Hence, the algorithm can be regarded as “picture” recognition, in other words, it is not explicitly using any facial features. The EBGM- based algorithm constructs local features (extracted using Gabor wavelets) and global face shape (represented as a graph), and so this approach is much closer to “face” recognition. However, the EBGM algorithm is pose-dependent, and it requires initial graphs for different poses during its training stage. The LFA-based algorithm is derived from the PCA-based method; it is also called a kernel PCA method. In this approach, however, the choice of kernel functions for local facial features (e.g., eyes, mouth, and nose) and the selection of locations of these features still remains an open 42 face bunch graph ((1) Figure 2.4. Internal representations of the EBGM—based approach (from Wiskott et a1. [32]): (a) a graph is overlaid on a face image; (b) a reconstruction of the image from the graph; (c) a reconstruction of the image from a face bunch graph using the best fitting jet at each node. Images are downloaded from [33]; (d) a bunch graph whose nodes are associated with a bunch of jets [33]; (e) an alternative interpretation of the concept of a bunch graph [33]. question. In addition to these three approaches, we categorize face recognition algorithms on the basis of pose—dependency and matching features (see Fig. 2.6). In pose-dependent algorithms, 3. face is represented by a set of viewer-centered images. A small number of 2D images (appearances) of a human face at different poses are stored as a repre— sentative set of the face, while the 3D face shape is implicitly represented in the set. 43 Figure 2.5. Internal representations of the LFA—based approach (from Penev and Atick [34]). (a) An average face image is marked with five localized features; (b) five topographic kernels associated with the five localized features are shown in the top row, and the corresponding residual correlations are shown in the bottom row. The representative set can be obtained from either digital cameras or extracted from videos. On the other hand, in pose-invariant approaches, a face is represented by a 3D face model. The 3D face shape of an individual is explicitly represented, while the 2D images are implicitly encoded in this face model. The 3D face models can be constructed by using either 3D digitizers or range sensors, or by modifying a generic face model using a video sequence or still face images of frontal and profile views. The pose-dependent algorithms can be further divided into three classes: appearance-based (holistic) [29], [78] feature—based (analytic) [102], [103] and hy- brid (which combines holistic and analytic methods) [60], [99], [32], [34] approaches. The appearance—based methods are sensitive to intra—subject variations, especially to changes in hairstyle, because they are based on global information in an image. However, the feature—based methods suffer from the difficulty of detecting local fidu- cial “points”. The hybrid approaches were proposed to accommodate both global and local face shape information. For example, LFA-based methods, eigen-template meth- 44 Algorithms ............................. Pose-dependency Pose-Invariant --. Face representation Pose-dependent A\ Viewer-centered Object-centered Images Models ------------------------------------------------------------------------ Appearance Hybrid Feature- - Gordon et al., 1995 -based based - Lengqne at al., 1996 ("Oust“) {Analytic} - Atick et al., 1996 PCA LDA LFA - Yan et al., 1996 EGBM '- Zhao ct al., 2000 - Zhang at al., 2000 -- Hsu et al., 2001 Figure 2.6. A breakdown of face recognition algorithms based on the pose— dependency, face representation, and features used in matching. ods, and shape-and-shape—free [104] methods belong to the hybrid approach which is derived from the PCA methodology. The EBGM-based methods belong to the hybrid approach that is based on 2D face graphs and wavelet transforms at each feature node of the graphs. Although they are in the hybrid approach category, the eigen-template matching and EBGM-based methods are much closer to feature-based approaches. In the pose-invariant algorithms, 3D face models are utilized to reduce the varia- tions in pose and illumination. Gordon et al. [105] proposed an identification system based on 3D face recognition. The 3D model used by Gordon et al. is represented by a number of 3D points associated with their corresponding texture features. This method requires an accurate estimate of the face pose. Lengagne et al. [106] proposed a 3D face reconstruction scheme using a pair of stereo images for recognition and mod- 45 eling. However, they did not implemented the recognition module. Atick et al. [107] proposed a reconstruction method of 3D face surfaces based on the Karhonen-Loeve (KL) transform and the shape-from—shading approach. They discussed the possibility of using eigenhead surfaces in face recognition applications. Yan et al. [108] proposed a 3D reconstruction method to improve the performance of face recognition by mak— ing Atick et al.’s reconstruction method rotation-invariant. Zhao et al. [109] proposed a method to adapt a 3D model from a generic range map to the shape obtained from shading for enhancing face recognition performance in different lighting and viewing conditions. Based on our brief review, we believe that the current trend is to use 3D face shape explicitly for recognition. In order to efficiently store an individual’s face, one approach is to adapt a 3D face model [72] to the individual. There is still a considerable debate on whether the internal recognition mechanism of a human brain involves explicit 3D models or not [49], [110]. However, there is enough evidence to support the fact that humans use information about 3D structure of objects (e.g., 3D geometry of a face) for recognition. Closing our eyes and imagining a face (or a chair) can easily verify this hypothesis, since the structure of a face (or a chair) can appear in our mind without the use of eyes. Moreover, the use of a 3D face model can separate both geometrical and texture features for facial analysis, and can also blend both of them for recognition as well as visualization [67]. Our proposed systems belong to this emerging trend. 46 2.3 Face Modeling Face modeling plays a crucial role in applications such as human head tracking, facial animation, video compression/ coding, facial expression recognition, and face recog- nition. Researchers in computer graphics have been interested in modeling human faces for facial animation. Applications such as virtual reality and augmented reality [74] require modeling faces for human simulation and communication. In applications based on face recognition, modeling human faces can provide an explicit representa- tion of a face that aligns facial shape and texture features together for face matching at different poses and in different illumination conditions. 2 .3. 1 Generic Face Models We fiI‘St review three major approaches to modeling human faces and then point out an advanced modeling approach that makes use of the a priori knowledge of facial geometry. DeCarlo et al. [111] use the anthropometric measurements to generate a general face model (see Fig. 2.7). This approach starts with manually-constructed B‘spline surfaces and then applies surface fitting and constraint optimization to these Surfaces. It is computationally intensive due to its optimization mechanism. In the second approach, facial measurements are directly acquired from 3D digitizers or Structured light range sensors. 3D models are obtained after a postprocessing, tri- angul‘c‘trization, on these shape measurements. The third approach, in which models are reconstructed from photographs, only requires low-cost and passive input devices (Video cameras). Some computer vision techniques for reconstructing 3D data can 47 val —. j. 1"; [ Lag _ F, IN. F? ,__. l ear inclination? awn.” ‘ -' mentooervlcal angle (a) (b) Figure 2.7. Face modeling using anthropometric measurements (downloaded from [35]): (a) anthropometric measurements; (b) a B-spline face model. be used for face modeling. For instance, Lengagne et al. [106] and Chen et al. [112] built face models from a pair of stereo images. Atick et al. [107] and Yan et al. [108] reconstructed 3D face surfaces based on the Karhonen-Loeve (KL) transform and the shape-from—shading technique. Zhao et al. [109] made use of a symmet— ric S118.1)e-from-shading technique to build a 3D face model for recognition. There are other methods which combine both shape-from-stereo (which extracts low—spatial freclllency components of 3D shape) and shape-from—shading (extracting high—spatial frequency components) to reconstruct 3D faces [113], [114], [115]. See [116] for addi- tional methods to obtain facial surface data. However, currently it is still difficult to extract. sufficient information about the facial geometry only from 2D images. This diffiCulty is the reason why Guenter et al. [117] utilize a large number of fiducial points to capture 3D face geometry for photorealistic animation. Even though we Can C>btain dense 3D facial measurements from high-cost 3D digitizers, it takes too much time and it is expensive to scan a large number of human subjects. 48 a ‘3 fl lll An advanced modeling approach which incorporates a priori knowledge of fa— cial geometry has been proposed for efficiently building face models. We call the model representing the general facial geometry as a generic face model. Waters” face model [69], shown in Fig. 2.8(a), is a well-known instance of polygonal facial surfaces. Figure 2.8(b) shows some other generic face models. The one used by Blanz and Vet- iiflllllflhs mg "as“. r WW. w Maphnblo face Mode : E ‘Ilhlililllllill . c. , ~\ . 2,, WV will; U is ;;—:9Asav!figf3- :7 Amberw 999 (b) Figure 2.8. Generic face models: (a) Water’s animation model; (b) anthropometric meaSlll‘ements; (b) six kinds of face models for representing general facial geometry. ter is a. statistics—based face model which is represented by the principal components 0f Shape and texture data. Reinders et al. [72] used a fairly coarse wire-frame model, compared to Waters’ model, to do model adaptation for image coding. Yin et al. [118] proposed a MPEG4 face modeling method that uses fiducial points extracted from tWo face images at frontal and profile views. Their feature extraction is simply based on the results of intensity thresholding and edge detection. Similarly, Lee et a1. [1 19] have proposed a method that modifies a generic model using either two or- t'hogonal pictures (frontal and profile views) or range data, for animation. Similarly, 49 for facial animation, Lengagne et al. [120] and Fua [121] use bundle-adjustment and least—squares fitting to fit a complex animation model to uncalibrated videos. This algorithm makes use of stereo data, silhouette edges, and 2D feature points. Five manually-selected features points and initial values of camera positions are essential for the convergence of this method. Ahlberg [122] adapts a 3D wireframe model (CANDIDE—3 [123]) to a 2D video image. The two modeling methods proposed in this thesis follow the modeling approach using a generic face model; both of our meth- ods make use of a generic face model (Waters’ face model) as a priori knowledge of facial geometry and employ (i) displacement propagation and 2.5D snakes in the first method and (ii) interacting snakes and semantic face graphs in the second method for adapting recognition-orientated features to an individual’s geometry. 2.3.2 Snakes for Face Alignment AS 8. Computational bridge between the high-level a priori knowledge of object shape and the low-level image data, snakes (or active contours) are useful models for extract- ing the Shape of deformable objects. Similar to other template-based approaches such as Hough transform and active shape models, active contours have been employed to detect object boundary, track objects, reconstruct 3D objects (stereo snakes and interrframe snakes), and match/ identify shape. Snakes self converge in an iterative Way, and deform either with or without topological constraints. I{esearch on active contours focuses on issues related to representation (e.g., para- metric curves, splines, Fourier series, and implicit level—set functions), energy func- 50 tionals to minimize, implementation methods (e.g., classical finite difference models, dynamic programming [124], and Fourier spectral methods), convergence rates and conditions, and their relationship to statistical theory [125] (e.g., the Bayesian esti- mation). Classical snakes [77], [126] are represented by parametric curves and are de— formed by finite difference methods based on edge energies. In applications, different types of edge energies including image gradients, gradient vector flows [127], distance maps, and balloon force have been proposed. On the other hand, combined with level-set methods and the curve evolution theory, active contours have emerged as a powerful tool, called geodesic active contours (GAC) [128], to extract deformable ob- jects with unknown geometric topology. However, in the GAC approach, the contours are implicitly represented as level-set functions and are closed curves. In addition to the edge energy, region energy has been introduced to improve the segmentation re- sults for homogeneous objects in both the parametric and the GAC approaches (e. g., region and edge [129], GAC without edge [130], statistical region snake [131], region competition [132], and active region model [133]). Recently, multiple active contours [134], [135] were proposed to extract / partition multiple homogeneous regions that do not overlap with each other in an image. In our first alignment method, we have reformulated 2D active contours (a dy- namic programming approach) in 3D coordinates for energies derived from 2.5D range and 2D color data. In our second alignment method, we make use of multiple 2D snakes (a finite difference approach) that interact with each other in order to adapt facial components. 51 2.3.3 3D Model Compression Among various representations of 3-D objects, surface models can explicitly represent shape information and can effectively provide a visualization of these objects. The polygonal model using triangular meshes is the most prevalent type of surface rep- resentations for free-form objects such as human faces. The reason is that the mesh model explicitly describes the connectivity of surfaces, enables mesh simplification, and is suitable for free-form objects [136]. The polygonization of an object surface ap- proximates the surface by a large number of triangles (facets), each of which contains primary information about vertex positions as well as vertex associations (indices), and auxiliary information regarding facet properties such as color, texture, specu- larity, reflectivity, orientation, and transparency. Since we use a triangular mesh to represent a generic face model and an adapted model, model compression is preferred when efficient transmission, visualization, and storage is required. In 1995, the concept of geometric compression was first introduced by Deer- ing [137], who proposed a technique for lossy compression of 3—D geometric data. Deering’s technique focuses mainly on the compression of vertex positions and facet properties of 3-D triangle data. Taubin [75] proposed topological surgery which further contributed connectivity encoding (compression of association information) to geo- metric compression. Lounsbery et al. [76] performed geometric compression through multiresolution analysis for particular meshes with subdivision connectivity. Apply- ing remeshing algorithms to arbitrary meshes, Eck et al. [138] extended Lounsbery’s work on mesh simplification. Typical compression ratios in this line of development 52 are listed in Table 2.2. All of these compression methods focus on model represen- Table 2.2 GEOMETRIC COMPRESSION EFFICIENCY. Compression [137] Method Geometric Loss Measure Compressed feature Compression Ratio (GCR) Geometric 6—10 slight losses Positions, normals, colors Topological 20—100 no loss Connectivity; Surgery [75] 12-30 N / A Positions, facet properties; 20—100 N / A ASCII-file sizes Remeshing [138] 54-12 Remeshing & com- Level of detail pression tolerances (facets) tation using triangular meshes. However, for more complex 3D shapes, the surface representation using triangular meshes usually results in a large number of triangu- lar facets, because each triangular facet is explicitly described. We have developed a novel compression approach for free-form surfaces using 3D wavelets and lattice vector quantization [139]. In our approach, surfaces are implicitly represented in- side a volume in the same way as edges in a 2D image. A further improvement in our approach can be achieved by making use of integer wavelet transformation [140], [141]. 2.4 Face Retrieval Face recognition technology provides a useful tool for content-based image and video retrieval using the concept of human faces. Based on face detection and identification technology, we can design a system for consumer photo management (or for web graphic search) that uses human faces for indexing and retrieving image content and generates annotation (textual descriptions) for the image content automatically. Traditional text-based retrieval systems for digital libraries can not fulfill a re- trieval of visual content such as human faces, eye shape, and cars in image or video databases. Hence, many researchers have been developing multimedia retrieval tech- niques based on automatically extracting salient features from the visual content (see [40] for an extensive review). Well known systems for content-based image and video retrieval are QBIC [142], Photobook [143], CONIVAS [144], FourEyes [145], Virage [146], ViBE [147], VideoQ [148], Visualseek [149], Netra [150], MARS [151], PicSOM [152], ImageScape [153], etc. In these systems, retrieval is performed by comparing a set of low-level features of a query image or video clip with features stored in the database and then by presenting the user with the content that has the most similar features. However, users normally query an image or video database based on seman- tics rather than low-level features. For example, a typical query might be specified as “retrieve images of fireworks” rather than “retrieve images that have large dark regions and colorful curves over the dark regions”. Since the commonly used features are usually a set of unorganized low-level at- tributes (such as color, texture, geometrical shape, layout, and motion), grouping 54 low-level features can provide meaningful high-level semantics for human consumers. There has been some work done on automatically classifying images into semantic categories [154], such as indoors/outdoors and city/landscape images. As for the semantic concept of faces, the generic facial topology (e.g., our proposed generic se— mantic face graph) is a useful structure for representing the face in a search engine. We have designed a graphical user interface for face editing using our face detection algorithm. Combined with our semantic face matching algorithm, we can build a face retrieval system. 2.5 Summary We have briefly described the development of face detection, face recognition, face modeling and model compression in this chapter. We have summarized the per- formance of currently available face detection systems in Table 2.3. Note that the performance of a detection system depends on several factors such as face databases on which the system is evaluated, system architecture, distance metric, and algorith- mic parameters. The performance is evaluated based on the detection rate, the false positive rate (false acceptance rate), and databases. In Table 2.3, we do not include the false acceptance rate because the false positive rate has not been completely re- ported in literature. We refer the reader to the FERET evaluation [68], [28] for the performance of various face recognition systems. Face detection and face recognition are closely related to each other in the sense of categorizing faces. Over the past ten years, based on the statistical pattern theory, 55 Table 2.3 SUMMARY OF PERFORMANCE OF VARIOUS FACE DETECTION APPROACHES. Authors Year Head Test Detection Rate Pose Databases F éraud et al. 2001 Frontal Sussex; 100% for Sussex; [19] to CMU testl; 81% ~ 86% for CMU testl; profile Web images 74.7% ~ 80.1% for Web images. Maio et al. 2000 Frontal Static images 89.53% ~ 91.34% [201 Schneiderman 2000 Frontal CMU; Web 75.24% ~ 92.7% et al. [22] to images profile Garcia et a1. 1999 Frontal MPEG videos 93.27% [21] to near frontal Rowley et al. 1998 (Upright) CMU; 86%[24]; 79.6%[23] for ro- [24], [23] frontal F ERET; tated faces Web images Yow et al. 1997 Frontal CMU 84% ~ 92% [26] to profile Lew et al. [27] 1996 Frontal MIT; CMU; 87% ~ 95% Leiden 56 the appearance-based (holistic) approach has greatly advanced the field of face recog- nition. By categorizing face detection methods based on their representations of the face, we observe that detection/ recognition algorithms using holistic representations have the advantage of finding/ identifying small faces or faces in poor-quality images (i.e. detection/ recognition under uncertainty), while those using geometrical facial features provide a good solution for detecting / recognizing faces in different poses and expressions. The internal representation of a human face substantially affects the per- formance and design of a detection or recognition system. A seamless combination of holistic 2D and geometrical 3D features provides a promising approach to represent faces for face detection as well as face recognition. Modeling human face in 3D space has been shown to be useful for face recognition. However, the important aspect of face modeling is how to efficiently encode the 3D facial geometry and texture as compact features for face recognition. 57 Chapter 3 Face Detection We will first describe an overview of our proposed face detection algorithm and then give details of the algorithm. We will demonstrate the performance and experimental results on several image databases. 3.1 Face Detection Algorithm The use of color information can simplify the task of face localization in complex environments [19], [84], [90], [85]. Therefore, we use skin color detection as the first step in detecting faces. An overview of our face detection algorithm is depicted in Fig. 3.1, which contains two major modules: (i) face localization for finding face candidates; and (ii) facial feature detection for verifying detected face candidates. The face localization module combines the information extracted from the luminance and the chrominance components of color images and some heuristics about face shape (e.g., face sizes ranging from 13 x 13 pixels to about three fourths of the image 58 size) to generate potential face candidates, within the entire image. The algorithm first estimates and corrects the color bias based on a novel lighting compensation technique. The corrected red, green, and blue color components are first converted to the YCbCT color space and then nonlinearly transformed in this color space (see formulae in Appendix A). The skin-tone pixels are detected using an elliptical skin model in the transformed space. The parametric ellipse corresponds to contours of constant Mahalanobis distance under the assumption of the Gaussian distribution of skin tone color. The detected skin—tone pixels are iteratively segmented using local color variance into connected components which are then grouped into face candidates based on both the spatial arrangement of these cemponents (described in Appendix B) and the similarity of their color [84]. Figure 3.1 shows the input color image, color compensated image, skin regions, grouped skin regions, and face candidates obtained from the face localization module. Each grouped skin region is assigned a pseudo color and each face candidate is represented by a rectangle. Because multiple face candidates (bounding rectangles) usually overlap, they can be fused based on the percentage of overlapping areas. However, in spite of this postprocessing there are still some false positives among face candidates. It is inevitable that detected skin—tone regions will include some non—face regions whose color is similar to the skin-tone. The facial feature detection module rejects face candidate regions that do not contain any facial features such as eyes, mouth, and face boundary. This module can detect multiple eye and mouth candidates. A triangle is constructed from two eye candidates and one mouth candidate, and the best-fitting enclosing ellipse of the triangle is constructed to approximate the face 59 Regions Segmented Skin Regions , . ._ . Grouped [confiscate 2» Skin _’ ‘ . fireupine :3. Regions 1 Face Localization ] Face Candidates s L Eye/Maui?! fiéecfionj [Face Boundary Detection] Verifying/Weighting l—ji E es-Mouth Trian les ~77" Facial Feature Detection Face Descriptor Figure 3.1. Face detection algorithm. The face localization module finds face candi- dates, which are verified by the detection module based on facial features. 60 boundary. A face score is computed for each set of eyes, mouth and the ellipse. Figure 3.1 shows a detected face and the enclosing ellipse with its associated eye- mouth triangle which has the highest score that exceeds a threshold. These detected facial features are grouped into a structured facial descriptor in the form of a 2D graph for face description. These descriptors can be the input to subsequent modules such as face modeling and recognition. We now describe in detail the individual components of the face detection algorithm. 3.2 Lighting Compensation and Skin Tone Detection The appearance of the skin-tone color can change due to different lighting conditions. We introduce a lighting compensation technique that uses “reference white” to nor- malize the color appearance. We regard pixels with the top 5% of the luma (nonlinear gamma-corrected luminance) values as the reference white if the number of these pix— els is sufficiently large (> 100). The red, green, and blue components of a color image are adjusted so that these reference-white pixels are scaled to the gray level of 255. The color components are unaltered if a sufficient number of reference—white pixels is not detected. This assumption is reasonable not only because an image usually contains “real white” (i.e., white reference in [155]) pixels in some regions of interest (such as eye regions), but also because the dominant bias color always appears in the “real white”. Figure 3.2 demonstrates an example of our lighting compensation 61 . liltilu-n‘kiluw “mm Figure 3.2. Skin detection: (a) a yellow-biased face image; (b) a lighting compensated image; (c) skin regions of (a) shown in white; (d) skin regions of (1)). method. Note that the yellow bias color in Fig. 3.2(a) has been removed, as shown in Fig. 3.2(b). The effect of lighting compensation on detected skin regions can be seen by comparing Figs. 3.2(a) and 3.2(d). With lighting compensation, our algorithm detects fewer non-face areas and more skin—tone facial areas. Note that the varia- tions in skin color among different racial groups, reflection characteristics of human skin and its surrounding objects (including clothing), and camera characteristics will all affect the appearance of skin color and hence the performance of an automatic face detection algorithm. Therefore, if models of the lighting source and cameras are available, additional lighting correction should be made to remove color bias. Modeling skin color requires choosing an appropriate color space and identifying a cluster associated with skin color in this space. It has been observed that the 62 normalized red-green (r-g) space [156] is not the best choice for face detection [157], [158]. Based on Terrillon et al.’s [157] comparison of nine different color spaces for face detection, the tint-saturation-luma (TSL) space provides the best results for two kinds of Gaussian density models (unimodal and mixture of Gaussian densities). We adopt the YCbCr space since it is perceptually uniform [155], is widely used in video compression standards (e.g., MPEG and JPEG) [21], and it is similar to the TSL space in terms of the separation of luminance and chrominance as well as the compactness of the skin cluster. Many research studies assume that the chrominance components of the skin-tone color are independent of the luminance component [159], [160], [158], [90]. However, in practice, the skin-tone color is nonlinearly dependent on luminance. In order to demonstrate the luma dependency of skin-tone color, we manually collected training samples of skin patches (853, 571 pixels) from 9 subjects (137 images) in the Heinrich—Hertz-Institute (HHI) image database [15]. These pixels form an elongated cluster that shrinks at high and low luma in the YCbCr space, shown in Fig. 3.3(a). Detecting skin tone based on the cluster of training samples in the Cb-C, subspace, shown in Fig. 3.3(b), results in many false positives. If we base the detection on the cluster in the (Cb/ Y)-(Cr / Y) subspace, shown in Fig. 3.3(c), then many false negatives result. The dependency of skin tone color on luma is also present in the normalized rgY space in Fig. 3.4(a). the perceptually uniform C I E :er space in Fig. 3.4(c), and the HSV spaces in Fig. 3.4(e). The 3D cluster shape changes at different luma values, although it looks compact in the 2D projection subspaces, in Figs. 3.4(b), 3.4(d) and 3.4(f). To deal with the skin-tone color dependence on luminance, we nonlinearly trans- 63 form the YCbCr color space to make the skin cluster luma—independent. This is done by fitting a piecewise linear boundary to the skin cluster (see Fig. 3.5). The details of the model and the transformation are described in Appendix A. The transformed space, shown in Fig. 3.6(a), enables a robust detection of dark and light skin tone colors. Figure 3.6(b) shows the projection of the 3D skin cluster in the transformed Cb-Cr color subspace, on which the elliptical model of skin color is overlaid. Figure 3.7 shows examples of detection using the nonlinear transformation. More skin-tone pix- els with low and high luma are detected in this transformed subspace than in the 050, subspace. 64 Figure 3.3. The YCbCr color space (blue dots represent the reproducible color on a monitor) and the skin tone model (red dots represent skin color samples). (a) The YCbC, space; (b) a 2D projection in the Cb—C', subspace; (c) a 2D projection in the (Cb/Y)-(C,/Y) subspace. 65 *< 9999 e959o - l °PP°P Anyhm o9 - .0, . _. awf, 7' l..f”.~. Jr" ATTT.‘.0103 on 03305 M Figure 3.4. The dependency of skin tone color on lunIa. The skin tone cluster (red dots) is shown in (a) the rgY, (c) the CIE er, and (e) the HSV color spaces; the 2D projection of the cluster is shown in (b) the r — g, (d) the 2: — y , and (f) S —- H color subspaces, where blue dots represent the reproducible color on a monitor. For a better presentation of cluster shape. we normalize the luma Y in the rgY and the CI E .er by 255, and swap the hue and saturation coordinates in the H SV space. The skin tone cluster is less compact at low saturation values in (e) and (f) I .° 66 3oq'Cb l — Skin-{OWEREEST 300cr — Skin- tone Samples , ‘ -[:- FILIEG Biodet -_ ] I I - Fitted Model 250] 250» zoo] 200 150 150» 100 100* 50 50 o E . - .L o . . 1 _, Y - 0 50 100 150 200 250 0 50 100 150 200 250 (a) (b) Figure 3.5. 2D projections of the 3D skin tone cluster in (a) the Y-Cb subspace; (b) the Y-C, subspace. Red dots indicate the skin cluster. Three blue dashed curves, one for cluster center and two for boundaries, indicate the fitted models. Figure 3.6. The nonlinear transformation of the YCbCr color space. (a) The trans- formed YCbC'r color space; (b) a 2D projection of (a) in the Cb-Cr subspace, in which the elliptical skin model is overlaid on the skin cluster. 67 35%:er 2: 53» 33mg .93 :EEoe €5.63. 9: 5 $9: amass .anfideb Qt ”:6an 983.08% one 25me as? Ed :EBou umum 23 E 96% memes: 2: 63598 come Em .5997. one. Showcases. 2: 50:33 was its .moEEexo qoSoBoc cam 45898:. 830 Soszcoz Sn 8de 68 3.3 Localization of Facial Features Among the various facial features, eyes, mouth, and face boundary are the most prominent features for recognition [103] and for estimation of 3D head pose [161], [162]. Most approaches for eye [163], [164], [165], [166]. [167], mouth [165], [168], face boundary [165], and face [20] localization are template based. However, our approach is able to directly locate eyes, mouth, and face boundary based on their feature maps derived from the the luma and the chroma of an image, called the eye map, the mouth map and the face boundary map, respectively. For computing the eye map and the mouth map, we consider only the area covered by a face mask that is built by enclosing the grouped skin—tone regions with a pseudo convex hull, which is constructed by connecting the boundary points of skin-tone regions in horizontal and vertical directions. Figure 3.8 shows an example of the face mask. (C) (d) Figure 3.8. Construction of the face mask. (a) Face candidates; (b) one of the face candidates; (c) grouped skin areas; ((1) the face mask. 69 3.3.1 Eye Map We first build two separate eye maps, one from the chrominance components and the other from the luminance component of the color image. These two maps are then combined into a single eye map. The eye map from the chroma is based on the observation that high Cb and low C, values are found around the eyes. It is constructed from information contained in Cb, the inverse (negative) of Cr, and the ratio Cb/Cr, as described in Eq. (3.1). EyeMapC = %{ (03) + (a)2 + (Cb/c.) }, (3.1) where 03, (CA2, and Cb/C'r all are normalized to the range [0,255] and Cr is the negative of Cr (i.e., 255 — 0,). An example of the eye map from the chroma is shown in Fig. 3.9(a). The eyes usually contain both dark and bright pixels in the luma component. Based on this observation, grayscale morphological operators (e.g., dilation and ero- sion) [169] can be designed to emphasize brighter and darker pixels in the luma component around eye regions. These operations have been used to construct feature vectors for face images at multiple scales for frontal face authentication [66]. We use grayscale dilation and erosion with a hemispheric structuring element at a single estimated scale to construct the eye map from the luma, as described in Eq. (3.2). Yet, y) 69 90a, y) EyeMapL = Y(IL',y) e 90(xay) +1 , (3.2) where the grayscale dilation {9 and erosion 9 operations [169] on a function f : f C R2 ——> R using a structuring function g : g C R2 -—> R are defined as follows. 70 A” .' Hem (C) Figure 3.9. Construction of eye maps: (a) from chroma; (b) from luma; (c) the combined eye map. 71 (f 69 go)($, y) = MaX{f(IB - Ca y — 7‘) + 9(Ca7‘)}; (:r—c,y—r) 6f, (c,r)Eg, (3.3) (f 9 ga)(:v, y) = Min{f(rr - c, y - r) + 9(6, r)}; (x—c,y—r) Ef, (c,r) 69, (3.4) Ial - (ll — (R($,y)/0)2|1/2 — 1); R 5 lol, —00; R > Io], R(:c,y) = \/$2+y2, (3.6) 90(1'3 y) = (3.5) where o is a scale parameter, which will be described later in Eq. (3.11). An example of a hemispheric structuring element is shown in Fig. 3.10. The construction of the Figure 3.10. An example of a hemispheric structuring element for grayscale morpho- logical dilation and erosion with o = 1. eye map from the luma is illustrated in Fig. 3.9(b). Note that before performing the grayscale dilation and erosion operations, we fill the background of the face mask with the mean value of the luma in the face mask (skin regions) in order to smooth the noisy boundary of detected skin areas. 72 The eye map from the chroma is enhanced by histogram equalization, and then combined with the eye map from the luma by an AND (n‘Iultiplication) operation in Eq. (3.7). EyeMap = ( EyelldapC' ) AND ( EyelVIapL ) . (3.7) The resulting eye map is dilated, masked, and normalized to brighten the eyes and suppress other facial areas, as can be seen in Fig. 3.9(c). The locations of the eye candidates are initially estimated from the pyramid decomposition of the eye map, and then refined using iterative thresholding and binary morphological closing on this eye map. 3.3.2 Mouth Map The color of mouth region contains more red component compared to the blue compo— nent than other facial regions. Hence, the chrominance component 0,, proportional to (red — Y), is greater than 05, proportional to (blue — Y), near the mouth areas. We further notice that the mouth has a relatively low response in the Cr/Cb feature, but it has a high response in CB. We construct the mouth map as follows: Mouthllfap = CE-(CE—n-Cr/Cb)2; (3.8) .1: 2 Great/)2 (1:,y)€]:g n = 0.95- , (3.9) i Z Crf$ay)/Cb($,y) (2:,y)E.7:g where both C3 and Cr/Cb are normalized to the range [0,255], and n is the number of pixels within the face mask, $9. The parameter 7) is estimated as the ratio of the 73 average Cf to the average Cr/Cb. Figure 3.11 shows the major steps in computing the mouth map of the subject in Fig. 3.9. Note that after the mouth map is dilated, masked, and normalized, it is dramatically brighter near the mouth areas than at other facial areas. Mouth map Dilated & masked Difference Figure 3.11. Construction of the mouth map. 3.3.3 Eye and Mouth Candidates We form an eye-mouth triangle for all possible combinations of two eye candidates and one mouth candidate within a face candidate. We then verify each eye—mouth triangle by checking (i) luma variations and average gradient orientations of eye and mouth blobs; (ii) geometry and orientation constraints of the triangle; and (iii) the presence of a face boundary around the triangle. A weight is computed for each verified eye—mouth triangle. The triangle with the highest weight that exceeds a threshold is selected. We discuss the detection of face boundary in Section 3.3.4, and the selection of the weight and the threshold in Section 3.3.5. Note that the eye and mouth maps are computed within the entire areas of the face candidate, which is bounded by a rectangle. The search for the eyes and the 74 mouth is performed within the face mask. The eye and mouth candidates are located by using (i) a pyramid decomposition of the eye/ mouth maps and (ii) an iterative thresholding and binary morphological closing on the enhanced eye and mouth maps. The number of pyramid levels, L, is computed from the size of the face candidate, as defined in Eqs. (3.10) and (3.11). L 2 Max{ [1082(20)la [log2(Min(W, Hl/Fcll } i (310) o = [von/(2-Fe)], (3.11) where W and H represent the width and height of the face candidate; Fc X FC is the minimum expected size of a face candidate; a is a spread factor selected to prevent the algorithm from removing small eyes and months in the morphological operations; and Fe is the maximal ratio of an average face size to the average eye size. In our implementation, Fe is 7 pixels, and Fe is 12 pixels. The coarse locations of eye and mouth candidates obtained from the pyramid de- composition are refined by checking the existence of eyes / mouth blobs which are ob- tained after iteratively thresholding and (morphologically) closing the eye and mouth maps. The iterative thresholding starts with an initial threshold value, reduces the threshold step by step, and stops when either the threshold falls below a stopping value or when the number of feature candidates reaches pre—determined upper bounds, New, for the eyes and .I mth for the mouth. The threshold values are automatically computed as follows. Th 2 E Z Map(:z:,y)+(1— 00- Max .Map(:c,y), (3.12) n (1‘,y)€fg (:r,y)e]-'g 75 where .M ap(:1:,y) is either the eye or the mouth map; the parameter a is equal to 0.5 for the initial threshold value, and is equal to 0.8 for the stopping threshold. The use of upper bounds on the number of eye and mouth candidates can prevent the algorithm from spending too much time in searching for facial features. In our implementation, the maximum number of eye candidates, New, is 8 and the maximum number of mouth candidates, Nmth, is 5. 3.3.4 Face Boundary Map Based on the locations of eyes / mouth candidates, our algorithm first verifies whether the average orientation of luma gradients around each eye matches the interocular direction, and then constructs a face boundary map from the luma. Finally, it utilizes the Hough transform to extract the best-fitting ellipse. The fitted ellipse is used for computing the eye-mouth triangle weight. Figure 3.12 shows the boundary map that is constructed from both the magnitude and the orientation components of the luma gradient within the regions that have positive orientations of the gradient orientations (i.e., have counterclock-wise gradient orientations). We have modified Canny edge detection [170] algorithm to compute the gradient of the luma as follows. The gradient of a luma subimage, S (a: g), which is slightly larger than the face candidate in size is estimated by V5013, y) = (0,, Cy) = (130(1) ® SCI?) .71), Da(y) 69 Stay», (3-13) 76 Orientation Magnitude Face boundary map Figure 3.12. Computation of face boundary and the eye-mouth triangle 77 where Da(r) is the derivative of the Gaussian with zero mean and variance 02, and ® is the convolution operator. Unlike the Canny edge detector, our edge detection requires only a single standard deviation 0 (a spread factor) for the Gaussian that is estimated from the size of the eye—mouth triangle. —qu2 “2 Z _— ' Z M . tic, . em 7 - a (8 ln(wh)) , ws ax(dis dist ) (3 14) where as is the window size for a Gaussian, which is the maximum value of the interocular distance (distw) and the distance between the interocular midpoint and the mouth (distem); wh = 0.1 is the desired value of the Gaussian distribution at the border of the window. In Fig. 3.12, the magnitudes and orientations of all gradients have been squared and scaled between 0 and 255. Fig. 3.12 shows that the gradient orientation provides more information to detect face boundaries than the gradient magnitude. So, an edge detection algorithm is applied to the gradient orientation and the resulting edge map is thresholded to obtain a mask for computing the face boundary. The gradient magnitude and the magnitude of the gradient orientation are masked, added, and scaled into the interval [0, 1] to construct the face boundary map. The center of a face, indicated as a white rectangle in the face boundary map in Fig. 3.12, is estimated from the first-order moment of the face boundary map. The Hough transform is used to fit an elliptical shape to the face boundary map. An ellipse in a plane has five parameters: an orientation angle, two coordinates of the 78 center, and lengths of major and minor axes. Since we know the locations of eyes and mouth, the orientation of the ellipse can be estimated from the direction of a vector that starts from the midpoint between the eyes towards the mouth. The location of the ellipse center is estimated from the face boundary map. Hence, we need only a two-dimensional accumulator for estimating the ellipse for bounding the face. The accumulator is updated by perturbing the estimated center by a few pixels for a more accurate localization of the ellipse. 3.3.5 Weight Selection for a Face Candidate For each face in the image, our algorithm can detect several eye-mouth—triangle candi- dates that are constructed from eye and mouth candidates. Each candidate is assigned a weight which is computed from the eye and mouth maps, the maximum accumula- tor count in the Hough transform for ellipse fitting, and face orientation that favors vertical faces and symmetric facial geometry, as described in Eqs. (3.15)-(3.19). The eye-mouth triangle with the highest weight (face score) that is above a threshold is retained. In Eq. (3.15), the triangle weight, tw(i,j, k), for the i-th and the j-th eye candidates and the k-th mouth candidate is the product of the eye-mouth weight, emw(i,j, k), the face-orientation weight, ou;(i,j, k), and boundary quality, q(i,j, k). The eye-mouth weight is the average of the eye-pair weight, ew(i, j), and the mouth weight, mw(k), as described in Eq. (3.16). 79 two's, k) = 6mw(i,j,kl°mU(i,j./<)40,1316); (3.15) 1 emw(i,j, k) = §(ew(i,j) + mw(k‘)); (3.16) . . _ EyeMaptvtz/i)+EyeMap($J-,yjl, ew(zi]) "7 i 2 - EyeMap(:rm, ym) i > j; i,j E [1,Neye]; (3.17) MOUthMapffEk, yr) k = ' k 1 Nm ; .1 ”w“ ) MouthMap