Face Recognition: Role of Aging and Quality Covariates By Lacey Best-Rowden A Dissertation Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science – Doctor of Philosophy 2016 Abstract Face Recognition: Role of Aging and Quality Covariates By Lacey Best-Rowden A technology once seen only in television dramas, automatic face recognition systems are now deployed in many important applications. Recognition of individuals from facial images is used for de-duplication of identification cards (e.g., driver’s licenses and passports), verification of prisoner identities, and tag suggestions for personal photo collections. Face images acquired in such applications are conducive to the current capabilities of face recognition algorithms; state-of-the-art systems are able to recognize constrained face images with close to 99% accuracy. However, the performance of automatic face recognition degrades when processing unconstrained face images (i.e., image acquisition is uncontrolled and subjects may be uncooperative). In such scenarios, a face image may simultaneously contain multiple confounding factors, or covariates, such as variations in facial pose, illumination, expression, occlusion, resolution, and facial aging. The first contribution of this dissertation is a framework for matching a collection of unconstrained face media (images, videos, 3D model, demographics, facial sketch) when multiple instances of a subject’s face are available. This is particularly relevant to forensic investigations where the goal is to identify a “person of interest” based on low quality face images and videos (e.g., captured by surveillance cameras or mobile phones of bystanders) and other information compiled during the investigation (e.g., gender, race, age, facial sketch). While traditional face matching methods generally take a single media (i.e., a still face image, video track, or face sketch) as input, this work considers using the entire gamut of media as a probe to generate a single candidate list for the person of interest. We show that the proposed approach boosts the likelihood of correctly identifying the person of interest through the use of different fusion schemes, 3D face models, and incorporation of quality measures for fusion and video frame selection. Secondly, this dissertation proposes an automatic measure of the quality of an unconstrained face image, where quality is defined as a measure of the utility of a face image to automatic face recognition. A large database of unconstrained face images is first annotated with target quality labels using two methods: (i) human assessments of face image quality, and (ii) quality values computed from similarity scores. A support vector regression model trained on image features automatically extracted using a deep convolutional neural network is then used to predict the quality of an unseen face image. Results demonstrate that target quality values from human assessments and similarity scores are not highly correlated with each other, but both are useful for applications of face image quality, such as to reject low-quality face images prior to matching and to rank a collection of face images based on quality. Finally, this dissertation addresses the important problem of facial aging, which is a challenge for both constrained and unconstrained applications. The two underlying premises of automatic face recognition are uniqueness and permanence. We investigate the permanence property by addressing the following: Does face recognition ability of state-of-the-art systems degrade with elapsed time between enrolled and query face images? If so, what is the rate of decline with respect to the elapsed time? While previous studies have reported degradations in accuracy, no formal statistical analysis of large-scale longitudinal data has been conducted. We conduct such an analysis on two mugshot databases, which are the largest facial aging databases studied to date in terms of number of subjects, images per subject, and elapsed times. Longitudinal analysis shows that despite decreasing genuine scores, 99% of subjects can still be recognized at 0.01% FAR up to approximately 6 years elapsed time, and that age, sex, and race only marginally influence these trends. The methodology presented in this dissertation should be periodically repeated to determine age-invariant properties of face recognition as state-of-the-art evolves to better address facial aging. Acknowledgments I would like to extend my sincerest gratitude to my advisor, Dr. Anil Jain, to my parents, family, friends, and PRIP lab members. This thesis would not be possible without the overwhelming kindness and support that they have all given me throughout this journey. Additional thanks to Patrick Grother and Mei Ngan at the National Institute of Standards and Technology (NIST) for collaboration with the longitudinal study of face recognition in this thesis, and to Jane Wankmiller and Sarah Krebs, sketch artists at the Michigan State Police. iv Table of Contents LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1 Introduction . . . . . . . . . . . . . . . . . 1.1 Background . . . . . . . . . . . . . . . . . . . . . 1.1.1 Automatic Face Recognition Pipeline . . . 1.2 Research Progression . . . . . . . . . . . . . . . . 1.2.1 Face Databases . . . . . . . . . . . . . . . 1.2.2 Holistic Representation . . . . . . . . . . . 1.2.3 Local Representation . . . . . . . . . . . . 1.2.4 Learned Representation . . . . . . . . . . 1.2.4.1 Deep ConvNets . . . . . . . . . . 1.3 Video Face Recognition . . . . . . . . . . . . . . . 1.4 Face Image Quality . . . . . . . . . . . . . . . . . 1.5 Facial Aging . . . . . . . . . . . . . . . . . . . . . 1.6 Benchmarking State of the Art . . . . . . . . . . 1.6.1 Unconstrained Face Recognition . . . . . . 1.6.1.1 Drawbacks of the LFW Protocol 1.6.2 Age-Invariant Face Recognition . . . . . . 1.7 Contributions . . . . . . . . . . . . . . . . . . . . 1.8 Thesis Organization . . . . . . . . . . . . . . . . . x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 9 10 16 17 18 21 22 23 24 27 29 30 31 32 35 38 39 Chapter 2 Face Recognition with Media Collection 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Overview . . . . . . . . . . . . . . . . . . . 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . 2.3 Media-as-Input . . . . . . . . . . . . . . . . . . . . 2.3.1 Still Image and Video Track . . . . . . . . . 2.3.2 3D Face Models . . . . . . . . . . . . . . . . 2.3.3 Demographic Attributes . . . . . . . . . . . 2.3.4 Forensic Sketches . . . . . . . . . . . . . . . 2.4 Media Fusion . . . . . . . . . . . . . . . . . . . . . 2.5 Experimental Setup . . . . . . . . . . . . . . . . . . 2.5.1 Closed Set Identification . . . . . . . . . . . 2.5.2 Open Set Identification . . . . . . . . . . . . 2.6 Experimental Results . . . . . . . . . . . . . . . . . 2.6.1 Pose Correction . . . . . . . . . . . . . . . . 2.6.2 Forensic Identification: Media-as-Input . . . 2.6.3 Quality-based Media Fusion . . . . . . . . . 2.6.4 Forensic Sketch Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 40 42 45 50 50 50 51 53 54 57 58 59 60 60 61 65 67 v 2.7 2.6.5 Watch List Scenario: Open Set Identification . . . . . . . . . . . . . . 2.6.6 Large Gallery Results . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 3 Automatic Face Image Quality . . . . . . . . . . . 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Face Image Databases and COTS Matchers . . . . . . . . . 3.3 Face Image Quality Labels . . . . . . . . . . . . . . . . . . . 3.3.1 Human Ratings of Face Image Quality . . . . . . . . 3.3.1.1 Crowdsourcing Comparisons of Face Quality 3.3.1.2 Matrix Completion . . . . . . . . . . . . . . 3.3.2 Recognition-based Face Image Quality Labels . . . . 3.4 Automatic Prediction of Face Quality . . . . . . . . . . . . . 3.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . 3.5.1 Target Face Image Quality Values . . . . . . . . . . . 3.5.2 Predicted Face Image Quality Values . . . . . . . . . 3.5.2.1 Train, Validate, and Test on LFW: . . . . . 3.5.2.2 Train and Validate on LFW, Test on IJB-A: 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 72 74 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 . 81 . 84 . 86 . 87 . 87 . 90 . 91 . 93 . 94 . 96 . 98 . 98 . 101 . 112 Chapter 4 Longitudinal Study of Automatic Face Recognition . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Longitudinal Face Databases . . . . . . . . . . . . . . . . . . . . . 4.3.1 LEO LS Face Database . . . . . . . . . . . . . . . . . . . . 4.3.2 PCSO LS Face Database . . . . . . . . . . . . . . . . . . . 4.3.3 Face Comparison Scores . . . . . . . . . . . . . . . . . . . 4.4 Mixed-Effects Models . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Model Formulations . . . . . . . . . . . . . . . . . . . . . 4.4.1.1 Function of Elapsed Time . . . . . . . . . . . . . 4.4.1.2 Function of Elapsed Time and Age at Enrollment 4.4.2 Model Comparison and Evaluation . . . . . . . . . . . . . 4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Model Assumptions . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Unconditional Means Model (Model A) . . . . . . . . . . . 4.5.3 Unconditional Growth Model (Model BT) . . . . . . . . . 4.5.4 Age at Enrollment (Models CT and D) . . . . . . . . . . . 4.5.5 Sex and Race (Model E) . . . . . . . . . . . . . . . . . . . 4.5.6 Face Image Quality (Model Q) . . . . . . . . . . . . . . . . 4.5.7 LEO LS Database . . . . . . . . . . . . . . . . . . . . . . 4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 114 118 120 123 123 124 125 129 130 132 133 134 134 138 138 139 142 143 146 148 Chapter 5 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . 152 5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 vi BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 vii List of Tables Table 1.1 Face image databases in the public domain . . . . . . . . . . . . . . . 20 Table 1.2 Characteristics of popular face video databases in the public domain. . 25 Table 1.3 Face recognition performance on frontal, constrained face images as reported over the years in NIST evaluations. . . . . . . . . . . . . . . . . . . 31 Table 1.4 Comparison of performance on the LFW [62] vs. BLUFR [85] protocols. 34 Table 2.1 A summary of published methods on unconstrained face recognition (UFR). Performance is reported as True Accept Rate (TAR) at a fixed False Accept Rate (FAR) of 0.1% or 1%, unless otherwise noted. . . . . . . . . . . 49 Table 2.2 Number of probe face images (from the LFW database) and video tracks (from the YTF database) available for the 596 subjects that are common in the two databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Table 2.3 Closed-set identification accuracies (%) for pose corrected gallery and/or probe face images using 3D model. The gallery consists of 4,249 LFW frontal images and the probe sets are (a) 3,143 LFW images and (b) 1,292 YTF video tracks. Performance is shown as Rank retrieval results at Rank-1, 20, 100, and 200. Computation of match scores s1 , s2 , s3 , and s4 are shown in Fig. 2.5. 61 Table 2.4 Closed-set identification accuracies (%) for matching consolidated 3D face models built from (a) all frames of a video track or (b) a subset of high quality (HQ) video frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Table 2.5 Closed-set identification accuracies (%) for quality based fusion (QBF) (a) within a single image, and (b) across multiple images. . . . . . . . . . . . 67 Table 2.6 Retrieval ranks for probe images (1a, 1b) and sketch (1c) matched against gallery images 1x, 1y, and 1z with an extended set of one million mug shots (a) without and (b) with demographic filtering. Rows max and mean denote score fusion of multiple images of this suspect in the gallery; columns max and sum are score fusion of the three probes. . . . . . . . . . . . . . . . 71 Table 3.1 Summary of Related Work on Automatic Methods for Face Image Quality 82 Table 3.2 Performance of Face Recognition Algorithms on the BLUFR Protocol [85] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 84 Table 3.3 Rank Correlation, (a) Kendall’s tau and (b) Spearman, Between Target and Predicted Quality Labels (Mean ± Standard Deviation Over 10 Random Splits of LFW Images) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Table 4.1 Table of related work on the effects of facial aging on face recognition performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Table 4.2 Facial Aging Databases . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Table 4.3 Overall true accept rates (TARs) at fixed false accept rates (FARs) for various face matchers on the PCSO LS and LEO LS databases. . . . . . . . 124 Table 4.4 Mixed-Effects Model Formulations . . . . . . . . . . . . . . . . . . . . 129 Table 4.5 Bootstrap results for mixed-effects models on the PCSO LS database and COTS-A genuine scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Table 4.6 Bootstrap results for mixed-effects models with elapsed time and face quality covariates for the PCSO LS database and COTS-A genuine scores. . 145 Table 4.7 Elapsed times (in years) for when population-mean trends in genuine scores drop below the decision thresholds at 0.001% and 0.01% FAR for different measures related to face quality (frontalness and IPD) of the enrollment image Qie and the query image Qij . . . . . . . . . . . . . . . . . . . . . . . . 145 Table 4.8 Mixed-effects model results for the LEO LS database and COTS-B genuine scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Table 5.1 Published works which have reported results using the experimental protocols introduced in Chapter 2 for the LFW database [62] (single-image matching). COTS results were reported in Chapter 2. . . . . . . . . . . . . . 153 ix List of Figures Figure 1.1 (a) Rank-1 miss rates of six vendors for closed-set identification of (b) mugshot and (c) webcam face images against a gallery of mugshot photos 1.6 million individuals, as reported by the NIST FRVT 2013 evaluation [55]. . . 2 Figure 1.2 Sources of intra-class variability: (a) pose, (b) illumination, and (c) expression. Although all images shown here are of different people, such variations typically cause two images of the same person to appear very different. 7 Figure 1.3 A teacher who wears the same outfit for his school picture every year; while the outfit is the same, his face and eyeglasses change over time. The overall quality of the image also changes (i.e., improves over time). Such temporal aspects are additional sources of intra-class variation. [Images are from: http://fillthewell.com/yearbook-pictures/] . . . . . . . . . . . . . . . . . . . 8 Figure 1.4 Sources of inter-class similarity: (a) kinship similarities (in this case, twins) and (b)-(d) different people with no kinship relation who happen to exhibit very similar facial characteristics. This is sometimes referred to as a doppelg¨anger; (b) shows, as an example, that president Barack Obama (left) has a doppelg¨anger (right) from Indonesia. [Images in (b) are from: http://www. theguardian.com/theguardian/2010/dec/05/barack-obama-doppelganger-ilham-anas] . . 8 Figure 1.5 A flowchart of automatic face recognition in identification mode. A probe face image (with unknown identity) is matched against all face images enrolled in a gallery database. The top-k most similar identities retrieved from the database are then manually adjudicated by human analysts to determine whether the top-k candidates contain the identity of the probe face image. In verification mode, the probe image would be accompanied by a claimed identity and then only compared to the gallery image with the same identity as that which is claimed by the probe. . . . . . . . . . . . . . . . . . . . . . 9 Figure 1.6 The automatic face recognition pipeline typically consists of (i) face detection, (ii) face normalization (to mitigate geometric and photometric variations), (iii) feature extraction, and (iv) comparison of resulting face representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 x Figure 1.7 Example face detection results. Faces were (a) detected and (b) not detected by an implementation of the Viola-Jones algorithm [135]. Face images in (b) can be better detected by (c) a COTS face recognition system. However, the COTS detector also encounters (d) errors due to occlusion and facial pose, in particular. The small and large rectangles in (c) and (d) show bounding boxes of face and head detections, respectively. The circles are detected eye locations. All face images are from the LFW database [62]. . . 13 Figure 1.8 Example images from eight face tracks in the YouTube Faces (YTF) database where all images in that track could not be enrolled by one of the COTS matchers. These images display extreme pose and illumination conditions, low resolution, and motion blur. . . . . . . . . . . . . . . . . . . . . . 14 Figure 1.9 Example face images from different databases: (a) FERET [97], (b) FRGC [97], (c) AR [92], (d) LFW [62], and (e) IJB-A [72]. Databases (a)-(c) contain variations such as illumination, expression, and occlusion to challenge face recognition research, but they are relatively controlled acquisition conditions because such variations are simulated/staged (subjects are typically students and members of research groups). Databases (d) and (e) contain more unconstrained face images (e.g., collected from the internet). . . . . . . 19 Figure 1.10 Example images from face tracks of two subjects in the YouTube Faces (YTF) database. The top two and bottom two rows are face tracks from the same subject. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Figure 1.11 Face images of two example subjects from the FG-NET database [78]: (a) female at ages 3–38 years and (b) male at ages 19–63 years. As shown in these examples, the FG-NET database contains a significant amount of variations (pose, illumination, inter-pupillary distances, image quality, etc.), in addition to intrinsic variations due to facial aging. . . . . . . . . . . . . . 37 Figure 1.12 Face images and corresponding ages (in years) of three example subjects from the MORPH database [113]. The largest commercial version of MORPH has 78,207 face images of 20,569 subjects. However, there are only 317 subjects with at least 5 images acquired over at least 5 years (these are three of the 317). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Figure 2.2 Forensic investigations by law enforcement agencies using face images typically involve six main stages: obtaining face media, preprocessing, automatic face matching, generating a suspect list, human analysis, and suspect identification. Feedback occurs after human analysis reveals that, for example, additional preprocessing of the input image (e.g., illumination correction and/or manual eye locations), demographic filtering of the gallery, and/or a different face sample from the media collection is necessary. . . . . . . . . . 43 xi Figure 2.3 Schematic diagram of a person identification task given a face media collection as input. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Figure 2.4 Example (a) face images from the LFW database and (b) face video tracks from the YTF database. All faces shown are of the same subject. . . . 46 Figure 2.5 Pose correction of probe (left) and gallery (right) face images using CyberExtruder’s Aureus 3D SDK. We consider the fusion of four different match scores (s1 , s2 , s3 , and s4 ) between the original probe and gallery images (top) and synthetic pose corrected probe and gallery images (bottom). . . . 52 Figure 2.6 Pose corrected faces (b) in a video track (a) and the resulting “consolidated” 3D face model (c). The consolidated 3D face model is a summarization of all frames in the video track. . . . . . . . . . . . . . . . . . . . . . . . . . 52 Figure 2.7 An example of a sketch drawn by a forensic artist by looking at a low-quality video. (a) Video shown to the forensic artists, (b) facial region cropped from the video frames, and (c) sketch drawn by the forensic artist. Here, no verbal description of the person of interest is available. . . . . . . . 54 Figure 2.8 Examples of different face media types with varying quality values (QV) of one subject: (a) images, (b) video frames, (c) 3D face models, and (d) demographic information. The range of QV is [0,1]. . . . . . . . . . . . . 57 Figure 2.12 Face verification performance of a gallery of 4,249 frontal LFW images and probe media collections of 596 subjects. . . . . . . . . . . . . . . . . . . 66 Figure 2.13 A comparison of quality based fusion (QBF) vs. simple sum rule fusion (SUM). (a) Examples where quality based fusion provides better identification accuracy than sum fusion; (b) Examples where quality based fusion leads to lower identification accuracy compared with sum fusion. . . . . . . . . . . . 68 Figure 2.15 Three examples where the face sketches drawn by a forensic artist after viewing the low-quality videos improve the retrieval rank. The retrieval ranks without and with combining the demographic information (gender and race) are given in the form of #(#). . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Figure 2.16 Face images used in our case study on identification of Tamerlan Tsarnaev, one of the two suspects of the 2013 Boston Marathon bombings. Probe (1a, 1b) and gallery (1x, 1y, and 1z) face images are shown. 1c is a face sketch drawn by a forensic sketch artist after viewing 1a and 1b, and a low quality video frame from a surveillance video. . . . . . . . . . . . . . . . . . . . . . . 70 Figure 2.18 An example of two face images of the same subject in the LFW database where facial aging has occurred. . . . . . . . . . . . . . . . . . . . . 75 xii Figure 3.1 Examples of (a) high and (b) low quality mugshots from the PCSO database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Figure 3.2 (a) Video frames from a sample video in the IJB-A [72] unconstrained face database and (b) corresponding cropped faces sorted from high to low quality by the proposed approach. . . . . . . . . . . . . . . . . . . . . . . . . 79 Figure 3.3 Sample face images from the (a) LFW [62] and (b) IJB-A [72] unconstrained face databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Figure 3.4 The interface used to collect responses for pairwise comparisons of face image quality from MTurk workers. . . . . . . . . . . . . . . . . . . . . . . . 89 Figure 3.5 Face images (from the LFW database) used for the 6 tutorial pairs used to check whether MTurk workers understood the task before completing the pairwise comparisons used in our study of face image quality. For each of the tutorial pairs, one image was selected from the top row (high quality images) and one image was selected from the bottom row (low quality images), so the pairwise comparison of face quality had an unambiguous answer. . . . . . . 89 Figure 3.6 The resulting range of the face quality values (after matrix completion) for a particular worker inversely depends on the number of pairs that the worker marked “Similar” quality. Although collection of relative responses avoids bias present when workers are asked to rate individual images on an absolute scale, bias is still present from tendency to respond “Similar”. This indicates that normalization is required to transform the quality ratings from each worker to the same scale. . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Figure 3.7 Histogram of rank correlations between the face image quality ratings of all pairs of MTurk workers ( 194 = 18, 721 total pairs of workers). The 2 quality ratings are those obtained after matrix completion. The degree of concordance between workers is 0.37, on average. . . . . . . . . . . . . . . . 92 Figure 3.8 Illustration of the pairwise quality issue. Images in the left and right columns are individually of high and low qualities, respectively. However, when compared with the other images, they can produce both high and low similarity scores. (Similarity scores are from COTS-A with range of [0, 1].) . 93 Figure 3.9 Rank correlations between the different target face quality values considered in this work. COTS-B FQ is a face quality measure output by COTS-B (black-box method to us, included for comparison). Three red asterisks indicate that the correlations are statistically significant at α = 0.001. The score-based measures of face quality (zij ) from COTS-A and COTS-B have the strongest correlation, while the human quality ratings have the weakest correlation with the other quality measures. . . . . . . . . . . . . . . . . . . 95 xiii Figure 3.10 Error vs. Reject curves for (a) FNMR and (b) FMR on the LFW database (5,749 gallery and 7,484 probe images). Probe images were rejected in order of target (i.e., “ground truth”) quality values of human quality ratings or score-based quality values (zij ). Thresholds are fixed at (a) 0.2 FNMR and (b) 0.01 FMR for comparison of the three face matchers (COTS-A, COTS-B, and DCNN [136]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Figure 3.11 Error vs. Reject curves for target and predicted face image quality values. The curves show the efficiency of rejecting low quality face images in reducing FNMR at a fixed FMR of 0.001%. The model used for the face quality predictions in (a)-(c) are support vector regression on the deep-320 features from the deep convNet in [136]. . . . . . . . . . . . . . . . . . . . . 100 Figure 3.12 Face images from a subject in LFW are rank-ordered by target (left) and predicted (right) human quality ratings, in order of increasing face quality. The Spearman correlation between the target and predicted rank orderings for this subject is 0.72. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Figure 3.13 Face images from LFW are rank-ordered by target (left) and predicted (right) human quality ratings, in order of increasing quality. Examples shown have positive rank correlation between target and predicted rankings. For each of the three example subjects, the Spearman correlation between the target and predicted rank orderings are 0.94, 0.90, and 0.50 (top to bottom). 103 Figure 3.14 Face images from LFW rank-ordered by target (left) and predicted (right) human quality ratings, in order of increasing quality. Examples shown have negative (or zero) rank correlation between target and predicted rankings. For each of the example subjects, the Spearman correlation between the target and predicted rank orderings are -0.50, and 0.00 (top to bottom). . . . 104 Figure 3.15 Face images from LFW rank-ordered by target (left) and predicted (right) human quality ratings, in order of increasing quality. Examples shown have strong negative rank correlation between target and predicted rankings. For each of the example subjects, the Spearman correlation between the target and predicted rank orderings are -0.90, and -0.70 (top to bottom). . . . . . . 105 Figure 3.16 Face images from LFW rank-ordered by target (left) and predicted (right) score-based quality values (COTS-A zij ), in order of increasing quality. Examples shown have negative rank correlation between target and predicted rankings. For each of the example subjects, the Spearman correlation between the target and predicted rank orderings are -0.33 and -0.37 (top to bottom). xiv 106 Figure 3.17 Face images from LFW rank-ordered by target (left) and predicted (right) score-based quality values (COTS-A zij ), in order of increasing quality. Examples shown have negative rank correlation between target and predicted rankings. For each of the three example subjects, the Spearman correlation between the target and predicted rank orderings are -1.00, -0.20, and -0.31 (top to bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Figure 3.18 Face images from IJB-A [72] sorted by face image quality (best to worst). The face image qualities were automatically predicted by (left) the proposed approach (SVR model on Deep-320 image features [136]) and human quality ratings from the LFW database) and (right) Rank-based Quality Score (RQS) [35] for comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Figure 3.19 Face images from IJB-A [72] sorted by face image quality (best to worst). The face image qualities were automatically predicted by (left) the proposed approach (SVR model on Deep-320 image features [136]) and human quality ratings from the LFW database) and (right) Rank-based Quality Score (RQS) [35] for comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Figure 3.20 Face images from two subjects in IJB-A [72] sorted by face image quality (best to worst). The face image qualities were automatically predicted by (left) the proposed approach (SVR model on Deep-320 image features [136]) and human quality ratings from the LFW database) and (right) Rank-based Quality Score (RQS) [35] for comparison. . . . . . . . . . . . . . . . . . . . . 110 Figure 3.21 Face images from the videos of example subjects in IJB-A [72] sorted by face image quality (best to worst) which was automatically predicted by the proposed approach using a model (SVR on Deep-320 image features [136]) trained on human quality ratings from the LFW database. . . . . . . . . . . 111 Figure 4.1 Face image pairs of four subjects from the PCSO LS mugshot database which are age-separated by eight to ten years. Similarity scores from a stateof-the-art face matcher (COTS-A) are shown in parentheses (score range is [0.0, 1.0]). The thresholds at 0.01% and 0.1% FAR are 0.533 and 0.454, respectively. Hence, all of these genuine pairs would be falsely rejected at 0.01% FAR, while the two female subjects, (a) and (b), would also be rejected at 0.1% FAR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Figure 4.2 Statistics of the two longitudinal face image databases (PCSO LS and LEO LS) used in this study. (a) and (e) Number of face images per subject, (b) and (f) the time span of each subject (i.e., the number of years between a subject’s youngest and oldest face image acquisitions), (c) and (g) demographic distributions of sex (male, female) and race (white, black, Asian, Indian, unknown), and (d) and (h) the age of the youngest image of each subject (in years). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 xv Figure 4.3 Three examples of labeling errors in the PCSO LS face database. All pairs show two different subjects who are labeled with the same subject ID number in the database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Figure 4.4 Examples of facial occlusions (sunglasses, bandages, and bruises) in the PCSO LS face database. . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Figure 4.5 Face images of six example subjects from the PCSO LS database. The enrollment face image (leftmost column) is the youngest image of each subject, and all query images are in order of increasing age. In this study, genuine similarity scores are computed by comparing the query images of each subject to his/her enrollment image. . . . . . . . . . . . . . . . . . . . . . . 126 Figure 4.6 An example of cross-sectional vs. longitudinal analysis. In (a), a crosssectional approach (ordinary least squares (OLS) linear regression) is applied, which incorrectly assumes that all the scores are independent. In (b), OLS is instead applied six times, separately to each subject’s set of scores (subjects shown in Fig. 4.5). The slope estimated by cross-sectional analysis (black dotted line) is much flatter than the slopes of subject-specific trends in (solid colored lines in (c)). The longitudinal analysis in this work utilizes mixedeffects models, which provide “shrunken” OLS estimates for each subject, where the OLS trends shrink towards a population-mean trend [44, 118], further accounting for the correlation that exists between scores from the same subject. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Figure 4.7 Age distribution of a random sample of 200 subjects from the PCSO LS database. Each line denotes the age span of a subject (i.e., age of youngest image to age of the oldest image), separated along the y-axis by the elapsed time for each subject (i.e., the length of the age span). . . . . . . . . . . . . 128 Figure 4.8 Distributions of standardized genuine comparison scores from the two longitudinal face databases used in this study: (a) COTS-A on PCSO LS and (b) COTS-B on LEO LS. There are a total of 129, 773 and 26, 216 genuine scores in (a) and (b), respectively. . . . . . . . . . . . . . . . . . . . . . . . . 135 Figure 4.9 Normal probability plots of ((a) and (d)) level-1 residuals, εij , and level2 random effects for ((b) and (e)) intercepts, b0i , and ((c) and (f)) slopes, b1i , from Model BT on the PCSO LS and LEO LS databases (top and bottom rows, respectively). Departure from normality at the tails of the distributions is likely due to low quality face images or errors in subject IDs. . . . . . . . . 136 xvi Figure 4.10 Results from Model BT on COTS-A genuine scores from the PCSO LS database. The bootstrap-estimated population-mean trend is shown in black (bootstrap confidence intervals are too small to be visible). The blue and green bands plot regions of 95% and 99% confidence, respectively, for subject-specific variations around the population-mean trend. Grey dotted lines additionally add one standard deviation of estimated residual variation, σε . Hence, Model BT estimates that 95% and 99% of the subject trends fall within the blue and green bands, but scores can vary around their trends, extending to the grey dotted lines. Thresholds at 0.01% and 0.1% FAR for COTS-A are shown as dashed red lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Figure 4.11 Example outlier subjects, i.e., subjects whose subject-specific trends, estimated by Model BT, significantly deviate from the spread of the population in the PCSO LS database. All images were aligned using COTS-A eye locations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Figure 4.12 Model E fit to COTS-A genuine scores from the PCSO LS database. Population-mean trends are plotted by subject demographics of sex and race. Each trend line represents seven years of elapsed time since enrollment at five different ages (20–60 years old). For example, the solid blue line beginning at AGEij = 20 years represents the average decrease in genuine scores for white males enrolled at age 20 with query images until age 27. . . . . . . . . . . . 143 Figure 4.13 A boxplot of interpupillary distances (IPDs) versus year of acquisition shows that mean IPDs systematically changed over time for the PCSO LS database, likely due to booking stations adhering to face imaging standards only in more recent years. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Figure 4.14 Results from Model BT on COTS-B genuine scores from the LEO LS database. The population-mean trend is shown in black. The blue and green bands plot regions of 95% and 99% confidence, respectively, for subject-specific variations around the population-mean trend. Grey dotted lines additionally add one standard deviation of estimated residual variation, σε . Hence, Model BT estimates that 95% and 99% of the subject trends fall within the blue and green bands, but scores can vary around their trends, extending to the grey dotted lines. Thresholds at 0.01% and 0.1% FAR for COTS-B are shown as dashed red lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Figure 4.15 Model E for COTS-B genuine scores from the LEO LS database. Population-mean trends are plotted by subject demographics of sex and race, in addition to five different ages at enrollment (20 to 60 years). Each trend line represents seven years of elapsed time since enrollment. For example, the solid blue line beginning at AGEij = 20 years represents the average decrease in genuine scores for white males enrolled at age 20 with query images until age 27. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 xvii Chapter 1 Introduction Automatic face recognition systems are currently deployed in many important applications. Face recognition plays a key role in identity card de-duplication to prevent a person from obtaining multiple ID cards, such as driver’s licenses and passports, under different names. Face recognition is used by the United States Department of Defense (DoD) to assist soldiers in determining friend or foe at security checkpoints and village assessments, and law enforcement officers in the field are able to capture face images with mobile devices, submit them to face recognition system on central servers, and quickly identify people who refuse to give their name, provide false information, or are injured and unresponsive. Face recognition systems are additionally utilized for surveillance purposes and access control to secure locations. Commercial applications of automatic face recognition are also now abundant, including “tag” suggestions on Facebook, organization of personal photo collections, and mobile phone unlock. Face image acquisition conditions for many of these applications are conducive to the current capabilities of face recognition systems (i.e., relatively controlled environments and/or cooperative subjects). Face images for identification documents require a neutral expression and no facial accessories, a uniform background, and controlled lighting. For example, de-duplication entails frontal-to-frontal face matching of controlled images. In these types 1 70 Rank−1 Miss Rate (%) 60 Mugshot Images Webcam Images 50 40 30 (b) 20 10 0 NEC Morpho Toshiba Cognitec Vendor 3M Neurotech. (a) (c) Figure 1.1 (a) Rank-1 miss rates of six vendors for closed-set identification of (b) mugshot and (c) webcam face images against a gallery of mugshot photos 1.6 million individuals, as reported by the NIST FRVT 2013 evaluation [55]. of scenarios, state-of-the-art commercial off-the-shelf (COTS) face recognition systems are highly accurate and have proven to be extremely useful. As of 2013, at least 37 states1 are using face recognition technology to assist in the detection of fraudulent identification documents; the state of New York alone attributes more than 2,500 arrests in three years to the use of face recognition technology.2 In terms of accuracy, a large-scale evaluation conducted by the National Institute of Standards and Technology (NIST) [55] demonstrated that error rates of the top performing COTS face recognition systems were lower than 10% for identifying mugshot face images at rank-1 against a gallery database of 1.6 million individuals (see Fig. 1.1). As the demand for automatic recognition of individuals continues to increase, face offers 1 www.washingtonpost.com/business/technology/state-photo-id-databases-become-troves-forpolice/2013/06/16/6f014bd4-ced5-11e2-8845-d970ccb04497 story.html 2 http://www.governor.ny.gov/news/governor-cuomo-announces-13000-identity-fraud-casesinvestigated-dmv-using-facial-recognition 2 a number of advantages over other biometric traits (e.g., fingerprint and iris): (i) Recognition by faces is how humans naturally interact with each other, so face images do not contain any information that people do not also disclose to the public on a daily basis. Face recognition tends to be more publicly accepted (compared to fingerprints, for example, which are commonly associated with criminal accusations). (ii) Large legacy face image databases already exist that can be searched against (e.g., passport and driver’s license). (iii) The face reveals other attributes (gender, race, age) that can be used as side information. (iv) Face can be captured unobtrusively, at a distance, and in a covert manner, if necessary. (v) No specialized sensors are required; digital cameras are readily available (i.e., in mobile phones) and/or relatively inexpensive. The above advantages of face biometric lend themselves to new emerging applications of face recognition, which are largely due to the increasing ubiquity of surveillance cameras and mobile imaging devices. According to a 2013 survey, there is one surveillance camera for every 11 people in the UK,3 and a study conducted in 2014 estimates that the video surveillance market will reach $42 billion by the year 2019.4 With recent tragic and controversial policecivilian incidents, such as the deaths of Michael Brown in Ferguson, Missouri, and Eric Garner in New York City, many police agencies are now equipping patrol officers with body cameras, and national debates are ensuing about whether police should be required to wear them at all times.5,6 Personal collections of photos have also skyrocketed, as front-facing cameras on mobile phones sparked the “selfie” boom (i.e., taking a picture of yourself) and an era of constant documentation of personal lives on social media. Recent tragic events have made use of this increase in available imagery for solving high profile crimes. For example, the 2011 London riots, which resulted in one fatality, had 3 http://www.telegraph.co.uk/technology/10172298/One-surveillance-camera-for-every-11-people-inBritain-says-CCTV-survey.html 4 http://www.securitysales.com/article/report video surveillance market to reach 42b by 2019 5 http://www.npr.org/2015/04/10/398704487/eyewitness-video-a-controversial-tool-for-holding-policeaccountable 6 http://www.msnbc.com/msnbc/missouri-lawmaker-police-body-camera-footage 3 over 100,000 hours of surveillance footage for law enforcement officials to utilize.7 The 2013 Boston marathon bombings resulted in four fatalities and more than 250 injured; again, law enforcement acquired a daunting amount of surveillance footage to sift through, as well as images and videos from the mobile phones of bystanders and marathon runners.8 In both of these cases, large amounts of manual resources were immediately devoted to searching for investigative leads from the acquired media, and face images of suspects were released to the public for identification. Made evident by these recent tragic events, in addition to countless other routine crimes (e.g., robbery, kidnapping, assault), government and law enforcement officials could greatly benefit from automated (or semi-automated) face recognition to assist with identification of persons of interest. A face recognition system designed for the 2012 Olympics was available for use in the London riots but did not play a major role in identifying the rioters,9 and there have been no reports that automatic face recognition was attempted for the Boston bombings. Although, a recent case study demonstrated that a state-of-the-art commercial face recognition system had the potential to identify one of the suspects, Dzhokhar Tsarnaev (the younger brother), at Rank-1 amongst one million mugshot images if he was in the database [73]. The success of face recognition technology in these scenarios is currently limited by the unconstrained nature of the imagery typically available. Accuracies of current COTS systems are highly sensitive to the quality of available face images. The large-scale face recognition evaluation by NIST (FRVT 2013 [55]), also reported that error rates of the top six COTS systems more than doubled when matching lower quality webcam images to the mugshot gallery (see Fig. 1.1). While the feasibility and utility of fully automated face recognition for surveillance purposes are limited, used as an investigative tool, face recognition can still assist law enforcement in searching for a list of suspects for manual examination. 7 http://www.independent.co.uk/news/uk/crime/more-support-for-cctv-after-riots-2375768.html http://www.washingtonpost.com/wp-srv/special/national/boston-marathon-bombing-victims/ 9 http://latimesblogs.latimes.com/technology/2011/08/london-riots-facial-recognition-tech-being-usedby-police.html 8 4 In unconstrained scenarios where face image acquisition is not well controlled and subjects may be uncooperative (or unknowing), multiple factors which are known to confound the performance of face recognition systems are simultaneously present. Such confounding factors include facial pose, non-uniform illumination, facial expression, as well as occlusion and low image resolution. • Pose: Facial pose can be categorized as in-plane (roll) or out-of-plane rotation (yaw and/or pitch). In-plane rotations can be corrected for with simple 2D transformations. However, when the head is rotated out-of-plane, certain regions of the face become “self-occluded” or no longer visible in the acquired face image (see Fig. 1.2(a)). This results in missing information and makes it difficult to determine correspondences between features of two faces at different poses. • Illumination: For face images acquired in natural settings, ambient lighting can be drastically different depending on the setting (e.g., indoor vs. outdoor) and is affected by daily changes even in a specific environment (e.g., the amount of light coming in from windows on a particular day and time). The angle of the head with respect to the light source also causes changes in how the face is illuminated. Due to the three-dimensional structure of the face, certain angles of illumination can cause severe shadows across the face. Darkening or lightening of facial features causes them to appear very different in a 2D color or grayscale image. Some features may even diminish completely if the illumination is either too strong or too weak (see Fig. 1.2(b)). • Expression: While a neutral or relaxed facial expression is probably the most frequent state of a person’s face, face images are often captured mid-conversation, while viewing something surprising, upsetting, etc., or while simply “making a face.” Such daily activities cause different expressions involving different facial regions and components (see Fig. 1.2(c)). As facial recognition technology became widely used by Departments of Motor Vehicles (DMVs) across the United States, some DMVs be5 gan enforcing a no smiling rule for new driver’s license photos.10 However, recently DMVs (e.g., Delaware11 ) have started to upgrade their facial recognition technology to systems which are capable of matching face images with high accuracy, regardless of smiling or neutral expression, and have lifted the ban on smiling. Nevertheless, extreme expressions are still challenging for state-of-the-art face recognition systems. • Occlusion: Eyeglasses and sunglasses are a common cause of errors in facial recognition systems because the eye region, which is often highly discriminative, gets occluded. Facial occlusions not only cause missing information, but also extraneous information because it is difficult to detect and mask out occluded facial regions for matching. Even if a person consistently wears eyeglasses, specular reflections that change based on the light source still cause additional intra-person face variation. Other occluding facial accessories, such as baseball caps and hoods, can hide the forehead and eyes and cast shadows on the face. Besides facial accessories, faces can also be occluded by other objects or persons, which is typical of faces in a crowd; to accurately identify such “partial faces” in a crowd is an application of high interest for surveillance purposes. • Resolution: The spatial resolution of a face, irrespective of image resolution, can be measured as the distance (i.e., number of pixels) between the two eyes, also termed interpupillary distance (IPD). Smaller IPDs generally lead to lower face recognition accuracy, but there have also been studies (e.g., [53]) that show that the discrepancy between the IPDs of two face images being compared can cause more errors than the absolute IPD values. The above factors are typically those assumed present when dealing with “unconstrained face recognition.” However, another variation that is known to degrade performance of face recognition systems is facial aging (see Fig. 1.3). Given two face images of the same person 10 http://usatoday30.usatoday.com/news/nation/2009-05-25-licenses N.htm http://www.delawareonline.com/story/news/traffic/burke/2015/01/28/dmv-lifts-ban-smiling-licensephotos/22475061/ 11 6 (a) Facial pose (b) Illumination (c) Expression Figure 1.2 Sources of intra-class variability: (a) pose, (b) illumination, and (c) expression. Although all images shown here are of different people, such variations typically cause two images of the same person to appear very different. captured multiple years apart, a robust face recognition system should still be able to recognize the two photos as the same person. Unlike the above factors, facial aging cannot be controlled either by the subject or the imaging environment; it is a challenge that can be present in both constrained and unconstrained face recognition scenarios. Facial aging will be discussed later in this chapter. While intra-class variations are a major challenge for face recognition systems, inter-class similarities can also cause errors. For example, it can be difficult (even for humans) to distinguish between persons with kinship relations (particularly twins, see Fig 1.4(a)), and persons that are not related can exhibit strong similarities (see Fig 1.4(b-d)). 7 1973 1975 1978 1980 1981 1984 1987 1989 1992 1995 1998 2000 2002 2004 2005 2007 2010 2011 Figure 1.3 A teacher who wears the same outfit for his school picture every year; while the outfit is the same, his face and eyeglasses change over time. The overall quality of the image also changes (i.e., improves over time). Such temporal aspects are additional sources of intra-class variation. [Images are from: http://fillthewell.com/yearbook-pictures/] (a) (b) (c) (d) Figure 1.4 Sources of inter-class similarity: (a) kinship similarities (in this case, twins) and (b)-(d) different people with no kinship relation who happen to exhibit very similar facial characteristics. This is sometimes referred to as a doppelg¨anger; (b) shows, as an example, that president Barack Obama (left) has a doppelg¨anger (right) from Indonesia. [Images in (b) are from: doppelganger-ilham-anas] http://www.theguardian.com/theguardian/2010/dec/05/barack-obama- 8 Automa:c   Face  Matcher   Gallery:   Face  Image  Database   (IDs  are  known)   … Probe Image Top-K Matches (Candidate List) Figure 1.5 A flowchart of automatic face recognition in identification mode. A probe face image (with unknown identity) is matched against all face images enrolled in a gallery database. The top-k most similar identities retrieved from the database are then manually adjudicated by human analysts to determine whether the top-k candidates contain the identity of the probe face image. In verification mode, the probe image would be accompanied by a claimed identity and then only compared to the gallery image with the same identity as that which is claimed by the probe. 1.1 Background Automatic face recognition operates in different modes depending on the application. Regardless of application, face images labeled with their identities are first enrolled in a database, referred to as the gallery. A face recognition system then takes a face image as input (i.e., the probe or query) and matches it against one or many face images in the database. Face verification involves a one-to-one comparison to verify that the probe face image is the identity that it claims to be (e.g., passport and passenger processing at airports, access control for buildings, and mobile phone authentication). Face identification involves one-to-many comparisons to retrieve (from the gallery) the identity of a probe face image whose identity is unknown. Because automatic face recognition systems susceptible to errors, in practice, the identity of the probe face image is established by manual adjudication 9 of the top-k most similar identities (see Fig. 1.5), where k is application dependent (e.g., deduplication, watch list surveillance, tag suggestions). In some scenarios, the top-k candidate identities are always manually adjudicated (e.g., identification of a suspected criminal in forensics); this can be considered a closed-set identification scenario, where we assume that the identity of the probe is present in the gallery. However, open-set identification, where the identity of the probe may not be present in the gallery, is more representative of real-world scenarios. For open-set applications, the frequency of false alarms raised for subjects not in the gallery can be reduced by only returning the top candidate matches if they exceed a predetermined threshold (i.e., k is of variable length). This is useful for “lights out” applications where it is impractical for a human analyst to review candidates for every query to the database (e.g., watch list surveillance, especially in high traffic areas). Whether verification or identification, the primary goal of an automatic face recognition system is to compute a measure of similarity between any two face images. Ideally, faces of the same individual should have higher similarity than faces of different individuals. However, there are multiple components in the face recognition pipeline that have significant impact on the computation of similarity scores and the resulting recognition performance. 1.1.1 Automatic Face Recognition Pipeline The automatic face recognition pipeline (shown in Fig. 1.6) typically consists of the following sequential components: (i) face detection, (ii) face normalization, (iii) feature extraction and face representation, and (iv) comparison. Each of these components are crucial for achieving accurate and robust face recognition systems, and a significant amount of research has been devoted to each component individually. While state-of-the-art systems perform these steps fully automatically with extremely high accuracies for controlled environment and cooperative subject scenarios (e.g., mugshot face images), open research problems still exist for unconstrained scenarios. Face Detection: Face detection is the process of automatically determining whether a 10 Face Detection Face Normalization Feature Extraction s(a,b) = 0.14 Comparison Figure 1.6 The automatic face recognition pipeline typically consists of (i) face detection, (ii) face normalization (to mitigate geometric and photometric variations), (iii) feature extraction, and (iv) comparison of resulting face representations. face (or multiple faces) exist in an image, and subsequently outputting the locations of all detected faces. While it is a trivial task for humans to locate faces in an image, automatic extraction of face “sub images” from arbitrary images is a challenging task for machines. This is because of large intra-class variability in the appearance of faces (due to location, scale, skin color, etc.), as well as the possible presence of other face-like objects. Research in face detection has been ongoing for more than two decades, but the seminal work of Viola and Jones [135] is credited with being the first real-time and accurate face detector, enabling many real-world applications. The Viola-Jones algorithm is an appearancebased method that uses simple Haar-like features which are sums of rectangular regions of pixels that respond to contrast differences structures on the face (e.g., the two eyes are typically darker than the bridge of the nose). At the time it was introduced, the novelty of the Viola-Jones face detector was due to three key contributions: (i) fast feature computation using an “integral image”, ii) feature subset selection with AdaBoost [46], and iii) fast and accurate rejection of non-faces using an attentional cascade structure [135]. Though the Haar-like features are simple, computation of the over-complete set is expensive (e.g., 160,000 features for a 24×24 window), so the integral image enables the sum of an arbitrary rectangle to be computed with just four lookups [135]. The over-complete set of features is also computationally expensive to be used directly for classification, so multiple 11 “weak” classifiers are trained sequentially with AdaBoost [46] where each weak classifier is based on a single feature. Because selection of a small number of features sacrifices accuracy for real-time processing, Viola and Jones further use a cascade of weak classifiers which quickly discards non-face regions and allocates more resources to possible faces [135]. Experimental results in [135] demonstrated that a single-stage classifier with 200 features and a cascade of 10 classifiers each with 20 features achieved similar detection rates, but the cascade was 10 times faster. Since its publication in 2004, the Viola-Jones face detector has greatly influenced research in face detection and is still widely used. However, many other methods have since been proposed that stem from the techniques proposed by Viola and Jones [135], and aim to be more robust to variations in facial pose, illumination, expressions, and occlusions. A survey of face detection approaches is provided in [150], and an in depth evaluation of various detection algorithms on unconstrained faces is given in [36]. Figure 1.7 shows example face detection results from an implementation of the Viola-Jones algorithm and a detector from a COTS face recognition system. The COTS detector performs better than the Viola-Jones algorithm, but errors are still observed for faces with extreme facial pose and occlusions, for example. Face Normalization: Face normalization seeks to mitigate geometric and photometric variations that can greatly affect the subsequent modules of the face recognition pipeline. To normalize shape, face alignment is often performed to transform all faces to a canonical view. Face alignment aims at determining correspondences between face images based on any number of feature/landmark/fiducial points (e.g., eyes, nose, mouth, contour, etc.). The most common face alignment technique is a simple 2D rigid affine transformation based on the two eye locations to correct for size and in-plane head rotation. However, in unconstrained face recognition, face images may contain out-of-plane rotations, so a simple 2D rotation based on the eyes alone may not be sufficient. Active Shape Models (ASMs) [38] and Active Appearance Models (AAMs) [37, 93] were 12 (a) (b) (c) (d) Figure 1.7 Example face detection results. Faces were (a) detected and (b) not detected by an implementation of the Viola-Jones algorithm [135]. Face images in (b) can be better detected by (c) a COTS face recognition system. However, the COTS detector also encounters (d) errors due to occlusion and facial pose, in particular. The small and large rectangles in (c) and (d) show bounding boxes of face and head detections, respectively. The circles are detected eye locations. All face images are from the LFW database [62]. 13 Figure 1.8 Example images from eight face tracks in the YouTube Faces (YTF) database where all images in that track could not be enrolled by one of the COTS matchers. These images display extreme pose and illumination conditions, low resolution, and motion blur. some of the first statistical models proposed for object (e.g., face) alignment. At their time, ASMs and AAMs were state-of-the-art, mainly due to novelty from learning shape and/or texture variations of a face from labeled training data. While ASMs and AAMs improved specificity of model-based approaches, they did so at the cost of generalization; alignment performance suffers when ASMs or AAMs are trained on a large database and/or fitted to previously unseen instances. AAM-based methods were predominant for some time, but more robust solutions for landmark localization and alignment have since been proposed. For example, Zhu and Ramanan [153] propose a unified approach for detection, alignment, and landmark localization for faces “in the wild” that discriminatively encodes deformation and 3D structure as mixtures of trees with shared pool of parts [153]. Face alignment can also be done in 3D, with 3D morphable models (3DMMs), for example [26, 27]. Jourabloo and Liu propose a 3DMMbased approach to estimate both 2D and 3D facial landmarks for full pose variations, which additionally allows for estimation of the visibility of 2D landmarks [66]. Additionally, some recent works have shown impressive results for “frontalization” of unconstrained 2D face images with 3D modeling techniques (e.g., [127, 128]), as well as 3D face reconstruction from a collection of unconstrained 2D face images [115]. Face alignment is also associated with “failure to enroll.” If landmark points can not be detected, features cannot be extracted which can cause the entire enrollment process to fail. Figure 1.8 shows cropped face images from video frames in the YouTube Faces database where two COTS face matchers failed to enroll the face. Landmark localization and face 14 alignment are difficult problems, and many face recognition methods are highly dependent on the accuracy of either one or both of these processes; hence, some “alignment-free” methods have been proposed (e.g., [84]). Feature Extraction and Face Representation: Feature extraction and face representation go hand in hand. The simplest features are the raw pixel values of the face, where the representation is then a rasterized vector of raw pixel values. However, raw pixel values in vector form are not very informative; a significant amount of additional and relevant information exists in a face image that can be used to represent a face and enhance matching results. For example, high-level features, such as the distances between facial components and their relative locations and ratios, in addition to low-level features such as wrinkles and facial marks, can also be encoded to further discriminate between individuals. Use of additional features seems like an obvious way to improve performance. However, the primary issue with adding more informative features is that the dimensionality of the feature vector becomes increasingly large (and likely redundant). Hence, the representation step typically focuses on compressing features so that they are both compact and highly discriminative. A vast amount of research has been devoted to these tasks (extraction and representation), some of which will be discussed in Section 1.2. Comparison: Once a compact and discriminative representation of a face image has been obtained, the next step is to compare it to the representations of other face images to compute a measure of similarity. The Euclidean distance between feature vectors can be used; however, a more sophisticated choice of distance metric may significantly improve the recognition rate. Some examples of other distance functions include cosine, Manhattan, Tchebyshev, and correlation, as well as histogram intersection, log-likelihood statistics, chisquare statistics, etc. Distance metric learning has also been applied to face recognition (e.g., [39, 61, 128]), where a distance metric is learned from training data to simultaneously minimize distances between instances of the same class and maximize distances between different classes. 15 1.2 Research Progression The concept of identifying individuals based on retained face images dates back to the 19th century when Alphonse Bertillon developed a system for identifying criminals based on anthropometric measurements in 1879 [112]. The Bertillon system, or bertillonage, was introduced in the U.S. in 1887 as the primary method for identifying and tracking criminals.12 Although it was replaced by fingerprinting in the early 20th century, face images of criminals, now known as mugshots, are still used worldwide. ‘‘...according to the method prescribed by Dr. Bertillon, the exact identity of any adult person can be established with so much definiteness that when signalized a second time he can be recognized with infallible certainty by a simple reference to the file in which the former signalment is kept. Even if this file represented the entire population in the country, the process of identifying two correctly-taken signalments by its means could be performed in most cases in a few minutes, without any assistance from a similarity of names.’’ - From publisher’s preface to Signaletic Instructions Including the Theory and Practice of Anthropometrical Identification by Alphonse Bertillon, 1896 Partially automated recognition began in the mid 1960s when Woodrow W. Bledsoe came up with a “man-machine” system for identification of individuals based on physiological measurements which were entered by hand (e.g., height, weight, interpupillary distance, etc.), stored in documents, and searched automatically [9]. Bledsoe understood that the results were highly dependent on the angle of the face images, so he learned a transformation from the actual 3D heads of seven individuals and applied this transformation to the measurements of any non-frontal faces; a concept that is still used in current state-of-the-art 3D face models. 12 http://www.nleomf.org/museum/news/newsletters/online-insider/november-2011/bertillon-systemcriminal-identification.html 16 Since Bledsoe’s man-machine system, 50 years of research (see Jain et al. for an overview [65]) has been devoted to improving the robustness and efficiency of fully automated face recognition systems (albeit recognition results are often manually adjudicated). Every stage of the pipeline has received substantial research attention and great progress has been made in face detection, alignment and normalization, feature extraction and representation, and comparison. The progression of face recognition from frontal constrained face matching to unconstrained “in the wild” face matching can roughly be delineated by three face representation approaches: (i) holistic, (ii) local, and (iii) learned representations. This section briefly discusses a few methods related to these categories. 1.2.1 Face Databases First of all, it would not be possible to discuss progress in face recognition research without reference to standardized face image databases and evaluations that have paved the way for such success. While many researchers evaluate proposed methods on in-house databases, research progression in face recognition is primarily facilitated and motivated by the compilation and public release of face image databases. Some of the first standardized databases on which the research community began to evaluate proposed methods are shown in Fig. 1.9. While databases such as FERET [97], FRGC [97], and AR [92] (example images shown in Fig. 1.9) greatly contributed to advancements in face recognition research, most of them were acquired under relatively controlled conditions and were compiled by research teams for studying specific subproblems of face recognition (e.g., illumination, expression, pose). Such databases allow researchers to directly evaluate performance on face images that exhibit certain variations, but are not very representative of face images encountered in real-world scenarios. As algorithms continued to mature to handle controlled/simulated variations in pose, illumination, expression, and occlusion, more challenging databases were needed. For this reason, Huang et al. released the Labeled Faces in the Wild (LFW) database which was compiled by searching the internet for the names of public figures, athletes, 17 actors/actresses, etc. [62]. The LFW database includes 13,233 face images of 5,749 different people. All face images were automatically detected by an implementation of the Viola-Jones face detector [135], so they are constrained in that respect, but images typically exhibit multiple variations that are challenging for face recognition algorithms. Along with the database, Huang et al. released the LFW experimental protocol: 10-fold cross-validation on verification/classification of 300 same and 300 not-same face pairs per split. 1.2.2 Holistic Representation Drawing upon the Sirovich and Kirby [119] discovery that face images could be reconstructed as projections onto a small set of eigenpictures, the Eigenfaces method was one of the first fully automatic face recognition algorithms proposed in 1991 by Turk and Pentland [132]. A low-dimensional “face space” is calculated based on the training set of N face images using principal component analysis (PCA). The face space is the set of M < N eigenvectors corresponding to the largest M eigenvalues of the covariance matrix of the training set. All faces are then represented as the weights associated with their linear projection onto the set of eigenfaces, and dissimilarity is defined as Euclidean distance between two M -dimensional feature vectors. Turk and Pentland also use the distance to face space for automatic face detection; every pixel of an image is projected onto face space to acquire a “face map” where low values (i.e., small distances to face space) indicate the presence of a face. Experiments conducted on 16 subjects, represented by 7 eigenfaces, showed that Eigenface representation was fairly robust to lighting variations (96% identification accuracy) but suffered more errors with changes in head size and pose. Fisherfaces [13] is an extension of Eigenfaces that uses supervised dimensionality reduction to find the subspace that minimizes intra-person and maximizes extra-person variance via linear discriminant analysis (LDA). These first fully automatic face recognition methods can be categorized as holistic representations, as they utilize all the facial pixels together to drive a representation. Holistic methods heavily rely on accurate alignment (typically based on eye locations) which be18 (a) FERET (b) FRGC (c) AR (d) LFW (e) IJB-A Figure 1.9 Example face images from different databases: (a) FERET [97], (b) FRGC [97], (c) AR [92], (d) LFW [62], and (e) IJB-A [72]. Databases (a)-(c) contain variations such as illumination, expression, and occlusion to challenge face recognition research, but they are relatively controlled acquisition conditions because such variations are simulated/staged (subjects are typically students and members of research groups). Databases (d) and (e) contain more unconstrained face images (e.g., collected from the internet). 19 Table 1.1 Face image databases in the public domain Num. Subj. Acquisition Conditions Database Year NIST Mugshot Id [140] 1994 1,573 (3,248) constrained, operational FERET [97] 1996 1,199 (14,126) simulated/staged PIE Yale [13] 1997 15 (165) simulated/staged IE AR [92] 1999 126 (4,000) simulated/staged IEO Yale B [48] 2001 10 (5,760) simulated/staged PIE CMU PIE [117] 2003 68 (41,368) simulated/staged PIE FRGC [97] 2005 >466 (>20,000) simulated/staged IE LFW [62] 2007 5,749 (13,233) unconstrained, web-collected CMU Multi-PIE [51] 2008 337 (>750,000) simulated/staged PIE MEDS [45] 2011 518 (1,219) constrained, operational CASIA-WebFace [146] 2014 10,575 (494,414) unconstrained, web-collected IJB-A [72] 2015 500 (5,712) unconstrained, web-collected (Num. Imgs.) 20 comes difficult when faces are encountered that may be non-frontal or contain expression variations, etc. Holistic methods also do not generalize well to new databases and have difficulty with variations not present in the training set (e.g., presence/absence of eyeglasses in training/testing). 1.2.3 Local Representation Local representations typically perform a dense sampling of features at overlapping patches in the face image and at multiple scales. To incorporate global information, geometric relationships between features are often encoded by concatenating features extracted from either a common set of landmark points or from a grid overlaid on the face. Hence, local representations can also be sensitive to face alignment. Because the resulting set of features is often over-complete with high dimensionality, feature selection (e.g., boosting) or subspace methods (e.g., PCA, LDA) are adopted to achieve a compact face representation. Liu et al. presented a novel augmented Gabor feature vector for face representation and proposed the Gabor-Fisher classifier (GFC) for face recognition [88]. Gabor wavelets had been used for face representation in prior works (e.g., Lades et al. [77]), but the novelty of the Liu et al. Gabor feature was the concatenation of Gabor filter responses (using five scales and eight orientations) and the subsequent application of PCA to compress the highdimensional feature vector. They showed that Gabor face representation with PCA performed better than both Eigenfaces and Fisherfaces (which use the original image intensity values as features). Furthermore, the GFC, which applied the Enhanced Fisher linear discriminant Model (EFM) to the compressed augmented Gabor feature vector, achieved better performance than both PCA and LDA with the Gabor feature vector. The use of EFM helps improve discrimination and generalization and alleviates the small sample problem of FLD/LDA. Ahonen et al. presented the first application of the local binary pattern (LBP) texture descriptor to face recognition [4]. Specifically, they used the uniform patterns extension 21 of LBP (i.e., every circular pattern with at most two bitwise transitions contributes to its own bin in the histogram and all other non-uniform patterns contribute to a single bin). One major contribution of the Ahonen et al.’s LBP face representation was the spatially enhanced histogram. To incorporate regional and global properties in combination with the local features from the LBPs themselves, they placed a grid over the face, extracted a histogram of LBP for each grid location, and concatenated the results to form the final feature vector of the face. Because of this representation, different weights can be assigned to the grid locations to be used with the weighted Chi squared distance measure. Patches that contribute more to discriminating between identities (e.g., the eyes) can be given more weight. In comparison with other local descriptors, Ahonen et al. [4] provided experiments to show that LBP representation typically demonstrated the best performance on subsets of the FERET database; likely due to the monotonic gray-scale invariance of LBP compared to the other local descriptors. Recently, a few extremely high-dimensional local representations (with efficient dimensionality reduction techniques) have shown impressive performance on the LFW database. For example, high-dimensional features (sampled at multiple scales on dense landmarks detected by Cao et al. [29]) with Joint Bayesian classification [33] achieve 93-95% accuracy on the LFW database for LBP, Gabor, HOG, SIFT, and LE descriptors trained on WDRef database (99,773 images of 2,995 subjects) [34]. Most of the initial local representation methods have now been categorized as methods based on “handcrafted” or “engineered” features because image filters are pre-defined and performance typically depends on a fine tuning of the radius and scales of sampling. 1.2.4 Learned Representation Motivated by the drawbacks of handcrafted local descriptors such as LBP, Gabor, SIFT etc., Cao et al. proposed a learning-based (LE) descriptor [30]. LE descriptors are extracted by sampling a ring-based pattern from the neighborhood of each pixel to form a low-level feature 22 vector which is then normalized to unit length. Cao et al. applied unsupervised learning methods (K-means, PCA tree, or random-projection tree) to encode feature vectors into discrete codes. Face images are represented as “code images,” and histograms of LE codes can be extracted from grid locations and concatenated to form the final face representation. Cao et al. showed that the distribution of the LE descriptors is more uniform across face images than LBP and histogram of oriented gradients (HOG) and is therefore more informative, discriminative, and compact [30]. The LE descriptors are combined with a pose-adaptive matching method which aligns and matches nine components of the face separately, combines their similarity scores, and delegates the verification decision to a linear SVM classifier that has been trained on the two poses most similar to the input face images. Experimental results on the LFW (84.45% accuracy) and Multi-PIE (95.19% accuracy) databases show that the LE descriptors with pose-adaptive matching performs better than other methods trained in the same manner and is competitive with methods trained using additional information [30]. 1.2.4.1 Deep ConvNets More recently, deep neural networks have achieved impressive results for many visual recognition tasks [75], including face recognition. Neural networks are not new (e.g., perceptrons were first developed in the 1950s); however, network models with many hidden layers (deep structures) can be trained due to better regularization strategies and availability of large face databases and processing capabilities. Again, rather than handcrafted features, face representations are learned by deep convolutional neural networks (ConvNets) trained to classify identities (or verify pairs of face images) from large-scale training sets of face images. The dimensionality of the feature representation is hierarchically reduced due to the structure of convolutional and pooling layers; both low-level features and global features are learned in a cascaded manner. Commonly, the output of the last hidden layer (prior to the classification layer) has been shown to have learned a highly robust face representation for new face images in testing [123, 128]. While the specific architectures of the networks in [123, 128, 146] are all 23 different, their high recognition performance can generally be attributed to a few common properties: better regularization strategies for learning very deep structures (4-11 layers), availability of large-scale training databases (e.g., > 4 million images [128]), and access to faster and cheaper computational resources. However, the success of these deep ConvNets approaches is not due to sophisticated learning and large-scale training sets alone; many of these methods also include additional preprocessing and/or post-processing steps that further boost performance. For example, Taigman et al. directly feed raw RGB pixel values as input to their deep ConvNet under the assumption that their 3D face frontalization is successful [128]. This strong assumption may not have been possible a few years ago when 3D frontalization capabilities from unconstrained 2D images were not accurate and robust. Sun et al. train multiple deep ConvNets on various face patches at different scales [121], and DeepID [123] (and its variants) utilize the Joint Bayesian classification method [33] on their deep representation; Joint Bayesian [33] is a supervised subspace learning approach that has achieved high accuracies with other face representations as well (e.g., [34, 85]). 1.3 Video Face Recognition Face recognition in video is becoming increasingly important due to the abundance of video data captured by surveillance cameras and mobile devices, uploaded to the Internet, etc. Given the aggregate of facial information contained in a video (i.e., a sequence of face images or frames), video-based face recognition solutions can potentially alleviate classic challenges caused by variations in pose, illumination, and expression. A summary of the common public domain databases used to evaluate video-based face recognition algorithms can be found in Table 1.2. Of particular interest for these databases is the number of subjects available, and whether or not the activities of the subjects were constrained or unconstrained (e.g., subjects were directed to move in certain ways vs. subjects act naturally in an environment). Notably, 24 Table 1.2 Characteristics of popular face video databases in the public domain. Database Acquisition Conditions Subjects Videos Accuracy Motion of Body (MoBo) [52] Treadmill walking: slowly, quickly, on incline, or with a ball 25 150 98.8% [89] Face in Action (FIA) [49] Variations in expressions and orientations; indoor/outdoor 221 n/a 99% [101]a 1st Honda/UCSD [79] Staged head rotations and expressions 20 75 99% [131] MBGC [103] Walking, activity, conversation; standard and high resolutions 821 3,764 see [103] YouTube Celebrity [69] Unconstrained, many same-subject tracks from the same video 47 1,910 78.9% [145] YouTube Faces [141] Unconstrained 1,595 3,425 54.8% [128]b IJB-A [72] Unconstrained 500 2,085 40.6% [72]b a Authors used an indoor subset of FIA b TAR @ 1.0% FAR Figure 1.10 Example images from face tracks of two subjects in the YouTube Faces (YTF) database. The top two and bottom two rows are face tracks from the same subject. 25 the YouTube Faces (YTF) database [141] contains the largest number of subjects and the faces in the video tracks are relatively more unconstrained than other face video databases. The MBGC video data [103] also has strong relevance to unconstrained faces in video, but the YTF database is more widely used due to the following reasons: (i) it contains the largest number of subjects, (ii) the actions of the subjects are naturally varied (as opposed to performing prescribed actions), (iii) the YTF database is easier to acquire (thus allowing the baselines to be used by the research community at large), and (iv) all subjects in the YTF database also have still images available in the LFW database [62] (thus allowing baselines to be compared to the video-to-still image matching scenario). The IJB-A database [72] also contains unconstrained face videos but with fuller pose variations and lower quality faces than the YTF database (faces in YTF were detected by Viola-Jones detector which can miss faces at extreme poses, while faces in IJB-A were manually annotated by humans). Video-based face recognition approaches have been organized into the following two categories [10] based on how they leverage the multitude of information available in a video sequence: (i) sequence-based, and (ii) set-based. At a high-level, what most distinguishes these two approaches is whether or not they utilize temporal information. Sequence-based approaches consider all detected faces based on their temporal ordering. For example, Zhou et al. combined both face tracking and face recognition into a single framework, which allowed the inter-frame dynamics to be exploited during the recognition process [152]. See [10] for more details about sequence-based methods. Set-based approaches to video-based face recognition consider all the available frames of a subject’s face as an unordered set. Such methods have been further organized into approaches that fuse the available information prior to matching, and those that fuse information after performing matching [10]. Methods that fuse information prior to matching will generally output either a feature vector representation or a single face image. For example, manifoldbased methods project the set of face images onto a manifold within a feature space, which in turn facilitates matching within the feature space [80, 138]. Manifold methods are similar 26 to sequence-based methods in that they require specialized matching algorithms. Both super resolution methods [6] and 3D modeling-based methods [101] output a single face image that in turn can be matched with an existing face recognition system. Thus, while such synthesisbased methods attempt to solve a difficult generative modeling task, these methods are compatible with existing face recognition engines. A few commercial solutions are available for such synthesis methods, though they are only semi-automated and hence more relevant to forensic applications. Finally, set-based methods that fuse information after the face matching process seek to combine the comparison scores from static face matchers into a single similarity score. For example, Taigman et al. [128] randomly selected 100 pairs of frames from two videos and used the mean of the pairwise similarity scores as the similarity score between two videos; this simple extension of their static image-based method (i.e., DeepFace) achieves 91.4% accuracy on the YTF database. Yi et al. also applied their deep ConvNet approach to video data in a similar manner (randomly select 15 frames from each video) and also achieve high accuracy (92.2%) on the YTF database. Like the LFW database, deep ConvNet methods are currently outperforming all other methods on the YTF database. 1.4 Face Image Quality Face recognition system errors are often due to quality issues at the time of acquisition of the face image. In constrained and controlled capture environments (e.g., passport and mugshot photos), low quality face images are typically due to operator issues or uncooperative users. Many users of face recognition systems are unaware of the sensitivity of automatic face recognition systems to illumination, facial pose, expression, eyeglasses, etc., or subjects may be uncooperative (e.g., for mugshot photos). In unconstrained scenarios, ranging from surveillance imagery to face images available on the internet, low quality face images are unavoidable due to the nature of the applications. Available face images are either not col- 27 lected for use with identification documents and face recognition purposes, or face images are captured covertly where subjects are unaware of acquisition or purposely do not want a good quality face image to be acquired. Following the accepted definition of biometric sample quality, a face quality measure should be predictive of automatic face recognition performance [5, 23, 56]. Hence, a face image determined to be of low (poor) quality should result in low genuine and high impostor similarity scores, and a high (good) quality face image should result in high genuine and low impostor similarity scores. The benefits of an automatic measure of face quality are similar to the benefits of automatic quality measures for any other biometric trait (e.g., fingerprint or iris) [56]. Some examples include the following: • To assist with the integrity of enrollment face databases, automatic quality measures could be integrated into face image acquisition protocols, where the process cannot be completed until a face image of desired quality has been acquired. The quality measures could also be applied retroactively to legacy face databases to “flag” low quality images which have been previously enrolled. • Similarly, an automatic quality check could be incorporated at the time of verification or identification in controlled and constrained scenarios where capture of additional face images is possible if necessary. Rather than returning a false match or false nonmatch, where the operator (or user) would need to ascertain whether to attempt the process again, the system could have a “reject option” where no decision is given unless the query face image is of sufficient quality. If the acquired face image does not pass a quality check, the user can be prompted to provide a better quality face image. • A face quality measure can be used to weight face image samples for fusion of different biometric traits (e.g., face and iris) or of multiple face images (and/or video frames) in media collection scenarios such as those explored in Chapter 2. • Automatic invocation of adaptive recognition systems based on the quality (e.g., fusion 28 of multiple matchers if face quality is poor may boost the performance, but fusion could be avoided for high quality samples where additional computation is unnecessary or fusion may even degrade performance). Hence, it may be useful to have both matcherindependent and matcher-dependent quality measures. Bharadwaj et al. [23] and Alonso-Fernandez [5] provide recent reviews of biometric sample quality for fingerprint, face, and iris. The most widely used biometric sample quality has without a doubt been the use of NIST Fingerprint Image Quality (NFIQ v1.0 [125] and NFIQ v2.0 [124]); with wide acceptance, it is now the de facto standard for assessing fingerprint image quality for many important applications, including the US-VISIT program [96]. NFIQ is an integer value of 1 to 5 (1 being the highest quality) that predicts the expected performance of fingerprint matching algorithms on a given fingerprint image. In comparison, face image quality has been studied in the literature (e.g., [22, 35, 42]), but no satisfactory solutions are yet available from either the research community or commercial vendors, to the best of our knowledge. 1.5 Facial Aging Because of the natural process of aging, appearance changes that affect both facial shape and texture are inevitable. Hence, the permanence/persistence of the face as a biometric tends to be lower than that of the other primary biometric traits (i.e., fingerprint and iris). Unlike other factors such as pose and illumination, aging variations cannot be controlled; facial aging is a challenge that spans both constrained and unconstrained applications of face recognition. A common approach to handle the issue of faces changing over time is “template update” where subjects’ enrolled samples are periodically updated. For example, driver’s licenses and passport photos must be renewed every so many years. While template update is effective, there are many applications where template update is not a viable solution (e.g., de-duplication, identification of missing persons, surveillance and watch-list scenarios). 29 To solve this problem, there have been two primary approaches in the literature: (i) age simulation/progression of face images prior to feature extraction and matching (e.g., [78, 102]) and (ii) recognition methods which utilize “age-invariant” features and/or subspaces (e.g., [67, 83, 87]). Ramanathan et al. provide a survey of approaches related to facial aging [111]. 1.6 Benchmarking State of the Art Progress in face recognition research has largely been driven by systematic large-scale evaluations of current methods, which not only encourage competition, but also help to identify future areas of research. While the research community attempts to benchmark published methods against each other, public access to large operational databases has been limited. Hence, third-party evaluations done by the National Institute of Standards and Technology (NIST) are invaluable for knowledge of current state-of-the-art algorithms; NIST has access to large operational databases and conducts extensive testing of multiple algorithms on protocols that mimic operational scenarios.13 In particular, commercial vendors, whose algorithms are typically proprietary, submit their algorithms for the NIST evaluations. The research community should pay close attention to the results of these tests; actual state-ofthe-art methods are different than “home-brewed” algorithms evaluated on small “in-house” or lab-collected databases. To measure progress in face recognition, we can track the results of the various NIST evaluations, which began in September 1993 with the FERET program [107]. At that time, face recognition systems were limited to prototypes from research labs and universities, and few were fully automatic. Commercial systems have since been evaluated in multiple Face Recognition Vendor Tests (FRVTs). Table 1.3 shows that the FRVTs (and the MBE [57]) have documented continuously increasing TARs at 0.1% FAR on frontal constrained face images from 2000 to 2013; a decrease in error rates of approximately three orders of 13 http://www.nist.gov/itl/iad/ig/face.cfm 30 Table 1.3 Face recognition performance on frontal, constrained face images as reported over the years in NIST evaluations. Evaluation FERET [107] FERET [107] FRVT [106] FRVT [108] FRGC v2.0 [105] MBE [57] FRVT [55] Year Rank-1 Accuracy Gallery Size TAR @ 0.1% FAR 1993/94 1996/97 2002 2006 2005 2010 2014 78% 95% 73% n.a. n.a. 92% 96% 316 831 37,437 n.a. 16,028 1.6M 1.6M 21% 46% 80% 99% 99% >99% n.a. magnitude has been observed. The most recent NIST evaluation, FRVT 2013, focused on large-scale face identification, both closed-set and open-set [55]. While closed-set accuracies of the top six commercial vendors were quite high (best was 4.1% rank-1 miss rate), open-set accuracies decreased significantly for a FAR of 0.2% (best was 7.5% rank-1 detection and identification miss rate). These evaluations have also experimented on face images captured in less ideal conditions (i.e., non-uniform lighting, lower resolution, non-frontal); the FRGC and FRVT evaluations identified pose, illumination, and outdoor imagery as especially challenging for algorithms. However, with the exception of the webcam face images used in FRVT 2013, most of these evaluations on other factors have been on databases with staged variations (i.e., lab collected). 1.6.1 Unconstrained Face Recognition Current state-of-the-art methods for unconstrained face recognition have been benchmarked by the LFW database protocol since its release in 2007 [62]. Numerous methods have evaluated on the LFW protocol (almost 60 publications on the LFW website14 at the time of writing). Recently, the LFW protocol has been dominated by convolutional neural network approaches with reported accuracies of 97–99% (e.g., [121, 128, 146]). As previously dis14 http://vis-www.cs.umass.edu/lfw/results.html 31 cussed, these high accuracies are largely due to the use of large-scale training databases external to LFW; methods which leverage outside training data (ConvNet methods already mentioned, as well as e.g., [30, 34, 90]) have proven to achieve much higher accuracies than methods that only train on LFW face images (current best accuracies are 95.89% [7] and 88.97% [81]). The public availability of the LFW database has greatly contributed to advancements in the development of face recognition techniques that are robust to variations in pose, illumination, expression, etc. by facilitating competition amongst research teams, as well as the goal to outperform humans [76]. However, there are a few limitations of the LFW protocol as discussed in the next section. 1.6.1.1 Drawbacks of the LFW Protocol The LFW database protocol was designed for the classification task of determining whether a pair of face images is the same (i.e., genuine) or different (i.e., impostor). Hence, the LFW protocol is specifically an evaluation of face verification. While face verification is a real-world biometric scenario, the LFW protocol suffers from the following limitations: • Many methods that use the LFW protocol only report the accuracy of their final classifier that determines same vs. not-same face pairs. However, in a biometric verification system, we typically do not require a classifier to make a binary decision. A face recognition system will be deployed, and the system administrators will determine the threshold at which they wish to operate the system (depending on security and usability requirements of the application domain). Hence, a full receiver operating characteristic (ROC) curve should be reported to demonstrate performance across different thresholds. • Because of the above point, biometric systems should especially be tested at low false accept rates (FARs) as this is typically where most applications operate (e.g., FARs well below 1%). The LFW protocol, which contains only 300 impostor scores per crossvalidation fold, does not allow for evaluating at these low FARs (e.g., 1/300 = 0.3% 32 and is not statistically reliable). • Many unconstrained face recognition scenarios require face identification, rather than verification, tasks. While verification and identification are related, DeCann and Ross show that a good verification system does not necessarily imply a good identification system (and vice versa) [40]. Hence, unconstrained face recognition methods should also be evaluated in identification modes (both closed-set and open-set). Because of these drawbacks, the LFW protocol design has received recent criticisms [85, 129, 151], and research focus is beginning to shift towards evaluation in more realistic biometric settings. In 2014, Liao et al. released a new unconstrained face recognition protocol: Benchmark of Large-scale Unconstrained Face Recognition (BLUFR) [85]. The protocol is still 10-fold cross-validation but exploits the large number of face images available in the LFW database; BLUFR has both verification and open-set identification protocols consisting of about 157,000 genuine scores and 47 million impostor scores per fold.15 Liao et al. [85] provide results on the BLUFR protocol for some benchmark methods, including Chen et al.’s high-dimensional LBP with Joint Bayesian classification [33, 34]; while Chen et al.’s approach achieves 95% accuracy on the LFW protocol, the accuracies significantly drop for the more challenging BLUFR protocol (see Table 1.4). Similarly, deep neural network approaches (e.g., Yi et al. [146] and Wang et al. [136]) achieve ∼98% accuracy on the LFW protocol, but only 90% and 56% accuracies on the BLUFR protocol. These results demonstrate that accuracies of ∼99% on the LFW protocol are misleading; there is still room for improvement in scenarios more representative of real-world (i.e., large-scale) biometric applications. The YTF database protocol (i.e., 10-fold cross-validation of 250 same and 250 not-same pairs per fold [141]) is the video-equivalent of the LFW protocol and contains the same drawbacks. Additionally, another issue with these two databases is that web-collected data can easily contain labeling errors. Because the LFW and YTF protocols contain so few face 15 http://www.cbsr.ia.ac.cn/users/scliao/projects/blufr/ 33 Table 1.4 Comparison of performance on the LFW [62] vs. BLUFR [85] protocols. LFW Protocol Method BLUFR Protocol Accuracy (%) TAR @ 0.1% FAR DIR @ 1% FAR HighDimLBP + JointBayes [34]* 95.17 41.66 18.07 Yi et al. [146] 97.73 80.26 28.90 Wang et al. [136] 98.20 89.80 55.90 * Performance here for [34] on BLUFR protocol was reported by [85] pairs, these errors may be significant. By studying human performance (via crowdsourcing on Amazon Mechanical Turk), we discovered 111 errors out of the 2,500 genuine pairs in the YTF protocol [16]. Some of the errors were due to the difficult task of verifying ground truth because of the temporal aspect of videos; the person of interest may not appear in the video until a few or many frames into the sequence. Databases that have been reliably annotated with ground truth labels prior to release, such as the IJB-A database [72], are invaluable to the research community. An additional limitation of the LFW and YTF unconstrained face databases is that they were both compiled using a commodity face detector, namely, an implementation of the Viola-Jones algorithm [135]. While automatic face detection facilitates the collection of large-scale face databases, this property immediately puts a constraint on the collected face images which are supposed to be unconstrained. The constraint being that Viola-Jones based algorithms (and most other existing face detectors) perform best on near-frontal face images [36]. Additionally, poor illumination, extreme expression, and occlusions can also cause face detection to fail. Hence, current research efforts in unconstrained face recognition have been optimizing automatic face recognition only for those faces which can be detected by these commodity detectors. For the reasons mentioned above, the IARPA Janus program released a new unconstrained face database, IARPA Janus Benchmark A (IJB-A), which is a joint face detection and recognition database [72]. IJB-A contains 500 subjects with an average of 11.4 face im- 34 ages and 4.2 videos per subject. All faces in both images and video frames were annotated manually via sophisticated methods using Amazon Mechanical Turk [126]. Because all faces are detected by humans (rather than automatically detected by a Viola-Jones face detector), the IJB-A database contains larger ranges of variations (particularly facial pose) that degrade performance of current face detection and recognition approaches. The IJB-A face recognition challenge managed by NIST16 is a “template-based” matching scenario where each sample is a composite of still images and video frames of the same subject; the goal is to leverage complementary information that may be available in multiple unconstrained faces. The current leaderboard17 accuracies for the IJB-A challenge are the following: 82% TAR @ 1% FAR (1:1 verification), 88% rank-1 accuracy (1:N closed-set), and 53% TPIR @ 1% FPIR (1:N open-set). 1.6.2 Age-Invariant Face Recognition State-of-the-art age-invariant face recognition systems (in the literature) are currently benchmarked by the FG-NET [78] and MORPH [113] databases; a number of methods claim to improve the “age-invariance” of face recognition by reporting overall performance on FG-NET and/or MORPH. For example, [50] reports rank-1 identification accuracies of 69.0% and 91.1% on the FG-NET and MORPH-II databases, respectively. Using the periocular region, Xu et al. [67] reported 100% rank-1 accuracy and 98% TAR at 0.1% FAR on FG-NET. However, an overall performance improvement on a specific database does not necessarily indicate a good solution to the facial aging problem. Klare and Jain demonstrate that methods developed (trained) for age-invariance may actually decrease performance in non-aging scenarios [70]. Furthermore, simply stating accuracies on the entire longitudinal database does not provide any information/quantification regarding facial aging as a covariate to face recognition (i.e., how much impact specific ages or time lapses have on comparison scores and/or accuracies). 16 17 https://www.nist.gov/programs-projects/face-challenges IJB-A reports are periodically updated. Leaderboard results reported here are from Nov. 2016. 35 To further study facial aging, most researchers divide the database into partitions (of age groups or elapsed times) and report performance for each partition. Performance trends across increasing age group or elapsed time are then evaluated. While this approach provides empirical notions of how facial aging affects the performance of systems, covariate analysis is needed to account for the effects of other factors (e.g., pose, image quality) that also play a role in performance. In particular, the FG-NET database contains a number of other variations that can make recognition difficult, in addition to those related to facial aging (see Fig. 1.11). Longitudinal databases are difficult to acquire because images of the same subjects need to be collected over time. A database for studying facial aging should consist of both a large number of subjects and a large number of images per subject that have been collected over time. While the FG-NET and MORPH databases have primarily been the only publicly available databases for studying facial aging, they are not ideal for longitudinal study due to the following reasons: • FG-NET contains only 82 subjects in total, and 48% of the 1,002 total face images are younger than 13 years old. Even with small elapsed times, face recognition of children is still an open research problem; the FRVT 2013 [55] reported that all of the top six commercial algorithms suffered an especially large decrease in performance for all age groups less than 13 years old. • While the largest commercial version of the MORPH database has about 20,000 subjects, there are only an average of 4 face images per subject. Additionally, there are only 317 subjects with more than 5 face images collected over at least 5 years. Hence, if we wish to study how facial changes of individuals affects face recognition performance over time, we need to leverage a database that is both fairly constrained with respect to other covariates and contains a large number of images per subject acquired over periods of time which are long enough for facial changes due to aging to occur. 36 (a) (b) Figure 1.11 Face images of two example subjects from the FG-NET database [78]: (a) female at ages 3–38 years and (b) male at ages 19–63 years. As shown in these examples, the FGNET database contains a significant amount of variations (pose, illumination, inter-pupillary distances, image quality, etc.), in addition to intrinsic variations due to facial aging. 48 49 51 52 53 47 48 49 50 53 43 44 46 47 48 Figure 1.12 Face images and corresponding ages (in years) of three example subjects from the MORPH database [113]. The largest commercial version of MORPH has 78,207 face images of 20,569 subjects. However, there are only 317 subjects with at least 5 images acquired over at least 5 years (these are three of the 317). 37 1.7 Contributions Automatic face recognition has been an extensively studied topic for more than two decades. Significant advancements in the technology have been realized in numerous subtasks needed for robust recognition (face detection, alignment, feature extraction, matching). However, as the technology moves from research problems to real-world deployment systems, it is imperative that the research be driven by requirements of these real-world scenarios. In summary, this introduction has highlighted a few limitations of current research in unconstrained face recognition and studies on facial aging, particularly with respect to how these two challenging problems are benchmarked and evaluated. The perceived contributions of this thesis are the following: 1. Experimental protocols are developed for identification of unconstrained face images. Baseline results using a state-of-the-art COTS face matcher and a separate 3D face modeler are provided for both closed-set and open-set scenarios. 2. A framework is provided for matching a collection of face media (image(s), video(s), 3D model(s), demographic data, and sketch) to mitigate the challenges associated with unconstrained face recognition (uncooperative subjects, unconstrained imaging conditions) and to boost recognition accuracy in scenarios where multiple instances of the face may be available (e.g., persons of interest on a watch list). 3. An automatic measure of face image quality is proposed which can be used to reject low-quality face images prior to matching and rank a collection of face images in order of quality (e.g., to determine which face image to put in the gallery or which face images to use to build a 3D face model). 4. The largest (to date) longitudinal study of face recognition performance is conducted to determine the state-of-the-art robustness to facial aging. The study involves two operational mugshot databases consisting of (i) 147,784 images of 18,007 subjects and (ii) 38 31,852 images of 5,636 subjects; each subject has a minimum of 4 mugshots collected over an average of 8.5 and 5.8 years for the two databases, respectively. Mixed-effects regression models are used to analyze trends in genuine scores over time (i.e., as subjects age) and quantify the subject-specific variability. As such, estimates are provided for how many years of aging are tolerated by face matchers, e.g., before 95% of the population’s genuine scores will drop below the threshold at 0.1% FAR. The effects of demographics (age, gender, race) and face image quality are also analyzed. 1.8 Thesis Organization The remainder of this thesis is organized as follows. Chapter 2 focuses on utilizing a face media collection to improve unconstrained face recognition accuracy. Chapter 3 investigates human assessments of the quality of a large database of unconstrained face images and proposes an automatic measure of face image quality. Chapter 3 provides a longitudinal study on automatic face recognition which utilizes multilevel statistical models for a covariate analysis of elapsed time and other factors. Chapter 4 concludes this thesis with a summary of contributions and future work. 39 Chapter 2 Face Recognition with Media Collection 2.1 Introduction As face recognition applications progress from constrained imaging and cooperative subjects (e.g., identity card deduplication) to unconstrained imaging scenarios with uncooperative subjects (e.g., watch list monitoring), a lack of guidance exists with respect to optimal approaches for integrating face recognition algorithms into large-scale applications of interest. In this work we explore the problem of identifying a person of interest given a variety of information sources about the person (face image, surveillance video, face sketch, 3D face model, and demographic information) in both closed-set and open-set identification modes. Identifying a person based on unconstrained face images is an increasingly prevalent task for law enforcement and intelligence agencies. In general, these applications seek to determine the identity of a subject based on one or more probe images or videos, where a top-100 ranked list retrieved from the gallery (for example) may suffice for analysts (or forensic examiners) to identify the subject [64]. In many cases, such a forensic identification 40 Gender: Male Race: White Age: 60-70 (a) (c) (b) (d) Figure 2.1 A collection of face media for a particular subject may consist of (a) multiple still images, (b) a face track from a video, (c) a forensic sketch, (d) a 3D face model of the subject derived from (a) and/or (b), and demographic information (e.g., gender, race, and age). The images and video track shown here are from [62, 141]. The sketch was drawn by a forensic sketch artist after viewing the face video. In other applications, sketches could be drawn by an artist based on verbal description of the person of interest. is performed when multiple face images and/or a face track (i.e., a sequence of cropped face images which can be assumed to be of the same person) from a video of a person of interest are available (see Fig. 2.1). For example, in investigative scenarios, multiple face images of an unknown subject often arise from an initial clustering of visual evidence, such as a network of surveillance cameras, the contents of a seized hard drive, or from open source intelligence (e.g., social networks). In turn, these probe images are searched against large-scale face repositories, such as mug shot or identity card databases. High profile crimes such as the Boston Marathon bombings often rely on data extracted by significant manual effort to identify the person of interest: "It’s our intention to go through every frame of every video [from the marathon bombings]," Boston Police Commissioner Ed Davis1 1 http://www.washingtonpost.com/world/national-security/boston-marathon-bombings-investigatorssifting-through-images-debris-for-clues/2013/04/16/1cabb4d4-a6c4-11e2-b029-8fb7e977ef71 story.html 41 While other routine, but high value, crimes such as armed robberies, kidnappings, and acts of violence require similar identifications, only a fraction of the manual resources are available to solve these crimes. Thus, it is paramount for face recognition researchers and practitioners to have a firm understanding of optimal strategies for combining multiple sources of face information, collectively called face media, available to identify the person of interest. While forensic identification is focused on human-driven queries, several emerging applications of face recognition technology exist where it is neither practical nor economical for a human to have a high degree of intervention with the automatic face recognition system. One such example is watch list identification from surveillance cameras, where a list of persons of interest are continuously searched against streaming videos. Termed as open-set recognition, these challenging applications will likely have better success as unconstrained face recognition algorithms continue to develop and mature [28]. While a closed-set identification system deals with the scenario where the person of interest is assumed to be present in the gallery, and always returns a non-empty candidate list, an open-set identification system allows for the scenario where the person of interest is not enrolled in the gallery, and so can return a possibly empty candidate list [82]. We provide experimental protocols, recognition accuracies on these protocols using COTS face recognition and 3D face modeling algorithms, and an analysis of the integration strategies to improve operational scenarios involving open-set recognition. 2.1.1 Overview In forensic investigations, manual examination of a suspect’s face image against a mug shot database with millions of face images is prohibitive. Thus, automatic face recognition techniques are utilized to generate a candidate suspect list. As shown in Fig. 2.2, forensic investigations using face images typically involve six stages: obtaining face media, preprocessing, automatic face matching, generating a suspect list, human or forensic analysis, and 42 Feedback Obtain Preprocessing face media Single media is used as input each time Automatic face matching Suspect list Human analysis Suspect Identification Figure 2.2 Forensic investigations by law enforcement agencies using face images typically involve six main stages: obtaining face media, preprocessing, automatic face matching, generating a suspect list, human analysis, and suspect identification. Feedback occurs after human analysis reveals that, for example, additional preprocessing of the input image (e.g., illumination correction and/or manual eye locations), demographic filtering of the gallery, and/or a different face sample from the media collection is necessary. suspect identification.2 The available forensic data or media of the suspect may include still face image(s), video track(s), a face sketch, and demographic information (e.g., age, gender, and race) as shown in Fig. 2.3. While traditional face matching methods take a single media (i.e., a still face image, video track, or face sketch) as probe to generate a suspect list, a media collection is expected to provide more identifiable information about a suspect. The proposed approach contributes to forensic investigations by taking into account the entire media collection of the suspect to perform face matching. This approach generates a single candidate suspect list (rather than a separate list for each face sample in the collection), thereby reducing the amount of human analysis needed. In this work, we examine the use of commercial off the shelf (COTS) face recognition systems with respect to the aforementioned challenges in large-scale unconstrained face recognition scenarios. First, the efficacy of forensic identification is explored by combining two public-domain unconstrained face databases, Labeled Faces in the Wild (LFW) [62] and YouTube Faces (YTF) [141], to create sets of multiple probe images and videos to be matched against a gallery consisting of a single image for each subject. To replicate forensic identification scenarios, we further populate our gallery with one million operational mug shot images from the Pinellas County Sheriff’s Office (PCSO) database.3 Using this data, 2 A more detailed description of this forensic investigation process can be found at: http://www.justice. gov/criminal/cybercrime/docs/forensics chart.pdf 3 http://biometrics.org/bc2010/presentations/DHS/mccallum-DHS-Future-Opportunities.pdf 43 ... Single or multiple images Single or multiple videos Sketch Gallery of mugshots 3D face model or data Attributes: age, gender, race, etc. Automatic face matching Top 200 matches Generating suspect list Human analysis Media collection as a probe Figure 2.3 Schematic diagram of a person identification task given a face media collection as input. we are able to examine how to boost the likelihood of face identification through different fusion schemes, incorporation of 3D face models and hand drawn sketches, and methods for selecting the highest quality video frames. Researchers interested in improving forensic identification accuracy can use this competitive baseline (on public-domain databases LFW and YTF) to provide more objectivity towards such goals. Most of the work on unconstrained face recognition using the LFW and YTF databases has been reported in verification scenarios [98, 137]. However, in forensic investigations, it is the identification mode that is of interest, especially the open-set identification scenario where the person of interest may not be present in legacy face databases. The contributions of this work are summarized as follows: • We show, for the first time, how a collection of face media (image(s), video(s), 3D model(s), demographic data, and sketch) can be used to mitigate the challenges associated with unconstrained face recognition (uncooperative subjects, unconstrained imaging conditions) and boost recognition accuracy. • Unlike previous studies that report results in verification mode, we present results for both open-set and closed-set identifications which are the norm in identifying persons of interest in forensic and watch list scenarios. 44 • We present effective face quality measures to determine when the fusion of information sources will help boost identification accuracy. The quality measures are also used to assign weights to different media sources in fusion schemes. • To demonstrate the effectiveness of media-as-input for the difficult problem of unconstrained face recognition, we utilize a state of the art COTS face matcher and a separate COTS 3D face modeler, namely the Aureus 3D SDK provided by CyberExtruder4 . Face sketches were drawn by forensic sketch artists who generated the sketch after viewing low quality videos. In the absence of demographic data for LFW and YTF databases, we used crowdsourcing to obtain the estimates of gender and race. The above strategy allows us to show the contribution of various media components as we incrementally add them as input to the face matching system. • Pose-corrected versions of all face images in the LFW database, pose-corrected video frames from the YTF database, forensic sketches, and experimental protocols used in this work have been made publicly available.5 The remainder of this chapter is organized as follows. In Section 2.2, we briefly review published methods related to unconstrained face recognition. We detail the proposed face media collection as input and media fusion method in Sections 2.3 and 2.4, respectively. Experimental setup and protocols are given in Section 2.5, and experimental results are presented in Section 2.6. We conclude this work in Section 2.7. 2.2 Related Work The release of the public-domain database Labeled Faces in the Wild6 (LFW) in 2007 spurred interest and progress in unconstrained face recognition. The LFW database is a collection 4 http://cyberextruder.com/products/aureus-3d-sdk/ http://biometrics.cse.msu.edu/pubs/databases.html 6 http://vis-www.cs.umass.edu/lfw/ 5 45 (a) LFW face images (b) YTF face video tracks Figure 2.4 Example (a) face images from the LFW database and (b) face video tracks from the YTF database. All faces shown are of the same subject. 46 of 13, 233 face images, downloaded from the Internet, of 5, 749 different individuals such as celebrities, public figures, etc. [62]. These images were selected since they meet the criterion that faces can be successfully detected by the Viola-Jones face detector [135]. Despite this property, the LFW database contains significant variations in facial pose, illumination, and expression, and many of the face images are occluded. The LFW protocol consists of face verification based on ten-fold cross-validation, each fold containing 300 “same face” and 300 “not-same face” image pairs. The YouTube Faces7 (YTF) database, released in 2011, is the video-equivalent to LFW for unconstrained face matching in videos. The YTF database contains 3, 425 videos of 1, 595 individuals. The individuals in the YTF database are a subset of those in the LFW database. Faces in the YTF database were also detected with the Viola-Jones face detector at 24 fps, and face tracks were included in the database if there were at least 48 consecutive frames of that individual’s face. Similar to the LFW protocol, the YTF face verification protocol consists of ten-fold cross-validation, each fold containing 250 “same face” and 250 “not-same face” track pairs. Figure 2.4 shows example face images and video tracks from the LFW and YTF databases for one particular subject. In this work, we combine these two databases to evaluate the performance of face recognition on unconstrained face media collections. We provide a summary of related work on unconstrained face recognition, focusing on various face media matching scenarios in Table 2.1. We emphasize that most prior work has evaluated unconstrained face recognition methods in the verification mode. While fully automated face recognition systems are able to achieve ∼99% True Accept Rate (TAR) at 0.1% False Accept Rate (FAR) in constrained imagery and cooperative subject conditions, face recognition in unconstrained environments remains a challenging problem [97]. However, face verification accuracies on the LFW protocol have recently seen drastic improvements. When utilizing outside training data, recent works have achieved TARs greater than 94% at 7 http://www.cs.tau.ac.il/∼wolf/ytfaces/ 47 1% FAR and classification accuracies over 97% (e.g., [122, 128]). However, at 1% FAR, the LFW protocol only contains three impostor scores per fold, so these saturated accuracies may overestimate the abilities of FR systems on unconstrained faces. Liao et al. propose a new benchmark for LFW which allows for evaluation at lower FARs; out of three features and seven learning algorithms, they find the best performance is 42% and 66% at 0.1% and 1% FAR, respectively [85]. Open-set identification performance is even lower at 18% for Rank-1 and 1% FAR [85]. Unconstrained face recognition methods can be grouped into two main categories: single face media based methods and face media collection based methods. Single media based methods focus on the scenario where both the query and target instances contain only one type of face media, such as a still image(s), video track(s), or 3D image(s) or model(s). However, the query and target instances can be different media types, such as single image vs. single video. These methods can be effective for unconstrained illumination and expression variations but can only handle limited pose variations. For example, while ∼97% TAR at 0.1% FAR has been reported in MBGCv2.0 unconstrained vs. unconstrained face matching, under large pose variations, this performance drops to ∼17% TAR in MBGCv2.0 non-frontal vs. frontal face matching (see Table 2.1). Such challenges were also observed in single image vs. single image face matching in LFW, and single video vs. single video face matching in YTF and MBGCv2.0 walking vs. walking databases. These observations suggest that in unconstrained scenarios, a single face media probe, especially of “low quality”, may not be able to provide a sufficient description of a face. This motivates the use of a face media collection which utilizes any source of information that is available for a probe (or query) instance of a face. One preliminary study in this direction is the FRGCv2.0 Exp. 3 where (i) a single 3D face image and (ii) a collection of single 3D image and a single 2D face image were used as queries. Results show that 2D face image and 3D face image did improve the face matching performance (79% TAR for 3D face and 2D face vs. 53% TAR for just the 3D face at 0.1% FAR) in unconstrained conditions. It 48 Table 2.1 A summary of published methods on unconstrained face recognition (UFR). Performance is reported as True Accept Rate (TAR) at a fixed False Accept Rate (FAR) of 0.1% or 1%, unless otherwise noted. UFR on Single Media Dataset FRGC v2.0 Exp. 4 unconstrained vs. constrained MBGC v2.0 unconstrained vs. unconstrained MBGC v2.0 non-frontal vs. frontal MBGC v2.0 unconstrained vs. HD video MBGC v2.0 walking vs. walking FRGC v2.0 Exp. 3 3D vs. 3D LFW Image-Unrestricted Protocol (w/ outside training data) LFW BLUFR Protocol YouTube Celebritites UFR on Media Collection YouTube Faces Query Type (size) vs. Target Type (size) Single image (8,014) vs. single image (16,028) Single image (10,687) vs. single image (8,014) Single image (3,097) vs. single image (16,028) Single image (1,785) vs. single HD video (512) Notre Dame: Single video (976) vs. single video (976) UT Dallas: Single video (487) vs. single video (487) Single 3D image (4,007) vs. single 3D image (4,007) 300 genuine and 300 impostor pairs per fold 4,249 subjects and 9,708 images per fold 1,500 video clips of 35 celebrities 250 genuine and 250 impostors per fold Accuracy (TAR @ FAR) Source 12% @ 0.1% [97] 97% @ 0.1% [97] 17% @ 0.1% [97] 94% @ 0.1% [97] Notre Dame: 46% @ 0.1% [97] UT Dallas: 65% @ 0.1% 53% @ 0.1% [97] 88% @ 1% 94% @ 1% 95% @ 1% [34] [128] [122] 90% @ 0.1% [136] 79% Rank-1 Acc. 55% @ 1% 63% @ 1% [128] [19] [145] FRGC v2.0 Exp. 3 Single image & single 3D image (8,014) vs. single 3D image (943) 79% @ 0.1% [97] MBGC v2.0 unconstrained face and iris vs. NIR& HD videos Single face & iris (14,115) vs. single NIR & single HD (562) 97% @ 0.1% [97] Single image vs. single image 56.7% Multiple images vs. single image 72.0% Single video vs. single image 31.3% Multiple videos vs. single image Multiple images & multiple videos vs. single image Multiple images, multiple videos, & 3D model vs. single image Multiple images, multiple videos, 3D model, & demographics vs. single image 44.0% LFW & YouTube Faces (plus 3D face models & demographic information) 1 Performance 77.5% This work1 83.0% 84.9% measures reported here for scenarios considered in this work are Rank-1 identification accuracies. 49 is, therefore, important to determine how we can improve the face matching accuracy when presented with a collection of face media of different types, albeit of different qualities, as probe. 2.3 Media-as-Input A face media collection can consist of still images, video tracks, a 3D model, a forensic sketch, and demographic information. In this section, we discuss how we use face “media-as-input” as probe and our approach to media fusion. 2.3.1 Still Image and Video Track Still image and video track are two of the most widely used sources of media in face recognition systems [82]. Given multiple still images and videos, we use the method reported in [19] to match all still images and video frames available for a subject of interest to the gallery mugshot (frontal pose) images using a COTS face matcher. The resulting match scores are then fused to get a single match score for either multiple probe images or video(s). 2.3.2 3D Face Models One of the main challenges in unconstrained face recognition is large variations in facial pose [47, 94]. In particular, out-of-plane rotations drastically change the 2D appearance of a face, as they cause portions of the face to be occluded. A common approach to mitigate the effects of pose variations is to build a 3D face model from a 2D image(s) so that synthetic 2D face images can then be rendered at designated poses (e.g., [8, 63, 86]). In this work, we use a state of the art COTS 3D face modeling SDK, namely CyberExtruder’s Aureus 3D SDK, to build 3D models from 2D unconstrained face images.8 We input eye locations (extracted automatically by [34] for LFW images and the COTS face matcher 8 http://www.cyberextruder.com/aureus-3d-sdk 50 for YTF video frames) to the SDK to help with model robustness. The entire 3D face modeling process is fully automatic. The 3D face model is then used to render a “pose corrected” (i.e., frontal facing) image of the unconstrained probe face images. The pose corrected image can then be matched against a frontal gallery. We also pose correct “frontal” gallery images because even the gallery images can have variations in pose as well. Experimental results show that including pose corrected gallery images indeed improves the identification performance. Given the original and pose corrected probe and gallery images, there are four matching scores that can be computed between any pair of probe and gallery face images (see Fig. 2.5). We use the score s1 as the baseline to determine whether including scores s2 , s3 , s4 , or their fusion can improve the performance of a COTS face matcher. A face in a video frame can be pose corrected in the same manner. The Aureus SDK also summarizes faces from multiple frames in a video track as a “consolidated” 3D face model (see Fig. 2.6). 2.3.3 Demographic Attributes In many law enforcement and government applications, it is customary to collect ancillary information like age, gender, race, height, and eye color from the subjects during enrollment. We explore how to best utilize demographic data to boost the recognition accuracy. Demographic information such as age, gender and race becomes even more important in complementing identity information provided by face images and videos in unconstrained face recognition due to the difficulty of the face matching task. In this work, we take gender and race attributes of each subject in the LFW and YTF face databases as one type of media. Since this demographic information is not available for the subjects in the LFW and YTF face databases, we utilized the Amazon Mechanical Turk (MTurk) crowdsourcing service9 to obtain the “ground-truth” gender, and race of the 596 subjects that are common in LFW and YTF datasets. Most studies on automatic 9 www.mturk.com/mturk/ 51 Probe Gallery s1 Original Original s2 s3 Pose Corrected s4 Pose Corrected Figure 2.5 Pose correction of probe (left) and gallery (right) face images using CyberExtruder’s Aureus 3D SDK. We consider the fusion of four different match scores (s1 , s2 , s3 , and s4 ) between the original probe and gallery images (top) and synthetic pose corrected probe and gallery images (bottom). Figure 2.6 Pose corrected faces (b) in a video track (a) and the resulting “consolidated” 3D face model (c). The consolidated 3D face model is a summarization of all frames in the video track. 52 demographic estimation are limited to frontal face images [59]; demographic estimation from unconstrained face images (e.g., the LFW database) is challenging [76]. For gender and race estimation tasks, we submitted 5, 749 (i.e., the number of subjects in LFW) Human Intelligence Tasks (HITs), with ten human workers per HIT, at a cost of 2 cents per HIT. Finally, a majority voting scheme (among the responses) was utilized to determine the gender (Female or Male) and race (Black, White, Asian or Unknown) of each subject. We did not consider age in this work due to large variations in age estimates by crowd workers. 2.3.4 Forensic Sketches Face sketch based identification dates back to the 19th century [130], where the paradigm for identifying subjects using face sketches relied on human examination. Recent studies on automated sketch based identification systems show that sketches can also be helpful to law-enforcement agencies to identify the person of interest from mugshot databases [58, 74]. In situations where the suspect’s photo or video is not available, expertise of forensic sketch artists are utilized to draw a suspect’s sketch based on a verbal description provided by an eyewitness or victim. In some situations, even when a photo or video of a suspect is available, the quality of this media can be poor. In this situation also, a forensic sketch artist can be called in to draw a face sketch based on the low-quality face photo or video. For this reason, we also include the face sketch in a face media collection. We manually selected 21 low-quality (large pose variations, shadow, blur, etc.) videos (one video per subject) from the YTF database (for three subjects, we also included a low quality still image from LFW). We then asked two forensic sketch artists to draw a face sketch for each subject in these videos (10 subjects were drawn by one forensic sketch artist, and 11 subjects by the other). Our current experiments are limited to sketches of 21 subjects due to the high cost of hiring a sketch artist. Examples of these sketches and their corresponding low-quality videos are shown in Figs. 2.7 and 2.15. 53 (a) Video (b) Cropped face image from video (c) Forensic Sketch Figure 2.7 An example of a sketch drawn by a forensic artist by looking at a low-quality video. (a) Video shown to the forensic artists, (b) facial region cropped from the video frames, and (c) sketch drawn by the forensic artist. Here, no verbal description of the person of interest is available. 2.4 Media Fusion Given a face media collection as probe, there are various schemes to integrate the identity information provided by each individual media component, such as score level, rank level, and decision level fusion [114]. Among these approaches, score level fusion is the most commonly adopted. Some COTS matchers do not output a meaningful match score (to prevent hill-climbing attacks [133]). Thus, in these situations, rank level or decision level fusion is typically adopted. In this work, we match each face media (image, video, 3D model, sketch, or demographic information) of a probe collection to the gallery and combine the scores using score level fusion. Specifically, score level fusion takes place in two different layers: (i) fusion within one type of media, and (ii) fusion across different types of media. The first fusion layer generates a single score from each media type if multiple instances are available. For example, matching scores from multiple images or multiple video frames can be fused to get a single score. Additionally, if multiple video clips are available, matching scores of individual video clips can also be fused. Score fusion within the ith face media can generally be formulated as si = F(si,1 , si,2 , · · ·, si,n ), 54 (2.1) where si is a single match score based on n instances of the ith face media type; F(·) is a score level fusion rule; we use the sum rule, e.g., s = 1 n si,n , which has been found to be quite effective in practice [19]. Note that the sum and mean rules are equivalent, but we use the terms mean and sum for situations when normalization by the number of scores is and is not necessary, respectively. Given a match score for each face media type, the next fusion step involves fusing the scores across different types of face media. Again, the sum rule is used and found to work very well in our experiments; however, as shown in Fig. 2.8, face media for a person of interest can be of different quality. For example, a 3D face model can be corrupted due to inaccurate localization of facial landmarks. As a result, match scores calculated from individual media sources may have different degrees of confidence. We take into account the quality of individual media type by designing a quality based fusion. Specifically, let S = [s1 , s2 , · · ·, sm ]T be a vector of the match scores between n different media types in a collection of probe and gallery, and Q = [q1 , q2 , · · ·, qm ]T be a vector of quality values for the corresponding input media. Match scores from the COTS matcher are normalized with z-score normalization. The quality values are normalized to the range [0, 1]. The final match score between a probe and a gallery image is calculated by a weighted sum rule fusion, 1 s= m m qi si = QT S. (2.2) i=1 Note that the quality based across-media fusion in (2.2) can also be applied to score level fusion within a particular face media type (e.g., 2D video frames). In this work, we have considered five types of media in a collection: 2D face image, video, 3D face model, sketch, and demographic information. However, since sketches of only 21 persons (out of 596 persons that are common in LFW and YTF databases) are available, in most of the experiments, we perform quality-based fusion in (2.2) based on only four types of media (m = 4). The quality measures for individual media type are defined as follows. • Image and video: For a probe image, the COTS matcher assigns a face confidence 55 value in the range of [0, 1], which is used as the quality value. For each video frame, the same face confidence value measure is used. The average face confidence value across all frames is used as the quality value for a video track. • 3D face model: The Aureus 3D SDK used to build a 3D face model from image(s) or video frame(s) does not output a confidence score. We define the quality of a 3D face model based on the pose corrected 2D face image generated from it. Given a pose corrected face image, we calculate its structural similarity (SSIM) [139] to a set of predefined reference images (manually selected frontal face images). Let IP C be a pose corrected face image (from the 3D model), and R = {R1 , R2 , · · ·, Rt } be the set of t reference face images. The quality value of a 3D model based on SSIM is defined as q(IP C ) = = 1 t 1 t t SSIM(IP C , Ri ) i=1 t (2.3) α β l(IP C , Ri ) · c(IP C , Ri ) · s(IP C , Ri ) γ i=1 where l(·), c(·), and s(·) are luminance, contrast, and structure comparison functions [139], respectively; α, β, and γ are parameters used to adjust the relative importance of the three components. We use the recommended parameters α = β = γ = 1 in [139]. The quality value is in the range of [0, 1]. • Demographic information: As stated earlier, we collected demographic attributes (gender and race) of each face image using the MTurk crowdsourcing service with ten MTurk workers per task. Hence, the quality of demographic information can be measured by the degree of consistency among the ten MTurk workers. Let E = [e1 , e2 , ·· ·, ek ]T be the collection of estimates of one specific demographic attribute (gender or race) by k (here, k = 10) MTurk workers. The quality value of this demographic attribute can be calculated as q(E) = 1 max { k i=1,2,···,c 56 (E == i)}, (2.4) QV = 0.96 QV = 0.30 (a) Images QV = 0.99 QV = 0.94 (b) Video frames QV of white =1.0 QV of male = 1.0 QV = 0.6 QV = 0.35 (c) 3D face models (d) Demographic information Figure 2.8 Examples of different face media types with varying quality values (QV) of one subject: (a) images, (b) video frames, (c) 3D face models, and (d) demographic information. The range of QV is [0,1]. where c is the total number of classes for one demographic attribute. Here, c = 2 for gender (Male and Female); while c = 4 for race (Black, White, Asian, and Unknown). The notation (E == i) denotes the number of estimates that are labeled as class i. The quality value range in (2.4) is in [0, 1]. Quality values for different face media of one subject are shown in Fig. 2.8. We note that the proposed quality measures give reasonable quality assessments for different input media. 2.5 Experimental Setup The 596 subjects who have at least two images in the LFW database and at least one video track in the YTF database (subjects in YTF are a subset of those in LFW) are used to evaluate the performance of face identification on media-as-input in both closed-set and open-set scenarios. The state of the art COTS face matcher used in our experiments was one of the top performers in the 2010 NIST Multi-Biometric Evaluation [97]. Though the 57 Table 2.2 Number of probe face images (from the LFW database) and video tracks (from the YTF database) available for the 596 subjects that are common in the two databases. # images/videos per subj. 1 2 3 4 5 6 7+ # subjects (LFW images) 238 110 78 57 25 12 76 # subjects (YTF videos) 204 190 122 60 18 2 0 COTS face matcher is designed for matching still images, we apply it to video-to-still face matching via multi-frame fusion to obtain a single score for the video track [19]. In all cases where video tracks are part of the face media collection, we use the mean rule for multi-frame fusion (the max fusion rule performed comparably [19]). 2.5.1 Closed Set Identification In closed-set identification experiments, one frontal LFW image per subject is placed in the gallery (one with the highest frontal score from the COTS matcher), and the remaining LFW images are used as probes. All YTF video tracks for the 596 subjects are used as probes. Table 2.2 shows the distribution of number of probe images and videos per subject. The average number of images, video tracks, and total media instances per subject is 5.3, 2.2, and 7.4, respectively. We further extend the gallery size with an additional 3, 653 LFW images (of subjects with only a single image in LFW). In total, the size of the gallery is 4, 249. We evaluate five different scenarios depending on the contents of the probe set: (i) single image as probe, (ii) single video track as probe, (iii) multiple images as probe, (iv) multiple video tracks as probe, and (v) multiple images and video tracks as probe. We also take into account the 3D face models and demographic information in the five scenarios. To better simulate the scenarios in real-world forensic investigations, we also provide a case study on the Boston Marathon bomber to determine the efficacy of using media, and the generalization ability of our system to a large gallery with one million background face images. For all closed-set experiments involving still images from LFW, we input automatically 58 extracted eye locations (from [34]) to the COTS face matcher to help with enrollment because the COTS matcher sometimes enrolls a background face in the LFW image that is not the subject of interest. Against a gallery of approximately 5, 000 LFW frontal images, we observed a 2–3% increase in accuracy for Rank-20 and higher by inputting the automatically extracted eye locations from [34]. Note that for the YTF video tracks, there are no available ground-truth eye locations for faces in each frame. Recall from Section 2.3.2 that we input eye locations from [34] and the COTS face matcher to build the 3D models for LFW images and YTF video frames, respectively; hence, the entire 3D face modeling process is fully automatic. We report closed-set identification results as Cumulative Match Characteristic (CMC) curves. 2.5.2 Open Set Identification Here, we consider the case when the person of interest in the probe image or video track may not have a true mate in the gallery. This is representative of a watch list scenario. The gallery (watch list) consists of 596 subjects with at least two images in the LFW database and at least one video in the YTF database. To evaluate performance in the open-set scenario, we construct two probe sets: (i) a genuine probe set that contains faces matching gallery subjects, and (ii) an impostor probe set that does not contain faces matching gallery subjects. We conduct two separate experiments: (i) randomly select one LFW image per watch list subject as the genuine probe set and use the remaining LFW images of subjects not on the watchlist as the impostor probe set (596 gallery subjects, 596 genuine probe images, and 9, 494 impostor probe images), and (ii) use one YTF video per watch list subject as the genuine probe set, and the remaining YTF videos which do not contain watch list subjects as the impostor probe set (596 gallery subjects, 596 genuine probe videos, and 2, 064 impostor probe videos). For each of these experiments, we evaluate three scenarios for the gallery: (i) single image, (ii) multiple images, and (iii) multiple images and videos. Open-set identification can be considered a two step process: (i) decide whether or not to 59 reject a probe image as not in the watchlist, and (ii) if probe is in the watchlist, recognize the person. Hence the performance is evaluated based on (i) Rank-1 detection and identification rate (DIR), which is the fraction of genuine probes matched correctly at Rank-1, and not rejected at a given threshold, and (ii) the false alarm rate (FAR) of the rejection step (i.e. the fraction of impostor probe images which are not rejected). We report the DIR vs. FAR curve describing the tradeoff between true Rank-1 identifications and false alarms. 2.6 2.6.1 Experimental Results Pose Correction We first investigate whether using a COTS 3D face modeling SDK to pose correct a 2D face image prior to matching improves the identification accuracy. The closed-set experiments in this section consist of a gallery of 4, 249 frontal LFW images and a probe set of 3, 143 LFW images or 1, 292 YTF videos. Table 2.3 (a) shows that the COTS face matcher performs better on face images that have been pose corrected using the Aureus 3D SDK. Matching the original gallery images to the pose corrected probe images (i.e., match score s3 ) performs the best out of all four match scores, achieving a 7.25% improvement in Rank-1 accuracy over the baseline (i.e., match score s1 ). Furthermore, fusion of all four scores (s1 , s2 , s3 , and s4 ) with the simple sum rule provides an additional 2.6% improvement at Rank-1. Consistent with the results for still images, match scores s3 and sum(s1 , s2 , s3 , s4 ) also provide significant increases in identification accuracy over using match score s1 alone for matching frames of a video track (Table 2.3 (b)). We note that s4 likely performs lower than s3 because the gallery images are already fairly frontal. If both the gallery and the probe face images are unconstrained then s4 may perform better. Next, we investigate whether the Aureus SDK consolidated 3D models (i.e., n frames of a video track summarized as a single 3D face model rendered at frontal pose) can achieve comparable accuracy to matching all n frames. Table 2.4(a) shows that the accuracy of 60 Table 2.3 Closed-set identification accuracies (%) for pose corrected gallery and/or probe face images using 3D model. The gallery consists of 4,249 LFW frontal images and the probe sets are (a) 3,143 LFW images and (b) 1,292 YTF video tracks. Performance is shown as Rank retrieval results at Rank-1, 20, 100, and 200. Computation of match scores s1 , s2 , s3 , and s4 are shown in Fig. 2.5. YTF Video Tracks LFW Images s1 s2 s3 s4 sum R-1 R-20 R-100 R-200 56.7 57.7 63.9 55.6 66.5 78.1 77.6 83.4 78.8 85.9 87.1 86.0 90.7 88.0 92.4 90.2 89.9 93.6 91.9 95.1 s1 s2 s3 s4 sum R-1 R-20 R-100 R-200 31.3 32.3 36.3 31.7 38.8 54.2 55.3 58.8 54.4 61.4 68.0 67.8 71.3 68.7 73.6 74.5 73.9 77.2 76.5 79.0 (a) (b) Table 2.4 Closed-set identification accuracies (%) for matching consolidated 3D face models built from (a) all frames of a video track or (b) a subset of high quality (HQ) video frames. Consolidated 3D Model: All Frames s3 s4 sum R-1 R-20 R-100 R-200 33.1 29.4 34.6 54.1 51.7 56.4 67.3 64.8 68.2 72.8 71.1 74.1 Consolidated 3D Model: Frame Selection s3 s4 sum (a) R-1 R-20 R-100 R-200 34.4 29.8 35.9 56.6 52.4 58.3 67.8 66.5 69.9 73.4 72.7 75.1 (b) sum(s3 , s4 ) (i.e., consolidated 3D models matched to original and pose corrected gallery images) provides the same accuracy as matching all n original frames (i.e., score s1 in Table 2.3 (b)). However, the accuracy of the consolidated 3D model is slightly lower (∼ 5%) than mean fusion over all n pose corrected frames (i.e., score s3 in Table 2.3 (b)). Hence, the consolidated 3D model built from a video track is not able to retain all discriminatory information contained in the collection of n pose-corrected frames. 2.6.2 Forensic Identification: Media-as-Input A summary of results for the various media-as-input scenarios is shown in Fig. 2.9. For all scenarios that involved multiple probe instances (i.e., multiple images and/or videos), the 61 mean fusion method gave the best result. For brevity, all CMC curves and results that involve multiple probe instances are also obtained via mean fusion. We also investigated the performance of rank-level fusion; the highest-rank fusion performed similar to score-level fusion, while the Borda count method [114] performed worse. As observed in the previous section, pose correction with the Aureus 3D SDK to obtain scores s3 or sum(s1 , s2 , s3 , s4 ) achieves better accuracies than score s1 . This is also observed in Figs. 2.9(a) and 2.9(b) where scores sum(s1 , s2 , s3 , s4 ) and s3 provide approximately a 5% increase in accuracy over score s1 for multiple images and multiple videos, respectively. This improvement is also observed in Fig. 2.9(c) for matching media that includes both still images and videos, but the improvement is mostly at low ranks (< Rank-50). Figure 2.9 shows that (i) multiple probe images and multiple probe videos perform better than their single instance counterparts, but (ii) multiple probe videos actually perform worse than single probe image (see Figs. 2.9(a) and 2.9(b)). This is likely due in part to videos in the YTF database being of lower quality than the still images in the LFW database. However, we note that though multiple videos perform poorly compared to still images, there are still cases where the fusion of multiple videos with the still images does improve the identification performance. This is shown in Fig. 2.9(c); the best result for multiple images is plotted as a baseline to show that the addition of videos to the media collection improves identification accuracy. An example of this is shown in Fig. 2.10. For this particular subject, there is only a single probe image available that exhibits extreme pose. The additional information provided by the 3D model and video track improves the true match from Rank-438 to Rank8. In fact, the performance improvement of media (i.e., multiple images and videos) over multiple images alone can mostly be attributed to cases where there is only a single probe image with large pose, illumination, and expression variations. While Fig. 2.9 shows that including additional media to a probe collection improves identification accuracies on average, there are cases where matching the entire media collection can degrade the matching performance. An example is shown in Fig. 2.11. Due to the 62 1 0.85 0.95 0.8 0.75 0.85 Accuracy Accuracy 0.9 0.8 0.75 0.6 0.55 0.5 0.7 0.65 0.6 0.7 0.65 0 50 Single Image (sum(s1,s2,s3,s4)) Multiple Images (s1) Multiple Images (s3) Multiple Images (sum(s1,s2,s3,s4)) 100 150 200 0.45 0.4 0.35 0 Single Video Track: All Frames (s3) Multiple Video Tracks: All Frames (s1) Multiple Video Tracks: All Frames (s3) Multiple Video Tracks: All Frames (s1+s3) Multiple Video Tracks: Cons. 3D Models (s3) 50 100 150 200 Rank Rank (a) Images (b) Video Tracks 1 Accuracy 0.95 0.9 0.85 0.8 0.75 0 Images (sum(s1,s2,s3,s4)) Images (s1) and Video Tracks (s1) Images (sum(s1,s2,s3,s4)) and Video Tracks (s1) Images (sum(s1,s2,s3,s4)) and Video Tracks (s3) Images (s1) and Cons. 3D Models (s3) 50 100 150 200 Rank (c) Media Collection Figure 2.9 Closed-set identification results for different probe sets: (a) multiple still face images, (b) multiple face video tracks, and (c) face media collection (images, videos and 3D face models). Single face image and video track results are plotted in (a) and (b) for comparison. Note that the ordinate scales are different in (a), (b), and (c) to accentuate the difference among the plots. 63 (a) Probe media collection (b) Gallery true mate Figure 2.10 A collection of face media for a subject (a) consisting of a single still image, 3D model, and video track improves the retrieval rank of the true mate in the gallery (b). Against a gallery of 4,249 frontal images, the single still image was matched at Rank-438 with the true mate. Including the 3D model along with the still image improved the match to Rank-118, while the entire probe media collection was matched to the true mate at Rank-8. (a) Probe image and 3D model (b) Gallery image and 3D model (c) Probe video tracks Figure 2.11 Additional face media does not always improve the identification accuracy. In this example, the probe image with its 3D model (a) was matched at Rank-5 against a gallery of 4,249 frontal images. Inclusion of three video tracks of the subject (c) to the probe set degraded the true match to Rank-216. 64 fairly low quality of the video tracks, the entire media collection for this subject is matched at Rank-216 against the gallery of 4, 249 images, while the single probe image and pose corrected image (from the 3D model) are matched at Rank-5. This necessitates the use of quality measures to assign a degree of confidence to each media. We evaluated the face verification performance (see Fig. 2.12) using the same database as the closed-set identification protocol (i.e., gallery (target) of 4,249 images and probe (query) media collections of 596 subjects). We found that score s3 still outperforms s1 , s2 , and s4 for still images and videos frames. In investigating why s3 performs better than s4 , we found that s4 provides a better genuine score distribution than s3 , but the impostor distribution of s4 has a longer tail. We believe this is partially due to similarities in the contours of two pose-corrected images. However, we find that multiple images with their 3D models (sum(s1 ,s2 ,s3 ,s4 )) perform better than a media collection of multiple images (s1 ) and video frames (s1 or consolidated 3D model), whereas in closed-set identification, these media collections perform better than the multiple images and 3D models alone. In both identification and verification modes, the best performance is a collection of images with their 3D models and video frames. Image and video scores were normalized with z-score normalization. 2.6.3 Quality-based Media Fusion In this section, we evaluate the proposed quality measures and quality-based face media fusion. As discussed in Section 2.4, quality measures and quality-based face media fusion can be applied at both within-media layer and across-media layer. Tables 2.5 (a) and (b) show the closed-set identification accuracies of quality-based fusion of match scores (s1 , · · ·, s4 ) of single image per probe and multiple images per probe, respectively. The performance with sum rule fusion is also provided for comparison. Our results indicate that the proposed quality measures and quality based fusion are able to improve the matching accuracies in both scenarios. Examples where the quality-based fusion 65 1 0.9 True Accept Rate 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Images (sum(s1,s2,s3,s4)) Images (s1) and Video Tracks (s1) Images (sum(s1,s2,s3,s4)) and Video Tracks (s1) Images (sum(s1,s2,s3,s4)) and Video Tracks (s3) Images (s1) and Cons. 3D Models (s3) −4 −3 10 10 False Accept Rate −2 10 Figure 2.12 Face verification performance of a gallery of 4,249 frontal LFW images and probe media collections of 596 subjects. 66 Table 2.5 Closed-set identification accuracies (%) for quality based fusion (QBF) (a) within a single image, and (b) across multiple images. QBF within a single image R-1 R-20 R-100 R-200 sum 65.7 83.2 90.1 93.5 QBF 66.5 85.9 92.6 95.3 QBF across multiple images R-1 R-20 R-100 R-200 sum 79.4 91.1 94.5 96.5 QBF 80.0 91.8 94.5 96.5 (a) (b) performs better than sum rule fusion are shown in Fig. 2.13 (a). Although in some cases the quality-based fusion may perform worse than sum rule fusion (see Fig. 2.13 (b)), overall, it still improves the matching performance (see Table 2.5). We have also applied the proposed quality measure for 3D face model to select highquality frames that are used to build a consolidated 3D face model for a video clip. Figure 2.14 (a) shows two examples where the consolidated 3D models using frame selection with SSIM quality measure (see Sec. 2.4) gets better retrieval ranks than using all frames. Although, a single value, e.g., the SSIM based quality measure, may not always be reliable to describe the quality of a face image (see Fig. 2.14 (b)), frame selection still slightly improves the identification accuracy of the consolidated 3D face models at low ranks (see Table 2.4). 2.6.4 Forensic Sketch Experiments In this experiment, we study the effectiveness of forensic sketches in a media collection. For each subject with a forensic sketch, we input the forensic sketch to the COTS matcher to obtain a retrieval rank. Among the 21 subjects for whom we have a sketch, sketches of 12 subjects are observed to perform significantly better than the corresponding low-quality videos. Additionally, when demographic filtering using gender and race is applied, we can further improve the retrieval ranks. Figure 2.15 shows three examples where the face sketches significantly improved the retrieval ranks compared to low quality videos. The retrieval ranks 67 Quality values: 0.78, 0.87 Quality values: 0.86, 0.64 Quality values: 0.42, 0.98 SUM rule rank: 286 SUM rule rank: 283 SUM rule rank: 95 QBF  rule rank: 163 QBF  rule rank: 123 QBF  rule rank: 150 Quality values: 0.78, 0.30 Quality values: 0.46, 0.99 Quality values: 0.78, 0.67 SUM rule rank: 202 SUM rule rank: 7 SUM rule rank: 124 QBF  rule rank: 149 QBF  rule rank: 1 QBF  rule rank: 243 (a) (b) Figure 2.13 A comparison of quality based fusion (QBF) vs. simple sum rule fusion (SUM). (a) Examples where quality based fusion provides better identification accuracy than sum fusion; (b) Examples where quality based fusion leads to lower identification accuracy compared with sum fusion. (a) All Frames: Rank-3,962 All Frames: Rank-885 SSIM Frames: Rank-19 SSIM Frames: Rank-1 All Frames: Rank-1,099 SSIM Frames: Rank-40 All Frames: Rank-706 SSIM Frames: Rank-2 (b) All Frames: SSIM Frames: Rank-3 Rank-480 All Frames: Rank-17 SSIM Frames: Rank-3,009 Figure 2.14 Retrieval ranks using consolidated 3D face models (built from video tracks). Frame selection with SSIM quality measure (see Sec. 2.4) prior to building the consolidated 3D face model (a) improves and (b) degrades the identification accuracy. However, overall, frame selection using the proposed quality measure based on SSIM improves the COTS matcher’s performance by an average of 1.43% for low ranks 1 to 50. 68 Matching rank: 372(243) Matching rank: 3,147(1,956) Probe: Video Matching rank: 5(4) Probe: Sketch Probe: Video Fusion: 29(19) Probe: Video Fusion: 45(30) Matching rank: 12(8) Gallery Matching rank: 1,129(755) Gallery Fusion: 194(137) Matching rank: 113(80) Gallery Probe: Sketch Probe: Sketch Figure 2.15 Three examples where the face sketches drawn by a forensic artist after viewing the low-quality videos improve the retrieval rank. The retrieval ranks without and with combining the demographic information (gender and race) are given in the form of #(#). of sketch and low-quality video fusion are also reported in Fig. 2.15. To further demonstrate the efficacy of forensic sketch, we focus on identification of Tamerlan Tsarnaev, the older brother involved in the 2013 Boston Marathon bombing. In an earlier study Klontz and Jain [73] showed that while the younger brother, Dzhokhar Tsarnaev, could be identified at Rank-1 based on his probe images released by the authorities, the older brother could only be identified at Rank-12,446 (from a gallery of one million images with no demographic filtering). Figure 2.16 shows three gallery face images of Tamerlan Tsarnaev (1x, 1y, and 1z [73]) and two probe face images (1a and 1b) which were released by the FBI during the investigation.10 Because the probe images of Tamerlan Tsarnaev are of poor quality, particularly due to wearing of sunglasses and a hat, we also asked a sketch artist to draw a sketch of Tamerlan Tsarnaev (1c in Fig. 2.16) while viewing the two probe images.11 To simulate a large-scale forensic investigation, the three gallery images of Tamerlan Tsarnaev were added to a background set of one million mugshot images of 324,696 unique subjects from the PCSO database. Particularly due to the occlusion of eyes, the probe images are difficult for the COTS face matcher to identify (though they can be enrolled with 10 http://www.fbi.gov/news/updates-on-investigation-into-multiple-explosions-in-boston “I was living in Costa Rica at the time that event took place and while I saw some news coverage, I didn’t see much and I don’t know what he actually looks like. The composite I am working on is 100% derived from what I am able to see and draw from the images you sent. I can’t make up information that I can’t see, so I left his hat on and I can only hint at eye placement.” - Jane Wankmiller, forensic sketch artist, Michigan State Police. 11 69 Race: White Gender: Male Age: 20 to 30 1a 1b 1c 1x 1y 1z Figure 2.16 Face images used in our case study on identification of Tamerlan Tsarnaev, one of the two suspects of the 2013 Boston Marathon bombings. Probe (1a, 1b) and gallery (1x, 1y, and 1z) face images are shown. 1c is a face sketch drawn by a forensic sketch artist after viewing 1a and 1b, and a low quality video frame from a surveillance video. 70 Table 2.6 Retrieval ranks for probe images (1a, 1b) and sketch (1c) matched against gallery images 1x, 1y, and 1z with an extended set of one million mug shots (a) without and (b) with demographic filtering. Rows max and mean denote score fusion of multiple images of this suspect in the gallery; columns max and sum are score fusion of the three probes. (a) Without Demographic Filtering 1a 1b 1c max sum 1x 117,322 475,769 8,285 18,710 27,673 1y 12,444 440,870 63,313 38,298 28,169 1z 87,803 237,704 53,771 143,389 55,712 max 9,409 117,623 6,259 14,977 6,281 mean 13,658 125,117 8,019 20,614 8,986 (b) With Demographic Filtering (white male, 20-30) 1a 1b 1c max sum 1x 5,432 27,617 112 114 353 1y 518 25,780 1,409 1,656 686 1z 3,958 14,670 1,142 2,627 1,416 max 374 6,153 94 109 106 mean 424 5,790 71 109 82 71 manually marked eye locations), as shown in Table 2.6. However, the retrieval rank for the sketch (1c in Fig. 2.16) is much better compared to the two probe images (1a and 1b in Fig. 2.16), with the best match at Rank-6,259 for max fusion of multiple images of Tamerlan Tsarnaev (1x, 1y, and 1z ) in the gallery. With demographic filtering [71] (white male in the age range of 20 to 30 filters the gallery to 54, 638 images of 13, 884 subjects), the sketch is identified with gallery image 1x (a mugshot)12 in Fig. 2.16 at Rank-112. Again, score fusion of multiple images per subject in the gallery further lowers the retrieval to Rank-71. The entire media collection (here, 1a, 1b, and 1c in Fig. 2.16) is matched at Rank-82 against the demographic-filtered and multiple image-fused gallery. 2.6.5 Watch List Scenario: Open Set Identification We report the DIR vs. FAR curves of open-set identification in Figs. 2.17 (a) and (b). With a single image or single video per subject in the gallery, the DIR values at 1% FAR are about 25% and 10% for still image probe and video clip probe, respectively. This suggests that a large percentage of probe images or video clips that are matched to their gallery true mates at a low rank in a closed-set identification scenario, can no longer be successfully matched in an open-set scenario. Of course, this comes at the benefit of much lower false alarms than in the closed-set identification. The proposed face media collection based matching still shows improvement over single media based matching. For example, at 1% FAR, face media collection based matching leads to about 20% and 15% higher DIRs for still image and video clip probes, respectively. 2.6.6 Large Gallery Results In order to simulate the large-scale nature of operational face identification, we extend the size of our gallery by including one million face images from the PCSO database. We 12 http://usnews.nbcnews.com/ news/2013/05/06/18086503-funeral-director-in-boston-bombing-caseused-to-serving-the-unwanted?lite 72 1 0.8 Single Image (s1) Multiple Images (s1) Multiple Images and Video Tracks (s1) Single Image (s1+s4) Multiple Images (s1+s4) Multiple Images and Video Tracks (s1+s4) 0.6 Detection and Identification Rate Detection and Identification Rate 0.9 0.7 Single Image (s1) Multiple Images (s1) Multiple Images and Video Tracks (s1) Single Image (s1+s4) Multiple Images (s1+s4) Multiple Images and Video Tracks (s1+s4) 0.7 0.6 0.5 0.4 0.3 0.2 0.5 0.4 0.3 0.2 0.1 0.1 0 0.01 0.1 0 1 0.01 0.1 False Alarm Rate 1 False Alarm Rate (a) Probe: Single Image (b) Probe: Single Video Track 1 0.9 0.8 A ccuracy 0.7 0.6 0.5 Single Image (s1+s3) Single Video (s1+s3) Multiple Images (s1) Multiple Images (s1+s3) Multiple Video Tracks (s1) Multiple Video Tracks (s1+s3) Multiple Video Tracks: Cons. 3D Models (s3) Multiple Images and Video Tracks (s1+s3) Multiple Images and Video Tracks (s1+s3) w/ D.F. 0.4 0.3 0.2 0 20 40 60 80 100 120 140 160 180 200 Rank (c) Large Gallery Figure 2.17 Scenarios of open-set and closed-set identifications. Open-set identification with (a) a single face image as the probe and various media collections as the gallery and (b) a single face video track as the probe and various media collections as the gallery; the legend denotes the gallery media collection in (a) and (b). Closed-set identification of (c) various media collections as probe against a large gallery set with one million background face images from the PCSO database; the legend denotes the probe media collection; the black curve denoted with “D.F.” indicates that demographic information (gender and race) is also fused with the other face media. Note that the ordinate scales are different in (a) and (b) to accentuate the difference among the plots. 73 acknowledge that there may be a bias towards matching between LFW probe and LFW gallery images versus matching LFW probe with PCSO gallery images. This bias is likely due to the fact that the gallery face images in LFW are not necessarily frontal with controlled illumination, expression, etc., while the background face images from PCSO are mugshots of generally cooperative subjects. The extended gallery set with 1M face images makes the face identification problem more challenging. Figure 2.17(c) gives the media collection based face identification accuracies with 1M background face images. A comparison between Fig. 2.17 (c) and Fig. 2.9 shows that the proposed face media collection based matching generalizes well to a large gallery set. 2.7 Conclusions We studied face identification of persons of interest in unconstrained imaging scenarios with uncooperative subjects. Given a face media collection of a person of interest (i.e., face images and video clips, 3D face models built from image(s) or video frame(s), face sketch, and demographic information), we have demonstrated an incremental improvement in the identification accuracy of a COTS face matching system. We believe this is of great value to forensic investigations and “lights out” watch list operations, as matching the entire probe collection outputs a single ranked list of candidate identities, rather than a ranked list for each face media sample. Evaluations are provided in the scenarios of closed-set identification, open-set identification, closed-set identification with a large gallery, and verification. Our contributions can be summarized as follows: 1. A collection of face media, such as image, video, 3D face model, face sketch, and demographic information, on a person of interest improves identification accuracies, on average, particularly when individual face samples are of low quality. 2. Pose correction of unconstrained 2D face images and video frames (via 3D face modeling) prior to matching improves the accuracy of a state of the art COTS face matcher. 74 This improvement is especially significant when match scores from rendered pose corrected images are fused with match scores from original face imagery. 3. A single consolidated 3D face model summarizes the entire video track to a single representation, but score-level fusion of the multiple pose corrected frames from the video track performs better than the consolidated model. 4. Quality based fusion of match scores of different media types performs better than fusion without incorporating the quality. 5. The value of forensic sketch drawn based on low quality videos or low quality images of the suspect is demonstrated in the context of one of the Boston bombing suspects and YTF video tracks. While the LFW and YTF databases contain variations in pose, illumination, expression, occlusion, resolution, etc., matching a face media collection may not boost the performance if there are long elapsed times between the probe face samples and the true mate in the gallery. Figure 2.18 shows an example of two age-separated face images of the same subject in the LFW database. This type of scenario is difficult to analyze because the LFW and YTF databases do not contain age information for the images. Figure 2.18 An example of two face images of the same subject in the LFW database where facial aging has occurred. 75 Chapter 3 Automatic Face Image Quality The performance of automatic face recognition systems largely depends on the quality of the face images acquired for comparison. Under controlled image acquisition conditions (e.g., mugshot photos) with uniform lighting, frontal pose, neutral expressions, and standard image resolution, face recognition systems can achieve extremely high accuracies (e.g., >99% TAR at 0.1% FAR [57]). The system errors still present here are often caused by a relatively small portion“poor” quality face images. This could be due to uncooperative subjects or operator negligence during the acquisition of a mugshot, for example (see Fig. 3.1). There are many emerging applications of face recognition which seek to operate on face images captured in less than ideal conditions (e.g., surveillance). In such cases where large intrasubject facial variations are more prevalent, or even the norm, the accuracy of face recognition degrades. The 2014 large-scale evaluation conducted by NIST demonstrated that mugshotto-mugshot recognition error rates more than doubled for the top six commercial algorithms when comparing a mugshot gallery to lower quality webcam face images [55]. The performance of biometric recognition, in general, is driven by the quality of biometric samples (e.g., fingerprint, iris, and face images) [5,24,56]. Biometric sample quality is defined as a measure of a sample’s utility to automatic matching [5, 24, 56]. A desirable property of a biometric quality measurement is that it should be indicative of recognition performance 76 (a) (b) Figure 3.1 Examples of (a) high and (b) low quality mugshots from the PCSO database. 77 and be correlated with error rates such as false non-match rate (FNMR), false match rate (FMR), or identification miss rates. If a system can automatically determine the quality of a biometric sample defined in this way, it can be useful for several practical applications. • Negative identification systems - e.g., automated security checkpoints at airports to compare passengers against watch list photos. If passengers are purposely trying to evade detection, automatic face quality assessment can flag their attempt and/or deny entry through the checkpoint. • Quality-based fusion: multiple face images (e.g., sequence of video frames), multibiometric fusion [109] (e.g., face and fingerprint), or 3D face modeling from collection of face images. • Dynamic assignment of comparisons to different matching algorithms. High quality face images can be assigned to high-throughput algorithms, while low quality face images could be assigned to slower, but more robust, algorithms. A biometric quality measure able to detect “bad” quality samples can subsequently process them accordingly (e.g., reject poor quality samples, request a better sample from the user, employ a slower but more robust matching algorithm, etc.). Additionally, a quality measure can be used to rank a collection of biometric samples which is particularly useful when multiple samples of a subject are available (e.g., frames from a video track, see Fig. 3.2). Because a biometric sample’s quality is specific to automatic recognition performance, human visual perception of the sample’s quality may not be well correlated with recognition performance [24, 56]. Particularly, given a fingerprint or iris image, it is difficult for a human to assess the quality in the context of recognition because humans (excluding forensic experts) do not naturally use fingerprints or iris textures for person recognition. However, the human visual system is extremely advanced when it comes to recognizing the faces of individuals, a routine daily task. In fact, it was recently shown that humans surpass the performance of current state-of-the-art automated systems on recognition of very challenging, 78 (a) (b) Figure 3.2 (a) Video frames from a sample video in the IJB-A [72] unconstrained face database and (b) corresponding cropped faces sorted from high to low quality by the proposed approach. 79 low quality, face images [25]. To the best of our knowledge, very few studies have actually investigated face image quality assessment by humans. Adler and Dembinsky [2] found very low correlation between human and algorithm measurements of face image quality (98 mugshots of 29 subjects, 8 human evaluators), while Hsu et al. [60] found some consistency between human perception and recognition-based measures of face image quality (frontal and controlled illumination face images, 2 human evaluators). Face recognition performance is highly sensitive to factors such as pose, illumination, expression, occlusion, resolution, and other intrinsic or extrinsic properties of face images. The primary goal of face recognition research is to develop systems which are more robust to these factors. Recent works on automatic face recognition have devoted efforts towards recognition of unconstrained facial imagery [136] where facial variations of any kind can be simultaneously present (e.g., face images from surveillance cameras). While much prior work has been conducted in face image quality, it has primarily focused on the quality of lab-collected face image databases where facial variations such as pose and illumination are synthetic/staged/simulated in order to isolate and facilitate evaluation of the different factors. In this work, we focus on automatic face image quality of unconstrained face images using the Labeled Faces in the Wild (LFW) [62] and IARPA Janus Benchmark A (IJB-A) [72] unconstrained face datasets. The contributions of this work are summarized as follows: • Collection of human ratings of face image quality for a large database of unconstrained face images (namely, LFW [62]) by crowdsourcing a small set of pairwise comparisons of face images and inferring the complete ratings with matrix completion. • Investigation of the utility of face image quality assessment by humans in the context of automatic face recognition performance. This is the first study on human quality assessment of face images that exhibit a wide range of quality factors (i.e., unconstrained face images). • Comparison of two methods for “ground truth” labeling the quality of face images in 80 a database: (i) human quality ratings and (ii) quality labels computed from similarity scores from COTS matchers. The latter serves as an “oracle” for a face quality measure that is correlated with recognition performance. • Automatic prediction of the face image quality of an unseen image using image features from a deep neural network. Our experimental evaluation follows the methodology advocated by Grother and Tabassi [56] where a biometric quality measurement is tested by “relating quality values to empirical matching results.” Our evaluation focuses on two primary uses of the proposed face image quality measure: (i) for ranking a collection of face images, and (ii) to reject low quality face images to improve error rates (e.g., FNMR) of automatic face recognition systems. 3.1 Related Work A number of studies (e.g., [1, 20, 21]) have offered in depth analyses of the performance of automatic face recognition systems with respect to different covariates. These studies have identified key areas of research and have guided the community to develop algorithms that are more robust to the multitude of variations in face images. The covariates studied include image-based, such as pose, illumination, expression, resolution, and focus, as well as subjectbased, such as gender, race, age, and facial accessories (e.g., eyeglasses). In general, it is typically shown that face recognition performance degrades due to these different sources of variability. Intuitively, the magnitude of degradation is algorithm-specific. Prior works have proposed face image quality as some measure of the similarity to reference face images (typically frontal pose, uniform illumination, neutral expression). For example, [116] uses luminance distortion from a high quality reference image for adaptive fusion of two face representations. Wong et al. [142] propose probabilistic similarity to a reference model of “ideal” face images for selecting high quality frames in video-to-video verification, and Best-Rowden et al. [17] investigated structural similarity (SSIM) for quality81 Table 3.1 Summary of Related Work on Automatic Methods for Face Image Quality Study (year) Database: Num. imgs. (subj.) Target Quality Value Hsu et al. [60] (2006) FRGC: 1,886 (n/a) passports: 2,000 (n/a) mugshots: 1,996 (n/a) Continuous (genuine score) Aggarwal et al. [3] (2011) Multi-PIE: 6,740 (337)∗ FacePix: 1,830 (30) Continuous (genuine score) or Binary (algorithm success vs. failure, requires matching prior to quality) Phillips et al. [104] (2013) Bharadwaj et al. [22] (2013) PaSC: 4,688 (n/a) GU† : 4,340 (437) Binary (low vs. high) CAS-PEAL: n/a (1,040) Quality bins (poor, fair, good, excellent) SCFace: n/a (130) SVM on GIST and HOG features Abaza et al. [1] (2014) GU† : 4,340 (437) Binary (good vs. ugly) Dutta et al. [42] (2014) Multi-PIE: 3,370 (337)‡ Continuous (false reject rate) Kim et al. [68] FRGC: 10,448 (322) Binary (low vs. high) or Continuous (confidence of the binary classifier) Neural network (1-layer) to combine contrast, brightness, sharpness, focus, and illumination measures Probability density functions (PDFs) model interaction between image quality (deviations from frontal and uniform lighting) and recognition performance Objective (pose, blurriness, brightness) and Relative (color mismatch between train and test images) face image quality measures as features fed into AdaBoost binary classifier Chen et al. [35] (2015) SCFace: 2,080 (130) (trained with FERET, FRGC, LFW, and non-face images) 0 – 100 (rank-based quality score) Learning Approach Neural network to combine 27 quality measures (exposure, focus, pose, illumination, etc.) for prediction of genuine scores MDS to learn a mapping from illumination features to genuine scores. Predicted genuine score compared to algorithm score to predict algorithm success or failure PCA + LDA classifier Evaluation ROC curves for different levels of quality (FaceIt algorithm by Identix) Prediction accuracy of algorithm success vs. failure, ROC curves for predicted, actual, 95% and 99% retained (SIFT-based and PittPatt algorithms) Error vs. Reject curve for FNMR vs. percent of images removed ROC curves, rank-1 accuracy, EER, % histogram overlap (COTS algorithm) Rank-1 identification for blind vs. quality-selective fusion Predicted vs. actual verification performance for different clusters of quality (FaceVACS algorithm) Identification rate w.r.t. fraction of images removed, ROC curve with and without low quality images (SRC face recognition algorithm) A ranking function is Visual quality-based learned by assuming images rankings, from different databases are Identification rate of different quality and images from same database are of equal quality Proposed Continuous (human Support vector regression Error vs. Reject LFW: 13,233 (5,749) Approach quality ratings or with image features from a curves, visual IJB-A: 5,399 (500) normalized deep convolutional neural quality-based ranking comparison scores) network [136] Note: n/a indicates that the authors did not report the number (an unknown subset of the database may have been used) ∗ Only the illumination subset of Multi-PIE † GU denotes the Good and Ugly partitions of the Good, Bad, and Ugly (GBU) face database ‡ Only neutral expressions from Multi-PIE 82 based fusion within a collection of face media. Reference-based approaches are dependent on the face images used as reference and may not generalize well to different databases or face images with multiple quality factors present. More recently, especially with the influx of unconstrained face images, interest has peaked in automatic measures for face image quality that can encompass multiple quality factors, and hence, determine the degree of suitability for automatic matching of an arbitrary face image. Table 3.1 summarizes related works in automatic face image quality which are learning-based approaches. These methods are related in that they all define some target quality which is related to automatic recognition performance. The target quality value can be a prediction of the genuine score (e.g., [3, 60]), a bin indicating that an image is poor, fair, or good for matching (e.g., [22]), or a binary value of low vs. high quality image (e.g., [1, 68, 104]). For example, Bharadwaj et al. fuse similarity scores from two COTS matchers, define quality bins based on CDFs of images that were matched correctly and incorrectly, and use a support vector machine (SVM) trained on holistic image features to classify a test image as poor, fair, good, or excellent quality [22]. Rather than defining target quality values for a training database of face images, Chen et al. propose a “learning to rank” framework which assumes a rank-ordering of a set of databases (e.g., non-face images < unconstrained face images < ID card face images) where face images from the same database have equal quality; rank weights from multiple types of features are learned and then mapped to a quality score 0∼100 [35]. In our approach, we annotate a large database of unconstrained face images with target quality values (defined as either human quality ratings or score-based values from a COTS matcher), extract image features using a deep convNet [136], and learn a model for prediction of face quality from the deep convNet features using support vector regression. The target quality values in this work are continuous and allow for a fine-tuned quality-based ranking of a collection of face images. 83 Table 3.2 Performance of Face Recognition Algorithms on the BLUFR Protocol [85] Algorithm TAR @ 0.1% FAR HDLBP + JointBayes [34]* Yi et al. [146] DCNN et al. [136] COTS-A COTS-B * Performance 3.2 41.66 80.26 89.80 88.14 76.01 DIR @ 1% FAR 18.07 28.90 55.90 76.28 53.21 here for [34] was reported by [85] Face Image Databases and COTS Matchers In this work, we utilize two unconstrained face databases: Labeled Faces in the Wild (LFW) [62] and IARPA Janus Benchmark A (IJB-A) [72]. Both LFW and IJB-A contain face images with unconstrained facial variations that affect the performance of face recognition systems (e.g., pose, expression, illumination, occlusion, resolution, etc.). The LFW database consists of 13,233 images of 5,749 subjects, while the IJB-A database consists of 5,712 images and 2,085 videos of 500 subjects. Face images in the LFW database were detected by the Viola-Jones face detector [135] so the pose variations are limited by the pose tolerance of the Viola-Jones detector. Face images in IJB-A were manually located, so the database is considered more challenging than LFW due to full pose variations [72]. See Fig. 3.3 for sample face images from the two databases. Because face image quality needs to be evaluated in the context of automatic face recognition performance, we make use of two commercial face matchers, denoted as COTS-A and COTS-B. Table 3.2 shows that COTS-A and COTS-B are competitive algorithms on the BLUFR protocol [85] for the LFW database. Performance is also reported for the deep learning-based matcher proposed by Wang et al. [136] as DCNN. The feature representation from [136] is used in this work to predict face image quality. 84 (a) LFW (b) IJB-A Figure 3.3 Sample face images from the (a) LFW [62] and (b) IJB-A [72] unconstrained face databases. 85 3.3 Face Image Quality Labels Biometrics and computer vision heavily rely on supervised learning techniques when training sets of labeled data are available. When the aim is to develop an automatic method for face image quality, compiling a quality-labeled face image database is not straightforward. The definition of face image quality (i.e., a predictor of automatic matching performance) does not lend itself to explicit labels of face image quality, unlike labels of facial identity or face vs. non-face labels for face recognition and detection methods, respectively. Possible approaches for generating quality labels of face images include: 1. Combine various measurements of image quality factors into a single value which indicates the overall face quality. 2. Human annotations of perceived image quality. 3. Based on comparison scores (or performance measures) from automatic face recognition matchers. The issues with 1) are that it is an “ad-hoc”/heuristic approach and, thus far, has not achieved much success (e.g., [104]). The issue with 2) is that human perception of quality may not be indicative of automatic recognition performance; previous works [22, 56] have stated this consensus but, to our knowledge, the only studies to investigate these statements were conducted on constrained face images (e.g., mugshots) [2,60]. The issue with 3) is that comparison scores are obtained from a pair of images, so labeling single images based on comparison scores (or performance) can be problematic. However, this approach achieved some success for fingerprint [56, 125], and only few studies [22, 104] have considered it for face quality. In this work, we investigate both methods 2) and 3), detailed in the remainder of this section. 86 3.3.1 Human Ratings of Face Image Quality Because of the inherent ambiguity in the definition of face image quality, framing an appropriate prompt to request a human to label the quality of a face image is challenging. If asked to rate a face image on a scale of 1 to 5, for example, there are no notions as to the meaning of the different levels. Additionally, some prior exposure to the variability in the face images that the human will encounter may be necessary so that they know what kinds of “quality” to expect in face images (i.e., a baseline) before beginning the quality rating task. In this work, we choose to only collect quality labels for relative pairwise comparisons of face images by asking the following question: “Which face (left or right) has better quality?” Crowdsourcing literature [148] has demonstrated that ordinal (comparison-based) tasks are generally easier and take less time than cardinal (score-based) tasks. Ordinal tasks additionally avoid calibration efforts needed for cardinal responses from raters inherently using different ranges for decision making (i.e., biased ratings, inflated vs. conservative ratings, meaning of absolute ratings changes with exposure to more data). To obtain absolute quality ratings for individual face images, we make use of a matrix completion approach [148] to infer the quality rating matrix from the pairwise comparisons. Because it is infeasible to have multiple persons manually assess and label the qualities of all face images in a large database, this approach is desirable in that it only requires a small set of quality labels from each human rater in order to infer the quality ratings for the entire database. The details of data collection and the matrix completion approach are discussed in the remainder of this section. 3.3.1.1 Crowdsourcing Comparisons of Face Quality Amazon Mechanical Turk (MTurk)1 was utilized to facilitate collection of pairwise comparisons of face image quality from multiple human raters (i.e., MTurk “workers”). Given a pair of face images, displayed side by side, our Human Intelligence Task (HIT) was to select a 1 https://www.mturk.com 87 response to the prompt “Indicate which face has better quality” out of the following options: (i) left face is much better, (ii) left face is slightly better, (iii) both faces are similar, (iv) right face is slightly better, and (v) right face is much better. Fig. 3.4 shows the interface used to collect the responses.2 Our HIT requested each worker to provide responses to a total of 1,001 face image pairs, made up of 6 tutorial pairs, 974 random pairs, and 21 consistency check pairs. The tutorial pairs were pre-selected from the LFW database where the quality of one image was clearly better than the quality of the other (Fig. 3.5 shows the sets of images used). Because these pairs had “correct” responses, they allowed us to ensure that the worker had completed the tutorial introduction and understood the goal of the task. The next 974 pairs of images were chosen randomly from the LFW database, while the final 21 pairs were selected from the set of 974 as repeats to test the consistency of the worker’s responses. MTurk workers who attempted our HIT were only allowed to complete it if they passed the tutorial pairs, and we only accepted the submitted responses from workers who were consistent on at least 10 out of the 21 consistency check pairs. In order to be eligible to attempt our HIT for assessment of face image quality, MTurk workers had to have previously completed at least 10,000 HITs from other MTurk “requesters” with an approval rate of at least 99%. These stringent qualifications helped to ensure that only experienced and reliable workers (in terms of MTurk standards) participated in our data collection.3 A total of 435 MTurk workers began our HIT. After removing 245 workers who did not complete the full set of 1,001 pairwise comparisons and 4 workers who failed the consistency check (inconsistent response for 10 or more of the 21 repeated pairs), a total of 194 workers were each compensated US $5.00 through the MTurk crowdsourcing service. 2 3 The tool is available at http://cse.msu.edu/∼bestrow1/FaceOFF/. The MTurk worker qualifications are managed by the MTurk website. 88 Figure 3.4 The interface used to collect responses for pairwise comparisons of face image quality from MTurk workers. Figure 3.5 Face images (from the LFW database) used for the 6 tutorial pairs used to check whether MTurk workers understood the task before completing the pairwise comparisons used in our study of face image quality. For each of the tutorial pairs, one image was selected from the top row (high quality images) and one image was selected from the bottom row (low quality images), so the pairwise comparison of face quality had an unambiguous answer. 89 3.3.1.2 Matrix Completion After collecting random sets of pairwise comparisons of face image quality from 194 workers via MTurk, we use the matrix completion approach proposed by Yi et al. [148] to infer a complete set of quality ratings for each worker on the entire LFW database (13,233 total face images). The aim is to infer Fˆ ∈ Rm×n , the worker-rating matrix for face image qualities, where n is the number of workers and m is the number of face images. Yi et al. [148] show that only O(r log m) pairwise queries are needed to infer the full ranking list of a worker for all m items (face images), where r is the rank of the unknown rating matrix (r m). The maximum possible rank of the unknown rating matrix is r = n = 194 workers, O(194 log 13, 233) ≈ 800; hence, the 974 random pairs per worker collected in our study are sufficient to do the matrix completion, especially since we expect r < n (i.e., the quality ratings from the n workers are not all independent). While relative pairwise comparisons are often preferred in crowd-based tasks [148] because they avoid the biases from raters’ tendencies to give conservative or inflated responses when using an absolute scale (e.g., quality levels 1 to 5), we still observed a bias after the matrix completion where the bias is from a tendency to respond “Similar”. Fig. 3.6 shows an inverse relationship between the number of pairs that a worker marked “Similar” and the resulting range of quality ratings for that worker (after matrix completion). Note that this bias is not due to the coarse levels of left image is “much better” vs. “slightly better” because prior to matrix completion we combine these responses to simply “left is better”. Because of this observation, min-max normalization was performed on each worker’s quality ratings to transform them to the same range (0 to 1). After matrix completion, there are face image quality ratings from 194 different workers for each face image in the LFW database. With the aim of obtaining a single quality rating per face image in the LFW database, we simply take the median value from all 194 workers to reduce the 194 × 13, 233 matrix of quality ratings to a 1 × 13, 233 vector of quality ratings (one per image in LFW). We empirically tested other heuristics (mean, min, max) but found 90 that median seemed to result in the best quality ratings. 3.3.2 Recognition-based Face Image Quality Labels Target quality labels acquired from similarity scores serve as an “oracle” for a quality measure that is highly correlated with automatic recognition performance. For example, if the goal is to detect and remove low-quality face images to improve the FNMR, then face images could be removed from a database in the order of their genuine comparison scores. Previous works on biometric quality (fingerprint [56,125] and face [22]) have defined “ground truth” or “target” quality labels as a measure of the separation between the sample’s genuine score and its impostor distribution when compared to a gallery of enrollment samples. A normalized comparison score for the jth query sample of subject i can be defined as, I I zij = (sG ij − µij )/σij , (3.1) I I where sG ij is the genuine score and µij and σij are the mean and standard deviation, respec- tively, of the impostor scores for the query compared to the gallery. Previous works then bin the normalized comparison scores into quality bins based on the cumulative distribution functions (CDFs) of sets of correctly and incorrectly matched samples [22, 56, 125]. Instead, we propose to directly predict the zij for a given face image to obtain a continuous measure of face image quality. Target quality values defined based on comparison scores are confounded by the fact that a comparison score is computed from two face images, but we are trying to label the quality of a single face image. A simplifying assumption can be made if it can be assumed that the quality of the enrollment samples is at least as good as the quality of the probe samples; because comparison scores are typically governed by the low quality samples [56], the quality value can be assigned to the probe image. To allow for this simplifying assumption, we manually selected the best quality image 91 Figure 3.6 The resulting range of the face quality values (after matrix completion) for a particular worker inversely depends on the number of pairs that the worker marked “Similar” quality. Although collection of relative responses avoids bias present when workers are asked to rate individual images on an absolute scale, bias is still present from tendency to respond “Similar”. This indicates that normalization is required to transform the quality ratings from each worker to the same scale. Figure 3.7 Histogram of rank correlations between the face image quality ratings of all pairs of MTurk workers ( 194 = 18, 721 total pairs of workers). The quality ratings are those 2 obtained after matrix completion. The degree of concordance between workers is 0.37, on average. 92 Figure 3.8 Illustration of the pairwise quality issue. Images in the left and right columns are individually of high and low qualities, respectively. However, when compared with the other images, they can produce both high and low similarity scores. (Similarity scores are from COTS-A with range of [0, 1].) for every subject in the LFW database. There are 1,680 subjects in LFW with at least two face images. The best image selected by us is placed in the gallery (1,680 images, one per subject), while the remaining 7,484 images of these subjects are used as the probe set. The additional 4,069 images in the LFW database (subjects with only a single image) are used to extend the size of the gallery. Normalized comparison scores are computed using Eqn. (3.1) for the 7,484 probe images for each of the face matchers (COTS-A, COTS-B, and DCNN) and are used as score-based target face quality values. 3.4 Automatic Prediction of Face Quality Given that we have obtained face image quality labels for the LFW database, we now wish to train a model to automatically predict the quality of an unseen face image. Ideally, we would compile a set of automatically extracted image features that are measurements of 93 known quality factors that affect face recognition performance, such as pose, illumination, expression, occlusion, contrast, focus, etc. Rather than trying to handcraft a set of image features for our task of predicting face image quality, we make use of features extracted from a deep convolutional neural network which was trained for recognition purposes by Wang et al. [136]. The features are 320-dimensional, so we refer to them as Deep-320 features. The deep network in [136] was trained on the CASIA WebFaces database [147]. We additionally consider a 5-dimensional feature set, referred to as Vishnu-5, which includes a face alignment score, number of occluded landmarks (out of 68 total), and measures of facial pose (i.e., yaw, pitch, and roll). Using either the Deep-320 or Vishnu-5 image features, we then train a support vector regression (SVR) [31] model with radial basis kernel function to predict either the normalized comparison scores (zij ) from a commercial matcher or the human quality ratings. The parameters for SVR are determined via grid search on a validation set of face images. 3.5 Experimental Evaluation The aim of this work is twofold: 1. Label the target, or “ground truth”, quality values of a face image database. 2. Train a model to automatically predict the target quality values using features automatically extracted from an unseen test face image (prior to matching). Hence, in Sec. 3.5.1, we first evaluate the target quality values to determine their utility for automatic recognition. In Sec. 3.5.2 we then evaluate how well the target quality values can be predicted by the proposed model for automatic face image quality. Following the methodology advocated by Grother and Tabassi [56], we evaluate the face quality measures using the following performance metrics. • Error versus Reject (EvR) curve evaluates how efficiently rejection of low quality samples results in decreased error rates. The EvR curve plots an error rate (FNMR or 94 (a) Kendall’s tau (b) Spearman Figure 3.9 Rank correlations between the different target face quality values considered in this work. COTS-B FQ is a face quality measure output by COTS-B (black-box method to us, included for comparison). Three red asterisks indicate that the correlations are statistically significant at α = 0.001. The score-based measures of face quality (zij ) from COTS-A and COTS-B have the strongest correlation, while the human quality ratings have the weakest correlation with the other quality measures. 95 FMR) versus the fraction of images removed/rejected, where the error rates are recomputed using a fixed threshold (e.g., overall FMR = 0.01%) after a fraction of the images have been removed. We additionally provide visual inspections of face images rank-ordered by the proposed face image quality. 3.5.1 Target Face Image Quality Values First, the face images in the LFW database are “ground truth” labeled with the methods discussed in Section 3.3. We refer to these quality values as target quality values and the ones predicted by our model as predicted. Fig. 3.9 shows the distributions of the target labels for COTS-A zij , COTS-B zij , and the human ratings after matrix completion, as well as a measure of quality output by the COTS-B matcher (for comparison). Fig. 3.9 also shows that the rank correlation is fairly low between the human ratings of quality and the scorebased quality values, while the score-based quality values from the two matchers are highly correlated. We evaluate the target quality values using the same gallery/probe setup of the LFW database that was used to compute the normalized comparison scores (zij ). This allows for comparison of the human quality ratings and the score-based quality values. Fig. 3.10 plots EvR curves for both methods, evaluated for three different face matchers (COTS-A, COTSB, and DCNN [136]). Fig. 3.10(a) shows that removing probe images in order of human quality ratings does decrease FNMR for all three matchers. So, human quality ratings are correlated with recognition performance; however, the score-based quality values are much more efficient in reducing FNMR. This is expected because the score-based target quality values are computed from the same comparison scores used to compute the FNMR. Again, the score-based quality values here somewhat serve as an “oracle” for a desirable quality measure. The utility of the target quality values in terms of reducing FMR in Fig. 3.10(b) is not 96 Threshold = 0.200 FNMR 0.2 False Non−Match Rate 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 Human, COTS−A Human, COTS−B Human, DCNN zij, COTS−A zij, COTS−B zij, DCNN 0.05 0.1 0.15 0.2 Fraction of Probes Removed 0.25 (a) FNMR Threshold = 0.010 FMR 0.0114 0.0112 False Match Rate 0.011 0.0108 Human, COTS−A Human, COTS−B Human, DCNN zij, COTS−A zij, COTS−B zij, DCNN 0.0106 0.0104 0.0102 0.01 0.0098 0.0096 0 0.05 0.1 0.15 0.2 Fraction of Probes Removed 0.25 (b) FMR Figure 3.10 Error vs. Reject curves for (a) FNMR and (b) FMR on the LFW database (5,749 gallery and 7,484 probe images). Probe images were rejected in order of target (i.e., “ground truth”) quality values of human quality ratings or score-based quality values (zij ). Thresholds are fixed at (a) 0.2 FNMR and (b) 0.01 FMR for comparison of the three face matchers (COTS-A, COTS-B, and DCNN [136]). 97 as apparent; in fact, removing low quality images based on human quality ratings clearly increases FMR for COTS-B (though the magnitude of the increase is quite small). The relation between face quality and impostor scores (i.e., FMR) is generally less of a concern. For biometric quality, in general, we desire high quality samples to produce low impostor similarity scores, but low quality samples may also produce low (or even lower) impostor scores. If this is the case, low quality face images may be beneficial to FMR for empirical evaluation, but still undesirable operationally. 3.5.2 Predicted Face Image Quality Values The proposed framework for automatic prediction of face image quality (both human ratings and score-based quality values) is used to predict the quality of face images from the LFW [62] and IJB-A [72] databases. The prediction models for both databases are trained using LFW face images and the following experimental protocols. 3.5.2.1 Train, Validate, and Test on LFW: We first divide 7,484 face images of the 1,680 subjects with two or more images in LFW into 10 random splits for training and testing data, where the subjects are randomly split into 2/3 and 1/3 for training and testing, respectively. For each split, we then conduct 5-fold cross-validation within the training set to determine the parameters (via grid-search) for the support vector regression model. The selected set of parameters is then applied to the full training set to result in a single model for each of the 10 splits, which are then used to predict the quality labels of the images in each of the 10 test sets. This framework ensures subjectdisjoint training and testing sets, and parameter selection is conducted within a validation set, not optimized for the test sets. Table 3.3 gives the rank correlation (mean and standard deviation over the 10 splits) between the target and predicted quality values for human quality ratings and score-based quality values (for COTS-A and COTS-B). The first observation is that the Deep-320 features 98 Table 3.3 Rank Correlation, (a) Kendall’s tau and (b) Spearman, Between Target and Predicted Quality Labels (Mean ± Standard Deviation Over 10 Random Splits of LFW Images) (a) Face Quality Label COTS-A zij COTS-B zij Human Rating Deep-320 0.395 ± 0.018 0.305 ± 0.019 0.412 ± 0.016 Vishnu-5 0.232 ± 0.031 0.202 ± 0.018 0.295 ± 0.018 (b) Face Quality Label COTS-A zij COTS-B zij Human Rating Deep-320 0.558 ± 0.023 0.442 ± 0.026 0.585 ± 0.019 Vishnu-5 0.340 ± 0.042 0.297 ± 0.026 0.431 ± 0.025 better predict all three quality measures than the Vishnu-5 features. Additionally, prediction of human quality ratings is more accurate than prediction of score-based quality from either COTS-A or COTS-B, likely due to the difficulty in predicting particular nuances of each matcher. To further investigate the resulting face quality predictions, we computed the Spearman rank correlation between the target and predicted values separately for the multiple images of each subject; i.e., given multiple face images of a subject, we rank them based on the target and the quality values, and compute the correlation between the two ranking lists. Figs. 3.12 and 3.13 show examples of strong correlation between target and predicted human quality ratings, while Figs. 3.14 and 3.15 show examples of weak or even negative correlation. Figs. 3.16 and 3.17 show examples of negative correlation between target and predicted scorebased quality for COTS-A. It appears that weak correlation is observed when the multiple images of a subject are of similar quality; it is difficult to achieve a consistent fine-tuned ranking of face images when all of the qualities are similar. To evaluate the quality values in the context of automatic face recognition performance, error vs. reject curves (for FNMR) are plotted in Fig. 3.11 for both target and predicted quality values. The figures demonstrate that rejecting low quality face images based on predicted zij , predicted human ratings, or the COTS-B measure of face quality, results in 99 COTS−A, Human Rating, Deep−320: 0.0001 FMR COTS−A, zi, Deep−320: 0.0001 FMR 0.16 Target zij Predicted zij Target Human Rating COTS−B Face Quality 0.14 0.12 False Non−Match Rate (FNMR) False Non−Match Rate (FNMR) 0.16 0.1 0.08 0.06 0.04 0.02 Target zij Predicted Human Rating Target Human Rating COTS−B Face Quality 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 0 0.05 0.1 0.15 0.2 0 0.25 Fraction of Probe Images Removed 0.05 0.1 0.15 0.2 0.25 Fraction of Probe Images Removed (a) COTS-A, zij (b) COTS-A, Human False Non−Match Rate (FNMR) COTS−B, zi, Deep−320: 0.0001 FMR Target zij Predicted zij Target Human Rating COTS−B Face Quality 0.35 0.3 0.25 0.2 0.15 0.1 0 0.05 0.1 0.15 0.2 0.25 Fraction of Probe Images Removed (c) COTS-B, zij Figure 3.11 Error vs. Reject curves for target and predicted face image quality values. The curves show the efficiency of rejecting low quality face images in reducing FNMR at a fixed FMR of 0.001%. The model used for the face quality predictions in (a)-(c) are support vector regression on the deep-320 features from the deep convNet in [136]. 100 comparable efficiency in reducing FNMR (e.g., removal of 5% of probe images lowers FNMR by ∼2%). However, none of the methods are near as efficient as rejecting images based on the target zij values, which serve as an oracle for a predicted face quality measure that is highly correlated with the recognition performance. 3.5.2.2 Train and Validate on LFW, Test on IJB-A: In this framework, we conduct 5-fold cross-validation over the 7,484 LFW images (folds are subject-disjoint) to determine the parameters for the support vector regression model via grid search. We then apply the selected set of parameters to all of the LFW training images. This model trained on LFW face images is then used to predict the quality of face images in the IJB-A database. The Deep-320 image features [136] are used here. We currently do not have any ground truth quality labels for IJB-A face images because we did not collect human annotations for this database, and we do not have a recognition protocol set up with a higher quality gallery. Initial efforts to construct a high quality gallery (faces with frontal pose, neutral expression, no occlusion, etc.) for IJB-A indicated that this is not possible for all the subjects. Hence, current evaluation entails visual inspection of the rank-ordering of face images based on the predicted quality values. Figs. 3.18–3.20 shows that the proposed automatic face quality measure does a fairly good job at sorting face images (and video frames in Fig. 3.21) in order of face quality. Figs. 3.18–3.20 also show face images sorted by the Rank-based Quality Score (RQS) of Chen et al. [35] for comparison. Though it is difficult to compare the two methods without recognition experiments, there are a few cases where the top highest quality faces predicted by our method appear to be better than the RQS ranking (e.g., top row in Fig. 3.20). 101 Ranked by Target Human Quality Ratings Ranked by Predicted Human Quality Ratings Figure 3.12 Face images from a subject in LFW are rank-ordered by target (left) and predicted (right) human quality ratings, in order of increasing face quality. The Spearman correlation between the target and predicted rank orderings for this subject is 0.72. 102 Ranked by Target Human Quality Ratings Ranked by Predicted Human Quality Ratings Figure 3.13 Face images from LFW are rank-ordered by target (left) and predicted (right) human quality ratings, in order of increasing quality. Examples shown have positive rank correlation between target and predicted rankings. For each of the three example subjects, the Spearman correlation between the target and predicted rank orderings are 0.94, 0.90, and 0.50 (top to bottom). 103 Ranked by Predicted Human Quality Ratings Ranked by Target Human Quality Ratings Figure 3.14 Face images from LFW rank-ordered by target (left) and predicted (right) human quality ratings, in order of increasing quality. Examples shown have negative (or zero) rank correlation between target and predicted rankings. For each of the example subjects, the Spearman correlation between the target and predicted rank orderings are -0.50, and 0.00 (top to bottom). 104 Ranked by Target Human Quality Ratings Ranked by Predicted Human Quality Ratings Figure 3.15 Face images from LFW rank-ordered by target (left) and predicted (right) human quality ratings, in order of increasing quality. Examples shown have strong negative rank correlation between target and predicted rankings. For each of the example subjects, the Spearman correlation between the target and predicted rank orderings are -0.90, and -0.70 (top to bottom). 105 Ranked by Predicted COTS-A zij Ranked by Target COTS-A zij Figure 3.16 Face images from LFW rank-ordered by target (left) and predicted (right) scorebased quality values (COTS-A zij ), in order of increasing quality. Examples shown have negative rank correlation between target and predicted rankings. For each of the example subjects, the Spearman correlation between the target and predicted rank orderings are -0.33 and -0.37 (top to bottom). 106 Ranked by Predicted COTS-A zij Ranked by Target COTS-A zij Figure 3.17 Face images from LFW rank-ordered by target (left) and predicted (right) scorebased quality values (COTS-A zij ), in order of increasing quality. Examples shown have negative rank correlation between target and predicted rankings. For each of the three example subjects, the Spearman correlation between the target and predicted rank orderings are -1.00, -0.20, and -0.31 (top to bottom). 107 Ranked by Predicted Human Rating Ranked by RQS [35] Figure 3.18 Face images from IJB-A [72] sorted by face image quality (best to worst). The face image qualities were automatically predicted by (left) the proposed approach (SVR model on Deep-320 image features [136]) and human quality ratings from the LFW database) and (right) Rank-based Quality Score (RQS) [35] for comparison. 108 Ranked by Predicted Human Rating Ranked by RQS [35] Figure 3.19 Face images from IJB-A [72] sorted by face image quality (best to worst). The face image qualities were automatically predicted by (left) the proposed approach (SVR model on Deep-320 image features [136]) and human quality ratings from the LFW database) and (right) Rank-based Quality Score (RQS) [35] for comparison. 109 Ranked by Predicted Human Rating Ranked by RQS [35] Figure 3.20 Face images from two subjects in IJB-A [72] sorted by face image quality (best to worst). The face image qualities were automatically predicted by (left) the proposed approach (SVR model on Deep-320 image features [136]) and human quality ratings from the LFW database) and (right) Rank-based Quality Score (RQS) [35] for comparison. 110 Figure 3.21 Face images from the videos of example subjects in IJB-A [72] sorted by face image quality (best to worst) which was automatically predicted by the proposed approach using a model (SVR on Deep-320 image features [136]) trained on human quality ratings from the LFW database. 111 3.6 Conclusion Automatic face image quality assessment is a challenging problem with important operational applications. Automatic detection of low quality face images would be beneficial in maintaining the integrity of enrollment databases, reacquisition prompts, quality-based fusion, and adaptive recognition approaches. In this work, we have investigated two methods for assigning target face image quality values to a large database of face images to be used for training, and proposed a model for automatic prediction of face image quality using only image features extracted prior to matching. The conclusions and contributions can be summarized as follows: • Human ratings of face image quality (obtained from crowdsourcing and matrix completion) are correlated with automatic recognition performance for unconstrained face images. Rejection of 5% of the lowest quality face images (based on human quality ratings) in the LFW database resulted in ∼ 2% reduction in FNMR. • Human quality ratings are not as correlated with recognition performance as are target face quality values obtained from similarity scores (matcher-specific). This was as expected since score-based quality serves as an oracle for an ideal quality measure (performance is directly computed from the same similarity scores), whereas human quality ratings are solely based on single images. • Automatic prediction of human quality ratings is more accurate than prediction of score-based face quality values. It is difficult to predict the score-based quality because of nuances of specific matchers and pairwise quality factors (i.e., comparison scores are a function of two face images, but we are using the scores to label the quality of a single face image). • Visual inspection of face images rank-ordered by the proposed automatic face quality measures (both human ratings and score-based quality) are promising, even for cross112 database prediction (i.e., model trained on LFW [62] and tested on IJB-A [72] face images). 113 Chapter 4 Longitudinal Study of Automatic Face Recognition 4.1 Introduction Technological advancements in automatic face recognition have progressively tackled challenges caused by variations in facial pose, illumination, and expression (collectively called PIE variations). Current efforts (e.g., [128,136]) are breaking ground on robustness to “faces in the wild” (e.g., images posted on the web) to account for PIE, occlusion, and partial face images. Comparatively, aging variations (i.e., large time lapse between pairs of images being compared) have received considerably less attention in the face recognition community. Published studies on facial aging in the context of automatic face recognition have primarily employed cross-sectional techniques where a population of individuals who differ in age are analyzed according to differences between age groups [15, 55, 70, 87, 99]. However, cross-sectional analysis cannot adequately explore age-related effects because assumptions of independent observations require that there be only one measurement per individual in the study (see Fig. 4.6. Past and future measurements are either not considered or are 114 (a) Ages 30.5 and 39.6 (0.423) (b) Ages 32.2 and 40.3 (0.433) (c) Ages 29.5 and 38.3 (0.498) (d) Ages 39.2 and 48.6 (0.500) Figure 4.1 Face image pairs of four subjects from the PCSO LS mugshot database which are age-separated by eight to ten years. Similarity scores from a state-of-the-art face matcher (COTS-A) are shown in parentheses (score range is [0.0, 1.0]). The thresholds at 0.01% and 0.1% FAR are 0.533 and 0.454, respectively. Hence, all of these genuine pairs would be falsely rejected at 0.01% FAR, while the two female subjects, (a) and (b), would also be rejected at 0.1% FAR. summarized into a single measurement which loses information; trends of individuals over time are not analyzed. Hypotheses about facial aging are, instead, longitudinal by nature and require multiple measurements of the same individuals over time to reveal trends in comparison scores with respect to facial aging. To what extent facial aging affects the performance of automatic face recognition systems is of more than academic concern. Because the appearance of the face changes throughout a person’s life, most identity documents containing face images expire after a designated period of time; U.S. passports are only valid for five years for minors and ten years for adults, while U.S. driver’s licenses typically require renewal every five years. Additionally, to our knowledge, ensuring that a new (more recent) photo has been submitted for renewal is not verified, especially for renewals by mail or online. Validity periods of such identity documents may be too long if these photos are to be used with state-of-the-art face matching 115 systems. Fig. 4.1 shows that elapsed times of eight to ten years between two face images can cause false non-match errors. Studying how the actual comparison scores change over time is important for understanding the implications of operating with a global threshold1 (e.g., de-duplication and other open-set scenarios) on face recognition accuracy. While longitudinal studies for automatic iris recognition [54] and fingerprint recognition [149] have been published, to our knowledge, no large-scale longitudinal study of automatic face recognition performance has been reported in the literature. We aim to fill this gap by addressing the following question: How robust are state-of-the-art automatic face recognition systems to facial aging? In this chapter, we conduct a longitudinal analysis of the performance of state-of-the-art COTS face matchers on two longitudinal face image databases consisting of repeat criminal offenders (mugshots) from two different law enforcement agencies (see Table 4.2). The COTS matchers used here are among the top-ranked performers in the FRVT 2013 face recognition evaluation [55]. The contributions of this chapter can be summarized as follows: 1. Longitudinal analysis of two of the largest longitudinal databases studied to date. LEO LS contains 31, 852 images of 5, 636 subjects, and PCSO LS contains 147, 784 images of 18, 007 subjects, where the average time span between a subject’s multiple image acquisitions is 6.1 and 8.5 years, respectively. Such large-scale databases allow for evaluation of performance at low FAR values (e.g., 0.01% and 0.1%). Previous studies (e.g., [70, 99]) evaluated at 1% FAR and higher. 2. Determine the age-invariant properties of current state-of-the-art face matchers. Rates of change over time in genuine comparison scores are analyzed using mixed-effects regression models, which are appropriate for longitudinal data. In doing so, we quantify (i) the population-mean rate of change in genuine scores over time and (ii) the variability in subject-specific longitudinal trends (i.e., how closely individuals in the 1 A biometric system operating with a global threshold uses the same decision threshold for all subjects across all comparisons. 116 population follow the population-mean trend). We also investigate the influence of age at enrollment, sex, race, and face image quality. 3. Methodology and analysis tools for advancing the development and evaluation of ageinvariant face recognition algorithms. The analysis conducted in this chapter can be applied to any matcher and any database. Periodic reevaluation will be necessary as face recognition technology evolves to better address facial aging.2 Our previous longitudinal analysis of automatic face recognition was first published in [18]. The present work extends and refines our previous study in significant ways. The primary differences are as follows. (i) We study longitudinal effects of both aging (elapsed time) and age (biological age); [18] only studied elapsed time. (ii) Genuine scores are computed to represent a scenario where the youngest image of each subject is enrolled in a gallery (a subject with ni total images has ni − 1 scores, whereas [18] computed all ni 2 genuine scores). Comparing query images to an enrollment image (a fixed point in time) simplifies the complex correlation structure that is present for all pairwise comparisons. (iii) We analyze an additional longitudinal face database (namely, LEO LS) from a different law enforcement agency than the PCSO LS database used in [18], and a different COTS matcher is used to obtain genuine scores for LEO LS. Still, longitudinal analysis shows similar results for both databases and matchers. The remainder of this chapter is organized as follows. Section 4.2 highlights related work on facial aging as it pertains to automatic face recognition. Section 4.3 details the two longitudinal face databases used in this study. Section 4.4 explains the methodology used for longitudinal analysis. Section 4.5 gives results for both the PCSO LS and LEO LS face databases. Section 4.6 summarizes our observations about the current longitudinal capabilities of automatic face recognition. 2 To facilitate longitudinal study on other face datasets and matchers, the code of our longitudinal analysis will be made publicly available at http://biometrics.cse.msu.edu/. 117 4.2 Related Work Almost all of the published studies that investigate the effects of facial aging on automatic face recognition performance adopt the following approach: (i) divide the database (face pairs) into partitions depending on age group or time lapse, (ii) report summary performance measures (e.g., TAR at fixed FAR) for each partition independently, and then (iii) draw conclusions from the differences in performance across the partitions. Such an approach has led to the following general conjectures [91]: (i) Face recognition performance decreases as the time elapsed between two images of the same person increases (e.g., [70, 87, 99]). (ii) Faces of older individuals are easier to recognize/discriminate than faces of younger individuals (e.g., [55, 87]). See Table 4.1 for a summary of these studies.3 Partitioning of data (images or subjects) based on age group or time lapse is often arbitrary and varies from one study to another. Erbilek and Fairhurst show that different age group partitionings result in different performance trends for both iris and signature modalities [43]. Furthermore, this cohort-based analysis with summary statistics cannot address whether age-related performance trends are due to changes in genuine (same subject) comparison scores, impostor (different subjects) comparison scores, or both. Multilevel (hierarchical or mixed-effects) statistical models have been used for determining important factors (covariates) to explain the performance of face recognition systems. Beveridge et al. [20] apply generalized linear mixed models to verification decisions (accept or reject) made by three algorithms in the FRGC Exp. 4 evaluation. In addition to eight levels of FAR as a covariate, they analyze gender, race, image focus, eye distances, age, and elapsed time. The limitations of this study include (i) the maximum elapsed time between face images of the same subject is less than one year, and (ii) it only involves 351 subjects. Poh et al. [110] utilized regression models to estimate subject-specific biometric (face and speech) performance trends over time, but the database used only contains 150 subjects and 3 Studies that address developing age-invariant face recognition algorithms (e.g., [50,67]) are beyond the scope of this work. 118 Table 4.1 Table of related work on the effects of facial aging on face recognition performance. Study Ling et al. [87] Database Passports (private) FG-NET Age or Elapsed Time Partitions 4–11 years elapsed time 0–8, 8–18, and 18+ years old 0-1, 1-5, 5-10, 10+ years elapsed time Klare and Jain [70] PCSO (200,000 mugshots, 64,000 subjects) Otto et al. [99] MORPH-II 0–1, 1–5 years elapsed time Bereta et al. [15] FG-NET NIST FRVT [55] Visa images (19,972 subjects) 0–5, 6–10, 11–15, 16–20, 21–30, and 30+ years elapsed time; 23–30, 31–40, 41–50, and 50+ years old baby, kid, pre-teen, teen, young, parents, older Summary of Findings Degradation in EER saturates after 4 years elapsed time. Verification accuracies increase with increasing age group. TARs at 1% FAR are 96.3%, 94.3%, 88.6%, and 80.5% for the listed elapsed time partitions. Training/testing on different aging partitions decreases performance in some non-aging scenarios. TARs at 1% FAR are 97% and 95% for the listed elapsed time partitions. The nose is the most stable facial component over time. Identification accuracies of local descriptors (e.g., variants of LBP) when combined with Gabor wavelet magnitudes become relatively consistent across absolute ages and age gap groups, but accuracies are still fairly low for a small gallery. Error rates (for open-set identification) are higher for younger age groups when the same threshold is used for all age groups. EER = equal error rate; TAR = true accept rate; FAR = false accept rate the elapsed times are less than two years. The longitudinal study on face recognition in this work follows the general methodology of linear mixed-effects statistical models outlined in [54] for iris recognition and [149] for fingerprint recognition. The two main databases used for research on facial aging, including automatic age estimation, age progression, and age-invariant face recognition, are FG-NET [78] and MORPH [113]. Panis et al. [100] provide a recent overview of research that has utilized the FG-NET database. While the public release of these databases greatly encouraged progress in these areas, the databases are not suitable for longitudinal analysis because (i) FG-NET contains only 82 subjects in total, and (ii) MORPH contains only a small number of subjects with multiple images over time (only 317 subjects have at least 5 images over at least 5 years).4 The Cross-Age Celebrity Dataset (CACD) [32] was recently released, containing 163, 446 images of 2, 000 celebrities across 10 years. However, because the images were downloaded from the web (via Google search), the unconstrained quality makes it difficult to statistically model 4 Images in FG-NET are relatively unconstrained (scanned from personal photo collections), while the MORPH databases are mugshots, similar to LEO LS and PCSO LS used in this work but with different database properties (see Table 4.2). 119 Table 4.2 Facial Aging Databases Num. Subjects Num. Imgs 82 1,002 MORPH-II [113] 13,000 55,134 MORPH-II commercial [113]a 20,569 78,207 CACD [32] 2,000 163,446 LEO LSb 5, 636 31, 852 18, 007 147, 784 Database FG-NET [78] PCSO LSb Num. Imgs per Subject Age Range (years) 6–18 (avg. 12) 2–53 (avg. 4) 1–76 (avg. 4) n.a. (avg. 81) 0–69 (avg. 16) 16–77 (avg. 42) 15–77 (avg. 33) 16–62 (n.a.) 4–20 (avg. 6) 5–60 (avg. 8) 12–69 (avg. 31) 18–83 (avg. 35) a This largest version of MORPH-II only has 317 subjects with at least 5 images acquired over at least 5 years. b The longitudinal face image databases used in this study (details in Sec. 4.3). the effects of facial aging. Variations in pose, illumination, expression, etc., may largely influence the trends in similarity scores. Such covariates are difficult to quantify in order to “tease out” these effects from the longitudinal effects, so standardized imaging (near-frontal, neutral expression, uniform illumination) is preferable for the longitudinal study conducted in this work. Relatively constrained images, such as mugshots, help to ensure that other effects, such as PIE variations, are captured in the noise term in the statistical models. For the above reasons, our longitudinal analysis utilizes two new longitudinal face databases, detailed in Section 4.3. 4.3 Longitudinal Face Databases Operational face image datasets maintained by government and law enforcement agencies can contain longitudinal records of individuals of magnitudes that are infeasible to collect in laboratory settings (e.g., elapsed times over 10+ years). These agencies routinely collect face images of the same individuals over time and have been doing so for relatively long 120 durations, primarily for applications involving driver’s licenses, visa and passport applications/renewals, frequent travelers, and multiple arrests of repeat criminal offenders. The sources of face images in our longitudinal analysis are mugshot bookings. While we acknowledge that lifestyle factors (e.g., drug5 and alcohol use, trauma, etc.) may increase aging rates for some individuals in this population (adult repeat criminal offenders), these accelerated agers are expected to be outliers in the statistical models in our analysis; the overall trends should be relatively robust to this factor. Additionally, we were not able to access any other longitudinal face data. We did attempt to use longitudinal face images from the State Department visa databases. However, we discovered that roughly 5% of genuine face images were duplicate photo submissions (e.g., an individual reuses the same photo for a visa renewal application), so the corresponding inaccurate age information rendered it unsuitable for longitudinal study. The two databases used in this longitudinal study (LS), denoted LEO LS and PCSO LS, are subsets of subjects and images from two larger mugshot databases initially consisting of 3.7 and 1.5 million images, respectively. The following criteria were used to compile the subsets: (i) Each subject has at least 4 (LEO LS) or 5 (PCSO LS) face images that were (ii) acquired over at least a 5 year time span, and (iii) each pair of consecutive images is time-separated by at least one month. Database statistics are shown in Fig. 4.2. The facial variations in the PCSO S and LEO LS databases are well-controlled because the mugshots adhere to standards similar to those detailed in the ANSI/NIST-ITL 2011 face image standards.6 The standards specify that mugshots should be captured at frontal pose, with neutral expression, uniform illumination, and a background set to 18% gray, for examples. Because these databases are both from operational sources, some confounding factors are still present, such as minor pose and expression variations (see Fig. 4.5). We also observed rare occurrences of facial occlusions or injury, as shown in Fig. 4.4, but have retained such images in this study. 5 6 See Yadav et al. [143] for work specifically on the effects of drug abuse on face recognition performance. https://www.nist.gov/itl/iad/image-group/ansinist-itl-standard-history 121 PCSO LS Longitudinal Database (147, 784 mugshots of 18, 007 subjects; avg. of 8 mugshots per subject) 1000 0 3000 2000 1000 7500 5000 2500 0 5 10 15 20 Number of Subjects 2000 Number of Subjects 4000 Number of Subjects Number of Subjects 3000 0 25 8 Number of Images per Subject 12 16 2000 1000 0 F/B Time Span (years) (a) 3000 F/W M/B M/W 20 Sex/Race (b) 40 60 80 Age of Youngest Image (years) (c) (d) LEO LS Longitudinal Database (31, 852 mugshots of 5, 636 subjects; avg. of 6 mugshots per subject) 1000 500 0 1000 500 2000 1000 0 5 10 15 20 Number of Images per Subject (e) Number of Subjects 1000 Number of Subjects Number of Subjects Number of Subjects 3000 1500 0 5 6 7 8 Time Span (years) 500 250 0 F/B F/W M/A M/B M/I Sex/Race (f) 750 (g) M/U M/W 20 40 60 Age of Youngest Image (years) (h) Figure 4.2 Statistics of the two longitudinal face image databases (PCSO LS and LEO LS) used in this study. (a) and (e) Number of face images per subject, (b) and (f) the time span of each subject (i.e., the number of years between a subject’s youngest and oldest face image acquisitions), (c) and (g) demographic distributions of sex (male, female) and race (white, black, Asian, Indian, unknown), and (d) and (h) the age of the youngest image of each subject (in years). For both databases, we only include white and black race subjects in this study because there are too few subjects of other races to do a meaningful statistical analysis. Since human labeling errors pertaining to demographic attributes and subject ID can be inadvertently introduced in large-scale legacy databases, we determine the sex, race, and date of birth of a subject as the majority vote from each subject’s records to ensure consistent labels within each subject. Identifying all such errors was not feasible due to the large size of these databases, but a cursory examination of the PCSO LS database revealed 134 subject records that contained multiple identities (Fig. 4.3). These subject records were removed from our study. 122 (a) (b) (c) Figure 4.3 Three examples of labeling errors in the PCSO LS face database. All pairs show two different subjects who are labeled with the same subject ID number in the database. Figure 4.4 Examples of facial occlusions (sunglasses, bandages, and bruises) in the PCSO LS face database. 4.3.1 LEO LS Face Database The LEO LS database contains 31, 852 images of 5, 636 subjects from an operational dataset of law enforcement images. Each subject has an average of 6 images over an average time span of 5.8 years (maximum of 8 years). Demographic makeup of the LEO LS database includes 2, 009 white and 3, 627 black subjects where 4, 922 subjects are males and 714 are females. Subjects in LEO LS are primarily adults, but there are 656 images of 369 subjects that are younger than 18 years-old; these may be juvenile7 arrests or they could be data entry errors. Due to privacy considerations, we only have access to the comparison scores (both genuine and impostor), so we cannot show face images from this database. 4.3.2 PCSO LS Face Database The PCSO LS database consists of 147, 784 operational mugshots of 18, 007 repeat criminal offenders booked by the Pinellas County Sheriff’s Office (PCSO) from 1994 to 2010. Each 7 In the United States, a juvenile is typically under the age of 17. 123 Table 4.3 Overall true accept rates (TARs) at fixed false accept rates (FARs) for various face matchers on the PCSO LS and LEO LS databases. 0.01% FAR 0.1% FAR 1% FAR PCSO LS COTS-A PittPatt 94.98 41.54 97.83 58.65 99.14 78.30 LEO LS COTS-B COTS-2 COTS-3 COTS-4 99.35 90.62 78.97 96.68 99.66 94.96 86.87 98.47 99.84 97.92 93.49 99.31 subject has an average of 8 images over an average time span of 8.5 years (maximum of 16 years). Demographic makeup of the PCSO LS database includes 11, 002 white and 7, 004 black subjects where 14, 882 subjects are males and 3, 124 are females. Example face images from PCSO LS are shown in Fig. 4.5. Each booking record in PCSO LS contains both the date of birth and the date of arrest (actual dates were unavailable for LEO LS, only the ages were provided to us). 4.3.3 Face Comparison Scores Face comparison scores (similarities) were obtained from various commercial face matchers with the aim of evaluating current state-of-the-art longitudinal performance. Two matchers were applied to the PCSO LS database, and comparison scores were obtained from four different matchers for the LEO LS database.8 As shown in Table 4.3, COTS-A and COTS-B were the overall most accurate matchers. Due to space limitations, longitudinal results are only reported for COTS-A and COTS-B throughout the remainder of the chapter. COTS-A and COTS-B were both among the top-3 performers in the FRVT 2013 [55]. The original mugshot images were input to each COTS matcher, and a total of 26, 216 and 129, 773 genuine scores were computed for the LEO LS and PCSO LS databases, respectively, under the scenario where each subject’s set of face images are compared to his/her enrollment 8 Comparison scores and ancillary information (sex, race, age) for the LEO LS face image database were provided by the Image Group, National Institute of Standards and Technology (NIST), http://www.nist. gov/itl/iad/ig/. 124 image. Genuine comparison scores, sij , between the enrollment and jth face images of subject i were standardized so yij = (sij − µ)/σ, where µ and σ are the mean and standard deviation of the genuine scores from all subjects. This standardized response, yij , is in terms of standard deviations from the mean of the genuine distribution, which allows interpretation of coefficients from mixed-effects regression models as quantifying the change in genuine scores as β standard deviations per year. Fig. 4.8 shows the distributions of COTS-A and COTS-B standardized genuine scores. The response variable for all mixed-effects models in this study are standardized genuine comparison scores. However, to evaluate face recognition performance, trends in genuine scores should be considered in context with an impostor distribution. For both the LEO LS and PCSO LS databases, we computed all possible impostor scores (5.5 million and 11.1 billion, respectively) to calculate thresholds at different fixed FAR values. The threshold at 0.01% FAR, for example, is used to determine when genuine scores drop below the threshold, causing false rejection errors. 4.4 Mixed-Effects Models Mixed-effects models (also known as random-effects, multilevel, and hierarchical models) are widely used in various scientific disciplines for studying data that is hierarchically structured, including longitudinal data of repeated observations over time [44, 118]. In our case, face images are grouped by subject because we have repeated observations of each individual in our study. When data is structured in such a manner, responses from the same cluster/group/individual are correlated with each other and across time (for longitudinal data). Mixed-effects models enable analysis of variation in the response (here, standardized face comparison scores) that occurs at different levels of the data hierarchy. Ideally, longitudinal data collection would observe all individuals in the study following the exact same schedule over the entire duration of interest. However, longitudinal data is 125 Enrollment Image Query Images (in order of increasing age) 265368 1514675 132341 37342 11536 65954 ID Figure 4.5 Face images of six example subjects from the PCSO LS database. The enrollment face image (leftmost column) is the youngest image of each subject, and all query images are in order of increasing age. In this study, genuine similarity scores are computed by comparing the query images of each subject to his/her enrollment image. 126 Standardized Genuine Score 1 0 −1 −2 −3 20 30 40 50 Age of Query Image (years) (a) Standardized Genuine Score Standardized Genuine Score Subject ID 11536 37342 65954 132341 265368 1514675 11 0 0 −1 −1 −2 −2 −3 −3 20 20 30 40 Age of Query Image (years) 30 40 50 50 Age of Query Image (years) (b) Figure 4.6 An example of cross-sectional vs. longitudinal analysis. In (a), a cross-sectional approach (ordinary least squares (OLS) linear regression) is applied, which incorrectly assumes that all the scores are independent. In (b), OLS is instead applied six times, separately to each subject’s set of scores (subjects shown in Fig. 4.5). The slope estimated by crosssectional analysis (black dotted line) is much flatter than the slopes of subject-specific trends in (solid colored lines in (c)). The longitudinal analysis in this work utilizes mixed-effects models, which provide “shrunken” OLS estimates for each subject, where the OLS trends shrink towards a population-mean trend [44,118], further accounting for the correlation that exists between scores from the same subject. 127 Elapsed Time (years) 16 12 8 20 30 40 50 60 70 Age Span of Subject (years) Figure 4.7 Age distribution of a random sample of 200 subjects from the PCSO LS database. Each line denotes the age span of a subject (i.e., age of youngest image to age of the oldest image), separated along the y-axis by the elapsed time for each subject (i.e., the length of the age span). typically not this nicely structured because it is difficult (and expensive) to collect, or it must be analyzed retrospectively, as is the case with the mugshot databases used in this study. Instead, longitudinal data is most often time-unstructured and unbalanced, meaning individuals in the study population are observed at different schedules and have different numbers of observations. For the mugshot databases, this translates to different rates of recidivism for each subject. Fig. 4.2 shows that subjects in the LEO LS and PCSO LS databases have anywhere from 4 to more than 20 mugshots, and Fig. 4.7 shows that the age spans of the subjects are highly unstructured. Mixed-effects models can handle imbalanced and time-unstructured data and are preferable over other approaches because they model both the mean response (fixed effects define the population-mean trend), as well as the covariance structure (random effects allow deviations of individuals from the population-mean). In longitudinal data, this covariance structure has a complicated form which stems from the fact that error terms are not independent (as is assumed in standard linear regression). The remainder of this section provides details of the models and covariates of interest. 128 Table 4.4 Mixed-Effects Model Formulations Model Level-1 Model Level-2 Model: Intercept A yij = ϕ0i + εij ϕ0i = β00 + b0i BT yij = ϕ0i + ϕ1i CT CA D Tij + εij Level-2 Model: Slope ϕ0i = β00 + b0i ϕ1i = β10 + b1i yij = ϕ0i + ϕ1i Tij + εij yij = ϕ0i +ϕ1i AGEij +εij ϕ0i = β00 + β01 AGEie + b0i ϕ0i = β00 + β01 AGEie + b0i ϕ1i = β10 + b1i ϕ1i = β10 + b1i yij = ϕ0i + ϕ1i ϕ0i = ϕ1i = β10 + β11 AGEie + b1i Tij + εij 2 +b β00 + β01 AGEie + β02 AGEie 0i E yij = ϕ0i + ϕ1i Q yij = ϕ0i + ϕ1i ϕ2i Qij + ϕ3i Qij Tij + εij Tij + 2 + ϕ0i = β00 + β01 AGEie + β02 AGEie ϕ1i = β03 Mi + β04 Bi + b0i β10 + β11 AGEie + β12 Mi + β13 Bi + b1i ϕ0i = β00 + β01 Qie + b0i ϕ1i = β10 + β11 Qie + b1i , Tij + εij ϕ2i = β20 + β21 Qie + b2i , ϕ3i = β30 Tij : elapsed time (years) between the enrollment and jth face image of subject i; AGEie : age (years) of subject i in her enrollment face image; AGEij : age (years) of subject i in her jth face image; Mi : binary indicator of subject sex (Mi = 1 if male, 0 if female); Bi : binary indicator of subject race (Bi = 1 if black, 0 if white) Qie : quality (e.g., frontalness or interpupillary distance) of the enrollment image of subject i; Qij : quality (e.g., frontalness or interpupillary distance) of the jth query image of subject i 4.4.1 Model Formulations Given ni face images of subject i, let AGEij denote the absolute age of the ith individual for the jth face image, where AGEij < AGEik for j = 0, . . . , ni − 2 and k = j + 1, . . . , ni − 1 (i.e., the ni images are ordered by increasing age). To begin with, assume that the youngest image (first acquisition) of each subject is enrolled in the gallery, and let AGEie = AGEi0 denote the age of individual i at enrollment where AGEie < AGEij for j = 1, . . . , ni − 1. We can compute mi = ni − 1 genuine comparison scores by comparing every other image to the enrollment image. Hence, in this scenario, yij (j = 1, . . . , mi ) is the comparison score between the jth face image of individual i and his/her enrollment image. AGEij is the age of the jth query/probe image of subject i, so the elapsed time between enrollment and query image is Tij = AGEij − AGEie . When studying age-related effects on automatic face recognition performance, there are two different, albeit closely related, time-varying covariates which are of primary interest: (i) the elapsed time between image acquisitions and (ii) the absolute ages of the subject in 129 the two face images being compared. Below, we discuss mixed-effects models which include these and other covariates. 4.4.1.1 Function of Elapsed Time The simplest notion of face recognition performance over time is a function of the elapsed time between a subject’s enrollment and query face images, f ( Tij ). A linear mixed-effects model with two levels (to account for subject-specific trends) and a single covariate for elapsed time can be formulated as follows. At level-1, the comparison score yij between the enrollment and jth query image of subject i can be modeled as a linear function of yij = ϕ0i + ϕ1i Tij + εij , Tij : (4.1) where the ith individual’s intercept, ϕ0i , and slope, ϕ1i , are ϕ0i = β00 + b0i , (4.2) ϕ1i = β10 + b1i . The level-1 equation in (4.1) models within-subject longitudinal change in yij where a subject’s scores can vary around his/her linear trend by εij (level-1 residual variation). The level2 model in (4.2) accounts for between-subject variation in comparison scores because each subject’s intercept and slope parameters, ϕ0i and ϕ1i , respectively, are modeled as a combination of fixed and random effects. The fixed effects, β00 and β10 , are the grand means of the population intercepts and slopes, respectively, and define the overall population-mean trend, while the random effects, b0i and b1i , are subject-specific deviations from the populationmean parameters. Since each subject can have his/her own intercept and slope parameters, mixed-effects models are flexible in handling/allowing for biometric zoo effects [41,144] (some subjects generally have higher or lower scores). Fig. 4.5 shows six example subjects from the PCSO LS database at different ages, with their subject-specific trends in genuine scores 130 over time shown in Fig. 4.6(b). The random structure of the above two-level model includes the level-1 residuals, {εij }, as well as the random effects, b0i and b1i , which can be thought of as level-2 residuals. The distributional assumptions of these two error terms are: εij ∼ N (0, σε2 ) (4.3)       2 b0i  0  σ0 σ01   ,   ∼ N   ,  σ10 σ12 b1i 0 (4.4) and where N (., .) denotes a Gaussian distribution. Substituting the level-2 equations for subject-specific intercepts and slopes into the level-1 model in (4.1), the composite form of the two-level mixed-effects model is: yij = β00 + b0i + β10 + b1i Tij + εij . (4.5) Here, the model terms inside the two brackets in (4.5) correspond to all coefficients for the intercept and slope terms. When the error terms are equal to their assumed means of zero, (6) reduces to the population-mean trend of yij = β00 + β10 Tij . The grand mean intercept β00 quantifies the expected marginal mean comparison score when Tij = 0. Note that this intercept is not particularly meaningful, as our data does not contain any same-day comparisons. However, interpretation of β00 does give us some notion of differences in subject’s comparison scores at a projected baseline of zero years elapsed time. The primary coefficient we are interested in is β10 which quantifies the expected change in mean comparison score per one-year increase in elapsed time since enrollment. Because this model, as well as all others considered in this work, include random terms for both intercepts and slopes (b0i and b1i ), we can also analyze the variation in the population parameters (i.e., differences in the trends of individuals in 131 the population). 4.4.1.2 Function of Elapsed Time and Age at Enrollment If rates of change in comparison scores are steeper or flatter throughout an individual’s lifetime, then face recognition performance may also be a function of absolute age. If we add the age of the enrollment image to (4.5): yij = β00 + β01 AGEie + b0i + β10 + b1i Tij + εij . (4.6) Because AGEie is a fixed effect for each subject (time-invariant), the above composite model actually has a two-level specification with the same level-1 model in (4.1). Hence, AGEie cannot improve the model fit at level-1 (within-subject); it can only influence the level-2 subject-specific variations.9 The population-mean trend for (4.6) is: E(yij ) = β00 + β01 AGEie + β10 Tij (4.7) = β00 + β01 AGEie + β10 (AGEij − AGEie ). By definition, Tij is a centered version of AGEij , where the centering term (AGEie ) is subject-specific. Hence, the model for aging as a function of elapsed time and age at enrollment, f ( Tij , AGEie ), is mathematically equivalent to a model for aging as a function of the age of the query image and age at enrollment, f (AGEij , AGEie ): E(yij ) = β00 + β01 AGEie + β10 AGEij . (4.8) The two models in (4.7) and (4.8) will result in the same estimate for longitudinal change, β10 . What distinguishes them is the interpretation of the coefficient β01 quantifying the effect (4.8) of AGEie . Note the relationship between the two models: β01 (4.7) (4.7) (4.8) = β01 − β10 . Hence, β01 9 Comparing all images of a given subject to her fixed enrollment image means that AGEij and Tij are perfectly correlated at level-1 (within-subject) of the model. Hence, we cannot include both of these covariates; the effect of age must be added as a level-2 covariate. 132 is the “contextual” effect that models the difference between the within- and between-subject effects of aging [14].10 The significance of subject age at enrollment in (4.8) is tested with the null hypothesis of H0 : β01 = 0, whereas restricted inference is needed to test significance in (4.7) because the null hypothesis must instead be H0 : β01 = β10 . The relationship between these two models (CT and CA) is similar to common approaches for decoupling the longitudinal and cross-sectional effects of a time-varying covariate. A timevarying covariate at level-1 (e.g., age or elapsed time) exhibits variability within, but also between individuals; models which assume that the within- and between-individual effects are equal do not properly estimate either of these effects [12, 14, 44, 95]. Typically, the timevarying covariate is “centered” on subject-specific means, so as to remove between-subject variation at level-1 of the model. 4.4.2 Model Comparison and Evaluation The goal of statistical modeling is to find a model that includes substantive predictors and excludes unnecessary ones (parsimony). A common approach is to fit increasingly complex models to successively evaluate the impact of adding different covariates [118]. Models can be compared using goodness-of-fit measures based on log-likelihood statistics: deviance, Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). Deviance quantifies how much worse the current model is compared to the (hypothetical) saturated model that includes all possible covariates to perfectly fit the data. Because the log-likelihood (LL) of the saturated model is zero, Deviance = −2[LLcurrent − LLsaturated ] = −2LLcurrent . (4.9) Deviance can be used to compare nested models (i.e., the more complex model can be reduced to the simpler model by placing constraints on its parameters) that are fit to the 10 (4.8) (4.7) (4.7) The equality β01 = β01 − β10 holds for mixed-effects models with random intercepts, and is approximately true for models with both random intercepts and random slopes. 133 same data. To compare non-nested models, AIC and BIC penalize the log-likelihood based on the complexity of the models11 and the sample size. Smaller values indicate better fit for all three goodness-of-fit measures.12 Further comparisons of models depend on whether the successive model has included a time-invariant (e.g., sex, race) or time-varying (e.g., face image quality) covariate to the baseline model. For both cases, pseudo-R2 statistics can be used to measure the proportional reduction in level-2 variance (σ02 , σ12 ) and level-1 residual variance (σε2 ) attributable to inclusion of time-invariant and time-variant covariates, respectively. 4.5 Results We first focus on analysis of the PCSO LS database, starting with simpler models (i.e., Model A and BT) and progressing to more complex models including covariates for subject sex/race and face image quality. We then present results for the LEO LS database. Recall that models are discussed in Section 4.4 and equations are provided in Table 4.4. All models in our analysis are fit with full maximum likelihood (ML) estimation via iterative generalized least-squares (GLS) using the lme4 package (v1.1-9) [11] for R (v3.2.2). 4.5.1 Model Assumptions While mixed-effects models are capable of handling non-Gaussian response distributions (e.g., COTS-A genuine scores in Fig. 4.8(a)), the error terms must follow Gaussian distribution. Fig. 4.9(a) shows normal probability plots of the level-1 residuals, εij , from fitting Model BT to genuine scores from the PCSO LS database. Since significant departure from linearity is observed at the tails, we cannot verify that the model assumptions hold; normal probability plots of random effects, b0i and b1i , also depart from linearity (Figs. 4.9(b), 11 For full ML estimation, the number of parameters includes both the fixed effects and the variance components. 12 For AIC and BIC, the magnitude of the reduction in model fit is difficult to interpret. 134 PCSO_LS Database LEO_LS Database Number of Comparisons Number of Comparisons 8000 6000 4000 2000 0 900 600 300 0 −5.0 −2.5 0.0 2.5 Standardized Genuine Score −4 −2 0 2 4 Standardized Genuine Score (a) PCSO LS (b) LEO LS Figure 4.8 Distributions of standardized genuine comparison scores from the two longitudinal face databases used in this study: (a) COTS-A on PCSO LS and (b) COTS-B on LEO LS. There are a total of 129, 773 and 26, 216 genuine scores in (a) and (b), respectively. 4.9(c)). This behavior was observed for other models as well, precluding the use of standard errors for formal hypothesis tests of parameters [134]. When parametric model assumptions are violated, it is common to resort to nonparametric bootstrap to establish confidence intervals for the parameter estimates, as followed in Yoon and Jain [149]. Hence, for the PCSO LS database, we conduct a nonparametric bootstrap by case resampling [134]; 1, 000 bootstrap replicates are generated by sampling 18, 007 subjects with replacement. Multilevel models are fit to each bootstrap replicate, and the mean parameter estimates over all 1, 000 bootstraps are reported. Tests for fixed effects parameters can be conducted by examining the bootstrap confidence intervals.13 Table 4.5 gives the bootstrap parameter estimates (with 95% confidence intervals), variance components, and goodness-of-fit for the models in Table 4.4. 13 The null hypothesis of the parameter equal to 0 can be rejected at significance of 0.05 if the 95% confidence interval does not contain 0. 135 (a) (b) (c) (d) (e) (f) Figure 4.9 Normal probability plots of ((a) and (d)) level-1 residuals, εij , and level-2 random effects for ((b) and (e)) intercepts, b0i , and ((c) and (f)) slopes, b1i , from Model BT on the PCSO LS and LEO LS databases (top and bottom rows, respectively). Departure from normality at the tails of the distributions is likely due to low quality face images or errors in subject IDs. 136 Table 4.5 Bootstrap results for mixed-effects models on the PCSO LS database and COTS-A genuine scores. Model A Model BT Model CT Model D Fixed Effects (95% confidence intervals): Intercept β00 Time β10 Age Group β01 Age Group 0.0274 0.6734 0.7226 0.5158 (0.0171, 0.0376) (0.6624, 0.6849) (0.6905, 0.7556) (0.4073, 0.6239) −0.1364 (−0.1379, −0.1349) −0.1364 (−0.1379, −0.1349) −0.0016 Age Group2 0.0120 (−0.0027, −0.0006) (0.0047, 0.0189) 0.0000# β11 × Time −0.1372 (−0.1426, −0.1316) (−0.0002, 0.0002) −0.0002 β02 (−0.0003, −0.0001) Variance Components:a Level-1 Residual Random Intercepts σε2 0.6076 0.3912 0.3912 0.3912 σ02 0.3841 0.3243 0.3239 0.3231 Random Slopes σ12 0.0028 0.0028 0.0028 Covariance σ01 −0.0039 −0.0039 −0.0038 286985 Goodness-of-Fit:b a Confidence AIC 333433 287016 287006 BIC 333462 287074 287075 287073 Deviance 333427 287004 286992 286967 intervals for variance components have been omitted due to space limitations. b Goodness-of-fit values are the mean values of the 1, 000 bootstrap samples. 137 4.5.2 Unconditional Means Model (Model A) The simplest mixed-effects model is the unconditional means model, which partitions the total variation in comparison scores by subject. Denoted Model A in Table 4.4, and with composite form of yij = β00 + b0i + εij , b0i is the subject-specific mean and β00 is the grand mean. Similar to analysis of variance (ANOVA), Model A provides initial estimates of the within-subject variance σε2 (i.e., deviations around each subject’s own mean comparison score) and the between-subject variance σ02 (i.e., deviations of subject-specific means around the grand mean). The intraclass correlation coefficient (ICC) quantifies the proportion of between-subject variation in the response, ρ = σ02 /(σ02 + σε2 ). Variance components for Model A shown in Table 4.5 indicate that between-subject differences in genuine scores (i.e., biometric zoo) account for 38.7% (ρ = 0.3873) of the total variation in genuine scores from the PCSO LS database. Baseline goodness-of-fit measures are also shown in Table 4.5. 4.5.3 Unconditional Growth Model (Model BT) The next model to consider in longitudinal analysis is the unconditional growth model that includes the time-related covariate. In our case, we add elapsed time, Tij , as well as random effects for slopes, b1i , to Model A, resulting in Model BT. Table 4.5 shows that Model BT estimates that PCSO LS genuine scores decrease by 0.1364 standard deviations per one-year increase in elapsed time (see solid black line in Fig. 4.10). Comparing the level-1 residual variation of Models A and BT, elapsed time explains 35.6% of the variation in a given subject’s genuine scores around his/her own average genuine score.14 Longitudinal change estimated by Model BT implies that the population-mean trend will drop below the thresholds for 0.01% and 0.1% FAR after 19.1 and 24.0 years elapsed time, respectively, but this only provides insight into performance on subjects in the population with average (or higher) genuine scores over time. A reliable face recognition system must be able to recognize much more than just 50% of the population it encounters, so we are also 14 Using pseudo-R2 = (σε2 (A) − σε2 (BT ))/σε2 (A). 138 interested in the spread of the population around the population-mean trend. Do all subjects closely follow the population-mean trend, or is there large variability between subjects? Do biometric zoo effects extend to rates of change over time? Using the estimated variance components for slopes and intercepts (σ02 , σ12 , and σ01 ), we compute a 2D confidence ellipse (random effects are assumed to be 2D Gaussian distributed) to define a region that contains, for example, 95% of the estimated subject-specific parameters. In order to translate from the 2D space of intercepts and slopes to obtain a confidence region for genuine scores versus elapsed time, we sample 100 combinations of intercept and slope parameters along the contour of the confidence ellipse, compute the predicted genuine scores for each of the 100 trends, and define the confidence region as between the minimum and maximum predicted scores for different values of elapsed time. Results are shown in Fig. 4.10. From the confidence bands of subject variations in Fig. 4.10, we infer that genuine scores for 99% of the population will remain above the threshold at 0.01% FAR for up to approximately 5.5 years elapsed time, which reduces to 95% of the population after 7 years (i.e., false reject errors would occur, on average, for 5% of subjects after 7 years since enrollment). Similarly, at a higher FAR of 0.1%, 99% of subjects can be recognized up to 8.5 years elapsed time, which reduces to 95% after 10.5 years. Fig. 4.11 shows face images from six example outlier subjects whose estimated trends lie outside the 99% region of confidence due to extreme intercepts and/or slopes; subjects significantly deviate from the population spread due to alignment errors, face quality issues (illumination, facial occlusion), and changes to facial hair, for example. 4.5.4 Age at Enrollment (Models CT and D) We next investigate whether the population-mean trends in genuine scores over time depend on a subject’s absolute age (i.e., whether variation in subject-specific trends observed in Model BT can be explained by differences in subject age). The significance of the AGEie 139 Standardized Genuine Score 4 2 0 −2 0.01% FAR 0.10% FAR −4 −6 0 5 10 15 Elapsed Time (years) Figure 4.10 Results from Model BT on COTS-A genuine scores from the PCSO LS database. The bootstrap-estimated population-mean trend is shown in black (bootstrap confidence intervals are too small to be visible). The blue and green bands plot regions of 95% and 99% confidence, respectively, for subject-specific variations around the population-mean trend. Grey dotted lines additionally add one standard deviation of estimated residual variation, σε . Hence, Model BT estimates that 95% and 99% of the subject trends fall within the blue and green bands, but scores can vary around their trends, extending to the grey dotted lines. Thresholds at 0.01% and 0.1% FAR for COTS-A are shown as dashed red lines. 140 Enrollment Image Query Images (in order of increasing age in years old) 19.9 21.7 22.0 24.4 24.7 25.1 25.3 31.1 31.3 34.5 34.7 34.8 35.7 37.0 69.1 69.3 70.8 72.4 73.2 73.4 82.3 31.5 33.2 34.6 35.7 38.2 43.5 45.3 20.6 20.8 21.3 22.7 23.1 23.5 27.9 Figure 4.11 Example outlier subjects, i.e., subjects whose subject-specific trends, estimated by Model BT, significantly deviate from the spread of the population in the PCSO LS database. All images were aligned using COTS-A eye locations. 141 term in Model CT suggests a negative linear relationship between age at enrollment and genuine scores, but the magnitude of β01 is relatively small. To further test the complexity of the effects of age at enrollment, we add additional terms associated with AGEie , resulting in Model D (see Table 4.4). The hypotheses of interest are 1) older subjects are easier to recognize than younger subjects, and 2) younger subjects age at faster rates than older subjects. These two hypotheses manifest in younger subjects having lower genuine scores, on average, and steeper negative rates of change. Table 4.5 shows that the interaction term AGEie × Tij in Model D is not significantly different from zero because the 95% confidence interval for β11 contains zero; hence, we cannot conclude that subject enrollment age has a linear effect on rates of change in COTS-A genuine scores. The statistically significant β02 coefficient indicates a quadratic relationship between subject enrollment age and intercepts, and goodness-of-fit measures are lower compared to Model BT. However, further comparing to Model BT, level-2 variation in random effects for intercepts (σ02 ) is only reduced by 0.4% after including AGEie terms. The differences between scores for different ages at enrollment are marginal compared to the change in scores due to elapsed time; the change in score between a 20 year-old and a 30 or 50 year-old (at enrollment) is equivalent to only 7 and 5 months of elapsed time (within-subject longitudinal change), respectively. 4.5.5 Sex and Race (Model E) Model E in Table 4.4 is used to test the effects of subject sex and race. First, we observed that Model E results in better model fit than Model D (deviance for Model E is 285, 712 compared to 286, 967 for Model D). The main effect of subject sex is statistically non-zero at significance level of 0.05, but the main effect of subject race is not (the 95% bootstrap confidence interval contains 0). Male genuine scores at baseline ( Tij = 0 years) are 0.3987 standard deviations higher than female scores. Significant interactions with elapsed time indicate that rates of change in genuine scores depend on both sex and race; population-mean slopes are 142 Standardized Genuine Score 0.5 MALE 0 1 0.0 BLACK 0 1 −0.5 20 30 40 50 Age of Query Image (years) 60 Figure 4.12 Model E fit to COTS-A genuine scores from the PCSO LS database. Populationmean trends are plotted by subject demographics of sex and race. Each trend line represents seven years of elapsed time since enrollment at five different ages (20–60 years old). For example, the solid blue line beginning at AGEij = 20 years represents the average decrease in genuine scores for white males enrolled at age 20 with query images until age 27. −0.0113 and −0.0267 standard deviations steeper for males and black subjects, respectively. Population-mean trends separated by subject demographics are shown in Fig. 4.12 for different ages at enrollment. while male genuine scores decrease at slightly faster rates than female scores, males are clearly easier to recognize with higher genuine scores overall. Fig. 4.12 also shows that the differences between subject race are minor compared to differences between males and females. 4.5.6 Face Image Quality (Model Q) Adding level-2 covariates (i.e., time-invariant values for each subject, such as AGEie ) cannot improve the fit of the model at level-1 (within-subject). Table 4.5 shows that the level-1 residual variation σε2 (i.e., deviation of scores around each subject’s own linear trend) is quite large when time is the only level-1 covariate for all models considered thus far. One standard deviation of level-1 residual variation estimated by Model BT (and similarly Models CT and 143 D) is equivalent to 4.6 years of elapsed time (calculated as σε2 /β10 = √ 0.3912/−0.1372). This is visually shown by the dotted grey lines in Fig. 4.10. Level-1 residual variation can only be reduced by level-1 time-varying covariates (i.e., image-specific); in this section we investigate whether face image quality measures can be used to improve the model fit. The quality measures considered are interpupillary distance (IPD) and a “frontal” score, both of which are output by COTS-A. While higher frontalness indicates better quality, the range of the frontal score has little meaning, since its computation is proprietary. We standardize (z-score) the frontalness score so we can interpret model parameters as standard deviations from the mean of the frontalness scores from all images in PCSO LS. After finding that neither of the quality measures alone explain variation in genuine scores as well as Model BT with only elapsed time as covariate (details are omitted due to space limitations), we then added the quality measures to Model BT, resulting in Model Q in Table 4.4. Table 4.6 gives estimated level-1 residual variation and goodness-of-fit for models with frontalness, IPD, and both frontalness and IPD (Model QF, QI, and QFI, respectively). Model QF has a better overall fit than Model QI. Table 4.7 gives the elapsed times for when population-mean scores cross thresholds at 0.001% and 0.01% FAR for different values of frontalness and IPD. Note how changing frontalness has a greater impact on when population-mean genuine scores cross the thresholds than changes in IPD. Model QFI with both measures of quality further reduces both the level-1 residual variation and values of goodness-of-fit values. The values of 100 and 120 pixels for IPD in Table 4.7 were chosen because we observed systematic changes in IPDs over time (see Fig. 4.13); in particular, mean IPD varies around 100 pixels from 1994–2002 but increases to a consistent ∼120 pixels starting in 2003. This observation, along with correspondence with Pinellas County Sheriff’s Office, suggests that booking agencies began to adhere to imaging standards around this time. To investigate whether this aspect of the data confounds the estimation of longitudinal effects (face images 144 100 150 200 250 50 IPD (pixels) 1994 1996 1998 2000 2002 2004 2006 2008 2010 Year of Acquisition Figure 4.13 A boxplot of interpupillary distances (IPDs) versus year of acquisition shows that mean IPDs systematically changed over time for the PCSO LS database, likely due to booking stations adhering to face imaging standards only in more recent years. Table 4.6 Bootstrap results for mixed-effects models with elapsed time and face quality covariates for the PCSO LS database and COTS-A genuine scores. Model QF Model QI Model QFI σε2 0.3302 0.3539 0.3218 AIC BIC Deviance 275108 275283 275072 281296 281471 281260 273643 273848 273601 Qie Qij 0.001% FAR 0.01% FAR Frontal −1σ µ 1σ −1σ µ 1σ 10.9 13.0 16.8 15.6 18.4 23.0 IPD Table 4.7 Elapsed times (in years) for when population-mean trends in genuine scores drop below the decision thresholds at 0.001% and 0.01% FAR for different measures related to face quality (frontalness and IPD) of the enrollment image Qie and the query image Qij . 100 pixels 100 pixels 120 pixels 100 pixels 120 pixels 120 pixels 13.8 14.0 13.0 19.4 20.0 18.4 145 Table 4.8 Mixed-effects model results for the LEO LS database and COTS-B genuine scores. Model A Model BT Model CT Model D 0.5468 (0.0325) −0.1699 (0.0023) −0.0003 (0.0011) 0.0894 (0.1057) −0.1980 (0.0076) 0.0346 (0.0068) 0.0010 (0.0003) −0.0006 (0.0001) 0.4276 0.5543 0.0059 −0.0317 0.4276 0.5542 0.0058 −0.0317 0.4275 0.5516 0.0058 −0.0316 62647 62697 62635 62649 62707 62635 62606 62679 62588 Fixed Effects (standard errors): 0.0037 0.5395 (Intercept) β00 (0.0098) (0.0127) −0.1699 Time β10 (0.0023) Age Group β01 Age Group × Time β11 Age Group2 β02 Variance Components: Level-1 Residual σε2 0.5985 2 Intercepts σ0 0.4009 Slopes σ12 Covariance σ01 Goodness-of-Fit: AIC BIC Deviance 68705 68730 68699 in later years may be of higher quality), we also tested for a difference in slope prior to 2003 versus after 2003 by using a piecewise linear formulation for the mixed-effects model (with a breakpoint at 2003). We found that slope after 2003 was significantly flatter (less negative). Additional face quality factors known to cause changes in face recognition performance are illumination, expression, and occlusions. However, there are no widely accepted methods for quantifying such variations in face images and doing so is beyond the scope of this work. 4.5.7 LEO LS Database Table 4.8 gives results for the models in Table 4.4 fit to COTS-B genuine scores from the LEO LS database. Fixed-effects parameter estimates are given with standard errors; bootstrapping was not conducted for LEO LS models because the error terms better follow Gaus- 146 sian distributions (see Fig. 4.9). Model results are summarized as follows. Model A estimates that 40% of the total variation in genuine scores is due to betweensubject differences. The longitudinal change in genuine scores estimated by both Model BT and Model CT indicates that a one year increase in elapsed time decreases genuine scores by β10 = −0.1699 standard deviations. From the confidence bands of subject variations in Fig. 4.14 (estimated by Model BT), we infer that genuine scores for 99% of the population will remain above the threshold at 0.01% FAR for up to approximately 6.5 years elapsed time, which reduces to 95% of the population after 8.5 years (i.e., false reject errors would occur, on average, for 5% of subjects after 8.5 years since enrollment). Similarly, at a higher FAR of 0.1%, 99% of subjects can be recognized up to 8.0 years, which reduces to 95% after 9.5 years elapsed time. Although the between-subject effect of age at enrollment (β01 ) is significantly different from β10 in Model CT, the effect is not significantly different from zero, indicating that there is no linear relationship between subject enrollment age and average genuine scores. However, additional terms involving AGEie result in significant effects of enrollment age in Model D. The significant β02 coefficient indicates a downward quadratic relationship between age at enrollment and average genuine scores (similar to COTS-A on PCSO LS). Furthermore, the significant interaction term AGEie × Tij indicates that longitudinal change in scores tends to vary with subject’s age at enrollment; a 10-year increase in subject age results in a longitudinal slope that is β11 = −0.0098 standard deviations steeper. Population-mean rates of change range from −0.1784 to −0.1490 standard deviations per year for subjects with age at enrollment of 20 to 50 years (calculated as β10 + β11 AGEie ). Recall that age at enrollment had no effect on rates of change for COTS-A on PCSO LS. Model E results indicate that intercepts are 0.0565 and 0.4238 standard deviations higher for black and male subjects, respectively (so, black-male subjects have intercepts that are 0.4803 standard deviations higher than white-female subjects). Slopes are not statistically different for black and white subjects, but the population-mean slope for males is steeper 147 (i.e., more negative) than for females. These population-mean trends are shown in Fig. 4.15 for different ages at enrollment. Fig. 4.15 also shows that the differences between subject race are minor compared to differences between males and females, as was also the case for COTS-A on the PCSO LS database. 4.6 Conclusions We presented a longitudinal study of automatic face recognition, utilizing two large operational databases of mugshots, PCSO LS (147, 784 images of 18, 007 subjects, avg. 8 images per subject over avg. 8.5 years) and LEO LS (31, 852 images of 5, 636 subjects, avg. 6 images per subject over avg. 5.8 years), where each subject has at least four face images acquired over at least a five-year time span. Linear mixed-effects regression models were used to analyze variation in genuine scores due to elapsed time, age, sex, and race, as well as subject-specific differences in scores (i.e., biometric zoo effects). Face similarity scores were obtained from state-of-the-art COTS matchers for both the PCSO LS and LEO LS databases. Based on our analysis, we make the following observations (statements apply to both databases and matchers): ✦ Population-mean trends indicate that genuine scores significantly decrease with increasing elapsed time between enrollment (gallery) and query (probe) images, as expected. However, population-mean trends (average genuine scores) do not fall below thresholds at 0.01% FAR until after 15 years elapsed time. This suggests that in a practical application, an average individual’s genuine scores decrease at a rate that will not affect the recognition accuracy at 0.01% FAR until more than 15 years since enrollment. ✦ Significant subject-specific variability around the population-mean trends is observed; genuine scores for some subjects decline at much faster rates than the population-mean. Analysis of the estimated variance in subject-specific parameters (intercepts and slopes) allowed for estimation of subject-based accuracies (i.e., how many subjects are estimated to 148 Standardized Genuine Score 2.5 0.0 −2.5 0.01% FAR 0.10% FAR −5.0 0 5 10 15 Elapsed Time (years) Standardized Genuine Score Figure 4.14 Results from Model BT on COTS-B genuine scores from the LEO LS database. The population-mean trend is shown in black. The blue and green bands plot regions of 95% and 99% confidence, respectively, for subject-specific variations around the populationmean trend. Grey dotted lines additionally add one standard deviation of estimated residual variation, σε . Hence, Model BT estimates that 95% and 99% of the subject trends fall within the blue and green bands, but scores can vary around their trends, extending to the grey dotted lines. Thresholds at 0.01% and 0.1% FAR for COTS-B are shown as dashed red lines. 0.5 MALE 0 0.0 1 BLACK −0.5 0 1 −1.0 20 30 40 50 Age of Query Image (years) 60 Figure 4.15 Model E for COTS-B genuine scores from the LEO LS database. Populationmean trends are plotted by subject demographics of sex and race, in addition to five different ages at enrollment (20 to 60 years). Each trend line represents seven years of elapsed time since enrollment. For example, the solid blue line beginning at AGEij = 20 years represents the average decrease in genuine scores for white males enrolled at age 20 with query images until age 27. 149 be falsely rejected, rather than standard image-based accuracy calculations). For example, the models estimate that genuine scores for 99% of the population will remain above the threshold at 0.01% FAR until 6.5 years elapsed time for PCSO LS and 5.5 years for LEO LS. Other calculations (e.g. 95% of the population) are also within approximately one year for both databases. ✦ Subject-specific variance in rates of change (i.e., linear slopes) is only marginally attributable to subject age at enrollment, sex, and race. Subject sex was the most significant factor for between-subject differences in genuine scores, with males having significantly higher genuine scores than females. The magnitude of the difference suggests that false reject errors may occur approximately two years earlier for females than for males (assuming that a global threshold is used operationally). ✦ While the model fit improved for more complex models incorporating simple measures of face quality (for the PCSO LS database), the models are still limited for prediction purposes. The within-subject variability (i.e., level-1 residual variance) is still quite large. All models considered in this study indicate that one standard deviation in genuine scores due to short-term variations (e.g., illumination, hairstyle, etc.) is approximately equivalent to the change in genuine scores due to ±4 years of elapsed time (for these particular databases and matchers). Longitudinal analysis, in general, is an important, yet very difficult, problem. To the best of our knowledge, no proper statistical analysis has yet been conducted for studying face recognition performance on a large population over periods of time longer than five years. In this work, we attempted to analyze the covariates of interest that were available to us (elapsed time, age, sex, race, some measures of quality), but there are additional covariates that cannot be accounted for because we do not have the information (e.g., camera characteristics, IPD for the LEO LS database, expression variations, etc.). Despite this, the longitudinal study on automatic face recognition presented here utilizes two of the largest, deepest, and longest (in terms of number of subjects, number of images per subject, and 150 time spans of subject images, respectively) face image databases studied to date, and the COTS matchers are representative of current state-of-the-art. Given that the performance of face recognition systems continues to improve, longitudinal analysis should be conducted periodically to reevaluate robustness to facial aging (and other covariates). 151 Chapter 5 Summary and Future Work This thesis has addressed some of the important challenges associated with automatic face recognition. The primary contributions involve the role of quality covariates present in unconstrained face images and the effect of facial aging on face recognition performance. 5.1 Contributions In Chapter 2, we studied operational scenarios for recognition of unconstrained face media. The contributions include: • A framework for matching a collection of face media (i.e., images, videos, 3D models, demographics) was provided for scenarios where multiple instances of a subject’s face are available (e.g., to identify a person of interest). This is particularly of value to forensic investigations, as matching the collection of face media outputs a single candidate list for a human operator to review, rather than multiple candidate lists (one for each of the face samples available on the person of interest). This work is one of the first baselines provided for “template-based” matching which is rapidly gaining interest (e.g., the NIST IJB-A protocol [72]). 152 Table 5.1 Published works which have reported results using the experimental protocols introduced in Chapter 2 for the LFW database [62] (single-image matching). COTS results were reported in Chapter 2. Method COTS-A COTS-A (s1+s4) DeepFace [128] WST Fusion [129] DeepID2+ [123] DeepID3+ [120] Rank-1 Accuracy (%) DIR (%) @ 1% FAR 56.7 66.5 64.9 82.5 95.0 96.0 27.0 36.0 44.5 61.9 80.7 81.4 • Evaluation protocols introduced in Chapter 2 were publicly released1 for closed-set and open-set identification of unconstrained face images and videos in the LFW [62] and YTF [141] databases. While our work focused on matching a collection of unconstrained face media, we also reported baseline results for single-image matching. At the time of release, identification protocols for unconstrained face images were lacking in the research community, as efforts were focused on maximizing performance on the LFW verification protocol [62] (which has some limitations, see Chapter 1). Table 5.1 shows that our evaluation protocols introduced in Chapter 2 have since been adopted by other published works for comparisons and have encouraged competition, particularly for the more challenging open-set identification problem.2 Chapter 3 focused on the important and challenging problem of automatic face image quality. This chapter offers the following contributions: • The first study on human assessments of unconstrained face image quality. To the best of our knowledge, there have been no other work on human assessment of face quality since preliminary studies on mugshot quality by Adler et al. [2] and Hsu et al. [60] 1 Evaluation protocols are available at: http://biometrics.cse.msu.edu/pub/databases.html The BLUFR protocol [85] was released around the same time and is also a valuable benchmark for unconstrained face recognition algorithms. 2 153 in 2006. Relative pairwise comparisons of face image quality (i.e., “Which face image has better quality?”) were collected via crowdsourcing on Amazon Mechanical Turk3 . With a relatively small number of pairwise responses per “worker” (<1,000 pairs), a matrix completion approach [148] was utilized to obtain face quality ratings from each worker for all 13,233 images in the LFW database. The resulting human quality ratings were shown to be correlated with automatic face recognition performance. • An automatic method was proposed to predict either (i) human face quality rating or (ii) similarity score-based face quality value. The proposed method uses image features extracted prior to matching and does not require any comparisons to reference high quality images. • Evaluation of the proposed automatic face image quality measure showed efficiency in reducing false non-match errors by removing low quality face images from a database (i.e., operational reject option). • Visual inspections of face images rank-ordered by the predicted quality values demonstrated the effectiveness of the approach in separating high quality face images (e.g., frontal, uniform illumination, no occlusion) from low quality face images (e.g., out-of-plane rotation, low resolution, occluded facial regions). Lastly, the contributions of the longitudinal study on automatic face recognition in Chapter 4 are summarized as follows: • First large-scale statistical analysis of the longitudinal effects of facial aging on the performance of automatic face recognition. The study involved two operational mugshot databases consisting of (i) 147,784 images of 18,007 subjects and (ii) 31,852 images of 5,636 subjects with a minimum of 4 mugshots per subject collected over an average of 8.5 and 5.8 years for the two databases, respectively. 3 https://www.mturk.com/mturk/ 154 • Mixed-effects regression models were used to analyze trends in genuine scores over time (i.e., as subjects age) and quantify the subject-specific variability in the longitudinal trends of a large population of subjects. As such, estimates were provided for how many years of aging are tolerated by commercial face matchers before recognition errors are attributable to be expected. For example, we showed that a state-of-the-art face matcher operating at a threshold of 0.1% FAR can recognize 95% of the population until 10.5 years elapsed time between enrollment and query face images. • Demographics (age, gender, race) and face image quality were shown to only marginally affect the longitudinal trends in genuine scores. • A methodology for the longitudinal evaluation of face recognition performance was detailed which will ideally be conducted periodically to reevaluate state-of-the-art systems as robustness to facial aging continues to evolve. 5.2 Future Work In conducting the studies on face recognition included in this dissertation, a number of areas for future work have been realized. This section concludes the dissertation by suggesting extensions to the work presented in the previous chapters that can be explored by researchers in automatic face recognition. Template-based matching is still an open research problem, indicated by the recognition accuracies reported in Chapter 2 for matching collections of face media, as well as the current leaderboard4 accuracies for the IJB-A face challenge [72]. Score-level fusion of all face samples in the collection, as was explored in Chapter 2, is not computationally efficient, especially for 1:N matching scenarios. Face representations which can extract information from multiple face samples to result in a single template are preferable. This template-totemplate matching reduces comparisons to the same complexity as image-to-image matching 4 https://www.nist.gov/programs-projects/face-challenges 155 while still leveraging the multiple face samples of a subject. Research in automatic face image quality assessment is still in its infancy. While a very challenging problem due to the large facial variations that are possible, particularly in unconstrained scenarios, face image quality has many important operational applications. The work presented in Chapter 3 suggests the following next steps for face image quality. • Face quality may need to be distinguished as three scenarios: (i) determining face vs. non-face (flagging face detection failures), (ii) assessment of the accuracy of face alignment, and (iii) given an aligned face image, now what is the quality? These three modules of a face image quality algorithm may allow for the integration of face matcher-dependent properties (e.g., IPD, alignment errors) with more generalizable face image quality measures. • A hierarchical prediction approach may improve the prediction accuracy. For example, face quality of an image could first be classified as low, medium, or high (where the bins are defined to be highly correlated with recognition performance), followed by regression within each bin for a fine-tuned ranking (useful for visual purposes and other ranking applications). • The current image features extracted from a deep convNet [136] show promising results for face image quality. However, the deep convNet in [136] was trained for face recognition purposes, so the representation should ideally be robust to face quality factors. It would be desirable to retrain a deep convNet for prediction of face image quality, rather than identity. • More extensive evaluation of face image quality measures in the context of face recognition performance are needed. A methodical evaluation of the pairwise quality factor may offer new insights. Persistence (or permanence) of a biometric trait is one of two fundamental premises of biometrics (uniqueness being the other) [82]. Our systematic longitudinal study in Chapter 4 156 offered significant insights about the persistence property of automatic face recognition systems. The following related avenues of research could be pursued in future. (i) Development of a single face quality measure for mugshot images would be beneficial for longitudinal study. Incorporating individual face quality factors (e.g., IPD and pose) into the mixedeffects regression model quickly increases the complexity and interpretation of the results. (ii) Longitudinal analysis could be conducted on different face cropping (particularly, precropped images to exclude most of the hair region) to investigate the impact of changing hairstyle over time. (iii) The longitudinal capabilities of face recognition for children (0–18 years old) is still relatively unknown. Operational mugshot databases do not contain this population, and longitudinal face images of young children are difficult to obtain. Recognition of child face images is an important application for law enforcement agencies seeking to analyze digital media containing faces of exploited children (see the Child Exploitation Image Analysis (CHEXIA) face challenge5 ). (iv) The stability of impostor scores should be investigated, as recognition errors can also manifest in increased impostor similarity scores. A longitudinal study of impostor scores over time will help to quantitatively address questions related to the second fundamental premise of uniqueness, such as: Does the probability of false acceptance depend on the age of the two subjects in question? The hypothesis is that younger individuals are more likely to falsely match to other younger individuals because distinctive characteristics such as wrinkles and spots have not yet formed. Mixed-effects regression models applied to impostor scores may additionally be useful for location of duplicate identities in a large operational database of subjects. Lastly, the methodology detailed in Chapter 4 can and should be used to periodically reevaluate the longitudinal robustness of state-of-the-art face recognition systems. 5 https://www.nist.gov/programs-projects/chexia-face-recognition 157 BIBLIOGRAPHY 158 Bibliography [1] A. Abaza, M. A. Harrison, T. Bourlai, and A. Ross. Design and evaluation of photometric image quality measures for effective face recognition. IET Biometrics, 3(4):314–324, Dec. 2014. [2] A. Adler and T. Dembinsky. Human vs. automatic measurement of biometric sample quality. In Canadian Conf. on Electrical and Computer Engineering (CCECE), 2006. [3] G. Aggarwal, S. Biswas, P. J. Flynn, and K. W. Bowyer. Predicting performance of face recognition systems: An image characterization approach. In Proc. CVPR Workshops, 2011. [4] T. Ahonen, A. Hadid, and M. Pietikainen. Face description with local binary patterns: Application to face recognition. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 28(12):2037–2041, Dec. 2006. [5] F. Alonso-Fernandez, J. Fierrez, and J. Ortega-Garcia. Quality measures in biometric systems. IEEE Security Privacy, 10(6):52–62, Nov. 2012. [6] O. Arandjelovic and R. Cipolla. A manifold approach to face recognition from low quality video across illumination and pose using implicit super-resolution. In Proc. ICCV, 2007. [7] S. R. Arashloo and J. Kittler. Class-specific kernel fusion of multiple descriptors for face verification using multiscale binarised statistical image features. IEEE Trans. Information Forensics and Security (TIFS), 9:2100–2109, Dec. 2014. [8] A. Asthana, T.K. Marks, M.J. Jones, K.H. Tieu, and M. Rohith. Fully automatic pose-invariant face recognition via 3D pose normalization. In Proc. ICCV, 2011. [9] M. Ballantyne, R. S. Boyer, and L. Hines. Woody Bledsoe: His life and legacy. AI Magazine, 17(1):7–20, Spr. 1996. [10] J. H. Barr, K. W. Boyer, P. J. Flynn, and S. Biswas. Face recognition from video: A review. Int. Journal of Pattern Recognition and Artificial Intelligence, 26(05), 2012. [11] D. Bates, M. M¨achler, B. Bolker, and S. Walker. Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1):1–48, 2015. [12] M. D. Begg and M. K. Parides. Separation of individual-level and cluster-level covariate effects in regression analysis of correlated data. Statistics in Medicine, 22(16):2591– 2602, Aug. 2003. [13] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 19(7):711–720, Jul. 1997. 159 [14] A. Bell and K. Jones. Explaining fixed effects: Random effects modeling of time-series cross-sectional and panel data. Political Science Research and Methods, 3(1):133–153, Jan. 2015. [15] M. Bereta, P. Karczmarek, W. Pedrycz, and M. Reformat. Local descriptors in application to the aging problem in face recognition. Pattern Recognition, 46(10):2634–2646, Oct. 2013. [16] L. Best-Rowden, S. Bisht, J. Klontz, and A. K. Jain. Unconstrained face recognition: Establishing baseline human performance via crowdsourcing. In Proc. IJCB, 2014. [17] L. Best-Rowden, H. Han, C. Otto, B. Klare, and A. K. Jain. Unconstrained face recognition: Identifying a person of interest from a media collection. IEEE Trans. Information Forensics and Security (TIFS), 9(12):2144–2157, Dec. 2014. [18] L. Best-Rowden and Anil K. Jain. A longitudinal study of automatic face recognition. In Proc. ICB, 2015. [19] L. Best-Rowden, B. Klare, J. Klontz, and A. K. Jain. Video-to-video face matching: Establishing a baseline for unconstrained face recognition. In Proc. BTAS, 2013. [20] J. R. Beveridge, G. H. Givens, P. J. Phillips, and B. A. Draper. Factors that influence algorithm performance in the face recognition grand challenge. Computer Vision and Image Understanding (CVIU), 113:750–762, 2009. [21] J. R. Beveridge, G. H. Givens, P. J. Phillips, B. A. Draper, D. S. Bolme, and Y. M. Lui. FRVT 2006: Quo vadis face quality. Image and Vision Computing, 28(5):732–743, May 2010. [22] S. Bharadwaj, M. Vatsa, and R. Singh. Can holistic representations be used for face biometric quality assessment? In Proc. ICIP, 2013. [23] S. Bharadwaj, M. Vatsa, and R. Singh. Biometric quality: a review of fingerprint, iris, and face. EURASIP Journal on Image and Video Processing, 34, Jul. 2014. [24] S. Bharadwaj, M. Vatsa, and R. Singh. Biometric quality: A review of fingerprint, iris, and face. EURASIP Journal on Image and Video Processing, 34(1), 2014. [25] A. Blanton, K. C. Allen, T. Miller, N. D. Kalka, and A. K. Jain. A comparison of human and automated face verification accuracy on unconstrained image sets. In Proc. CVPR Workshops, 2016. [26] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In Proc. SIGGRAPH, 1999. [27] V. Blanz and T. Vetter. Face recognition based on fitting a 3d morphable model. IEEE Trans. Pattern Analysis Machine Intelligence (PAMI), 25:1063–1074, Sep. 2003. [28] M. Burge. IARPA Broad Agency Announcement: BAA-13-07, Janus Program. http: //www.iarpa.gov/index.php/research-programs/janus/baa, Nov. 2013. 160 [29] X. Cao, Y. Wei, F. Wen, and J. Sun. Face alignment by explicit shape regression. In Proc. CVPR, 2012. [30] Z. Cao, Q. Yin, X. Tang, and J. Sun. Face recognition with learning-based descriptor. In Proc. CVPR, 2010. [31] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm. [32] B.-C. Chen, C.-S. Chen, and W. H. Hsu. Face recognition and retrieval using cross-age reference coding with cross-age celebrity dataset. IEEE Trans. Multimedia, 17(6):804– 815, Apr. 2015. [33] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun. Bayesian face revisited: A joint formulation. In Proc. ECCV, 2012. [34] D. Chen, X. Cao, F. Wen, and J. Sun. Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification. In Proc. CVPR, 2013. [35] J. Chen, Y. Deng, G. Bai, and G. Su. Face image quality assessment based on learning to rank. IEEE Signal Processing Letters, 22(1):90–94, 2015. [36] J. Cheney, B. Klein, A. K. Jain, and B. F. Klare. Unconstrained face detection: State of the art baseline and challenges. In Proc. ICB, 2015. [37] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 23(6):681–685, Jun. 2000. [38] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active shape models their training and application. Computer Vision and Image Understanding (CVIU), 61(1):38–59, Jan. 1995. [39] Z. Cui, Wen Li, D. Xu, S. Shan, and X. Chen. Fusing robust face region descriptors via multiple metric learning for face recognition in the wild. In Proc. CVPR, 2013. [40] B. DeCann and A. Ross. Can a “poor” verification system be a “good” identification system? a preliminary study. In Proc. WIFS, 2012. [41] G. Doddington, W. Liggett, A. Martin, M. Przybocki, and D. Reynolds. Sheep, goats, lambs and wolves: A statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation. In Proc. ICSLP, 1998. [42] A. Dutta, R. Veldhuis, and L. Spreeuwers. A bayesian model for predicting face recognition performance using image quality. In Proc. IJCB, 2014. [43] M. Erbilek and M. Fairhurst. A methodological framework for investigating age factors on the performance of biometric systems. In Proc. Multimedia and Security, 2012. 161 [44] G. M. Fitzmaurice, N. M. Laird, and J. H. Ware. Applied Longitudinal Analysis. John Wiley & Sons, Inc., Hoboken, New Jersey, 2nd edition, 2011. [45] A. P. Founds, N. Orlans, W. Genevieve, and Craig I. Watson. NIST special database 32 - multiple encounter dataset II (MEDS-II). NIST Interagency Report 7807, Jul. 2011. [46] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119– 139, Aug. 1997. [47] X. Ge, J. Yang, Z. Zheng, and F. Li. Multi-view based face chin contour extraction. Eng. Appl. Artif. Intel., 19(5):545–555, Aug. 2006. [48] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 23(6):643–660, 2001. [49] R. Goh, L. Liu, X. Liu, and T. Chen. The CMU face in action (FIA) database. In Proc. AMFG, pages 255–263, 2005. [50] D. Gong, Z. Li, D. Lin, J. Liu, and X. Tang. Hidden factor analysis for age invariant face recognition. In Proc. ICCV, 2013. [51] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-PIE. In Proc. FGR, 2008. [52] R. Gross and J. Shi. The CMU motion of body (MoBo) database. Technical Report CMU-RI-TR-01-18, Robotics Institute, Pittsburgh, PA, June 2001. [53] P. Grother. Face recognition vendor test 2002: Supplemental report. NIST Interagency Report 7083, Feb. 2004. [54] P. Grother, J. R. Matey, E. Tabassi, G. W. Quinn, and M. Chumakov. IREX VI: Temporal stability of iris recognition accuracy. NIST Interagency Report 7948, Jul. 2013. [55] P. Grother and M. Ngan. FRVT: Performance of face identification algorithms. NIST Interagency Report 8009, May 2014. [56] P. Grother and E. Tabassi. Performance of biometric quality measures. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 29(4):531–543, Apr. 2007. [57] P. J. Grother, G. W. Quinn, and P. J. Phillips. Multiple biometric evaluation (MBE) 2010: Report on the evaluation of 2D still-image face recognition algorithms. Interagency report 7709, NIST, 2010. [58] H. Han, B. F. Klare, K. Bonnen, and A. K. Jain. Matching composite sketches to face photos: A component-based approach. IEEE Trans. Information Forensics and Security (TIFS), 8(1):191–204, Jan. 2013. 162 [59] H. Han, C. Otto, and A. K. Jain. Age estimation from face images: Human vs. machine performance. In Proc. ICB, 2013. [60] R.-L. Hsu, J. Shah, and B. Martin. Quality assessment of facial images. In Biometrics Symposium: Special Issue on Research at the Biometric Consortium Conference (BCC), 2006. [61] J. Hu, J. Lu, and Y.-P. Tan. Discriminative deep metric learning for face verification in the wild. In Proc. CVPR, 2014. [62] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Tech. Report 07-49, Univ. of Mass., Amherst, Oct. 2007. [63] C. P. Huynh, A. Robles-Kelly, and E. R. Hancock. Shape and refractive index from single-view spectro-polarimetric images. Int J. Comput. Vis., 101(1):64–94, 2013. [64] A. K. Jain, B. Klare, and U. Park. Face matching and retrieval in forensics applications. IEEE Multimedia, 19(1):20–28, Jan. 2012. [65] Anil K. Jain, Karthik Nandakumar, and Arun Ross. 50 years of biometric research: Accomplishments, challenges, and opportunities. Pattern Recognition Letters, 79:80– 105, Jan. 2016. [66] A. Jourabloo and X. Liu. Pose-invariant 3d face alignment. In Proc. ICCV, 2015. [67] F. Juefei-Xu, K. Luu, M. Savvides, T. D. Bui, and C. Y. Suen. Investigating age invariant face recognition based on periocular biometrics. In Proc. IJCB, 2011. [68] H. I. Kim, S. H. Lee, and Y. M. Ro. Face image assessment learned with objective and relative face image qualities for improved face recognition. In IEEE International Conference on Image Processing (ICIP), Sep. 2015. [69] M. Kim, S. Kumar, V. Pavlovic, and H. Rowley. Face tracking and recognition with visual constraints in real-world videos. In Proc. CVPR, 2008. [70] B. Klare and A. K. Jain. Face recognition across time lapse: On learning feature subspaces. In Proc. IJCB, 2011. [71] B. F. Klare, M.J. Burge, J.C. Klontz, R.W. Vorder Bruegge, and A. K. Jain. Face recognition performance: Role of demographic information. IEEE Trans. Information Forensics and Security (TIFS), 7(6):1789–1801, Dec. 2012. [72] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen, P. Grother, A. Mah, M. Burge, and A. K. Jain. Pushing the frontiers of unconstrained face detection and recognition: IARPA Janus benchmark A. In Proc. CVPR, 2015. [73] J. C. Klontz and A. K. Jain. A case study on unconstrained facial recognition using the boston marathon bombing suspects. Tech. Report MSU-CSE-13-4, Michigan State Univ., May 2013. 163 [74] S. Klum, H. Han, A. K. Jain, and B. Klare. Sketch based face recognition: Forensic vs. composite sketches. In Proc. ICB, 2013. [75] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. NIPS, 2012. [76] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. Attribute and simile classifiers for face verification. In Proc. ICCV, 2009. [77] M. Lades, J. C. Vorbruggen, J. Buhmann, J. Lange, C. von der Malsburg, R. P. Wurtz, and W. Konen. Distortion invariant object recognition in the dynamic link architecture. IEEE Trans. Computers, 42:300–311, 1993. [78] A. Lanitis, C. J. Taylor, and T. F. Cootes. Toward automatic simulation of aging effects on face images. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 24(4), Apr. 2002. [79] K.-C. Lee, J. Ho, M. H. Yang, and D. Kriegman. Visual tracking and recognition using probabilistic appearance manifolds. CVIU, 99(3):303–331, 2005. [80] K.-C. Lee and D. Kriegman. Online learning of probabilistic appearance manifolds for video-based recognition and tracking. In Proc. CVPR, volume 1, pages 852–859, 2005. [81] H. Li, G. Hua, X. Shen, Z. Lin, and J. Brandt. Eigen-PEP for video face recognition. In Proc. ACCV, 2014. [82] S. Z. Li and A. K. Jain, editors. Handbook of Face Recognition. New York: Springer, 2 edition, 2011. [83] Z. Li, U. Park, and A. K. Jain. A discriminative model for age invariant face recognition. IEEE Trans. Information Forensics and Security (TIFS), 6(3):1028–1037, Sep. 2011. [84] S. Liao, A. K. Jain, and S. Z. Li. Partial face recognition: Alignment-free approach. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 35:1193–1205, May 2013. [85] S. Liao, Z. Lei, D. Yi, and S. Z. Li. A benchmark study on large-scale unconstrained face recognition. In Proc. IJCB, 2014. [86] Y. Lin, G. Medioni, and J. Choi. Accurate 3D face reconstruction from weakly calibrated wide baseline images with profile contours. In Proc. CVPR, 2010. [87] H. Ling, S. Soatto, N. Ramanathan, and D. W. Jacobs. Face verification across age progression using discriminative methods. IEEE Trans. Information Forensics and Security (TIFS), 5(1):82–91, Mar. 2010. [88] C. Liu and H. Wechsler. Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. IEEE Trans. Image Processing, 11(4):467–476, Aug. 2002. 164 [89] X. Liu and T. Chen. Video-based face recognition using adaptive hidden markov models. In Proc. CVPR, pages 340–345, 2003. [90] C. Lu and X. Tang. Surpassing human-level face verification performance on LFW with gaussianface. http://arxiv.org/abs/1404.3840, Apr. 2014. [91] Y. M. Lui, D. Bolme, B. A. Draper, J. R. Beveridge, G. Givens, and P. J. Phillips. A meta-analysis of face recognition covariates. In Proc. BTAS, 2009. [92] A. M. Martinez and R. Benavente. The AR face database. Technical Report 24, Computer Vision Center, University of Barcelona, 1998. [93] I. Matthews and S. Baker. Active appearance models revisited. International Journal of Computer Vision, 60(2):123–164, Nov. 2004. [94] E. Mostafa, A. Ali, N. Alajlan, and A. Farag. Pose invariant approach for face recognition at distance. In Proc. ECCV, 2012. [95] J. M. Neuhaus and J. D. Kalbfleisch. Between- and within-cluster covariate effects in the analysis of clustered data. Biometrics, 54(2):638–645, Jun. 1998. [96] U.S. Department of Homeland Security. Biometric standards requirements for usvisit: Version 1.0. https://www.dhs.gov/xlibrary/assets/usvisit/usvisit biometric standards.pdf, Mar. 2010. [97] National Institute of Standards and Technology (NIST). Face homepage. http://face. nist.gov, Jun. 2013. [98] E. G. Ortiz and B. C. Becker. Face recognition for web-scale datasets. Comput. Vis. Image Und., 118(0):153 – 170, Jan. 2013. [99] C. Otto, H. Han, and A. K. Jain. How does aging affect facial components? In ECCV WIAF Workshop, 2012. [100] G. Panis, A. Lanitis, N. Tsapatsoulis, and T. F. Cootes. An overview of research on facial aging using the FG-NET aging database. IET Biometrics, May 2015. [101] U. Park and A. K. Jain. Face recognition in video: Adaptive fusion of multiple matchers. In Proc. CVPR, pages 1–8, 2007. [102] U. Park, Y. Tong, and A. K. Jain. Age-invariant face recognition. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 32(5), May 2010. [103] J. Phillips. Video challenge problem multiple biometric grand challenge preliminary results of version 2. In MBGC 3rd Workshop, December 2009. [104] P. J. Phillips, J. R. Beveridge, D. S. Bolme, B. A. Draper, G. H. Givens, Y. M. Lui, S. Cheng, M. N. Teli, and H. Zhang. On the existence of face quality measures. In Proc. BTAS, pages 1–8, Sep. 2013. 165 [105] P. J. Phillips, P. J. Flynn, T. Scruggs, K. W. Bowyer, and W. Worek. Preliminary face recognition grand challenge results. In Proc. FG, 2006. [106] P. J. Phillips, P. Grother, R. J. Micheals, D. M. Blackburn, E. Tabassi, and M. Bone. Face recognition vendor test 2002: Evaluation report. NIST Interagency Report 6965, Mar. 2003. [107] P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss. The FERET evaluation methodology for face-recognition algorithms. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 22(10), Oct. 2000. [108] P. J. Phillips, W. T. Scruggs, A. J. O’Toole, P. J. Flynn, K. W. Bowyer, C. L. Schott, and M. Sharpe. FRVT 2006 and ICE 2006 large-scale experimental results. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 32:831–846, 2010. [109] N. Poh and J. Kittler. A unified framework for biometric expert fusion incorporating quality measures. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 34(1):3–18, Jan 2012. [110] N. Poh, J. Kittler, C.-H. Chan, and M. Pandit. Algorithm to estimate biometric performance change over time. IET Biometrics, 4(4):236–245, Dec. 2015. [111] N. Ramanathan, R. Chellappa, and S. Biswas. Computational methods for modeling facial aging: A survey. Journal of Visual Languages and Computing, 20:131–144, 2009. [112] H. T.F. Rhodes. Alphonse Bertillon, father of scientific detection. London: George G. Harrap & Co., 1956. [113] K. Ricanek and T. Tesafaye. MORPH: A longitudinal image database of normal adult age-progression. In Proc. FGR, 2006. [114] A. Ross, K. Nandakumar, and A. K. Jain. Handbook of Multibiometrics. New York: Springer, 2006. [115] J. Roth, Y. Tong, and X. Liu. Unconstrained 3D face reconstruction. In Proc. CVPR, 2015. [116] H. Sellahewa and S. A. Jassim. Image-quality-based adaptive face recognition. IEEE Transactions on Instrumentation and Measurement, 59(4):805–813, Apr. 2010. [117] T. Sim, S. Baker, and M. Bsat. The CMU pose, illumination, and expression database. IEEE Trans. Pattern Analysis and Machine Intelligence, 25(12):1615–1618, Dec. 2003. [118] J. D. Singer and J. B. Willett, editors. Applied Longitudinal Data Analysis: Modeling Change and Event Occurrence. New York: Oxford Univ. Press, Inc., 2003. [119] L. Sirovich and M. Kirby. Low-dimensional procedure for the characterization of human faces. Journal of the Optical Society of America A, 4(3):519–524, Mar. 1987. 166 [120] Y. Sun, D. Liang, X. Wang, and X. Tang. Deepid3: Face recognition with very deep neural networks. https://arxiv.org/abs/1502.00873, Feb. 2015. [121] Y. Sun, X. Wang, and X Tang. Deep learning face representation from predicting 10,000 classes. In Proc. CVPR, 2014. [122] Y. Sun, X Wang, and X. Tang. Deep learning face representation from predicting 10,000 classes. In Proc. CVPR, 2014. [123] Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective, and robust. In Proc. CVPR, 2015. [124] E. Tabassi, M. A. Olsen, A. Makarov, and C. Busch. Towards NFIQ II lite: Selforganizing maps for fingerprint image quality assessment. NIST Interagency Report 7973, Dec. 2013. [125] Elham Tabassi and Charles L. Wilson. A novel approach to fingerprint image quality. In IEEE International Conference on Image Processing (ICIP), 2005. [126] E. Taborsky, K. Allen, A. Blanton, A. K. Jain, and B. F. Klare. Annotating unconstrained face imagery: A scalable approach. In Proc. ICB, 2015. [127] Y. Taigman and L. Wolf. Leveraging billions of faces to overcome performance barriers in unconstrained face recognition. http://arxiv.org/abs/1108.1122, Aug. 2011. [128] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to humanlevel performance in face verification. In Proc. CVPR, 2014. [129] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Web-scale training for face identification. In Proc. CVPR, 2015. [130] K. T. Taylor. Forensic Art and Illustration. Boca Raton, FL: CRC Press, 2000. [131] D. Thomas, K. W. Bowyer, and P. J. Flynn. Multi-frame approaches to improve face recognition. In IEEE Workshop on Motion and Video Computing, pages 19–19. IEEE, 2007. [132] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1):71–86, Winter 1991. [133] U. Uludag and A. K. Jain. Attacks on biometric systems: A case study in fingerprints. In Proc. SPIE, 2004. [134] R. van der Leeden, F. Busing, and E. Meijer. Bootstrap methods for two-level models. In Multilevel Conf., 1997. [135] P. Viola and M. J. Jones. Robust real-time face detection. Int J. Computer Vision, 57(2):137–154, May 2004. 167 [136] D. Wang, C. Otto, and A. K. Jain. Face search at scale: 80 million gallery. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), PP(99), Jun. 2016. [137] H. Wang, B. Kang, and D. Kim. PFW: A face database in the wild for studying face identification and verification in uncontrolled environment. In Proc. ACPR, 2013. [138] R. Wang, S. Shan, X. Chen, and W. Gao. Manifold-manifold distance with application to face recognition based on image set. In Proc. CVPR, pages 1–8, 2008. [139] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Processing, 13(4):600– 612, Apr. 2004. [140] C. I. Watson. NIST mugshot identification database. nistsd18.htm, 1994. http://www.nist.gov/srd/ [141] L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background similarity. In Proc. CVPR, 2011. [142] Y. Wong, S. Chen, S. Mau, C. Sanderson, and B. C. Lovell. Patch-based probabilistic image quality assessment for face selection and improved video-based face recognition. In Proc. CVPR Workshops, pages 74–81, Jun. 2011. [143] D. Yadav, N. Kohli, P. Pandey, R. Singh, M. Vatsa, and A. Noore. Effect of illicit drug abuse on face recognition. In Proc. WACV, 2016. [144] Neil Yager and Ted Dunstone. The biometric menagerie. IEEE IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 32(2):220–230, Feb. 2010. [145] M. Yang, P. Zhu, L. Van Gool, and L. Zhang. Face recognition based on regularized nearest points between image sets. In Proc. FG, 2013. [146] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation from scratch. http: //arxiv.org/abs/1411.7923, Nov. 2014. [147] D. Yi, S. Liao, and S. Z. Li. arXiv:1411.7923v1, 2014. Learning face representation from scratch. [148] J. Yi, R. Jin, S. Jain, and A. K. Jain. Inferring users’ preferences from crowdsourced pairwise comparisons: A matrix completion approach. First AAAI Conference on Human Computation and Crowdsourcing (HCOMP), 2013. [149] S. Yoon and A. K. Jain. Longitudinal study of fingerprint recognition. Proc. National Academy of Sciences (PNAS), 112(28):8555–8560, Jul. 2015. [150] C. Zhang and Z. Zhang. A survey of recent advances in face detection. Tech. Report MSR-TR-2010-66, Microsoft Research, Jun. 2010. [151] E. Zhou, Z. Cao, and Q. Yin. Naive-deep face recognition: Touching the limit of lfw benchmark or not? Tech. report, Face++, Megvii Inc., Jan. 2015. 168 [152] S. Zhou, R. Chellappa, and B. Moghaddam. Visual tracking and recognition using appearance-adaptive models in particle filters. IEEE Trans. Image Processing, 13(11):1491–1506, 2004. [153] X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In Proc. CVPR, 2012. 169