~r..~. .
1.1.1

.4 $55.55»...

 

I
.313...
11.3. .v

s

(”33-5

 

 

 

,... 3%

ml
..... . . I
.3...
.1.‘
:1)? t... .

 

 

 

 
 
 
 
 
 

brat.

     

 

 

 

LIBRARY
Michigan State

University

 

 

 

This is to certify that the
dissertation entitled

FACE RECOGNITION: FACE IN VIDEO, AGE INVARIANCE,

AND FACIAL MARKS

presented by

Unsang Park

has been accepted towards fulﬁllment
of the requirements for the

PhD degree in Computer Science

 

 

MW

Major Professor's Signature

$/io/e200‘7
/ I

Date

MSU is an Afﬁrmative Action/Equal Opportunity Employer

 

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5/08 K:IProj/Aoc&Pres/ClRC/Date0ue.indd

FACE RECOGNITION: FACE IN VIDEO, AGE INVARIANCE,
AND FACIAL MARKS

By

Unsang Park

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Computer Science

2009

ABSTRACT

FACE RECOGNITION: FACE IN VIDEO, AGE INVARIANCE, AND
FACIAL MARKS

By
Unsang Park

Automatic face recognition has been extensively studied over the past decades
in various domains (e.g., 2D, 3D, and video) resulting in a dramatic improvement.
However, face recognition performance severely degrades under pose, lighting and
expression variations, occlusion, and aging. Pose and lighting variations along with
low image resolutions are major sources Of degradation Of face recognition performance
in surveillance video.

We propose a video-based face recognition framework using 3D face modeling
and Pan-Tilt-Zoom (PTZ) cameras to overcome the pose/ lighting variations and low
resolution problems. We propose a 3D aging modeling technique and Show how it can
be used to compensate for age variations to improve face recognition performance.
The aging modeling technique adapts view invariant 3D face models to the given 2D
face aging database. We also propose an automatic facial mark detection method
and a fusion scheme that combines the facial mark matching with a commercial face
recognition matcher. The proposed approach can be used i) as an indexing scheme
for a face image retrieval system and ii) to augment global facial features to improve
the recognition performance.

Experimental results Show i) high recognition accuracy (>99%) on a large scale
video data (>200 subjects), ii) ~10% improvement in recognition accuracy using the
proposed aging model, and iii) ~0.94% improvement in the recognition accuracy by

utilizing facial marks.

Dedicated tO my sweet heart, son, and parents.

iii

 

ACKNOWLEDGMENTS

This is one Of the most important and pleasing moments in my life; making a mile
stone in one of the longest project, the PhD thesis. It has been a really long time and
ﬁnally I am on the verge of graduating. This could have not been possible without
all the academic and private supports around me.

I would like to thank Dr. Anil K. Jain for giving me the research Opportunity in
face recognition. He has always inspired me with all the interesting and challenging
problems. He showed not only which problems we need to solve but how eﬂ'ectively
and how efﬁciently. I am still working with him as a postdoc and still learning all the
valuable academic practices. I thank Dr. GeOrge C. Stockman for raising a number
Of interesting questions in my research in face recognition. I always thank him as
my former MS advisor also. I thank Dr. Rong Jin for his guidance in the aspect of
machine learning to improve some Of the approaches used in face recognition. I thank
Dr. Yiying Tong for his advice and co—work in all the meetings for the age invariant
face recognition work. I thank Dr Lalita Udpa for joining the committee at the last
moment and reviewing my PhD work.

I thank Greggorie P. Michaud who has provided the Michigan mugshot database
that greatly helped in the facial mark study. I thank Dr. Tsuhan Chen for providing
us the Face In Action database for my work in video based face recognition. I thank
Dr. Mario Savvides for providing the efﬁcient AAM tool for fast landmark detection. I
thank Dr. Karl Jr. Ricanek for providing us the MORPH database in timely manner.
I also would like tO thank organizations and individuals who were responsible for
making FERET and FG—N ET databases available.

I would like to thank former graduates Of PRIP lab; Dr. Arun Ross, Dr. Umut
Uludag, Dr. Hiu Chung Law, Dr. Karthik Nandakumar, Dr. Hong Chen, Dr. Xi-

iv

aoguang Lu, Dr. Dirk Joel Luchini Colbry, Meltem Demirkus, Stephen Krawczyk,
and Yi Chen. I would like to thank all current Prippies; Abhishek Nagar, Pavan Ku-
mar Mallapragada, Brendan Klare, Serhat Bucak, Soweon Yoon, Alessandra Paulino,
Kien Nguyen, Rayshawn HOlbrOOk, Serhat Bucak, and Nick Gregg. They have helped
me in some computer or programming problems, providing their own biometric data,
or giving jokes and sharing each other’s company.

I ﬁnally thank to my parents for supporting my study. I thank my mother-in-law
and father-in—law. I thank my wife Jung-Bun Lee for supporting me for the last 7
year’s Of marriage. I thank my son, Andrew Chan-Jong Park for giving me happy

smiles all the time.

 

 

TABLE OF CONTENTS

LIST OF TABLES ................................ viii
LIST OF FIGURES ............................... ix
1 Introduction .................................. 1
1.1 Face Detection ................................ 3
1.2 Face Recognition ............................... 4
1.3 Face Recognition in 2D ............................ 5
1.3.1 Challenges in 2D Face recognition ..................... 7
1.3.2 Pose Variation ............................... 8
1.3.3 Lighting Variation ............................. 9
1.3.4 Occlusion .................................. 9
1.3.5 Expression ................................. 9
1.3.6 Age Variation ................................ 9
1.3.7 Face Representation ............................ 11
1.4 Face Recognition in Video .......................... 11
1.4.1 Surveillance Video ............................. 12
1.4.2 Challenges ............... . .................. 14
1.5 Face Recognition in 3D Domain ..... ‘ .................. 14
1.5.1 Challenges .................................. 15
1.6 Summary ................................... 16
1.7 Thesis Contributions ............................. 17
2 Video-based Face Recognition ...................... 19
2.1 View-based Recognition ........................... 22
2.1.1 Fusion Scheme ............................... 22
2.1.2 Face Matchers and Database ....................... 24
2.1.3 Tracking Feature Points .......................... 28
2.1.4 Active Appearance Model (AAM) ..................... 28
2.1.5 AAM 'ITaining ............................... 31
2.1.6 Structure from Motion ........................... 31
2.1.7 3D Shape Reconstruction ......................... 32
2.1.8 3D Facial Pose Estimation ......................... 35
2.1.9 Motion Blur ................................. 38
2.1.10 Experimental Results ............................ 38
2.2 View-synthetic Face Recognition in Video ................. 48
2.2.1 Texture Mapping .............................. 49
2.2.2 Experimental Results ............................ 50
2.3 Video Surveillance .............................. 57
2.3.1 Moving Object Detection ......................... 59
2.3.2 Object Tracking .............................. 63
2.3.3 Experimental Results ............................ 66
2.4 Face Recognition in Video at a Distance .................. 68
2.4.1 Image Acquisition System ......................... 69

vi

 

 

2.4.2 Calibration Of Static and PTZ Cameras ................. 69

2.4.3 Face Video Database with PTZ Camera ................. 70
2.4.4 Motion Blur in PTZ camera ........................ 71
2.4.5 Parallel vs. Perspective Projection .................... 71
2.4.6 Experimental Results ............................ 71
2.5 Summary ................................... 73
3 Age Invariant Face Recognition ..................... 76
3.1 Introduction .................................. 76
3.2 Aging Model ................................. 82
3.2.1 2D Facial Feature Point Detection .................... 84
3.2.2 3D Model Fitting .............................. 85
3.2.3 3D Aging Model .............................. 88
3.3 Aging Simulation ............................... 93
3.4 Experimental Results ............................. 94
3.4.1 Face Recognition Tests ........................... 94
3.4.2 Effects of Different Cropping Methods .................. 98
3. 4. 3 Effects Of Different Strategies in Employing Shape and Texture . . . . 99
3. 4. 4 Effects of different ﬁlling methods In model construction ........ 101
3. 5 Summary ................. . .................. 105
4 Facial Marks .......... . ....................... 111
4.1 Introduction .................................. 111
4.2 Applications of Facial Marks ......................... 113
4.3 Categories Of Facial Marks .......................... 115
4.4 Facial Mark Detection ............................ 116
4.4.1 Primary Facial Feature Detection ..................... 117
4.4.2 Mapping to Mean Shape .......................... 118
4.4.3 Generic and User Speciﬁc Mask Construction .............. 119
4.4.4 Blob Detection ............................... 119
4.4.5 Facial Mark Based Matching ....................... 121
4.5 Experimental Results ............................. 122
4.6 Summary ................................... 125
5 Conclusions and Future Directions .................... 127
5.1 Conclusions .................................. 127
5.2 Phture Directions ............................... 128
APPENDICES .................................. 130
A Databases ................................... 131
BIBLIOGRAPHY ................................ 133

vii

 

 

 

 

 

1.1
1.2

2.1
2.2

2.3

3.1
3.2
3.3

4.1

A.1

 

 

LIST OF TABLES

Face recognition scenarios in 2D domain. .................

Face recognition scenarios across 2D and 3D domain. ..........

A comparison of video based face recognition methods. .........

Face recognition performance according to gallery, probe, and matcher for
video example 1. .............................

Face recognition performance according to gallery, probe, and matcher for
video example 2. .............................

A comparison of methods for modeling aging for face recognition.
Databases used in aging modeling. .....................

Probe and gallery data used in age invariant face recognition tests.

Face recognition accuracy using FaceVACS matcher, proposed facial marks
matcher and fusion of the two matchers. ................

Databases used for various problems addressed in the thesis. ......

viii

47

78
81
97

125

132

 

 

 

 

1.1

1.2
1.3

1.4

1.5
1.6
1.7

1.8
1.9

1.10

2.1
2.2

2.3

2.4

LIST OF FIGURES

Images in this dissertation are presented in color.

Example applications using face biometric: (a) ID cards (from [1]),
(b) face matching and retrieval (from [2]), (c) access control (from [2]),
and (d) DynaVox EyeMax system (controlled by eye gaze and blinking,

from [3]) ................................... 2
Example face detection result (from [123]) ................ 4
Face recognition performance in FRVT 2002 (Snapshot Of computer

screen) [90]. ................................ 6
Reduction in face recognition error rates from 1993 to 2006 (Snapshot

Of computer screen) [92]. ......................... 7
Example images showing pose, lighting, and expression variations. . . 8
Example images Showing occlusions. .‘ .................. 8

Images Of the same subject at age (a) 5, (b) 10, (c) 16, (d) 19, and (e)
29 [4]. ................................... 10

Four frames from a video: number Of pixels between the eyes is ~45 [37]. 13

Example face images from a surveillance video: number Of pixels be-
tween eyes is less than ten and the facial pose is severely Off-frontal [5]. 14

A 3D face model and its 2D projections. ................ 16

Schematic Of the proposed face recognition system in video ....... 24

Pose variations in probe images and the pose values where matching

succeeds at rank-one: red circles represent pose values where Face—
VACS succeeds. .............................. 25

Pose variations in probe images and the pose values where matching
succeeds at rank-one: red circles represent pose values where PCA
succeeds ................................... 26

Example images from the F ace In Action (FIA) database. Six different
cameras record the face images at the same time. Six images at three
time instances are shown here. The frontal view at a close distance
(ﬁfth image from top to bottom, left to right) is used in the experiments. 29

2.5

2.6
2.7
2.8

2.9

 

 

 

Example of face image cropping based on the feature points. (a) Face
images with AAM feature points and (b) corresponding cropped face
images .................................... 3O

Pose estimation scheme. ......................... 36
Pose distribution in yaw-pitch space in (a) gallery and (b) probe data. 37

Face recognition performance on two different gallery data sets: (i) ran-
dom gallery: random selection of pose and motion blur, (ii) composed
gallery: frames selected based on speciﬁc pose and with no motion blur. 39

Cumulative matching scores using dynamic information (pose and mo-
tion blur) for Correlation matcher. ................... 40

2.10 Cumulative matching scores using dynamic information (pose and mo-

tion blur) for PCA matcher. ....................... 41

2.11 Cumulative matching scores using dynamic information (pose and mo-

tion blur) for FaceVACS matcher ..................... 41

2.12 Cumulative Matching Characteristic curves with the effect Of pitch for

correlation matcher ............................. 42

2.13 Cumulative Matching Characteristic curves with the effect of pitch for

PCA matcher ................................ 42

2.14 Cumulative Matching Characteristic curves with the effect of pitch for

FaceVACS matcher ............................. 43

2.15 Cumulative Matching Characteristic curves with the effect of pitch for

correlation matcher ............................. 43

2.16 Cumulative Matching Characteristic curves with the effect Of pitch for

PCA matcher ................................ 44

2.17 Cumulative Matching Characteristic curves with the effect Of pitch for

FaceVACS matcher ............................. 44

2.18 Cumulative matching scores by fusing multiple face matchers and mul-

tiple frames in near-frontal pose range (-20°S (yaw & pitch) < 20°). 45

2.19 Proposed face recognition system with 3D model reconstruction and

frontal view synthesis. .......................... 49

2.20 Texture mapping. (a) typical video sequence used for the 3D recon-
struction; (b) single frame with triangular meshes; (c) two frames with
triangular meshes; (d) reconstructed 3D face model with one texture
mapping from (b); (e) reconstructed 3D face model with two texture
mappings from (c). The two frontal poses in (d) and (e) are correctly

identiﬁed in the matching experiment. ................. 50
2.21 RMS error between the reconstructed shape and true model. ..... 52
2.22 RMS error between the reconstructed and ideal rotation matrix, M 3. 53

2.23 Examples where 3D face reconstruction failed. (a), (b), (c), and ((1)
Show the failure of feature point detection using AAM; (e), (f), and
(g) show failures due to deﬁciency of motion cue. The resulting recon-
struction Of the 3D face model is shown in (h) .............. 54

2.24 Face recognition performance with 3D face modeling. ......... 55

2.25 3D model-based face recognition results on six subjects (Subject IDs
in the FIA database are 47, 56, 85, 133, 198, and 208). (a) Input video
frames; (b), (c) and (d) reconstructed 3D face models at right view,
left view, and frontal view, respectively; (e) frontal images enrolled in
the gallery database. All the frames in (a) are not correctly identiﬁed,
while the synthetic frontal views in (d) obtained from the reconstructed
3D models are correctly identiﬁed for the ﬁrst ﬁve subjects, and not
for the last subject (# 208). The reconstructed 3D model of the last
subject appears very different from the gallery image, resulting in the
recognition failure. ............................ 56

2.26 Proposed surveillance system. The ViSE is a bridge between the human
Operator and a surveillance camera system ................ 58

2.27 Background subtraction. ......................... 63

2.28 Intra- and inter-camera variations of observed color values. (a) original
color values, (b) observed color values from camera 1, (c) camera 2,

and ((1) camera 3 at three different time instances ............ 65
2.29 Schematic retrieval result using ViSE ................... 66
2.30 Schematic Of face image capture system at a distance .......... 67
2.31 Schematic of camera calibration ...................... 68
2.32 Calibration between static and PTZ cameras. ............. 70
2.33 Example Of motion blur. Example close-up image: (a) without motion

blur and (b) with motion blur. ..................... 71

xi

 

2.34

2.35
2.36
2.37

3.1

3.2

3.3

3.4
3.5
3.6
3.7
3.8

3.9

3.10

3.11

Parallel vs. perspective projection. (a) face image captured at a dis-
tance of ~10m, (b) parallel projection of the 3D model, (c) face images

captured at a distance of ~1m (f) perspective projection of the 3D model. 72

Effect of projection model on face recognition performance. .....
Face recognition performance with static and closeup views. .....
Face recognition performance using real and synthetic gallery images

and multiple frames. ...........................

Example images in (a) FG-NET and (b) MORPH databases. Multi-
ple images Of one subject in each of the two databases are shown at
different ages. The age value is given below each image .........

3D model ﬁtting process using the reduced morphable model.

Four example images with manually labeled 68 points (blue) and the
automatically recovered 13 points (red) for the forehead region.

3D aging model construction. ......................
Aging simulation from age :1: to y .....................
An example aging simulation in FG-NET database. ..........
Example aging simulation process in MORPH database. .......

Example images showing different face crOpping methods: (a) original
image, (b) nO-forehead and no pose correction, (c) forehead and nO
pose correction, (d) forehead and pose correction. ...........

Cumulative Match Characteristic (CMC) curves with different meth-
ods of face cropping and shape & texture modeling. .........

Cumulative Match Characteristic (CMC) curves showing the perfor-
mance gain based on the proposed aging model. ............

Rank-one identiﬁcation accuracies for each probe and gallery age
groups: (a) before aging simulation, (b) after aging simulation, and
(c) the amount Of improvement after aging simulation. ........

73
74

80
88

89
90
92
95
96

103

3.12

3.13

4.1
4.2

4.3

4.4

4.5

4.6

4.7
4.8

4.9

Example matching results before and after aging simulation for seven
different subjects: (a) probe, (b) pose-corrected probe, (0) age—adjusted
probe, ((1) pose-corrected gallery and (e) gallery. All the images in (b)
failed to match with the corresponding images in (d) but images in
(c) were successfully matched to the corresponding images in (d) for
the ﬁrst ﬁve subjects. Matching for the last two subjects failed both
before and after aging simulation. The ages of (probe, gallery) pairs
are (0,18), (0,9), (4,14), (3,20), (30,54), (0,7), and (23,31), respectively,
from the top to bottom row ........................

Example matching results before and after aging Simulation for four
different subjects: (a) probe, (b) pose-corrected probe, (c) age-adjusted
probe, (d) pose-corrected gallery and (e) gallery. All the images in (b)
succeeded to match with the corresponding images in (d) but images
in (c) failed to match to the corresponding images in (d). The ages Of
(probe, gallery) pairs are (2,7), (4,9), (7,18), and (24,45), respectively,
from the top to bottom row ........................

Facial marks: freckle (spot), mole, and scar. ..............

Two face images of the same person. A leading commercial face recog-
nition engine failed tO match these images at rank-1. There are a few
prominent facial marks that can be used to make a better decision.

Three different types of example queries and retrieval results: (a) full
face, (b) partial face, and (c) non-frontal face (from video). The mark
that is used in the retrieval is enclosed with a red circle. .......

Examples Of distinctive marks .......................

Statistics of facial marks based on a database Of 426 images in FERET
database. Distributions of facial mark types on mean face and the
percentage Of each mark types is shown. ................

Effects of generic and user speciﬁc masks on facial mark detection. TP
increases and both FN and PP decrease by using user speciﬁc mask. .

Schematic of automatic facial mark extraction process. ........

Ground truth and automatically detected facial marks for four images
in our database ...............................

Schematic of the deﬁnitions of precision and recall. ..........

109

110

112

114

115

116

117

118
120

121

4.10 Precision and recall curve Of the proposed facial mark detection method.123

xiii

 

4.11 An example face image pair that did not match correctly at rank-
1 using FaceVACS but matched correctly after fusion for the ground
truth (probe) to automatic marks (gallery) matching. Colored (black)
boxes represent matched (unmatched) marks .............. 125

4.12 First three rows show three example face image pairs that did not
match correctly at rank-1 using FaceVACS but matched correctly after
fusion for the ground truth (probe) to ground truth (gallery) matching.
Colored (black) boxes represent matched (unmatched) marks. Fourth
row shows an example that matched correctly with FaceVACS but
failed to match after fusion. The failed case shows zero matching score
in mark based matching due to the error in facial landmark detection. 126

Chapter 1

Introduction

Face recognition is the ability to establish a subject’s identity based on facial char-
acteristics. Automated face recognition requires various techniques from different
research ﬁelds, including computer vision, image processing, pattern recognition, and
machine learning. In a typical face recognition system, face images from a number
of subjects are enrolled into the system as gallery data, and the face image Of a test
subject (probe image) is matched to the gallery data using a one-to-one or one-to-
many scheme. The one-to-0ne and one-to-many matchings are called veriﬁcation and
identiﬁcation, respectively.

Face recognition is one of the fundamental methods used by human beings to
interact with each other. Attempts to match faces using a pair of photographs dates
back to 1871 in a British court [96]. Techniques for automatic face recognition have
been developed over the past three decades for the purpose Of automatic person
recognition with still and video images.

1 Face recognition has a wide range of applications, including law enforcement, civil
applications, and surveillance systems. Face recognition applications have also been
extended tO smart home systems where the recognition of the human face and expres-
sion is used for better interactive communications between human and machines [63].
Fig. 1.1 shows some biometric applications using the face.

The face has several advantages that makes it one of the most preferred biometric

 

 

 

 

 

 

 

 

 

Access
Allowed

 

Figure 1.1. Example applications using face biometric: (3.) ID cards (from [1]), (b) face
matching and retrieval (from [2]), (c) access control (from [2]), and (d) DynaVox EyeMax
system (controlled by eye gaze and blinking, from [3]).

traits. First, the face biometric is easy to capture even at a long distance. Second,
the face conveys not only the identity but also the internal feelings (emotion) of the
subject (e.g., happiness or sadness) and the person’s age. This makes face recognition
an important topic in human computer interaction as well as person recognition.
The face biometric is affected by a number of intrinsic (e.g., expression and age)
and extrinsic (e.g., pose and lighting) variations. While there has been a signiﬁcant
improvement in face recognition performance during the past decade, it is still below
acceptable levels for use in many applications [63] [90]. Recent efforts have focused on
using 3D models, video input, and different features (e.g., skin texture) to overcome
the performance bottleneck in 2D still face recognition. This chapter begins with a
survey of face recognition in 2D, 3D, and video domains and presents the challenges
in face recognition problems. We also introduce problems in face recognition due to
subject aging. The relevance of facial marks or micro features (e. g., scars, birthmarks)

to face recognition is also presented.

1.1 Face Detection

The ﬁrst problem that needs to be addressed in face recognition is face detec-
tion [131] [21] [132]. Some of the well-known face detection approaches can
be categorized as: i) color based [47], ii) template based [28], and iii) feature
based [103] [123] [44] [61] [127]. Color based approaches learn the statistical model
of skin color and use it to segment face candidates in an image. Template based
approaches use templates that represent the general face appearance, and use cross
correlation based methods to ﬁnd face candidates. State-of-the-art face detection
methods are based on local features and machine learning based binary classiﬁcation
(e.g., face versus non—face) methods, following the seminal work by Viola et al. [123].
The face detector proposed by Viola et al. has been widely used in various stud-

ies involving face recognition because of its real-time capability, high accuracy, and

 

Figure 1.2. Example face detection result (from [123]).

availability in the Open Computer Vision Library (OpenCV) [6]. Fig. 1.2 shows an

example face detection result using the method in [123].

1.2 Face Recognition

In a typical face recognition scenario, face images from a number of subjects are
enrolled into the system as gallery data, and the face image of a test subject (probe
image) is matched to gallery data using a one-tO-one or one-to—many scheme. There

are three different modalities that are used in face recognition applications: 2D, 3D,

 

Table 1.1. Face recognition scenarios in 2D domain.

 

 

 

 

 

 

 

Probe Gallery
Single still image Many still images
Single still image one—to-0ne one-to-many
Many still images many-to—one many-tO-many

 

and video. We will review the face recognition problems in these domains in the

following sections.

1.3 Face Recognition in 2D

Face recognition has been well studied using 2D still images for over a
decade [118] [55] [135]. In 2D still image based face recognition systems, a snap-
shot Of a user is acquired and compared with a gallery of snapshots to establish a
person’s identity. In this procedure, the user is expected to be cooperative and pro-
vide a frontal face image under uniform lighting conditions with a simple background
to enable the capture and segmentation of a high quality face image. However, it is
now well known that small variations in pose and lighting can drastically degrade the
performance Of the single-shot 2D image based face recognition systems [63]. 2D face
recognition is usually categorized according to the number Of images used in matching
as shown in Table 1.1.

Some Of the well-known algorithms for 2D face recognition are based on Principle
Component Analysis (PCA) [118] [55], Linear Discriminant Analysis (LDA) [33],
Elastic Graph Bunch Model (EGBM) [126], and correlation based matching [62].

2D face recognition technology is evolving continuously. In the Face Recognition
Vendor Test (FRVT) 2002 [90], identiﬁcation accuracy Of around 70% was achieved
given near frontal pose and normal lighting conditions on a large database (121,589

images from 37,437 subjects, Fig. 1.3). The FRVT 2006 evaluations were performed in

 

a veriﬁcation scenario and the best performing system Showed a 0.01 False Reject Rate
(FRR) at a False Accept Rate (FAR) of 0.001 (Fig. 1.4) given high resolution (400
pixels between eyesl) 2D images (by Neven Vision [7]) or 3D images (by Viisage2 [8]).

Face recognition can be performed in open set or closed set scenarios. Closed set
recognition tests always have the probe subject enrolled in the gallery data, but the
open set recognition consider the possibility that the probe subject is not enrolled
in the gallery. Therefore, a threshold value (on match score) is typically used in

retrieving candidate matches in open set recognition tests [108].

1.0 ._
0.9 ..
0.0 _
0.7 n
0.6 _
0.5 _
0.4 J

ldentltlcctlonratc

0.3 a
0.2 _

0.1 _

 

 

 

 

 

0.0 _

 

I I ITIIITII I ITIIIIII I I IIIIIII I IIITIIII I II

1 10 100 1” 10000 37437

Bank

Figure 1.3. Face recognition performance in FRVT 2002 (Snapshot of computer
screen) [90].

 

1Even though the image resolution is also deﬁned as dots-per-inch (dpi) or pixels-per-inch (ppi),
they are effective when the image is printed. Since the digital image is only represented as a set of
pixels when it is processed by a computer, we will use the number of pixels as the measure of image
resolution. Number of pixels between the centers of the eyes was used as the measure of image
resolution in FRVT 2006 [92].

2Now L-1 Identity Solutions.

 

 

 

 

 

 

 

 

 

1 .00
5 0.79
O
o" 0.80
II
“<5 0.60
LI.
3
a: 0.40
D:
[L
0.20
0.01
0.00 . .
1993 1997 2002 2006
Turk 8: Portland FERET 1M FRVT 2W FRVT 2006
(Partially AW) (Fully Automatic) (Fully Aumaﬁc) (Fully Autunatic)

Year of Evaluation

Figure 1.4. Reduction in face recognition error rates from 1993 to 2006 (Snapshot of
computer screen) [92].

1.3.1 Challenges in 2D Face recognition

More efforts have been devoted to 2D face recognition because of the availability of
commodity 2D cameras and deployment opportunities in many security scenarios.
However, 2D face recognition is susceptible to a variety of factors encountered in
practice, such as pose and lighting variations, expression variations, age variations,
and facial occlusions. Fig. 1.5 and Fig. 1.6 show examples of the pose and ﬁght-
ing variations and occlusion. Local feature based recognition has been proposed to
overcome the global variations from pose and lighting changes [113] [133] [14]. The

use Of multiple frames with temporal coherence in a video [136] [10] and 3D face

models [17] [71] have also been proposed to improve the recognition rate.

 

(a) frontal (b) non—frontal (c) lighting ((1) expression

Figure 1.5. Example images showing pose, lighting, and expression variations.

 

(a) glasses (b) sunglasses (c) hat (d) scarf

Figure 1.6. Example images showing occlusions.

1.3.2 Pose Variation

Pose variation is one of the major sources Of performance degradation in face recogni-
tion [63]. The face is a 3D Object that appears different depending on which direction
the face is imaged. Thus, it is possible that images taken at two different view points
of the same subject (intra—user variation) may appear more different than two images

taken from the same view point for two different subjects (inter-user variation).

 

1.3.3 Lighting Variation

It has been shown that the difference in face images of the same person due to severe
lighting variation can be more signiﬁcant than the difference in face images Of different
persons [134]. Since the face is a 3D object, different lighting sources can generate
various illumination conditions and shadings. There have been studies to develop
invariant facial features that are robust against lighting variations, and to learn and
compensate for the lighting variations using prior knowledge of lighting sources based
on training data [134] [22] [97]. These methods provide visually enhanced face images

after lighting normalization and show improved recognition accuracy of up to 100%.

1.3.4 Occlusion

Face images Often appear occluded by other Objects or by the face itself (i.e., self-
occlusion), especially in surveillance videos. Most Of the commercial face recognition
engines reject an input image when the eyes cannot be detected. Local feature based

methods are proposed to overcome the occlusion problem [73] [46].

1.3.5 Expression

Facial expression is an internal variation that causes large intra-class variation. There
are some local feature based approaches [73] and 3D model based approaches [52] [70]
designed to handle the expression problem. On the other hand, the recognition of
facial expressions is an active research area in human computer interaction and com-

munications [23].

1.3.6 Age Variation

The effect Of aging on face recognition performance has not been substantially studied.

There are a number Of reasons that explain the lack of studies on aging effects:

 

o Pose and lighting variations are more critical factors degrading face recognition

performance.
0 Template update1 can be used as an easy work-around for aging variation.

0 There has been no public domain database for studying aging until recently.

Aging related changes on the face appear in a number of different ways: i) wrinkles
and speckles, ii) weight loss and gain, and iii) change in shape of face primitives
(e.g., sagged eyes, cheeks, or mouth). All these aging related variations degrade face
recognition performance. These variations could be learned and artiﬁcially introduced
or removed in a face image to improve face recognition performance. Even though it is
possible to update the template images as the subject ages, template updating is not
always possible in cases Of i) missing child, ii) screening, and iii) multiple enrollment
problems where subjects are either not available or purposely trying to hide their
identity. Therefore, facial aging has become an important research problem in face
recognition. Fig. 1.7 shows ﬁve different images of the same subject taken at different

ages from the FG—NET database [4].

 

(a) (b) (0) (d) (6)

Figure 1.7. Images of the same subject at age (a) 5, (b) 10, (c) 16, (d) 19, and (e) 29 [4].

 

1Template update represents updating the enrolled biometric template to reduce the error rate
caused from template aging.

10

 

1.3.7 Face Representation

Most face recognition techniques use one of two representation approaches: i) local
feature based [72] [87] [126] [113] [133] [14] or ii) holistic based [119] [55] [79] [122].
Local feature based approaches identify local features (e.g., eyes, nose, mouth, or
skin irregularities) in a face and generate a representation based on their geometric
conﬁguration. Holistic approaches localize the face and use the entire face region in
the sensed image to generate a representation. A dimensionality reduction technique
(e.g., PCA) is used for this purpose. Discriminating information present in micro
facial features (e.g., moles or speckles) is usually ignored and considered as noise.
Applying further transformations on the holistic representation is also a common
technique (e.g., LDA). A combination Of local and holistic representations has also
been studied [41] [56].

Local feature based approaches can be further categorized as: i) component
based [43] [54], ii) modular [38] [73] [88] [114] [34], and iii) skin detail based [64] [93].
Component based approaches try to identify local facial primitives (e.g., eyes, nose,
and mouth) and either use all or a subset Of them to generate features for matching.
Modular approaches subdivide the face region, irrespective of the facial primitives,
to generate a representation. Skin detail based approaches have recently gained at—
tention due to the availability of high resolution (more than 400 pixels between the
eyes) face images. Facial irregularities (e.g., freckles, moles, scars) can be explicitly

or implicitly captured and used for matching in high resolution face images.

1.4 Face Recognition in Video

While conventional face recognition systems mostly rely upon still shot images, there
is a signiﬁcant interest in developing robust face recognition systems that will ac-

cept video as an input. Face recognition in video has attracted interest due to the

11

 

 

widespread deployment of surveillance cameras. The ability to automatically rec-
ognize faces in real time from video will facilitate, among other things, the covert
method of human identiﬁcation using an existing network Of surveillance cameras.
However, face images in video are often in Off-frontal poses and can undergo sub-
stantial lighting changes, thereby degrading the performance Of most commercial face
recognition systems. Two distinctive characteristics of a video are availability of: i)
multiple frames of the same subjects and ii) temporal information. Multiple frames
ensure a variation Of poses, allowing a proper selection of a good quality frame (e.g.,
high quality face image in near-frontal pose) for high recognition performance. The
temporal information in video is regarded as the information embedded in the dy-
namic facial motion in the video. However, it is difﬁcult to determine whether there
is any identity-related information in the facial motion: more work needs tO be done
to utilize the temporal information. Some of the work on video based face recognition
is summarized in Table 2.1. By taking advantage of the characteristics Of video, the
performance Of a face recognition system can be enhanced. Fig. 1.8 shows four frames

in a typical video captured for face recognition studies [37].

1.4. 1 Surveillance Video

The general concept Of video based face recognition covers all types Of face recognition
in any video data. However, face recognition in surveillance video is more challenging

than typical video based face recognition for the following reasons:

0 Pose variations: The subject’s cooperation cannot be assumed because of the
covert characteristics Of surveillance applications. Also, the cameras are in—
stalled at elevated positions, resulting in a low probability of capturing frontal

face images.

0 Lighting variations: Surveillance systems are Often installed in outdoor locations

12

 

 

Figure 1.8. Four frames from a video: number of pixels between the eyes is ~45 [37].

where variations in natural lighting (e.g., bright sunlight to cloudy days) and

shadows degrade the face image quality.

0 Low resolution: Surveillance systems use a wide ﬁeld of view to cover a large
physical area. Therefore, the size of the face appearing in the video frames is

small (number of pixels between eyes 210).

Due to the difficulties in simultaneously handling all of the above variations in a
surveillance video, for research purposes it is customary to use a set of video data with
a limited number of variations (e.g., pose or lighting variations) [80] [81]. Fig. 1.9
shows a typical surveillance video captured at a security check point at an airport.
There are severe degradations in quality in terms of pose and resolution compared to

Fig. 1.8.

 

 

Figure 1.9. Example face images from a surveillance video: number of pixels between eyes
is less than ten and the facial pose is severely off-frontal [5]

1 .4.2 Challenges

The difﬁculty of face recognition in video depends on the quality of face images in
terms of pose, lighting variations, occlusion, and resolution. The large number of
frames in video also increases the computational burden. Unlike the still shot 2D
image, surveillance video usually contains multiple subjects in a sequence of frames.
Most of the real-time face detectors [123] are able to detect multiple faces in the
given image. A simultaneous detection and recognition can be performed by associ-
ating each face in current frame with face images observed in previous frames. Low
resolution problems have been addressed by adapting super resolution based image

enhancement [53].

1.5 Face Recognition in 3D Domain

3D face recognition methods use the surface geometry of the face [71]. Unlike 2D face
recognition, 3D face recognition is robust against pose and lighting variations due to
the invariance of the 3D shape against these variations. A 3D image captured from
a face by a 3D sensor covers about 120° from right end to left end and this is called

a 2.5D image. A full 3D model covering 360° of a face is constructed by combining

14

Table 1.2. Face recognition scenarios across 2D and 3D domain.

 

 

 

 

 

 

 

Probe Gallery
2D images 3D models or 2.5D images
2D images 2D to 2D 2D to 3D
2.5D images 3D to 2D 3D to 3D

 

multiple (3 to 5) 2.5D scans. The probe is usually a 2.5D image and the gallery
can be either a 2.5D image or a 3D model. Identiﬁcation can be performed between
two range (depth) images [71] or between a 2D image and the 3D face model [60].
Table 1.2 extends Table 1.1 across 2D and 3D face models.

There also have been many approaches that are based on reconstructed 3D models
from a set of 2D images [36] [17]. The reconstructed 3D model is used to obtain
multiple 2D projection images that are matched with probe images [60]. Alternatively,
the reconstructed 3D model can be used to generate a frontal view of the probe
image with arbitrary pose and lighting conditions; the recognition is performed by
matching the synthesized probe in frontal pose. Fig. 1.10 shows a 3D face model and

its corresponding 2D projection images under different pose and lighting conditions.

1.5.1 Challenges

3D face models are usually represented as a polygonal (e.g., triangular or rectangular)
mesh structure for computational efﬁciency [83]. The 3D mesh structure changes
depending on the preprocessing (e.g., smoothing, ﬁlling holes, etc.) , mesh construction
process, and imaging process (scanning with laser sensor). Even though the 3D
geometry of a face model changes depending on the pose, this change is very small and
the model is generally regarded as pose invariant. Similarly, the model is also robust
against lighting variations. However, 3D face recognition is not invariant against

variations in expression, aging, and occlusion. There have been several studies on

15

 

Figure 1.10. A 3D face model and its 2D projections.

expression invariant 3D face recognition [52] [70] and studies on age variation are
beginning to appear [82] [104]. The drawbacks in 3D face recognition are the large
size of the 3D model, which requires high computation cost in matching, and the

expensive price of 3D imaging sensors.

1.6 Summary

We have reviewed various face recognition schemes with respect to different data
modalities: 2D, video, and 3D. Even though there have been steady improvements
in face recognition performance over the past decade, several challenges remain due
to the large intra—class variations and small inter—class variations. These variations
are mostly due to pose and lighting variations, expression, occlusion, aging, and non-

robust representations of face image data.

16

 

l II- a

 

While 3D face recognition has been studied to overcome pose and lighting prob-
lems, a number of factors have prevented the practical application of 3D face recog-
nition; these include computation cost, sensor cost, and large legacy data in the 2D
domain. Video based face recognition is important for its need in surveillance. How-
ever, the video domain has its own challenging set of problems related to severe pose
and lighting variations and poor resolution. Taking advantage of the rich temporal
information in video and using 3D modeling techniques to assist video based face
recognition has been regarded as a promising approach.

This thesis will focus on three major problems in face recognition. First, we utilize
3D modeling techniques, temporal information in video, and a surveillance camera
setup to improve the performance of video based face recognition. Second, we develop
a framework for age invariant face recognition. Third, we develop a framework for
utilizing secondary local features (e. g., facial marks) as a means of complementing

the primary facial features to improve face matching and retrieval performance.

1.7 Thesis Contributions

We have developed methods to improve face recognition performance in three ways:
i) using temporal information in video, ii) models for facial variations due to aging,
and iii) utilizing secondary features (e.g., facial marks). Contributions of the thesis

are summarized below.

0 A systematic method of gallery and probe construction using video data is pro-
posed. The pose and motion blur signiﬁcantly affect face recognition accuracy.
We perform face recognition in video by selectively using a subset of frames
that are in near frontal pose with small blur. Fusion across multiple frames and

multiple matchers on the selected frames results in high identiﬁcation accuracy.

0 We use 3D modeling techniques to overcome the pose variation problem. The

17

 

 

Factorization algorithm [116] is adapted for 3D model reconstruction to synthe-
size frontal views and improve the matching accuracy. The synthesized frontal

face images substantially improve the face recognition performance.

A multi—camera surveillance system that captures soft-biometric features (e. g.,
height and clothing color) has been developed. The soft biometric information
is coupled with face biometrics to provide robust tracking and identiﬁcation

capabilities to conventional surveillance systems.

We propose a pair of static and Pan-Tilt-Zoom (PTZ) cameras to overcome the
low image resolution problem. The static camera is used to locate the face and
the PTZ camera is used to zoom in and track the face image. The close-up view
of the face provides a high resolution face image that substantially improves the

recognition accuracy.

To address age invariant face recognition systems, we use the Principle Com-
ponent Analysis (PCA) technique to model the shape and texture separately.
The PCA coefficients are estimated from a training database containing mul-
tiple images at different ages from a number of subjects to construct an aging
pattern space. The aging pattern space is used to correct for aging and narrow

down the age separation between probe and gallery images.

An automatic facial mark detection system is developed that can be used in face
matching and retrieval. We have used an Active Appearance Model (AAM) [26]
to localize and mask primary facial features (e.g., eyes, eye brows, nose, and
mouth). A Laplacian of Gaussian (LOG) operator is then applied on the rest
of the face area to detect facial mark candidates. A fusion of mark based
matching and a commercial matcher shows that the recognition performance

can be improved.

18

3'," 'v-“—’ _ . 21", a- #- ’-.~ ‘ EF- Vi“: .;.'_- E 'f. ”3;“ p, 'l .p ,gﬁiu‘wg. ‘

 

 

Chapter 2

Video—based Face Recognition

Deciding a person’s identity based on a sequence of face images appearing in a video
is called video based face recognition. Unlike still-shot 2D images, video data contains
rich information in multiple frames. However, the pose and lighting variations in a
video are more severe compared with still-shot 2D images. This is mostly because
human subjects are more cooperative in the still-shot image capture process. On the
contrary, most often video data is captured in covert applications and the subject’s
cooperation is not usually expected. Therefore, video based face recognition presents
some additional challenges in face recognition.

There have been a number of studies that perform face recognition speciﬁcally on
video streams. Chowdhury et al. [24] estimate the pose and lighting of face images
contained in video frames and compare them against synthetic 3D face models ex-
hibiting similar pose and lighting. However, in their approach the 3D face models
are registered manually with the face image in the video. Lee et al. [59] pr0pose an
appearance manifold based approach where each gallery image is matched against the
appearance manifold obtained from the video. The manifolds are obtained from each
sequence of pose variations. Zhou et al. [136] proposed to obtain statistical models
from video using low level features (e.g., by PCA) contained in sample images. The
matching is performed between a single frame and the video or between two video

streams using the statistical models. Liu et al. [68] and Aggarwal et al. [10] use HMM

19

 

 

 

Table 2.1. A comparison of video based face recognition methods.

 

 

 

 

 

 

 

Approaches No. of Recognition
subjects in accuracy
database

Chowdhury et Frame level matching with synthe- 32 90%

al. [24] sized gallery from 3D model

Lee et al. [59] Matching frames with appearance 20 92.1%
manifolds obtained from video

Zhou et Frame to video and video to video 25 (Video 88~100%1

al. [136] matching using statistical models to video)

Liu et a1. [68] Video level matching using HMM 24 99.8%

Aggarwal et Video level matching using Au- 45 90%

al. [10] toRegressive and Moving Average

 

model (ARMA)

 

 

 

 

1 Four videos are prepared for each subject as the subject is walking slowly, walking
fast, inclining, and carrying an object. Different performances are shown depending

on different selection of probe and gallery video.

20

 

and ARMA models, respectively, for direct video level matching. Most of these direct
video based approaches provide good performance on small databases, but need to be
evaluated on large databases. Table 2.1 summarizes some of the major video based
recognition methods presented in the literature.

We propose an approach to face recognition in video that utilizes 3D modeling
technique. The proposed method focuses on the utilization of 3D models than the
temporal information embedded in the 2D video; the effectiveness of the proposed
approach is evaluated on a large database (>200 subjects). We utilize the modeling
techniques in view-based and view synthetic approaches to improve the recognition
performance [81] [80].

View-based and view synthesis methods are two well known approaches to over-
come the problem of pose and lighting variations in video based face recognition.
View-based methods enroll multiple face images under various pose and lighting con-
ditions and match the probe image to the gallery image with the most similar pose
and lighting conditions [88] [20]. View synthesis methods generate synthetic views
from the input probe images with pose and lighting conditions similar to those in
the gallery to improve the matching performance. The desired view can be synthe-
sized by learning the mapping function between pairs of training images [15] or by
using 3D face models [17] [102]. The parameters of the 3D face model in the view
synthesis process can also be used for face recognition [17]. These view—based and
view—synthetic approaches are also applicable to still images. However, considering
the large pose and lighting variations in multiple 2D images taken at different times, it
is more suitable to use these techniques for video data. The view synthesis approach
has the following two advantages over the view based method: i) it does not require
the tedious process of collecting multiple face images under various pose and light-
ing conditions, and ii) it can generate frontal facial images under favorable lighting

conditions on which state-of—the-art face recognition systems can perform very well.

21

 

However, the view synthesis approach needs to be applied carefully, so that it does

not introduce noise that may further degrade the original face image.

2.1 View-based Recognition

If we assume that a video is available for a subject both for gallery construction at
enrollment and as probe, we can take advantage of video for improving the face recog-
nition performance. In this section we explore (a) the adaptive use of multiple face
matchers in order to enhance the performance of face recognition in video, and (b) the
possibility of appropriately populating the database (gallery) in order to succinctly
capture intra-class variations. To extract the dynamic information in video, the pose
in various ﬂames is explicitly estimated using Active Appearance Model (AAM) and a
Factorization based 3D face reconstruction technique [116]. We also estimate the mo-
ti on blur using the Discrete Cosine Transformation (DCT). Our experimental results
on 204 subjects in CMU’S Face-In-Action (FIA) database show that the proposed
recognition method provides consistent improvements in the matching performance

using three different face matchers (e.g., FaceVACS, PCA, and correlation matcher).

2. 1 - 1 Fusion Scheme

Consider a video stream with r ﬂames and assume that the individual ﬂames have
been processed in order to extract the faces present in them. Let T1, T2, . . . ,T,»
be the feature sets computed ﬂom the faces localized in the r ﬂames. Further,
let W1, W2, . . . , Wn be the n identities enrolled in the authentication system and
G1, G 2, . . . , Gn, respectively, be the corresponding feature templates associated with
these identities. The ﬁrst goal is to determine the identity of the face present in the
ith frame as assessed by the kth matcher. This can be accomplished by comparing

the eXtracted feature set with all the templates in the database in order to determine

22

 

   

the best match and the associated identity. Thus,

j=1,2,...,n

where I D,- is the identity of the face in the ith ﬂame and Sk(-, ) represents the

kth matcher to compute the match score between

similarity function employed by the
feature sets T,- and G j. If there are m matchers, then a fusion rule may be employed
to consolidate the m match scores. While there are several fusion rules, we employ
the simple sum rule (with min-max normalization of scores) [50] to consolidate the

match scores, i.e.,

m
ID,- = argmax ZSk(Cl’,-,Gj). (2.2)
j=1,2,...,n k=1

In practice, simple fusion rules work as well as complicated fusion rules such as the
likelihood ratio [77].

Now the identity of a subject in the given video stream can be obtained by ac-
cumulating the evidence across the r ﬂames. In ﬂame level fusion, we assume each
frame is equally reliable, so the score sum is used. In matcher level fusion, the com-
mercial matchers usually outperform the public domain matchers. Therefore, we take
the maximum rule that will favorably take the matching score with the highest con-
ﬁdence (e.g., commercial matcher). In maximum rule, the identity that exhibits the

highest match score in the r ﬂames is deemed to be the ﬁnal identity. Therefore,

ID = argmax (argmax (Z Sk(T,-,Gj))). (2.3)

j=1,2,...,n i=1,2,...,r 1:21

In the above formulation, it must be noted that the feature sets Ti and G j are
imP-‘d-Cted by several different factors such as facial pose, ambient lighting, motion
blur, etc. If the parameter vector 0 denotes a compilation of these factors, then the

feature sets are dependent on this vector, i.e., T,- m ill-(0) and 03- 3 03(9). In this

23

Input Video

 

Gallery (MU'ti-POSGS)

see 90

Dynamic Information

Facial pose & Adaptive
Motion blur Fusion

Identity

Figure 2.1. Schematic of the proposed face recognition system in video.

 

 

 

 

 

V
.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

work, m is 3 since three different face matchers are used and the vector 9 represents
facial pose and motion blur in video. The dynamic nature of the fusion rule is
explained in the subsequent sections. Fig. 2.1 shows the overall schematic of the

proposed view—based face recognition system using video.

2.1.2 Face Matchers and Database

Given the large pose variations in video data, it is expected that using multiple
matchers will cover larger pose variations for improved accuracy. Most of the com-
mercial face matchers reject the input image when the facial pose is severely off—ﬂontal

(> 40°) and both eyes cannot be detected. We use two public domain matchers that

24

pose-oognitec

 

 

 

 

 

 

 

 

 

30 l I r I T I
Cognitec -
0 Only Cognitec '
20 - ° -
10 r
0 - - a o .
.C O , o
.9
'5.
-10 ..
-20 P ‘ ..
-30 I. .vd. _
_40 L l I 1 L l
--40 -30 -20 -10 0 10 20 30

yaw

Figure 2.2. Pose variations in probe images and the pose values where matching succeeds
at rank-one: red circles represent pose values where FaceVACS succeeds.

25

pose—pca

 

 

 

 

30 I I I l I I
' PCA .
0 Only PCA -
20 '— 0 _.
10 .- . . ' 2 I . . -‘
I. ﬁn:
0 O Q ‘ .95.: O t
_ .I- e : s.“-(':— ~ 7‘ o.- ...':. :§~ '5.-If“' _
5 _,.o - . {3‘ as. . .. . ..
e—I . n- . . a. . I I. . ".. . .... o
a .| ‘5‘.” .1 I: '5935. .V' . .r.
—10.— lam-'5' ~.:;..?... . _
e e. 630 - ..
a o ' 0
'20 ‘ 0 _
.' “3, g
' e
-30 - . _
-40 I I I l I l
-40 -30 -20 -10 0 10 20 30
yaw

Figure 2.3. Pose variations in probe images and the pose values where matching succeeds
at rank-one: red circles represent pose values where PCA succeeds.

26

can generate a matching score even with failed eye detection to compensate for the
failure Of enrollment in the commercial facial matcher. We selected the state-of-the-
art commercial face matcher FaceVACS ﬂom Cognitec [9] and two public domain
matchers: PCA [119], and a correlation based matcher [62]. FaceVACS, which per-
formed very well in the Face Recognition Vendor Test (FRVT) 2002 and FRVT 2006
competitions [90] [92], is known to use a variation of Principle Component Analy-
sis (PCA) technique. However, this matcher has limited operating range in terms
of facial pose. To overcome this limitation and facilitate a continuous decision on
the subject’s identity across multiple poses, the conventional PCA [118] [55] based
matcher and a cross correlation based matcher [62] were also considered. The PCA
engine calculates the similarity between probe and gallery images after applying the
Karhunen-Loeve transformation to both the probe and galley images. The cross cor-
relation based matcher calculates the normalized cross correlation between the probe
and gallery images to obtain the matching score. Fig. 2.2 and 2.3 show the differ-
ence in the success of face recognition at different facial poses for two different face
matchers: FaceVACS and PCA. It is shown that these two matchers are successful
for different pose values.

We use CMU’S Face In Action database [37], which includes up to 221 subjects
with data collected in three indoor sessions and three outdoor sessions. Each subject
was recorded by six different cameras simultaneously, at two different distances and
three different angles. The number of subjects varies across these different sessions.
We use the ﬁrst indoor session in our experiments because it i) has the largest number
of subjects (221), ii) contains a signiﬁcant number of both ﬂontal and non-ﬂontal
P0888, and iii) has relatively small lighting variations. Each video of a subject consists
of 600 ﬂames. We partition the video data into two halves; the ﬁrst half was used as
gallery data and the second half as probe data. Fig. 2.4 shows example images of the

FIA database. In the FIA database the images captured ﬂom six different cameras

27

are stored as separate images. While FIA is now available in the public domain, we

have not found any other face recognition study using this database.

2.1.3 Tracking Feature Points

Facial pose is an important factor to consider in video based face recognition. We
detected and tracked a set of facial feature points and estimated the facial pose using
the reconstructed or generic 3D face models. The Active Appearance Model (AAM)
was used to detect and track facial feature points. The Viola-Jones face detector
[123] was used to locate the face; feature points were rejected when they deviated
substantially ﬂom the face area estimated with the face detector. The AAM feature
points were also used to tightly crop the face area to be used by the PCA and
cross correlation matchers. Fig. 2.5 shows example images of AAM tracking and the

resulting cropped face images.

2-1.4 Active Appearance Model (AAM)

The AAM is a statistical model of the facial appearance generated by combining
shape and texture variations [25]. Constructing an AAM requires a set of training
data X ={X1,X2, . . . ,Xn} with annotations, where X; represents a set of points
marked on image 2'. Exact correspondences are required in X across all the 12. training

images. By applying PCA to X, any X,- can be approximated by
Xi=Xu+Ps'bsz-, (2.4)

where X” is the mean Shape, P3 is a set of the orthogonal modes of variation obtained
by applying PCA to X, and 032. is a set of shape parameters. To build a texture model,
each example image is warped so that its control points match the mean shape. Then

the face texture g (gray values) is obtained by the region covered by the mean shape.

28

 

Figure 2.4. Example images from the Face In Action (FIA) database. Six different cameras

in the experiments.

record the face images at the same time. Six images at three time instances are shown here.
The ﬂontal view at a close distance (ﬁfth image ﬂom top to bottom, left to right) is used

29

 

Figure 2.5. Example of face image cropping based on the feature points. (a) Face images
with AAM feature points and (b) corresponding cropped face images.

The texture model is deﬁned similar to the shape model as

g, = 9,, + Pg - bgi, (2.5)

where g,‘ is the mean texture, Pg is a set of orthogonal modes of variation obtained
by applying PCA to g, and bgi is a set of texture parameters. The shape and texture
parameters are combined as b = (be, by) and any new face image is approximated by
b. Now the problem becomes ﬁnding the best shape and texture parameter vector b,-
that achieves the minimum difference between the test image I,- and the image Im
generated by the current model deﬁned by 0;. More details about an efﬁcient way of
Searching for the best model parameter b,- can be found in [25]. There are enhanced
versions of AAM that have real time capability [130] using 2D and 3D information,
and that are robust against occlusions [40]. A user-speciﬁc AAM has also been studied

for more robust feature point detection when the user speciﬁc model is available [98].

30

2.1.5 AAM Training

Instead of using a single AAM for multiple poses as in Section 2.1.3, we constructed
multiple AAMS, each for a different range of pose variations [27] to cover a larger
set of variations in facial pose. In this way, each model is expected to ﬁnd better
facial feature points for its designated pose. Moreover, the number of feature points
in each AAM can be different according to the pose (e.g., ﬂontal vs. proﬁle). We
chose seven different AAMS corresponding to ﬂontal, left half proﬁle, left proﬁle, right
half proﬁle, right proﬁle, lower proﬁle, and upper proﬁle poses to cover the observed
pose variations appearing in our video data. Assuming facial symmetry, the right
half and right proﬁle models are obtained ﬂom the left half and left proﬁle models,
respectively.

The off-line manual labeling of feature points for each training face image is a time
consuming task. Therefore, we used a semi-automatic training process to build the
AAMS. The training commenced with about 5% of the training data that had been
manually labeled, and the AAM search process was initiated for the unlabeled data.
Training faces with robust feature points were included into the AAM after manually
adjusting the points, if necessary. The AAM facial feature search process was then
initiated again. This process was repeated until all the images in the training set had
been labeled with feature points. Our proposed scheme uses a generic AAM where
the test subject is not included in the trained AAM. To simulate this scenario, we
generated two sets of AAMS and used them in a cross validation way to ensure the

separation between AAM training and testing.

2.1.6 Structure from Motion

Let a set of points P,- = {p;1,p,-2,...,p,-p} denote the 2D shape of a 3D object S ob—
served in an image 1;. Given a video with F ﬂames, O = {11, 12, ..., I F} containing the

2D projections of the 3D object S, we obtain a sequence of points H = {P1, P2, ..., PF}.

31

The relationship between S and P,- can be described as

Pi=C'(Rt"S+Ti)I (53-6)

where C, R, and T are the camera projection matrix, rotation matrix, and translation
matrix, respectively. The Structure ﬂom Motion (SfM) problem can be stated as
estimating S ﬂom the observed set of points P,- = {P1, P2, ..., PF}. The challenge in
the SM problem is to ﬁnd sets of P,- that correspond in a sequence of video ﬂames.
Due to object and camera motion, some parts of the object are occluded, resulting
in missing and spurious feature points. A solution to SfM involves using the least

squared error method, which tolerates error in feature point detection to a certain

degree.

2.1.7 3D Shape Reconstruction

The Factorization method [116] is a well known solution for the Structure ﬂom Motion
problem. There are different factorization methods to recover the detailed 3D shape
depending on the rigidity of the object [129] [19]. We regard the face as a rigid object
and treat small changes in facial expression as noise in feature point detection. This
helps us recover only the most dominant shape ﬂom video data. Under orthographic
projection model, the relationship between 2D feature points and 3D shape is given
by

W = M - S, (2.7)

32

K U11 U12 “1p \ { i111: 2lg 7:12 \

U21 U22 “2p 32:0 7:23; 7:22

1

W: ”f1 “f2 ufp M: ifzr: ify ifz

 

 

 

 

v11 v12 ’Ulp jla: jly jlz
U21 U22 v2p 3'2: .7231 1'22 (2 8)
Us ”f2 ’Ufp/ iii: in 1'le

31:1 3122 . . . 53:1)
82]. $22 . . . Szp

where ufp and vfp in W represent the row and column pixel coordinates of the pth
point in the fth ﬂame, each pair of if}: = [ifx ify ifz] and j}, = [jfx j fy j f2] in M
represents the rotation matrix with respect to the f th ﬂame, and S represents the 3D
shape. The translation term is omitted in Eq. (2.7) because all 2D coordinates are
centered at the origin. The rank of W in Eq. (2.8) is 3 in an ideal noise-ﬂee case. The
solution of Eq. (2.7) is obtained by a two-step process: (i) Find an initial estimate of
M and S by singular value decomposition, and (ii) apply metric constraints on the

initial estimates. By a singular value decomposition of W, we obtain

W=U-D-VTeu’.D’-V’T, (2.9)

33

where U and V are unitary matrices of size 2F x 2F and P x P, respectively and D
is a matrix of size 2F x P for F ﬂames and P tracked points. Given U, D and V, U’
represents the ﬁrst three columns of U, D’ is the ﬁrst three columns and ﬁrst three
rows of D, and VT is the ﬁrst three rows of VT, to impose the rank 3 constraint on

W. Then, M ’ and S’ (the initial estimates of M and S) are obtained as

MI ___ U! . 011/2,

3! = Bil/2 . va (2'10)

To impose the metric constrains on M, a 3x 3 correction matrix A is deﬁned as

(lif jflT°A)'(AT.-lif J'fl) :19. (2-11)

where if is the f th 2' vector in the upper half rows of M, j f is the f th j vector in the
lower half rows of M, and E is a 2x2 Identity matrix. The constraints in Eq. 2.11
need to be imposed across all ﬂames. There is one if and one jf vector in each ﬂame,
which generate three constraints. Since A - AT is a 3x3 symmetric matrix, there are
6 unknown variables. Therefore, at least two ﬂames are required to solve Eq. 2.11. In
practice, to obtain a robust solution, we need more than two ﬂames and the solution
is obtained by the least squared error method. The 3 x3 symmetric matrix L = A-AT
with 6 unknown variables is solved ﬁrst and then L1/2 is calculated to obtain A. The
conditions when factorization fails are: (i) number of ﬂames F is less than 2, (ii)
the singular value decomposition fails, or (iii) L is not positive deﬁnite. Usually,
conditions (i) and (ii) are not of concern in processing a video with a large number
of ﬂames. Most of the failures occur due to condition (iii). Therefore, the failure
condition of the factorization process can be determined by observing the positive

deﬁniteness of L through eigenvalue decomposition. The ﬁnal solution is obtained as

34

 

M’=M’-A,

SI ___ A—l . 3!, (2.12)
where M contains the rotation information between each ﬂame and the 3D object
and S contains the 3D shape information. We will provide the lower bound on the

performance of the Factorization method on synthetic data and real data in Sec. 2.2.2.

2.1.8 3D Facial Pose Estimation

We estimate the facial pose in a video ﬂame to select the best pose to use in recog-
nition. There are many facial pose. estimation methods in 2D and 3D domains [117].
Because the head motion occurs in a 3D domain, 3D information is necessary for
accurate pose estimation. We estimate the facial pose in [yaw, pitch, roll] (YPR)
values as Shown in Fig. 2.6. Even though all the rotational relationships between the
3D shape and the 2D feature points in each frame are already established through
the matrix M in the factorization process, it reveals only the ﬁrst two rows of the ro-
tation matrix for each ﬂame, which generates inaccurate solutions in obtaining YPR
values, especially in noisy data. Moreover, the direct solution cannot be Obtained in
cases where the factorization fails. Therefore, we use the gradient descent method
to iteratively ﬁt the reconstructed 3D shape to the 2D facial feature points. The re-
constructed 3D shape is ﬁrst initialized to zero yaw, pitch, and roll, and the iterative

gradient descent process is applied to minimize the objective function
E: [le—C-R-SH, (2.13)

where Pf is the 2D facial feature points in the fth ﬂame, C is an orthogonal camera
projection matrix, R is the full 3 x 3 rotation matrix, and S is the 3D shape. The

overall process of pose estimation is depicted in Fig. 2.6. The proposed pose estima-

35

tion scheme is evaluated on a synthetic data consisting of 66 frames obtained from a
3D model with known poses. The pose variations in synthetic data are in the range
[—45°, 45°] in yaw and pitch. The pose estimation error on the synthetic data is less
than 6° on average. However, this error increases in real face images because of the

noise in the feature point detection process.

 

 

Input
Video
.ﬂ';. P '“IF‘ if) XI’:§. *3 lhix}. kg;
Feature . "iii-ea . mama. . ism-aa- . "iii-nee. j
point I "z I;‘ I ’I I": i . I“. :1; h I ‘. 2;! .
detemon I ‘1'“? l I all} I . .:"‘:o n - I‘ll-I}. I
I . I I I I l . I I ‘ Tia-f.
: : : viii: m
" -. .' ‘ ’ " i """"" ' I“ "it 1.3;.”
Iterative : l i it! ---A‘7‘” -- .’
fitting "ﬁém'é """ if: “r: :f' ‘i.’
: : : '-'.; '7‘
v v V '4
3D
reconstruction
_ Pitch: -13.83 °
Estimated Yaw: 31.57 °
pose Roll: - 5.97 °

   

Figure 2.6. Pose estimation scheme.

36

Pﬁd1

200

 

 

 

 

 

200

 

 

 

 

 

Gallery
I I I '0
150-
100 -
50L
.C
:2 O -
D.
-50 .-
O
-100 r-
_150 .-
O
- -200 -150 —100 —50 or 50 100
Yaw
(3)
Probe
200 I I I I I I
150%
100 r
50 -
X
0 .—
X
X
o l l l l I L
-200 -150 -100 —50 o 50 100 150
Yaw
(b)

Figure 2.7. Pose distribution in yaw-pitch space in (a) gallery and (b) probe data.

37

7*"... I 'mﬁxuv',‘ a 15:1. 1*“)? -.‘

\'.

1v-

— 4!;‘311" 7

 

 

2.1.9 Motion Blur

Unlike still shot images of the face, motion blur is often present in segmented face
images in video. The blurred face images can confound the recognition engine re—
sulting in matching errors. Therefore, frames with motion blur need to be identiﬁed
and they either need to be enhanced or rejected in the face recognition process. The
degree of motion blur in a given image can be evaluated based on a frequency do—
main analysis: motion blur decreases the fraction of sharp edges, which are high
frequency components. Any spatial to frequency domain transformation method can
be used to detect the degree of high frequency components (e.g. Fourier transfor-
mation (FT) [39] or Discrete Cosine Transformation (DCT) [11]). We used DCT
to evaluate the degree of high frequency components for its simplicity compared to
FT. DCT is a similar operation as FT, but it uses only real numbers. The N1 x N2
real numbers 330,0, . . . , 3N1-1,N2—1 are transformed into the N1 x N2 real numbers

X03, . . . :XN1—1,N2—1 after the DCT transformation deﬁned as

N1—1N2—1

_ «(n1 + 1/2)k1 «(72.2 + 1/2)k2
X khkz — Z Z 3:7,,an cos N1. cos N2 (2.14)

 

 

n1 =0 112:0

where k1 = O,...,N1 — 1 and k2 = 0,...,N2 — 1 . We determined the presence
of motion blur by observing the DCT coefﬁcients of the top 10% of high frequency
components; frames with motion blur were not considered in the adaptive fusion

scheme.

2.1.10 Experimental Results

We performed three different experiments to analyze the effect of i) gallery data,

ii) probe data, and iii) adaptive fusion of multiple matchers on the face recognition

38

 

---e----o---ne.---o-.-.e----e-~--e--.-o-.--

 

 

0.9
08 a ...... ”""3:13.274'5333". : £2121? -_-'
l v“*“ .'_._u_._..u.---D'

0.7- ‘, _‘(,.—n-" _
5* :3...
95°06. .. --»~--":::::
§ 4 ...... 1"':::i::::::1::::::t::::.I ------ T
9 0'5" ..... ‘illiilo """"" , .
:3 1 ...... ’
30.4- 1
5 ' -0- ' FaceVACS. Composed Gallery

9
w
I

ma .. FaceVACS, Random Gallery i
- -Ap- - PCA, Composed Gallery
0.2 - .... -- PCA, Random Gallery l
--I- - Correlation, Composed Gallery

 

 

 

 

 

 

0-1 ' --< -- Correlation, Random Gallery -
0 1 l 1 1 A 1 J l
1 2 3 4 5 6 7 8 9 10

Rank

Figure 2.8. Face recognition performance on two different gallery data sets: (i) random
gallery: random selection of pose and motion blur, (ii) composed gallery: frames selected
based on speciﬁc pose and with no motion blur.

performance in video. We ﬁrst report the experimental results as CMC curves at the
frame level. The subject level matching performance is also provided along with the
overall system performance.

To study the effect of gallery composition, we constructed two different gallery
data sets. The ﬁrst gallery set, A, was constructed by selecting 7 frames per sub-
ject with pitch and yaw values as (—40°,0°), (—20°,O°), (O°,O°), (0°,20°), (0°,40°),
(0°,—20°), (0°,20°). These frames are also selected not to have any motion blur.
The second gallery set, B, also has the same number of frames per subject but it
is constructed by considering a random selection of yaw and pitch values, and these
may contain motion blur. The effect of gallery data set on the matching performance

is shown in Fig. 2.8. The gallery database composed by using pose and motion blur

39

information (set A) shows signiﬁcantly better performance for all the three matchers.
This is because the composed gallery covers larger pose variations appearing in probe

data. Removing images with motion blur also positively affects the performance.

 

 

 

 

 

 

 

Correlation
1 f I l I I I I r
0.9 - -
0.8 - ...... -o ........ o -------- 0 """" o """" 9
.-.o- ....... ‘0"
0.. ....... 0 """
07 " '.o d" :I
g. .0!“ nnnnnnnn * -------- ﬂ """"""
a 05 - ...... I- ......... I .......... .I ---- 4. ..... I."
8 ”I“, ,,,,,, u-N" ~'* _____ ... ————— *""
9 0 5 - war-":2“: """" 4"” _
'a ,. s."
9 1r . '
E 0.41:" -
D
o
0.3 - -
0.2 - , ' _ -
“-0"- Correlanon, -20<(Yaw.PItch)<20
01 _ "Km-Correlatjon.-40<(Yaw,Pitch)<-20 or 20<(Yaw,Pitch)<40 _
' "It-- Correlation. (Yew,Pitch)<-60,60<(Yaw,Pitch)
0 l l l L l J l l
1 2 3 4 5 6 7 8 9 10

 

Rank

Figure 2.9. Cumulative matching scores using dynamic information (pose and motion
blur) for Correlation matcher.

Next, we separate the probe data according to the facial pose in three different
ranges: i) between —20° and 20°, ii) between -40° and —20° or 20° and 40°, and
iii) between -60° and —-40° or 40° and 60° for yaw and pitch values. We com-
puted the CMC curves for these three different probe sets as shown in Fig. 2.9~2.11.
Fig. 2.9~2.11 indicates that for all the three matchers, the face recognition perfor-
mance is the best in near frontal-view and decreases as it deviates from the frontal
view. Fig. 2.12~2.17 Show the same results as Fig. 2.9~2.11, but using separate
pitch and yaw information. The overall performance is observed to be slightly lower
with pitch variation compared to yaw. In Figs. 2.12~2.14, the performance does

not strictly become lower as the pose variation increases. We believe this is due to

40

 

 

 

 

 

 

 

 

 

 

PCA
1 I I T I I I I I
0.9 - 1,
....... e-------o--------'0-""""
....... -o»-----"°'
08 " _.,..v-o" ------ '0- -l
.o".0'.
0.70"" ........... 4.. ....... 'l'
a .u ......... ﬂ """"" 4".” 4* 4|
9 .-*o-"' """""""" * ______ .. -----
3 0.5 - ,.»"’" """"""" r ..... A 1
U x“' ..rr"
a ’ a‘ F‘-'
g 0'51?" l' ”k ””””” t .1
E r'"
g 04 ~ i," j
o 17’
0.3 ~ 4
0‘2 ' ”0.. PCA. -20<(Yaw.Pitch)<20 —
"*" PCA. -40<(Yaw.Pitch)<-20 or 20<(Yaw.Pitch)<40 4
0" ” --t-- PCA, (Yaw,Pitch)<-60,60<(Yaw,Pitch)
0 l l L 1 1 l L 1
l 2 3 4 5 6 7 a 9 10
Rank

Figure 2.10. Cumulative matching scores using dynamic information (pose and motion
blur) for PCA matcher.

 

 

 

 

 

 

 

FaceVACS
1 I I I I I I I I
+ 6 c a c c # £- ------- uo- ------- .‘b
0.9 ; _______ -.—. -- = = i i i ;: i: ---4.
r ...... O- *** .._..-.. " " "
0.8 - _
0.7 e A
0.6 - 4

Cumulative accuracy
0
01
I
l

O

a.

r
1 I I

 

 

 

 

 

 

 

0.3 -
0.2 - ---e--- FaceVACS, -20<(Yaw,Pitch)<20
-..,... FaceVACS, .40<(Yaw,Pitch)<-20 or 20<(Yaw,Pitch)<40
0.1 - --s-- FaceVACS, (Yaw.Pitch)<-60 or 60<(Yaw,Pitch)
U m 1 1 I l 1 l l
1 2 3 4 5 5 7 8 9 10
Rank

Figure 2.11. Cumulative matching scores using dynamic information (pose and motion
blur) for FaceVACS matcher.

41

 

 

 

 

 

 

 

{I .llrll l

Correlation

 

 

 

 

 

 

 

 

 

 

1 I I I I I I I I
0.9 - ‘
0 e A l
_0 ....... .o- ---- 9‘ 9-
..u-O- ..... d.
...uo'r"

0 7 - fro0"~-. -
g 0 80"
3 ' .- Cid-4'
Q ------- ..--
8 My... """"" *’----'*-—- .- or 4!
2 05 - ﬂ-.. ------- w __ : ;: .-....... -------- .
'D o".- ------ ~ """" ...—-* ....... “
°-§ .-v'IJ-v*"""
E 0.4 "" ‘
3
U

0.3 - f

0.2 - _

-.o..- Correlation. ~20<Pitch<20
0.1 - ---I-~- FaceVACS. -40<Pitch<—20 or 20<Pitch<40 "
--¢-- FaceVACS, Pitch<-60 or 60<Pitch
0 l l l f 1 1 1
1 2 3 4 5 6 7 8 9 10
Rank

Figure 2.12. Cumulative Matching Characteristic curves with the effect of pitch for cor-
relation matcher.

 

 

 

 

 

 

 

 

PCA
1 r I 1 I 1 ' r I
0.9 - d
- ....... 41
.. ........ o ....... -..o- ....... -0
0 a — -.-o— ------ ’9 ' "- 0‘ _
‘0'.¢ -----
.00..."
0.7 " '0'“. .-
rr'
3
:1 0.6 -
U -q|.':.-:l
8 M
g 05 r --;51::=-F"-"""°37:' ﬂ 4
E ’*...-:~,1‘W.~::m w
E 0.4 ~.. — "11*" "
5 l?
0.3 - '
0.2 - d
---o--- PCA. -20<Pitch<20
0.1 - ------ FaceVACS. -40<Pitch<-20 or 20<Pitch<40 ‘
--s—- FaceVACS. Pitch<-60 or 60<Pitch
0 1 1 1 x I I ‘ ;
Rank

Figure 2.13. Cumulative Matching Characteristic curves with the effect of pitch for PCA
matcher.

42

FaceVACS

 

 

 

 

 

 

 

 

 

 

 

1 1 L T _J.
,1, ......... 5...“: a s s it 3 a .
0.9 - -
--44I
- -------- *---'
0 8 - _--* ______ ..a- ------ *- ------ 4? """"" .ﬂ" * _
r """" ""
0.7 - f
i?
G)
5 0.6 - '
8
G)
g 0 5 - f
’D
9.
E 0.4 — d
D
0
0.3 — '
0.2 h 4
---e--- F aceVACS, -20<Pitch<20
0.1 — ........- FaceVACS. -40<Pitch<-20 or 20<Pitch<40 L
--¢-- FaceVACS. Pitch<-60 or 60<Pitch
1 1 1 r f I I L
01 2 3 4 5 5 7 e 9 10
Rank

Figure 2.14. Cumulative Matching Characteristic curves with the effect of pitch for Face-
VACS matcher.

 

 

 

 

 

 

 

 

 

 

Correlation
1 I l I l I I l r
0.9 - _
A_ A b.
0.8 - .-e ......... .0- ....... --O- """"" '0' V -
"""""" o..—.-.'
0.7; ‘I' fo'oo’ _
3 *‘,,..ua-w-'\"
9 ........ x- ------------ 4 1
D 0.5 - ”4‘ .......... a -------- ._ ______ u. ._
8 uuuu -p:::::'--* ''''' *-
°° ........ 1:.“ -----
.2 0'5 " .r:::.:5:’"" ‘
E ,.- at,"
E 0 415’," a
D
o i'
0.3 - a
0.2 - _
"-0-“ Correlation. -20<Yaw<20
0" ’ ------ FaceVACS, -40<Yaw<-20 or 20<Yaw<40 '
"O" FaceVACS. Yaw<-60 or 60<Yaw
U l l l 1 1 1 1 T
1 2 3 4 5 6 7 a 9 10
Rank

Figure 2.15. Cumulative Matching Characteristic curves with the effect of pitch for cor-
relation matcher.

43

 

 

 

 

 

 

 

 

 

PCA
1 1 1 ﬂ I 17 1 I I
0.9 -— -
L A _ 1.
.......... 0-.-.a.-..o....-. v v
0.8 h- we, ....... _o,_ ...... .o- -
...ﬁ' """
..r'
[17‘ ff ------ t:r.:::r_e:k:.:-::.=r:d[
g. in” _____ ; zﬂwsfﬁ: ..........
g 06 '* ......... *‘d': ...,.er .—
8 "I" ff-u' """" '9-
93 0-5 - ,..- " “,2: _
.5 Jro'. ”a
S "p
g 04 P ""’ _
6 Air”
0.3 - -
0.2 - -
no"- PCA. -20<Yaw<20
0.1 r- -......- FaceVACS. -40<Yaw<-2o or 20<Yaw<40 _
--e—- FaceVACS, Yaw<-60 or 60<Yaw
U l 1 1 1 J 1 f 4
1 2 3 4 5 5 7 B 9 10

Rank

Figure 2.16. Cumulative Matching Characteristic curves with the effect of pitch for PCA
matcher.

 

 

 

 

 

 

 

 

 

 

 

 

FaceVACS
1 . L L )- 1 f L A
\ ' " v v v w v v V
0.9 - _ ___ : = = 2 i f
I? ......... .n . d--’* ------ _‘_ : I: 2 ‘- w
0.0‘5 """"" f.“ ‘
0.7 - f
8‘
S
3 0.6 r d
U
U
0
g 0.5 - f
E
E 0.4 ~ —
3
o
0.3 - f
0.2 ~ ‘
---o--- FaceVACS. -20<Yaw<20
0.1 - ---«---FaceVACS,-40<Yaw<-20 or 20<Yaw<40 '
--t-- FaceVACS. Yam-60 or 60<Yaw
U l 1 l I I ‘ I I
1 2 3 4 5 6 7 a 9 10
Rank

Figure 2.17. Cumulative Matching Characteristic curves with the effect of pitch for Face-
VACS matcher.

44

 

1 .-. -.-. .-... .-. -.-. .-.- .... .... .....
Iii—~23“;"g”:wv-gw-zr-ng-lgw—g—l-3

 

 

 

 

 

 

 

0.95' .
0.9 - ‘
. ._ . ‘ .. — A
E 0 85 - .5- - -' ‘— “
ﬂ * | ‘ . I- ‘4
g . ' ‘ ’ a? I, '- O '— >
g o 8 _ ' ‘ ’ A .u . .- ‘ 4‘ q
s A ,9, "
_ I p ’
g A" ,, ‘9
8 0.75 - "‘ f ’0‘ d
I o "
0. ’ , ’ --0- ~ FaoeVACS+PCA+Correlation (multiple frames)-
I .....¢ FaoeVACS+PCA+Correlation (likelihood ratio)
- u - FaoeVACS+PCA+Correlation (max rule)
0.65 “is ‘ ' FaceVACS -
.4- . PCA
- O - Correlation
0.6 L l l l l l l l
1 2 3 4 5 6 7 8 9 10

Rank

Figure 2.18. Cumulative matching scores by fusing multiple face matchers and multiple
frames in near-frontal pose range (-20°S (yaw & pitch) < 20°).

the noise in pose estimation process. Finally, Fig. 2.18 shows the effects of fusion of
multiple matchers and multiple frames using dynamic information of facial pose and
motion blur. We used score-sum with min-max score normalization and max-sum as
described in Sec. 2.1.1. The best rank-1 accuracy by combining all three matchers
achieved 96% accuracy. The frame level fusion result (subject level matching accu-
racy) exhibited over 99% accuracy. Tables 2.2 and 2.3 show examples of matching
results for two of the subjects in the database according to the choice of gallery, probe,
and matcher, where the ﬁnal fusion with composed gallery shows the best results.
Experiments on view-based face recognition were performed assuming a large num-

ber of images are available both in the probe and gallery data and a subset of the

45

Table 2.2. Face recognition performance according to gallery, probe, and matcher for video

acample 1.

 

€3anon Enméaoom

 

AEowﬁav Esméaoom

 

€309.53 moiabooam

 

A8858 84%on

 

9309:0an con—39:00

 

A8855 couﬁwtoo

 

_Qomomﬁoov 40m

 

O H O v—I I"! :— v—I H
O H O r—c v—4 F" r-a v—I
O r-‘ O r-I I—' 1-- r—c v—I
O .—. 1-4 r-' v—i :— v—' 1—0
1-4 H H 1" C) O r—I v—c
1—- v—0 r-0 '-'l O O H q—c
1“ V" H H O O ‘r- u—c
1—0 v—c «1— v—- O O v—c 1—0

goose 40m

 

m
0)
M

on on on on on on moan on new wmgﬂm

46

 

 

 

 

 

 

 

 

 

 

Hon—“-53

Enrol 331 33s alias $63 9493 sealsa H9153 as.

o
‘9

 

 

 

38m

 

9309:0an @230

 

9.8390 @230

 

 

 

 

Table 2.3. Face recognition performance according to gallery, probe, and matcher for video

example 2.

 

933808 83-98an

 

AEovcmb Eaméaoom

 

€80quan molaboomm

 

@9505 mogooﬁ

 

0983808 cosﬂmtoo

 

AEovﬁb coﬁﬁotoo

 

GomanooV 40m

 

O O O O O i— O F-'
O O O O O 1'" O 1—c
O O O O O O O H
O O O O O O O O
O O .O O O O O v—'
0 O O O O O O O
O O O v-« 0 v-- 0 v-«
C) H 0 v-' 0 v—' 0 v-‘
1—' 1—0 .O O H v-~ v—a v—o
v—c O 1—1 0 v—- v—1 H v—0

@893 «on

 

O
C"
O
C:
O
C‘.
O
C.‘
O
C:
0
C
m
0)
>5
m
0)
>5

3% new 3.83m

47

 

 

 

el
8..
Q:
E
5.
E
§1
0.
a
g,
a
a
N
E]

 

 

 

 

 

 

 

 

 

 

 

 

Q
a
El
5*.
E
5.
c:
B
9

 

38m

 

Gmmomﬁoov bwzao

 

9:03ch mamzmo

 

 

 

 

 

 

 

 

 

OQ

rl

 

probe data contains poses that are similar to those in the gallery data. In this case,
it is more important to select gallery and probe images that are close to each other.
However, when none of the probe and gallery data is similar in pose, we need to
synthetically generate probe face images that are close to the gallery images. For this

purpose, we introduce a view-synthetic approach in the following section.

2.2 View-synthetic Face Recognition in Video

The face images of subjects enrolled in a face recognition system are typically in
frontal pose, while the face images observed at recognition time are often non-frontal.
We propose a view-synthetic method to generate the face images that are similar
to the enrolled face images in terms of facial pose. We propose to automatically
(i) reconstruct a 3D face model from multiple non-frontal frames in a video, (ii)
generate a frontal view from the derived 3D model, and (iii) use a commercial 2D face
recognition engine to recognize the synthesized frontal view. A factorization-based
structure from motion algorithm [116] is used for 3D face reconstruction. Obtaining a
3D face model from a sequence of 2D images is an active research problem. Morphable
model (MM) [17], stereography [74], and Structure from Motion (SM) [120] [116] are
well known methods in 3D face model construction from 2D images or video. While
morphable models have been shown to provide accurate reconstruction performance,
the processing time is overwhelming (4.5 minutes [17]), which precludes their use in
real-time systems. Stereography also provides good performance and has been used
in commercial applications [17], but it requires a pair of calibrated cameras, which
limits its use in many surveillance applications. Structure from motion (SfM) gives
reasonable performance, has the ability to process in real-time, and does not require
a calibration process, making it suitable for surveillance applications. Since we are
focusing on face recognition in surveillance video, we propose to use the Sle technique

to reconstruct the 3D face models as described in Sec. 2.1.6. The overall schematic

48

of the system is depicted in Fig 2.19.

Input Video Reconstructed 30 Model
(Shape and Texture)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Gallery
...,...

Figure 2.19. Proposed face recognition system with 3D model reconstruction and frontal
view synthesis.

2.2. 1 Texture Mapping

We deﬁne the 3D face model as a set of triangles and generate a Virtual Reality
Modeling Language (VRML) object. Given the 72 feature points obtained from the
reconstruction process, 124 triangles are generated. While the triangles can be ob—
tained automatically by the Delaunay triangulation process [120], we use a predeﬁned
set of triangles for the sake of efﬁciency because the number and conﬁguration of the
feature points are ﬁxed. The corresponding set of triangles can be obtained from
the video frames with a similar process. Then, the VRML object is generated by
mapping the triangulated texture to the 3D shape. The best frame to be used in tex-
ture mapping is selected based on the pose estimation scheme described in Sec. 2.1.8.

When all the available frames deviate signiﬁcantly from the frontal pose, two frames

49

 

are used in the texture mapping as described in Fig. 2.20. Even though both the
synthetic frontal views in Figs. 2.20 (d) and (e) are correctly recognized, the view in
(e) looks more realistic. When more than one texture is used for texture mapping,
a sharp boundary is often observed across the line where two different textures are
combined because of the differences in illumination. However, the synthetic frontal

views are correctly recognized in most cases regardless of this artifact.

 

 

Figure 2.20. Texture mapping. (a) typical video sequence used for the 3D reconstruc-
tion; (b) single frame with triangular meshes; (0) two frames with triangular meshes; (d)
reconstructed 3D face model with one texture mapping from (b); (e) reconstructed 3D
face model with two texture mappings from (c). The two frontal poses in (d) and (e) are
correctly identified in the matching experiment.

2.2.2 Experimental Results

We performed the following set of three experiments. i) Evaluation of the minimum
requirement of rotation angle and number of ﬂames for the factorization algorithm

both synthetic and real data, ii) 3D face modeling on a public domain video database,

50

and iii) face recognition using the reconstructed 3D face models.

3D FACE RECONSTRUCTION WITH SYNTHETIC DATA

A set of 72 facial feature points were obtained ﬂom the true 3D face model, which
were constructed ﬂom the 3D range sensor data. A sequence of 2D coordinates of
the facial feature points were directly obtained ﬂom this true model. We took the
angular values for the rotation in steps of 0.1 in the range (0.1,1) and in steps of
1.0 in the range (1,10). The number of frames used was 2, 3, 4, and 5. The Root
Mean Squared (RMS) error between the ground truth and the reconstructed shape is
shown in Fig. 2.22. While the number of frames required for the reconstruction in the
noiseless case is two (see Sec. 2.1.7), in practice more ﬂames are needed to keep the
error small. As long as the number of frames was more than two, the reconstruction

errors were observed to be negligible (2 0).

3D FACE RECONSTRUCTION WITH REAL DATA

For real data, noise is present in both the facial feature point detection and the
correspondences between detected points across ﬂames. This noise is not random and
its affect is more pronounced at points of self-occlusion and on the facial boundary,
as observed in Fig. 2.5. Since AAM does use feature points on the facial boundary,
the point correspondences are not very accurate in the presence of self-occlusion.
Reconstruction experiments were performed on real data with face rotation ﬂom —45°
to +45° across 61 ﬂames. Example ﬂames ﬂom a real video sequence are shown in
Fig. 2.5. We estimated the rotation between successive ﬂames as 1.5° (61 ﬂames
varying ﬂom -45° to +45°) and obtained the reconstruction error with rotation
in steps of 1.5° in the range (1.5°,15°). The number of ﬂames used varied ﬂom 2
to 61. A direct comparison between the true model and the reconstructed shape

is not possible for real data because the ground truth is not known. The original

51

database was collected only as 2D video and 3D models of the corresponding subjects
were not available. Therefore, we measured the orthogonality of M to estimate the
reconstruction accuracy. Let M be a 2F x 3 matrix as shown in Eq. 2.7 and M (a :
b, c : (1) represent the sub matrix of M ﬂom rows a to b and columns c to d. Then,
M3 = M x M’ is a 2F x 2F matrix where all elements in M3(1 : F,l : F) and
M3(F+ 1 : 2F,F + 1 : 2F) are equal to 1 and all elements in Ms(1 : F,F + 1 : 2F)
and M3(F + 1 : 2F,1 : F) are equal to 0 if M is truly an orthogonal matrix. We
measured the RMS difference between the ideal M3 and the calculated M3 as the
reconstruction error. The reconstruction error for real data is shown in Fig. 2.22.
Our experiments show that the number of ﬂames needed for reconstruction ﬂom real
data is more than the synthetic data, but the error decreases quickly as the number
of ﬂames increases. The increase in error with larger pose difference is due to error

in point correspondences ﬂom self-occlusion.

 

 

 

 

 

 

 

m 1 1 r 1— l I I I I
53 f _ f 2 ﬂames
550 l 2 3 4 5 e 7 e 9 10
40
x 10
E 5 U v r I l I r l r
‘3 IN 3 ﬂames
E .
m 0 J I L j l l l I
0 1 2 3 4 5 6 7 8 9 10
do
5 Ii 10

L 4ﬂames
0-12 3 4 5 s 7 e 910

k 5 ﬂames

0 1 2 3 4 5 6 7 8 9 10
Rotation angles between frames (degrees)

 

 

 

 

 

Figure 2.21. RMS error between the reconstructed shape and true model.

52

 

 

 

 

 

 

 

 

 

 

 

Rmserror
O

o ..-

-l Oil->1 01

F

+
A
g:
m

 

D
o _e
I
r.
L
+
U!
:3
m

 

 

 

 

0 5 10 15
0.2 I 1
0‘ ' " 15 ﬂames
00 5 ‘ 10 1S

Rotation angles between frames (degrees)

Figure 2.22. RMS error between the reconstructed and ideal rotation matrix, M3.

FACE RECOGNITION WITH POSE CORRECTION

We used CMU’s Face In Action (FIA) video database [37] (see Sec. 2.1.2) for our
matching experiments. We used selected ﬂames of the FIA database to simulate the
video observed in a surveillance scenario. To demonstrate the advantage of using
reconstructed 3D face models for recognition, we were primarily interested in video
sequences that contained mostly non-ﬂontal views for each subject. Since the recon-
struction with SfM performs better when there are large motion differences between
ﬂames, both left and right non-ﬂontal views were collected for each subject, if avail-
able, resulting, on average, in about 10 ﬂames per subject (a total of 221 subjects).
When there was sufﬁcient motion difference between ﬂames and the feature point

detector performed well, it was possible to obtain the 3D face model ﬂom only 3

53

   

(a) * (b)

   

(8) (0 (9) ('1)

Figure 2.23. Examples where 3D face reconstruction failed. (a), (b), (c), and ((1) show the
failure of feature point detection using AAM; (e), (f), and (g) show failures due to deﬁciency
of motion cue. The resulting reconstruction of the 3D face model is shown in (h).

different ﬂames, which is consistent with the results shown in Fig. 2.21. The num-
ber of ﬂames that is required for the reconstruction can be determined based on the
orthogonality of M. Typical ﬂames ﬂom video sequences are shown in Fig. 2.20 (a).

We successfully reconstructed 3D face models for 207 subjects out of the 221
subjects. The reconstruction process failed for 14 subjects either due to poor facial
feature point detection in the AAM process or the deﬁciency of motion cue, which
caused a degenerate equation in the factorization algorithm.

Example images where AAM or SM failed are shown in Fig. 2.23. The recon-
structed 3D face models were corrected in their pose to make all yaw, pitch, and roll
values equal to zero. The ﬂontal face image can be obtained by projecting the 3D
model in the 2D plane. Once the frontal view was synthesized, the FaceVACS face

recognition engine ﬂom Cognitec [9] was used to generate the matching score. The

54

 

 

1V.-.V.-..v..-v-..v..-.y.-..vm.v.-..v.._.v

0.9” ‘
0.8.. ‘0..—‘°._._o_I_.e._'_0-I"e'-'-o-'-'¢
o_70_.—-9“ a

.0
a,
I
I

out

.°
h
I

.--«--u--“"#

cumulative accuracy
O
01

’8'

' a

4;

 

 

 

 

 

 

 

0'2 ' l-v-- Frontal video frames .
0 1 _ --0-- Non-frontal video frames with 3D modeling _
- Ct - Non-frontal video frames
0 l 1 l l 1 l l l
1 2 3 4 5 6 7 8 9 10

rank

Figure 2.24. Face recognition performance with 3D face modeling.

face recognition results for ﬂontal face video, non-ﬂontal face video, and nonrﬂontal
face video with 3D face modeling are shown in Fig. 2.24. These results are based on
207 subjects for which the 3D face reconstruction was successful. The CMC curves
show that the FaceVACS engine does extremely well for ﬂontal pose ﬂames but its
performance drops drastically for non-ﬂontal pose ﬂames. By using the proposed
3D face modeling, the rank-1 performance in the non-ﬂontal scenario is improved by
40%. Example 3D face models and the synthesized ﬂontal views ﬂom six different
subjects (subject ID: 47, 56, 85, 133, 198, and 208 in the FIA database) are shown
in Fig. 2.25. All these input video ﬂames were incorrectly recognized by the Face-
VACS engine. However, after 3D model reconstruction, the synthetic ﬂontal views

were correctly recognized except for the last subject. The synthetic ﬂontal view of

55

 

 

(8)

Figure 2.25. 3D model-based face recognition results on six subjects (Subject IDs in the
FIA database are 47, 56, 85, 133, 198, and 208). (a) Input video ﬂames; (b), (c) and
(d) reconstructed 3D face models at right view, left view, and ﬂontal view, respectively;
(e) frontal images enrolled in the gallery database. All the frames in (a) are not correctly
identiﬁed, While the synthetic ﬂontal views in ((1) obtained ﬂom the reconstructed 3D
models are correctly identiﬁed for the ﬁrst ﬁve subjects, and not for the last subject (#
208). The reconstructed 3D model of the last subject appears very different ﬂom the gallery
image, resulting in the recognition failure.

56

 

 

 

Ill

 

this subject appears sufficiently different ﬂom the gallery image, resulting in a false
match. O

2.3 Video Surveillance

With increasing security concerns, surveillance cameras have become almost ubiqui-
tous in public and private places. However, most of these surveillance cameras were
initially installed with limited functionalities of providing video streams to human
operators for review after a security breach. In order to assess security threats in
real time and identify subjects in video, development Of automated video surveillance
systems is needed.

Many studies on automated surveillance systems that utilize computer vision and
image processing techniques have been reported [30] [128] [51]. The automation
of surveillance systems will not only increase the number of manageable cameras
per Operator, but it will also remove the necessity of video recording by identifying
critical events in real-time. With the difﬁculties encountered in fully automating the
surveillance systems, semi-automatic surveillance systems that can effectively utilize
human intelligence through the interaction with surveillance systems are becoming
mainstream.

Another trend in developing surveillance camera systems is using networked cam-
eras. Networked cameras are installed with built-in ethernet cards and send the
captured video to the Digital Video Recording (DVR) systems through the ethernet
cable. Networked cameras simplify the installation and maintenance processes and,
in turn, enable monitoring large areas, especially when a wireless ethernet is used.
Networked cameras use compressed images due to the limited bandwidth available
in many applications. The noise involved in networked cameras due to the image
compression will be addressed in our image processing algorithms.

We have developed a Visual Search Engine (ViSE) as a semi-automatic component

57

 

 

 

 

Visual Search Conventional
Hm OPCI‘Itor Engine (ViSE) Surveillance
' ' Camera System
.mhggigﬂ OLabor-intensive
- image processing OVideo-slreem

- Galaate qumu , . . d

with pn‘miﬁve 0 anrtrve feature capture an _

farm-.3 erm'action optional recording

0 Answer to operator

 

 

 

Figure 2.26. Proposed surveillance system. The ViSE is a bridge between the human
operator and a. surveillance camera system.

in a. surveillance system using networked cameras. The ViSE aims to assist the
monitoring operation of huge amounts of captured video streams; these operations
ﬁnd and track people in the video based on their primitive features with the interaction
of a human operator.

In contrast to the conventional viewpoint of partitioning surveillance systems into
human intervention and camera systems, we decompose the surveillance systems into
three different parts: (i) human intervention, (ii) Visual Search Engine (ViSE), and
(iii) conventional camera system. A human operator translates high-level queries such
as “Is this a suspicious person?” or “Where is the person X?” into low-level queries
with primitive (image) features that ViSE can easily understand. The translation of
the query is performed based on the knowledge of the operator. For example, the
best visual features of a missing child can be provided by his parents. Examples of
low-level queries with primitive features are “Show all subjects wearing a blue shirt,”
or “Show all subjects that passed location Y.” For the purpose of ﬁnding a person,
ViSE narrows down the candidates and the human operator then chooses the ﬁnal
target. To be able to interact with human operators, ViSE processes input video

streams and stores primitive features for all the objects of interest in the video. The

58

blOCk diagram of the proposed system is shown in Fig. 2.26.

We address the issues of object detection and tracking, shadow suppression, and
color-based recognition for the pr0posed system. The experimental results on a set
of video data with ten subjects showed that ViSE retrieves correct candidates with

83% recall at 83% precision.

2.3.1 Moving Object Detection

The ﬁrst step in processing video input is detecting objects of interest. A well-known
method of object detection is based on the inter-ﬂame subtraction of the current
ﬂame against a reference ﬂame [29] [86] or a few adjacent ﬂames [13].

Moving object detection or motion segmentation is one of the most important
tasks in video surveillance, object tracking, and video compression [76]. The simplest
method of motion extraction is background subtraction [106] [109]. Each image is
compared with the reference background image and the difference between the two
images is extracted. This method is used when the background is static over a
relatively long period of time. In video surveillance, the reference background is
periodically updated. The background subtraction is very simple to implement with
low computational cost, which makes it ideal for real-time processing. However,
keeping the reference image static is not trivial in many situations. The reference is
easily corrupted by a small camera oscillation. In addition, even when the background
is stationary with respect to noisy motion, it is usually not static with respect to
illumination. The illumination change is detected in the background subtraction and
estimated as a motion.

Frame subtraction against past images is used for motion detection when the
reference image is not obtainable. Most ﬂame subtraction techniques use two or
three consecutive ﬂames [13] [48] [105]. The background image is assumed not to

change much across the ﬂames. The detected change in two or three ﬂames is used

59

to retrieve the outline of the moving object. The moving object is segmented ﬂom
the outline or further processing is performed to reﬁne the segmentation. The ﬂame
subtraction technique can be used in more general cases than background subtraction
because it does not need a reference image. However, the ﬂame subtraction method
depends on the velocity of the moving object in the image. If the velocity is low, it
cannot be detected, while the segmentation is overestimated if the velocity is high.

Horn and Schunck [45] used a motion constraint equation to analyze motion in a
sequence of images: each pixel in the image is evaluated for its magnitude and direc-
tion of motion. A set of pixels that are correlated with the optical ﬂow is segmented
as one region. The set of pixels with large magnitude of motion corresponds to the
moving object. Optical flow provides more information about the motion in an image
by estimating direction of motion, which allows for more detailed analysis than back-
ground subtraction or ﬂame subtraction. However, the motion between consecutive
ﬂames is assumed to be small in optical ﬂow. The main disadvantage of Optical flow
is its high computational cost.

We used the background subtraction method for the moving Object detection be-

cause Of its capability of real-time processing and detection of slowly moving objects.

BACKGROUND SUBTRACTION

Conventional background subtraction methods are very sensitive to noise, so they
are not useful in our system, because of the additional noise in networked cameras.
We propose a slight variation of the Gaussian background modeling method. Our
approach estimates the background model both at the pixel level and at the image
level.

Let It(a:,y) denote the image captured at time t, and BN = {It(:r,y)|t =
1, 2, . . . ,N} denote the set of N images used in the background modeling. In a

recursive fashion, the mean u(:l:, y) and standard deviation 0(55, y) for each pixel (x,y)

6O

can be calculated as

t — 1 :I:
u: = _t—‘Ht—l + 7" (2-15)

2
2 75—1 2 (HR-m)
t=_Ut—l+—?_—1'_l

t (2.16)

where t = 1,2,,N. Once a and 0' are estimated, a pixel is declared to be in the

background if

IIt(:v, y) - Mic, y)| < k - 003.31), (2.17)

and foreground, otherwise. Setting the value Of k to 3 implies that we expect the
model will include 99.73% of the background pixels if the distribution of pixel val-
ues is Gaussian. However, due to the violation of Gaussian assumption in practice,
additional noise due to image compression and the limited data in building the back-
ground model, the rule in Eq. (2.17) results in many false classiﬁcations of background
pixels as foreground. These false classiﬁcations can be suppressed by introducing an
additional threshold in Eq. (2.17), which can be obtained ﬂom the secondary mean
and standard deviation computed as

u’= 1

 

or, 2.18
”My“ ( y) < )

 

 

0': \/n 1 Belem—m2. (2.19)

a: ' ny 1:,y
where nx and ny denote the number of rows and columns, respectively. The parame-
ters ”(33, y) and 0(3, y) and p’ and 0’ account for the background model at the pixel
level and the image level, respectively. The modiﬁed criterion to decide a pixel as
background is

May) - u(x,y)| < k-0+ (u’+k-0')- (2-20)

As the camera captures a new frame, the new image Inew(x, y) replaces the Old image

61

 

Iold (1:, y) in B N and the background model is updated by recursively updating ”(112, y)

and a(x,y) as
#new = How + new]; old (221)

2 _ 2 (new - #102 - (01d — Mo)2 (#12 — #o{(n€w — Mn) + (Old - #0)}
”new " Gold + N + N l

(2.22)

 

where the subscript new(n) denotes the newly added pixel values and the subscript
old(o) denotes the pixel values to be removed. The parameters 11’ and or’ are also
updated ﬂom the new I (at, y). By keeping only N images in the buffer and updating
the background model, no false detections due to background changes persist over N
ﬂames, an improvement over other background modeling methods. However, since
part of an object that is static for N ﬂames can be misclassiﬁed as background, a

user intervention is also allowed to initialize and update the background model.

SUPPRESSION OF SHADOWS

We ﬁrst perform the proposed background subtraction in RGB space and then remove
shadows ﬂom the second pass background subtraction in the Hue-Saturation-Intensity
(HSI) Space. The RGB space is better at detecting salient objects, but suffers ﬂom
many false positives due to shadows. Therefore, object segmentation using the com-
bination of RGB and HSI space is expected to be more robust in terms of both salient
region detection and shadow suppression. To save computation, ﬁrst pass subtraction
is performed in I Space and second pass subtraction is performed in H and S Space.
The second pass subtraction is also performed only on the foreground pixels whose
I value is decreased ﬂom the background. The difference between background sub-
traction in V and HS spaces is shown in Fig. 2.27, where I space subtraction shows a
Clear background but includes the shadow. The HS space subtraction, on the other

1'land, shows the advantage of shadow suppression.

62

 

i ,

 

(a) V space (b) HS space (c) V-HS space

Figure 2.27. Background subtraction.

2.3.2 Object Tracking
HOMOGRAPHY—BASED LOCATION ESTIMATION

Homography is a mapping function between two different 2D projection images of a
3D scene [42]. It is well known that the homography between two images requires four
corresponding points. Using this, we transform the 2D motion segmented image into
the 2D representation of the ﬂoor (foot) print (i.e., surveillance area) to obtain the
location of the subjects ﬂom the top-view. Two base homographic transformations H0
and H h are calculated from four observed points in the image at two different heights
0 meters and h meters ﬂom the ground. Then, the homographic transformation, Hy,
at a height of y meters, 0 S y S h is calculated based on the following interpolation

(h—y)‘Ho+y-Hh

(2.23)

The location of a person can be estimated by integrating the multiple transformed

planes and detecting the peak value ﬂom the integration as

h
location = argmax (/ Hy - de) , (2.24)
25,2 0

63

where S is a cylindrical object for the convolution operation. This location estimation
method can suppress the segmentation error, such as cracks and holes that are caused

by the additional sources of noise in networked cameras.

KALMAN FILTER

In accumulating the moving path of a subject, a conventional linear Kalman ﬁlter [125]
is used for prediction and smoothing. The Kalman ﬁlter can be formulated in the
prediction stage as

5k = A§k_1 + Buk_1 (2.25)
Pk = APk_1AT + Q (2.26)

and in the correction stage as

K], = PkHT(HPkHT + R)‘1, (2.27)
3k = Tk + Kk(zk — ka), (2.28)
Pk = (I — KkH)Pk, (2.29)

where 5k is the predicted state, 575k is corrected state with measurement, 2]: is the
measurement, Q is process noise covariance matrix, R is measurement noise covariance
matrix, A denotes parameters that relate the state ﬂom k — 1 to k, B relates u to
the state at, H relates a: to 2, P is the estimation error covariance matrix, and K is

the Kalman gain.

PRIMITIVE FEATURE EXTRACTION

To enable communication between the human operator and the surveillance system, it
is critical to select descriptive features that can be understood and processed at both

ends. We chose clothing color, height, and build of the subjects as three features that

64

can be easily computed at a. distance using networked cameras. In addition. these
three features are easy for human operators to recall to describe a subject because
they are commonly used in the real world. Below we describe how we compute each

of these three features.

CLOTHING COLOR

The detected blob corresponding to a person in the video was divided into three parts
ﬂom top to bottom (at 1 / 5th and 3/ 5th of the person’s height). A combination of the
color values of middle and bottom parts was considered to describe the color feature.
One problem in color matching is that the observed color values in the RGB space
from different cameras vary as much as those observed in different instances from the

same. caIIIera as Shown in Fig. 2.28.

l
1.2]-

 

lv‘
1
4
I
L

1.!
.|.!

(a) (b) (C) (d)

Figure 2.28. Intra— and inter-camera variations of observed color values. (a) original color
values, (b) observed color values from camera 1. (c) camera 2. and ((1) camera. 3 at three
different time instances.

By removing the, lightness component (1 component) in the H81 color space, the
color variation can be greatly reduced. Saturation also causes variations in color
values ﬂom pure to dark but in the same color label. We propose a color—matching
scheme by using hue as the main component with the assistance of saturation and

intensity. The color is decided mainly according to the hue and the possibility of

65

color being white, black, or gray is decided by S and V components. A histogram
with ten bins (red, brown, yellow, green, blue, violet, pink, white, black, and gray)
is constructed from every pixel in the segmented object. The decision threshold for
each color is made from the boundary values in standard color charts. The ﬁnal color

is decided as the bin with the largest count.

HEIGHT

The height Of the person is estimated as the y-value at the location of a subject in

 

 

 

 

 

 

 

 

 

      

      
      
   
 

     

vi

Eq. (2.24).
Input Videos I Queries Search
Results
_. ” 5' "A Brown shirt 8. 1705 5.."le Ff”;
. _/ . height S180 ' ~ -‘ if:

4..

Figure 2.29. Schematic retrieval result using ViSE.

2.3.3 Experimental Results

We collected three instances of video recordings Of ﬁve different subjects and one
instance Of video recording from a different set of ﬁve subjects using three networked
cameras installed in a hallway. Durations of the video clips ranged between 25 and 30
seconds with ten frames per second. Since one instance of video recording generates
three video clips from three different cameras. the total number Of video clips was
sixty.

We evaluated the performance Of ViSE in terms Of the accuracy Of feature extrac-

tion and the precision and recall for subject retrieval. Given the set of pre-recorded

66

 

—— 1.2".“ 1‘

m
,'m"'~--' ' "

 

 

video data with ten different subjects, ViSE showed 93% overall accuracy in color
feature extraction and about 2 cm average deviation in height measurement. At 79%
precision, the recall for subject retrieval was 85% using only the color features. Using
both color and height, the recall was decreased to 83% with an increased precision of
83%. Some example search results using ViSE are shown in Fig. 2.29. It can be seen
that ViSE is able to retrieve correct candidates and, in turn, signiﬁcantly reduce the
operator’s burden.

The resolution of face images in the video data is very low (~10 pixels between
the eyes), which made it infeasible to perform face recognition tasks. However, the
extracted soft biometric features can help in improving person identiﬁcation accuracy
as shown in [49]. To address the low resolution problem in video surveillance, we

propose a face recognition method at a distance in the next section.

    

Global vlew

Figure 2.30. Schematic of face image capture system at a. distance.

67

 

Figure 2.31. Schematic of camera calibration.

2.4 Face Recognition in Video at a Distance

In typical surveillance application scenarios, the distance between the subject and
camera is large (> 10m) and the resolution Of face image is relatively poor (no. of
pixels between eyes < 10) resulting in low recognition performance of face recognition
systems. We propose to use a pair of static and PTZ cameras to obtain higher
resolution face images (no. of pixels between eyes > 100) at a distance of 10 or
more meters. There have been a few studies on face recognition at a distance using
a camera system consisting Of static and PTZ cameras [111] [67]. However, most of
the studies are limited in the sense that only tracking is enabled and face recognition

performance is evaluated on a small number Of subjects (35).

68

2.4.1 Image Acquisition System

To obtain high resolution face images at a distance (>10 m), we used a pair of static
and PTZ cameras. The static camera detects the human subject and estimates the
head position using the coordinates in the global view. The coordinate of the head
location is passed to the PTZ camera, which zooms into the face area to capture the
high resolution (no. of pixels between eyes > 100) face image. The schematic of the

proposed system is shown in Fig. 2.30.

2.4.2 Calibration of Static and PTZ Cameras

The static camera and the PTZ camera need to be calibrated into a common coordi-
nate system to communicate with each other. The calibration is performed between
the pixel coordinates of the static camera and the pan and tilt values of the PTZ cam-
era. Let (71, c1), (r2,c2), . . . , (rmcn) be the sampled pixel coordinates in an image
captured by the static camera and (p1, t1), (p2, t2), . . . , (pn, tn) be the corresponding
pan and tilt values in the PTZ camera. The relationship between the pixels coordi-

nates and the pan and tilt values can be Obtained by the following linear model

P = a0 + alr + agc (2.30)

T = [30 + an + 32c (2.31)

Fig. 2.31 shows the schematic Of the relationship between the static and PTZ cameras
in terms of a static view.
The relationship in Eqs. (2.30) and (2.31) is affected by the distance between the
camera and the subject. However, when the distance between the camera and subject
is SUfﬁciently long (d2 << d1, d4 << d1), the error in estimated pan and tilt values

is negligible as shown in Fig. 2.32.

69

 

kl‘

d1

 

 

 

 

 

Figure 2.32. Calibration between static and PTZ cameras.

2.4.3 Face Video Database with PTZ Camera
We collected the video data with both static and close-up views from 12 different
subleCts. We also captured the 3D face models using a 3D range sensor for the same

SUbJeC‘ts. Example images of a subject in static and close-up views are shown in

70

 

Figure 2.33. Example of motion blur. Example close-up image: (a) without motion blur
and (b) with motion blur.

Fig. 2.30.

2.4.4 Motion Blur in PTZ camera

We estimated the motion blur as explained in Sec. 2.1.9 and removed the frames with
large blur from the face recognition process to reduce erroneous matching results. A

frame with motion blur is shown in Fig. 2.33.

2.4.5 Parallel vs. Perspective Projection

Another important aspect in face recognition at a distance is the projection model
used. The differences between perspective vs. parallel projection is shown in Fig. 2.34.
The differences in face recognition performance based on the effect of different projec-
tions is shown in Fig. 2.35. If the image is captured at a distance, the 2D projection
image of a 3D model with parallel projection shows a higher matching score than that

of the 2D face image captured at a close distance.

2.4.6 Experimental Results

We uSed both real images and 2D projection (synthetic) images from a 3D model to

construct the gallery data. We compared the face recognition performance between

71

 

(d)

 

Figure 2.34. Parallel vs. perspective projection. (a) face image captured at a distance of
~10m, (b) parallel projection of the 3D model, (0) face images captured at a distance of
~1m (f) perspective projection of the 3D model.

(i) the static and close-up view, (ii) real and synthetic gallery, (iii) two different
matchers, and (iv) different number of frames. The experimental results are shown in
Figs. 2.36 and 2.37. These ﬁgures demonstrate that: i) a close—up view shows better

performance than a static view, (ii) having both real and synthetic galleries provides

better performance, and (iii) using multiple frames provides better performance.

72

 

 

 

 

 

 

Matching score: .27 Matchlng score: .77

Figure 2.35. Effect of projection model on face recognition performance.

2.5 Summary

We have shown that the performance of video based face recognition can be improved
by using 3D face models obtained either directly from a. 3D range sensor or via 2D
to 3D reconstruction based on structure from motion. Fusing multiple matchers and
multiple frames in an adaptive manner by utilizing dynamic information of facial
pose and motion blur also provides performance improvements. A systematic use of
temporal information in video is crucial to obtain the desired recognition performance.
The current implementation processes at a rate of 2 frames per second, on average.
A more efﬁcient implementation and integration of various modules is necessary.
We have developed a semi—automatic surveillance system with the concept of Vi—
sual Search Engine (ViSE) using multiple networked cameras. A robust background
modeling method that can handle the images from networked cameras, a shadow
suppression method, and a number of descriptive feature extraction methods were
developed. The system has been tested with pre-recorded video data and shows
PromiSing results. The proposed feature extraction method can be used in automatic
single or cross camera tracking as well, where robust and invariant feature extraction

is impOI‘tant. Since our system is targeted for surveillance applications, we also devel-

73

 

«ff - N34;

 

 

 

 

-" w'T‘W

 

       

 

 

 

 

 

 

 

100 . r f
A 80- .
e:
s
'5 60-
8
CD
9 . ’
g 40 ’IF-I--+---I.---¥ .
S _ _ _ _ I
g ’1' + +
o ,’

20 ‘ I ‘

’ I" —o— Close-up, real
’1 - + -Static, real
0' 1 1 1 L
2 4 6 8 10

Rank
Figure 2.36. Face recognition performance with static and close-up views.

Oped a prototype high resolution face image acquisition system and demonstrated its
performance in face recognition at a distance using a pair of static and PTZ cameras.
The crucial aspect of face recognition at a distance is to properly utilize the advantage
of 3D models, if available, and the temporal information in the video (e.g., facial pose

and motion blur as mentioned in Sections 2.1.8 and 2.1.9).

74

 

100 . . . w

 

 

 

Cumulative accuracy (%)

 

 

 

 

 

 

20 _ -.-,...“ Majority vote (30 frames) 4
-1— Majority vote (10 frames)
-- Close-up, real+synthetic
o 1 1 1 1
2 4 6 8 10
Rank

Figure 2.37. Face recognition performance using real and synthetic gallery images and
multiple frames.

75

Chapter 3

Age Invariant Face Recognition

3. 1 Introduction

Face recognition accuracy is usually limited by the large intra-class variations caused
by factors such as pose, lighting, expression, and age [92]. Therefore, most of the
current work on face recognition is focused on compensating for the variations that
degrade face recognition performance. However, facial aging has not received ade-
quate attention compared with other sources of variations such as pose, lighting, and
expression.

Facial aging is a complex process that affects both the shape and texture (e.g, skin
tone or wrinkles) of a face. This aging process also appears in different manifestations
in different age groups. While facial aging is mostly represented by the facial growth
in younger age groups (i.e., 318 years old), it is also represented by relatively large
texture changes and minor shape changes (e. g., due to the change of weight or stiffness
of skin) in older age groups (i.e., >18). Therefore, an age correction scheme needs to
be able to compensate for both types of aging processes.

Some of the face recognition applications where age compensation is required in-
clude (1) identifying missing children, (ii) screening, and (iii) detection of multiple
emouments. These three scenarios have two common characteristics: (i) a signiﬁcant

age difference between probe and gallery images (images obtained at enrollment and

76

 

veriﬁcation stages) and (ii) an inability to obtain a subject’s face image to update
the template (gallery). Identifying missing children is one of the most apparent ap-
plications where age compensation is needed to improve the recognition performance.
In screening applications aging is a major source of difficulty in identifying suspects
in a watch list. Repeat offenders commit crimes at different time periods in their
lives, often starting as a juvenile and continuing throughout their lives. It is not
unusual to encounter a time lapse of ten to twenty years between the ﬁrst (enroll-
ment) and subsequent (veriﬁcation) arrests. Multiple enrollment detection for issuing
government documents such as driver licenses and passports is a major problem that
various government and law enforcement agencies face in the facial databases that
they maintain. Face or some other types of biometric traits (e. g., ﬁngerprint or iris) is
the only way to detect multiple enrollment, i.e., detect a person enrolled in a database
with different names.

Ling et al. [66] studied how age differences affect the face recognition performance
in a real passport photo veriﬁcation task. Their results show that the aging process
does increase the recognition difﬁculty, but it is less severe than the effects of illu-
mination or expression. Studies on face veriﬁcation across age progression [99] have

shown that: (i) simulation of shape and texture variations caused by aging is a chal-
lenging task, as factors like lifestyle and environment also contribute to facial changes
in addition to biological factors, (ii) the aging effects can be best understood using
3D scans of the human head, and (iii) the available databases to study facial aging
are not only small but also contain uncontrolled external and internal variations (e. g.,
pose, lighting, and expression). It is due to these reasons that the effect of aging in
facial recognition has not been as extensively investigated as much as other factors
in intra-class variations in facial appearance.

SOllie biological and cognitive studies on the aging process have also been con-

ducted, e.g., in [115] [95]. These studies have shown that cardioidal strain is a major

77

Table 3.1. A comparison of methods for modeling aging for face recognition.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Rank-1 identi-
A roach Face Database ﬁcation.
pp matcher (#subjects, accuracy (%)
#images) in . . after
ongmal .
probe and gallery - aging
“age model
anathan Shape growth mod- Private database
et al. elin u to a e 18 PCA (109 109) 8'0 15'0
(2006) [100] g p g ’
Lanitis et Bmld. an aglng Mahalanobis .
function 1n terms of . Prlvate database
3.]. (2002) . distance, 57.0 68.5
[58] PCA coefﬁc1ents of P C A (12,85)
shape and texture
Learn aging pattern
on concatenated .
Geng et al. PCA coefﬁcients of 3?:th321230bls FG~NET * 14 4 38 1
(2007) [35] shape and texture PC A ’ (10,10) ' '
across a series of
ages
Build an aging
Wang et a1. function in terms of Private database
(2006) [124] PCA coefﬁcients of PCA (NA,2000) 52'0 63'0
shape and texture
Build an aging
:lattezzragt): function in terms of P C A MORPH + 11 0 33 0
[84] PCA coefﬁcients of (9,36) ° '
shape and texture
Pro 05 ed Learn aging pattern €§-;?T 26.4 37.4
p based on PCA FaceVACS ’
method ' coefﬁcients in
++
separated 3D shape MORPH 57.8 66.4
(612,612)
and texture

 

 

"‘ Used only a subset of the FG-NET database that contains 82 subjects
+ Used only a subset of the MORPH-Albuml database that contains 625 subjects
" Used all the subjects in FG-NET

 

'H' Used all the subjects in MORPH-Albuml

factor in the aging of facial outlines. Such results have also been used in psychological
studies, e.g. by introducing aging as caricatures generated by controlling 3D model
parameters [78]. Patterson et a1. [85] compared automatic aging simulation results

with forensic sketches and showed that further studies in aging are needed to improve

78

 

 

 

face recognition techniques. A few seminal studies [100] [112] have demonstrated the
feasibility of improving face recognition accuracy by simulated aging. There has also
been some work done in the related area of age estimation using statistical models,
e.g. [58] [57]. Geng et al. [35] learned a subspace of aging pattern based on the as-
sumption that similar faces age in similar ways. Their representation is composed of
face texture and thei2D facial shape; the shape is represented by the coordinates of
the feature points as in the Active Appearance Model.

Table 3.1 gives a brief comparison of various methods for modeling aging proposed
in the literature. The performance of these models is evaluated in terms of the
improvement in the identiﬁcation accuracy. When multiple accuracies were reported
in any of the studies under the same experimental setup, their average value is listed
in Table 3.1. If multiple accuracies are reported under different approaches, the best
performance is reported in Table 3.1. The identiﬁcation accuracies of various studies
in Table 3.1 cannot be directly compared due to the differences in the databases
used, number of subjects used and the underlying face recognition methods used for
evaluation. Usually, the larger the number of subjects, and the larger the database
variations in terms of age, pose, lighting, and expression, the smaller the recognition
performance improvement due to the aging model. The identiﬁcation accuracy for
each approach in Table 3.1 before aging simulation indicates the difﬁculty of the
experimental setup for the face recognition test as well as the capability of the face
matcher.

There are two well known public domain databases that are used to evaluate facial
aging models; FG-NET [4] and MORPH [101]. The FG-NET database contains
1,002 face images of 82 subjects (~12 images/subject) at different ages, with the
minimum age being 0 (< 12 months) and the maximum age being 69. There are two
separate databases in MORPH: Albuml and Album2. MORPH-Albuml contains
1,690 images from 625 different subjects (~2.7 images/subject). MORHP-Album2

79

 

(b) MORPH

Figure 3.1. Example images in (a) FG-NET and (b) MORPH databases. Multiple images
of one subject in each of the two databases are shown at different ages. The age value is
given below each image.

80

contains 15,204 images from 4,039 different subjects (~3.8 images/subject). Since
it is desirable to have as many subjects and as many images at different ages per
subject as possible, the FG~NET database is more useful for aging modeling than
MORPH. The age separation observed in MORPH-Albuml is in the range 0~30
and that in MORPH-AlbumZ is less than 5. Therefore, MORPH-Albuml is more
useful in evaluating the aging model than MORPH-Album2. We have used 1,655
images of all the 612 subjects whose images at different ages are available in MORPH-
Albuml in our experiments. We have used the complete FG-NET database for model
construction and then evaluated it on FG-NET (in leaveone-person—out fashion) and
MORPH-Albuml. Fig. 3.1 shows multiple sample images of one subject from each
of the two databases. The number of subjects, number of images, and number of
images at different ages per subject for the two databases used in our aging study are

summarized in Table 3.2.

Table 3.2. Databases used in aging modeling.

 

 

 

 

 

 

 

 

 

. . average #images at
Database #subjects #1mages different ages per
subject
FG-NET 82 1,002 12
MORPH Albuml 1,690 625 2.7
Album2 4,039 15,204 3.8

 

 

Compared with the other published approaches, the proposed method for aging

modeling has the following features:

0 3D aging modeling: We use a pose correction stage and model the aging pattern
more realistically in the 30 domain. Considering that the aging is a process
occurring in the 3D domain, 3D modeling is better suited to capture the aging
patterns. We have shown how to build a 3D aging model given a 2D face aging

database. The proposed method is the only viable alternative to building a 3D

81

 

aging model directly, because there is no 3D aging database currently available.

Separate modeling of shape and texture changes: The effectiveness of different
combinations of shape and texture in an aging model has not yet been system-
atically studied. We have compared three different modeling methods, namely,
shape modeling only, separate shape and texture modeling, and combined shape
and texture modeling (e.g., applying PCA to remove the correlation between
shape and texture after concatenating the two types of feature vectors). We
have shown that a separate modeling of shape and texture (or shape modeling
only) is better than combined shape and texture modeling method, given the

FG—N ET database as the training data.

All the previous studies on facial aging have used PCA based matchers. We have
used a state-of-the—art face matcher, FaceVACS from Cognitec [9] to evaluate
our aging model. The proposed method can be useful in practical applications
requiring age correction processes. Even though we have evaluated the pro-
posed method on only one particular face matcher, it can be used directly in

conjunction with any other face matcher.

Diverse Databases: We have used FG-NET for aging modeling and evaluated
the aging model on two different databases, FG-NET (in leave-oneperson—out
fashion) and MORPH. We have observed substantial performance improvements
on the two databases. This demonstrates the effectiveness of the proposed aging

modeling method.

3.2 Aging Model

We propose to use a set of 3D face images to learn the model for recognition, because

the true craniofacial aging model [95] can be appropriately formulated only in 3D.

82

 

However, since only 2D aging databases are available, it is necessary to ﬁrst convert
these 2D face images into 3D. The methods for detecting salient feature points in
face images, and using them to convert the images into 3D models are discussed
in Sec. 3.2.1 and Sec. 3.2.2, respectively. These 3D face models from a number of
subjects at different ages are then used for building the aging model through both
shape and texture. A combination of the shape and texture gives the aging simulation
capability, which will be used to compensate for age variations, thereby improving
the face recognition performance. Detailed explanation of the aging model is given

in Sec. 3.2.3. We ﬁrst deﬁne the notation that is used in the subsequent sections.

a Smm = {Smm,laSmm,21' -- , Smmmmm}: a set of 3D face models used in con-

structing the reduced morphable model.
0 Sa: reduced morphable model represented with model parameter a.

e 3ng = {x1,y1, - -- ,xn2d,yn2d}: 2D facial feature points for the ith subject at

age j; n2d is the no. of points in the 2D shape.

0 8‘; = {$1,311, 21, - - - , 3:713 01’ yns d’ zn3d}: 3D facial feature points for the ith sub-

ject at age j; 713,; is the no. of points in 3D shape.
0 Ti: facial texture for the ith subject at age j.
0 8‘3 : reduced shape of Si after applying PCA on Si .
0 ti. : reduced texture of T3 after applying PCA on T? .
0 V3: largest Ls principle components of S? .
e Vt: largest Lt principle components of T? .
0 8%: synthesized 3D facial feature points at age 3' represented with weight ws.

0 Tat: synthesized texture at age j represented with weight wt.

83

 

’13!“ "£1513." 'i 3‘ - .’ '““'~ 154-4513;? ‘.~“ I... - ,r

I:

 

 

 

e ann=100, n2d=68, n3d=81, L3=20 and 14:180.

In the following subsections we ﬁrst transform Sgd, to 8% using the reduced mor-
phable model 80. Then, 3D shape aging pattern space {Sws} and texture aging

pattern space {th} are constructed using Si and T27.

3.2.1 2D Facial Feature Point Detection

We use manually marked facial feature points in aging model construction. However,
in the test stage we detected the feature points automatically. The feature points
on 2D face images are detected using the conventional Active Appearance Model
(AAM) [110] [26]. We train separate AAM models for the two databases, the details

of which are given below.

FG—NET

Face images in the FG—NET database have already been (manually) marked by the
database provider with 68 feature points. We use these feature points to build the
aging model. We also automatically detect the feature points and compare the
face recognition performance based on manual and automatic feature point detec-
tion methods. We perform training and feature point detection in cross-validation

fashion.

MORPH

Unlike the FG-NET database, a majority of face images in the MORPH database
belong to African-Americans. These images are not well represented by the AAM
model trained on the FG-NET database due to the differences in the cranial structure
between the caucasian and African-American populations. Therefore, we labeled a

subset of images (80) in the MORPH database as a training set for the automatic

84

 

 

I" 'w I? J, "- U‘ '
1 m It”
a
‘.
,
Ia
....
, V
a.
...,:
.
._
...
..‘l
.-

 

£1
he

 

 

 

feature point detector in the MORPH database.

3.2.2 3D Model Fitting

As mentioned earlier, the current face aging databases contain only 2D images. Fur-
ther, some of the images in these databases were taken several decades back, and
hence are of poor quality. This poses a signiﬁcant challenge in creating an aging
model. Thus, we begin by building a coarse 3D model for each subject at different
ages before analyzing the 3D aging pattern by ﬁtting a generic 3D face model to the
images, based on feature correspondences. The 3D model enables us to perform pose
correction and to build the 3D aging model.

We use a simpliﬁed deformable model based on Blanz and Vetter’s model [16].
The geometric part of their deformable model is essentially a linear combination
(weighted average) of a set of sample 3D face shapes, each with ~75,000 vertices.
The vector that describes the 3D face shape is expressed in the Principle Component
Analysis (PCA) basis. For efﬁciency, we drastically reduced the number of vertices
in the 3D morphable model to 81 (from ~75,000); 68 of these points correspond to
the features already present in the FG-NET database, while the other 13 delineate
the forehead region. Following [16], we performed a PCA on the simpliﬁed shape
sample set, {8mm}. We obtained the mean shape Smm, the eigenvalues Al’s and
eigenvectors Wl’s of the shape covariance matrix. The top L (= 30) eigenvectors
were used, which accounted for 98% of the total variance, again for efﬁciency and
stability of the subsequent ﬁtting algorithm performed on the possibly noisy data set.
A 3D face shape can then be represented using the eigenvectors as

L
SO, = Smm + X 011W], (3.1)
1:1

where the parameter a=[aj] controls the shape, and the covariance of the a’s is the

85

 

 

diagonal matrix with A, as the diagonal elements. The ﬁtting process can be per-
formed in the Bayesian framework where the prior shape and posterior observations
of the ﬁtting results are uniﬁed to reach the ﬁnal result. However, we follow the
direct ﬁtting process for its simplicity. We now describe the transformation of the
given 2D feature points Si,” into the corresponding 3D points S? using the reduced

morphable model 80,.

OBJECTIVE FUNCTION

TO ﬁt the 3D shape, 80,, to a 2D shape, we ﬁnd the value Of a that minimizes the
sum Of the squared distance between each 2D feature point and the projection of its
corresponding 3D point. We follow an iterative procedure similar to [94] to optimize
this objective function. However, some modiﬁcations to the algorithm are necessary,
since the deformable models we use are different from those in [94], and we are not
tracking the motion of the face, but ﬁtting a generic model to the feature set Of a 3D
face projected to 2D.

Our goal is to ﬁnd a shape descriptor a, a projection matrix P, a rotation matrix
R, a translation vector t, and a scaling factor a, such that the difference between the
given 2D shape SM and the projection Of the 3D shape Sa is minimized. Let E() be
the overall error in ﬁtting the 3D model of one face tO its corresponding 2D feature

points, where

E(P.R.t.a.{az}{;l> = Ils§,2d—TP,R,.,.(sa>u2. (3.2)

Here T(-) represents a transformation operator performing a sequence of Operations,
i.e., rotation, translation, scaling, projection, and selecting n2d points out of 11361 that

have correspondences. To simplify the procedure, we use an orthogonal projection

for P.

86

 

 

 

In practice, the 2D feature points that are either manually labeled or generated by
AAM are noisy, which means overﬁtting these feature points may produce undesirable
3D shapes. We address this issue by introducing a Tikhonov regularization term to
control the Mahalanobis distance of the shape from the mean shape. Let a be the
empirically estimated standard deviation of the energy E induced by the noise in the

location Of the 2D feature points. We deﬁne the regularized energy as
L
E’ = E/a2 + Zalz/Al. (3.3)
l=1

OPTIMIZATION PROCEDURE

To minimize the energy term deﬁned in Eq 3.3, we use the following alternating

optimization procedure:

(1) Initialize all the ms to 0, set the rotation matrix R to the Identity matrix and
translation vector t to 0, and set the scaling factor a to match the overall size

of the 2D and 3D shapes.

(ii) Minimize E’ by varying R and T with a ﬁxed. There are multiple ways to ﬁnd
the Optimal pose given the current a. In our tests, we found that ﬁrst estimating
the best 2 x 3 afﬁne transformation (P R) followed by a QR decomposition to get
the rotation works better than running a quaternion based Optimization using
Rodriguez’s formula [94]. Note that tz is ﬁxed to 0, as we use an orthogonal

projection.

(iii) Miminize E’ by varying a with R and t ﬁxed. Note that when both R and t

are ﬁxed, the target function E’ is a simple quadratic energy.

(iv) Repeat (ii) and (iii) until convergence (i.e., decrease in energy between successive

iterations is below a threshold or iteration count exceeds the maximum number).

87

 

Morphable model

 

 

 

 

 

 

 

 

F7500. pie)
"5 3-‘2 5'3. Ruined morphwh
3%.: mod-H" M
9.3. .” 3.5-h .5 ‘3 .‘° ' '~ .0 a. z :
.:.. -.-'- ~ L’o’r‘ 2'9: ~ '- ﬂ "‘2 ’
I ;::l . 0. v33 ' V:"
" If!!! e , ‘- :5“: “5 '93:?
mm helm 3Dmodelh
Modelllﬂlng Fitted JDmodOI
[ 90mm pt) [ J L J [ frontelpoee ]

 

 

 

 

Figure 3.2. 3D model ﬁtting process using the reduced morphable model.

Fig. 3.2 illustrates the 3D model ﬁtting process to acquire the 3D shape. Fig. 3.3
shows the manually labeled 68 points and automatically recovered 13 points that
delineate the forehead region. The associated texture is then retrieved by warping

the 2D image.

3.2.3 3D Aging Model

Following [35], we deﬁne the aging pattern as an array Of face models from a single
subject indexed by her age. We assume that any aging pattern can be approximated
by a weighted average of the aging patterns in the training set. Our model con-
struction differs from [35] mainly in that we model shape and texture separately at
different ages using the shape (aging) pattern space and the texture (aging) pattern
space, respectively. This is because the 3D shape and the texture images are less
correlated than 2D shape and texture. We also adjust the 3D shape as explained

below. Separating shape aging patterns and texture aging patterns can also help

88

 

 

 

 

Figure 3.3. Four example images with manually labeled 68 points (blue) and the auto—
matically recovered 13 points (red) for the forehead region.

alleviate the problem of a relatively small number of available training samples for
different shape and texture combinations. as is the case in FG-NET, which has only

82 subjects with ~12 images/ subject. The two pattern spaces are described below.

SHAPE AGING PATTERN

The shape pattern space captures the variations in the internal shape changes and
the size Of the face. The pose corrected 3D models obtained from the preprocessing
phase are used for constructing the shape pattern space. Under age 19. the key effects
of aging are driven by the increase. in the cranial size. while for the adults the facial
growth in height and width is very small [12]. To incorporate the growth pattern of
the cranium for ages under 19, we rescale the overall size of 3D shapes according to
the anthropometric head width found in [32].

We perform a PCA over all the 3D shapes. S? in the. database irrespective of age
j and subject 2'. Vic project all the mean subtracted S? on to the subspace spanned

by the columns of V5 to Obtain S? as

s] = vsT(s{ — s). (3.4)

89

—>subject (o~N-1)

   

s [:1 [:1 .
... pm EC] B E] . ..

ED ‘ El 4 -
.. [.3 : f “
Pose function 8. . - - Cbmplete shape aging

Forehead pattern

mentaﬂon

'“° an -- e 03 e
E] E] El @ U
m 8 El [:1 — e t: G
"" D is -- c Q t.) -- G
Inputlmages [___] a D Q 9 - 9
mm c e [3 c c s
1 - I - - Cbmpletetexture aging

3:21;. mamas ......

FIII missing data

Figure 3.4. 3D aging model construction.

which is an L, x 1 vector.

The basis of the shape pattern space is then assembled as an m x 71. matrix with
vector entries 83 (or alternatively as an m x n x L, tensor), where the i—th row
corresponds to age 2' and the j-th column corresponds to subject j. The shape pattern
basis is initially ﬁlled with the projected shapes 8; from the face database. We
tested three different methods for the ﬁlling process: linear, Radial Basis Function
(RBF), and a variant Of RBF (v-RBF). Given available ages a,— and the corresponding
shape feature vectors 32-, a missing feature value 33; at age a; can be estimated by
3:1: = 11 X 31 + [2 x 32 in linear interpolation, where 31 and 32 are shape features
corresponding to the ages a1 and a2 that are closest from 0.; and II and lg are
weights inversely proportionate to the distance from ax to al and a2. In v-RBF

process, each feature is replaced by a weighted sum Of all the available features as sx =

90

 

 

Z,- ¢(ax - alibi/(Z ¢(az — (12)), where ¢(.) is a RBF function deﬁned by a Gaussian
function. In RBF method, the mapping function from the age to shape feature vector
is calculated by 3;; = Z,- ri¢(ax—a,:) / (Z (Max —a,-)) for each available age and feature
vector 0.,- and 82-, where ri’s are estimated based on the known scattered data. Any
missing feature vector 5;; at age :1: can thus be Obtained.

The shape aging pattern space is deﬁned as the space containing all the linear

combinations Of the patterns of the following type (expressed in the PCA basis):

11
szus = sf + Es; — sums, ,0 g j g 69. (3.5)
i=1
The weight ms in Eq. (3.5) is not unique for the same aging pattern. We take care
Of this by the regularization term in the aging simulation described below. Given a
complete shape pattern space, mean shape S and the transformation matrix Vs, the

shape aging model with weight ms is deﬁned as
Si... = § + Vsst... 0 s j g 69. (3.6)

TEXTURE AGING PATTERN

The texture pattern Ti for subject 2' at age j is obtained by mapping the original
face image to the frontal projection of the mean shape S followed by column-wise
concatenation of the image pixels. The texture mapping is performed by using the
Barycentric coordinate system [18]. After applying PCA on Ti , we calculate the
transformation matrix Vt and the projected texture ti . We follow the same ﬁlling
procedure as in the shape pattern space to construct the complete basis for the texture

pattern space using ti . A new texture Tint can be similarly obtained, given an age 3'

91

 

 

 

  

[ Probe image]

ﬂlgix

Femropolntdetoctlon {$232.5
(AAM), If not given

 

 

“if .4!

 

u.. *b' .-

 

 

 

iifg‘ffftg Pose correction 8. ‘ Texture in
Augmont forehead moan shape

 

e U:‘ e
e e
.m.
I D
...

 

 

 

 

Fitting into shape aging [Fitting into texture aging]
model model

2;." [Ethan agingmulaﬂon] Sign“ aging simulation] @
'. «is .' 1111 0!
eases

3D image atﬂve differed

“PM
L— [Gaileryimagej If]

 

 

 

 

 

[:0 Image simulated at age y]

 

 

 

 

“19.7

 

 

Figure 3.5. Aging simulation from age :1: to y.

and a set Of weights wt as

n
tint = tj + E(tg — tj)wt,,-, (3.7)
i=1
TL, = T + vat, 0 g j g 69. (3.8)

Fig. 3.4 illustrates the aging model construction process for shape and texture

pattern spaces.

92

 

3.3 Aging Simulation

Given a face image of a subject, say at age 2:, aging simulation involves the construc-
tion of the face image of that subject adjusted to a different age, say y. The purpose
of the aging simulation is to generate synthetically aged (y > :r) or de—aged (y < 2:)
face images to eliminate or reduce the age gap between the probe and gallery face
images1 The aging simulation process can be accomplished using the above aging
model.

Given an image at age 2:, we ﬁrst produce the 3D shape, Sﬁew and the texture
Tifew by following the preprocessing steps described in Sec. 3.2, and then project them
to the reduced space to get sﬁew and tﬁew. Given a reduced 3D shape sﬁew at age
3:, we can obtain a weighting vector, ws, that generates the closest possible weighted
sum of the shapes at age a: as:

{173 = argmin “sﬁew _ Sisllz + T8l|w3|l2a (39)

c_ SwsSC+

where n, is a regularizer to handle the cases when multiple solutions are obtained or
when the linear system used to obtain the solution has a large condition number. We
constrain each element of the weight vector, wsd within the interval [c_, c+] to avoid
strong domination by a few shape basis vectors.

Given 2173, we can obtain age adjusted shape at age y by carrying 233 over to the
shapes at age y and transforming the shape descriptor back to the original shape
space as

Sgew = 82,223 = § + Vssfbs. (3.10)

 

1Note that the term de-aging is used when the new age at which the images need to be simulated
by the aging model is lower than the age of the given image.

93

 

rev,-

 

 

 

 

The texture simulation process is similarly performed by ﬁrst estimating wt as

at = argmin ”tie... — titllz + Ttll'wtll2a (3.11)
C- _<_wt 36+
and then propagating the ﬂit to the target age y followed by the back projection to
get
T3,... = Ta = r + mtg/3t. (3.12)

The aging simulation process is illustrated in Fig. 3.5. Fig. 3.6 shows an ex-
ample of aging simulated face images from a subject at age two in the FG—NET
database. Fig. 3.7 exhibits the example input images, feature point detection, pose—
corrected, and age-simulated images from a subject in the MORPH database. The
pseudocodes of shape aging pattern space construction and simulation are given in

(Algorithms 3.5.1, 3.5.2, 3.5.3 and 3.5.4).

3.4 Experimental Results

3.4.1 Face Recognition Tests

We evaluate the performance of the proposed aging model by comparing the face
recognition accuracy of a state-of-the-art matcher before and after aging simulation.
We construct the probe set, P = {p33, . . . , pin}, by selecting one image p? for each
subject 2' at age at,- in each database, 2' 6 {1, . . . ,n}, x,- E {0, . . . , 69}. The gallery set
G = {g¥1,. . . , ggn} is similarly constructed.

We also created a number of different probe and gallery age groups from the
two databases to demonstrate our model’s effectiveness in different periods of the
aging process (e.g., youth growth or adult aging). In FG-N ET, we selected 7
different age groups, :1: E {0,5,10, .. .,30}, as probes and 6 different age gaps,

94

 

(c) Face images at ﬁve different poses from the agingvsimulated image at age 20

Figure 3.6. An example aging simulation in FG-NET database.

Aage E {5, 10,. . . , 30}, to set up the gallery age y = a: + Aage. In this way, 42 differ-
ent combinations of probe-gallery groups were constructed for FG-NET. In MORPH,
there are no photos with ages under 15, so we only used 24 different groups with all
probe ages 2 15. Since all the subjects do not have images at the chosen ages in the
database, we pick the photo of subject 2' at age mi that is closest to 2 into the probe
set, and pick the photo at age yi (;£ 1;) closest to y into the gallery set.

The numbers of subjects in the probe and gallery sets are 82 and 612 in evaluating

95

 

((1) Aging simulation at indicated ages

Figure 3.7. Example aging simulation process in MORPH database.

96

 

 

 

Table 3.3. Probe and gallery data used in age invariant face recognition tests.

 

 

 

 

 

 

 

Database Probe (Gallery)
#images #subjects age group
{O,5,--- ,30}
FG—NET 82 (82) 82 (82) (23* + {5, . . . ’30»
{15, 20, - - - , 30}
MORPH 612 (612) 612 (612) (x* + {5, . . . ’30»

 

 

 

as“ is the age group of the probe

FG-NET and MORPH, respectively.

Aging simulation is performed in both aging and de-aging directions for each
subject 2’ in the probe and each subject j in the gallery as (2:,- -—> yj) and (yj -—> 12,-).
Table 3.3 summarizes the probe and gallery data sets used in our face recognition
test.

Let P, Pf and Pa denote the probe, the pose-corrected probe, and the age-adjusted
probe set, respectively. Let G, G f and Ga denote the gallery, the pose-corrected
gallery, and age-adjusted gallery set, respectively. All age-adjusted images are gen-
erated (in leave-one-person-out fashion for FG—NET) using the shape and texture
pattern spaces. The face recognition test is performed on the following probe-gallery
pairs: P—G, P-Gf, Pf-G, Pf-Gf, Pa-Gf and Pf-Ga. The identiﬁcation rate for the
probe-gallery pair P-G is the performance on original images without applying the
aging model. The accuracy obtained by fusion of P-G, P—Gf, Pf-G and Pf-Gf match-
ings is regarded as the performance after pose correction. The accuracy obtained by
fusion of all the pairs P-G', P-Gf, Pf-G, Pf-Gf, Pa-G’f and Pf-Ga represents the
performance after aging simulation. A simple score—sum based fusion is used in all
the experiments. All matching scores are obtained by FaceVACS and distributed in

the range of O~1. Therefore, score normalization is not applied in the fusion process.

97

3.4.2 Effects of Different Cropping Methods

Recall that a morphable model with 81 3D vertices is used, including the 68 feature
points already marked in FG—NET for aging modeling. The additional 13 feature
points (shown in Fig. 3.2) are used to delineate the contour of the forehead, which
is inside the region used to generate the feature sets and the reference sets in the
commercial matcher FaceVACS.

We study the performance of the face recognition system with different face crop-
ping methods. A comparison of the cropping results obtained by different approaches
is shown in Fig. 3.8. The ﬁrst column shows the input face image and the second
column shows the cropped face obtained using the 68 feature points provided in the
FG—NET database, without pose correction. The third column shows the cropped
face obtained with the additional 13 points (total of 81 feature points) for forehead
inclusion, without any pose correction. The last column shows the cropping obtained
by the 81 feature points, with pose correction.

Fig. 3.9 (a) shows the face recognition performance on FG-N ET using only shape
modeling based on different face cropping methods and feature point detection meth-
ods. Face images with pose correction that include the forehead lead to the best
performance. This result shows that the forehead does inﬂuence the face recogni-
tion performance, although it has been a common practice to remove the forehead in
AAM based feature point detection and subsequent face modeling [58] [124] [26]. We,
therefore, evaluate our aging simulation with the model that contains the forehead
region with pose correction.

Note that the performance difference between non-frontal and frontal pose is as ex-
pected, and that the performance using automatically detected feature points is lower
than that of manually labeled feature points. However, the performance with auto-
matic feature point detection is still better than that of matching the original images

before applying the aging modeling. We have also tried enforcing facial symmetry

98

 

-..:‘7' "

 

 

Figure 3.8. Example images showing different face cropping methods: (a) original image,
(b) no-forehead and no pose correction, (c) forehead and no pose correction, (d) forehead
and pose correction.

in the 3D model ﬁtting process, but it did not help in achieving better recognition

accuracy.

3.4.3 Effects of Different Strategies in Employing Shape and Texture

Most of the existing face aging modeling techniques use either only shape or a combi-
nation of shape and texture [100] [58] [35] [124] [84]. We have tested our aging model
with shape modeling only, separate shape and texture modeling, and combined shape
and texture modeling. In our test of the combined scheme, the shape and texture are
concatenated and a second stage of principle component analysis is applied to remove
the possible correlation between shape and texture as in the AAM face modeling

technique.

99

 

 

 

 

 

 

 

 

60L T V I l l I I i D
_ ---- g1»
55 ‘_ ..... ,e’
’3‘ , r' ’ , r ’
ESO- ”" ";,a’ _
g , ,0" I , a
.-" 0‘ a
g . , z’ s ’ ‘
m . l ’
“ I
g 35‘ v f,’ ‘
:E, I 3’ + aging. pose corr, forehead, manual pts
0 30 —, I —-e— aging, no pose correction. forehead -
"V ------- aging. pose corr, forehead, automatic pts
25 — -C- aging. no pose correction, no forehead -
- - -original
20 I J I I I l l I
1 2 3 4 7 8 9 10

5 6
Rank
(a) CMC with different methods of face cropping.

 

 

 

 

 

 

 

 

 

60 ~

55 -
’3
33 50 -
>.
8
‘5 45 —
o
8
m 40 '
s A v
(u _.
S 351 z ’ + fusion, shape & shape+texture‘.5 ‘
g , I —i— shape modeling only
0 30 ' / ’ '''''' shape+textue'.5 ‘

’ -O- shape+texture modeling
25 - —<— shape+texture (2nd PCA) 4
- - -origina
20 I I I I I I I l
1 2 3 4 5 6 8 9 10
Rank

(1)) CMC with different methods of shape & texture modeling.

Figure 3.9. Cumulative. Match Characteristic (CMC) curves with different methods of
face cropping and shape & texture modeling.

100

 

Fig. 3.9 (b) shows the face recognition performance of different approaches to
shape and texture modeling. We have observed a consistent performance drop in face
recognition performance when the texture is used together with the shape. The best
performance is observed by combining shape modeling and shape+texture modeling
using score level fusion. When simulating the texture, we blend the aging simulated
texture and the original texture with equal weights. Compared to the shape, texture
is a higher dimensional vector that can easily deviate from its original value after
the aging simulation. Even though performing aging simulation on texture produces
more realistic face images, it can easily lose the original face-based identity infor-
mation. The blending process with the original texture reduces the deviation and
generates better recognition performance. In Fig. 3.9 (b), shape+texture modeling
represents separate modeling of shape and texture, shape+.5xtexture represents the
same procedure but with the blending of the simulated texture with the original tex-
ture. We use the fusion of shape and shape+.5xtexture strategy for the following

aging modeling experiments.

3.4.4 Effects of different ﬁlling methods in model construction

We tried a few different methods of ﬁlling missing values in the aging pattern space
construction (see Sec. 3.2.3): linear, 'v-RBF, and RBF. The rank-one accuracies are
obtained as 36.12%, 35.19%, and 36.35% in shape+texturex.5 modeling method for
linear, v-RBF, and RBF methods, respectively. We chose the linear interpolation
method in the rest of the experiments for the following reasons: i) its performance
difference with other approaches is minor, ii) linear interpolation is computationally
efficient, and iii) the calculation of the RBF based mapping function can be ill-posed.
Fig. 3.10 provides the Cumulative Match Characteristic (CMC) curves with origi-
nal, pose-corrected, and aging simulated images in FG-NET and MORPH databases,

respectively. It can be seen that there is a signiﬁcant performance improvement after

101

 

 

 

Cumulative accuracy (%)

 

 

 

 

--A- aging

------- pose correction .

-*~ original

 

 

 

 

 

 

 

 

 

 

 

 

 

.3
>
O
«I
‘5 50- .
§
o 40-
.g
g 30—
E
3
o 20' +aging

10 _ ------- pose correction

-*~ original
0 1 1 r
5 10 15
Rank
(b) MORPH

Figure 3.10. Cumulative Match Characteristic (CMC) curves showing the performance

gain based on the proposed aging model.

102

20

 

”ram Elwin-3 'WTE méhnw 3",,“ ._ a. . ~~_ FR“ r

 

Rank-1 accuracy (%)

   

888

improved rank-1 accuracy (%)

 

(c) Amount of improvement

Figure 3.11. Rank-one identiﬁcation accuracies for each probe and gallery age groups:
(a) before aging simulation, (b) after aging simulation, and (c) the amount of improvement
after aging simulation.

103

 

aging modeling and simulation in both databases. The amount of improvement due

to aging simulation is more or less the same with those of other studies as shown in
Table 3.1. However, we have used FaceVACS, a state-of—the—art face matcher, which
is known to be more robust against internal and external facial variations (e. g., pose,
lighting, expression, etc) than simple PCA based matchers. We argue that the perfor-
mance gain using FaceVACS is more realistic than the performance improvement of a
PCA matcher reported in earlier studies. Further, unlike other studies, we have used
the entire FG—NET and MORPH-Albuml in our experiments. Another attribute of
our study is that the model is built on FG-NET and then independently evaluated
on MORPH.

Fig. 3.11 presents the rank-one identiﬁcation accuracies for each of the 42 different
age pair groups of probe and gallery in the FG~NET database. The aging process can
be separated as growth and development (age_<_18) and adult aging process (age>18).
While, our aging process provides performance improvements in both the age groups,
“less than 18” and “greater than 18”, the performance improvement is somewhat
lower in the growth process where more changes occur in the facial appearance. The
average recognition result for the age groups “less than 18” is improved from 17.3% to
24.8% and for the age groups “greater than 18” performance is improved from 38.5%
to 54.2%.

Matching results for seven subjects in the FG—NET database are demonstrated
in Fig. 3.12. The face recognition fails without aging simulation but succeeds with
aging simulations for the ﬁrst ﬁve of these seven subjects. The aging simulation fails
to provide correct matchings for the last two subjects, possibly due to poor texture
quality (for the sixth subject) and large pose and illumination variation (for the
seventh subject). Fig. 3.13 shows four example matching results where the original
images succeeded in matching but failed after the aging simulation. The original

probe and gallery images appear similar even though there are age gaps, but become

104

 

more different after aging simulation in these examples. In any event, the overall

matching accuracy improves after the aging simulation.

The proposed aging model construction takes about 44 secs. The aging model
is constructed off-line, therefore its computation time is not a major concern. In
the recognition stage, the entire process, including automatic feature point detection,
aging simulation, template generation and matching takes about 12 secs. per probe
image. Note that the gallery images are preprocessed off-line. All computation times
are measured on a Pentium 4, 3.2GHz, 3G-Byte RAM machine. The feature point
detection using AAM takes about 10 secs, which is the major bottleneck. We have
noticed that our aging correction method is capable of improving the recognition
performance even with noisy feature points. Therefore, a simpler and faster feature
point localization method should be explored to reduce the computation time while

keeping the performance gain to a similar level. L

3.5 Summary

We have proposed a 3D facial aging model and simulation method for age-invariant
face recognition. The extension of shape modeling from 2D to 3D domain gives
additional capability of compensating for pose and, potentially, lighting variations.
Moreover, we believe that the use of a 3D model provides more powerful modeling
capabilities than the 2D age modeling methods proposed earlier because the changes
in human face conﬁguration occur in the 3D domain. We have evaluated our approach
using a state-of—the-art commercial face recognition engine (FaceVACS), and we have
shown improvements in face recognition performance on two different publicly avail-
able aging databases. We have shown that our method is capable of handling face

aging effects in both growth and developmental stages.

105

 

 

Algorithm 3.5.1: 3D SHAPE AGING PATTERN CONSTRUCTION( )

 

Input : 52d = {Sizwnqszféwnqsmzd}

Outputzszﬂz 1,...,n, j= 1,...,m

1'. «— 1, j «— 1

whilei<=n&j=<m

rif 312d is available

It «— 1, E «— ﬁtting error between 33'” and so,

while k < T & E < 0

do < update pose (a, R, t) (3D model parameters, a, ﬁxed)

d0 update 3D model parameters (pose ﬁxed)
k <— k +1,update E

3,? 4— so,

 

Calculate eigenvalue As and eigenvector and Vs from S? — 5'
2' <— 1, j <— 1
whilei<=n&j <=m
if S; is available
do 5;? .— VT(S{ — 5')
Fill (i,j)-th shape pattern by .917

else Fill (i,j)—th shape pattern space, using interpolation along the column

 

106

 

 

 

Algorithm 3.5.2: TEXTURE AGING PATTERN CONSTRUCTION()

 

Input : s ={S},...,s{,...,s;{*}, T={T11,...,Cl‘zj,...,T1()f},
Pose: {P11,...,Pg,...,P,T}
Output:t{,z'= 1,...,n,j =1,...,m
Construct mean shape 5'
i +— 1, j <— 1
whilei<=n&j=<m
do if T520, is available
Warp texture Ti] from 8% with pose P: to 3
Calculate eigenvalue A; and eigenvector Vt from (Tij — T)
ieLjel
whilei<=n&j <=m
if T3 is available
do til— VT<TZj _ T)
Fill (i,j)-th texture pattern by it:

else Fill (i,j)-th texture pattern space, using interpolation along the column

 

Algorithm 3.5.3: AGE SIMULATION FOR SHAPE( )

 

Input : s = {s}, . . . ,sw},Sﬁew
Output : 3,346,,

Estimate ws by Eq. (3.9)
Calculate 333w by Eq. (3.10)

 

107

 

 

 

T-‘VL‘ 3

 

 

 

Algorithm 3.5.4: AGE SIMULATION FOR TEXTURE( )

 

Input ; t = {2:1, . . . ,tgn},T,few
Output : T36,“

Estimate wt by Eq. (3.11)
Calculate Tim by Eq. (3.12)

 

108

 

Figure 3.12. Example matching results before and after aging simulation for seven different
subjects: (a) probe, (b) pose-corrected probe, (c) age-adjusted probe, ((1) pose-corrected
gallery and (8) gallery. All the images in (b) failed to match with the corresponding images
in (d) but images in (c) were successfully matched to the corresponding images in (d) for
the ﬁrst ﬁve subjects. Matching for the last two subjects failed both before and after aging
simulation. The ages of (probe, gallery) pairs are (0,18), (0,9), (4,14), (3,20), (30,54), (0,7),
and (23,31), respectively, from the top to bottom row.

109

 

(a) (b) (C) (d) (6)

Figure 3.13. Example matching results before and after aging simulation for four different
subjects: (a) probe, (b) pose-corrected probe, (c) age-adjusted probe, ((1) pose—corrected
gallery and (e) gallery. All the images in (b) succeeded to match with the corresponding
images in (d) but images in (c) failed to match to the corresponding images in (d). The
ages of (probe, gallery) pairs are (2,7), (4,9), (7,18), and (24,45), respectively, from the top
to bottom row.

110

 

 

 

 

Chapter 4

Facial Marks

4. 1 Introduction

2D Face recognition systems typically encode the human face by utilizing either local
or global texture features. Local techniques ﬁrst detect the individual components
of the human face (viz., eyes, nose, mouth, chin, ears), prior to encoding the textu-
ral content of each of these components (e.g., EBGM and LFA) [126] [87] [14] [63].
Global (or holistic) techniques, on the other hand, consider the entire face as a single
entity during encoding (e.g., PCA and LDA) [79]. However, both these techniques do
not explicitly extract micro-features such as wrinkles, scars, moles, and other distin-
guishing marks that may be present on the face (see Fig. 4.1). While many of these
features are not permanent, some of them appear to be temporally invariant, which
can be useful for face recognition and indexing. That is why we deﬁne facial marks as
a. soft biometric; While they cannot uniquely identify an individual, they can narrow
down the search for an identity [49].

Spaun [107] described the facial examination process carried out in the law en-
forcement agencies. One of the examination steps involves identifying “class” charac-
teristics and “individual” characteristics. The class characteristics include hair color,

overall facial shape, presence of facial hair, shape of the nose, presence of freckles, etc.

111

 

_urzs-..: " ~ " "‘ ‘ ‘
r , _,_ 3w" -‘~==L¢-cai.'..r “5': m. Wig-332.;- w" .- _ ‘v .. -- zym”

 

freckles

 

Figure 4.1. Facial marks: freckle (spot), mole, and scar.

The individual characteristics include number and location of freckles, scars, tattoos,
chipped teeth, lip creases, number and location of wrinkles, etc. in a face or other
body parts. While these examinations are currently performed manually by forensic
experts, an automatic procedure will not only reduce the manual labor and speed up
the process, but is likely to be more consistent and accurate. This has motivated our
work on automatic facial mark detection and matching.

There have been only a few studies reported in the literature on utilizing facial
marks. Lin et al. [64] used the SIFT operator [69] to extract facial irregularities and
then fused them with a global face matcher. Facial irregularities and skin texture
were used as additional means of distinctiveness to achieve performance improvement.

This method was tested on the XM2VTS [75] and HRDB1 databases and showed ~5%

 

1Collected by the author.

112

 

mpfovement in matching accuracy by using skin detail. However, the individual types

of {30131 marks were not explicitly deﬁned. Hence, their approach is not suitable for
face database indexing. Pierrard et a1. [93] proposed a method to extract moles
using normalized cross correlation (NCC) matching and a morphable model. This
method was tested on the FERET [89] [91] database and computed only mark-based
recognition accuracy without comparing it to the global face matcher. They claimed
that their method is pose and lighting invariant since it uses a 3D morphable model.
However, they only utilized moles explicitly - other types of facial marks were ignored
or implicitly used.

We propose a fully automatic facial mark extraction system using global and
local texture analysis methods. We ﬁrst apply the Active Appearance Model (AAM)
to detect and remove primary facial features such as eye brows, eyes, nose, and
mouth. These primary facial features are subtracted from the face image. Then,
the local irregularities are detected using the Laplacian-of-Gaussian (LOG) operator.
Finally, we combine these distinguishing marks with a commercial face matcher in
order to enhance the face matching accuracy. Our method differs signiﬁcantly from
the previous studies in the following aspects: (a) we extract all types of facial marks
that are locally salient, and (b) we focus on detecting semantically meaningful facial

marks rather than extracting texture patterns that implicitly include facial marks.

4.2 Applications of Facial Marks

There are three major directions where facial marks can be used: (a) supplement
existing facial matchers to improve the identiﬁcation accuracy, (b) enable fast face
image retrieval, and (c) enable matching or retrieval from occluded, partial, or severely
damaged face images.

First, the facial mark based matcher captures the individual characteristics cm-

113

 

 

Figure 4.2. Two face images of the same person. A leading commercial face recognition
engine failed to match these images at rank—1. There are a few prominent facial marks that
can be used to make a better decision.

bedded in facial marks that are ignored or minimally used in conventional face recog-
nition methods. Therefore, a combination of conventional face matcher and mark
based matcher is expected to provide improved recognition accuracy. Fig. 4.2 shows
an example pair of images that were not successfully matched by a commercial face
matcher. Some prominent facial marks in these two images strongly support the fact
that they are from the same subject.

Second, the mark based matcher enables indexing each face image based on the
semantically meaningful marks (e.g., moles or scars). These indices will enable fast
retrieval by using mark based queries. Third, facial marks can characterize partial,
occluded, or damaged face images. Therefore, matching or retrieval based on a partial
image will be possible. Fig. 4.3 shows example retrieval results based on a facial mark
on the right side of cheek on frontal, partial, and non-frontal face images. Fig. 4.4
shows two example face images that contain distinctive facial marks. Face images

with such distinctive marks can be more efficiently matched or retrieved.

114

 

 

 

 

 

(a) full face (b) partial face (c) non-frontal face ((1) example retrieval results

Figure 4.3. Three different types of example queries and retrieval results: (a) full face, (b)
partial face, and (c) non-frontal face (from video). The mark that is used in the retrieval is
enclosed with a red circle.

4.3 Categories of Facial Marks
We have deﬁned ten categories of facial marks as below.
e Freckle: small spots from concentrated melanin
e Mole: growth on the skin (brown or black)
0 Scar: marks left from cuts or wounds
e Pockmark: crater-shaped scar

e Acne: red regions caused by pimple or zit

115

 

 

 

 

 

(a) Large birthmark1 (b) Red skin

1 http: Hm . und . com/ index . php?fa=PAGE . vieukpageId=63558.

Figure 4.4. Examples of distinctive marks.

e Whitening: skin region that appears white

e Dark skin: skin region that appears dark

e Abrasion: wound (includes clots; temporary marks)
e Wrinkle: fold, ridge or crease in the skin

e Other: all other types of marks

While abrasion is not temporally invariant, it can later be related to the scars that
are possibly caused by abrasions. We ignore beards and small fragments from the
beards in the face image in constructing the ground truth. We consider only large
wrinkles and ignore small wrinkles especially around eyes and mouth. The statistics

of mark location and frequency are shown in Fig. 4.5.

4.4 Facial Mark Detection

All the face marks appear as salient localized regions on the face. Therefore, a blob

detector based on a Difference of Gaussian (DOG) or Laplacian of Gaussian (L0G)

116

freckles (77.0%)
o mole(0.6%)
e scar(1.0%)
pockmark (0.2%)
acne (3.4%)
whitening (5.3%)
D darkskin(1.7%)
<1 abrasion (inc. blood crust) (0.03%)
wrink|e(4. %)

other (6.6%)

 

Figure 4.5. Statistics of facial marks based on a database of 426 images in FERET
database. Distributions of facial mark types on mean face and the percentage of each mark
types is shown.

operator [65] can be used to detect the marks. However, a direct application of a
blob detector on a face image results in a large number of false positives because of
the primary facial features (e.g., eyes, eye brows, nose, and mouth). Currently. we do
not distinguish between the mark categories. Instead, our focus is to automatically
detect as many of these marks as possible. Each step of mark detection process is

described below.

4.4.1 Primary Facial Feature Detection

We have used an Active Appearance Model (AAM) [25] [110] to automatically detect
133 landmarks that delineate the primary facial features: eyes, eye brows, nose,

mouth, and face boundary (Fig. 4.7). These primary facial features will be disregarded

117

 

(4:) User speciﬁc mask

a True positive (T?)
I False positive (F?)

E] False negative (F55)

 

(d) Mark detection using (c) Mark detection using
genetic mask user speciﬁc mask

Figure 4.6. Effects of generic and user speciﬁc masks on facial mark detection. TP

increases and both FN and FP decrease by using user specific mask.

in the subsequent facial mark detection process.

4.4.2 Mapping to Mean Shape

Using the landmarks detected by AAM, we tightly crop each face image and map
it to the mean shape to simplify the mark detection and matching process. Let 3,,
i = 1, 2. . . . ,N represent the shape of each face image based on the 133 landmarks.
The mean shape is then defined as S“ = (l/N) ;:18,. Each face image, 5,, is
mapped to the mean shape, 5'“. by using a Barycentric coordinate based texture
mapping process [18]. In this way. all face images are normalized in terms of scale
and rotation, which allows us to use a Euclidean distance based matcher in facial

mark matching.

118

4.4-3 Generic and User Speciﬁc Mask Construction

We construct a mask from the mean shape, S”, to suppress false positives due to
primary facial features in the blob detection process. The blob detection operator is
applied on face regions that are not covered by the mask. Therefore, constructing
the mask is an important step to reduce false positives. Let the mask constructed
from the mean shape be Mg, namely, a generic mask. Since the generic mask does
not cover the user speciﬁc facial features such as beards or winkles that increase the
false positives, we build a user speciﬁc mask, Mu, using the edge image. We use the
conventional Sobel operator [31] to obtain the edge image. Given an image I (as, y),
two 3x3 ﬁlters, D3 and Dy are convolved with I (11:, y) to obtain the gradients in both
the a: and y directions. The magnitude of the gradient is obtained as D = «DE: + D5.
The ﬁnal edge image is obtained by binarizing .D with a threshold value te.

The user speciﬁc mask Mu is constructed as a sum of Mg and edges in D that are
connected to M9. The effect of the generic mask and the user speciﬁc mask on mark
detection is shown in Fig. 4.6. The user speciﬁc mask helps in removing most of the

false positives appearing around the beard and small wrinkles around eyes or mouth.

4.4.4 Blob Detection

Facial marks mostly appear as isolated blobs. Therefore, we use the well-known blob
detector, Laplacian-of—Gaussian (LOG) Operator, to detect facial mark candidates.
Laplacian of Gaussian operation involves applying a Gaussian ﬁlter followed by a
Laplacian operation. The Laplacian operator is deﬁned as a second order derivative

deﬁned on an image I (x, y) as

(9211:, 6212:,
L<x,y)=v21(z.y)= 3:23!” 32,230-

 

(4.1)

The Laplacian Operator is sensitive to noise, so the Gaussian operator is first

119

      

-.

 

//
I Feature point Mapping to
Inplrt e detection (Am mem shape
r = 1
Subtract mask
Laplacian 0f Gaussian Salient region detection Detected Face
(LOG) filter marks

Figure 4.7. Schematic of automatic facial mark extraction process.

applied. The Gaussian operator, G(a:, y, 02) is deﬁned as

G(x,y, a2) = Fife—(12+y2VQa2). (4.2)

The combination of Gaussian and Laplacian operator can be deﬁned as

2 2
Loa(.,y,.2) = 3%,, [1 - 7:21] whys/(202). (4.3)

The LoG(:r,y,cr2) is applied in a single step to an image I (x,y). We used a
3x3 LOG ﬁlter with a = x/i on the image size of 397x579 (in mean shape). The
LOG operator is usually applied at multiple scales to detect blobs of different sizes.
However, we used a single scale L0G ﬁlter followed by morphological operators (e.g.,

Closing) to reduce the computation time. The LOG ﬁltered image subtracted with

120

 

— Groundiruth _ Automaticdetection

Figure 4.8. Ground truth and automatically detected facial marks for four images in our
database.

a. user speciﬁc mask undergoes a binarization process using a series of threshold val-
ues big]? = 1,. . . ,K in a decreasing order. The threshold value fun is selected such
that the resulting number of connected components is larger than no. A brightness
constraint (2th) is also applied to each of the connected components. All the con—
nected components are enclosed with bounding boxes with a padding of 5 pixels and
recorded as facial marks. When the user speciﬁc mask does not effectively remove
sources of false positives, true marks with low contrast will be missed in the mark
detection process. The overall facial mark detection process is shown in Fig. 4.7. A

comparison of ground truth and automatically detected marks is shown in Fig. 4.8.

4.4.5 Facial lVIark Based Matching

Given the facial marks. We compare. their (1:. y) coordinates in the mean shape space.
A pair of marks. ml and Ill-2. is considered to match when (1(1):). 711.2) g td, where
d(-. -) deﬁnes the Euclidean distance and a, is a threshold value. The number of

matching marks is used as the matching score between the two face images.

121

 

.45

 

 

y—-

4.5 Experimental Results

We used the public domain FERET [89] [91] database. The database consists of 14,126
images belonging to 1,199 different subjects. The original image size is 512x768
(widthxheight) with 96 dpi resolution. We have used 426 images from 213 subjects
in our facial mark study. We manually labeled the ten facial mark types as deﬁned in
Sec. 4.4 in all the 426 images to create the ground truth. This allows us to evaluate
the proposed facial mark extraction method. We selected one image of each of the
426 subjects with duplicate images in the database to construct the probe set with
213 images; the remaining face images form the gallery consisting of 213 images.
We evaluate the automatic mark detection method in terms of precision and recall
values. Precision and recall are commonly used as performance measures in informa-
tion retrieval and statistical classiﬁcation tasks [121]. These measures are deﬁned in
terms of true positives, false positives, and false negatives as shown in Fig. 4.9. True
positives are the number of automatically detected marks that match the ground
truth, false positives are the number of automatically detected marks that are not
in the ground truth, and false negatives are the number of ground truth marks that

were not detected by the automatic detector.

Detected Marks Ground truth Marks

  

  
   

 
  

True
Posrtrve

 
  

False Positive
(false alarm)

 
 
 

.~_; > False Negative
. (miss)

 
 

Figure 4.9. Schematic of the deﬁnitions of precision and recall.

122

 

 

 

true positive

 

 

 

 

 

 

precision = . . . .
true posztwe + false posztwe
true positive
recall = . . .
true posztzue + false negative
43 W V
35 -
E
C
.9
.2
0
9
a.
30 - -
25 I A 4L I l l I
10 15 20 30 35 40 45

25
Recall (%)
Figure 4.10. Precision and recall curve of the proposed facial mark detection method.

For the mark based matching, three different matching schemes are tested based
on whether the ground truth or the automatic method was used to extract the marks
in the probe and gallery: 1) ground truth (probe) to ground truth (gallery), ii) au-
tomatic (probe) to automatic (gallery), and iii) ground truth (probe) to automatic
(gallery). Constructing the ground truth for a large gallery database with millions

of images is very time consuming and not feasible in practice. Therefore, using au-

123

tomatically detected marks on the gallery database and the automatic or manually
labeled marks on the individual probe images is more practical. The score-level fusion
of a commercial face matcher FaceVACS [9] and mark-based matcher is carried out
using the weighted sum method after min-max normalization of scores. The weights
of the two matchers were selected empirically as 0.6 for FaceVACS and 0.4 for the
facial mark matcher.

The precision and recall values for the mark detector with a series of brightness
contrast thresholds tb (see Sec. 4.4.4) varies from (32, 41) to (38, 16) as shown in
Fig. 4.10. The rank-1 identification accuracies from FaceVACS only and the fusion
of FaceVACS and marks are shown in Table 4.1 using tb=200 and td=30. The
range of parameter values tried are 200, 400, 600, 800, and 1,000 for tb and 10, 30,
and 50 for td to obtain the best recognition accuracy. Among the 213 probe images,
there are 15 cases that fail to match at rank-1 using FaceVACS. After fusion, three
out of these 15 failed probes are correctly matched at rank-1 for the ground truth
(probe) to ground truth (gallery) matching. There is one case that was successfully
matched before fusion but failed after fusion. Only one out of 15 failed probes are
correctly matched at rank-1 for the ground truth (probe) to automatic marks (gallery)
matching (Fig. 4.11). The three example face image pairs that failed with FaceVACS
but correctly matched at rank-1 after fusion are shown in Fig. 4.12. The 15 image pairs
where FaceVACS failed to match at rank-1 contain relatively large pose variations.
The three examples in Fig. 4.12 contain at least four matching marks, which increases
the ﬁnal matching score after fusion to successfully match them at rank-1. The
proposed mark extraction method is implemented in Matlab and takes about 15 sec.

per face image. Mark based matching time is negligible.

124

Table 4.1. Face T()CO 'lllthIl accurac llSlIlg FﬂCCVACS matcher. 1‘0 )OSGd filClEtl llli‘ll‘kS
. I I
matcher and fllSlOIl Of the tWO matchers.

 

Matcher Rank—1 Rank—10

 

( ( ' v 0 .
FaceVACS only 9296 d' 30-1 1/0

 

C 0
Ground truth mark + FaceVACS 9390/9 97-18/5

 

0
Automatic mark + FaceVACS 934370 9718/0

 

Ground truth (probe) 8:. Auto. mark 93.43% 96.71%
(gallery) + FaceVACS

 

 

 

 

 

     

(C) probe ((1) gallery

, -l l 11 '
(I) prme o) g. m (mam) (mean 31......)

Figure 4.11. An example face image pair that did not match correctly at rank-1 using
FaceVACS but matched correctly after fusion for the ground truth (probe) to automatic
marks (gallery) matching. Colored (black) boxes represent matched (unmatched) marks

4.6 Summary

Facial marks (e.g.. freckles, moles. and scars) are. salient localized regions appearing
on the face that have been shown to be useful in face recognition. An automatic
facial mark extraction method has been developed that shows good performance
in terms of recall and precision. The fusion of facial marks with a state—of—the-
art face matcher (FaceVACS) improves the rank-1 face recognition performance on
an operational database. This demonstrates that micro-level features such as facial
marks do offer some discriminating inforn'iation.

Most of the facial marks detected are, semantically meaningful, so users can issue

125

1,

n_v’

:wﬂ'.ﬁ‘~:ﬁf‘b'&‘;7£ "WW? I 43143;? a

v

'3}-

1"1'

 

 

(c) probe ((1) gallery
(mean shape) (mean shape)

(a) probe (b) gallery

Figure 4.12. First three rows show three example face image pairs that did not match
correctly at rank-1 using FaceVACS but matched correctly after fusion for the ground
truth (probe) to ground truth (gallery) matching. Colored (black) boxes represent matched
(unmatched) marks. Fourth row shows an example that matched correctly with FaceVACS
but failed to match after fusion. The failed case shows zero matching score in mark based
matching due to the error in facial landmark detection.

queries to retrieve images of interest from a large database. The absolute coordinates
of the mark location deﬁned in the mean shape space, the relative geometry, or the
morphology of each mark can be used as queries for the retrieval. For example. a

query could be “retrieve all face images with a mole on the left side of lip".

126

 

Chapter 5

Conclusions and Future Directions

5.1 Conclusions

The conclusions of this thesis are summarized below.

0 We have shown that a 3D model based approach can be used to improve face
recognition performance under pose variations up to ~99% on a video database
containing 221 subjects. The 3D model is used for pose estimation to measure
the quality of face images. The pose information is used in conjunction with
image blur for gallery and probe construction for robust face recognition in
video. A 3D model reconstruction technique based on the Factorization method
is used to generate a synthetic frontal view from a non-frontal sequence of images
to improve the recognition performance. A system of static and PTZ cameras
is used as a means of resolving poor resolution problems that are typically
encountered in surveillance scenarios. A prototype semi supervised surveillance
system that tracks a person and computes soft biometric features (e.g., height

and clothing color) is developed.

0 We have developed a 3D aging modeling and simulation technique that is ro-

bust against age related variations in face recognition. The PCA analysis is

127

 

 

 

 

 

applied on shape and texture components separately to model the facial ag-
ing variations. The learned model is used for age correction to improve the
face recognition performance. We have built the aging model on the FG-NET
database and applied the learned model to FG-NET (in a leave-one-out fash-
ion) and MORPH. Different face cropping methods and modeling techniques
using shape only, texture only, and shape and texture, with and without sec-
ond level PCA have been tested. Consistent improvements are observed in the
face recognition performance for both the databases by ~10%. Separate mod-
eling of shape and texture components with score level fusion shows the best

performance.

We have developed an automatic facial mark detection system. Facial marks
provide performance improvement when combined with state-of—the-art face
matchers. Primary facial features are ﬁrst detected using AAM and then ex-
cluded from the facial mark detection process. All face images are mapped to
the mean shape and a LOG operator is applied to detect blob-like facial marks.
Facial mark based matching is carried out based on an absolute coordinate sys-
tem deﬁned on the mean shape space. Fusion of facial mark based matching

with FaceVACS shows about 0.94% performance improvement.

Future Directions

Based on the contributions of this thesis, the following research directions appear

promising.

e The proposed 3D model reconstruction is susceptible to the noisy feature point
detection process. By combining a generic 3D model with the Factorization
method, the success rate of 3D model reconstruction will increase, leading to

better recognition performance. The 3D face model can also be used to estimate

128

 

...-'2': 4‘"

 

 

and compensate for lighting variations for robust face recognition in surveillance

scenarios.

It would be desirable to explore different (non-linear) methods for building an
aging pattern space given noisy 2D or 3D shape and texture data by cross
validating the aging pattern space and aging simulation results in terms of face
recognition performance. The aging modeling technique can also be used for
age estimation. For a fully automatic age invariant face recognition system, one

also needs a method for automatic age estimation.

Additional features such as morphology or color for the facial mark based match-
ing should be considered. This will improve the matching accuracy with facial
marks and enable more reliable face image retrieval. The face image retrieval
system can be combined with other robuSt face matchers for faster search. Since
each facial mark is locally deﬁned, marks can be easily used in matching and

retrieval given partial faces.

The proposed aging correction, facial mark detection, and matching system
should be evaluated in a video based recognition systems. The pose correc-
tion, quality based frame selection, aging correction, and mark based matching
techniques can be combined to build a uniﬁed system for video based face recog-

nition.

129

l

\

   

   

....x-n-nc' :- v1 |‘!9AV‘I‘I\1\|\V."IV :L"!\'»I-ol‘.\(".y‘ngs,~."‘..""‘.‘.|....."‘.'.‘."..S‘

 

 

APPENDICES

130

 

Appendix A

Databases

We have used a number of public domain and private databases for our experiments.
The databases used for each problem we addressed are listed in Table A.1.

The Face In Action database [37] was collected at Carnegie Mellon University
in both indoor and outdoor settings, each in three different sessions, including 221
subjects for the purpose of face recognition in video. Each subject was recorded
by six different cameras simultaneously, at two different distances and three differ-
ent angles. The MSU-ATR database was collected in a collaborative effort between
Michigan State University and Advanced Telecommunications Research Institute In-
ternational (ATR), Kyoto, Japan using three networked cameras. The MSU 2D-3D
face database was collected at Michigan State University using the proposed “face
image capture system at a distance” and the 3D Minolta laser scanner. FG-NET [4]
and MORPH [101] are databases for studying facial aging. FERET [89] [91] database
was collect by NIST including 14,126 images from 1,199 subjects. FERET database
is used for facial mark study. We used both public domain face matchers [119] [62]
and a commercial face recognition engine [9] to demonstrate that face recognition

performance is improved by the approaches developed in this thesis.

131

 

Table A.1. Databases used for various problems addressed in the thesis.

 

 

 

 

 

 

 

 

 

 

 

 

 

Problem Database #Subjects Image size
View based
. Face in Action
face recogm- (FI A) [37] 221 640x480
tion
View synthetic Face in Action
face recognition (FIA) [37] 221 64OX480
ViSE MSU-ATR surveillance _ 10 320x240
database
Face recog-
nition at a MSU 2D'3D face 12 640x480
. database
distance
FG—NET [4] 82 311x377 ~ 639x772
Facial Aging MORPH [101] 612 400x500
3D morphable faces [16] 1001 not applicable2
Facial marks FERET 1199 512x768

 

1 The morphable model is constructed based on 100 subjects.
2 The 3D morphable model can be captured as a 2D image in various sizes depending on the
camera projection matrix, distance between the camera and the face, and zooming option.

132

 

BIBLIOGRAPHY

133

 

Bibliography

[1] ZDNET Deﬁnition, http : / / dictionary . zdnet . com/ def inition/
inf ormation+security . html.

[2] O’REILLY Online Catalog, http://oreilly.com/catalog/dbnationtp/
chapter/ch03.htm1.

[3] DynaVox Technology, http://m.dynavoxtech. com.
[4] FG—NET Aging Database, http://www.fgnet.rsunit.com.
[5] http : //www . youtube . com/vat ch?v=uLqu pHVPhM.

[6] Open Computer Vision Library, http://sourceforge.net/projects/
opencvlibrary.

[7] Neven Vision, fR SDK, http://neven-vision-s-fr-sdk.software.
informer. com/.

[8] L—1 Identity Solutions, httpz/ /www.1iid.com.

[9] FaceVACS Software Developer Kit, Cognitec Systems GmbH, httpzl/www.
cognitec-systems.de.

[10] G. Aggarwal, A. K. Roy-Chowdhury, and R. Chellappa. A system identiﬁcation
approach for video-based face recognition. In Proc. International Conference
on Pattern Recognition, volume 4, pages 175—178, 2004.

[11] N. Ahmed, T. Natarajan, and K. R. Rao. Discrete cosine transform. IEEE
Transactions on Computers, pages 90—93, 1974.

[12] A. M. Albert, K. Ricanek, and E. K. Patterson. The aging adult skull and
face: A review of the literature and report on factors and processes of change.
university of north carolina at wilrnington, Technical Report, WRG FSC—A,
2004.

[13] C. Anderson, P. Burt, and G. van der Wal. Change detection and tracking
using pyramid transformation techniques. In Proc. SPIE - Intelligent Robots
and Computer Vision, volume 579, pages 72-78, 1985.

134

 

[14] S. Area, P. Campadelli, and R. Lanzarotti. A face recognition system based
on local feature analysis. In Proc. Audio- and Video-Based Biometric Person
Authentication, pages 182—189, 2003.

[15] D. Beymer and T. Poggio. Face recognition from one example view. In Proc.
IEEE International Conference on Computer Vision, pages 500—507, 1995.

[16] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In
Proc. Computer Graphics and Interactive Techniques, pages 187—194, 1999.

[17] V. Blanz and T. Vetter. Face recognition based on ﬁtting a 3d morphable model.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(9):1063—
1074, 2003.

[18] C. J. Bradley. The Algebra of Geometry: Cartesian, Areal and Projective Co-
ordinates. Bath: Highperception, 2007.

[19] M. Brand. A direct method for 3d factorization of nonrigid motion observation
in 2d. In Proc. IEEE Conference on Computer Vision and Pattern Recognition,
volume 2, pages 122-128, 2005.

[20] X. Chai, S. Shan, X. Chen, and W. Gao. Local linear regression (LLR) for pose
invariant face recognition. In Proc. Automatic Modeling of Face and Gesture,
pages 631—636, 2006.

[21] R. Chellappa, C. L. Wilson, and S. Sirohey. Human and machine recognition
of faces: A survey. Proc. IEEE, 83(5):705—740, 1995.

[22] T. Chen, Y. Wotac, S. Z. Xiang, D. Comaniciu, and T. S. Huang. Total variation
models for variable lighting face recognition. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 28(9):1519—1524, 2006.

[23] C. C. Chibelushi and F. Bourel. Facial expression recognition: A brief tutorial
overview. Pattern Recognition, 25(1):65—77, 2002.

[24] A. Roy Chowdhury and R. Chellappa. Face reconstruction from monocular
video using uncertainty analysis and a generic model. Computer Vision and
Image Understanding, 91(1-2):188—213, 2003.

[25] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models.
In Proc. European Conference on Computer Vision, volume 2, pages 484—498,
1998.

[26] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):681—
685, 2001.

[27] T. F. Cootes, K. Walker, and C. J. Taylor. View-based active appearance
models. In Proc. Automatic Face and Gesture Recognition, pages 227—232,
2000.

135

 

2“: ‘15» '3..’. F". r’ :,=..:." !

 

[28] I. Craw, D. Tock, and A. Bennett. Finding face features. In Proc. European
Conference on Computer Vision, pages 92—96, 1992.

[29] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati. Detecting moving objects,
ghosts, and shadows in video streams. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 25(10):1337—1343, 2003.

[30] A. R. Dick and M. J. Brooks. Issues in automated visual surveillance. In Proc.
VIIth Digital Image Comp. Tech. and App, pages 195—204, Dec. 2003.

[31] R. O. Duda and P. E. Hart. Pattern Classiﬁcation and Scene Analysis (2nd
ed.) John Wiley and Sons, 1995.

[32] L. G. Farkas, editor. Anthropometry of the Head and Face. Lippincott Williams
& Wilkins, 1994.

[33] R. A. Fisher. The statistical utilization of multiple measurements. Annals of
Eugenics, 8:376-386, 1938.

[34] X. Geng and Z.-H. Zhou. Image region selection and ensemble for face recogni—
tion. Journal of Computer Science Technology, 21(1):116—125, 2006.

[35] X. Geng, Z.—H. Zhou, and K. Smith-Miles. Automatic age estimation based
on facial aging patterns. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 29(12):2234—2240, 2007.

[36] A. Georghiadas, P. N. Belhumeur, and D. Kriegman. From few to many: Gen-
erative models for recognition under variable pose and illumination. In Proc.
Automatic Face and Gesture Recognition, pages 277—284, 2000.

[37] J. Rodney Goh, L. Liu, X. Liu, and T. Chen. The CMU face in action (FIA)
database. In Proc. Automatic Modeling of Face and Gesture, pages 255—263,
2005.

[38] R. Gottumukkal and V. K. Asari. An improved face recognition technique based
on modular pca approach. Pattern Recognition Letters, 25(4):429—436, 2004.

[39] L. Grafakos. Classical and Modern Fourier Analysis. Prentice—Hall, 2004.

[40] Ralph Gross, Iain Matthews, and Simon Baker. Active Appearance Models with
Occlusion, 24(1):593—604, 2006.

[41] M. Grudin. On internal representation in face recognition systems. Pattern
Recognition, 33(7):1161—-1177, 2000.

[42] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision.
Cambridge University Press, 2003.

136

 

[43] B. Heisele, P. Ho, J. Wu, and T. Poggio. Face recognition: component-based
versus global approaches. Computer Vision and Image Understanding, 91(1):6—
21, 2003.

[44] B. Heisele, T. Serre, M. Pontil, and T. Poggio. Component-based face detec-

tion. In Proc. IEEE Conference on Computer Vision and Pattern Recognition,
volume 1, pages 657—662, 2001.

[45] B.K.P Horn and B.G. Schunck. Determining optical ﬂow. Artiﬁcial Intelligence,
17:185-203, 1981.

[46] K. Hotta. Robust face recognition under partial occlusion based on support
vector machine with local gaussian summation kernel. Image and Vision Com-
puting, 26(11):1490—1498, 2008.

[47] R.-L. Hsu, Mohamed Abdel-Mottaleb, and A. K. Jain. Face detection in color
images IEEE Transactions on Pattern Analysis and Machine Intelligence,
24(5):696—706, 2002.

[48] Y. Z. Hsu, H. H. Nagel, and G. Rekers. New likelihood test methods for change
detection in image sequences. Computer Vision, Graphics and Image Process-
ing, 26(1):73—106, 1984. '

[49] A. K. Jain, S. C. Dass, and K. Nandakumar. Soft biometric traits for personal
recognition systems. In Proc. International Conference on Biometric Authen-
tication, pages 731—738, 2004.

[50] A. K. Jain, K. Nandakumar, and A. Ross. Score normalization in multimodal
biometric systems. Pattern Recognition, 38(12):2270—2285, December 2005.

[51] O. Javed, Z. Rasheed, O. Alatas, and M. Shah. Knightm: A real-time surveil-
lance system for multiple overlapping and non-overlapping cameras. In Proc.
International Conference on Multimedia and Expo, pages 649—652, July 2003.

[52] I. A. Kakadiaris, G. Passalis, G. Toderici, N. Murtuza, Y. Lu, N. Karampatzi-
akis, and T. Theoharis. Three-dimensional face recognition in the presence of
facial expressions: An annotated deformable model approach. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 29(4):640—649, 2007.

[53] D. Keren, S. Peleg, and R. Brada. Image sequence enhancement for super-
resolution image sequence enhancement. In Proc. IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 742—746, 1988.

[54] T.-K. Kim, H. Kim, W. Hwang, and J. Kittler. Component-based LDA face
description for image retrieval and MPEG-7 standardisation. Image and Vision
Computing, 23(7):631—642, 2005.

137

 

... WWW-(“WVGV =“ 3:5ng

 

 

[55] M. Kirby and L. Sirovich. Application of the Karhunen-Loeve procedure for
the characterization of human faces. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 12(1):103—108, 1990.

[56] K. M. Lam and H. Yan. An analytic-to—holistic approach for face recognition
based on a single frontal view. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 20(7):673—686, 1998.

[57] A. Lanitis, C. Draganova, and C. Christodoulou. Comparing different clas-
siﬁers for automatic age estimation. IEEE Transactions Systems, Man, and
Cybernetics, Part B, SMC-B, 34(1):621—628, February 2004.

[58] A. Lanitis, C. J. Taylor, and T. F. Cootes. Toward automatic simulation of aging
effects on face images. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 24(4):442—455, 2002.

[59] K. Lee, J. Ho, M. Yang, and D. Kriegman. Video-based face recognition using
probabilistic appearance manifolds. In Proc. IEEE Conference on Computer
Vision and Pattern Recognition, pages 313—320, 2003.

[60] M. W. Lee and S. Ranganath. Pose-invariant face recognition using a 3d de-
formable model. Pattern Recognition, 36:1835—1846, 2003.

[61] K. Levi and Y. Weiss. Learning object detection from a small number of exam-
ples: the importance of good features. In Proc. IEEE Conference on Computer
Vision and Pattern Recognition, volume 2, pages 53—60, 2002.

[62] J. P. Lewis. Fast normalized cross-correlation. Vision Interface, pages 120—123,
1995.

[63] S. Z. Li and A. K. Jain (eds). Handbook of Face Recognition. Springer-Verlag,
Secaucus, NJ, 2005.

[64] D. Lin and X. Tang. From macrocosm to microcosm. In Proc. IEEE Conference
on Computer Vision and Pattern Recognition, pages 1355—1362, 2006.

[65] T. Lindberg. Feature detection with automatic scale selection. International
Journal of Computer Vision, 30(2):79—116, 1998.

[66] H. Ling, S. Soatto, N. Ramanathan, and D. Jacobs. A study of face recognition
as people age. In Proc. IEEE International Conference on Computer Vision,
pages 1—8, 2007.

[67] R. Liu, X. Gao, R. Chu, X. Zhu, and S. Z. Li. Tracking and recognition of
multiple faces at distances. In Proc. International Conference on Biometrics,
pages 513-522, 2007.

[68] X. Liu and T. Chen. Video-based face recognition using adaptive hidden markov
models. In Proc. IEEE Conference on Computer Vision and Pattern Recogni-
tion, volume 1, pages 340—345, 2003.

138

 

[69]

[70]

[71]

[72]

[73]

[74]

[75]

[76]

[77]

[78]

[79]

[80]

[81]

D. G. Lowe. Distinctive image features from scale invariant keypoints. Inter-
national Journal of Computer Vision, 60(2):91—110, 2004.

X. Lu and A. K. Jain. Deformation modeling for robust 3d face matching. IEEE
Transactions on Pattern Analysis and Machine Intelligence, pages 1346—1357,
2008.

X. Lu, A. K. Jain, and D. Colbry. Matching 2.5d face scans to 3d models. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 28(1):31—43, 2006.

B. S. Manjunath and R. Chellappa. A feature based approach to face recogni-
tion. In Proc. IEEE Conference on Computer Vision and Pattern Recognition,
pages 373—378, 1992.

A. M. Martinez. Recognizing imprecisely localized, partially occluded, and
expression variant faces from a single sample per class. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 24(6):748—763, 2002.

T. Maurer, D. Guigonis, I. Maslov, B. Pesenti, A. Tsaregorodtsev, D. West,
and G. Medioni. Performance of Geometrix active IDTM 3d face recognition
engine on the FRGC data. In Proc. IEEE Conference on Computer Vision and
Pattern Recognition, pages 154—160, 2005.

K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre. Xm2vtsdb: The
extended m2vts database. In Proc. Audio- and Video-Based Biometric Person
Authentication, pages 72—77, 1999.

H. H. Nagel. Image sequence - ten (octal) years - from phenomenology to-
wards a theoretical foundation. In Proc. International Conference on Pattern
Recognition, pages 1174—1185, 1987.

K. Nandakumar, Y. Chen, S. C. Dass, and A. K. Jain. Likelihood ratio based
biometric score fusion. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 30(2):342-347, 2008.

A. J. O’Toole, T. Vetter, H. Volz, and E. M. Salter. Three-dimensional carica-
tures of human heads: distinctiveness and the perception of facial age. Percep-
tion, 26:719—732, 1997.

J. P. Hespanha P. N. Belhumeur and D. J. Kriegman. Eigenfaces vs. fsherfaces:
recognition using class speciﬁc linear projection. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 19(7):711—720, 1997.

U. Park and A. K. Jain. 3d model-based face recognition in video. In Proc.
International Conference on Biometrics, pages 1085-1094, 2007.

U. Park, A. K. Jain, and A. Ross. Face recognition in video: Adaptive fusion of
multiple matchers. In Proc. IEEE Conference on Computer Vision and Pattern
Recognition Workshop on Biometrics, pages 1—8, 2007.

139

 

 

[82]

[83]

[84]

[85]

[86]

[87]

[881

[89]

[90]

[91]

[92]

[93]

U. Park, Y. Tong, and A. K. Jain. Face recognition with temporal invariance:
A 3d aging model. In Proc. Automatic Face and Gesture Recognition, pages
1—7, 2008.

F. I. Parke. Computer generated animation of faces. In Proc. ACM annual
conference, pages 451—457, 1972.

E. Patterson, K. Ricanek, M. Albert, and E. Boone. Automatic representa-
tion of adult aging in facial images. In Proc. 6th International Conference on
Visualization, Imaging, and Image Processing, IAS TED, pages 171—176, 2006.

E. Patterson, A. Sethuram, M. Albert, K. Ricanek, and M. King. Aspects of age
variation in facial morphology affecting biometrics. In Proc. IEEE International

Conference on Biometrics: Theory, Applications, and Systems (BTAS), pages
1—6, 2007.

I. Pavlidis, V. Morellas, P. Tsiarnyrtzis, and S. Harp. Urban surveillance sys-
tems: from the laboratory to the commercial world. In Proc. IEEE, volume
89(10), pages 1478—1497, 2001.

P. S. Penev and J. J. Atick. Local feature analysis: a general statistical theory
for object representation. Network: Computation in Neural Systems, 7:477—500,
1996.

A. Pentland, B. Moghaddam, and T. Starner. View-based and modular
eigenspace for face recognition. In Proc. IEEE Conference on Computer Vi-
sion and Pattern Recognition, pages 84—91, 1994.

J. Phillips, H. Wechsler, J. S. Huang, and P. J. Rauss. The FERET database
and evaluation procedure for face recognition algorithms. Image and Vision
Computing, 16(5):295—306, 1998.

P. J. Phillips, P. Grother, R. J. Micheals, D. M. Blackburn, E. Tabassi, and
M. Bone. Face Recognition Vendor Test 2002: Evaluation Report, Tech. Report
NISTIR 6965, NIST, 2003.

P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss. The FERET evaluation
methodology for face-recognition algorithms. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 22(10):1090—1104, 2000.

P. J. Phillips, W. T. Scruggs, A. J. O’Toole, P. J. Flynn, K. W. Bowyer, C. L.
Schott, and M. Sharpe. Face Recognition Vendor Test 2006: FRVT 2006 and
ICE 2006 Large-Scale Results, Tech. Report NISTIR 7408, NIST, 2007.

J. S. Pierrard and T. Vetter. Skin detail analysis for face recognition. In Proc.
IEEE Conference on Computer Vision and Pattern Recognition, pages 1—8,
2007.

140

 

 

   

 

[94] Frédéric Pighin, R. Szeliski, and D. H. Salesin. Modeling and animating realistic
faces from images. Internal Journal of Computter Vision, 50(2):143—169, 2002.

[95] J. B. Pittenger and R. E. Shaw. Aging faces as viscal-elastic events: Implications
for a theory of nonrigid shape perception. Journal of Experimental Psychology:
Human Perception and Performance, 1:374—382, 1975.

[96] G. Portera and G. Doran. An anatomical and photographic technique for foren-
sic facial identiﬁcation. Forensic Science International, 114:97—105, 2000.

[97] L. Qing, S. Shan, X. Chen, and W. Gao. Face recognition under varying light-
ing based on the probabilistic model of gabor phase. In Proc. International
Conference on Pattern Recognition, pages 1139—1142, 2006.

[98] Iain Matthews Ralph Gross and Simon Baker. Generic vs. person speciﬁc active
appearance models, 23(1):1080—1093, 2005.

[99] N. Ramanathan and R. Chellappa. Face veriﬁcation across age progression. In
Proc. IEEE Conference on Computer Vision and Pattern Recognition, volume 2,
pages 462—469, 2005.

[100] N. Ramanathan and R. Chellappa. Modeling age progression in young faces. In
Proc. IEEE Conference on Computer Vision and Pattern Recognition, volume 1,
pages 387-394, 2006.

[101] K. J. Ricanek and T. Tesafaye. Morph: A longitudinal image database of normal

adult age-progression. In Proc. Automatic Face and Gesture Recognition, pages
341—345, 2006.

[102] S. Romdhani, T. Vetter J. Ho, and DJ. Kriegman. Face recognition using 3-d
models: Pose and illuminatin. In Proc. IEEE, volume 94, pages 1977-1999,
2006.

[103] H. A. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):23—38,
1998.

[104] K. Scherbaum, M. Sunkel, H.-P. Seidel, and V. Blanz. Prediction of individual
non-linear aging trajectories of faces. Computer Graphics Forum, 26(3):285—
294, 2007.

[105] L. G. Shapiro and G. C. Stockman. Computer Vision. New Jersey: Prentice
Hall, 2001.

[106] A. Shio and J. Sklansky. Segmentation of people in motion. In Proc. IEEE
Workshop on Visual Motion, pages 325—332, 1991.

[107] N. A. Spaun. Forensic biometrics from images and video at the Federal Bu-
reau of Investigation. In Proc. IEEE International Conference on Biometrics:
Theory, Applications, and Systems, pages 1-3, 2007.

141

 

 

[108] J. Stallkamp, H. K. Ekenel, and R. Stiefelhagen. Adaptive background mixture
models for real-time tracking. In PProc. IEEE International Conference on
Computer Vision, pages 1—8, 2007.

[109] C. Stauffer and W. E. L. Grimson. Adaptive background mixture models for
real-time tracking. In Proc. IEEE Conference on Computer Vision and Pattern
Recognition, pages 246—252, 1999.

[110] M. B. Stegmann. The AAM-API: An open source active appearance model im-
plementation. In Proc. International Conference on Medical Image Computing
and Computer-Assisted Intervention (MICCAI), pages 951—952, 2003.

[111] S . Stillman, R. Tanawongsuwan, and I. Essa. A system for tracking and recog-
nizing multiple people with multiple cameras. In Proc. International Conference
on Audio and Video-Based Biometric Person Authentication, pages 96—101,
1999.

[112] J. Suo, F. Min, S. Zhu, S. Shan, and X. Chen. A multi-resolution dynamic model
for face aging simulation. In Proc. IEEE Conference on Computer Vision and
Pattern Recognition, pages 1—8, 2007.

[113] A. Hadid T. Ahonen and M. Pietikainen. Face recognition with local binary
patterns. In Proc. European Conference on Computer Vision, pages 469—481,
2004.

[114] K. Tan and S. Chen. Adaptively weighted sub-pattern PCA for face recognition.
Neurocomputing, 64:505-511, 2005.

[115] D. W. Thompson. 0n Growth and Form. New York: Dover, 1992.

[116] C. Tomasi and T. Kanade. Shape and motion from image streams under or-
thography: A factorization method. International Journal of Computer Vision,
9(2):137—154, 1992.

[117] J. Tu, T. Huang, and H. Tao. Accurate head pose tracking in low resolution
video. In Proc. Automatic Face and Gesture Recognition, pages 573—578, 2006.

[118] M. Tqu and A. Pentland. Eigenfaces for recognition. Cognitive Neuroscience,
3:72-86, 1991.

[119] M. Turk and A. Pentland. Face recognition using eigenfaces. In Proc. IEEE
Conference on Computer Vision and Pattern Recognition, pages 586—591, 1991.

[120] S. Ullrnan. The Interpretation of Visual Motion. MIT Press, Cambridge, 1979.

[121] C. V. van Rijsbergen. Information Retrieval (2nd ed.) London: Butterworths,
1979.

142

 

 

   

 

[122]

[123]

[124]

[125]

[126]

[127]

[128]

[129]

[130]

[131]

[132]

[133]

N. Vaswani and R. Chellappa. Principal components null space analysis
for image and video classiﬁcation. IEEE Transactions on Image Processing,
15(7):1816—1830, 2006.

P. A. Viola and M. J. Jones. Robust real-time face detection. International
Journal of Computer Vision, 57(2):137—154, 2004.

J. Wang, Y. Shang, G. Su, and X. Lin. Age simulation for face recognition. In
Proc. International Conference on Pattern Recognition, pages 913—916, 2006.

G. Welch and G. Bishop. An introduction to the Kalman ﬁlter, Technical Re-
port No. TR 95-041, Department of Computer Science, Univ. of North Carolina,
2003.

L. Wiskott, J .-M. Fellous, N. Kruger, and G. von der Malsburg. Face recognition
by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 19(7):775—779, 1997.

J. Wu, S. C. Brubaker, M. D. Mullin, and J. M. Rehg. Fast asymmetric learning
for cascade face detection. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 30(3):369—382, 2008.

Y.-L. Wu, L. Jiao, G. Wu, E. Y. Chang, and Y.-F. Wang. Invariant feature
extraction and biased statistical inference for video surveillance. In Proc. IEEE

Conference on Advanced Video and Signal Based Surveillance, pages 284—289,
2003.

J. Xiao, J. Chai, and T. Kanade. A closed-form solution to non-rigid shape and
motion recovery. In Proc. European Conference on Computer Vision, pages
668—675, 2004.

J ing Xiao, Simon Baker, Iain Matthews, and Takeo Kanade. Real-time com-
bined 2d+3d active appearance models. In Proc. IEEE Conference on Computer
Vision and Pattern Recognition, volume 2, pages 535—542, 2004.

G. Yang and T. S. Huang. Human face detection in a scene. In Proc. IEEE
Conference on Computer Vision and Pattern Recognition, pages 453—458, 1993.

M.-H. Yang, D. Kriegman, and N. Ahuja. Detecting faces in images: A survey.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(1):34—58,
2002.

W. Zhang, S. Shan, W. Gao, X. Chen, and H. Zhang. Local gabor binary
pattern histogram sequence (LGBPHS): A novel non-statistical model for face

representation and recognition. In Proc. IEEE International Conference on
Computer Vision, pages 786—791, 2005.

143

 

[134] W. Zhao and R. Chellappa. Robust face recognition using symmetric shape-
from-shading. Technical Report, Center for Automation Research, University
of Maryland, 1999.

[135] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld. Face recognition: A
literature survey. ACM Computing Surveys ( CS UR), 35(4):399—458, 2003.

[136] S. Zhou, V. Krueger, and R. Chellappa. Probabilistic recognition of human faces
from video. Computer Vision and Image Understanding, 91:214—245, 2003.

144