aunt.” tuna;

v-I‘ A 0 _
1.1-»...

I

..
. n
I.
. n; fawn:
2.7 r! if!

t a
, is 3%..
us .

. ’othmusugvm,»
Hu‘k‘mmmo 01.. h, x

1...“... law...

a}:
1' 3..

. ‘ . L.. :5 extremmn...
«Rich; .3»... .
.nl‘. nlv .

no.
5..
iv ..

ﬁnﬁk

~’§Sl.‘.f 'Dll

Faﬂanmysi
. . « $2.:

(«.1
..!.,IA..?

‘1) y.

113.15.}... 5-.)

flu.

24 3!...“ 1
did? IA it...

, .1. 2 .(

‘ ill

 

THESIS

200}

This is to certify that the
dissertation entitled

Human Face Veriﬁcation by Robust 30 Surface Alignment

presented by

Dirk Joel Luchini Colbry

has been accepted towards fulfillment
of the requirements for the

Ph.D degree in Computer Science and
Engineering

 

 

W

Major Professor’s Signature

4 at? 200$

Date

MSU is an Afﬁrmative Action/Equal Opportunity Institution

 

LIBRARY
Michigan State
University

 

 

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2/05 p:/ClRC/DateDue.indd-p.1

 

HUMAN FACE VERIFICATION BY ROBUST 3D SURFACE
ALIGNMENT

By

Dirk Joel Luchini Colbry

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

PHD
Department of Computer Science

2006

ABSTRACT

HUMAN FACE VERIFICATION BY ROBUST 3D SURFACE
ALIGNMENT

By
Dirk Joel Luchini Colbry

Traditional 2D face recognition systems are not tolerant to changes in pose, light-
ing and expression. This dissertation explores the use of 3D data to improve face
recognition by accounting for these variations. A two step, fully automatic, 3D sur-
face alignment algorithm is developed to correlate the surfaces of two 3D face scans.
In the first step, key anchor points such as the tip of the nose are used to coarsely
align two face scans. In the second step, the Iterative Closest Point (ICP) algorithm
is used to ﬁnely align the scans. The quality of the face alignment is studied in depth
using a Surface Alignment Measure (SAM). The SAM is the root mean squared error
over all the control points used in the ICP algorithm, after trimming to account for
noise in the data. This alignment algorithm is fast (<2 seconds on a 3.2GHz P4)
and robust to noise in the data (<10% spike noise). Extensive experiments were
conducted to Show that the alignment algorithm can tolerate up to 15° of variation
in pose due to roll and pitch, and 300 of variation in yaw. It is shown that this level
of pose tolerance easily covers the normal pose variation of a database of over 275 co-
operative subjects. By using the SAM as an initial matching score and automatically
rejecting poor quality scans, an equal error rate of 1.2% is achieved on a database
of 943 scans from 275 subjects. This surface alignment verification system is fast,
fully automatic, and requires no additional training. By using the 3D face surface
alignment algorithm, a Canonical Face Depth Map (CFDM) is deﬁned to allow for
automatic preprocessing of 3D face scans. The CFDM is shown to help in anchor
point localization, reduce model memory requirements, and enables other algorithms,

such as PCA or correlation, to be applied.

To Kate, my wife and my best friend.

iii

ACKNOWLEDGMENTS

Thanks to George Stockman for always finding time to provide advice, encour-
agement and support as I navigated the Ph.D.

Thanks to Anil Jain for inspiring and funding my initial research in 3D faces,
and for his guidance throughout the dissertation process.

Thanks to Frank Biocca and Hayder Radha for providing feedback and perspec-
tive on the dissertation as members of my committee.

Thanks to all of the members of the PRIP lab for participating in my research
projects and for your friendship and support.

Thanks to the students who helped with programming and data collection, in-
cluding Brian Hasselbeck, Rayshawn Holbrook, Greg Heil, Charles Otto, Jason
Payne, Jermil Sadler, and Thaddeus Walker.

Thanks to the NSF and the MSU IGERT program for funding my research, and
to my fellow IGERT students for their friendship and support.

Thanks to my extended family in East Lansing - Karl, Angie, Sienna, Krissy,
Steve, Teresa and all of the folks at St. John - for the dinners, games and loving
friendship over the years.

Thanks to my wife, Katy, who worked as hard on this dissertation as I did, and
who will deliver our ﬁrst child about the same time I receive this degree.
Thanks to my entire family for their love and support, especially Alberta, Norm,
Tricia, Michael, Alonzo, Zachary, John, Tamara, Katy (and Baby), Mark, Dolly,
Nan, Tim, and Frank.

iv

TABLE OF CONTENTS

LIST OF TABLES .............................. viii
LIST OF FIGURES ............................. xi
1 Introduction to the Problem ...................... 1
1.1 Background and Motivation for the Face Biometric ............ 3
1.2 Project and Document Overview ...................... 7
2 Background ................................ 10
2.1 Existing Work ................................. 10
2.1.1 Face Recognition .............................. 11
2.1.2 3D Face Processing ............................. 16
2.2 Coordinate System Transformations ..................... 21
2.3 Sensor Selection ................................ 27
2.4 Existing 3D Modeling Methods ....................... 28
2.4.1 Point Clouds ................................ 31
2.4.2 Polygonal Models .............................. 31
2.4.3 2.5D Point Matrix ............................. 33
2.4.4 Depth Map ................................. 34
2.4.5 Structured Polygonal Models ....................... 34
2.4.6 Face Mesh .................................. 35
2.4.7 Cylindrical Point Matrix .......................... 36
2.5 Data Collection ................................ 41
2.5.1 Michigan State Dataset .......................... 42
2.5.2 Face Recognition Grand Challenge Datasets ............... 44
2.5.3 Ongoing Work in 3D Scanners ....................... 44
3 Solution Approach ............................ 48
3.1 Surface Fitting ................................ 48
3.1.1 Curvature Calculation ........................... 50
3.1.2 Surface Curvature and Normal Directions ................ 52
3.2 Shape Index .................................. 54
3.3 Scanner Resolution and Scale ........................ 55
3.4 Image Cleaning ................................ 59
3.5 Surface Alignment .............................. 60
3.5.1 Single Point Rigid Alignment ....................... 60
3.5.2 Three Point Rigid Alignment ....................... 61
3.5.3 Iterative Methods .............................. 63

3.6 Orthogonal Depth Maps ........................... 65

3.6.1 Virtual Scans ................................ 67
3.6.2 Super Sampling ............................... 68
3.7 Summary ................................... 68
4 3D Anchor Point Detection ....................... 70
4.1 Problem Description ............................. 71
4.2 Background .................................. 74
4.3 Frontal Anchor Point Detection in 3D ................... 77
4.3.1 Experiments and Results .......................... 79
4.4 General Pose Anchor Point Detection in 3D ................ 83
4.4.1 Candidate Point Selection ......................... 84
4.4.2 Relaxation / Search Algorithm ...................... 87
4.4.3 Experiments and Results .......................... 94
5 3D Face Alignment ............................ 100
5.1 Background .................................. 102
5.2 Face Alignment Algorithm ......................... 103
5.2.1 Control Point Region Selection ...................... 104
5.2.2 Number of Iterations ............................ 107
5.2.3 Distance Threshold ............................. 108
5.2.4 Trimming Study .............................. 108
5.2.5 Scan Swap ................................. 110
5.2.6 Automatic Reject Option ......................... 112
5.3 System Error Analysis ............................ 113
5.3.1 Color and Lighting ............................. 114
5.3.2 Optimal Conditions Baseline SAM .................... 115
5.3.3 Pose Variation ............................... 117
5.3.4 Change in Stand Off Distance ....................... 120
5.4 Large Dataset Experiments ......................... 121
5.5 Arbitrary Pose ................................ 125
6 Toward a Canonical Face Format ................... 131
6.1 Background .................................. 133
6.2 F ace—Based Coordinate System ....................... 135
6.2.1 Anchor Point Alignment .......................... 136
6.2.2 Face Model Normalization ......................... 137
6.2.3 Deﬁning a Canonical Coordinate System with Global Features ..... 139
6.3 Canonical Face Depth Map ......................... 146
6.4 Robustness Analysis ............................. 148
6.5 Properties of Canonical Face Depth Maps ................. 154
6.6 Preprocessing using Canonical Face Depth Maps .............. 159
6.6.1 Correlation ................................. 160
6.6.2 PCA Analysis ................................ 164

vi

7 Research Platform ............................ 168

7.1 Program History ............................... 168
7.2 Demo Interface ................................ 171
7.2.1 VIVID 910 Operation ........................... 174
7.2.2 Matching Algorithm ............................ 175
7.2.3 Data Visualization ............................. 176
7.3 Performance .................................. 179
8 Conclusions, Discussion and Future Work .............. 180
8.1 Discussion and Synthesis ........................... 185
8.1.1 Anchor Point Detection and Evaluation ................. 185
8.1.2 Practical 3D Verification System ..................... 192
8.1.3 Matching Score Evaluation for Face Veriﬁcation ............. 196
8.2 Directions for Future Work ......................... 200
APPENDICES ................................ 203
A Programing Interface .......................... 204
A.1 File Formats .................................. 204
A2 Programming Tools .............................. 206
A3 Algorithm Components and Required SDKS ................ 210
AA Cross Platform Computing .......................... 212
BIBLIOGRAPHY .............................. 215

vii

1.1
1.2

2.1
2.2
2.3
2.4
2.5
2.6

3.1

4.1
4.2
4.3

4.4
4.5
4.6
4.7
4.8
4.9

LIST OF TABLES

Outline of major thesis contributions ............ ‘ .........

Utilizing 3D information ...........................

Design space for 2D and 3D face recognition systems ............
Calculating the Euler angles from a rotation matrix. ...........
3D model data structure comparison .....................
Operation of the Vivid 910 scanner (time in seconds). ..........
F RGC database. ...............................

Comparison between the F RGC and MSU datasets. ...........
Step values for standard databases ......................

Automatic face anchor point detection literature overview. ........
Frontal pose anchor point detection algorithm ................

Statistical model of the face. Each set of distances are calculated from
point A to point B only in the Speciﬁed direction. ...........

Experimental results for frontal anchor point detector. ..........
Experimental results for successful frontal anchor point detector ......
Methods for ﬁnding candidate nose points ..................
Initial relaxation matrix ............................
Relaxation matrix after ﬁrst relaxation ....................

Relaxation matrix after selecting the first point ...............

4.10 Relaxation matrix after the second relaxation step. ............

4.11 Relaxation matrix after the second point selection. ............

viii

13
26
30

57

75
77

78
80
83
87
89
91
92

4.12 Arbitrary pose anchor point detection algorithm. .............

4.13 Attribute population size success rate. ...................

5.1

5.2
5.3

5.4
5.5

6.1

6.2
6.3
6.4

6.5

7.1
7.2

List of design parameters used by the two step face alignment algorithm
described in this chapter. The default values were initially determined
by program evaluation and then veriﬁed using the experiments dis-
cussed in the listed sections. ......................

Differences between valid scans in the database. ..............

Number of failed tests out of 63 scans due to recognition in the first test.
The number of iterations (10,10) represents a total of 10 iterations for
Besl’s ICP scheme and 10 iterations for Chen’s scheme. .......

Second test matching error rates. ......................

Distribution of matching error from a total of 113 test scans in the second
experiment. ................................

Improvement of mid-line normalization over database roll, pitch, and yaw
differences. In the ﬁrst row, the original roll, pitch, yaw and distance
statistics represent the variations between different scans of the same
subjects within the original database (this first row is taken from Table
5.2). Each subsequent row represents the variation after the different
normalization techniques. The Plane, Parabolic Cylinder, Quadratic,

and Average are all applied after the baseline symmetry is established.

Steps for constructing a Canonical Face Depth Map. ...........
Error with respect to noise ..........................

Experimental Results for Frontal Anchor Point Detector before and after
appling the CF DM .............................

Results of file format memory comparison. A standard smart card has 512
kilobytes of memory [66]. This is not enough memory to store the raw
image of a face scan. Because the CFDM requires only one matrix to
represent the 3D surface there are many formats of the CFDM that can
fit onto a smart card. It is even possible to ﬁt more than one template
on a card. All ﬁle formats under the 512 kilobyte memory goal are
shown in bold ................................

Algorithm processing times in seconds ....................

Complete process time for the current demo of a practical veriﬁcation
system system. ..............................

ix

104

118

128
129

129

159

179

8.1
8.2
8.3
8.4
8.5

8.6

A.1

A.2

Dissertation overview. ............................ 181

Major contributions. ............................. 184
Publications based on research in this dissertation. ............ 184
Methods for evaluating anchor points. ................... 188

Questions that need to be addressed in order for a 3D face veriﬁcation
system to be considered practical ..................... 193

Matching Scores. ............................... 196
File formats used by the 3DID system. A yes or no indicates whether the

format can store a particular data type. A ﬁle extension is include if
the data type is stored in a separate ﬁle. ................ 207

Cross plat form functionality .......................... 213

1.1
1.2

2.1

2.2
2.3
2.4
2.5
2.6
2.7
2.8

2.9

2.10
2.11
2.12
2.13
2.14
2.15

2.16

LIST OF FIGURES
Images in this dissertation are presented in color.

Reasonable variations in face images (lighting, pose, facial hair, hairstyle). 1

People do not have trouble recognizing exaggerated distinguishing fea-
tures, such as Jay Leno’s chin. (Caricature published with permission
from [129].) ................................ 5

A database of 3D scans was used to try to clarify the identity of the sailor
in this photo. ............................... 13
Early anthropometric work by Leonardo Da Vinci. ............ 14

Example locations for anthropometric face measurements (Figure from [16]). 14

Scanner-based coordinate system ....................... 23
Object-based coordinate system. ...................... 24
Point cloud (surface only). ......................... 31
Polygonal mesh. ............................... 32
2.5D point matrix representation. Lighter colors represent larger values of

x, y and z and darker colors represent smaller values. ......... 33
Single matrix depth map. .......................... 35
Structured polygonal model (image from [92]) ................ 36
Mesh models in different levels of detail (Figure from [144]). ....... 36
Example cylindrical data structure. ..................... 37
Example cylindrical data structure. ..................... 37
Model construction. Five 2.5D test scans used to construct a full 3D model. 38
Algorithm for computing the cylindrical coordinates from an arbitrary

point cloud. ................................ 40
Minolta Vivid 910 2.5D scanner ........................ 41

xi

2.17 Example 2.5D image (a) 2D texture image (b) depth map (c) rendered 3D

2.18
2.19

2.20

2.21

3.1
3.2

3.3

3.4

3.5

3.6

3.7

3.8

3.9
3.10
3.11

model. ...................................
Scanning time. ................................

FRGC version 2.0 demographics (ethnicity, age, gender). (Figure from
[118]). ...................................

F RGC version 2.0 sittings. (Figure from [118].) ..............

Example prototype setup for the PhotonX multi-image Shape from Shading
camera. ..................................

Example surfaces with surface normals represented by arrows. ......

Shape index scale. A value of zero represents a spherical cup, while a
value of one represents a spherical cap. This index is independent of
the coordinate system. ..........................

Example of the shape index over faces in different poses. Note the consis-
tency of the values around areas such as the eyes and nose regardless
of pose. ..................................

T wo scans taken with different camera Offsets resulting in different scales
for the face. ................................

Fixed window (9 x 9) shape index calculation window. Note that the shape
index values are dependent on the resolution of the scan. The detail
of the curvature in (a) is much more prominent than the detail shown

in (b). ...................................

Fixed distance shape index calculation. Notice that the values are more
similar across scale than in the ﬁxed window mode (i.e., notice the
smoother regions on the larger face). ..................

Proﬁle view of a frontal scan from the F RGC1.0 database. Notice the
large spike noise errors around the eye. .................

Rigid transformation between two sets of three corresponding points; (a)
the original set of points (the red triangle is constructed from the a.
points, the blue triangle is constructed from the p points); (b) The set
of points after the rigid transformation of points a onto points p. . . .

Visual representation of an orthogonal projection ..............
Example of virtual yaw rotation. ......................

Depth maps sampled at two different resolutions. .............

xii

41

43

45

47

53

54

55

55

58

59

60

62
66

4.1 Example anchor point locations. rEye - Inside of the right eye. orEye -
Outside of the right eye. lEye - Inside of the left eye. olEye - Outside
of the left eye. nose - Nose tip .......................

4.2 Simple and efficient 2D face detector using a cascade of dark-light-dark
anchor point detectors [138] ........................

4.3 Example bounding boxes produced by the statistical face model and the
location of the nose tip. .........................

4.4 Detected anchor points for example images from dataset A. .......

4.5 Detected anchor points for example images from dataset C (these images
violate the specs of the veriﬁcation system). ..............

4.6 Average face image with each of the manually selected anchor points
(green) and the relative location of the automatically generated points
(white) from the MSU 330 image dataset. ...............

4.7 Correlation of the shape space with a simple vertical mask to ﬁnd the
inside points of the eyes. (a) frontal view (b) a semi-proﬁle View. The
bright Spots in the scan represent the locations of the inside of the eye
next to the bridge of the nose .......................

4.8 An example shape—space scan with key anchor points identiﬁed. .....

4.9 Example nose candidate points for three different poses. Notice that many
of the points are errors, but in all of the scans at least one of the
candidate points is actually the nose tip. ................

4.10 Error histograms for each of the surface candidate points. The error repre-
sents the distance from the ﬁnal extracted anchor point to the manually
selected point. The dotted vertical line represents the 20 mm cutoff,
which 87% of the scans fall under .....................

4.11 Example anchor point locations. The plus (+) represents the automati-
cally located anchor points and the circle (0) represents the manually
selected anchor points. ..........................

4.12 Percent of correctly selected anchor points versus the pose direction and
facial expression. The success rate for frontal neutral faces is 99.0%. .

4.13 Trade off between the error rate and the false rejection rate as a function
of modifying the rejection angle ......................
5.1 Face alignment algorithm. ..........................

5.2 Alignment algorithm outline. ........................

70

76

85
86

87

96

97

98

99

100

5.3
5.4

5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14

5.15
5.16
5.17
5.18
5.19
5.20
5.21
5.22
5.23
5.24

5.25

5.26

Control point selection regions. .......................

Results when varying control point ranges; the numbers correspond to
regions I, II and II from Figure 5.3 ....................

Results when varying the number of control points. ............
Results when varying the maximum number of iterations ..........
Results when varying the distance threshold .................
Results when varying the number of points trimmed. ...........
Symmetric matching performance on MSU dataset. ............
Symmetric matching performance on FRGC version 1.0 dataset ......
Symmetric matching performance on FRGC version 2.0 dataset ......
Rejection examples ...............................
Example mannequin data with anchor points detected ...........

Nose was not captured due to the focusing range of the Minolta scanner.
Scans that are closer than 0.77m can cause problems. .........

Example scans with extreme lighting and color variations. ........
Gaussian approximation of SAM distributions ................
Pitch, yaw and roll ...............................
Change in roll vs. SAM. ...........................
Change in pitch vs. SAM. ..........................
Change in yaw vs. SAM ............................
Change in stand off vs. SAM. ........................
ROC curves for the MSU dataset .......................
Resulting ROC curves for the FRGC 1.0 dataset. .............

Performance on FRGC 2.0 dataset when the training and testing data are
both from the same semester. ......................

Performance on F RGC 2.0 dataset when the training and testing data are
taken from two different semesters. ...................

Performance on F RGC 2.0 dataset when the training data is from one
semester and the testing data is from a later semester. ........

xiv

105
106
107
108
109
111
111
111
113
113

114

117
119
119
119
120
122

5.27
5.28

5.29

5.31

5.32

6.1

6.2

6.3

6.4
6.5
6.6
6.7
6.8
6.9
6.10
6.11

6.12
6.13
6.14

Example false accept SAM scores .......................

Results for matching two sets of twins. The genuine scores are in green
(thin bars) and the impostors are in red (wide bars). .........

Grid of control points used by the ﬁne alignment ICP algorithm. The grid
is adapted to the pose of the test scan. The location of the grid points
is determined by the location of the extracted anchor points. .....

Matching results after ﬁne alignment. From left to right: the 2.5D test
scan, the 3D model, the 3D textured model overlaid by the wire-frame
of the test scan after ﬁne alignment ....................

Examples of correctly matched test scans. Notice the changes in lighting,
pose and facial expression. ........................

Examples of incorrectly matched test scans. ................
Example error where the generic model and the test scan subject do not
have the same sized head, resulting in multiple local minima ......

3D model of an average face generated using a Canonical Face Depth Map.
This model can be used as a generalized target model for new scans. .

Proﬁle ridge. Each point represents the largest 7. value for every row in
the scan ...................................

Projection proﬁle examples. .........................
Example mirror transformation ........................
Example location of the mid-line plane relative to a face. .........
Parametric surface ﬁtting using forced symmetrical surface ﬁtting. . . . .
Comparison of pitch standard deviation over different ﬁtting techniques.

Process for generating a Canonical Face Depth Map. ...........
Example noise on the surface of the face ...................

Example of holes on the surface of the face and how the angles in the
canonical format change ..........................

Comparison of how the holes in the face affect the CFDM angles.
Comparison of how the holes in the face affect the CF DM origin.

Comparison of how the holes in the face affect the CFDM origin relative
to the :r, y and z distances. .......................

XV

124

127

127

130

138

139

141
141
142
143
145
146
148

150

151
152
153

153

6.15

6.16

6.17

6.18

6.19
6.20

6.21

6.22

6.23

6.24

6.25

7.1
7.2

7.3

(a) and (c) are scans with different camera offset and resolution, (b) and
(d) show how using a depth map normalizes the resolution .......

Average face image with each of the manually selected anchor points
(green) and the relative location of the automatically generated points
(white) from the MSU 330 image dataset. The left images represents
relative the anchor point locations using the raw data and the right

155

images represents the anchor point locations after applying the CFDM. 156

Example of symmetric hole ﬁlling. Note that the color is maintained with
the copied hole points. ..........................

Depth maps sampled at two different resolutions using the MSU database.
The higher resolution data improves ICP performace by allowing for
small incremental changes during each iteration of the algorithm. The
increased surface matching performs slightly better, as seen in (c). . .

Example of the correlation point algorithm. ................

Points used in the correlation experiment: nose tip, nose bridge, nose base,
inside of right eye, inside of left eye, chin .................

Correlation applied to the depth channel of a Canonical Face Depth Map
using a large mask window. .......................

Correlation applied to the shape channel of a canonical depth map using
a small mask window. ..........................

Point localization using correlation. Average face image with each of the
manually selected anchor points (green) and the relative location of
the automatically generated points (white) from the MSU 330 image
dataset. The left images represents relative the anchor point locations
using the raw data and the right images represents the anchor point
locations after applying the CFDM ....................

Preprocessing the scans for use with the FRGC PCA algorithm is a two
step process. First the scan is processed into its CFDM format and
then the FRGC normalization algorithm called “face2norm” masks out
unwanted regions and normalizes the depth and color ..........

Comparing automatic PCA with manual PCA. ..............

Standard operation of the VIVID demo system. ..............

Flow chart representing the operation of the prototype 3D face veriﬁcation
system ....................................

VIVID 910 interface .............................

xvi

157

162

163

165

166
167

172

173

7.4
7.5

7.6
7.7

7.8

8.1

8.2

8.3

8.4

8.5

A.1

Matching algorithm demo visualization. .................. 175

Main image window Showing anchor point location with and without flag
values. ................................... 176

Matching algorithm demo visualization. .................. 177

Surface normals with the x,y,z directions represented as red, green, and
blue. .................................... 178

3D VRML viewer with color, depth, Shape and normal data representations. 178

Eye corner localization. Average face image with each of the manually
selected anchor points (green) and the relative location of the auto-
matically generated points (white) from the MSU 330 image dataset.

The left images represent relative anchor point locations using the raw
data, the middle images represent the anchor point locations after ap—
plying the CFDM, and the right images represent the location after
correlation localization ........................... 190

Point localization using correlation. Average face image with each of the
manually selected anchor points (green) and the relative location of
the automatically generated points (white) from the MSU 330 image
dataset. The left images represent the relative anchor point locations
using the raw data and the middle images represent the anchor point
locations after applying the CFDM and the right image represent the
anchor point locations after applying correlation. ........... 191

SAM score vs. PCA distance score on the MSU dataset. Red (dark) points
represent impostors and green (light) points represent genuine ..... 198

SAM score vs. PCA distance score on the MSU dataset. Red (dark)
points represent impostors, green (light) points represent genuine and
blue (darkest) points represent scores with either the model or the
template scan rejected based on the symmetry criteria. ........ 199

Fusion example. By fusing the nose correlation, bridge correlation and
SAM score, signiﬁcantly higher performance is achieved over any of
the three matching scores by themselves. ................ 200

Results of experiments to compare time taken to read and write common
ﬁle types. ................................. 206

xvii

Chapter 1

Introduction to the Problem

This dissertation describes methods for modeling and matching the human face in
three dimensions, and explores how such 3D models can improve the performance of
face processing algorithms that have traditionally focused on 2D texture images.
Traditional 2D face recognition systems are used to compare two images and de-
termine whether the faces in the images are from the same person. Such 2D face
recognition systems have been used successfully in limited applications such as se-
curity access [77], computer login/logout programs [43, 7] and search engines [111].
However, traditional 2D face recognition systems rely on texture (color) images that

are not tolerant to changes in pose, lighting or expression (see Figure 1.1).

Figure 1.1: Reasonable variations in face images (lighting, pose, facial hair, hairstyle).

Adding 3D information can improve the performance of face recognition algo-

1

rithms by accommodating some variations in lighting, pose and expression. A 3D
scanner measures the global coordinates on the surface of an object shown in the
2D image. These global coordinates can be used to easily rotate and manipulate the
surface in 3D, while in a 2D image the relative locations must be estimated in order to
change pose, and this estimation may cause errors [93]. Researchers have successfully
used 3D face data to develop algorithms for communications applications [117] and
face recognition [65, 2], yet there are a number of situations where 2D and 3D facial
processing algorithms are not yet sufficiently robust.

This dissertation explores and improves upon the use of 3D data to make several

contributions to face processing research, including:

0 Designing and optimizing a fast, fully automatic method for accurately aligning
faces from two different scans. This method is shown to be tolerant to changes
in pose, lighting and expression.

0 Experimental analysis and study of a practical face veriﬁcation system for use
with cooperative subjects.

0 Developing a Canonical Face Depth Map (CF DM) to normalize 3D face data
by creating a uniform, face-based coordinate system that allows different scans
to be readily compared. The CFDM also performs well as a preprocessor for

other algorithms.

The ﬁrst contribution is the design of a fast, fully automatic face alignment
algorithm. This automated alignment algorithm ﬁlls a gap in existing research,
which is dominated by methods that use pre—aligned or manually aligned data. By
developing methods for automatically aligning two face scans in real time, this dis-
sertation research provides a foundation for building fully automatic face processing
algorithms.

The second contribution is the experimental analysis and study of a

practical face veriﬁcation system. This system achieves ~1% Equal Error Rate

2

 

with cooperative subjects and is tested thoroughly on datasets containing 943 scans
(from 275 subjects) taken under different pose, lighting and expression conditions.
Experiments Show that this 3D face veriﬁcation system is robust to many types of
variations that would be expected in a real world application.

A third contribution is the development of the Canonical Face Depth
Map (CFDM). The CFDM offers a standard format for all 3D face processing
algorithms and can compress the raw 3D data 90% (down to a: 160 Kilobytes). The
face—based coordinate system used by the CFDM can be applied to all types of face
data (including 2D texture images and 3D scans) and is designed to be identiﬁable
across different poses and many expressions. The format of the CF DM is easily
transformed into a data structure similar to the output of a 3D scanner. This allows
the CFDM to be used as a preprocessor to other matching algorithms, such as a 3D
version of PCA and a 3D version of correlation.

Table 1.1 summarizes the major contributions of this dissertation and indicates

the portions of this dissertation that describe the contributions in detail.

1.1 Background and Motivation for the Face Bio-
metric

The goal Of biometrics is to develop methods that can take bodily measurements from
two people and return a similarity score indicating whether the data are from the same
person or from different people. Normally a similarity score can be used for veriﬁcation
(one-to—one) or for recognition (one-to—many). Examples of human biometrics include
ﬁngerprints, iris scans, voice patterns, etc. Current interest in homeland security,
such as security surveillance systems [47] and biometrics [94], has rejuvenated an
interest in face biometrics. Faces have advantages over other biometric modalities.

Some modalities are considered socially intrusive or uncomfortable; for example, some

3

 

288 was 32:88 8 cosao 808a 3888? a vomoﬁgom o
.AmvcoOom m Vv 8on 8“ 888580 mm Eﬁiowz o
.AmwcooS m Vv 8on HE 825qu mm 88$? $2585 a
.Ucamse
0.8 @8th :03 >8: mo 8888888 B we 322$ 83.82 88822 ooﬁgm woao~o>oQ a

839mm
Eigwicxz
ouch

 

 

 

 

 

N. Gm 185.85
23.98 28:85.29, 83 _&.n.nooo:m x8e» 82808 68% 2<m f3» mac—m @8888 :osﬂobonv o
4853283 88m .888 888:: :8 Eaaiowﬁ 28523880 0
Escrow? aosﬂotoo a E 88 ”.ch 2: we BSD c
.8888?ch :oEEwoOE
83 888:: 88.58088 68 833838 <Om 88888 a 8 8808808 a mm @388 o
.Amofﬁozx om: R Op 583 ﬁom 3 8088868 8882: 388 On moozwom o 8
.Eﬁﬂow? 88:89? 838m $.52 SEED o H 8 882
.339: C8. mcwoa EEebﬁ .322: .895 oﬁnwsccp 88m? Bea—C88 882.83 .Ecwop =o>> o 3 Q m
o . . . 888080
.8888 86888 858 3 8833 acoﬁcwza one 8me “8&3 88a 83:00 0
888 2: 5 $8: 3 8:38 pcoacwze oﬁ 8er 98:09:00 wEEEEB o
88.88828 Eﬁiow? E mowzeco 3 8:28 on Op 888 839$ 882m 0
.maoﬂatg wﬁﬁwz 3 8:38 on 3 883 839mm 888m 0
£8338.» Son 858 3 8988 on 3 Con—m8 882nm 888m 0
.8898 aqoacwzw omen 8855.8 Dm 288328 315 o 88:89?
m .8893 8883? Owen 880$ Om 8me OSeEOﬁE ~35 a 888m
889$ 228.88 88m 888 Owen 28538 On 38888 53% o :osoosbm
v .8293 838ch 88m .888 888$ On 83 38838 33m 0 88m 8:23.
.8880 8598ch 832:5qu

 

 

 

88545528 mmmoﬁ 8.38 We 22350 AA 0388

 

people are reluctant to enroll in systems that utilize ﬁngerprints because ﬁngerprints
are commonly associated with law enforcement [52]. Other systems, such as iris
scanners, are highly accurate but physically intrusive - you must place your eye close
to a scanner. In contrast, faces are one of the most commonly used (and readily
accepted) biometrics in the world. The majority of identiﬁcation documents (such
as passports, drivers licenses, etc.) use face images for identiﬁcation, and almost
everyone uses faces as a primary source of identiﬁcation in daily life. Humans are
often able to accurately identify faces under different pose, expression or lighting
conditions. For example, humans have no problem identifying people in distorted
caricatures, which is a particularly hard task for computers. A cartoon artist ﬁnds
distinguishing features and exaggerates them to an extreme (see Figure 1.2). This
exaggeration allows people to more easily identify the subject of the cartoon, so it
seems reasonable that humans are somehow encoding these differences as part of their
identiﬁcation metric. However, it is possible that people do not actually recognize each
other as easily as is assumed. For example, most humans have difﬁculty recognizing
or distinguishing between individuals who are not part of their familiar circle [114].
In other words, most humans ﬁnd it easier to identify individuals with whom they

are already familiar.

 

Figure 1.2: People do not have trouble recognizing exaggerated distinguishing fea-
tures, such as Jay Leno’s chin. (Caricature published with permission from [129].)

Another advantage of using faces is that everyone has a face, while not everyone
has viable ﬁngerprints (2% of subjects can not be easily ﬁngerprinted using current
technology [109]). Also, if a face—based automatic biometric system breaks down due
to a power outage or other hazard, security personnel can always be used as a stop—gap

replacement to conﬁrm identity by comparing the subject to a photo ID.

Although face-based biometrics have advantages, individual human faces are very
similar and are difﬁcult to distinguish via automatic face recognition algorithms.
While biometric systems based on ﬁngerprints or iris scans can assume that each sub-
ject has a different set of salient minutia (true even for identical twins [80]), the overall
structure of the face is largely the same for everyone. Although moles, birthmarks,
piercing, tattoos, facial hair, eyeglasses, etc., may be considered distinguishing marks,
there is no guarantee that a particular person will have one of these distinguishing
marks or that the marks will not change over time. This small between-subject
variability makes face recognition a particularly difficult problem in computer vision

research.

The Face Recognition Vendor Test (FRVT) was designed to provide a common
database for computer vision researchers to compare their face recognition algorithms.
One of the ﬁndings of FRVT is that 2D face recognition systems can achieve good
performance in constrained environments [119]. However, 2D approaches encounter
difﬁculties in handling variations due to changes in head pose, lighting conditions and
facial expressions [119]. Since the human face is a three-dimensional object whose 2D
projection (image) is sensitive to changes in lighting, pose and expression, utilizing
3D data may improve face recognition performance. Table 1.2 summarizes how this
dissertation utilizes 3D information to partially account for changes in pose, lighting
and expression. The next section overviews the structure of this dissertation and

outlines the key contributions presented in each chapter.

6

!

Table 1.2: Utilizing 3D information
Description Section
Pose The 2D projection of a 3D object loses information, while 5.3.3.
a 3D scan of a 3D object maintains relative position in-
formation for each point in 3D space. Pose is corrected
for in 3D without loss of information.

Lighting 2D face images depend entirely on how light reflects off 5.3.1
the surface of the face. Thus, changes in ambient light-
ing can signiﬁcantly alter the 2D texture (color) image.
3D structured lighting methods project a light source,
and only require that the light is seen by the camera to
generate valid 3D data. 3D data are demonstrated to be
robust to changes in ambient lighting.

Expression Expression is an issue in both 2D and 3D. However, it is 5.2.1
easier to register scans in 3D, which makes it easier to
select regions of the face that do not change with expres-
SlOIl.

 

 

 

 

 

 

 

 

 

1.2 Project and Document Overview

This dissertation begins with an introduction to 3D notation, detailed in Chapter 2.
The discussion of notation includes a description of various coordinate systems and
the different data structures used to represent 3D information. A face surface s can be
represented by the relationship f (p) : R3 ——> IR, where p E R3 and s = {pl f (p) = 0},
together with a texture mapping g(p) : R3 ——> R3, where g(:1:,y, z) = [R, G, B]. A
new 3D discrete data structure based on cylindrical coordinates is introduced, and
Chapter 2 concludes with an overview of the face image databases collected for this
dissertation.

Chapter 3 presents the mathematical approaches used in this dissertation to pro-
cess 3D data. The algorithms presented in Chapter 3 include: curvature calculations,
issues with resolution and scale, image cleaning algorithms, basic surface alignment
algorithms, orthogonal depth maps, and 3D surface ﬁtting. A fast, linear (with the

number of rows in matrix) 2.5D ICP algorithm is also presented in Chapter 3.

Chapter 4 describes methods for detecting anchor points on faces. Anchor point

7

detection is a key requirement for generic face recognition, calculating the coarse
transformation for registration, and transforming a test scan into the Canonical Face
Depth Map format. A 99% success rate is achieved in ﬁnding anchor points in frontal
images, and an 86% success rate is achieved in scans with variations in pose and
expression. These results demonstrate the challenges in 3D face recognition under

conditions of arbitrary pose and expression.

Chapter 5 discusses the two—step method used to register 3D surfaces. First, the
anchor points found in Chapter 4 are used to coarsely align the face scans. Then,
the Iterative Closest Point (ICP) algorithm [17] is used to ﬁnely register the two
scans. Mathematically, this alignment algorithm is searching for a rigid transform
R that aligns two face surfaces (.91 and 32) so that they are equivalent (31 ~ .92
if 3R 6 7?.le E 31, f2(R(p)) = 0). Detailed analysis of this two—step alignment
algorithm shows it to be tolerant to variations in pose and lighting, as well as robust
to changes in the algorithm parameters. The Surface Alignment Measure (SAM)
is introduced as a measurement of the quality of the two-step alignment process,
and experimental results demonstrate that the SAM score is a viable solution for a

practical face verification system used with cooperative subjects.

Chapter 6 introduces face-based coordinate systems and defines the Canonical
Face Depth Map (CF DM). The CFDM is deﬁned with a face-based coordinate system
that can be identiﬁed easily using different detection methods, and that generalizes
over all types of human faces. Once the face-based coordinate system is identiﬁed
in the 3D scan, the data are normalized (for scale and resolution) using an orthog-
onal depth map. This normalization enables faster and more accurate alignment,
a reduction in memory requirements, and the CF DM conversion can function as a
preprocessing step to a 3D version of PCA and a 3D version of correlation (note, in
this dissertation the words “correlation” and “cross-correlation” are used interchange-

ably). The correlation algorithm is experimentally shown to achieve viable matching

8

performance and anchor point localization.

Chapter 7 (and the Appendix) give an overview of the software research platform
developed and used in this dissertation. The research platform includes a real time
demo of a practical 3D face veriﬁcation system, as well as a toolbox of 3D processing
algorithms that can be easily used on multiple operating systems and platforms.
The research platform presented in Chapter 7 is designed as a foundation for future
research in 3D face recognition.

Chapter 8 discusses the conclusions of this dissertation and presents directions for

future work.

Chapter 2

Background

Chapter 1 introduced face recognition and veriﬁcation using three dimensional data.
This chapter provides additional background information on face recognition research
and applications, and discusses how three dimensional data are collected and repre-
sented in this dissertation. Section 2.1 overviews existing work, and Section 2.2
outlines the basic notation used in this dissertation to represent three dimensional
coordinate systems. Section 2.3 describes the various sensors used to generate three
dimensional data, while Section 2.4 describes the different data structures used to rep-
resent and store three dimensional data. Finally, Section 2.5 describes the datasets

used in this dissertation.

2. 1 Existing Work

Biometrics is literally the measurement and analysis of biological data, and in
medicine the term is used to describe an analysis of biological systems [140]. In
computer science, particularly in security applications, the term biometrics is com-
monly used to refer to a semi or fully-automatic computer measurement of the human
body for authentication [125]. The most common application of biometrics in com-

puter science is the measurement of human data speciﬁcally for the purpose of person

10

recognition. There are generally two modes of recognition:

0 Veriﬁcation - Comparing a person’s biometric to a single “claimed” identity.
0 Recognition - Comparing a person’s biometric to a [large] database of biometrics

to ﬁnd the closest match.

Both of these modes generally use the same underlying algorithms. However, there are
some subtle differences. In veriﬁcation mode, it is common to have a threshold that
determines whether the two biometric measurements are similar enough to be from
the same person. A database of training examples is generally used to establish this
threshold, and in some cases a different threshold may be calculated for every person
in the database [112]. The recognition mode, on the other hand, does not necessarily
require a threshold or a “claimed” identity. Instead, the best match is selected from
the entire database. This process can be quite time consuming, depending on the size
of the database and the speed of the recognition algorithm.

There are a number of different biometrics that researchers have explored for
recognition. Many of these are direct measurements of the human body and include
ﬁngerprints [79], faces [94], iris scans [38], palm prints [56], ears [145] and hand
geometry [82]. Another class of measurements are indirect ones based on a subject’s
actions, such as speech [81], signature [85], or gait [95]. Biometrics are used to control
access to resources [135], for security surveillance [77] and encryption [83], and for

postmortem identiﬁcation [36].

2.1 . 1 Face Recognition

The key goal for face recognition algorithms is to determine whether two facial images
are from the same individual. Normally, there are two main steps in a face recognition
system: the ﬁrst step is to detect the face in the image, and the second step is to

process the image for a particular application. Face recognition systems are used

11

to improve public safety; for example, current researchers are developing systems to
detect human faces in video [146], to identify pedestrians at crosswalks [115], and to
monitor and care for the elderly and impaired [122]. Human face recognition systems
are also of interest in communications and entertainment [113], where models of the
face are used to drive remote avatars and to improve video compression. All of these
applications beneﬁt from algorithms that are designed to understand and process the
face efficiently.

Table 2.1 outlines the design space for 2D and 3D face recognition systems. Cur-
rently, the most common approach to face recognition uses a database of 2D faces
to recognize 2D test images (upper left box in Table 2.1). Advances in 3D imaging
technology make it possible to consider face recognition in three dimensions as well.
3D information inherently makes a face recognition system more robust to changes in
pose and expression, and this dissertation focuses on systems that match 3D input to
3D data (lower right box in Table 2.1). Because 2D cameras are so affordable, future
research may consider matching 2D surveillance images to a 3D database (upper right
box in Table 2.1). Another, more limited, application is to have a database of 2D
images and use a 3D scan to help verify a person within the images [3] (lower left
box in Table 2.1). For example, the Mitsubishi Electric Research Laboratories [89]
used 3D scanner technology to try to identify the subject of the famous image of a
sailor celebrating his return from World War II with a kiss (see Figure 2.1). In this
example, each person who claimed to be the sailor had a 3D scan taken, which was
then measured against the 2D photo.

There are three general approaches to doing face recognition in both 2D and 3D

[94] :

1. Analytic methods
2. Holistic methods

3. Hybrid of analytic and holistic methods

12

Table 2.1: Design space for 2D and 3D face recognition systems.

 

 

 

Testing
(Veriﬁcation / Identiﬁcation)
2D 3D
2D Solution 1 Solution 3
Training Most common
(Enrollment) 3D Solution 2 Solution 4
This Dissertation

 

 

 

 

 

 

Figure 2.1: A database of 3D scans was used to try to clarify the identity of the sailor
in this photo.

Analytic methods use knowledge about the face to measure important features
on the face. For example, there is a long history of face recognition research based
on the distance between sets of known anchor points on the face [102]. Researchers
have explored face measurements in work that goes back to Leonardo Da Vinci (see
Figure 2.2), and anthropometric face proportions are commonly studied in medicine
[57, 53]. In anthropometric research, measurements are made on the surface of the

face (see Figure 2.3) and compared across, age, gender and ethnicity.

 

Figure 2.2: Early anthropometric work by Leonardo Da Vinci.

 

 

 

 

 

 

 

Figure 2.3: Example locations for anthropometric face measurements (Figure from

[16])-

Early computer science research used anchor points for face recognition by manu-
ally selecting anchor points on 2D images, with various levels of successful recognition
[86], and some recent 3D work is also based on manually selected anchor points [124].
Current research is exploring the use of automatically detected anchor points, but
this work is limited to small datasets and the recognition results are susceptible to

changes in pose and to noise in the data [91].

Another example of an analytic method for face recognition takes advantage of
features that can be found in proﬁle images. The measurement of proﬁle faces for
identiﬁcation started with Sir Francis Galton in 1888 [63]. Galton studied different
component measurements of the face and defined key feature points on the proﬁle.
Mathematically, Galton developed a method for describing irregular curves. Early
applications of proﬁle images in face recognition research relied on images taken with
2D cameras and compared using various techniques [126]. However, the addition of
3D scans allows the profile to be extracted from a wide range of scan poses. Results
of these 3D experiments are mixed and many of the existing approaches still require

near frontal images in order to accurately extract the proﬁle pose [19].

Holistic methods treat the input vector as an unknown entity and apply learning
algorithms to the entire vector, regardless of any heuristic knowledge of the objects
being recognized. For example, Eigenfaces [133] is a holistic approach that generates
the principal components of a set of training face images and maps new face images
onto these vectors to identify individuals in real time. The Eigenface algorithm is
frequently cited because of its simplicity and speed. However, this algorithm requires
a high level of correlation between the faces in the database in order to achieve good
performance, and any changes from the training conditions (including pose, lighting

and expression) will hinder performance [93].

Analytic methods are powerful because they have the potential to be pose invariant

as well as ﬂexible to minor global changes (such as lighting, etc). However, analytic

15

methods will break down when features are inaccurately detected. A hybrid approach
combines analytic and holistic features. For example, Manjunath, et al. [102] use
Gabor ﬁlters to identify non-speciﬁc features. These non-speciﬁc features tend to
relate to higher-level features, such as the eyes, nose and mouth, but this result is
due to the fact that these higher-level features are also good Gabor features. Hybrid
methods are more trainable than analytic methods, making the hybrid methods a

good compromise for face recognition research.

2.1.2 3D Face Processing

Much recent research has focused on the development of 3D face recognition tech-
nology. Some investigators have used the 3D processing component to normalize the
pose and lighting of the input image in order to match the pose and lighting varia-
tions in the 2D gallery images. These mappings include simple 3D rotations, as well
as more advanced morphing models that include variations in expression [21].

One of the ﬁndings of the Face Recognition Vendor Test (F RVT 2002) is that 2D
face recognition systems can achieve good performance in constrained environments;
however, both analytic and holistic approaches still encounter difﬁculties in handling
variations due to head pose, lighting conditions and facial expressions [119]. Because
the human face is a three-dimensional (3D) object whose 2D projection (image) is
sensitive to changes in lighting, pose and expression, utilizing 3D information can im-
prove face recognition performance [21, 33, 34, 35]. Range images captured explicitly
by a 3D sensor [51, 108, 1] present both the face surface shape and texture. The 3D
shape of facial surfaces represents the face structure, which is a function of anatomy
rather than an artifact of external factors such as lighting, pose or makeup.

The Face Recognition Grand Challenge (FRGC) [118] was developed in part to
introduce 3D face data to researchers to explore possible improvements over FRVT

results. The 3D database for the FRGC was collected using structured lighting tech-

16

nology and has two channels of information. The ﬁrst channel is a normal color image,
and the second channel is a depth map with an :r, y and 2 value for every pixel in the
color image. The FRGC database contains only frontal scans taken in a controlled
lighting situation and does not present enouph variations to adequately test for toler-
ance to lighting and pose changes. This database does, however, offer some indication
of the pose variation that may be expected in a real world application.

In order to take advantage of the additional information provided by 3D scans for

face recognition, researchers generally focus on four tasks:

1. 3D data collection methods.
2. Face normalization and input correction.
3. Develop features for 3D space face recognition.

4. Explore methods for merging 2D and 3D face recognition algorithms.

The first approach focuses on methods for collecting 3D data. Most 3D research
relies on commercially available 3D scanners that use either structured lighting or
stereo imaging. However, researchers are also looking at ﬁtting models to image
pairs. For example, Ansari and Abdel-Mottaleb [8] and 1p and Yin [72] use two
orthogonal 2D images to generate a 3D model from common anchor points found in
both images. Another method for generating a 3D model is to use two orthogonal
2D images [148], where one image is frontal and the second image is proﬁle. This
approach allows the my plane to be defined by the frontal image and the zy plane
to be deﬁned by the proﬁle image. The key to this 3D modeling technique is that
the exact image point location is needed in order to ﬁt the general 3D model onto
the speciﬁc high resolution data. Recent methods have tried to generate a 3D model
from a single 2D image by using a generic 3D model of the face and ﬁtting the model
to the 2D image. Blanz and Vetter [21] iteratively ﬁt a 2D image to a complex 3D

face model that accounts for lighting, expression and pose but takes a long time to

17

process. These 2D-to—3D methods are also available in commercial products, such as

Animetrics [7] and CyberExtruder [50].

The second approach is to use 3D information to normalize the face data; this
normalization includes pose, lighting/color and expression correction. In some al-
gorithms [14, 124], the pose correction is determined by a set of manually selected
anchor points and the pose—corrected 2D texture and shape images are fed into an
off—the—shelf recognition algorithm. These anchor points are used to put the face into
a partially canonical format. Current research in the use of automatically detected
anchor points is limited to small datasets, and the results are susceptible to changes in
pose and to noise in the data [91]. One problem with using anchor points is that even
minor errors in their detection can cause large errors in the transformations. Cartoux
et al. [31] used facial proﬁles extracted from 3D range images for face recognition.
Malassiotis and Strintzis [101] used the 3D channel to compensate for illumination in
the color channel and show some improvement on a small dataset. This algorithm is
fully automatic, but assumes that the nose tip is the closest point to the scan - an
assumption that cannot reliably be made in a real world application. Lu and Jain [98]
and Bronstein et al. [27] present methods for modeling and normalizing expressions.
However, methods that correct for expression are currently too slow for real world

veriﬁcation problems.

A third approach to using 3D information is to rely solely on the 3D channel
and recognize the face without using the color component. This is desirable because
it avoids problems with changes in lighting and color (e.g., makeup). Some feature
spaces, such as curvature [131] and 3D Gabor ﬁlters [139], are at least partially pose
tolerant. Other approaches assume normalized frontal scans and then apply existing
2D face detection algorithms, such as PCA [144, 25]. Xu et al. [144] use a standard
sized mesh grid over the nose point to even out the feature space. Points are evenly

sampled along the vertices of the grid so that there is a standard—sized feature vector,

18

and 3D PCA is used to identify the faces in a 120 subject dataset, achieving at
best around 93% accuracy. The following list describes different feature spaces and

approaches that use 3D data:

0 Lee and Milios [90] segmented the range image to obtain the convex regions
based on the sign of the mean and Gaussian curvatures at each point. These
convex regions correspond to distinct facial features. The Extended Gaussian
Image (EGI) is used to represent each convex region. A similarity metric be-

tween two regions is deﬁned to match the facial features of the two face images.

0 Gordon [67] explored face feature extraction for recognition based on depth and

curvature features.

0 Achermann and Bunke [4] introduced a 3—D version of the partial Hausdorff
distance. Experiments on a database with 240 images of 24 persons yielded

nearly perfect results.

a Beumier and Achcroy [19] extracted the proﬁles (curves) both from depth and

gray-scale images for face veriﬁcation.

o Hesher et al. [70] applied PCA to the range image, and estimated probability

models for the coefficients.

0 Bronstein et al. [26, 27] proposed an algorithm based on geometric invariants

in an attempt to deal with facial expression variations for face recognition.

0 Bellon et al. [13] proposed a Surface Inter-penetration Measure (SIM). For
every point in the scan, a tangent plane is determined and a local neighborhood
is analyzed. If there are local points both above and below the plane, then the
point has a high SIM score. This measurement avoids many of the problems
that are encountered by RMS errors. However, this calculation can take time
and these experiments were just a proof of concept, using a small number of

subjects.

19

Each of these feature spaces have different advantages and disadvantages, and no
one feature space has been shown to signiﬁcantly outperform the others. Boywer
et al. [24] present a good overview of current approaches and challenges in 3D face
recognition. However, it is difﬁcult to directly compare these results since different
systems are evaluated under different conditions and use datasets of varying quality.

The fourth approach to utilizing 3D data is to combine 2D and 3D recognition
channels into a single classiﬁer. Combining 2D and 3D face recognition systems has
been shown to achieve better performance than each individual algorithm alone [25].
However, much of the testing of these multimodal systems does not consider the
variations in pose and lighting that were the original motivation for utilizing the 3D
information. Work by Chang et al. [35] demonstrated that face recognition systems
based on either two-dimensional texture information or 2.5D range information have
similar performance characteristics. However, they went on to show that signiﬁcant
improvements can be made if a system uses a combination of texture and shape
information, such as applying PCA to both 2D and 3D face data.

Although some of these approaches to 3D face recognition applications are promis-
ing, none develop a complete, viable recognition system and all fail in at least one of

the following areas:

1. The system is not fully automatic and/ or is not practical for real world appli-

cations.
2. Testing is only done with small datasets and results do not generalize.

3. Testing is done on a dataset with only minor variations in lighting and pose and

results do not generalize.

4. Extensive training sets are required to achieve good performance, which makes
the system impractical and difﬁcult to maintain because new training sets are

required every time the application environment changes.

20

2.2 Coordinate System rIransformations

This dissertation deals with points in 3D space (1) E R3) where p is a triple with
:r,y,z coordinates (37,3), z 6 IR). T is a set of all R3 —+ R3 transformations and is

deﬁned as follows:

T E {TIT : R3 _> 1R3} (2.1)

R is a set of rigid transforms that is a subset of T (R C T). A rigid transform

has two properties. First, it conserves distances over the standard Euclidean metric

d:

D E {T E T | meoz (1091.102) = d(T(P1).T(I>2))} (22)

A rigid transform also maintains the relative handedness of the points. This means
that the rigid transform can be represented by a translation and a rotation about a
single axis. The class of rigid transforms does not include reﬂections. This second

property can be described using the following cross product constraint:

H E {T E T l VP17p2,P3,

(p2 - m) X (293 — p1) = (T(;D2) — T021» >< (T093) — T(p1))}

(2.3)

 

 

 

 

All rigid transformations (R) are the combination of both constraints represented

by the sets D and H.
R E D m H (2.4)

An object in R3 can be deﬁned by a relationship of the form f : R3 -—+ R where

21

a point p is a part of object b if the following surface relationship holds:

b E {plf(p) = 0} (25)

This deﬁnition could represent any set of points in R3, however for this dissertation
it will be assumed that this set of points falls on a surface s E S that obeys the

following constraint:

5 a {313}: f ; R3 _. a} (2.6)

Two surfaces (81, 32 with relationships f1, f2) are equivalent if there exists a rigid
transformation R E 7?. such that for every point in 31 the transformed point is also

in 82:
81 ~ 82 if 3R1 6 RN]? E 81,f2(R1(p)) = 0 (2'7)

Every rigid transformation has an inverse denoted by R_1. Since 31 and .92 are

equivalent, the inverse is also equivalent, i.e., 32 ~ 31 and R2 6 7?. is the same as

—1
R1 .

if 8] ~ 32 then R1 = 122—1 and 122 = 121—1 (2-8)

This equivalence relationship will break down in real world situations. For exam-
ple, faces are not rigid objects - people can grow and change their expression. Also,
sensors do not provide a continuous ideal surface. Instead, 3D scanners provide a
2.5D discrete depth map estimation of the ideal face surface and may have noise or
holes in the data. However, this relationship does provide some insight as to the

paradigm of surface alignment explored in this dissertation. The goal is to find a

22

rigid transformation R E ’R such that the two surfaces are aligned as closely as pos—
sible. The surface relationship constraint f is represented as a distance measure. If
a point is on the surface then its distance is zero; if a point is off the surface then
the shortest distance from the point to the surface is returned by the surface rela-
tionship constraint. Ideally, if the two surfaces are the same then all distances will be
zero. However, perfect alignment in a real world system is unlikely because of surface
changes, discretization errors, sensor errors, etc.

Any three non-collinear points in R3 can deﬁne a coordinate system. Two types
of coordinate systems are considered in this dissertation: sensor-based and object—
based. Any coordinate system that is based on the location of the camera and is
independent of objects in the camera’s ﬁeld of view is a sensor-based coordinate
system. For example, the scans produced by the Minolta Vivid 910 have sensor—
based coordinates. The origin of the scanner-based coordinate system is located on

the front plane of the scanner, with the coordinates shown in Figure 2.4.

 

Figure 2.4: Scanner-based coordinate system.

Within any object-based coordinate system, the coordinates are invariant with
object motion and are grounded in features related to the object. For example, the

coordinate system shown in Figure 2.5 is attached to the face.

23

 

Figure 2.5: Object-based coordinate system.

Normally, 3D data produced by a sensor are presented in a sensor-based coordi-
nate system because sensors are designed independently of the objects being sensed.
This object independence allows the sensors to be used with a wide range of objects.
However, object-based coordinates make assumptions about the object being repre-
sented. These assumptions can limit the variety of objects that can use a speciﬁc
object-based coordinate system; however, the object—based coordinate system also
enables correlation between objects of the same type. Good correlation is one of the

most important prerequisites for many pattern recognition algorithms.

The mathematical representation of a surface assumes that the surface is contin-
uous almost everywhere. However, the 3D sensor only uses a discretized sampling of
this “true” surface. This sampling can have errors due to sensor noise or quantization
in the data collection. A discretized 3D object is represented by a 4 x n matrix, where
n is the number of points, and each point has a location (1.3;, 2) and a ﬂag (with a
value of one or zero). The ﬂag value is a place holder denoting points that are deemed

invalid due to scanner error, occlusion, or any other form of noise or error. Any rigid

24

transformation R E R is represented by a 4 x 4 transformation matrix:

t ‘ ' i
1'1 (132 ° ' ' {L‘n T'gjg; ngy sz t5]:
y r, r r t
P: 311 2 W R: y": W W .21
21 22 Zn 7'sz sz Tzz tz
f1 f2 fn 0 0 0 1

 

 

 

 

By multiplying the point matrix P by the transformation matrix R, the points are

transformed into the new coordinate frame:
#:RP am

A 4 x 4 transformation matrix is used because it is simple to calculate and manipulate,
and because it can represent any rigid transformation. The upper left 3 x 3 matrix R
is the standard rotation matrix, and the vector t = [t 2:, ty, t Z]T is a translation vector.
The rotation matrix R is an orthogonal rotation matrix where each row represents

the new vector axes in the new coordinate system.

Tm Try T172

T937 T3!!! Tyz

 

 

[T233 sz T'zz

d

The following equations give the simple rotations of angle a about the 117,3; and z

 

 

axes:
1 0 0 ]
Rm: 0 cos(a) sin(oz) (2-10)
0 —sin(a) cos(a)

25

 

 

sin(a) 0 005(0)

f"

cos(oz) sin(oz) 0]

——sin(a) 603(0) 0

 

 

0 0 1

b a

(2.11)

(2.12)

Any rotation can be represented as a single rotation angle a about a speciﬁc axis A.

Given an arbitrary rotation matrix, this axis and rotation angle can be calculated as

follows:

 

1
O '2 (LCOS(§(T$$ + (Tyy + (7'32 —1)))) (2.13)
T'zy — Tyz
1
A = _
23in(a) 7“” T3“:
T's/:1: — Tiny

 

 

Also, the Euler angles (roll, pitch and yaw) can be calculated directly from a rotation

matrix. These calculations are dependent on the order in which the roll, pitch and

yaw are deﬁned. For this dissertation, the angles are deﬁned as in Table 2.2.

Table 2.2: Calculating the Euler angles from a rotation matrix.

 

else

 

If Rm; = 0 and Ryrr = 0 then:
roll = 0.0

pitch = tan—1(ngy, Ryy)
yaw = pi/2

7‘0“ = tan—1(Ryx, Rxx)
pitch = tan—1(Rzy, R32)
yaw = tan-1(—Rz;p, sqrt(R;%$ + ng»

 

 

26

2.3 Sensor Selection

The most common sensor used in face recognition is a standard color camera, which
captures a 2D (2:, y) projection of an illuminated object residing in 3D space. During
this 2D projection, the 3D information of the surface is mostly lost. However, 3D
sensors have the potential to retain the 3D surface information (2:, y, and z) of the face,
which has been shown to improve recognition rates [35]. 3D systems can compare the
depths of face anchor points in different scans, and the presence of depth information
also allows for calculation of surface curvatures and surface normals that assist in
face anchor point detection [46]. In addition to enhanced anchor point detection,
3D systems are more tolerant to adverse scanning conditions such as changes in
lighting, pose, makeup, etc. There are four major sensor technologies for calculating

3D information:

0 Shape from Shading uses a standard color camera to observe the natural
shading of an object. Shading provides surface normal constraints and ulti-
mately can be used to interpolate the 3D surface. Few systems take this ap-
proach because it is difﬁcult to generalize and has low accuracy. The most suc-
cessful applications of this method use artiﬁcial lighting to simplify the lighting
model that is used to determine the surface shape [151]. Section 2.5.3 dis-
cusses a shape from shading system that has been used in connection with the

algorithms described in this dissertation.

0 Stereo systems use two cameras that are calibrated so that depth can be calcu-
lated by triangulating disparities between the images from the two cameras [28].
Correspondence between pixels is required to perform the triangulation. Calcu-
lating this disparity can be time consuming, and the results may be ambiguous
if a reference point in one image matches more than one point in the counterpart

image. In the most successful stereo systems, the cameras are rigidly aligned to

27

each other and the distance between the cameras is accurately calculated.

0 Structure from Motion is similar to stereo methods. However, in structure
from motion the disparity difference is calculated between two different frames
of a video input from a single camera [132]. This method requires that there
be motion between frames in order to create a disparity in the images, but the
motion must be small enough for the system to ﬁnd the correct correspondence
between points in the frames [40]. If too much of the scene changes during
motion, then the corresponding between points on a moving object may be

inaccurate.

o Structured Lighting is similar to stereo, but replaces one of the cameras in the
stereo rig with a known light source. The light source is projected into the scene
and the camera picks up the location of the light to establish correspondence.
The light can be structured so that correspondence calculations are simple and
unambiguous. Most industrial scanning hardware uses some sort of structured
lighting because it allows the system to be more accurate and more robust to
environmental changes. The disadvantage of a structured lighting system is
that it is not passive - it requires a light source that alters the environment.
An example of a structured lighting system is the Minolta Vivid 910 scanner,
which is the primary scanner used for data collection in this dissertation (see

Section 2.5).

2.4 Existing 3D Modeling Methods

Many data structures have been developed to represent 3D information. This section
examines some of the most common 3D data representations and categorizes their

advantages and disadvantages using the following criteria:

28

0 Surface / Volume Representation - This dissertation deals primarily with
surface representations of the face. However, some data representations are
volumetric and can represent more than just the surface of an object, such as

scanners used in medical imaging (e.g., CAT scan).

a Occlusion and Blind Spots - Some surface representations are limited to
speciﬁc directions or structures. These limitations can make algorithms more

efﬁcient, but can also cause holes in the data due to occlusions.

0 Neighborhood and Localized Registration - Most 3D scanning hardware
deals with unordered raw surface data. However, the neighborhood relationships
between the points can also be useful. The neighbors of a point are the other
points in the representation that are nearby in space. A representation that
has access to this extra information can improve the speed of many algorithms,

including correlation (see Section 6.6.1) and surface ﬁtting (see Section 3.1).

a Face Registration / Orientation - Most 3D scanning hardware deals with
raw surface data and does not include any contextual knowledge of the objects
being modeled. However, by adding contextual information about the face (such
as anchor points or axis registration) to the data structure, the processing can

be specialized for dealing with this contextual data.

0 Memory Allocation - Data formats can also be compared based on their
required memory capacity. Many modern 3D scanners measure more points
than are necessary for most applications. These extra points are very close
together in 3D space and may not add additional information. Storing these
extra points can cause problems in applications where bandwidth or memory

storage limitations are an issue.

0 Extra, invalid data storage - Another problem with some data formats is

29

that they can have a rigid structure that requires information even when none is

available. These data formats require extra ﬂags to show when data are invalid.

Table 2.3 lists some common 3D data structures and outlines their pros and cons
using the criteria just described. The remainder of this section describes each of these
data structures in more detail, with the exception of the Canonical Face Depth Map

(CFDM), which is described in Chapter 6.

All of the data formats described in this section have advantages and disadvantages
depending on the intended application of the data. For communications, a polygonal
model is usually adequate to display a face in a realistic way. In addition, a structured
polygonal model can be used to greatly reduce the size of the model and enable
transmission over low bandwidths. For identiﬁcation and recognition, it is important
for a data format to be small enough to fit on a smart-card (512 KB [66]), and contain

enough information to disambiguate individuals within a database.

Table 2.3: 3D model data structure comparison.

 

 

 

 

 

 

 

 

 

 

Data Blind Neighbor Face Memory Invalid
Type Spots Info Reg Allocation Data
Point Cloud Volumetric N o N o N 0 Small N o
Polygonal Surface N 0 Yes N 0 Small N 0
Models
2.5D Point Surface Yes Yes N 0 Large Yes
Matrix
Depth Map Surface Yes Yes No Small Yes
Structured Surface No Yes Yes Very N o
Polygonal Small
Models
Face Mesh Surface Yes Yes Some Small Yes
Cylindrical Surface Some Yes Some Large Yes
Point Matrix
CFDM Surface Yes Yes Yes Small Yes
(Chap. 6)

 

 

 

 

 

 

 

 

30

2.4.1 Point Clouds

A point cloud is an unordered set of n points in space. Each point p,- is a vector that

contains information about the point, such as location (1:, y, z) and color (7", g, b).

pi = {xi,yi,zi,ri,gi,bZ-} where I S i S n

Each point p,- is a sampling from an object in 3D space (see Figure 2.6). This set
of sampled points is normally assumed to be dense enough to adequately represent
the surface information, even the volumetric information. The disadvantages of point
clouds are a lack of neighborhood information and a lack of information needed to

determine which points are connected.

 

Figure 2.6: Point cloud (surface only).

2.4.2 Polygonal Models

The polygonal model is an extension of a simple point cloud that also includes neigh-
borhood information (see Figure 2.7). This representation of 3D objects includes two

lists: a list of points and a list of polygons.

Pi = {$¢,yi,zi,r,-,g,-,b,} where 1 g 2' g n

polyj 2 {111, - - - ,ch} where 1 S j g m and Cj = number of vertices

31

The list of points p,- is the same as in the point cloud representation. In addition to
this list, the polygon list of size m stores all of the polygons within the model. Each
polygon (polyj) is a list of vertices that correspond back to the original point list. The
number of vertexes (cj) in a polygon can vary to represent polygons with different
numbers of sides. The advantage of this structure is that the polygons represent
the entire surface between the vertexes. Piecewise planes can be calculated for each
polygon (polyj) by decomposing the polygon into triangles. Each triangle represents
a plane with coefﬁcients (a, b, c, d) and an inﬁnite number of points (:r, y, 2) can be
sampled from these planes, which allows the entire surface to be represented at any

level of detail desired:
ar+by+cz+d=0

Many graphics cards are optimized for data in a polygonal format and algorithms
such as texture mapping and ray tracing have been optimized to work with this data
format [110, 78, 44, 62]. Polygonal models require fewer points than in the point
cloud format because the surface planes can represent many coplanar surface points

instead of needing a large sampling of points.

 

Figure 2.7: Polygonal mesh.

32

2.4.3 2.5D Point Matrix

The 2.5D point matrix is commonly used by 3D scanner manufacturers, such as in
the Minolta Vivid 910. The 2.5D point matrix is similar to a point cloud. However,
instead of storing the points as an unordered vector, the points are stored as a set of

2D matrices with n rows and m columns.

1’0““(2' j) ‘ {”(i 2') year zen): "(-z'.j)’9(2'.j): bearf‘ageaﬂ:
where 1 S 2' S n and 1 gj 3 nm.

The advantage of a 2.5D point matrix over a normal point cloud is that the point
matrix is ordered such that the neighboring points in the matrix are neighboring
points in 3D space. The points are a 2.5D projection of the surface of the object
in 3D space. For example, the Minolta Vivid 910 outputs a standard 2D 320 x 240
pixel color image. In addition, the range data is stored in a set of 320 x 240 :r, y
and .2 matrices that allow exact registration with the original color image (see Figure
2.8). One limitation of the 2.5D format is that the system only represents surface
information and must include a (n x m) ﬂag matrix to account for missing data. A
ﬂag is zero if the point is invalid and one if the point is valid. Another problem with
any 2.5D representation is that the format does not allow full 3D representation of

the surface due to occlusions.

a p I] n

(a) Color Image (1', g,b a: matrix )y matrix 2 matrix

 

Figure 2.8: 2.5D point matrix representation. Lighter colors represent larger values
of x, y and z and darker colors represent smaller values.

33

2.4.4 Depth Map

Instead of requiring three matrices to encode the 3D coordinates of a point cloud, the
information can also be stored in a single matrix with a known scale factor. In this
model, only the 2 matrix is used but is re—sampled as an orthogonal projection of the
data onto a ﬂat plane. In this way, the index values of the points encode the :1: and y
information. The :r and y resolution (stepx, stepy) and starting point (:rmz'n, ym'in)
must also be stored, but if the a: and y resolutions are known then the :r, y, 2 location
of a point can be determined using only a single matrix. In addition, an extremely

large or extremely small value (zthresh) can be used to represent the ﬂag.

(Q
’.‘
N.
K).

= ZUJ) > zthresh

So, information that requires four matrices in the 2.5D point matrix format can
be reduced to a single matrix in the Depth Map format (see Figure 2.9). However,
information will be lost because the Depth Map is even more rigid than a 2.5D Point
Matrix and will only represent an orthogonal projection of one side of the surface.
This is because the orthogonal projection requires re—sampling of the data. This

re—sampling is not invertible and may lose some information from the original scan.

2.4.5 Structured Polygonal Models

Modiﬁcations to the standard polygonal model can also be made. One limitation of
all of the previously described models is that none of them take into consideration the
natural shape and pose of the human head. In fact, each of the representations can

represent any 3D object and the same face can be represented in an inﬁnite number

34

 

Figure 2.9: Single matrix depth map.

of ways, making comparisons of two different faces difﬁcult. In computer graphics,
structured polygonal models of the face are used to standardize the location of the
polygons on the face (see Figure 2.10). This allows for simpler models that make
it easier to map a texture onto the surface. The only information that needs to be
stored is the location of the vertex points in the model. The polygonal nature of
this representation makes rendering face images extremely fast in modern graphics
processors. However, automatically extracting this information from an arbitrary scan
can be difﬁcult (see Chapter 4). The structured polygonal model may also provide
less information about the surface structure of the face, therefore less information is

available when disambiguating between subjects for identiﬁcation.

2.4.6 Face Mesh

The face mesh [144] is speciﬁcally designed for registering raw scanner data. The
face mesh assumes that the scans are frontal poses and registers the face using the
nose tip. After registration, an evenly spaced mesh is applied to the face and at each
intersection of the mesh the surface distance is re-sampled in such a way that the
result is an evenly spaced data structure. The mesh can be adjusted for different

levels of detail (see Figure 2.11). An advantage of this mesh is that it standardizes

35

 

 

Figure 2.10: Structured polygonal model (image from [92]).

the face feature vectors and closely aligns all the faces into a face—based coordinate

system.

 

Figure 2.11: Mesh models in different levels of detail (Figure from [144]).

2.4.7 Cylindrical Point Matrix

When dealing with full 3D models of the face, the data are transformed into a Cylin-
drical Point Matrix to represent the entire 3D surface of a face model. This format
stores the Xc, YC, Zc information in three 300 x 1000 matrices (see Figure 2.12), simi-
lar to the 2.5D format. However, these new matrices are indexed based on cylindrical
coordinates instead of Cartesian coordinates and the points are re-sampled based on
the centroid of the face model (see Figure 2.13). The cylindrical coordinate system
allows for a richer modeling of the face structure than the 2.5D representation because

all viewpoints of the face can be encoded.

36

 

Figure 2.12: Example cylindrical data structure.

 

Figure 2.13: Example cylindrical data structure.

A major limitation of many 3D data structures is their inability to account for
occlusions. The 2.5D scans generated by many 3D scanners only look at one side
of a 3D object, and cannot represent a full 3D model of the head’s surface. One
notable exception is a scanner developed by Cyberware [51]. This scanner moves a
vertically mounted laser around the subject to generate a full model of the subject
in a format similar to the Cylindrical point matrix. The Cyberware scanner still
produces occlusions around regions like the ears, nostrils and hair but these regions
are much smaller than the regions that are occluded in a 2.5D scan.

In the approach to arbitrary pose anchor point detection shown in Section 4.4, a
full 3D model of the head is used as a reference frame in order to align any arbitrary
pose scan with a face. Various methods were considered for producing a 3D model
from 2.5D scans. Five test scans, generated with a Minolta Vivid 910 scanner, were
combined. For each subject, the scans were stitched together using a commercial
software package, called Geomagic Studio [130]. The models were also cleaned up by
ﬁlling holes, smoothing the surface, and deleting noisy points associated with hair
and clothing. The resulting models had thousands of redundant points and polygons,
so a deeimate function was used to reduce the number of polygons in the model to
a reasonable level (approximately 50,000). The end result was a smooth, full view of

the face for each of the subjects (see Figure 2.14).

 

Figure 2.14: Model construction. Five 2.5D test scans used to construct a full 3D
model.

Once the full 3D model of the face was generated, the models were output in
3D VRML (Virtual Reality Modeling Language) format. The 3D VRML format was
chosen because most 3D editing programs can generate VRML output, so the system

can be tested on face models generated using various types of scanners, software and

38

algorithms. The VRML format is an unordered polygonal model of the face. The
unordered nature of the data makes it difﬁcult to estimate the local structure or
to calculate the local surface curvature and the surface parameters (such as surface
normals) at each point. The cylindrical coordinate system was developed to resample
the 3D model and store the neighborhood information in a matrix similar to a 2.5D
representation. However, since this representation stores the points in cylindrical
coordinates, it can represent all viewpoints of the 3D model. There are places where
occlusions are still a problem; however, these locations are much smaller than in the
2.5D format.

In order to generate the cylindrical model, the points in the polygonal (VRML)
model need to be resampled. The polygonal model is represented as a set of k points
that are vectors p,- = 13,-, yi, 22-. The transformation is made using the algorithm shown
in Figure 2.15. It is assumed that the orientation of the model is vertical, and that
there is a vertical cylindrical axis passing though the centroid of the model (23’, y’, z’).
After the vertical axis is determined, a rotational orientation about the axis is needed.
The tip of the nose was chosen to be located at an angle 7r. Now, the matrix can
be sampled such that there are nrows rows and ncols columns. For this research,
nrows = 300 and ncols = 1000 were chosen, with the vertical range approximately
equal to the vertical range calculated by taking the minimum and maximum y values

from the model.

39

// Initialize Variables
ncols = 1000, nrows = 300;
distanceLz'mz't = 6;

miny = mini-2“”);

mary = maxi:1(y,~);

szzey = macry — mzny;

// Calculate center of model.
2,; 2 £212] 332', y; : Elf-41 312', z; = 53,-; 32'

// Transform all model points onto Cylindrical Matrix
Loop from i = 1 to k
I I I.
$i=$r$ayi=yryvﬁfe—Z»
0 = atan2(:1:z-, —zi);

_y—m'in . ' _
h—W ncols,

__ 9 .
a— 2*— nrows,

>4

Xc(h, a) = 113;, Yc(h,a) = 3);, Zc(h,a) = z,-
z'mc( ,a) = imi
Flagc(h,a = true

end Loop

// Fill in the missing data in the Cylindrical Matrix
Loop from h = 1 to nrows

Loop from a = 1 to ncols
If Flagc(7‘,c) == false

_ C .
6 _ 2”neols’

rayvector = [3271(6) T

’ nrowsa<size
If rag/vector is valid
trianglelndexes = f indI ntersect-ingTriangle(Ta-yvector);
[Xc(h, a), Yc(h, a), Zc(h, a)] = f ind] ntersection(triangle1 ndexes, ray'vector);
imc(h., a) = InterpolateColor(trianglelnderes, Xc(h, a), Yc(h, a), ZC(h, a));
F lagc = true
end If
end If
end Loop

 

y + miny, -—c0s(0)];

Figure 2.15: Algorithm for computing the cylindrical coordinates from an arbitrary
point cloud.

40

2.5 Data Collection

Much of this dissertation focuses on the study of 3D models produced by the Minolta
Vivid 910 3D scanner (see Figure 2.16). This scanner can generate a 2.5D image of
the face in less than 2 seconds. A 2.5D image is a simpliﬁed 3D (x, y, z) surface repre-
sentation that contains at most one depth value (2 direction) for every point in some
(2:, y) view plane (see Figure 2.17). The Minolta Vivid 910 produces measurements
accurate to less than 0.2mm [22]; however, the cost (z $50, 000) and slow speed of
the Minolta Vivid 910 scanner make it an unrealistic solution for many applications

that could otherwise take advantage of 3D information.

 

Figure 2.16: Minolta Vivid 910 2.5D scanner.

 

 

 

(b) (C)

Figure 2.17: Example 2.5D image (a) 2D texture image (b) depth map (0) rendered
3D model.

The Vivid 910 uses structure from lighting by passing a horizontal plane of laser

light onto the surface of an object, and then a 640 x 480 pixel color camera picks up

41

the curve produced by the interaction of the laser with the object, and triangulates
the depth of the illuminated surface points. An advantage of this type of sensor is
that it is well calibrated and will produce scan depth values accurate to less than a
millimeter (i0.10mm in ﬁne mode).

The Minolta Vivid 910 has several different modes, each with different speeds
and resolutions. The fast mode is used in this dissertation because it is faster and
still produces enough data for recognition. The ﬁne mode is used by the F RGC and
has more pixel information but it is harder for subjects to hold still. Table 2.4 and
Figure 2.18 give a detailed list of the amount of time each camera mode requires
for each scan. For every pixel, the camera outputs the Cartesian coordinates of the
pixel (any, 2), the color of the pixel (7‘, g, b), and a ﬂag value indicating whether or
not the laser was detected. The Minolta Vivid 910 scanner is commonly used in face
detection and is currently the primary 3D scanner used to gather 3D face data for

the Face Recognition Grand Challenge [118].

Table 2.4: Operation of the Vivid 910 scanner (time in seconds).
Fast ] Fine Description

 

 

 

 

 

 

 

 

 

 

 

 

 

Camera set up 14.2 Camera calibration and setup. This should only
happen when the system starts up.

Focus 3.4 Minor Focus Adjustment if the subject moves
closer or farther from the camera.

Lead 0.072 Scanner setup time.

Scan 0.87 ] 3.74 Scanning time.

Color 0.85 Record 3 picture color image.

Release 1.72 4.59 Total time Subject needs to hold still.

Download 1.04 1.92 Download image from scanner to computer.

Total 2.83 6.58 Total time for processing (without setup or focus).

 

 

 

 

 

 

2.5.1 Michigan State Dataset

For the experiments described in this dissertation, the Vivid 910 was set to record a

320 x 240 pixel 3D scan in less than a second. At Michigan State there have been

42

 

 

 

 

 

6.00 ‘ ClDownload
lReIease
" 5.00
g ILead
§ 4.00
5’3.
0 3.00
g .
i- 2.00
1.00
0.00

 

Fast Fine
Minolta Vivid 910 Camera Mode
Figure 2.18: Scanning time.

three phases of data collection:

0 Phase I (arbitrary pose database) - In the ﬁrst phase, about 11 scans were
taken for each of 111 subjects. These scans included the following poses and
expressions: neutral expression (2 frontals, 30°, 45°, 60° to the right, and 30°,
45°, 60° to the left) and smiling expression (frontal, 45° to the right, and 45°
to the left).

Phase II (frontal pose database) - In the second phase, subjects were scanned
using a less rigid procedure. Many of the 80 or so subjects in this database
only have two scans (one frontal neutral expression and one frontal smiling

expression) .

Phase III (additional scans) — In addition to the ﬁrst two databases, a number
of test scans were made using different expressions, occlusions, facial hair, etc.

These scans are used to analyze special cases in this dissertation.

A database of scans of a mannequin head was also generated to establish a testing

foundation for the registration algorithm.

43

2.5.2 Face Recognition Grand Challenge Datasets

In addition to generating data at MSU, this dissertation research also uses data
provided by the Face Recognition Grand Challenge (FRGC) [118]. The aim of F RGC
is to encourage the development of new algorithms for face identiﬁcation. One of the
goals of FRGC is to introduce a 3D dataset to the research community in order to
encourage algorithm development. The FRGC dataset is the largest publicly available
dataset of 3D faces and currently has two versions. Table 2.5, Figure 2.19 and Figure
2.20 show some of the statistics associated with the FRGC data. Table 2.6 outlines

the attributes of the FRGC and MSU datasets.

Table 2.5: FRGC database.

 

 

 

 

 

Version Release Date Subjects Subject Sessions Recordings Size
1.0a May 2004 275 943 943 n/ a
2.0 Sept 2004 625 n/ a 4,007 70 Gbytes

 

 

 

 

 

 

2.5.3 Ongoing Work in 3D Scanners

New scanners have recently made 3D scanning more cost effective, although not
always with the accuracy of the Minolta. The Canesta scanner is more affordable
(2 $200) than the Minolta; however, it is limited to an accuracy of 2cm at 1m
distance from the scanner [30]. PhotonX [120] is also developing a scanner that may
be available in quantities from around $5000 and may generate 10 frames per second
of high resolution 2.5D scans. The scanner uses a multi-image, shape from shading
technique to generate surface normals and then uses local interpolation to calculate
the depth values.

Currently, only a preliminary proof of concept prototype (see Figure 2.21) of the
PhotonX scanner has been tested with the frontal face detection algorithm described
in Chapter 7. The prototype system still requires a controlled lighting situation and

some manual input to seed the depth values. However, as this technology matures, it

44

2%: see 23E amass as see? 005 News 25E

E a.....£.§..§s.§i§z

N R ow m; m: t m: 9 3 2 N— I 9 m w h w n v n N w

 

 

 

 

 

 

 

 

 

cow-
BN o 94..
NN u on.

 

 

 

 
 
 

madqns in JaqumN

45

Table 2.6: Comparison between the FRGC and MSU datasets.

 

Issue

FRGC

MSU

Discussion

 

File Size

640 x 480

320 x 240

The larger F RGC ﬁles have more information

 

Download
Time

1.9 (Sec)

1.0 (See)

The larger ﬁle size causes slower download
times.

 

Scanning
Time

4.6 (Sec)

1.7 (See)

Even though the F RGC data has more infor-
mation, the slower scanning time allows for

subject movement and causes more errors in
the FRGC dataset.

 

Pose

Frontal
Only

Tested
on non-
frontal

Even though one of the goals of 3D data is to
make face recognition tolerant to pose varia-
tions, the F RGC dataset only contains minor
pose variation. However, these small variations
allow accurate measurement of pose variations
in a cooperative subject pool.

 

Lighting

Minimal
Light
Variation

Range of
Lighting

Another goal of using 3D data is to make face
recognition more tolerant to lighting. However,
most of the F RGC data was collected with con-
stant lighting so there is little opportunity to
test 3D algorithms under varying lighting con-
ditions using the FRGC dataset.

 

Expression

Extreme
Expression

Minor
Variations-

The FRGC dataset has neutral expressions
or extreme “uncooperative” expressions, which
are particularly difﬁcult to record using the
slower scanning time. About 2/3 of the MSU
dataset is neutral expression and the other 1/ 3
with minor smiles (no teeth). This makes the
MSU dataset a more reasonable representation
of real world cooperative subjects.

 

 

Anchor
Points

 

Manual

 

Automatic

 

The F RGC provides manual anchor points to
help facilitate the exploration of feature spaces.
However, any practical system would need au-
tomatic selection of anchor points. The au-
tomatic anchor points may not be as good as
the manual points, and the recognition system
needs to take anchor point errors into consid-
eration.

 

46

 

could feasibly work at near real time and at long distances. The greatest challenge is
the vast amount of data that needs to be processed. PixelVelocity [121] has worked
with MSU and PhotonX to implement a high data bandwidth solution that uses
highly parallel Xilinx processes directly attached to the image charge-coupled device

(coo).

 

(a) PhotonX Camera Setup (b) Example Image

Figure 2.21: Example prototype setup for the PhotonX multi-image shape from shad-
ing camera.

47

Chapter 3

Solution Approach

Chapter 2 introduced coordinate systems and mathematical notations, described dif-
ferent scanning methods, and introduced 3D data formats. This chapter presents
methods and algorithms for working with three dimensional data; concepts that are
used throughout this dissertation. Section 3.1 describes an algorithm for ﬁtting a
continuous surface to a discrete set of 3D points. Section 3.2 describes the shape in-
dex, a pose-invariant feature space that helps deal with variations in input data due
to pose, resolution and scale. Section 3.3 describes issues that need to be considered
when dealing with scanner data at different camera offsets. Section 3.4 presents an
algorithm to address common types of noise in 3D scanner data. Section 3.5 describes
methods for aligning scans taken in different sensor-based coordinate systems. Sec-
tion 3.6 describes methods for reprojecting arbitrary 3D data in order to produce

artiﬁcial 3D scans.

3.1 Surface Fitting

This section describes a method for ﬁtting a smooth parametric surface to a set of
discrete points. This surface can be used to interpolate unavailable points or to

calculate the surface curvature [58].

48

Assume that the local surface structure of a face scan is smooth. This is a rea-
sonable assumption since most faces are smooth with some minor exceptions, such
as occlusion from the nose or the small “step functions” at the eyelid and mouth.
The surface properties of each point on the scan can be calculated by ﬁtting a small
cubic surface function to the neighborhood around the point. A cubic polynomial (see
Equation 3.1) was chosen because it is locally smooth, can ﬁt the several structures

of the face, and is simple to calculate.

3 2 2

z=a$ -+-b:r y+ccry +dy3

2

+er2+frry+gy +hr+iy+j (3.1)

The ﬁrst step in ﬁtting the surface in Equation 3.1 is to select a local neighborhood
of n points around a particular point p, in the scan. Then, the coordinates of each of

the n points are adjusted such that they are shifted to make 19’ the origin.

17' = I): — p’ (3-2)

The second step in ﬁtting the surface is to calculate the covariance matrix of these

points and transform them into a principal component coordinate system:

71

B = eigenvectors( 2 "p7,; -1')—l-T) (3.4)
i=1

Pn = B - P (3.5)

Once the points have been normalized into the principal space, the coefﬁcients for

Equation 3.1 are solved using least squares ﬁt of the following equation:

B=AX. mm

49

where:

 

H

R

' woo H00

$191

1.23/2

$nyn

 

$191

2212/3

 

3713/1

3323/1

xnyn

 

.171 yl 1
£132 y2 1
In 972. 1

 

 

(3.7)

(3.8)

Solving for X gives the coefficients of equation 3.1. This solution is a paramet-

ric local representation of the sampled points. This new continuous surface is also

differentiable, and therefore can be used to calculate the surface curvature.

3. 1 . 1 Curvature Calculation

The curvature of a surface is deﬁned as the second derivative of the surface, and on

a 3D surface there are different curvatures for each direction along the surface. The

maximum and minimum curvatures are orthogonal to each other and are calculated

using the Hessian matrix, similar to [127, 55]. The Hessian matrix is defined from

the second partial derivatives of a function:

Hessian =

d2z
Bray
d2z

(3.9)

The ﬁrst derivatives of Equation 3.1 are:

2 2

 

dz
E =3a1: +2b.ry+cy +2er+fy+h (3.10)
E _ , 2 , . 2 , ,.
dy — b.r. + 2c.ry + 3dy + f1: + 2gy +1 (3.11)
Similarly, the second derivatives are:
d2~
“ = 21);, 2 . .12
drdy r + ”y + f (3 l
(122
—2— = Gad: + 2by + 26 (3.13)
d1?
(12::
—2 = 6dy + 2m: + 29 (3.14)
dy

When calculating the curvature at the origin, :1: = y = z = 0. Combining Equations

3.9 with 3.12, 3.13, and 3.14 gives:
_ 26 f]
Hesszan = —— 3.15
[ f ,9 < >

Solve for the Eigenvalues of Equation 3.15 returns the maximum and minimum cur-

vatures:

 

[€1,2=e+g:l:\/e2-i—2eg+92—4eg-+-f2 (3.16)

This solution is equivalent to the solution presented in [58] which solves the following

equation:

[<12 =Hi \/H2 —K (3.17)

51

where the Gaussian curvature K, which is the determinant of Equation 3.15, and the

mean curvatures H, which is one-half the trace of Equation 3.15:

 

k’1,2 =e+gi\/(e——g)2-+—f2 (3.18)
These eigenvalues represent the minimum and maximum curvatures on the surface:

k'rnaaj = marr(/\1, A2) (3.19)

kmin = min()\1, A2) (3.20)

3.1.2 Surface Curvature and Normal Directions

The direction of the maximum eigenvector is in the principal plane and is calculated

from the Hessian Matrix in Equation 3.9:

 

e—g—+-\/(e—g)2+f2 ife>g

 

eigl[a:] =
f otherwise
. f if e < 9 (3.21)
€2.91lyl =
—(e — g — \/(e — g)2 + f2) otherwise
eigﬂz] = 0

Choose the ﬁrst solution if e > g and the second solution if e < 9 [58]:

eigl[:r]
eig1l$l2, +[63'91 [.1112
.- , = ezg1y
ezgl[.r] .- 2 . 2 (3.22)
elgllirl +6191I3/l
0

 

 

 

 

52

The minimum eigenvector direction is orthogonal to the maximum eigenvector:

—eig1[yl
6192 = eigl [x] (3-23)
0

The direction of curvature in the original coordinate space is:
Icni'am = B_1eigl (3.24)

km}, = B_1ei92 (3.25)

The surface normal is calculated by taking the cross product of the minimum and

maximum curvature directions:

There may be some (180 deg) ambiguity as to the direction of the surface normal.
However, for 2.5D scans the 2 component of the surface normal should always be in
the positive direction (if it is not, then the solution is ﬂipped 180 deg). Figure 3.1

shows the surface normals calculated using the method described in this section for

three surfaces.

   

(a) Sphere (b) Cylinder (c) Face

Figure 3.1: Example surfaces with surface normals represented by arrows.

53

3.2 Shape Index

The local curvature information about a point is independent of the coordinate sys-
tem. Dorai and Jain [54] proposed local shape information, called the Shape Index,
to represent the surface curvature at each point within a 2.5D image. The Shape In-
dex at point p is calculated using the maximum (kmar) and minimum (kmin) local
curvature. This calculation produces a shape scale between zero and one (see Figure

3.2).

 

, _1 kmaa‘fpl + kminw) (3.27)

. .H‘. , ~ -' - . :'1' '..‘.S.« “I
\ ._‘ .- I’ 'I- o. 4" . . . .A 1‘ -"_.- J,“ ’h.
-"'ﬁ -' ' 'o.< r'_ J ‘_ . '.' r’f ,
\‘\ '_‘ ‘ ' . ' 1 I, _ 4'. ’ 4' _ . \V.‘ / ,» .6 > ‘

 

‘ / “11' I
'1’,»’\L;E_";yr- .- _ 1.. .5 ._. -2. j . _‘ . 7'. --;
s-“ ‘~‘\ --~' ”x/T . —" H 4 ‘ .’ {1-1 ' .h} .'_ "-//~ .1--_-‘- -r‘ ' ,'j~( ,«
Spherical cup Hut Seam Spherical cap
(0.0) [CBS] (0 5} (1.0)

Figure 3.2: Shape index scale. A value of zero represents a spherical cup, while a
value of one represents a spherical cap. This index is independent of the coordinate
system.

Since the shape index is independent of the coordinate system, it is useful for
ﬁnding similar points between scans from different poses. The faces shown in Figure
3.3 are examples of 2.5D face images with intensity representing the shape index.
Notice that there are several consistencies between these scans of different faces. For
example, the area between the eyes and the bridge of the nose is consistently mt
shaped, as is the area around the mouth. These consistencies are used to help locate
prominent anchor points within the face regardless of the orientation of the face in

the scan.

 

t...

Figure 3.3: Example of the shape index over faces in different poses. Note the con—
sistency of the values around areas such as the eyes and nose regardless of pose.

3.3 Scanner Resolution and Scale

The resolution of the scan is the number of pixels per image, while the scale can be
deﬁned as the number of pixels per unit distance. For example, the Minolta Vivid
910 scanner produces images with a resolution of 320 x 240, while the scale depends
on how close the subject is to the scanner. In the scan shown in Figure 3.4a, there
are only 31 pixels between the center of the eyes, yet the scan shown in Figure 3.4b

has 70 pixels between the center of the eyes.

(a) Far Image )Close Image

 

Figure 3.4: Two scans taken with different camera offsets resulting in different scales
for the face.

While the resolution of a scan is constant, the scale varies depending on the
distance between the subject and the scanner. The scale may not be constant within

a particular scan; for example, the distance between two neighboring pixels may differ

55

depending on how close points represented by the pixels are to the scanner.

Deﬁne the change in :1: between two horizontally neighboring points in a 2.5D depth
map (see Section 2.4) to be the column step, and deﬁne the change in y between two
vertical neighboring pixels to be the row step. Regions of the scan that have a larger
offset from the scanner will have larger step values than regions that are closer to the
scanner. The extra :1: and y coordinate matrices in 2.5D scans are needed in order to
account for the variation in step within a particular scan. However, in an orthogonal
depth map (see Section 3.6.1), the step size can be made constant so there is no need
to maintain the redundant data. In the continuous form of the surface the stem; is
the the partial derivative of the function :1:(z', j) in the :1: direction and stepy is the
partial derivative of the function y(z', j) in the y direction. In the discrete from, the
step value is calculated as a matrix with size one less than the :1: and y matrices,
and also requires a ﬂag value (858ij f and stepy f) for areas where it is not possible
to calculate the step. The step in the a: direction is the difference between any two

horizontally neighboring points on the 2.5D depth map in the :1: direction:

1 S u S cols (3.28)

1 S v < rows (3.29)

stepg;(u, 11) = $01.0) — y(u,v+1) (3.30)
stepr(u,v) = flag(u’v)|flag(u,v+1) (3.31)

The step in the y direction is the difference between any two vertically neighboring

points on the 2.5D depth map in the y direction:

1 S u < cols (3.32)
1 S v S rows (3.33)
stem/('11, v) : $(u,v) — y(u+1,v) (3'34)

56

 

stepyf(u,v) = flag(u,v)]flag(u+1,v) (3.35)

The average step size (both column and row) represents a general scaling factor
that can be used to estimate the size of different regions in the scan. This scaling
factor can speed up distance calculations on a scan when only an estimate is required.
For example, the bounding box regions generated for anchor point detection in Section
4 are generated by using the minimum and maximum step values to ensure that the
bounding boxes properly encompass the regions of interest. If the step value is not
used, then all pixels would have to be searched to determine where the bounding
boxes lie (which consumes processing time).

To get an idea of the ranges of step values that are generated, the average step
values for the MSU frontal dataset and the FRGC1.0 dataset were calculated. Note
that the MSU dataset was taken at half the resolution of the FRGC data, which
accounts for the differences in the step values. The minimum, average, and maximum
step differences represent the range of step values within a particular scan. The step
difference for a scan is the difference between the minimum and maximum step values
over an entire scan. The “step diff” values shown in Table 3.1 represent the minimum,

average and maximum step difference in the entire database.

Table 3.1: Step values for standard databases.

 

 

 

 

 

 

 

Cols Rows Minimum Average Max
MSU step 320 240 1.28 mm 1.29 mm 1.29 mm
step diff 0.00 mm 0.01 mm 0.04 mm
FRGC1.0 step 640 480 0.43 mm 0.80 mm 0.98 mm
step diff 0.00 mm 0.03 mm 0.06 mm

 

 

 

 

Scale is also a challenge when working with sliding window algorithms. When
using a sliding window on images with different scales, it is important to note that
the results may be different for different scans. An example is a sliding window

algorithm used to calculate surface curvatures. In this algorithm, a cubic surface is

57

 

ﬁt to a set of data points in the window. In one version of the algorithm, the data
points are separated by a ﬁxed window size. In the second algorithm, the data is
selected based on the relative scale of the images using the step value. This relative
scale allows the ﬁtting algorithm to be applied to points at the same scale for any
image.

For example, consider a scan of a subject with a large camera offset (Figure 3.5a).
In this case, the between-pixel distance will also be large, and a ﬁxed window of 9 X 9
could represent 14mm2 of space. Now, consider a scan taken closer to the camera

2 of space.

(Figure 3.5b). The same 9 x 9 window could now represent less than 9mm
This discrepancy in scale will cause variations in the curvature calculation and shape

index, as can be seen in Figures 3.5 and 3.6.

 

(3.) Far Image (b) Close Image

Figure 3.5: Fixed window (9 x 9) shape index calculation Window. Note that the
shape index values are dependent on the resolution of the scan. The detail of the
curvature in (a) is much more prominent than the detail shown in (b).

Even though, Figures 3.5a/ 3.6a and Figures 3.5b/3.6b are of the same scan, their
curvature regions differ because the curvature calculation varies at different scales.
When a ﬁxed-size window is estimated using the average image step, the curvatures
look smoother. Note that using the average image step is fairly reasonable given the
results of Table 3.1; however, there will still be some slight variations in regions with
abnormal step values. This is especially true in regions with high drop—off values,

such as the sides of the face.

58

      

(a) Far Image (b) Close Image

Figure 3.6: Fixed distance shape index calculation. Notice that the values are more
similar across scale than in the ﬁxed window mode (i.e., notice the smoother regions
on the larger face).

3.4 Image Cleaning

Two types of errors are common to scanners such as the Vivid 910. The ﬁrst type of
error is holes in the scan (i.e., areas were the laser light is not seen by the camera).
The Minolta handles holes gracefully with a ﬂag matrix, which records zero if a hole
is found and records one otherwise. Another approach to dealing with holes is to ﬁll
in the missing data. For example, a surface can be ﬁt to a neighborhood of points
around a hole as shown in Section 3.1. Ahn, et al. [6] present a hole ﬁlling algorithm
that uses a dynamically changing window size. This allows the algorithm to reduce
the smoothing effects of the algorithm by using a smaller window in high density
areas and larger windows in low density areas.

The second type of scanner error is spike noise, which is common around the eyes
and hair due to the scanner’s inability to properly pick up the laser light reﬂection
from these surfaces (see Figure 3.7). The light gets reﬂected twice causing inaccurate
locations. Spike noise creates large outlying values that can cause problems in ICP
algorithms, anchor point detectors, and correlation. A gradient ﬁlter can be used to
eliminate spike noise by comparing the difference in 2 value for each point with all
of its neighbors. If any of the gradients are above a certain threshold, then the ﬂags

for these noisy points are set to zero. A gradient ﬁlter set to around 5mm/pixel will

59

remove most of the spike noise in data collected with the Vivid 910.

F igurc 3.7: Proﬁle view of a frontal scan from the FRGC1.0 database. Notice the
large spike noise errors around the eye.

3.5 Surface Alignment

When a scan is taken, all values are in a scanner-based coordinate system. Thus, two
scans of the same object can have vastly different values if they are positioned dif-
ferently relative to the camera. Surface alignment (also called registration) is deﬁned
as the calculation of a rigid transformation (R E R) that best transforms two scans
of the same object taken at different poses or locations onto one another. The ﬁrst
scan is denoted by a set of points P3 and the second scan by the points Pm. The
transformation is rigid and has six degrees of freedom: roll, pitch, yaw, and changes

in x, y and 2.

3.5.1 Single Point Rigid Alignment

If both the scan and the model are known to have similar poses relative to the scanner
(i.e., both are frontal face scans with no rotation), then the transformation between
the test scan and the model will only have a translation component and no rotation
component. Thus, coarse alignment can be calculated using the difference between a

single corresponding anchor point that is the same in the scan and the model. For

60

example, the 3D location of the nose tip is found in both the test scan (51:, 8y, 52) and
the model (ix, ty, t2). These anchor points can be used to generate the transformation

matrix (T) to transform the test scan to the model:

0 O O
O
H
A
IN.
Q
l
(1‘
N
v

 

 

The problem with a single point transformation is that it is difﬁcult to assume
that there are no variations in the pose angles. In fact, it is quite difﬁcult to maintain

exactly the same pose between two scans.

3.5.2 Three Point Rigid Alignment

If the relative pose between the scan and the model is unknown, then a single common
anchor point is not enough. Instead, a minimum of three corresponding anchor points
are required to calculate the transformation from the test scan to the model. For
example, assume that three anchor points (inside right eye, outside right eye, and nose
tip) have been identiﬁed on both the scan and the model. Then, a rigid transform
(R E R) can be easily calculated between the sets of anchor points [141] by a least
squares ﬁtting between the triangles formed from the two sets of three points. The
set of three anchor points on the test scan (a) is transformed into the same location as
the second set of anchor points on the model template (p). The rigid transformation

is composed of a series of simple transformations, as shown in Figure 3.8.

61

 

 

a.p,'

[aapg 32p;

 

 

 

:—

1 11.5 2 2.5 3

 

Figure 3.8: Rigid transformation between two sets of three corresponding points;
(a) the original set of points (the red triangle is constructed from the a points, the
blue triangle is constructed from the p points); (b) The set of points after the rigid
transformation of points a onto points p.

62

R=TC_p-Bp-6-BA-Tc_a
where:
R Total Rigid Transformation from anchor points t to anchor
points m.
Tc—a Translate points a to an origin at the center of the triangle
formed by the vertices of the triangle.

BA Transform world coordinates into coordinates based on basis
vectors of set a.

6 Optimum rotation to align point vertices into basis vectors of
set p.

Bp Transform from p basis vectors back into original world coordi-
nates.

Tc—p Translate from the center of p vertices into the original world
origin.

3.5.3 Iterative Methods

The single point and the three point alignment algorithms are closed—form alignment
methods. However, these methods are very sensitive to the accuracy of the points
used in the alignment. Even small errors in the detected point location or variations
in pose can cause the resulting alignment to have errors. These errors tend to get

larger for points that are farther away from the anchor points used in alignment.

The Iterative Closest Point (ICP) algorithm [17, 37, 152] is a ﬁne alignment
method that minimizes surface distance errors. Starting with an initial estimate
of the rigid transformation, ICP iteratively reﬁnes the transformation by choosing
corresponding (control) points on the 3D surface of a probe object and calculating
the best translation and rotation (that minimize an error function) based on the

distance to the surface of the base object.

This dissertation primarily uses the Besl ICP algorithm [17], which uses point-to-
point distance and a closed-form solution when calculating the transformation matrix
during each iteration. The point—to-point distances will have errors proportional to
the distance between the sample points on the objects. The closer together the points

are, the smaller the error. If the points are far apart, then the algorithm has difﬁculty

63

making small incremental changes and there is a higher probability of getting stuck

in a local minima.

The point-to—plane distance used by Chen [37] makes the ICP algorithm less sus—
ceptible to local minima, compared to the point-to-point metric [64]. Instead of
ﬁnding the closest point in a discrete sample, the point-to—plane algorithm interpo-
lates a plane between the points on the surface. The problem with the point-to—plane

algorithm is that it is more processor intensive.

These two classical ICP algorithms [17, 37] can also be run in a interleaved style,
called the hybrid ICP algorithm [100, 99]. Each iteration consists of two steps, starting
with Besl’s scheme [17] to compute an estimation of the alignment, followed by Chen’s
scheme [37] for a reﬁnement. The two different distance metrics are utilized together,

which has the potential for a better registration result.

The iterative nature of all of these ICP algorithms can make them slow. The most
time consuming component of the algorithms is the search for the closest point on
the surface of the base object for each of the control points. If every possible point on
the base object is searched, then this is an 0(np) algorithm, where p is the number
of points in the base object (normally p 2 re, where r and c are the number of rows
and columns in a 2.5D scan) and n is the number of control points sampled from the

probe object.

A KD-tree based [15, 61] ICP algorithm can improve the time it takes to calculate
the point distances by ordering the points in a lookup tree. This will have a worst
case running time of 0(n ln(rc)), which is better than ICP alone; however, the
construction of the KD-tree can take time. Another possibility to speed up point
distance calculation is to use the matrix structure of the 2.5D scan to help guide
the search. The ordering of the points in my space is related to the ordering of the
point index in the matrix. This relationship can help speed up the point-to-point

ICP algorithm by using row and column lookup tables. For each row, the maximum

64

and minimum y values are recorded. Similarly, for each column the maximum and
minimum :1: values are also stored. Then the search for the nearest point (:23,y, z)
involves searching these maximum and minimum tables for all possible rows and
columns that could have this point, within an error distance threshold. This search
table loop—up takes 0(7‘ + 6) processing time to execute and these rows and columns
represent a much smaller search window. The search speed depends on the size of

the window, which depends on the size of the resolution error threshold.

For this dissertation, an error threshold of 2mm is chosen (see Section 5.2.3).
Since the standard resolution between pixels is around 0.5mm, it is possible to have
a search window with a radius as large as 4 pixels for every row or column (9 x 9
window). Typically, the window’s radius ranges from zero (i.e., the point is off the
scan) to 8 pixels (a 17 x 17 window), with an average window size around 5 x 5 pixels.
The size of the window depends on the chosen error threshold and on the resolution
of the scan. In any case, a search window is much smaller than searching every point
in the scan and is similar to the number of comparisons required for the KD tree. The
overall speed of this algorithm is estimated to be O(n(r+n)w), where w is the window
size. In these scans, w is much smaller than any of the other components, and can
be considered constant. Thus this algorithm has a speed of O(n(r + 71)). In practice,
the speed increase can be quite signiﬁcant, especially as the my deviations of the rows
and columns become smaller (such as with an orthogonal projection). However, an
obvious limitation of this method is that it will only work on matrix-style 2.5D depth

maps.

3.6 Orthogonal Depth Maps

The scans produced by the Vivid 910 require three matrices (:r, y, and z) to describe

the location of the 3D points. As described in Section 3.3, the a: and y matrices are

65

required to account for the variations of scale within a scan. With three matrices
any point on the surface of a sphere can be stored; however, due to occlusion, the
2.5D matrix is only able to show one side of an object. An orthogonal depth map is
a re—projection of the 3D scan data with a constant step value. This constant step
value eliminates the need for the :1: and y matrices. In other words, an orthogonal
depth map is a re-projection of the surface data to be a function of z in terms of :1:

and y:

z = f (1‘. y) (336)

Given any 3D object in space, a depth map is produced by projecting 2 values onto a
projection plane deﬁned by a matrix. For each cell in the matrix, a ray is traced onto
the object perpendicular to the image plane (see Figure 3.9). The distance between
the cell and the ﬁrst point at which the ray intersects the object is recorded in the
depth matrix. The depth map cannot represent all the points on a sphere, but rather

only a hemisphere.

 

Projection / Object Surface
Rays

/

if

 

 

 

 

 

 

A
_1

Evenly Spaced Depth Map

 

 

 

 

 

 

 

/

Figure 3.9: Visual representation of an orthogonal projection.

In computer graphics the z-buffer algorithm is commonly used to generate an

66

orthogonal depth map. For each polygon in a mesh, the coordinates of the polygon
are projected onto the orthogonal matrix. If a new projection overlaps an existing
point on the matrix, then the minimum of the two points is stored (this overlap
happens during occlusion). Orthogonal depth maps are easy to calculate and they
normalize the data in the a: and y directions. Another advantage of orthogonal depth
maps is that many graphics processors are optimized to execute the z-buffer algorithm

and this optimization could be used to speed up depth map calculations.

3.6. 1 Virtual Scans

Since it is not generally practical to scan a subject at all possible rotations and angles,
orthogonal depth maps are used to generate virtual scans of an object. The virtual
rotation process rotates the points of a face scan in 3D space and then re—projects
the points onto an orthogonal plane that is parallel to the my plane. The orthogonal
projection produces a virtual scan that includes a depth map, ﬂag map and color
image similar to the data produced by the Minolta scanner. Exact pose angles can be
produced and control can be maintained over the sampling rate. The frontal image
in Figure 3.10(c) is the true scan, while the other four images in Figure 3.10 are

artiﬁcially rotated and re—sampled images.

 

Figure 3.10: Example of virtual yaw rotation.

67

3.6.2 Super Sampling

Another advantage of the orthogonal depth map is that it can be easily sampled at a
higher rate than the original scan (see Figure 3.11). This super sampling is achieved
by interpolating values between the points on the original scan. Super sampling allows
for a much richer and more dense point space and can improve the performance of
the ICP algorithm by avoiding local minimums. Super sampling the data allows it to
achieve faster ICP performance with a point-to—point algorithm while achieving the

quality of a point-to—plane ICP algorithm (see Section 3.5.3).

 

(a) Original (320 x 240) (b) (200 x 350) (c) (400 x 700)

Figure 3.11: Depth maps sampled at two different resolutions.

3.7 Summary

This chapter developed and presented various tools for processing 3D data. These
approaches are fundamental to the methods described in the rest of the dissertation.
Section 3.1 described methods for ﬁtting a cubic surface to a sample of 3D points.

Surface ﬁtting is used in Section 6.2.3 to help establish a robust face based coordinate

68

system. This surface fitting method is also used to to calculate the surface curvatures,
surface normals and shape index. The shape index described in Section 3.2 is a pose
invariant feature space used in anchor point detection in Chapter 4. The step values
described in Section 3.3 are used to establish the relative scale of a 3D scan and
are used as a standard way to estimate point locations and distances throughout the
dissertation. For example, in Section 4.3 step values are used to estimate the size
of the anchor point search windows (bounding boxes), and in Section 5.2.1 the step
size are used to help bound the location of the ICP control point region. Section 3.4
describes methods for cleaning 3D scans, which are necessary to eliminate spike noise
in the data. Section 3.5 discusses simple methods for establishing surface alignment.
These general 3D alignment algorithms form the foundation of the 3D face alignment
methods used throughout this dissertation and are described in detail in Chapter 5.
This Chapter concluded with a description of an Orthogonal depth map that is used
in Chapter 6 as a way to standardize the resolution of the Canonical Face Depth

Map.

69

Chapter 4

3D Anchor Point Detection

Chapters 2 and 3 introduced methods for collecting and working with 3D face data.
Chapter 2 described 3D notation, modeling and scanners. Chapter 3 described some
foundational algorithms for working in 3D, such as surface ﬁtting, noise cleaning and
orthogonal depth maps. This chapter discusses anchor points, which are used in many
algorithms to ground the location of key facial components. In the context of this
dissertation, anchor points are points on the face that are common to all faces and
that can be detected reliably in different facial scans. Some examples of anchor points

are shown in Figure 4.1.

lrEw
\_

Ul‘E\c‘ > \ /

‘ /

Figure 4.1: Example anchor point locations. rEye - Inside of the right eye. orEye -
Outside of the right eye. lEye - Inside of the left eye. olEye - Outside of the left eye.
nose - Nose tip.

 

70

This chapter outlines methods for detecting key anchor points in 3D face scanner
data. These anchor points can be used to estimate pose and match the test image
to a 3D face model. Two algorithms are presented for detecting face anchor points
in the context of face veriﬁcation; one for frontal images and one for arbitrary pose.
A 99% success rate is achieved when finding anchor points in frontal images with
neutral expressions, and 86% success is achieved for scans with variations in pose and
expression. These results demonstrate the challenges in 3D face veriﬁcation when
dealing with arbitrary pose and expression.

Anchor points are used by many algorithms to align and manipulate 3D data
from different scanners. For example, Chapter 5 presents a method for using common
anchor points (such as the tip of the nose) to coarsely align faces. Anchor points such
as the nose tip and eye corners are also used to localize regions of the face that do
not vary much with expression. These regions are used to select control points for
the ICP algorithm.

Section 4.1 provides a general description of the anchor point detection problem.
Section 4.2 examines existing 2D anchor point detection systems, while the remainder
of this chapter discusses the detection of anchor points in 3D for both frontal face

scans (Section 4.3) and scans of faces with arbitrary pose (Section 4.4).

4.1 Problem Description

Viola and Jones [138] automatically learn anchor points from manually labeled data.
However, this method tends to refer to regions on the face (rather than actual points)
and does not meet the accuracy requirements for many algorithms, such as align-
ment, which require the location of speciﬁc anchor points rather than regions. Using
saliency, robustness, and utility as the three main criteria, the core anchor labels

shown in Figure 4.1 were selected for study in this dissertation.

71

Saliency means that an anchor point is prominent. and easily identiﬁable in most
face scans. For example, the eye corners and nose tip can be used as anchor points
on almost every face, and are within the accuracy required for coarse alignment of
the 3D face alignment system (see Section 5). In addition to being salient, it is
also important for anchor points (or sets of anchor points) to be robustly identiﬁable
across all types of images. For example, the center of the pupil is a commonly used
anchor point because it is salient and can be quickly identiﬁed within an image [153].
However, depending on the pose, it is possible for a scan to have only one (proﬁle
view) or even zero pupils (eyes closed) visible for detection. Thus, it is desirable
to ﬁnd anchor points that are robust across variations in lighting, facial pose, and
facial expression. Currently, the MSU database (described in Section 2.5.1) contains

variations in neutral or smiling expressions and Open or closed eyes.

A third consideration for selecting anchor points is their utility, which is based
on the algorithm that will be using the anchor points. For the coarse registration
problem (see Chapter 5), anchor points that are farther apart tend to work better
than ones closer together. The ICP algorithm (see Section 3.5.3) works best with
non-deformable points (i.e., points that do not change with respect to each other).
Thus, points such as the mouth corners are not a good choice for use with ICP because

they deform as expression changes.

The ﬁve core anchor points labeled in Figure 4.1 are easy to ﬁnd within the image,
are stationary with respect to the skull (and therefore do not move much with respect
to each other), span the width of most images, and ﬁnding them does not require that
the eyes be open. There is still the problem of occlusion of speciﬁc points due to pose,
but the coarse alignment algorithm discussed in Chapter 5 only requires three of the
ﬁve possible points in order to achieve rigid alignment. Due to this redundancy,
the ﬁve anchor points in Figure 4.1 are sufﬁcient for most images, including proﬁles.

These core points are also sufficient to calculate the location of the grid of control

72

points used by ICP (see Figure 5.29) for surface alignment in Chapter 5.

If fewer than three of the ﬁve core anchor points are present, then the set will
not be sufﬁcient to calculate the transformation between two face scans. A common
example of this problem is in a full proﬁle scan where the eyeball occludes the inside
corner point of the eye. In such cases, additional labeled anchor points are needed to
increase the likelihood of ﬁnding three usable points. For this reason, four additional
anchor point labels are used: the two corners of the mouth, and a pair of non-surface
points that represents the central axis of the scan. Depending on the pose of the scan,

these secondary points may or may not be used as the three transformation points.

The corners of the mouth are normally avoided because they can change with
changes in expression. However, they are included in the secondary point list because
they are salient and can help in cases where the other anchor points are unavailable.
The other secondary points, the centroid and the top-centroid, represent the 3D
vertical axes of both the scan and the model. The centroid is calculated as the
average location of all the points in a scan or model, and the top-centroid is the same
point moved 100mm in the z—direction. In most scans these points do not align well
with the models. However, due to their similar structure and orientation, they have
a tendency to be useful points because it is simple to assume that most faces are
vertical and most of the transformations are just rotations about the vertical axis

(with a small translation to account for slight variations in the vertical axis).

More candidate anchor points could be added in the future; for example, a point on
the ear was considered. However, it is difﬁcult to detect a reliable point on the ear due
to self-occlusion and occlusion due to hairstyle. Although adding more anchor points
increases the chances of ﬁnding at least three good anchor points, it also requires

more computational resources to ﬁnd and label such additional points.

73

4.2 Background

Identifying anchor points for 2D face recognition is a well studied problem, and Table
4.1 overviews several automatic anchor point detection systems that work with 2D
faces. Most of these systems fall under two areas of research: face detection and pose
estimation. Face detection is the problem of determining whether one or more faces
exist within an image, which is a necessary precursor to any vision system that works
with faces. Pose detection analyzes an image to determine in which direction a face
is looking [10, 147].

Many types of algorithms have been developed to ﬁnd faces in 2D images. Some
of the faster methods learn statistical anchor points in the face region and exploit
these anchor points for detection [138]. For example, the area around the eyes can be
found by using a differentiation ﬁlter that looks for a dark-light-dark region, indicating
the dark eyes and a light area around the bridge of the nose (see Figure 4.2). This
approach does not identify a particular anchor point, but instead determines a pattern
of relative intensities. This is reasonable for 2D face detection algorithms because they
do not need to be constrained to ﬁnd exact anchor points in real space. This method
of face detection is used in [12] to identify the locations of faces in real time. Then, a
set of Gabor ﬁlters is trained using Adaboost to determine up to seven expressions.
Then these “mood” measures are fed into some robots to enhance their interaction
and performance. For a good review of 2D face detection literature, see work by Yang
et al. [146], or Hjelmas [71].

The pose detection literature is mostly related to the problem of anchor point
detection in 2D face images. This problem naturally lends itself to identifying speciﬁc
anchor points, such as the eyes and the nose [10, 147]. Pose information is commonly
used with algorithms that attempt to ﬁt a face image texture to a wire frame model
of the face [21, 143]. In these algorithms, anchor points are more useful than regions.

Another approach to anchor point detection is to collect heuristics about common

74

dosage, 66a: 6 8&0: 2:8 ﬁt:

:63 39:30:: EB mﬁgmzw 825. .:o_mme::xw E 55> 6: EU 6:: $356: a 2:3 :33: Be? 5626 30:65 .6932: 2a. .1.
.ommE: as: E 33%; 8 meme

:6: 6 8.8 25:0 0:: 6:: 8:52 0: 6:3 2:89? 08:9 .23: 80: 5 265.36.» 32% 06:9: :8 2:89? 08:: 6 $02 *

 

 

 

 

 

 

.wEEw: 6:: :owmmoHEE .omo: :osmﬁmmm6
:: £65265 66:: x63 0: 8:3me \’ \z \z \z £5 E @3205: Seamzm
.305 \/ *ssom \, $33330 .2258
.m:osﬂ.:w> 30: 26:
66:9: :8 as: $6 ommnﬁe: 686:4 \2 *oEom \z \/ 32:02 SEQ w:e>>v
.93: use 3 6on
Om 26566: e. E 0: maﬁam womb \, Wm JR .mtAEwn 5%:
.:o§:wouo:
mod: E EctoE 839:3 :5: 66:
6:25 5958 $0513: 2: 935:6 \, *oEom \z SmKoﬁwwom 4:833
wages:
.0363 8a: Om 952362 gain
0.: £268 :85 Om 6:: ON 8:033. \/ \’ *eEom \/ decay :N:.830> £52m;

 

.8656: m: was: 2: 6 63:8
.2 6er 0:5 3:6: 6 magma: m:o_w

 

 

 

 

 

 

 

 

 

 

 

 

.8 mime: AEESwE 6386: 8e: \» \2 *eEom \, sznmmgn 665v
63:8 86:68 25:58::

:a we 8:: .2: :0 8:6: 6:26 6862:. \, \/ com *oEom \z Sc: 32:56
63:8 66:58 036586 :a a:

8:3 as: Qm E 3:6: 6:66 69:22:. \/ com *oEom \2 E AgExooum 62mm:
65:8 6:658 026586

:e we mom: «a: :0 3:6: 6:25 638:8 \2 com *mEom \2 8: $68605 .9265

:osﬁ:w> **:o$£:w> 63:53
8qu maﬁsmﬁ :ommmmaxm owe: OWN ON 98%.:

 

452296 $338: E5866 E6: 66:: was: oswﬁoﬁé :4» 22:9

75

 

 

 

 

 

 

 

Figure 4.2: Simple and efﬁcient 2D face detector using a cascade of dark—light-dark
anchor point detectors [138].

facial anchor point locations, and then systematically exploit the heuristics to ﬁnd
the anchor points. For example, face color is a common attribute used to segment
the face region and then identify the dark regions within the face that correspond to
eyes, eyebrows and an open mouth. Simple models of the face are also used to guide
heuristic searches for anchor points [29, 9]. For instance, knowledge about the shape
of a face can be used with Snakes to ﬁnd the outline of the head and the eyes. Hsu and
Jain [75] applied Snakes to frontal 2.5D images to accurately determine the outline
of the head. However, as with many face detection and pose estimation algorithms,
some assumptions were made about the pose angle (frontal faces only) in order to

reliably detect the face anchor points.

In addition to the work presented in this dissertation on 2.5D anchor point detec-
tion [45], there are other frontal anchor point detectors [74, 139, 23]. These frontal
anchor point detection algorithms each have different requirements, such as the eyes
being open or the face having a neutral expression. Although ideally no assumptions
about the pose should be made, imposing some simple restrictions on pose greatly

reduces the computational requirements for anchor point detection in 2D and 2.5D.

76

4.3 Frontal Anchor Point Detection in 3D

This section describes the frontal anchor point detection algorithm outlined in Table
4.2. The ﬁrst step is to remove spike noise from the scan using a gradient ﬁlter, as
described in Section 3.4. The second step is to ﬁnd a base reference point, such as
the top of the head. (Actually, any point near the top of the head should do because
this point is only used to establish the vertical location of the head.) Once the top
of the head is found, a bounding box for the location of the nose can be produced
based on the inter-subject distance model shown in Table 4.3. The frontal anchor
point detection algorithm then uses other models to generate bounding boxes (step
4) and uses local feature—based methods to ﬁnd the remaining anchor points (step 5).
Table 4.2: Frontal pose anchor point detection algorithm.
. Use the gradient ﬁlter to ﬁlter out depth spikes.

1
2. Find the top of the head and generate nose bounding box.
3. Find the tip of the nose as the closest point of the scan.

4

5

 

. Calculate the bounding box for the inside corners of the eye, mouth, and chin.
. Calculate the location of each individual anchor point using the shape index.

 

The ﬁrst step in the 3D frontal anchor point detection algorithm relies on a gra-
dient ﬁlter. When dealing with frontal pose faces, the 2 gradient is deﬁned as the
change in the 2 distance with respect to the x and y directions. Typically there is
a large change in 2 when moving from the base of the nose to the tip of the nose,
and from the boundary of the face and hair. However, in most other locations on
the face the .2 gradient is small. It is common for spike noise to appear in the scans
(especially in the FRGC dataset) due to errors in reading the reﬂected laser light.
This spike noise is easily removed by ﬁltering out points that have a gradient larger
than the gradients of points around the nose (see Section 3.4). Filtering the scan
gradients to eliminate spike noise before selecting the nose point increases the success
rate of the nose detector from 95% to 97%. If the search is limited to areas that may

reasonably contain the nose (see Table 4.3), the success rate can be increased to 99%.

77

This success rate is sufﬁcient for a frontal-based identiﬁcation system since, when the
orientation of the head is known, one anchor point is sufficient to initially align the
head with the model (see Chapter 5). However, in order to allow for minor variations
in the pose angle (:l: 5 deg) other anchor points on the face are needed.

For the second step in the anchor point detection algorithm, the top of the head is
detected by simply searching from the top down on the scan. This is a naive detector
that works quite well, as long as there are not other objects in the range of the scan
located above the head. The top of the head is used as a reference point to reduce
the search area for the nose point detector by creating a bounding box to deﬁne
the search area. The size of the bounding box is determined based on a statistical
model of the face. The statistical model identiﬁes where the anchor points may be
located on the face relative to other anchor points. To create this statistical model,
the distance between each pair of anchor points was measured in a database of over
2000 test scans. The minimum, maximum and average distances were included in the
statistical model and used to bound the search region for each of the anchor points

(see Table 4.3).

Table 4.3: Statistical model of the face. Each set of distances are calculated from
point A to point B only in the speciﬁed direction.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Point A Point B direction Minimum Average Maximum

(mm) (mm) (mm)

Top of Head Nose Tip vertical 84.0 121.7 181.0
Nose Tip Mouth vertical 16.9 34.4 46.3
Nose Tip Chin vertical 49.7 70.6 95.4
Nose Tip Nose Bridge vertical 21.2 32.2 48.4
Nose Tip Inside Eye horizontal 13.2 17.5 23.6
Inside Eye Outside Eye horizontal 15.9 30.6 39.7

 

Creating bounding boxes can greatly reduce the time needed to search for a par-
ticular anchor point. For example, once the tip of the nose is found, bounding boxes
can be generated to locate many of the surrounding anchor points, such as the eye

corners or chin (see Figure 4.3). In addition to increasing the efficiency of the anchor

78

point detection algorithm, bounding boxes can also improve the chances of accurately
detecting an anchor point because the search is now limited to a region where the
anchor point is most likely to be found. Reducing the search area reduces the chances

of ﬁnding a local minimum.

 

Figure 4.3: Example bounding boxes produced by the statistical face model and the
location of the nose tip.

4.3.1 Experiments and Results

The frontal anchor point detection algorithm described in the previous section was

tested on the following datasets:

0 Dataset A - The primary dataset, composed of 111 test subjects with approx-
imately three frontal scans each (two with neutral expression and one smiling)
for a total of 330 test scans. Scans have little spike noise and most faces have

a neutral expression. See examples in Figure 4.4.

0 Dataset B - 953 scans of 275 subjects taken from a database provided by The
University of Notre Dame. These scans are all frontal neutral expression with

higher levels of noise than Set A due to different lighting conditions [118].

0 Dataset C — Limited dataset containing 49 images with a handful of subjects.

Explores signiﬁcant variations in the data, such as background changes, rota-

79

tions, occlusions, expression, lighting and noise. Most of these images do not

satisfy the speciﬁcations of the 3DID face veriﬁcation system (see examples in

Figure 4.5).

   

Figure 4.4: Detected anchor points for example images from dataset A.

Table 4.4: results for frontal anchor detector.

mm

       

Table 4.4 lists the mean and standard deviation between the automatically se-
lected anchor points and the manually selected anchor points. Figure 4.6 plots a

set of manually selected anchor points onto the average face to represent the type

80

 

Figure 4.5: Detected anchor points for example images from dataset C (these images
violate the specs of the veriﬁcation system).

of errors found in the dataset. Each of the white dots correspond to the location
of the automatically selected anchor points relative to the manually selected points.
Figure 4.6d represents the nose point precision. There are three outliers in the entire
dataset, two of them are due to limitations in the nose point detector (the head is
pitched back and the subject has a large lip) and the third outlier is due to head
motion during scanning.

The anchor point detection algorithm described in Table 4.2 was tested on all
three datasets to see how accurately the algorithm found anchor points on the face.
The anchor point detector was considered successful if it correctly located three of the
ﬁve anchor points shown in Figure 4.1. Success was a qualitative judgment made by
the investigator and veriﬁed by checking the convergence of the ICP algorithm. These
results are summarized in Table 4.5. Of the ﬁve anchor points detected, the worst
performance was with the mouth corners due to variations in expression; therefore
the success rate for the mouth anchor points is also reported as a reference. Figure

4.6 plots the relative difference between the manually selected anchor points and

81

a)Right Eye N)ose Bridge )Left Eye

     

(d) Nose (e) Nose Base (f) Chin

     

(g) Right Mouth (h) Mouth (i) Left Mouth
Figure 4.6: Average face image with each of the manually selected anchor points

(green) and the relative location of the automatically generated points (white) from
the MSU 330 image dataset.

82

Table 4.5: Experimental results for successful frontal anchor point detector.

 

 

 

 

Data Number Approximate Detection Detection Average
Set of Scans Number of Rate Rate Algorithm
Subjects Mouth Only Time (see)
A 330 110 99.1% 97.0% 0.33
B 953 275 99.8% 97.4% 0.30
C 49 10 85.7% 83.7% 0.33

 

 

 

 

 

 

 

 

the automatically generated anchor points. The anchor point detector was run on a
Pentium 4 CPU with two 3.20GHz processors and 1.00GB of RAM. All of the times
reported are wall clock times and will vary depending on the load of the cpu, this

load variation being most noticeable during ﬁle loading.

4.4 General Pose Anchor Point Detection in 3D

The previous section described a frontal anchor point detection algorithm sufﬁcient
for dealing with cooperative subjects in a constrained environment. This section
examines the more difﬁcult problem of detecting anchor points under conditions of
arbitrary pose. Instead of developing speciﬁc heuristics to determine the anchor
points, redundant methods are used to ﬁnd candidate anchor points and then these
anchor points are examined to identify the best ones.

The goal of the arbitrary pose anchor point detection algorithm is to ﬁnd three
labeled points on the test scan that correspond to three labeled points on the 3D
face model. These three pairs of points are then used to calculate a transformation
from the test scan to the model in the database. This transformation coarsely aligns
the test scan to the 3D model, and then the ICP algorithm is used to calculate the
best matching distance between the scan and the model using a set of control points
deﬁned by the anchor points (see Figure 5.29). Once the 3D model and the test
scan are aligned, the remaining anchor points can be found on the test scan by back

projecting them from the 3D model.

83

Sections 4.4.1 and 4.4.2 describe the three major steps in aligning a test scan to a
model with signiﬁcantly different pose. Section 4.4.3 presents data from experiments

with the arbitrary pose anchor point. detection algorithm.

4.4.1 Candidate Point Selection

When pose is unknown, it cannot be guaranteed that the user is facing the scanner.
Since anchor points need to be detected robustly regardless of the pose of the face, the
arbitrary pose anchor point detection system uses a set of model-based methods to
locate each candidate point. Surface curvatures (shape index), as described in Section
3.2, are used as a pose invariant feature to help guide the search for anchor points. A
similar approach was applied by Beumier and Acheroy [18] to ﬁnd anchor points on
the proﬁle ridge of a 2D proﬁle image. However, it can be difﬁcult to identify speciﬁc
anchor points based on the shape index alone; for example, ﬁnding a region of high
curvature does not necessarily result in a single point. Thus, the general locations of
the anchor points are determined by combining information about the shape index
with information from a model of the face. Once the general location of the anchor
point is found, the results are further reﬁned by ﬁnding a more speciﬁc point. This
is normally done using an edge map generated from the 2D color component of the
scan. The edge map is used to ﬁnd a speciﬁc edge point (such as a corner point)
closest to the general anchor point location.

Because only the three best anchor points are chosen for use in the arbitrary
pose alignment algorithm, spurious candidate points are acceptable. This simpliﬁes
the development of individual anchor point detectors because possible anchor points
can be selected by simple criteria and eliminated later by the constraint relaxation
algorithm [76]. This makes the anchor point detection system flexible in special cases,
such as when the face is occluded, because anchor point detectors can be added to

account for special situations.

84

Inner Eye Detection

The easiest anchor point to identify in scans with arbitrary pose is the inside edge of
an eye next to the bridge of the nose. This inner eye point has a shape index value
that is close to zero, and the area around this point has a consistent shape index value
across all face images and poses. An averaging mask (size 20 high x 10 wide) can be
used to identify this area in the shape space (see Figure 4.7). More complex masks
were considered, but the averaging mask proved to be robust across face scans.

The averaging mask locates a point with the largest neighborhood average cup
shaped curvature. The inside edge of the eye is then used as an anchor in order to
identify other anchor points within the face image. Because there are two eyes, the
two largest regions of high cup-shaped curvature are selected. Often, this results in a
point within the ear or hair being classiﬁed as a possible eye (see Fig 4.7b); this error

is tolerable as long as the constraints can eliminate these points later.

       

l
(a) (b)

Figure 4.7: Correlation of the shape space with a simple vertical mask to ﬁnd the
inside points of the eyes. (a) frontal View (b) a semi-proﬁle view. The bright spots
in the scan represent the locations of the inside of the eye next to the bridge of the
nose.

Outer Eye Detection

Once the inner eye candidate points are found, the outside of the eye is detected by

following the rut (deﬁned by the shape space) that consistently runs along the bottom

85

of the eye (see Figure 4.8). This calculation is made from all possible eye candidates.
This particular detection method is less reliable than the other methods because the

outside of the eye is ﬂatter (in the shape space) and is not as salient as the inside eye.

. (Slutsu'le eye ponit

Rut that nuts along
the bottom ofthe eye
Mouth tip

Figure 4.8: An example shape-space scan with key anchor points identiﬁed.

     

Nose Tip Detection

Like the inner eye point, the nose tip is generally easy to detect. For most people,
the nose tip sticks out beyond other points of the face and is salient in most poses.
For example, Figure 4.9 shows a face scan where up to six different simple detection
methods (outlined in Table 4.6) were used to ﬁnd the nose tip. The expectation is that
one of the points produced by these six methods is in fact the tip of the nose. In order
to evaluate the quality of these detection methods, over 1000 face scans were analyzed
and the minimum distance to the actual nose tip was recorded. The average minimum
distance was 5.1mm, with a standard deviation of 8.3mm. Experiments have shown
that any point on the nose within a 15mm radius from the tip is an adequate starting
point for the ICP algorithm. 15mm may seem like a large tolerance, but the goal
of the arbitrary pose anchor point detection algorithm is to come up with a coarse

alignment that can then be passed to ICP for ﬁne alignment. Most of the nose points

86

detected by the algorithm had close to zero error; however, there are still a few cases
where the algorithm missed the nose tip completely. In these cases, there is still a
possibility that the relaxation algorithm (discussed in Section 4.4.2) will ﬁnd a. better

set of three anchor points that does not include the nose tip.

 

Figure 4.9: Example nose candidate points for three different poses. Notice that
many of the points are errors, but in all of the scans at least one of the candidate
points is actually the nose tip.

Table 4.6: Methods for ﬁnding candidate nose points.
. Point closest to the scanner (minimum 2 direction point).
. Point farthest to the left (minimum x direction point).
Point farthest to the right (maximum x direction point).
. Point farthest from the vertical plane formed by points 1 and 2.
. Point farthest from the vertical plane formed by points 1 and 3.
. Point with the largest shape index.

 

03¢“:me

 

4.4.2 Relaxation / Search Algorithm

Section 4.4.1 described how model-based detection methods are used to identify can-
didate points that may prove to be anchor points for the scan. This section describes
the next step for identifying anchor points, which is to examine the set of candidate
points to to ﬁnd possible sets of three anchor points and their associated labels. For
example, even if only three candidate points are found, each of these three points could

be assigned up to nine labels (Nose Tip, Inside Right Eye, Inside Left Eye, Outside

87

Right Eye, Outside Left Eye, Left Mouth Corner, Right Mouth Corner, Centroid,
and Top Centroid), which results in 504 possible labellings. When more candidate
points are available, the total number of labellings goes up exponentially. Therefore,
an exhaustive search of all candidate points with all possible labellings is not feasible.
Instead, some simple constraints are developed to help ﬁlter the labeling possibilities.

There are many methods for solving constraint satisfaction problems [11]. Relax-
ation ﬁltering was chosen because it is easy to implement and is efficient given that
there is a limited number of candidate points as well as a limited number of possi-
ble labellings. The algorithm presented here is called discrete relaxation [76]. There
are two components to discrete relaxation: the relaxation matrix and the relaxation
constraints. Using a well-formulated combination of the relaxation constraints, the
relaxation algorithm is able to iteratively eliminate impossible labellings using the

following rule.

Relaxation Rule - If a particular label assignment for any point
fails ALL the constraints, then this labeling is not possible and the
corresponding row and column in the relaxation matrix should be

changed to zero.

This relaxation rule follows the philosophy of least commitment where only the

obviously wrong labels or points are eliminated.

Relaxation Matrix

The relationship between the candidate anchor points and the possible anchor point
labels is represented as a Relaxation Matrix (see Table 4.7). A value of one in the
relaxation matrix represents the possibility of a point having a particular label. A

value of zero means the point cannot have that particular label. The null label is

 

used for points that do not match up with any label. The matrix is initialized with

88

all ones because, without any knowledge from the constraints, every point could have

every possible label (note that this is before the unary constraints are applied).

Table 4.7: Initial relaxation matrix.
Possible Labels

 

 

HOSE

 

pointl
point2
point3
point4
point5
point6
point7
point8
point9

 

 

 

 

 

 

 

HHHHHHHHHrEyc

HHHHHHHHy—A

 

 

 

 

 

 

 

 

 

 

 

._. H ._. ._. ._. ._. ,_. ,_. ,_. topCentroid

 

 

i—th—tv—at—It—tr—ti—ov—tr—AIEye
HHHHHHHHHOI'Eye
HHHHHHHHHdEye
HHHHHHHHHrMouth
._.._.._.,_.._.._.._.,_.,_.ll\/Iouth
HHHHHHHHHcentroid

 

Relaxation Constraints

For the 2.5D face arbitrary pose anchor point detection problem, the following four
constraints are used: shape constraint, absolute distance constraint, relative distance
constraint and relative cross-product constraint.

Shape Constraint — The shape range of a candidate point with a speciﬁc label
must fall between a minimum and maximum value. Note that the shape index is a
unary constraint and only needs to be applied once when the registration matrix is

being initialized.

5mm < SL < Smaq; (41)
Example: 0.7 < 871,036 < 1.0

Absolute Distance Constraint - The distance between any two labeled points

89

must be between a minimum and maximum value.

sz‘n < [PLl — PLQl < Dmaa:
Example: (4.2)

30mm < ‘/ (Pfiosc — Pnyel‘Z + (Pfiose — Pig/Eye? + (Priose " :Eye)2 < 140mm

 

Relative Distance Constraint - The distance between a single component (2:, y,

or z) of any two labeled points must be between a minimum and maximum value.

szﬁn < (PEI — P1132) < Dynag; OI‘
Dmin < (1913f1 — 19%) < Dmag; or
Dmin <(Pi1‘ P122) < Dmarr

1'

Example: 10mm < (P Eye — PlTEye) < 100mm

7‘

Relative Cross-Product Constraint - The cross product between the vectors
formed by (PLI, PL2) and (PL1,PL3) has a particular direction within a single (2:, y,

or 2) component.

[(PL1 — PL?) x (PL1 — PL3)]I . Dir > 0 or
[(PLl - PLQ) >< (PLI — PL3)]y - Dir > 0 or
[(PLl — PL2) X (PLI — PL3)]Z - Dz‘r > 0 (44)
where Dir = :t1
Example: [(PLI —- PL2) x (PLI — PL3)]Z - Dir > 0

These four constraints were derived by qualitatively looking at example scans
and models and determining the relative distances between anchor points. Similar
constraints can be added to speed up the algorithm; however, the four constraints
described in this section proved to be sufﬁcient for identifying possible labellings.

One challenge in identifying these constraints was to not make too many assumptions

90

about the relative locations of the anchor points. For example, the left eye points are

not always located to the left of the nose; in some proﬁles, the left eye is actually to

the right of the nose.

Relaxation Example

The relaxation algorithm iteratively applies all of the constraints to the Relaxation
Matrix until the Relaxation Matrix achieves a steady state. The resulting Relaxation

Matrix should contain fewer ones than the original initial state (see Table 4.8).

Table 4.8: Relaxation matrix after ﬁrst relaxation.
Possible Labels

 

 

11086

 

pointl
point2
point3
point4
point5
point6
point7
point8
point9

 

 

 

 

 

 

 

 

ooooonoorEye
ooooonoolEye
OOOOOt—‘HOt—‘OIEye
oHoooooootOPceerid

oooooooooOIEye
ooooooooorMOUth

cor—toooooolMOUth
ooooooooocemmld
HHHHHHHHHnuH

OOOl—‘Ot—‘OOH

 

 

 

 

 

 

 

 

 

 

 

 

 

Note that in this example, candidate Points 2 and 9 have all zeros across their
rows (except for null). This means that they cannot be any of the points and are
eliminated from further consideration. Also note that none of the points can be
labeled as an outside left eye point, right mouth or left mouth. The algorithm has
signiﬁcantly reduced the number of possible labels. However, there are still several
possible labellings represented in this Relaxation Matrix. The next step is to search
though the possible assignment of points to ﬁnd the best labeling. we now search

the points in a depth ﬁrst method and apply more than pair-wise constraints. Three

91

labels need to be assigned, so the ﬁrst point is picked from the matrix. In the example
from Table 4.8, Point 1 is assigned a nose label. If Point 1 is a nose then none of the
other points can be the nose and Point 1 cannot take another label; so, the Relaxation
Matrix changes to the one shown in Table 4.9. This change in the Relaxation Matrix
allows the constraints to be reapplied to relax the matrix further (see Table 4.10).

This relaxation results in ﬁnding that Point 5 cannot be labeled as rEye.

Table 4.9: Relaxation matrix after selecting the ﬁrst point.
Possible Labels

 

 

DOSE

 

pointl
point2
point3
point4
point5
point6
point7
point8
point9

 

 

 

 

 

 

 

OOOOHOt—‘OOrEye

OOODOOOOH

 

 

 

 

 

 

 

 

 

 

 

Gamecococc:[501)CentrcE

 

 

OOOOHOv—IoolEye
oooooHHocorEye
oooooooooOIEYe
ooooooooorMOUth
OOHOOOOOOIMOuth
ooooooooocentmid

HHHHHt—IHHQHUII

 

Now, the second point needs to be selected from the resulting list in Table 4.10.
In this example Point 3 is selected to be the rEye and the relaxation algorithm is
applied again. However, this time relaxing the matrix does not change its values (see
Table 4.11).

The ﬁnal step of the search algorithm is to select the last labeling. In this example
there are four points left, so each point is selected individually and the ﬁtness function

is evaluated for all four sets of points.

0 Point 1 = nose, Point 3 = rEye, Point 4 = orEye
0 Point 1 = nose, Point 3 = rEye, Point 5 = lEye

0 Point 1 = nose, Point 3 = rEye, Point 7 = topCentroid

92

 

Possible Labels

55G

 

ﬁmthCGU

 

63:60:09

 

£33):

 

£532:

 

Page

 

ohmic

1000001

 

9mm:

1

 

ohm:

 

meE

1000000000

 

 

 

 

pointl

 

point20000000001

point3 0 1

 

 

point40001000001

 

point50010000001

 

point60000000001

 

point70000001001

 

point80000000101

 

point90000000001

 

 

Table 4.10: Relaxation matrix after the second relaxation step.

Table 4.11: Relaxation matrix after the second point selection.

 

Possible Labels

:52

1

 

665:8

 

66350:“:

 

£63:

 

£82:

 

mam—o

 

ohmic

1000001
0000001

 

9mm:

 

Gkhmh

 

@mOC

1000000000

 

 

 

 

point 1

 

point20000000001

 

point30100000000

point4 0 0 0

 

 

point5 0 0 1

 

point60000000001

 

point70000001001

 

point80000000101

 

point9000000000

 

 

93

0 Point 1 = nose, Point 3 = rEye, Point 8 = centroid

After evaluating the ﬁtness function for these labellings, the algorithm backtracks
and selects a different second point (in this example Point 3 would be a lEye). When
all the possible second points have been searched, the algorithm backtracks again and
searches through all of the ﬁrst points. The output of this entire algorithm is a list
of possible labellings with their ﬁtness values. The labeling with the best ﬁtness is
output to the next stage of the system. The feature point system could also output

the top k labellings and have the next stage determine the best ﬁt.

Fitness Functions

In order to evaluate the quality of the set of anchor points, the matching score gen-
erated by the Iterative Closest Point (ICP) algorithm is used as a ﬁtness function
(see Section 3.5.3). The key to using ICP is to have a set of consistent control points
from within the scan. Because the pose of the scan is unknown, the control points
related to the pose cannot be used. Instead, all noisy data (in the z direction) are
ﬁltered out of the scan and then an evenly spaced 20 x 20 grid of points is placed on
the remaining scan. Because a symmetrical grid is used, not all of these points are
valid due to holes in the scan. The invalid points are discarded. The same ﬁnal set
of control points is used in all of the ﬁtness calculations so that the distance errors
generated by the ICP algorithm are comparable. The downside of using ICP as a
ﬁtness function is that it requires the use of a 3D face model (see Section 2.4.7) to
calculate the distance value. This model is available in the case of veriﬁcation, but

perhaps not for recognition.

4.4.3 Experiments and Results

In order to evaluate the quality of the anchor points selected using the 3—step algo-

rithm outlined in Table 4.12, each detected anchor point was compared to a manually

94

selected point. However, many times the manually selected points were not as ac-
curate as the automatically selected points. This can occur when the experimenter
selects a point that looks correct in the texture image but may actually be incorrect
in the registered range image due to the movement of the subject’s head. (This was a
common problem when points were near the edges of the scan where the range data

can be invalid.)

Table 4.12: Arbitrary pose anchor point detection algorithm.
1. Candidate point selection (Section 4.4.1).
2. Relaxation/search algorithm (Section 4.4.2).
3. Fitness function evaluation (Section 4.4.2).

 

 

The arbitrary pose anchor point detection system was tested on a set of 600 scans
from 100 subjects (6 scans per subject). The scans include frontal pose, full proﬁles
and smiling expressions. Each test scan in the dataset was compared to 3D models of
the same subjects generated from an additional 5 scans stitched together to produce
a 3D surface model (see Section 2.4.7).

The histograms of the error for all ﬁve of the core surface anchor points are shown
in Figure 4.10. The median error of the ﬁve points is around 10mm, and approx-
imately 90% of the scans are below 20mm error for each anchor point. Extensive
testing shows that the ICP algorithm can tolerate these errors for matching. Some
examples of the detected points are shown in Figure 4.11.

To quantify the success rate of the entire transformation, all of the coarse trans-
formations were evaluated by hand. Each test scan produced a transformation of the
test scan onto the 3D model. If this transformation was judged to be good enough
for ICP to converge, then the anchor point detection was labeled as successful for
this scan. Of the approximately 600 test scans, there was an 85.6% success rate when
matching a subject’s test scan to the same subject’s 3D model.

To fully understand where the errors occurred, the test scans were also separated

into groups based on subject attributes (see Table 4.13). From this table it can be

95

Inside Right Eye Inside Left Eye Outside Right Eye

 

 

 

 

 

 

     

 

 

 

 

   

20' 20 50'
2 2 2
a to to 40
(g 15 a 15 ] 3;;
7:? ‘5 72". 30
C 10 t 10 C
S 2 E S 20
2 i 3 B
e 5 E 5 E 10
3 3 3
z z z
i 0 . 0 i
0 50 100 0 SD 100 0 SD 10.]
Distance Error (mm) Distance Error (mm) Distance Error (mm)
Outside Left Eye Nose
50 . 150 .
U) 0‘)
.. ._. 11]]
3 3° ‘ 8
i— I—
“6 20 . “6
5 '6 50
.D Q
s w 4 a
Z 2 .
D [J ,
0 50 100 D 100 200

Distance Error (mm) Distance Error (mm)

Figure 4.10: Error histograms for each of the surface candidate points. The error
represents the distance from the ﬁnal extracted anchor point to the manually selected

point. The dotted vertical line represents the 20 mm cutoff, which 87% of the scans
fall under.

96

 

Figure 4.11: Example anchor point locations. The plus (+) represents the automati-
cally located anchor points and the circle (0) represents the manually selected anchor
points.

seen that facial hair and dark skin made it more difﬁcult to identify key facial anchor
points. This is an understandable result because both of these attributes increase the
noise produced by the scanner (due to problems reading the reﬂected laser light). It
is also interesting to note that it is easier to identify anchor points in scans with eyes
closed than with the eyes open. This is probably also due to the increase in surface
noise that occurs with the eyes open, since the laser dose not reﬂect off the pupil and
causes a “hole” in the scan where the eyes should be.

Figure 4.12 shows how the success rate of the arbitrary pose anchor point detection
system varies with pose and facial expression. As Figure 4.12 demonstrates, the
anchor point detection system works 99.0% of the time on neutral expression, frontal
pose scans. The success rate goes down a little with a smiling expression and it
degrades signiﬁcantly for proﬁle views.

To improve the system performance, a reject option could be added to reject in-
coming scans that contain too much noise to work well with the anchor point detection

system. In other words, if the system is unable to process a scan properly, then that

97

Table 4.13: Attribute size success rate.

     

 

95 I Neutral I Smile

8

Percent Correct

88883

 

Fr orit

Pose Direction PM".

Figure 4.12: Percent of correctly selected anchor points versus the pose direction and
facial expression. The success rate for frontal neutral faces is 99.0%.

98

scan is rejected. The goal is to ﬁnd a rejection criterion that will consistently reject
the bad scans and minimize the number of good scans that are rejected. Experimental
results showed that extremely unsuccessful scans were often abnormally rotated. So,
a rejection criterion was developed that calculates the angle of head pitch by calcu-
lating the change in angle of a line (drawn though the eye points) with the horizontal.
In the MSU database there are no head pitches of greater than 35 degrees. If the
rejection criteria is set to over 35 degrees, the system success rate increases to 87.0%
without any false rejects. As Figure 4.13 shows, adjusting the rejection criteria allows

for a design trade off between the error rate of the system and the false accept rate.

 

30 I I I I I

25- u

20h -<

Error Rate

10- -

.- \ _

O 20 40 60 80 100 120
Rejection Rate

 

 

 

Figure 4.13: Trade off between the error rate and the false rejection rate as a function
of modifying the rejection angle.

99

Chapter 5

3D Face Alignment

This chapter describes, in detail, a two step face alignment algorithm that accurately
aligns the surfaces of two face scans (a probe scan that is to be aligned to a target

scan). Figure 5.1 shows a system diagram of this two step face alignment algorithm.

  

Target Scan

  

Probe Scan Anchor Point Detection ICP alignment Final Surface Comparison

Figure 5.1: Face alignment algorithm.

The ﬁrst step in this alignment process is coarse alignment using anchor points
(which were described in Chapter 4). These anchor points include the nose tip, eye
corners, the mouth, chin, etc., and are detected by using a model of the structure
of the face, statistics on the population of human faces, and the curvature (second

derivative) at various points. Anchor points detected in both the probe and target

100

scans are used to estimate a coarse transformation between the two scans.

After coarsely aligning the scans, the second step in the alignment algorithm is
to use the Iterative Closest Point (ICP) algorithm (see Section 3.5.3) to ﬁnely align
the scans. ICP samples a set of control points on the probe scan and calculates the
closest distances to the surface on the target scan, then calculates a transformation
that minimizes these error distances. The ICP algorithm terminates when the change
in error is below a threshold, or when the iteration limit is reached. The ICP algorithm
is not guaranteed to converge to the correct surface match and may fall into local
minima. ICP is also intolerant to errors in the data such as spike noise and holes. To
minimize this risk, the points with the largest distance errors are trimmed from the

list of control points [39].

The Surface Alignment Measure (SAM) is deﬁned as the root mean squared error
over all the control points, after trimming. The SAM is used as a matching measure
that allows the surface alignment algorithm to be evaluated by calculating the Re—
ceiver Operating Characteristics (ROC) and Equal Error Rate (EER). The ROC and
EER allow for simple comparison and evaluation of the surface alignment algorithm
without the need for manual evaluation. The ROC and EER are relative measures

used to evaluate how different input parameters affect the quality of the alignment.

The remainder of this chapter describes 3D face alignment in more detail. Section
5.1 describes existing work in 3D face alignment, and Section 5.2 describes the speciﬁc
face registration algorithm used in this dissertation and the experiments used to
evaluate the quality and speed of the algorithm. Section 5.3 presents experiments
for testing the reasonable operating range of the SAM score by testing the surface
alignment algorithm on a mannequin head. Finally, section 5.4 discusses using the
SAM value as a matching score, and applies this technique to the Face Recognition
Grand Challenge (FRGC) dataset. Section 5.5 presents methods for extending the

face alignment algorithm to account for arbitrary face poses.

101

5. 1 Background

Image registration is “the process of determining the point-to—point correspondence
between two images of a scene” [68]. Registration is one of the most difﬁcult obstacles
for biometric recognition algorithms, which must ﬁrst process (or normalize) the fea-
ture space in such a way as to make the biometric data from two sources comparable.
In most 2D face recognition systems, this registration / normalization process involves
identifying key anchor points or areas and then transforming the data sources to have
these areas align.

The ﬁrst step in a traditional 2D face identiﬁcation system is to detect the face
in the image. Once the face is detected, the image is usually cropped and the size
is standardized to match the faces in the database; this standardization may include
rotation and translation components. For 3D face scans, face detection is a much
simpler problem than in the 2D world. 3D faces are normally distinct blobs in the 3D
space, so detecting a face is not a difﬁcult problem. In 2D face recognition systems
pose is a factor that is usually ignored (by assuming a frontal pose) because the
number of different projections of a 3D face object in 2D is arbitrarily large. For 3D,
however, pose is less of an issue.

The alignment algorithm presented in this chapter uses a two step procedure. The
ﬁrst step is coarse alignment, and the second step is to ﬁnely align the scans. Similar
two step procedures can be found throughout the face recognition literature. Chua et
al. [41] found anchor points using point signatures based on the local neighborhood
of surface normals, while Conde and Cipolla [48] found anchor points by using spin
images. In both of these approaches [41, 48], three anchor points are chosen to make
a coarse transformation and then ICP is used for ﬁne alignment. An advantage of
these two methods of anchor point detection is that both anchor point detectors have
been shown to work on general 3D objects [42, 84]. However, both spin image and

point signature algorithms are prohibitively slow and have not been tested throughly

102

on faces. In addition, the results presented by Chua et al. [41] offer no evidence to

show that this method would work on images with arbitrary pose.

5.2 Face Alignment Algorithm

The face alignment algorithm is outlined in Figure 5.2. The ﬁrst step is to detect
common anchor points on each scan; these anchor points are then used to coarsely
align the scans to each other. The second step is to use the ICP algorithm (described

in Section 3.5.3) to ﬁnely align the scans.

 

 

 

 
 
     
   
   

  

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Ta ' ‘ Scan Coarse Calculate Trim points Average
alignment using up closest i—ﬁ with large distance
anchor points points distance error error is small
Seect Calculate new No . . . .
control alignment to minimize d "6:22:24?“ Final Surface
points distance error Companson

 

 

 

 

 

 

Figure 5.2: Alignment algorithm outline.

The ﬁrst step of the alignment algorithm uses automatically identified anchor
points (see Chapter 4) to coarsely align the probe scan to the target scan. Coarse
alignment is an estimation of a three dimensional transformation from the probe scan
to the target scan. The alignment does not need to be exact, but it must be close
enough to then allow ﬁne alignment algorithms to converge to an optimal solution. If
it can be assumed that the faces have the same pose (i.e., frontal), then it is reasonable
to only use one point for coarse alignment (see Section 3.5.1).

After coarse alignment, the second step is to determine a better estimate of the
alignment. The ﬁne alignment process follows the Iterative Closest Point (ICP) al-
gorithm (described in Section 3.5.3). Starting with an initial estimate of the rigid
transformation, ICP iteratively reﬁnes the transformation by alternately choosing

corresponding (control) points in the target and probe scans and calculating the best

103

translation and rotation to minimize an error function based on the distance between

the target and probe scans.

Table 5.1: List of design parameters used by the two step face alignment algorithm
described in this chapter. The default values were initially determined by program
evaluation and then veriﬁed using the experiments discussed in the listed sections.

 

 

 

 

 

 

 

 

Parameter Default Value Section
Control point selection region region I 5.2.1
Number of control points selected 100 5.2.1
Maximum Number of Iterations 10 5.2.2
Error Trim Value 10% 5.2.3
Distance Threshold 10mm 5.2.4
Swap compare true 5.2.5
Reject Option 7.5mm 5.2.6

 

 

 

 

 

Table 5.1 lists all of the adjustable design parameters used to develop the two step
alignment algorithm. The remainder of this section discusses a series of experiments
to study these design parameters. All of the experiments were run on the MSU
dataset (described in Section 2.5.1) unless stated otherwise, and the time results are

estimated wall times for a single processor on the MSU High Performance System.

5.2.1 Control Point Region Selection

The ICP algorithm selects a set of control points from the probe scan and uses them
to calculate the best alignment to the target surface. The quality of the alignment
depends on the control points chosen. There are two parameters to consider when
selecting the control points: the region where points are selected and the number of
points selected within this region. This section presents two experiments for each of
these parameters.

In the ﬁrst experiment, three control point selection regions are compared. Figure
5.3 shows the three regions chosen for the study, which used face scans that varied in
expression. The quality of the surface alignment was determined by comparing the

equal error rate generated by the SAM scores; Figure 5.4 shows the BER and time

104

changes for the different control point regions. The results in Figure 5.4 demonstrate
that the control region around the eyes performs the best, which is not surprising
since this area of the face does not change much with expression. There was no

signiﬁcant difference in the execution times for these control ranges.

(a) Region I (b) Region II (c) Region III

 

Figure 5.3: Control point selection regions.

Control Range vs. Equal Error Rate

 

Equal Error Rate

 

 

 

11
Control Range

Figure 5.4: Results when varying control point ranges; the numbers correspond to
regions I, II and II from Figure 5.3.

The second factor considered when selecting control points was the number of
control points used. Figure 5.5 shows experimental results when varying the number

of control points in a grid within the selection region. The results in Figure 5.5a

105

Comet Pom Row: and Colums vs. Equal Error Rate

  

0
§

0
0

Equal Error Rate
. .° . .
N
_l_m 1 4‘4

 
 

O
.n

NurberoiCqums Number of Rows

(a) EER

Control Point Rows and Columns vs. Average Match Time

w
[Al I."
/ J

u
or
I

_. 3.
.1

D
U'!
,1

 

L

Average Match Time (in seconds)
8 o u

Number of Colurrins Number of Rows

 

(b) Time

Figure 5.5: Results when varying the number of control points.

106

indicate that the number of control points within the region is not a factor, as long
as there are enough control points (a minimum of about 25) to ensure that the BER
stays fairly constant. Figure 5.5b conﬁrms that increasing the number of control

points within a region increases the processing time linearly.

5.2.2 Number of Iterations

It is possible that the ICP algorithm will oscillate between a number of values instead
of converging during the ﬁne alignment stage. In such cases, an iteration limit is
needed to ensure that the ICP algorithm will terminate. If the iteration limit is too
high, ICP will take longer than necessary to run; however, if the iteration limit is too

low, ICP may not have enough time to converge.

Max Iterations vs. Equal Error Rate Max Iterations vs. Average Match Time (in seconds)

 

 

 

 

 

 

 

 

0.12 A r
'80 8 . l
0.1 » « §
3 8
& one 4 €10.6r .
0.06 « E ”A“ MM“ w
.501 r <
E 0.04 L‘JJ.MVVWAWW#JWWL-MWNW‘IU\IWN E
0
«02
0.02» g /
>
0 * J L < 0 * 1 ‘
O 50 100 150 200 0 50 100 150 200
Maximum Iterations Maximum Iterations
(3.) BER (b) Time

Figure 5.6: Results when varying the maximum number of iterations.

Experiments were conducted to determine the optimal number of iterations. Fig-
ure 5.6a shows that increasing the iteration limit above 5 does not improve the BER,
while Figure 5.6b demonstrates that an iteration limit higher than 40 allows all the

scans to converge (error spikes are due to inconsistent system load).

107

5.2.3 Distance Threshold

The distance threshold is the maximum distance at which the algorithm will look
for the closest point. This is a required parameter for the fast 2.5D matrix ICP
algorithm introduced in Section 3.5.3. A high distance threshold allows ICP to make
large jumps in its alignment; however, small distance thresholds reduce the point

search area, help remove outliers, and speed up the algorithm.

 

 

 

 

 

 

 

 

Distance Threshold vs. Equal Error Rate Distance Threshold vs. Average Match Time
. . . a 1.8 . . . 1
0.04 A I r a i
A \A k [VVVVx “ .
[EM WV 1 - w/W/MMW/m/V 1.6]
0.035 ~ « f5
3 Mi -
0.03L 1 8
g g 1.2-
a 0.025» -.-,
is E 1 *
E, o 02 i.
3 o 015 l E /
m ' 8 0 6 /
0.01 r j g o 4 l-
o.oos[ ] < 0.2]/
o l x i L i o J 1 i 1
O 20 40 60 80 100 0 20 40 60 80 100
Distance Threshold (in mm) Point Threshold (in mm)
(a) BER (b) Time

Figure 5.7: Results when varying the distance threshold.

Figure 5.7 gives the results of varying the distance threshold. The minimum for the
threshold values shown in Figure 5.7 is the minimum of the scan resolution. Figure
5.7 indicates that increasing the distance threshold above this minimum does not
affect the quality of the matching error. Figure 5.7b demonstrates that the smaller

the distance threshold, the faster the algorithm.

5.2.4 Trimming Study

The two step registration algorithm (coarse and then ﬁne) described in this chapter
works well in many cases. However, errors may exist in the ﬁnal transformation

matrix due to noise in the data or changes in the surface shape of the object (such

108

as a change in expression). Trimming is used as a robust method for improving the

calculation of the alignment to better tolerate noise and variations in expression.

An example of a more robust algorithm is Trim ICP [39]. This is a simple variation
on classical ICP that incorporates a trim threshold, which is a fraction from zero to
one that represents the percentage of control points that will be used to calculate
the transformation matrix. For example, if one hundred control points are used and
the trim algorithm is set to 0.8, only 80 of the best points (those with the smallest
distance errors) will be used to calculate the matrix. This will improve the algorithm

performance for data where up to 20% of the information is spike noise.

 

 

 

 

 

 

 

 

 

Control Points Kept VS- Equal Error Rate Control Points Kept vs. Average Match Time
1 . . - . A , , . . .
V
‘2 0.5
c
em 8
g I; 0.4
°‘ 0.6i ‘J , ,. ...
g .goa .M -~/~v-~~—~
!— //~
- 0.4» -:-'-' _ ,1-
4
0.2' t \ i 1%(11 t
‘\_,. l“\~.~_ w n“, g
00 20 4o 60 so 100 ‘C 00 20 40 60 _ so, 100
Percentage of Control Points Kept Percentage of Control Pomts Ixept
(a) EER (b) Time

Figure 5.8: Results when varying the number of points trimmed.

Figure 5.8 shows that, in general, more control points results in better perfor-
mance. However, there is a small drop off around 95% due to errors in the scans
(such as spike noise and holes). The results in Figure 5.8a show that the trim param-
eter is robust to minor variations in its value, and a trim as low as 60 still produces

good results while improving execution time (see Figure 5.8b).

109

5.2.5 Scan Swap

The alignment algorithm presented in Section 5.2 is asymmetrical, meaning that it is
possible to get different answers depending on which scan is used as the target and
which one is used as the probe. These discrepancies occur because different sets of
control points are selected on each scan. Also, it is possible that control points will
fall into holes or data spikes that can cause large matching errors if the scan is used
as a target.

By swapping the target and the probe, two different SAM values are calculated
and the lowest matching score (SAM) is chosen as the surface matching score. Figures
5.9, 5.10, and 5.11 compare asymmetric and symmetric (swap) algorithms on various
datasets.

As Figures 5.9, 5.10 and 5.11 demonstrate, there is no signiﬁcant difference be-
tween the asymmetric and symmetric matching results. This suggests that it may be
better to run the algorithm in an asymmetric mode because the alignment algorithm
will only have to run once. However, during exploratory test runs of the system it
was found that swapping the scans and taking the minimum SAM improved results,
especially if there were large pose variations in one of the scans. These types of large
pose variations are not well represented in the dataset.

Instead of using the swap as a way to avoid bad target data, another option is to
set some standards on the quality of the target scan. For example, during enrollment
target scans should be evaluated for the size and number of holes and spikes. Surface
cleaning techniques could also be used to improve the quality of the target model
(see Section 3.4). With these standards in place there is less need to run a symmetric
(swap) alignment algorithm.

A third option is to speed up the swap algorithm by terminating early (before
the swap) if the ﬁrst match is below the matching threshold. For example, in both

FRGC 2.0 and FRGC 1.0 at least 92% of the pairs of genuine matching scores are

110

 

100* ' ‘

on
O

3

h
o

 

N
O

---Asymmetn'c

—Symmetric

10

 

Veriﬁcation rate (%)

 

 

 

 

-2 0

10
False Accept rate (%)

1o 2

Figure 5.9: Symmetric matching performance on MSU dataset.

100] 4

-I
‘
‘ﬁ

0}

O
D
\

—‘

O)
O

h
C

 

N
O

---Asymmetn'c

—Symmetric

10' 10‘J 102
False Accept rate (%)

 

Veriﬁcation rate (%)

 

 

 

 

Figure 5.10: Symmetric matching performance on FRGC version 1.0 dataset.

100

m
C)

O)
O

 

h
C

 

N
O

---Asymmetric
—$\/mmetric

Veriﬁcation rate (%)

 

 

 

 

 

-2 0 102

10 10
False Accept rate (%)

Figure 5.11: Symmetric matching performance on FRGC version 2.0 dataset.
111

both under 1mm SAM. If a subject is automatically accepted when the ﬁrst SAM
value is below this threshold, then the second ICP match (i.e., the swap) could easily

be skipped.

5.2.6 Automatic Reject Option

There are times when the alignment algorithm fails due to poor quality scans. This
can occur if there is noise in the scan, if the subject moves during scanning, or if the
pose varies too far from frontal. In these cases it may be better to have the algorithm

detect that there is a problem with the scan and notify the subject.

A rejection algorithm was developed based on the symmetry of the face, using
ICP to align the current probe scan with a mirror image of itself. If the anchor points
were not found correctly, then the SAM score between a probe scan and its mirror
image will likely be high. However, if the anchor points are found correctly the SAM
score is normally quite low when matching a scan to its mirror image. Using this
assumption, 14 (1.5%) of the 948 scans in the F RGC 1.0 database were rejected due

to poor symmetry in the scan.

Figure 5.12 show some examples of the rejected scans. Errors such as spike noise
in the hair (Figure 5.12a), spike noise around the eyes (Figure 5.12b) and scans with
no valid depth points (Figure 5.12c) are typical types of errors that the anchor point
algorithm cannot handle. Only 1 good scan was incorrectly rejected and 2 bad scans
were not rejected using the symmetry based reject criteria. After implementing the
reject option, the ROC curves were recalculated and a new EER of 1.2% was achieved

(see Section 5.4).

112

 

   

7 (a)

Figure 5.12: Rejection examples.
5.3 System Error Analysis

When using human subjects, it is difficult to determine if variations in the SAM
are associated with actual changes in the head (e.g., changes in expression, weight,
hairstyle, etc.) or are just an artifact of sensor noise. Figure 5.13 shows a mannequin
head (named Marie) bought from a local beauty salon. Data from this mannequin
was used as ground truth for evaluating the errors associated with the scanner and

the matching system developed for this dissertation.

 

Figure 5.13: Example mannequin data with anchor points detected.

Multiple scans of Marie were collected while varying the roll, pitch, and yaw of the
head, as well as the distance to the camera. in order to evaluate the best operational
distance for face matching. Marie’s nose tip was initially set to 152cm from the
camera, then moved forward (toward the scanner) in increments of 10cm. It was

discovered that any measurement made closer than 77cm to the scanner could make

113

nose detection difficult because the scanner’s focus range did not include the nose
(see Figure 5.14). This was an intermittent problem that occurred when the scanner

missed the tip of the nose during the focusing procedure.

 

Figure 5.14: Nose was not captured due to the focusing range of the Minolta scanner.
Scans that are closer than 0.77m can cause problems.

Five experiments were conducted to evaluate the performance of the surface
matching system due to external conditions. The ﬁrst examined the effects of facial
color and lighting. The second experiment was a control experiment to determine
the baseline SAM values for standard operation of the system when subjects are
cooperative, maintain a neutral expression and are 1 meter from the scanner. The
third experiment evaluated the performance of the system under changes in head
pose (roll, pitch, yaw). The fourth experiment explored the changes in SAM when
varying the distance between the camera and the subject. In the ﬁnal experiment,
the face matching system was used to calculate the SAM scores on the FRGC version
1.0 dataset (948 scans of 275 people with non-rigid features). The remainder of this

section discusses these experiments in detail.

5.3.1 Color and Lighting

Changes in color (skin tone, makeup, lighting, etc.) can affect the Minolta scanner’s

ability to detect the laser, which leads to failure of the triangulation scheme and of

114

computing the depth value. Washout from severe lighting inhibits the camera’s ability
to detect the laser beam, as do some dark colors. These types of color and lighting
changes, however, usually do not prohibit the matching system from generating an
acceptable SAM value. Figure 5.15 gives examples of extreme lighting and color

changes that do not prevent the alignment system from functioning properly.

 

(a) Low Lighting (0.23 mm SAM) (b) Camouﬂage (0.59 mm SAM)

Figure 5.15: Example scans with extreme lighting and color variations.

5.3.2 Optimal Conditions Baseline SAM

20 scans (totalling 190 matching scores) were made of Marie in a stationary frontal
pose with constant ﬂorescent lighting. The resulting SAM values had a mean of
0.37mm, :tO.15mm. These results indicate the range of expected variation due to
sensor noise, and suggest that two scans that fall within this range indicate a proper
alignment of the same face surface.

A histogram of those SAM values for Marie is shown in Figure 5.16. The dot—
ted line represents the Gaussian approximation to the histogram with a mean at
0.37mm and a standard deviation of 0.15mm. As a comparison, two more Gaussian
distributions are shown, one for intra—class (Genuine) SAM values and the other for

inter—class (Imposters) SAM values. The “Genuine” curve represents scans taken from

115

111 people; in half of these scans the people were smiling. Notice that the mean of
the genuine curve is slightly higher than that for Marie due to changes in expression
(0.7mm :l: 0.15). The “Imposters” curve was taken from the same 111 subjects. No-
tice that the SAM values for Imposters are much higher (1.5mm :i: 0.3). The overlap
between the Genuine and Imposters curves represents the error region for the face

matching system.

 

 

 

0 0.5 1 1.5 2 2.5 3
SAM (mm)

Figure 5.16: Gaussian approximation of SAM distributions.

This baseline experiment is similar to one reported by Boehnen and Flynn [22],
which compared characteristics of different scanner technologies. In their study, scans
were made of rigid masks produced from a milling machine and then compared to the
original shape of the masks. Their study showed that the Minolta Vivid 910 scan-
ner provided the most accurate surface measurements. However, surface matching
errors reported in [22] are much lower than many of the averages seen in the exper-
iments reported in this dissertation. Four experimental differences account for this
discrepancy. First, the study in [22] used the ﬁne resolution mode on the Minolta

scanner (640 x 480 pixels) instead of the fast resolution mode (320 x 240 pixels) used

116

in this dissertation. This change in resolution will affect the baseline distance errors
for the closest point calculation. Second, Boehnen and Flynn preprocessed the scans
to remove small holes and spike noise on the surface of the scan. Third, the faces
were rigid and the entire face area was used including the area around the mouth
and chin, enabling more accurate pitch values. Finally, the masks used in [22] were
manufactured to have an optimal white surface to reﬂect the light from the Minolta
scanner. These experimental differences combined account for the differences in error

scores between the study reported here and the study described in [22].

5.3.3 Pose Variation

The third set of Marie experiments tested the effects of changing the relative pose
angle between the face and the camera. There are three types of pose changes that a
face can undergo in 3D space: yaw, pitch, and roll (see Figure 5.17).

CHE) Yaw

  

 

Figure 5.17: Pitch, yaw and roll.

The yaw, pitch and roll were calculated on all of the datasets and the average

values are shown in Table 5.2. Also included in the table is the maximum roll, pitch

117

and yaw between the two scans with the most variation. These values are used as an

estimate of how much variation may be expected in a real world application because

the subjects in these datasets were asked to maintain a frontal pose.

 

 

 

 

 

Table 5.2: Differences between valid scans in the database.
Roll Pitch Yaw Distance
MSU 0.360 1.620 1.520 63.69mm
21:0.49 i225 :l:1.99 $66.08
max 3.110 16.030 17.300 431.97mm
FRGC1.0 0.730 2.950 2.540 164.44mm
0.81:1: 2.50:1: 2.13:1: 78.17i
max 6.120 13.030 17.580 646.75mm

 

 

 

 

 

 

 

The results of the pose experiments can be found in Figures 5.18, 5.19 and 5.20.
The Marie results represent an average of 5 scans, while the virtual rotation database
results are averaged over 300 scans for each rotation angle. Any matching score above
5mm is considered to be a false match.

In the yaw and pitch experiments, the increase in the SAM values at high rotations
(> 10 degrees) is due to inaccurate anchor point detection. This is an expected
result because the anchor point detection system was designed to ﬁnd anchor points
on strictly frontal pose scans. In experiments with roll, the nose is always detected
because the nose tip always remains closest to the camera. The anchor point algorithm
uses the nose location to identify other anchor points by calculating bounding boxes
on statistics generated from a number of faces, under the assumption that the face
is relatively vertical. Thus, if the head is signiﬁcantly rolled the calculated bounding
boxes will not contain the desired anchor points.

ICP does not need all of the anchor points to correctly align two surfaces; therefore,
the gradual increase in SAM shown in Figure 5.18 (10 ICP Iterations) is not due
simply to the missing anchor points. Instead, the increase is due to the number of
iterations used by ICP. The roll experiment were re—run using ﬁfty ICP iterations

instead of ten; the results are also shown in Figure 5.18. The graph conﬁrms the

118

'JI

 

   
   

 

 

q] I
4 5 j 1] +50 ICP Iterations j ] [
A 1‘} ] ] +10 ICP Iterations [ i . ,
,5. ‘ ; In ‘ L—- Virtual Rotated Scans] I, '
5:31.
2 5
< j j
V:- 15 ]
l
fiS'lL [1 l\; 3'1 1']
l.) I I I f T T r r r j 1 1
-55 ' -35 -15 -5 5 35 5 5 55

 

 

‘ sewereeeeeeea

.... mm .4 “m

if

Roll (degrees from vertical)

Figure 5.18: Change in roll vs. SAM.

 

 

 

 

 

 

 

 

5 _
+Marie

.7; 4 r + Virtual Rotated Scans
g X Baseline Scans
2:: 3 ~
v
”-A
A ‘5 _
,r: _

l ‘ ‘IKV

3 ,- l
O I I T T I I i I I I I T I

 

 

 

-35 -30-25 430 -15 -10 -5 O 5 10 15 20 25 30 35

   

Pitch (degrees from horizontal)

Figure 5.19: Change in pitch vs. SAM.

 

 

 

 

  

 

< ,._ .1— I __L 1 l I l ‘L
- j +- -. A T
4 S q Tl + Marie . /-‘ ]
4 i -- Virtual Rotated Scans
} t mine Sca_ns

 

SAM (mm )

'3
L

 

 

 

U I I T r r l ' 'r r

\

be} eeoeoeeeoecrcceed

Yaw (degrees oft center)

,0 @g\\

F1gure5.:20 Change in yaw vs. SAM.
119

hypothesis since a ﬂattened out, low SAM area, is now present between -30 and 30
degrees for both the Marie and the virtual rotation data. With 50 iterations, the
range of acceptable SAM greatly expands and almost all of the tests converged (the

average number of iterations was 19, with a standard deviation of 13 iterations).

5.3.4 Change in Stand Off Distance

The Minolta Vivid 910’s operating stand off for the medium distance lens is reported
to be between 0.6 to 2.5 meters [108]. In the Marie experiment, the system was
tested with the distance between the camera and the subject varied from 0.35 and
3.05 meters. The system found no valid points on the face when the distance was
below 0.5 meters and above 2.9 meters. The results of the valid scans can be seen
in Figure 5.21. The closest viable distance for Marie to the scanner was found to be
0.65 meters. At this distance, the matching error level was measured to be 0.41mm.
If Marie was closer to the camera, the computed SAM value was not reliable. As the

distance increased, the SAM increased at an almost linear rate.

'—
L
l

p—
Q
l

00 I—
1 l

 

L

SAthm)
O C O O

l

 

Gigi—Eh
J

 

I I I I T I I I I I I I I l I I I I T T _I T I I I l

u5¢“e°\\®’\°

q, of: «.9 9;)

\ "f “I

5'05 «39

Distance from camera (cm)

Figure 5.21: Change in stand off vs. SAM.

120

5.4 Large Dataset Experiments

In the ﬁnal experiment, the alignment algorithm was evaluated on the Face Recogni-
tion Grand Challenge (F RGC 1.0) database, containing 275 subjects with 948 subject
scans. All the scans are frontal pose, neutral expression; however, there is some devi-
ation from this due to natural human pose. The data were gathered using a different
Minolta 910 scanner from the one used in the Marie study, and the 640 x 480 high
resolution setting was used instead of the 320 x 240 fast resolution mode. The higher
resolution can cause lower sampling errors for ICP; however, the higher resolution
scan also takes longer to capture, so there are more data errors due to motion of the
subjects. A baseline 3D matching performance based on PCA (similar to [133]) is
also available for this dataset from the University of Notre Dame. In this baseline
algorithm, both the color and depth components of the face scans are normalized
to a standard width and height using manually selected points at the center of the
eyes. Once the images are normalized, the PCA algorithm is applied to both data
channels independently and the matching results are reported. Figure 5.22 shows the
ROC curves produced by the baseline algorithm and the SAM matching score on the
MSU dataset. Figure 5.23 shows the same graphs on the larger FRGC1.0 dataset and
Figures 5.24, 5.25 and 5.26 show the graphs for the FRGC2.0 dataset.

The Notre Dame baseline algorithm has an EER of 4.6% and the SAM (as a
matching score) has a baseline EER of 3.8% on the same database. Figure 5.27
shows the three face pairs with the lowest false positive SAM scores. The SAM score
performs much better for low values of the false positive rate. Many of the errors on
the top end of the ROC curve are due to incorrect detection of the anchor points.
In fact, the 3.8% BER includes only 23 False Reject errors. One of these scans is
missing its depth data, and the other 22 errors are due to bad anchor points in the
same 15 scans. These error rates demonstrate some of the difﬁculty with this testing

method: a single error in anchor point detection on a single scan can propagate and

121

 

 

 

 

 

 

 

A ' _--
35
v 80_
g ,
I."
I? 60.
C
.9
+6 40-
0
EE
5 20’ —-SAM with Reject
> "-Baseline PCA
10‘2 1o“ 102

False Accept rate (%)

Figure 5.22: ROC curves for the MSU dataset.

 

 

 

 

 

 

100- 3
A ma!"
89 .. ""
v 80 u-"’
(D , -
"tB' "
h 60-
C
.9
*5 40-
O
E
5 20“ —-SAM with Reject
> --- Baseline PCA
10'2 10‘J 102

False Accept rate (%)

Figure 5.23: Resulting ROC curves for the FRGC 1.0 dataset.

122

 

 

100,

 

 

 

 

 

 

 

A

35

V 80

(D

u...-

E 60.

C

.9

§ 40»

"E

o 20" -—SAM with Reject

> '“ Baseline PCA
10‘2 ° 102

10
False Accept rate (%)

Figure 5.24: Performance on F RGC 2.0 dataset when the training and testing data
are both from the same semester.

 

 

 

 

 

 

 

100

A

3‘3

V 80a

0

a

L“ 60.

c

.9

§ 40

E

a; 20’ —SAM with Reject

> "-Baseline PCA
10'2 10° 102

False Accept rate (%)

Figure 5.25: Performance on FRGC 2.0 dataset when the training and testing data
are taken from two different semesters.

 

 

 

 

 

 

 

100-
,-
32
v 80.
tn
‘6
- 60
c:
.9
§ 40-
E
5 2°" —-SAM with Reject
> "'Baseline PCA
10'2 10° 102
False Accept rate (%)

Figure 5.26: Performance on FRGC 2.0 dataset when the training data is from one
semester and the testing data is from a later semester.

123

cause misleading results. In a real application, it would be better to automatically

identify and reject some of the bad scans and have the subject rescanned.

           

‘ \

 

_ K _ ‘M‘
0.39 0.41 0.42

Figure 5.27: Example false accept SAM scores.

 

 

Similar experiments were conducted on the FRGC2.0 dataset. As described in
Section 2.5, this dataset is much larger and more difﬁcult than the version 1.0 dataset.
Results are reported for three matching conditions. Figure 5.24 shows the results with
the target and query scans taken from the same semester. A more difﬁcult situation
is shown in Figure 5.25, where the target and query data span two different semesters.
The most difficult situation is shown in Figure 5.26, where the target data is gathered
in one semester and the query data is taken from a different semester. In all cases,
the SAM score performs slightly better than the Baseline PCA algorithm. However,
quite a few research groups anonymously reported 100% performance on the FRGCl.0
dataset. These algorithms tend to use manually selected points and train extensivly
on the dataset. It is unlikely that a valid cross section of the human population will
yield 100% accuracy using face recognition. Human faces vary too much and there is
at least a subsection of the population that looks alike. Figure 5.28 shows the SAM
scores for matching two pairs of twins. Notice that even though the genuine scores

are smaller than the imposter scores, most of the scores are well below the standard

124

threshold of around 1mm. This means that the algorithm cannot know if the values
are from the same subject or different subjects. This does suggest that a recognition
system may be able to tell twins apart, but a veriﬁcation system that normally relies

on a globally trained threshold would be unsuccessful.

 

(h

   
    

 

 

0.2 0.8

0.4 0.6
. , .. SAM (mm)
(a) Twin Set A (b) Histogram of SAM for Twin Set A

 

 

 

 

1
- SAM (mm)
(c) Twin Set B (d) Histogram of SAM for Twin Set B

Figure 5.28: Results for matching two sets of twins. The genuine scores are in green
(thin bars) and the impostors are in red (wide bars).

5.5 Arbitrary Pose

Section 5.4 demonstrated that the frontal alignment algorithm performs quite well

even with small variations in pose and expression. However, the experiments in

125

Section 5.3.3 also demonstrated that large variations in pose can cause the frontal
face alignment algorithm to fail. This section describes the necessary modifications
to allow the frontal alignment algorithm to be used as an arbitrary pose alignment
algorithm.

When designing an arbitrary pose alignment algorithm, some of the assumptions
made in the frontal alignment system do not apply. Speciﬁc problems that need to

be addressed when working with arbitrary poses include:

o A single anchor point is no longer sufficient. for coarse alignment.
0 There may not be enough points in the control point selection region for proper
alignment.

0 Two scans with different poses may not overlap at all.

To address these problems, the following modifications were made to the alignment

algorithm:

0 A full 3D face model (see Section 2.4.7) was used as a target scan to ensure
enough overlap with the probe scan.

0 The three point rigid alignment algorithm (described in section 3.5.2) was used
for coarse alignment.

0 The control point selection regions changed depending on the pose of the scan

(see Figure 5.29).

The Besl ICP algorithm [17] uses point-to—point distance and a closed-form solu-
tion when calculating the transformation matrix during each iteration. The point-to-
plane distance used by Chen [37] makes the ICP algorithm less susceptible to local
minima than the point-to—point metric [17, 64]. Figure 5.30 shows the results of the
ICP matching of a full 3D model with a 2.5D test scan.

The two classical ICP algorithms [17, 37] are also integrated in a zigzag running

style called the hybrid ICP algorithm [100, 99]. Each iteration consists of two steps:

126

I'-

Figure 5.29: Grid of control points used by the fine alignment ICP algorithm. The grid
is adapted to the pose of the test scan. The location of the grid points is determined
by the location of the extracted anchor points.

 

Figure 5.30: Matching results after ﬁne alignment. From left to right: the 2.5D test
scan, the 3D model, the 3D textured model overlaid by the wire—frame of the test
scan after fine alignment.

using Besl’s scheme [17] to compute an estimation of the alignment, followed by
Chen’s scheme [37] for a reﬁnement. The two different distance metrics are utilized

together, which has the potential for a better registration result.

In order to properly evaluate the performance of the arbitrary pose alignment
system developed in this dissertation, it was tested in two experimental modes: auto-
matic and manual. In automatic mode, the algorithm selects the anchor points used
in coarse registration, while in manual mode the anchor points are labeled by hand.
The average error between the manually selected anchor points and the automatically
selected anchor points was 18.12mm, with a standard deviation of 37.58mm. This
error seems large because of the outliers in the automatic point selection. The error
was calculated by taking the average distance between the manually labeled points
and the automatically extracted points. Many of the points automatically selected

by the system are repeatable, even though they do not match the manual points. In

127

addition, there are some outliers where the automatic anchor point selection system
failed completely (see Chapter 4).

The results reported here are from two experiments dealing with arbitrary pose
matching. In the ﬁrst experiment [100], the system was tested on a database of 63
scans of the same 10 people, which varied in pose (up to i60 degrees from the frontal
View), facial expression and lighting. In the second experiment [99], the database was
extended to 18 subjects for a total of 113 scans. These studies give an idea of the
success rate when matching 2.5D to 3D faces.

In the ﬁrst experiment, the ﬁne alignment and face recognition systems were tested
on both manually selected anchor points and automatically selected anchor points.
The recognition accuracy is given in Table 5.3. The hybrid ICP achieved the best
performance among three ICP schemes, and Table 5.3 presents two classical ICP
algorithms for comparison. In this test, a success rate of 1 error out of 15 (93.3%
accuracy) for frontal test scans was achieved when comparing frontal 2.5D face images.
This is noteworthy considering that in many of the test images people had different
expressions than in the training images (i.e., some were smiling).

Table 5.3: Number of failed tests out of 63 scans due to recognition in the ﬁrst test.

The number of iterations (10,10) represents a total of 10 iterations for Besl’s ICP
scheme and 10 iterations for Chen’s scheme.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Algorithm Number of Anchor Range im— Range
Iterations Point age only Texture
Detection Shape
Index
ICP_hybrid (10, 10) Manual 6 5
ICP_hybrid (10, 10) Auto 13 10
ICP_Chen :37l 20 Manual 10 6
ICP_Chen :37: 20 Auto 16 14
ICP_Besl [17l 20 Manual 19 5
ICP_Besl [17: 20 Auto 33 15

 

 

When the testing database was extended to include semi-proﬁle images, the error

went up (92% accuracy). These results assume that the coarse alignment procedure

128

is correctly done. The combined total error of the entire automatic system (which
consists of all changes in pose and expression) was 10 out of 63 (84% accuracy). The
hybrid ICP scheme is still robust to the localization errors of anchor points, which
outperforms the other two classical ICP algorithms on the surface matching (range
image only) [100]. The combination of surface, texture and shape index matching
achieves better performance than surface matching based on the range image alone.

In the second experiment, the face recognition system was run with two differ-
ent matching scores. The ﬁrst matching score only used the distance measurement
produced by the ICP algorithm. The second matching score used the distance mea-
surement as well as the local shape index for each of the control points. A summary
of the experimental results is shown in Table 5.4. The best matching occurred when
both surface matching and shape index were used for the matching score. All the er-
rors were due to changes in facial expressions, which actually changed the 3D shape of
the model. Table 5.5 summarizes the errors with respect to two variations: expression

and pose.

Table 5.4: Second test matching error rates.

 

 

 

Best Match

Classiﬁcation
ICP Only 4.4%
ICP -l- Shape Index 3.5%

 

 

 

 

Table 5.5: Distribution of matching error from a total of 113 test scans in the second
experiment.

 

 

 

 

Frontal Proﬁle Total
w/o smile 0 O 0
w/ smile 1 3 4
Total 1 3 4

 

 

 

 

 

 

Figure 5.31 shows two examples of difﬁcult 2.5D scans that were correctly matched
by the arbitrary pose algorithm to the 3D model. Figure 5.32 shows two examples of

difﬁcult 2.5D scans that were incorrectly matched.

129

Correct Test Scan Correct
Match Match

   

Figure 5.31: Examples of correctly matched test scans. Notice the changes in lighting,
pose and facial expression.

Test Scan Incorrect Test Scan Incorrect
Match

I“

Figure 5.32: Examples of incorrectly matched test scans.

   

130

Chapter 6

Toward a Canonical Face Format

This chapter describes the development of a Canonical Face Depth Map (CF DM),
which provides a normalized format for 3D face data. The CFDM is created by
establishing a face-based coordinate system and then reprojecting the 3D data onto
an orthogonal data structure that is ﬁxed with regard to scale and resolution. The
orthogonal depth map can be represented as a function of z in terms of :1: and y
(z = f (x,y)) as was described in Section 3.6. The normalized CFDM format is
demonstrated to improve the performance of a SAM-based matching system, to reduce
the memory required to store a scan, and to help clean up holes in the face scan data.
The CFDM can also be used as a preprocessing step to provide normalized data for

matching algorithms such as PCA and correlation.

This chapter demonstrates that the CF DM can serve as an automated ﬁrst step to
the combined CFDM/PCA algorithm provided with the FRGC. Results indicate that
this PCA algorithm produces very good matching results while eliminating the need
for manual normalization (in the PCA algorithm). In addition, the CFDM performs
well as a preprocessor for a correlation algorithm that requires the probe data to

have the same scale and resolution as the model. Experimental results show that

131

combining the CFDM with the correlation algorithm provides a reasonable matching

system as well as an anchor point localization algorithm.

In Chapter 5, two faces are compared to each other by using a two step surface
alignment algorithm. This alignment procedure searches for a rigid transformation
R E ’R that minimizes the distance between a set of control points on the query scan
to the surface of the model. If the two scans have large variations in size then the
alignment algorithm may fall into different local minima, causing different values for

the rigid transformation.

In other words, using the ICP algorithm to align two arbitrary surfaces may result
in more than one possible solution transformation. This is not a problem when the
alignment algorithm is only used for matching because the corresponding SAM scores
are also large and the query scan is properly rejected. However, it may be useful to

have an alignment algorithm that aligns faces in a standard (canonical) way.

In developing a canonical face-based coordinate system, the goal is to ﬁnd a func-
tion fclS —> R such that V51,52 E S where 31 ~ 32 with rigid transform R E R

gives:

R = fCtsnfCtsgi—l (6.1)

A surface s E S is in its canonical format if fc(s) = I

The CF DM format is a face-based coordinate system that is tolerant to variations
in pose and noise. A coordinate system transformation is deﬁned by six degrees of
freedom (x0, yo, 20, roll, pitch, yaw). The canonical function (fc) developed in this
chapter is a two steps process. The ﬁrst step is to establish the mid-line face plane by
applying the 3D face alignment algorithm (described in Chapter 5) to a 3D face scan

and aligning the scan to its mirror image. The 3D face alignment algorithm is robust

132

to changes in pose and to noise, and will reliably establish the mid-line face plane.
The mid-line plane is used to limit two (roll and yaw) of the six degrees of freedom
required to establish a face-based coordinate system. The second step in deﬁning the
face-based coordinate system is to calculate the third rotational degree of freedom
(pitch) by ﬁtting a parabolic cylinder to the surface of the face along the mid-line

plane. The axis of this cylinder is used to deﬁne the pitch of the head.

Once the face-based coordinate system is properly deﬁned, the next step in cre-
ating the CFDM is to establish an orthogonal depth map from the nose tip into the
face scan. This depth map is perpendicular to the 13y plane and allows for consistent
re-sampling of the surface of the 3D face. Linear interpolation between the points
in the scan is used to ﬁll in the gaps. This re—sampling allows for the normalization

of the face in terms of scale and resolution, providing the standardization for the

CFDM.

Section 6.1 provides a background in canonical faces. Section 6.2 describes how
the face-based coordinate system required for the orthogonal projection normaliza-
tion is developed. Section 6.5 discusses properties of the CFDM, while Section 6.6

demonstrates how the CFDM can be used as a preprocessing step for other algorithms.

6. 1 Background

The canonical format (CF DM) described in this chapter has two main components: a
face-based coordinate system and a normalized feature vector based on an orthogonal
depth map. The face-based coordinate system developed in this dissertation is similar
to the one presented by BenAbdelkader and Griffin [14]. However, their approach [14]
relies on just seven manually selected anchor points to properly determine the location
of the coordinate system. In contrast, this dissertation uses ICP to generate a robust

bisection of the face and uses high sampling rates to develop a good representation

133

of the mid-line plane. The CF DM developed here does not assume that the nose is
the closest point to the camera, which is a weakness of other face-based coordinate
systems, such as the approach offered by Malassiotis and Strintzis [101]. Assuming
the nose tip is the closest point to the camera is particularly problematic when there

is a possibility of spike noise in the data or large hair styles.

The CFDM provides a consistent feature vector that can be used to apply space
transformation algorithms such as PCA. The CFDM is similar to the mesh model
presented by Xu et al. [144], which uses the nose tip as the origin and then conducts
a rigid, triangular re-sampling of the data that can be done at different resolutions.
PCA is also used in the base algorithm of the Face Recognition Grand Challenge
(FRGC) [118]. In this case, 3D faces are normalized using techniques similar to those

for resizing 2D faces, which creates images with the same between-eye—center distance.

The face-based coordinate system deﬁned for the CF DM requires the calculation of
the mid-line plane, which is assumed to be similar to the plane of symmetry. A simple
method for determining the axis of symmetry using low frequency ﬁlters is suggested
by Marola [103]. The centroid of the object in question is determined and all possible
axes of symmetry that pass though this centroid are examined. A simple coefficient
of symmetry is used to determine the best axis. One problem with using these ﬁlters
to ﬁnd symmetry is that low frequency ﬁlters tend to blur salient regions that may
be used to determine symmetry. Scognamillo et al. [128] ﬁrst enhanced the low
frequency salient regions using edge detection, and then a vertically-oriented Gaussian
ﬁlter was convolved to ﬁnd the location of the axis of symmetry. The orientation of
this Gaussian ﬁlter could be adjusted to ﬁnd different axes of symmetry. This work
was done primarily on gray scale faces, but the authors also demonstrated that this

approach worked even on gray scale images of symmetric dots. A major assumption

in [128] is that the axis of symmetry must pass through the centroid; however, this

134

may not be true for the face mid—line depending on pose and occlusion. The general
concept of object symmetry is also explored by Zabrodsky et al. [149], who examine
various methods for determining the symmetry of an image. The algorithm proposed
in [149] examines all possible axes of symmetry and determines the best axis based
on a Symmetry Distance score. This approach is interesting because it looks for a 3D
axis of symmetry on a 2D image, where the third dimension is the gray scale (depth)
value. However, the algorithm in [149] is prohibitively slow even when the authors
implement a multiple resolution scheme to increase the speed and decrease the search
space.

There is also a body of research examining the symmetry of the brain in MRI
images. This plane of symmetry is called the Mig-Sagittal plane and is important for
registering different slice images of the MRI. Liu et al. [96] determined the symmetry
of a human brain scan by convolving the mirror image of a (virtual) slice of the head
and rotating it about the center of mass of the image. The best correlation would be
the best match. Another brain symmetry algorithm presented by Tuzikov, Colliot et
al. [134] uses the ellipsoid of inertia to calculate the initial location of the symmetry
plane and then optimizes the estimate with a search.

For this dissertation, the plane of symmetry is established by aligning a face scan
with its mirror image using the ICP-based face alignment algorithm described in
Chapter 5. Pan and Wu [116] and Zhang et al. [150] also found the axis of symmetry
using the ICP algorithm, with similar results. However, these papers assume that the

face is vertical and no trimming is used to account for noise.

6.2 Face-Based Coordinate System

This section explores methods for identifying a face-based coordinate system, which

requires accounting for six degrees of freedom: roll, pitch, yaw and the origin

135

(2:0,y0,20). Having a well deﬁned face—based coordinate system makes it simple
to transform data into whatever format or pose an algorithm requires. The goal of
creating a face-based coordinate system is to establish a format that is easily calcu-
lated and repeatable for the same face. In addition, face-based coordinate systems
should be robust to noise in the data, and it should be pssible to estimate the face-
based coordinates even when some of the face is occluded or missing. Ideally, it will
also be possible to visually estimate the face-based coordinate system by using 3D
data, which would allow databases of different modalities to be registered with each
other.

Section 6.2.1 introduces a basic approach for establishing a face-based coordinate
system by using identiﬁed anchor points. Section 6.2.2 extends this method by using
the surface alignment algorithm described in Chapter 5 to ﬁnely align a scan with a
baseline model. Section 6.2.3 describes the canonical method for using more robust
face-based features, such as the mid-line of the face, which do not require a generic

3D model.

6.2.1 Anchor Point Alignment

The majority of 3D face-based coordinate systems are based on the selection of anchor
points. The simplest of these approachs is a single point transformation, where the 3D
scan is assumed to already be in a known pose (normally frontal); this pose accounts
for three degrees of freedom: roll, pitch and yaw. The remaining three degrees of
freedom are accounted for by translating the scan using a single point, such as the
tip of the nose or the centroid.

Single point alignment is by far the easiest method for normalizing a scan, but it
is also the most susceptible to minor variations in the data, such as changes in pose,
specularities, etc. Even a seemingly frontal database can have inter-pose variations

as large as 15 degrees (as described in Section 5.3.3). In addition, many single-point

136

methods use heuristics (such as the closest point to the scanner) to ﬁnd the tip of
the nose. These types of heuristics are not robust to spike noise and fail completely
if the nose is partially occluded or if there is a hole in the scanning data.

A multiple anchor point alignment method, which is a little more robust, requires
a minimum of three known, non-co—linear, anchor points to establish all six degrees
of freedom in a face-based coordinate system (see Section 3.5.2). In a three point
system, the frontal pose assumption does not have to be made. However, the anchor
points must be chosen such that any errors in selecting the points are minimized. For
example, if the selected points are close together the transformation will be biased
toward a small section of the scan, which can cause errors in more distant areas of the
scan. Additional points can be added to the calculation to minimize the error caused
by a single point; BenAbdelkader adn Grifﬁn [14] used seven points to estimate a

plane on the face.

6.2.2 Face Model Normalization

Although anchor points can be used to establish a face-based coordinate system (as
described in the previous section), minor errors in selecting these points can translate
into major variations in the coordinate axes. A more robust method for establishing
a face-based coordinate system is to use the face alignment algorithm described in
Chapter 5 to align every face in the database with a generic face or generalized model.
The generic face could be randomly selected from the database, or even selected based
on some similarity metric. A generalized face model such as in [5] could also be used.
The face-based coordinate system for the generic face is then used to deﬁne the face-
based coordinate system for the entire database.

The generic-face method for establishing a face-based coordinate system makes
some fairly broad assumptions. First, it is assumed that the alignment algorithm can

align any face with the generic face model. Second, it is assumed that the generic

137

face model will always align with a model from the same subject in the same way.
However, with an arbitrary generic face model these two assumptions are not always
true. For example, if the generic face model is of a large face then a smaller face scan
may not ﬁt properly (see Figure 6.1). In cases where the generic face model and the
test face scan are of different sized faces, it is common for the nose of the smaller face
to align properly but roll may be induced to make the smaller face ﬁt with the wider
one. It is also likely for more than one local minima to exist in this situation, and
if the SAM is high then there are likely several equivalent minima. An ideal generic

face model would minimize these types of errors (roll and local minima).

   

Generic “Random” Resulting orientation
Face Model Test Face Scans after alignment

Figure 6.1: Example error where the generic model and the test scan subject do not
have the same sized head, resulting in multiple local minima.

Two different experiments were used to explore normalized faces. The ﬁrst used
a randomly selected face model from the database and aligned all the other scans in
the database to this model. The second used an average face model (see Figure 6.2),
which was developed by transforming a set of training faces into a canonical format

based on face symmetry (see section 6.2.3) and using the average of these faces as the

138

generic model. Overall, the average face did not fall into as many local minima but

there were still scans that did not ﬁt well to the model.

 

(a) Frontal View (b) Side View

Figure 6.2: 3D model of an average face generated using a Canonical Face Depth
Map. This model can be used as a generalized target model for new scans.

One advantage of using a generalized model is that the ICP alignment algorithm
is simple and fast. However, misalignment can occur when there are differences in
size between the general model and the test scans. Another approach for creating
a generalized face is to vary the model to match the size of the face. This variable

model could account for variations in size, but increases the search space.

6.2.3 Deﬁning a Canonical Coordinate System with Global
Features

In addition to the use of anchor points and generalized face models, a third option

for generating a face—based coordinate system is to identify robust and salient global

features on the face that are consistent across all faces. One such feature is the

mid—line plane of the face.

139

Deﬁning the Mid-line Plane

There are two salient properties of the mid—line plane on a human face: the ﬁrst is
a high curvature proﬁle ridge and the second is face symmetry. Since the mid-line
has two salient properties, it is possible to estimate the mid-line location using two
completely different methods.

In the case of the mid-line plane, the symmetry property is very accurate but
breaks down if there is not enough surface to determine the symmetry (i.e., the face
is turned too much) or if the face is not very symmetrical. Perceived asymmetries
can be introduced by sensor noise or lighting conditions, or the face may not be
particularly symmetrical. Simple external variations, such as changing expression
(wink) or hairstyle, can cause asymmetries. Asymmetries can also occur due to
injuries such as a broken nose or a scar. Research has also shown that there are
natural asymmetries that occur in humans during their development [88]. All of these
factors make ﬁnding the plane of symmetry on the face more difﬁcult. However, in a
proﬁle image the mid-line plane can be detected using the prominent proﬁle points.
It is important to note that the proﬁle ridge, plane of symmetry and mid-line plane
may all represent slightly different planes of the face. However, the planes should be
similar enough that together these properties make the mid-line plane a useful feature

as a baseline for establishing a face—based coordinate system.

One method for detecting the mid-line of the face relies on the high-curvature
proﬁle ridge that deﬁnes much of the mid—line. The proﬁle ridge is deﬁned as the
points on the surface of the proﬁle projection of the face (see Figure 6.3). These
points are fairly salient, since much of the proﬁle plane protrudes from the surface
of the face (e.g., the nose and chin). Because of the high relative curvature of the
nose in the y direction, changes in estimating the proﬁle direction (yaw angle) will
result in similar proﬁle locations (see Figure 6.4). In fact, if the head is known to be

vertical, meaning at minimum roll and pitch values, then these high curvatures can

140

be used to generate an estimate of the location of the mid-line of the face. However,

as Figure 6.4c shows, errors can occur if there are variations in the pose of the scan.

 

Figure 6.3: Proﬁle ridge. Each point represents the largest 2 value for every row in
the scan.

 

(a) Non-Frontal Scan (b) Frontal View (c) Proﬁle
Projection

Figure 6.4: Projection proﬁle examples.

The second salient feature of the mid-line plane is face symmetry. For 3D faces,
the mid—line plane can be calculated robustly by matching a face scan with a mirror

image of itself (see Figure 6.5 ) using the surface alignment algorithm described in

141

Chapter 5. For this dissertation, a mirror image of an object is formally deﬁned as

follows:

PM = M . P (6.2)

where P represents the original 3D points in the scan, PM indicates the mirror points

and M is the mirror transform:

—1 0 0 0

0 1 0 0
M:

0 0 l 0

0 0 01

 

(a) Original Sean (P) (b) Mirror Transformed Scan (PM)

Figure 6.5: Example mirror transformation.

The mirror image is a left-hand projection of the original scan. The process for
aligning the original and the mirror scans is the same as for aligning any two faces and
uses the alignment algorithm described in Chapter 5. The resulting transformation
between the two scans is deﬁned as the mirror transform and it is denoted by Tm.

With the mirror transform, the mid-line plane can easily be calculated. First, an

arbitrary point p is chosen. This point does not have to lie on either scan but must

142

not lie on the mid—line plane. Using this point, the surface normal A7 of the mid-line

plane is calculated as:

Pm = TmM 'P (6.3)

-: Pm — P
N = —— 6.4
[Pm — Pl ( )

If a, b, and c are the :r, y, and 2 components of the surface normal and d is the
distance along the surface normal from the origin to the plane, then the following

equation deﬁnes the plane:

ax+by+cz+d=0 (6.5)

Figure 6.6 shows an example the face mid-line plane. The plane deﬁned by Equation

6.5 accounts for two of the six degrees of freedom (yaw and roll).

 

Figure 6.6: Example location of the mid-line plane relative to a face.

By assuming that the pitch angle is negligible, the face scan can be normalized
using only the mid—line plane and maintaining the existing pitch; this may produce a

reliable face depth map. To test this hypothesis, the MSU database was transformed

143

into the mid-line corrected depth map (at a resolution of 400 x 700) and then each
scan was aligned to every other scan from the same subject. After alignment, the
roll, pitch and yaw differences were calculated, as shown in Table 6.1. The results in
Table 6.1 demonstrate that the mid-line normalization removes the effects of the roll

and yaw rotation, but does not improve the pitch rotation.

Detecting Surface Pitch

The mid-line plane can be accurately estimated using the plane of symmetry of the
face and the surface alignment algorithm. However, the mid-line plane only accounts
for two of the six degrees of freedom. The origin can also be estimated by selecting
a robust point, such as the tip of the nose. Yet, this still only accounts for ﬁve of
the six degrees of freedom. The pitch (tilt) of the head is still problematic. This
section explores methods for robustly calculating head pitch. These methods assume
that the mid-line plane is already calculated and uses this plane to help restrict the
direction of the pitch rotation axis to ensure that it is orthogonal to the my plane.

The experiments described here ﬁt a simple surface to the face and use the surface
parameters to calculate the pitch. Figure 6.7 shows three parametric surfaces, a plane,
a parabolic cylinder, and a quadratic surface, ﬁt to a single face scan. Each of these
surfaces was applied to the database of scans and the results are shown in Table 6.1.
The goal of this surface ﬁtting is to calculate a robust and repeatable measurement of
the pitch of the surface of the face. All the rotations shown in Table 6.1 have values
close to zero. Figure 6.8 shows the variance for each of the pitch correction methods
described in this section.

From the results in Table 6.1 and Figure 6.8, it seems that the plane model is
too simple for accurate pitch detection, while the quad model is over-ﬁtting the data.
The parabola seems to be the best surface ﬁtting approach. A similar model has also

been used in 2D pose correction for head tracking [32].

144

 

(a) Plane Surface arr + by + c (b) Parabolic Cylinder Surface (c) Quadratic Surface 2 = are3 +
a$2+b$+cy+d bx2y+cxy2+dy3+ez2+fxy+
gy2+hz+iy+j.

Figure 6.7: Parametric surface ﬁtting using forced symmetrical surface ﬁtting.

Table 6.1: Improvement of mid-line normalization over database roll, pitch, and yaw
differences. In the ﬁrst row, the original roll, pitch, yaw and distance statistics rep-
resent the variations between different scans of the same subjects within the original
database (this ﬁrst row is taken from Table 5.2). Each subsequent row represents the
variation after the different normalization techniques. The Plane, Parabolic Cylinder,

 

Quadratic, and Average are all applied after the baseline symmetry is established.
Roll

 

 

 

 

 

 

 

 

 

 

 

Pitch Yaw Distance
Original 0.36 :t 049° 1.62 i 225° 1.52 :t 199° 63.69 i 66.08mm
max 311° 16.030 17.30° 431.97mm
Symmetry 0.015 053° 0.00 :1: 344° 0.00 i 056° i3.102.14mm
max 249° 16.680 320° 34.7mm
Plane 0.01 :r 052° 0.10 3: 315° 0.02 :t 054° :l:5.122.48m-m
max 280° 33.97° 181° 56.8mm
Parabola 0.02 i 0.530 0.10 i 2.590 0.00 at 0.500 2.06 :t 3.33mm
max 2.460 32.500 1.750 42.20mm
Quad 0.00 m 053° 0.01m 342° 0.01m 051° il2.675.3lmm
max 2.560 16.070 2.510 75.85mm
Average 0.01 5 058° 0.01 i 201° 0.00 i 079° 2.90 m 7.81mm
max 238° 12.50° 285° 73.95mm

 

145

 

 

 

Standard Devlatlon of Ratatlon In Puch Dlrecﬂon

 

Normallzatlon Type

Figure 6.8: Comparison of pitch standard deviation over different ﬁtting techniques.

As a comparison, the results for surface ﬁtting and pitch normalization are also
given for the average face model described in Section 6.2.2. Fitting a surface to an
average face seems to be the most reliable pitch correction method, but this method
is sure to break down if it encounters a face that is signiﬁcantly larger or smaller
than the average. Also, it should be noted that the average face is generated from
the same dataset, so it is probably biased toward this data.

Overall, the approaches shown in Figure 6.8 are accurate for changes of up to i2°

in the pitch axis, which is much larger than the $0.50 for the other two axes.

6.3 Canonical Face Depth Map

This section describes the use of face-based coordinate systems to establish a Nor-

malized Face Depth Map (N FDM). Any face-based coordinate system can be used to

146

develop a normalized format. However, if the face-based coordinate system is highly
repeatable over faces and scans that differ in pose, lighting, scanning noise, holes,
artifacts, and expression, then this face-based coordinate system can be deﬁned as a
canonical format. In other words, a Normalized Face Depth Map that uses a canoni-
cal, face-based coordinate system is called the Canonical Face Depth Map (CFDM).

Regardless of which face-based coordinate system is used, the establishment of a
CF DM requires that the face data be re—sampled using an orthogonal projection, as
described in Section 3.6. The orthogonal projection produces a depth map of constant
size and scale and the CF DM allows a database of face scans to be standardized and
aligned. This standardization can act as a preprocessor to other 2D and 3D face
processing algorithms and has the the following advantages:

0 Single matrix representation to reduce memory

a Registration of the faces

0 Standard feature vector size

0 Normalized pose

Taking advantage of these attributes provides the following beneﬁts:

0 Fast correlation between 3D face data
0 Trivial feature space transformations (ICP, wavelets, Fourier transforms, etc.)

0 Better data comparison for matching

The Canonical Face Depth Map consists primarily of a single matrix depth map,
which is standardized so that each increment of the index moves a static distance in
the a: and y direction (i.e., constant step value). This allows a depth map to store
the 1:, y and z information in a single matrix (see Section 3.6). There is also a very
large (or very small) value that is used to signify a bad point. In addition to the
depth map structure, the Canonical Face Depth Map has a standardized face-based
coordinate system.

Table 6.2 and Figure 6.9 outline the algorithm for converting raw scanner data

into the Canonical Face Depth Map. A face-based coordinate system (from Section

147

6.2) is established ﬁrst. Once the face-based coordinate system is properly detected,
an orthogonal projection of the data is made onto the my plane. Some difﬁculties
may arise when this projection is made, and proper sub-sampling of the data may be

necessary in order to ﬁll the entire depth map.

Table 6.2: Steps for constructing a Canonical Face Depth Map.
. Find key anchor points on the face.
. Calculate the mid-line plane using face symmetry.
. Calculate the head-pitch by ﬁtting a parabolic cylinder to the face.
. Transform the data into the face-based coordinate system.
. Orthogonally project the depth values onto the Depth Matrix.

 

OIAOOMI—t

 

Orthogonal
Pro’ection

     

 

 

   

Original Feature Point Canonical Face Depth

. Axis Estimation Map with registered
Face Scan Detection Texture Image

Figure 6.9: Process for generating a Canonical Face Depth Map.

One potential drawback of a Canonical Face Depth Map is that this format can
only store information from one direction, so some information may be occluded or
sub—sampled due to the angle. For example, points on the side of the nose will be less
dense than points in the same area on the cheek because the normal direction for the
cheek is in the direction of the projection plane. This loss of information in areas of
high 2 gradient is a problem with any depth map and is problematic for 2.5D storage

formats.

6.4 Robustness Analysis

The CFDM described in the previous section was designed to be robust. This section

tests the design to demonstrate its robustness to noise and holes in the data. In the

148

ﬁrst experiment, Gaussian noise is added to the 2 component of the MSU dataset.
The noise is generated in Matlab using a normal distribution and then multiplied by
one of three noise amplitudes (0.5mm 1.0mm and 1.5mm). This range was chosen
because the Minolta scanner accuracy is around 0.5mm and the the SAM veriﬁcation
threshold is around 1.0mm. Any lower errors should would not be signiﬁcant, and
any higher level of noise would result in inaccurate matching. Figure 6.10 shows some
examples of these noise levels on a typical scan. The results for the noise experiments
are shown in Table 6.3. These results show that the mean roll, pitch and yaw stay
around zero degrees, but as the amount of noise increases the standard deviation of
these values also increase. Fortunately this increase is very small, indicating that the

CF DM is robust to Gaussian noise.

Table 6.3: Error with respect to noise

 

 

 

 

 

 

 

 

 

 

Roll Pitch Yaw Distance
Parabola 0.02 :i: 0.53 0.10 :i: 259° 0.00 a: 050° 2.06 :i: 3.33mm
max 2.460 32.500 1.750 42.20mm
0.5mm 0.01 :i: 056° 0.06 :i: 237° 0.01 i 089° 2.57 i 4.84mm
max 246° 23.98° 530° 63.95mm
1.0mm 0.00 3: 062° 0.07 a: 249° 0.01 :t 100° 3.80 i 5.99mm
max 2.330 31.470 5.900 65.73mm
1.5mm 0.00 :r 067° 0.06 :i: 2.32° 0.01 :t 103° 5.32 i 6.24mm
max 5.230 27.69° 7.82O 62.65mm

 

 

The second experiment is designed to test the CF DM against holes in the data.
This experiment was tested on 36 subjects and 103 scans with ten holes per scan.
The trial removes a Gaussian hole from the scan; this hole covers circular area of
approximately 5mm radius taken from the region of the scan used by the surface
alignment algorithm. Each hole removes a different number of pixels depending on
the scale of the scan and whether the holes overlap with previous holes in the data.
This process is repeated for a total of ten iterations, with each iteration adding
another hole to the scan. Figure 6.11 shows an example of holes added to the scan.

The graph displays the percentage of valid pixels removed from the entire scan by the

149

          
    

   

-10 0 10 20 30 40

 

(b) 0.5mm noise

      

-10 0 10 20 30 40

(c) 1.0mm noise (d) 1.5mm noise

Figure 6.10: Example noise on the surface of the face.

150

holes. The y axis shows the difference in the roll, pitch and yaw between the CF DM

face processed without the holes and the CFDM face processed with the holes.

0.7
0.6
0.5
0.4
0.3
0.2
0.1

Variation (Degrees)
O

9595
N—L

 

    

 

'°"/o presumed 'V'

   

Figure 6.11: Example of holes on the surface of the face and how the angles in the
canonical format change.

Figure 6.12 shows the results for the entire 1030 trials (103 scans with 10 iterations
each). Notice that when less than 10% of the data is removed, the angle variation is no
more than two degrees. As the number of holes increases, the system generates larger
errors. In most cases, the angle difference is still well under 2 degrees of variation.
However, there are cases where the errors spike, normally due to holes placed in key
locations that interfere with anchor point detection or cause large asymmetries in the
data.

Figure 6.13 compares the percentage of pixels removed against the distance to the
origin. In this case, the origin is not as robust as the angles of rotation. The majority

of points are well within 10mm, but the origin can move around. This becomes

151

 

oo

 

O)

 

.h

 

 

 

 

 

tion (Degrees)
N

 

Varia
r'o

 

 

 

 

% Pixels Removed

Figure 6.12: Comparison of how the holes in the face affect the CFDM angles.

clearer when looking at a breakdown of the 3:,y, and 2 components of the distance
measure as shown in Figure 6.14. In this case, the depth value, or 2: component, of
the origin transform is fairly stable (under 10 mm), and the y and 2 components
have the most variation. Future work may need to focus on a method to improve the
robustness of the origin to noise. However, for the current system this level of noise
is not problematic because having 10% to 10% of holes in the data is unlikely. In any
case, it is easy to calculate the number of holes in a scan and reject it if there are too

many pixels missing.

This section has demonstrated that the CFDM is robust to Gaussian noise and
holes in the data. In addition to these studies, extensive testing of the system (as
described throughout this dissertation) has shown that the algorithm performs quite

well in everyday uses.

152

I
l
i
l
l
l
l

 

 

 

  
  
 

 

 

 

 

 

50
E40 .
020— o o
.s ’
73,10" . e
o c

0" i

0 10 20 30 40

% Pixels Removed

Figure 6.13: Comparison of how the holes in the face affect the CF DM origin.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

E 50 t ‘4“ - ° x

a: ‘2.

5 :A-A‘:
.E f : :
CD I . ..

.c '30 o e 0 ¢ : AAA.O ,

o O ‘ ‘ ‘ O

-50 r r r
0 10 20 30 40

°/o Pixels Removed

Figure 6.14: Comparison of how the holes in the face affect the CFDM origin relative
to the :13, y and z distances.

153

6.5 Properties of Canonical Face Depth Maps

Canonical Face Depth Maps have many advantages. The Canonical Face Depth Maps
described in this dissertation are all sub-sampled using the orthogonal depth map,
and this sub—sampling is at the same scale and dimension regardless of the input scan
resolution. For example, the images in Figure 6.15 are at different camera offsets.
The scans are normalized into the CF DM format (Figures 6.15b and 6.15d) and can
then be directly compared. In contrast, 2D images have an unknown scale and must
be normalized based on some common point distance, such as the distance between
the eyes. However, different people have different between-eye distances, so the effect
is a comparison of faces that must be scaled up or down to match each other.

Because all of the scans use a canonical point of view, the anchor point algorithm
can be reapplied the CF DM faces. Figure 6.16 shows the offset difference between
the manually selected points and the automatically generated points. Each image is
paired with the same image from Figure 4.6 in Chapter 4. The right image represents
the anchor point locations using the raw data and the left image represents the anchor
point locations after applying the CFDM. Table 6.4 summarizes the results and shows
that there is not much improvement over the original anchor point detection, and in
some cases does not perform as well. However, overall there is more improvement
using the CF DM and it does not signiﬁcantly deviate from the original point location.
Therefore the CFDM can help in point localization.

The facial mid-line plane of symmetry can be used to ﬁll in holes in 3D data, since
any point on one side of the mid-line plane should have a corresponding point on the
other side of the plane. If a close point or region is not found on same side of the
face, then a copy can be made from the other side to ﬁll in the missing information.
This method can be simpliﬁed by generating a depth map that is perpendicular to
the plane of symmetry. For example, notice the hole around the eyebrow in the scan

in Figure 6.17a. This hole can be ﬁlled using the mirror image (see Figure 6.17b).

154

 

(a) Original Scan (b) After Depth
Mapping

   

(e) Original Scan (d) After Depth
Mapping

Figure 6.15: (a) and (c) are scans with different camera offset and resolution, (b) and
(d) show how using a depth map normalizes the resolution.

155

   

(a) Raw (b) CFDM (c) Raw ((1) CFDM

   

(f) CFDM (g) Raw (h) CFDM

     

(i) Raw (j) Nose Base (k) Raw (1) CFDM

Figure 6.16: Average face image with each of the manually selected anchor points
(green) and the relative location of the automatically generated points (white) from
the MSU 330 image dataset. The left images represents relative the anchor point
locations using the raw data and the right images represents the anchor point locations
after applying the CFDM.

156

Table 6.4: Experimental Results for Frontal Anchor Point Detector before and after
the CFDM.

verage verage mm
Deviation Deviation

 

Notice that the projected color does not match; this hole-ﬁlling algorithm can also

re—sample the color image from the original scan if these data points are available.

 

(a) Scan with holes (b) Scan after hole ﬁlling

Figure 6.17: Example of symmetric hole ﬁlling. Note that the color is maintained
with the copied hole points.

Another advantage of using a Canonical Face Depth Map is that the depth map
can be sampled at any resolution (see Figure 6.18a and b). Lower resolutions can be
used to speed up some algorithms, while higher resolutions can be used to increase
accuracy. For example, the hybrid Besl and Chen ICP algorithm described in Section
5.5 is used as a way to combine the speed of the Besl algorithm with the performance

of the Chen algorithm. Similar algorithm accuracy can be achieved by super sam-

157

pling the CFDM and creating a highly dense target scan in which ICP can operate.
Figure 6.180 shows two ROCs for face depth maps sampled at different scales. The

higher sampling rate will slow down the matching algorithm, but it also increases the

accuracy of the algorithm.

 

 

 

 

 

 

 

 

A

a?

v

d)

H

('5

h

C

.2

4.:

(U

9 u

SE 70- ,3 -------- Low Sample Rate

(D I —High Sample Rate

> ---Original Scans

60 _2 10
10 10
False Accept rate (%)

(a) Low (b) High Sample Rate (c) ROC Performance
Sample (400 x 700)
Rate

(200 x 350)

Figure 6.18: Depth maps sampled at two different resolutions using the MSU
database. The higher resolution data improves ICP performace by allowing for small
incremental changes during each iteration of the algorithm. The increased surface
matching performs slightly better, as seen in (c).

Because the CFDM uses an orthogonal project, it is possible to reduce the memory
requirements for storage of the ﬁle. One goal for the CFDM is to be able to store
it on a smart card. Table 6.5 shows two sets of bar graphs. In each graph a line
representing the traditional smart card size is given as a goal (512 KB [66]). Note
that even in its most compressed formats it is unlikely that a smart card will be able
to store a face scan in its raw format. However, the CFDM does not need to store the

(r, y or even flag component of the face scan. This allows the face to be stored in a

158

2

Table 6.5: Results of ﬁle format memory comparison. A standard smart card has 512
kilobytes of memory [66]. This is not enough memory to store the raw image of a face
scan. Because the CFDM requires only one matrix to represent the 3D surface there
are many formats of the CFDM that can ﬁt onto a smart card. It is even possible to
ﬁt more than one template on a card. All ﬁle formats under the 512 kilobyte memory
goal are shown in bold.

 

 

 

 

 

 

 

 

 

 

 

 

Raw 320 x 240 Scans 400 x 700 CFDM

Minolta (cdm) 1.58 MB n/ a

FRGC (abs / ppm) 20.84 MB 4.39 MB
Compressed FRGC 0.88 MB 0.37 MB
Matlab (mat) 8.64 MB 3.08 MB
Compressed Matlab 1.70 MB 1.02 MB
3BX 4.53 MB n/ a

VRML (wrl /jpg) 1.70 MB n/a

Binary ppm n/ a 1.68 MB
Compressed Binary ppm n/ a 0.24 MB
Jpeg n/a 0.16 MB

 

 

 

 

format that is much smaller than the goal of 65 Megabytes even at the high 400 x 700
resolution. It is even possible that more than one scan could be stored on a single

memory card allowing for non-rigid variations, such as smiles.

6.6 Preprocessing using Canonical Face Depth
Maps

The output of the Canonical Face Depth Map algorithm is trivially transformed into
a data structure with the same format as the original face scan. Maintaining the
data format allows the Canonical Face Depth Map to be used as a preprocessor to
other algorithms. This section presents two common image processing algorithms,
correlation and PCA, and applies the CFDM developed in this dissertation as a
preprocessor for these algorithms. Correlation and PCA were chosen because they
had potential for effective identiﬁcation and they perform best if the input data

is normalized and of ﬁxed size resolution. Section 6.6.1 describes the correlation

159

algorithm in detail and shows how it can be used for face verification as well as
anchor point localization. Section 6.6.2 describes how the CFDM preprocessor was
used with the baseline PCA algorithm [20] implemented for the Face Recognition

Grand Challenge.

6.6.1 Correlation

A normalized 3D face correlation algorithm has been developed for this dissertation.
For each window region, the points are normalized to their average depth. The
correlation algorithm is similar to traditional correlation and minimizes the L2—norm,
but takes into account missing data (due to flags) and trims errors in the 2 data to
make the algorithm more robust to spike noise. Several experiments were conducted to
compare feature regions on a probe scan to those on a target scan; these experiments

used the following procedure:

1. Identify anchor points on the target scan.
2. Choose a region around each point as an independent mask. (See Figure 6.19a)
3. Find the anchor points on the probe scan.

4. Choose a larger search region about each point on the probe scan. (See Figure

6.1%)

5. Convolve the masked region in the search region for each corresponding point.

(See Figure 6.19c and d)

6. Record the minimum correlation score and the row and column of the points.

The correlation algorithm was tested on the depth channel of the MSU dataset.
All scans were normalized using the vertical symmetry Canonical Face Depth Map
algorithm (see Section 6.3). The L2-norm (average root mean squared error over the

entire correlation window) for each anchor point (shown in Figure 6.20) was used as

160

 

(a) Mask Region (b) Search Region (0) Point found on sub— (d) Point found on differ-
ject ent subject

Figure 6.19: Example of the correlation point algorithm.

 

Figure 6.20: Points used in the correlation experiment: nose tip, nose bridge, nose
base, inside of right eye, inside of left eye, chin.

161

Depth Convolution ROC Curves

100

   

 

   

-—-(1) nose (8.26 EER)
—(2) bridge (8.72 EER)
—(3) base (16.36 EER)
-—(4) reye (7.19 EER)
—-(5) Ieye (7.19 EER)
---- (6) mouthR (25.08 EER)
—(7) mouthL (24.62 EER)
—(8) chin (20.03 EER)
------- (9) Total (8.26 EER)

Verification rate (%)

 

 

 

 

v . ......1 . ...11.r1
2

10-1 0 l 2

All

10 10
False Accept rate (%)

Figure 6.21: Correlation applied to the depth channel of a Canonical Face Depth Map
using a large mask window.

162

a similarity score. Figure 6.21 shows the ROC results for the correlation over every

anchor point.

The results were not impressive compared to the SAM results in Chapter 5, sug-

gesting that the mask regions alone were not sufﬁcient to uniquely identify a single

person within a database of 330 subjects. Also included in Figure 6.21 is a curve rep-

resent the sum total of all of the L2-norm scores. This total performs better than any

of the individual points by itself. The correlation algorithm was also applied to the

shape index, as shown in Figure 6.22. The bridge of the nose achieved slightly better

performance than the depth channel by itself, yet the summation did not improve

any of the scores.

100w

90F

I

80

Verification rate (%)

10—

 

 

111 A 1 LLIAI‘

Shape Convolution ROC Curves

,.
.o'
’

—(1) nose (9.94 EER)
—(2) bridge (6.42 EER)
—(3) base (21.56 EER)
--(4) reye (11.62 EER)
-—(5) leye (11.01 EER)
----- (6) mouthR (28.59 EER)
--—(7) mouthL (29.36 EER)
-—(8) chin (21.25 EER)
' ''''' (9) Total (13.61 EER)

 

 

 

 

LALLJ

-1 1

10 10° 10 102

False Accept rate (%)

 

Figure 6.22: Correlation applied to the shape channel of a canonical depth map using

a small mask window.

The correlation experiment results suggest that any single anchor point is not

sufﬁcient to properly identify a subject. However, combining these feature regions

163

may make veriﬁcation possible, and correlation may be useful for localizing anchor
points.

Another use for the correlation algorithm is for point localization. The correlation
algorithm uses a moving window to ﬁnd the best match between regions around points
in the scans. To achieve the best localization results, an operator would need to
manually select anchor points on a scan during enrollment. When a new query scan is
made, the anchor points on the template are used to help guide the point localization.
Figure 6.23 shows the results of point localization on a selection of scans. Notice that
there are a larger number of outliers in this algorithm. However, the automatically
localized points are better clustered around the manually selected points. Future
work will include ways to minimize the number of outliers while maintaining the high

quality localization.

6.6.2 PCA Analysis

In addition to providing a dataset, the Face Recognition Grand Challenge (FRGC)
also provides a programming interface called the Biometrics Experimental Environ-
ment (BEE). The BBB system gives the FRGC competitors a common interface for
running independently-evaluated algorithms. It also provides a version of a PCA
matching algorithm to use for baseline comparison.

PCA is a well-established 2D color face matching method. It is not state-of-the—
art, but has been shown to perform acceptably when there are minor variations in
pose and lighting. In addition to using the F RGC PCA algorithm for alignment, it
is also possible to use it as a second matching score to be combined with the SAM
score. However, this requires some consideration. First, the FRGC PCA algorithm
uses manually selected points, which is unreasonable for a real-world, fully automatic
application. Second, the FRGC PCA algorithm requires some training data. This

training data is already available in the F RGC database but needs to be added for

164

     

(b) Correlation (d) Correlation

     

(e) Raw (f) Correlation (g) Raw (h) Correlation

   

(i) Raw (j) Correlation

Figure 6.23: Point localization using correlation. Average face image with each of the
manually selected anchor points (green) and the relative location of the automatically
generated points (white) from the MSU 330 image dataset. The left images represents
relative the anchor point locations using the raw data and the right images represents
the anchor point locations after applying the CFDM.

the MSU database.

The diagram in Figure 6.24 shows the steps involved in using the CFDM align-
ment algorithm developed in this dissertation as a preprocessor for the FRGC baseline
PCA algorithm. The FRGC baseline algorithm was chosen so that a direct compar—
ison with existing results can be shown. For this experiment, the manual FRGC
PCA preprocessor was replaced with the Canonical Face Depth Map algorithm. The
original FRGC PCA algorithm uses manually selected anchor points for registration
and image cropping. However, the Canonical Face Depth Map preprocessor is fully
automatic. Results are shown in Figure 6.25 for the MSU database of 330 scans. As
expected, the manually selected anchor points are more accurate than the automati-
cally selected anchor points. However, the automatically selected system seems to be

comparable to the manually selected system.

Ialam

 
 

i:'> i:>

   

Figure 6.24: Preprocessing the scans for use with the FRGC PCA algorithm is a
two step process. First the scan is processed into its CFDM format and then the
F RGC normalization algorithm called “face2norm” masks out unwanted regions and
normalizes the depth and color.

166

100v

(O
O

80

7o-,@

 

Veriﬁcation rate (%)

 

 

 

 

 

(soL -®-SAM
-®-20 PCA
50 5,0) -®-30 PCA
-®-20PCA+3DPCA+SAM
40 -2 10 '2
10 1o 10

False Accept rate (%)

Figure 6.25: Comparing automatic PCA with manual PCA.

Figure 6.25 shows that the normalized PCA algorithm performs even better than
the SAM, which is expected because the PCA algorithm is trained on a dataset. It is
rather naive to add the PCA and the SAM scores together since they are not in the
same units or scale. However, this naive addition produces better results than either
score alone, suggesting that a more intelligent, multi-modal combination of scores

may produce even better results.

167

Chapter 7

Research Platform

The previous chapters described different algorithms and methods for manipulating
and analyzing 3D faces. This chapter describes the 3D research platform developed
for and used to run the experiments described in this dissertation. This platform
is considered to be an important contribution of the dissertation work. Section 7.1
describes the history of the programming interface. Section 7.2 discusses the operation
of the demo program and its integration with the Vivid 910 scanner. Finally, Section
7.3 details the performance of the different components of the system and compares
this performance on databases of different sizes. Appendix A covers the design and

use of the research platform in more detail.

7 . 1 Program History

Research programming must be more ﬂuid than system programming, since it is not
possible to design every aspect of the research platform before some of the experiments
are conducted and the results analyzed. In the worst case, it may be necessary to
completely rewrite portions of the code to account for some aspect of the software
that was not initially anticipated. For these reasons, the programs developed for this

dissertation have evolved over time to enable the examination of new problems that

168

may not have been considered when the original code was developed. In developing

the research platform, several design goals were pursued:

1. Developing software that was easy to use and enabled quick testing of experi-

mental hypotheses without needing to redevelop redundant support software.

2. Creating a real time demo program capable of demonstrating the viability of

3D face veriﬁcation.

3. Building a testing platform that was compatible with the NIST Biometric Ex-

perimental Environment (BEE).

4. Designing a testing platform that could be compiled and run on a High Perfor-

mance Computing System.

These design goals occasionally conﬂicted and care was taken to design a system that
could balance these different requirements.

Initially, Matlab was used exclusively to develop a system prototype. There were
three main reasons for using Matlab: fast algorithm development and testing; fast
data analysis and visualization; and existing software for converting from the FRGC
dataset into a Matlab format. The Matlab prototype system was able to demonstrate
proof of concept; however, there were some limitations on how fast a program could
execute in Matlab and the prototype system took over two minutes to complete a
face comparison.

In order to achieve the goal of a real time demo, the Matlab prototype was ported
to C++. The C++ version of the research platform is much faster than Matlab and
is able to control the VIVID 910 scanner directly. Because C++ is a more difﬁcult
environment to develop in, the ﬁrst goal for the C++ program was to enable the
loading and saving of the Matlab data format using the Matlab API. This allowed

the C++ system to be built in stages, compared to the original Matlab prototype

169

code, and allowed the design to take advantage of the beneﬁts of both programming

environments.

The real time C++ system allowed for fast data collection and quick visualization
of the core algorithms. However, the demo system used the Microsoft Foundation
Classes (MFC) for the user interface, which added programming overhead. The MFC
classes are not cross platform compatible and the GUI interface made it difficult to
run large batch comparisons (although the GUI could be run in batch mode) or to

add new experiments.

A stripped-down version of the program was developed to run completely from the
command line, which simpliﬁed the development and improved the execution time of
the experiments. The command line version also allowed the experimenter to write
a new, independent program for each type of experiment instead of maintaining one
large, bulky program that has to do everything. However, because Visual Studio
was used to develop the initial C++ MFC project, it was difficult to add additional
command line programs to the project. Consequently, the C++ program was ported
into a CMake project that allowed for the simple addition of new programs and
enabled the new command line programs to be compiled on different systems (such
as Cygwin, Linux, Solaris, etc). Another advantage of using CMake was that the
entire project could be represented as text ﬁles, instead of the numerous binary ﬁles
required by Visual Studio. A text-based project also allows for the use of simple
project versioning tools such as CVS, which enables large numbers of programmers

to work on the same project in completely different areas.

With the ability to write many command line executables, the 3D research plat-
form expanded to include a series of 3D face command line tools. These tools enable
quick format conversions, feature detection and data analysis. The command line
format also allowed for the development of tools to work directly with the Biomet-

rics Experiment Environment (BEE). The BEE is a software interface used in the

170

Face Recognition Grand Challenge (FRGC) to normalize the input and output data
formats. The BEE system is designed to standardize the method for preprocessing
and analyzing biometric data. The BEE system is provided with the FRGC database
and not only contains the necessary interface, but also provides a baseline matching
algorithm for comparison. Working with the F RGC dataset is not a trivial task, as
it takes some time to gain proﬁciency. The code is exclusively written for Linux, and

in order to install the code administrator privileges are required.

Even though the faster C++ system (with command line option) can compare two
faces in less than a second, analyzing the entire FRGC dataset requires 16 million
comparisons, which would take around 185 days to complete on a standard Pentium
four 3Ghz PC; more complex trials that are designed to explore different research pos-
sibilities take even longer. To improve experimental speed, programs were developed
to enable the F RGC algorithm to run on the MSU High Performance Computing
(HPC) system. Because each of the 16 million comparisons are independent of each
other, the FRGC experiments are considered “embarrassingly parallel,” meaning that
the system can take full advantage of the large number of processors available on HPC
(128 nodes with 2 duel core processors on each node for a total of 512 cores).

The current 3D research platform runs under Windows, Linux, Solaris, HPC and
can integrate directly with Matlab. The research platform is also able to load and

save to many different ﬁle formats, allowing for maximum research and data options.

7.2 Demo Interface

The demo program is a speciﬁc aspect of the 3D Research Platform that runs on
Windows. The demo program is designed to gather data using the VIVID 910 and
demonstrate the viability of using 3D face matching in a practical system. The three

main features of the demo are:

171

p—n

. Data are gathered using the VIVID 910 in fast mode.

[0

. The practicality of the baseline matching algorithm is demonstrated by using

the SAM matching score and symmetry reject option in real-time.

OD

. Different visualization methods are provided to demonstrate how the data is

processed and what the algorithm is doing.

A subject stands in front of the camera (as shown in Figure 7.1) and must hold
still for about 2 seconds to capture a scan. Figure 7.2 is a flow chart representing
the operation of the prototype system designed based on the demo system. Many
additional features have been incorporated into the demo to increase functionality,
such as keyboard shortcuts, batch process capabilities and the ability to turn different

aspects of the matching algorithm “on” or “off”.

Minolta
Vivid 910
Subject with
0.6 — 1.6 m
offset Depth Image

 

Figure 7.1: Standard operation of the VIVID demo system.

The remainder of this section describes the three main features of the demo in
detail. Section 7.2.1 describes the different components of the VIVID 910 interface,
Section 7.2.2 outlines the use of the main matching algorithm and Section 7.2.3 dis-

cusses the different methods used to visualize the data.

172

 

 

0

[11:1
{-3

—>
Minolta Scanner
/ l“\\

 

 

 

Next Person

A
30 Scan

Data

/
/
W Curvature
Calculation

Smart Ca ’d Reader
I

Anchor Point
Detection

i
Symmetry
Plane Reject
Calculation

® o—D‘ Reject Scan

Yes
'

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ICP
Alignment

SAM < Label as Notify
Nov lmposter Operator

Yes
V

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Label as
Genuine

1

Give access to
secure
resource

 

 

 

 

 

 

 

Figure 7.2: Flow chart representing the operation of the prototype 3D face veriﬁcation
system.

173

7.2.1 VIVID 910 Operation

The VIVID 910 uses a SCSI interface to communicate with the host computer. The
interface is programmed with a thread that continuously queries the scanner and
updates the viewing window shown in Figure 7.3. The viewing window uses a double

buffer system to maintain a smooth display.

w lln- mm ‘1 ‘III with llu- nmlLiI. HII'

  

Furor

SeveScmTaArchlvel F Minoan

 

 

 

Figure 7.3: VIVID 910 interface

When the demo system is ﬁrst started, the focus button must be pressed to ini-
tialize the scanner. As long as the subjects stay within the same distance Window to
the scanner, focusing should not need to be repeated. It is a simple matter to put a
line on the ground indicating the location where a subject should stand. Once the
scanner is properly focused, the scan itself will take about a second.

When a scan is complete, the right window (in Figure 7.3) displays the depth
map, while the left window displays a still image of the scan. The streaming video
option can be selected if a moving image is desired. The scanner interface includes
options to load and save raw VIVID data (cdm ﬁles) from the scanner menu. The
user interface also includes a mirror scan option, which ﬂips the output window from

left to right and makes it easier for subjects to center themselves when viewing the

174

output window. Once a scan is made and the “0k” button is pressed, the demo system
calculates the surface curvatures and anchor points and then measures the symmetry
of the scan. If the face is not found to be sufﬁciently symmetric, the demo system

reports a reject and the subject can be rescanned.

7 .2.2 Matching Algorithm

The matching algorithm takes two scans (a model and a query) and aligns them using
the 3D Face Alignment algorithm described in Chapter 5. Figure 7.4 shows the three
windows that appear when executing the matching algorithm. Figure 7.4a shows the
model scan with the 100 control points selected from the query scan. The red control
points represent the trimmed points, and the blue circles around each control point
represents, the current error for each point (1 pixel in radius is approximately one
millimeter). Another window (see Figure 7.4b) shows a histogram of these errors
(all distances above 3mm are put into the 3mm bin). Figure 7.4c is a bar graph
representing the current SAM score. The vertical line indicates the threshold value.
A matching score to the right of the vertical line indicates an impostor, while a

matching score to the left of the vertical line indicates a match.

 

 

 

Mach OS {mu
‘ I
. .. .... j w.»

 

 

 

 

 

(a) Model Scan with Control (b) Point Error Histogram (c) Current SAM
Points

Figure 7.4: Matching algorithm demo visualization.

175

The matching algorithm compares the model to the query and the query to the
model and reports the lowest of the two scores. A ﬁnal display will appear either

accepting or rejecting the subject.

7.2.3 Data Visualization

hunt-save.
D

89 f.

   

 

(a) Main Window (b) Main Window With Flag values showing.

Figure 7.5: Main image window showing anchor point location with and without ﬂag
values.

Once a scan has been taken or loaded from a ﬁle, it is displayed on the main demo
window (see Figure 7.5a). The main window has a tool bar for executing the main
system commands and setup parameters. By default the image is displayed with
the automatically detected anchor points, which can be toggled on and off. The ﬂag
values can also be toggled on to indicate where the scanner did not pick up any valid
data. In addition to the color image, the depth map (see Figure 7.6a), shape index
(see Figure 7.6b) and surface normals (see Figure 7.7) can also be displayed as a 2D
image. The depth map and shape index are displayed on a color space represented
by the scale shown in Figure 7.6c. The surface normals are shown in red, green and

blue. The red component of the surface normal represents the absolute value of the

176

normal in the x direction. the green component of the surface normal represents the

absolute value of the normal in the y direction, and the blue component of the surface

normal represents the value of the normal in the z direction.

 

 

(a) Depth Map (b) Shape Index

 

   

(c) Color Space (blue to red)

Figure 7.6: Matching algorithm demo visualization.

Any of the main window visualization modes (shown in Figures 7 .5, 7.6 and 7.7)
can be exported to an image ﬁle. The system can also export the data directly into
a VRML ﬁle, which will be displayed immediately using an external VRML viewer.
This gives the system the ability to output models that can be rotated, in addition
to the 2D images. The VRML code is programmed so that an image ﬁle is used to
map the texture onto the shape. If the image ﬁle is overwritten with any of the above

visualization modes, this mode will be used by the VRML viewer (see Figure 7.8).

177

 

 

(a) Normal Image (b) Normals on a. Unit Sphere

Figure 7.7: Surface normals with the x,y,z directions represented as red, green, and
blue.

 

 

 

 

 

 

 

 

 

Figure 7.8: 3D VRML viewer with color, depth, shape and normal data representa-
tions.

178

7 .3 Performance

This section summarizes the run times for each of the algorithms presented in this
chapter. (Previous chapters have presented similar results for individual components
of the algorithm.) Table 7.1 offers the running times for the main test program

running from Cygwin on a Windows 3.1 GHz dual processor computer with 1Gig of

RAM.

Table 7.1: Algorithm processing times in seconds.

 

 

 

Cols Rows Read Curve Anchor Symmetry Write Total
Model Points Model
MSU 320 240 0.112 2.610 0.052 0.870 0.144 3.790
MAT :l:0.017 i0.442 21:0.009 :l:0.288 21:0.025 21:0.590
ND 640 480 1.207 13.000 0.378 3.582 0.725 18.490
3BX :i:0.074 i3.816 i003? $1.672 i0.056 :123.960

 

 

 

 

 

 

 

 

 

 

 

As can be seem in Table 7.1, the entire algorithm takes around 4 seconds to
execute using the Matlab ﬁle format. The majority of this time is spent calculating
the surface curvature. Table 7.2 shows the cycle time for the current demo matching
algorithm. These times are all conservative and suggest that this algorithm can be
used to create a viable practical system.

Table 7.2: Complete process time for the current demo of a practical veriﬁcation
system system.

 

 

 

 

 

 

 

 

 

 

 

Demo Cycle Time
(seconds)
Load Model 0.2
Scan 2.8
CalcCurve 2.4
Anchor Point 0.1
Symmetry Reject 1.1
Alignment 1.3
Swap Alignment 1.3

 

179

Chapter 8

Conclusions, Discussion and Future

Work

This dissertation studies the use of 3D scanners to improve face veriﬁcation systems.
Traditional 2D face recognition systems are not tolerant to changes in pose and light-
ing. However, 3D data can be used to reduce the effects of lighting and pose variations
for face identiﬁcation. Table 8.1 summarizes the chapters in this dissertation and the

major ideas, objectives, and contributions of each chapter.

The data used in this dissertation was gathered with two different Minolta Vivid
910 scanners over a four year period. This scanner has been shown to have high
accuracy (< 1mm in depth), fast capture (under 2 seconds), and it produces data
that is consistent across this model of scanners [22]. The preliminary experiments
started with as few as 10 subjects and 63 scans, and has grown to include over 625
subjects with over 5500 scans. The combined testing for all of the experiments was
run on over 35 computers, including two High Performance Computers (one with 32
processors and the other with 512 core processors). These resources were used to

conduct extensive experiments, which totaled over 6 years of computational time.

180

Tal

)le 8.1: Dissertation overview.

 

Chapter

Major Content and Contributions

 

2. Background

0 3D coordinate system notation
0 3D sensor overview

0 3D data ﬁle types

0 Data collection and statistics

 

3. Solution Approach

0 Surface ﬁtting

0 Curvature calculations

0 Scanner resolution, scale and orthogonal depth
* maps

0 Image cleaning

 

4. 3D Anchor Point
Detection

0 Fully automatic frontal anchor point detection
0 Fully automatic arbitrary pose anchor point de-
tection

 

5. 3D Face Alignment

0 Algorithm robustness tests
0 System error analysis

 

6. Toward a Canonical
Face Format

0 Face-based coordinate systems
0 Canonical Face Depth Map (CFDM)
o Preprocessing using CFDM

 

7. Research Platform

0 Real time demo
0 System performance study

 

8 . Conclusions

0 Anchor point detection and evaluation
0 Practical 3D face veriﬁcation overview
0 Feature space evaluation with fusion results

 

 

7. Appendix

0 File formats

 

0 Cross platform capabilities

 

181

 

Chapter 5 presents an algorithm for 3D surface alignment, used to align the 3D
surfaces of two faces. The alignment algorithm is a two step process. In the ﬁrst step,
key anchor points (found using the algorithms presented in Chapter 4) are automat-
ically identiﬁed on the surface of the face and used to coarsely align two faces scans.
In the second step, an iterative hill climbing algorithm called ICP (Iterative Closest
Point) is used to ﬁnely align the scans. The quality of this two-step alignment process
is studied in detail using a Surface Alignment Measurement (SAM) (see Section 5.3).
The SAM is the root mean squared error over all the control points used in the ICP
algorithm, after trimming to account for noise in the data. Large SAM values could
mean imprOper alignment between two scans, or they could mean that the surfaces

are shaped differently.

Because the SAM value is a measure of the quality of the surface alignment, it is
also used as a baseline face matching score. Section 5.4 demonstrates that using the
SAM as a matching score can achieve a 1% equal error rate (EER) on a database of
frontal scans restricted to neutral expressions. The performance decreases for faces
with different poses and expressions. However, these experimental results still suggest
that using the SAM score alone can provide a practical face veriﬁcation solution in a

market where subjects are cooperative (such as a security checkpoint).

The surface alignment algorithm requires a number of engineered input param-
eters, such as the number of control points, the control point selection regions, and
the percentage of trimmed points. The experiments in Section 5.2 varied these pa-
rameters and compared the resulting variations in the speed of the algorithm and
the BER. These experiments show that the surface alignment algorithm is robust to
changes in algorithm parameters, and the results determine the Optimal operating

parameters for the alignment system.

One strength of the fast, fully automatic surface alignment algorithm developed

in this dissertation is that it can be used as a general preprocessor for other 3D based

182

face recognition algorithms. For example, the surface alignment algorithm is used in
Section 6.2.3 to robustly identify the plane of symmetry on a face by aligning a face
scan with a mirror projection of itself. This plane of symmetry calculation can be
used as a reject option to increase the performance of the face veriﬁcation algorithm,
decreasing the BER on the F RGC 1.0 dataset to 1.2% with a 1.5% rejection rate. The

plane of symmetry can also be used to help deﬁne a face—based coordinate system.

The Canonical Face Depth Map (CFDM) is deﬁned in Chapter 6 as a face-based
coordinate system projected onto an orthogonal depth map. The CFDM normalizes
and re-samples the 3D data to align all the faces in a database which are taken at
different poses and scales. This alignment does not require a generalized face model,
and the CF DM can be used to automatically preprocess a database of 3D images.
Experiments with a PCA-based algorithm in Section 6.6.2 show that the CFDM
provides robust, automatic preprocessing, and performance results are close to those

of a database that is normalized by hand.

The high quality alignment and ﬁxed resolution size of the CF DM allows for
exploration of feature spaces and matching scores not normally available in arbitrary
3D scans. For example, the CFDM makes it possible to apply correlation to the 3D
scans, as shown in Section 6.6.1. The correlation algorithm can be used to localize
anchor points, and also provides additional information for the matching process.
Experimental results show that by combining simple correlations of the bridge of
the nose with the SAM score, better matching performance is achieved over either

measurement alone.

Table 8.2 summarizes the major contributions of this dissertation. These ﬁndings

have been disseminated though a number of publications, as described in Table 8.3.

Section 8.1 combines results from different parts of the dissertation to synthesize
and discuss some ﬁndings from this dissertation, while Section 8.2 discusses directions

for future work.

183

 

Table 8.2: Major contributions.

 

3D face alignment
algorithm

Uses the automatic anchor point detector developed in
Chapter 5 as a ﬁrst step to an automatic surface align-
ment algorithm.

 

Practical 3D face
veriﬁcation
system

Using the SAM value as a matching score, Chapter
5 demonstrated that the face veriﬁcation system can
achieve over 99% accuracy or less than 1% equal error
rate on a database of cooperative subjects.

 

Canonical Face
Design

Chapter 6 deﬁnes the robust Canonical Face Depth Map
and an algorithm to robustly take a 3D face scan and

 

convert it into this standard format.

 

 

 

Table 8.3: Publications based on research in this dissertation.

. George Stockman, Jayson Payne, Jermil Sadler and Dirk Colbry. Error Anal-
ysis of Sensor Input Variations in a. 3D Face Surface Matching
System. Sensor Review Journal, Volume 26, No. 2, pages 116421, 2006.

. Xiaoguang Lu, Anil K. Jain and Dirk Colbry. Matching 2.5D Face Scans
to 3D Models. IEEE Transactions on PAMI, 28(1):31-43, 2006.

. Dirk Colbry George Stockman and Anil Jain. Detection of Anchor Points
for 3D Face Veriﬁcation. In IEEE Workshop on Advanced 3D Imaging for
Safety and Security A3DISS, San Diego California, 2005.

. Xiaoguang Lu, Dirk Colbry and Anil K. Jain. Three-Dimensional Model
Based Face Recognition. In 17th International Conference on Pattern
Recognition, pages 362-365, Cambridge, United Kingdom, 2004.

. Xiaoguang Lu, Dirk Colbry and Anil K. Jain. Matching 2.5D Scans for
Face Recognition. In International Conference on Biometric Authentication,
pages 30—36, Hong Kong, 2004.

184

8.1 Discussion and Synthesis

This dissertation presents methods for aligning faces for face veriﬁcation. The meth-
ods presented in this dissertation could also be used for face recognition tasks where
the system searches for a single subject in a database of scans. However, because
the pair-wise match between two 3D scans can take as long 4 seconds, searching for
a match in a large database would take a long time. Also, a design constraint of
many of the matching algorithms presented in this dissertation is that the subjects
are being cooperative. Assuming cooperative subjects is reasonable in a veriﬁcation
application, because subjects need to cooperate to gain access to a resource. However,
this assumption is not as reasonable in a recognition application, where subjects may
not know they are being monitored (e.g., a surveillance application). The ideas and

methods presented in this dissertation encompass several major themes, speciﬁcally:

0 Anchor point detection and evaluation.
0 Requirements for a practical 3D face veriﬁcation system.

0 Face veriﬁcation matching score performance and fusion methods.

These themes span the work presented in this dissertation and the remainder of

this section discusses them in detail.

8.1.1 Anchor Point Detection and Evaluation

Anchor points are used throughout this dissertation, and the quality of the anchor
point detection is key to the success of many algorithms. In Chapter 5, anchor points
are a necessary ﬁrst step for coarse alignment of the faces. Section 5.2.1 uses anchor
points to help establish the location of the control points used by the ICP algorithm.
Proper anchor point detection helps to ensure that the control point regions selected
avoid areas of the face that change most with expression. Reliable anchor point

detection can be used to help model expression [98] or even as a feature for matching

185

[124]. However, each of these applications require anchor points of various accuracy
and repeatability. and it can be difﬁcult to evaluate these qualities in deleted anchor

points. Some reasons for the difﬁculty in evaluating anchor point detection include:

o It is difﬁcult to deﬁne a single, salient point on the face.
0 The deﬁnition of a point (such as the tip of the nose) may be ﬂexible, and using

humans or a computer to label these points will produce different results.

The remainder of this section reviews the different methods used in this dissertation
to calculate anchor points, and the methods used to evaluate the anchor points after

detection.

Detection

In Chapter 4, two algorithms were presented to detect anchor points in 3D face data.
The ﬁrst approach is used for frontal poses and the second algorithm is used for arbi-
trary poses. The anchor points produced by these detection algorithms were shown
to be sufﬁcient to coarsely align two scans such that the ICP algorithm will converge
properly. However, there is still some variance in these anchor points, especially
around the corners of the mouth. In Chapter 6, the face is put into a canonical
format before the anchor point detection algorithm is run, resulting in some minor
improvements in point variance.

The anchor point algorithms presented in this dissertation are either fully auto-
matic or they use prior knowledge of the anchor points provided by the template.
The fully automatic system is desirable because it does not require that a subject be
recorded prior to use. However, if a template is available then it is not unreasonable
to use manually selected anchor points on the model, even though manual selection
will make the enrollment procedure more time consuming. The manual anchor points

can provide better results at the cost of algorithm ﬂexibility and ease of use.

186

Evalution

Table 8.4 lists methods that can be used to evaluate the quality of anchor point
detection. One method is to manually examine the anchor points and evaluate their
quality, as was done in Section 4.4.3. The human evaluator has limited choice (good
or bad) and makes a Boolean decision. Although this approach is straightforward,
manual evaluation is difﬁcult and time consuming, and since only Boolean results are

available the results of this approach are not easily quantiﬁed and compared.

A second option is to evaluate the quality of the ICP alignment and use this as
an indirect measure of the anchor point quality, as was done in Section 4.3. If the
anchor points are good enough to converge in ICP, then they are considered to be
of acceptable quality. An advantage of the ICP approach is that it constrains the
evaluation to a much simpler problem. In addition, it may be possible to deveIOp
a threshold for the SAM score, which would allow for automatic evaluation of the
anchor points.

A third option is to compare the automatically selected points with manually se-
lected points, as in Sections 4.3.1 and 6.5. However, this approach negatively biases
the results because the automatic system may be highly repeatable within a subject,
even if it is not ﬁnding the same anchor point as the manual (ground truth) point, or
because the manually selected points also have errors. At best, a comparison to man-
ually selected ground truth can give an estimate of the variance of the automatically
selected point data.

A fully automatic method (i.e., no subjective input is required) to analyze the
quality of the anchor points is to compare anchor points from different scans to one
another using the ICP algorithm. Each scan is aligned to other scans of the same
subject, and the difference between the automatically selected anchor points gives a
measure of the variance within the dataset. The problem with this approach is that

expressions and surface changes are not taken into consideration.

187

 

£28898 as??? one 8%
.25 case on» Bob one. wood“ 83

ézﬁeueoaoa
Eﬁtow? mo osmmm 2: @8883
82 @5382 anon “ozone on»
oceans”. :93 one $.38 23 :3?

 

 

 

 

 

 

 

on... was... 883% was. comedian/o .coﬁsigo seas: 2:96. £358 acoacmmc 95 :0 Eamon naming—=00
no“ Eﬁtow? mOH one :0 momma was 383 coca—$2: on ~80 8:28 82% bﬁsoseﬁocsxx mOH mmeO
.Eﬁt
em? on: was 53396 2: .895 and;
on economno on $8 ESQ none .553 9:550
ice :e we ccﬁEmoc c5 Use 03¢ .258 ea :8 mcomEeQEOo =8st “5sz 23 on 3 Son... magma was wouoﬂom
sauce on go: .88 Samoa €282 .3m 55 $5583.38 853me 83 on... so Eamon goo—om bﬁszez bin—cog
.mcoseuoﬁ Eamon
8:28 E 288 new a: SE8 :8 deﬁning 12:38 We .muﬁoa 85
x @383 com: mm mOH a Baum: koaopmmmsoo 23 3o: 250% page ice wouoﬂom o6. 53w 83968 oozowuocr
Eco E Seances ”ESQ 8:28 of. «28:0 one 2053258 e 830 Eﬁtowﬁ n5“ .2: r mom 33282 L50 mm:
.2812».
-385 :23 £3» acoumwmcoQ on 3 .cmoeasooe :2:
base .meoaeigo oﬁ 8. 80.35 3 we :52 sec—com e 3% use :oﬁmzdgﬂ
mm was 223.3ro 3.0: mm @0534 522515: on. racism mpﬁom 5:28 2: aooamE c3955 13:32
£80 8.5 momumﬂemem 32:22

 

 

 

.3an Scene massing new 25502 Jam 03»?

188

Localization

Even though the accuracy of the anchor points are difﬁcult to evaluate, this disserta-
tion has shown that the anchor point detection system presented in Chapter 4 does
a good enough job for the ICP alignment system. However, if more accurate anchor

points are required, a point localization method could be considered.

Section 6.5 demonstrated that by preprocessing the scans into the CF DM format
better anchor point localization could be archived by just re—applying the anchor point
detection algorithm presented in Section 4.3. Another option for point localization
is the transfer localization method, which assumes that the manually selected anchor
points on the model scan are correct and then transfers these points to the nearest
neighbor on the query scan after ICP alignment. There are two problems with this
point transfer localization method. The ﬁrst problem is that point transfer localiza-
tion does not take into account the motion of points within the face due to expression
change, because the points are transfered directly. The second problem is that the
transfer biases the location of the anchor points and makes it difﬁcult to use them as

a biometric measure.

A better option is to use the manually selected anchor points as a guide for point
localization, as was done in Section 6.6.1 with point correlation. In this case, a
window region is selected around the manually selected anchor point on the template
and the same window is searched for on the query scan. This method accounts for
minor changes in expression and can do a better job in localizing the points around
the desired manual point. However, this localization method still has a larger number
of outliers which still need to be accounted for. Figures 8.1 and 8.2 show examples of

all three localization methods.

189

)CFDM (c) Correlation

((1) Raw )CFDM (f) Correlation

 

 

Figure 8.1: Eye corner localization. Average face image with each of the manually
selected anchor points (green) and the relative location of the automatically generated
points (white) from the MSU 330 image dataset. The left images represent relative
anchor point locations using the raw data, the middle images represent the anchor
point locations after applying the CFDM, and the right images represent the location
after correlation localization.

190

(b) CFDM )Correlation

)Correlat ion

   

(g) Raw (i) Correlation

Figure 8.2: Point localization using correlation. Average face image with each of the
manually selected anchor points (green) and the relative location of the automatically
generated points (white) from the MSU 330 image dataset. The left images represent
the relative anchor point locations using the raw data and the middle images represent
the anchor point locations after applying the CF DM and the right image represent
the anchor point locations after applying correlation.

191

8.1.2 Practical 3D Veriﬁcation System

Many of the methods presented in this dissertation were developed with a practical
3D face veriﬁcation system in mind. An example application for such a system might
be a security checkpoint, where users would have smart cards containing their 3D
face data (template). The smart card would be read into the system, the user would
get scanned, and the scan would be compared to the template on the smart card.
If subjects are not cooperative (e.g., they do not maintain a frontal pose or remove
eyeglasses), the quality of the scan my not be within the design speciﬁcation and
the scan may be rejected, which would require the subject to get scanned again. In
order to claim that the system is practical, it must be evaluated based on the criteria

presented in Table 8.5, which are discussed in the remainder of this section.

Subject Requirements

Most of the research reviewed and developed in this dissertation assumes the goal of
matching frontal face scans from cooperative subjects. This is a reasonable assump-
tion in many markets, which can assume that cooperative subjects are trying to gain
access to restricted areas or resources, such as an airport, secured workplace or ATM.
The cooperative subject assumption gives many additional restrictions. For exam-
ple, if subjects are being cooperative it is reasonable for them to hold their position
during the scan, maintain neutral expression, and face the camera. However, other
commonly made assumptions may not be reasonable for even a cooperative subject.
For example, it is not unreasonable for minor pose or expression variations to occur as
a subject changes mood. It is also not unreasonable for a subject to change hairstyle,
grow or shave a beard, or wear makeup.

The experiments in Chapter 5 demonstrated that the frontal matching system
developed in this dissertation can easily tolerate pose variations of up to 30 degrees

in the yaw direction and 10 degrees in the roll and pitch directions. These variations

192

 

.88: 88
888 o: :88: o: o: 888 :8: 8: 8898 8:9 .m

:88:
8:: 8 88: 8 88: 8880: 8898 8:: 80 .m

 

 

 

 

 

 

 

.8888 88.888 88 88:: 888 88898 8:: .8 :8on 88 3:88:88 :80
8888: c: 85888: 88: 88 888888: >82 .m 8:: 8888: 88388: 8 m:888>088_ 80 .m 88 mﬁhoa
68.3% 8 8888 o3 88> 8:882 89 A 8:88 888m 858 8:: 80: :88 Box A oNMWAMM
2888888
.02 .m :898 88:88 8088 8.88888 8:: 803 .m
.8888 No 8:: 88: .m 2888
8:68: 88 8:: 88m :8 :88 0: 8:8: :8 88: mac: 30m .m
3888 8:888 0mm VV 8888 2Qm0 88:: 8 038 A 2.88 :88m 8:: 8:8 :c o: 888 8:: 80 A Ema—8.8m—
.m:8oq 88:8 3888 w8>8>8 5:888:88 E83 88
88.8838 8 88888 o: 8:08 88on 88H. .m 8 :8 88858 8888868 88:: 8:: 8< .m
6888888 :8: 88: m8??? :88 883 8838.8: :88
-88: :8 8888: o: 80 m8:::8w8 8:: 8 882 .m :88on cm: 8888: o: 88:: 888 8:: 80 .m
88:8: 825 NE??? 8:: 8 88.38 888 8 8808
88 88:88:88 @8888 8:28:88 .8888 m A 0: A88 0: :88 88: 0:8: :8 88: mac: 30m A 88:8 80%0
8:888:88 o: o: m8oom 889m 03: :88 :< .m 288868 8 :88=o8o.8 88
8:88.88 88 :8: 88.8 :8 83:88: :82 .v .o: 88> 8 8:8: 8:88 8:: :8: meo— Bom. .m
:88 88 38:88 8 888m .883 N88: 888 380
828m 88m to: 888888 838888 88:82 .m c.8888: 88:: 88:8 88.38 8:: 80 .8
88:8 :8 8::8w 88888868 888898 8:: 88 8:3 .m
o: :8 o: 8:88 88: 8:: .888 w8::w: 888:me .N 588858.: w8::w: b8 88:: 8< .N 92888:?
.28» 8 com ﬁ 88 :88 88 :8 8 om: H A 88888888 88: 8:: 88 8:3 A new“: :oonmam
888888: 8:: 8 88838 8 88888 8:: >83 838885 888880

 

 

88:88 888808 c: 0: 8898 88:88:88 88 On 8 :8 8:8 8 88838 c: 0: A88 8:: 888850 ”mm @388

193

are well within the limits of “frontal” scans, as observed in the variations measured
in the FRGC database (see Section 5.3.3). Chapter 5 also showed that makeup and
beards do not affect detection rates, although the current system can have problems
with hair that occludes the upper part of the face or if the hair style captured with
the scanner is very large in comparison to the face. (Large hair can cause extra noise

in the data, making it difﬁcult to localize the face using the current algorithm.)

Cycle Time

A cooperative subject can be expected to remain still during the scanning process,
which can take 2 seconds or more when the Minolta scanner is used in fast resolution
mode. With a stereo sensor, such as 3DMD [1], scan time can be less than 0.1 seconds;
however, almost a minute of additional processing time is required. A cooperative
subject is also assumed to be facing the camera and maintaining a neutral or near
neutral expression. For face recognition applications where a cooperative subject
cannot be assumed, such as in a surveillance situation, a faster scanner is required
and the matching system must be invariant to all poses and lighting conditions. An

arbitrary pose system was explored in Sections 4.4 and 5.5.

The entire cycle time for the current demo system was shown in Table 7.2 to
be just above 9 seconds. These times are conservative and could easily be sped
up by implementing faster algorithms (such as pyramidal methods), faster scanning
hardware, or faster computing hardware. The estimate in Table 7.2 also includes
the swap alignment step, which was shown to be unnecessary in Section 5.2.5 if the
subject is cooperative. Removing these extra 1.3 seconds would reduce the total cycle
time to under 8 seconds, which is reasonable in many applications. For example, at
an airport security checkpoint subjects need to wait for x-ray scanners to examine

their luggage, so an 8 second identity check is not burdensome.

194

Enrollment

The enrollment procedure for 3DID simply requires a subject to be scanned once.
Improved results may be obtained by having an operator inspect the scan and add
information, such as manual anchor points. However, as long as the subjects are
cooperative, this additional work is not required in the current system.

Section 6.5 showed that the super sampled CF DM template (< 250 kilobytes) can
easily ﬁt onto a smart card (512KB [66]). There is room to store two or even three
templates on a standard smart card, which could store different expressions or poses
and help increase performance.

There is limited evidence to estimate how changes in the subject over time (aging
effects) will impact the performance of the face veriﬁcation system. In Section 5.4
the SAM score did not degrade in performance when the template and the target
scan were taken approximately 1 year apart (this was not the case for the PCA
algorithm, which degrades in performance). Informal studies in the MSU lab have
shown that many subjects match well over a 3 year time period, and these experiments
demonstrate that growing or shaving beards or changing hairstyle has almost no effect.

on the matching results.

Hardware Requirements and Cost

Probably the largest obstacle toward a practical use of a 3D face verification system
is the cost of 3D scanners. The current high end scanners (Minolta and 3DMD) cost
upwards of $50,000 each. This high price limits the possible applications (such as
installing scanners in automatic teller machines). However, as described in Section
2.5.3, there are companies working on new 3D technologies that should be more
affordable. If a scanner could be priced in the $2000 range, then 3D face veriﬁcation
may become affordable enough to overtake existing verification technologies, such as

ﬁngerprint or iris scanning.

195

8.1.3 Matching Score Evaluation for Face Veriﬁcation

The experiments in this dissertation have presented several different matching score
spaces that can be used to measure the similarity and difference between subjects.
In Section 5.5, the shape index was fused with each control point, which slightly
improved performance. In Chapter 6, PCA was applied to both the color and the
shape channels of the CFDM modiﬁed scan. In each of these cases, the same input
data is used but is processed in different ways in order to produce a matching score.

Table 8.6 lists the various feature spaces explored in this dissertation.

Table 8.6: Matching Scores.
Feature Spaces Described Sections Used
Baseline SAM Distance from 100 control points on the 5
query scan to the surface of target model
after trimming 10% of the the noisy data.

 

 

 

 

Shape Index Distance measure between the shape in— 5.4
dex on the surface of the query and target
scans.

Correlation Root mean squared error over a windowed 6.6.1

region on the target model searched over
the query scan. Window regions were 10-
calized around the anchor points.

PCA FRGC baseline algorithm after preprocess- 6.6.2
ing using the CFDM.

 

 

 

 

 

 

The baseline matching system presented in this dissertation uses the trimmed
root mean square distance (SAM) minimized by the ICP algorithm as the primary
matching score for face scans. When subjects are cooperative, the SAM score is
shown to be sufficient for doing proper face identiﬁcation. In most cases, the surface
alignment algorithm does a very good job aligning different surfaces, even if their
SAM scores are high. However, using the SAM value as a matching measure is a
fairly naive approach since the SAM is a measurement of the quality of the surface
alignment and the surface of the face changes drastically with expression. What the

experiments in Section 5.4 have shown is that comparisons of the same person with

196

minor changes in expression can produce larger SAM variations than comparisons of
some impostors. This is true even when the control points are limited to the “non-
deforming” parts of the face, as shown in Section 5.2.1 (although a smile can still

push the cheek surface up into this region near the eyes).

Bellon et al. [13] showed that the Surface Infusion Measure (SIM) value achieved
better performance over the SAM value. Bronstein et al. [26] used the surface normals
as a matching score, but with limited success. These approaches treat the face as a
large, rigid structure. In Section 6.6.1, moving window regions are used to compare
the local surface of the face in an approach similar to one developed by Manjunath
et al. [102], which has been shown to be more tolerant to global changes (such as
expression). However, the experiments conducted in Section 6.6.1 show that simply

fusing the correlation matching scores is less effective than using the SAM by itself.

The color channel of the Vivid 910 adds additional information about the face
that is not available in the 3D surface data alone. It was demonstrated by Chang et
al. [35] that the information provided in the color space is partially independent of
the 3D shape space, and can be used to augment or even improve the shape space
recognition. Thus, using the Canonical Face Depth Map as a preprocessor for a 2D

color algorithm may achieve better performance by exploring linear subspaces, such

as PCA.

The Face Recognition Grand Challenge not only provided a database of faces, but
also provided a baseline algorithm for evaluating the faces. The FRGC algorithm is
a Principal Component Analysis (PCA) algorithm similar to eigenfaces [133]. The
PCA algorithm is applied to both the texture and shape channels separately, which
are then fused together. The original PCA algorithm uses manually selected anchor
points, which are not practical for an automatic system. In Section 6.6.2, the FRGC
algorithm is made fully automatic by ﬁrst preprocessing the scans using the Canonical

Face Depth Map. First, the pose of each scan is corrected and each scan is converted

197

 

to a Canonical Face Depth Map using the automatic 3D face preprocessor. To further
normalize all of the scans, the FRGC masking algorithm is used to properly crop all of
the scans. The resulting normalized and preprocessed scans are fed into the standard
F RGC version of the PCA algorithm. Section 6.6.2 shows that automatic PCA on
the depth channel (73% at 0.01% FAR) performs better than the SAM score (68% at
0.01% FAR) or PCA (48% at 0.01% FAR) on the color channel. However, by adding
all three scores, a veriﬁcation rate of 81% is achieved at 0.01% FAR. Figure 8.3 plots
SAM against the PCA distance measure; notice that the green and red values are
seperating nicely. One area of future work is to develop a more advanced multimodal

fusion system to combine these scores.

 

Notre Dame Normalized PCA Distance Measure

 

 

 

 

1
SAM Score (mm)

Figure 8.3: SAM score vs. PCA distance score on the MSU dataset. Red (dark)
points represent impostors and green (light) points represent genuine.

Some of the outliers in Figure 8.3 represent bad data in the scans. The rejection
option described in Section 5.2.6 can be applied to these scans, and the blue scores
in Figure 8.4 represent results where one of the two matching scans is rejected.

In Section 6.6.1, correlation distance measures were also explored as matching

198

 

Comparison with Rejects

 

 

 

 

0.5

O
‘5
(0
3
z
0
U
C
2 0 ~
.22
Q
8
O.
‘5
.3
E
E
2° -U.5 r-
0
E
a!
Q
2
'5
Z

-1

10'1

 

SAM Score (mm)

Figure 8.4: SAM score vs. PCA distance score on the MSU dataset. Red (dark)
points represent impostors, green (light) points represent genuine and blue (darkest)
points represent scores with either the model or the template scan rejected based on
the symmetry criteria.

199

 

scores. Each individual score provided limited data, but together they produced
results similar to the SAM score.

It is possible to construct a matching algorithm that performs even better than
all the previous algorithms, as shown in Figure 8.5. In this multimodal matching
algorithm, the SAM, bridge and nose correlation results are combined, producing
much better results than any of the matching scores alone. Future work will extend

these fusion methods to include other feature spaces and matching scores.

 

 

Verification rate (%)

 

 

 

 

 

40’ - SAM (EER = 4.74)
- Shape Bridge (EER = 6.27)
20- -C3)- Depth Nose (EER = 8.26)
-@"Total (EER = 5.05)
0 1 l
10‘2 10° 102

False Accept rate (%)

Figure 8.5: Fusion example. By fusing the nose correlation, bridge correlation and
SAM score, signiﬁcantly higher performance is achieved over any of the three matching
scores by themselves.

8.2 Directions for Future Work

This dissertation explores the use of 3D data to enhance and extend face modeling and
veriﬁcation methods, which are relevant for a number of security and communications
applications. There are a number of areas for future work based on this dissertation

research.

200

The existing 3D databases have been shown to be a reasonable starting point for
3D face veriﬁcation research; however, in order to further advance the state of the
art additional 3D data should be collected. This new dataset should include many
different poses as well as lighting conditions, and subjects should be encouraged to
make reasonable variations to their appearance (such as growing beards, wearing
makeup or changing their hairstyle). It is also important to include more pose and
expression variations in the database to evaluate the feasibility of continuing the
research into recognition and surveillance domains. No publicly available dataset

truly explores all of these image variations.

This dataset with large intra—subject variation should include scans that could
occur in any real world situation (indoor, outdoor, minor expressions, pose variations
etc.) and should include scans that are invalid or noisy. Future work in this area could
examine ways to measure the noise in a scan and identify environmental conditions
that affect the quality of the scan. For instance, an improved 3D scanning algorithm
could detect changes in the subject’s pose or appearance (such as movement or the
presence of eyeglasses) that could impact the success of the face veriﬁcation system.
Researching ways to identify these factors would allow the development of a system
that could give feedback to the subjects by asking them to adjust their pose, maintain

a more neutral expression, or remove eyeglasses and hats, etc.

With a larger, more realistic dataset, new feature spaces can be explored and
compared to the existing systems. The CFDM has been shown to work well as an
automatic preprocessor to existing face veriﬁcation algorithms. However, the CFDM
preprocessor has only been used for PCA and correlation, and there are many existing

algorithms that may beneﬁt from an automatic alignment system.

This dissertation explored the combination of various matching scores, such as
SAM, correlation and PCA. These preliminary experiments suggest that combining

the results of individual matching scores can improve the performance, yet there

201

are nearly unlimited ways that new matching scores can be explored. One example
would be to further explore the combination of matching scores using weighted sums,
which is a naive approach that produced some surprisingly positive results in this
dissertation research.

Anchor points are another area where work remains. Improved methods for local-
ization should be explored using the CFDM as a starting point. It is also useful to
explore methods for properly evaluating how well anchor point localization is work-
ing. One approach would be to evaluate the quality of anchor point detection using
ICP to ﬁrst align the faces, and then calculating the distances between the anchor
points. Another method would be to reﬁne the anchor point detection results using
manually selected points from the template. Applications using anchor points, such
as expression modeling, could also be explored.

The algorithms and experiments presented in this dissertation demonstrate the
potential of 3D data to improve face veriﬁcation. 3D systems are more tolerant to
changes in lighting, expression and pose than are 2D approaches, which must rely on
color and texture information. However, 3D scanners are still prohibitively expensive
for many applications, whereas a 2D face detection or identiﬁcation system can be
created using inexpensive and readily available cameras. 3D veriﬁcation systems may
not become mainstream until the cost of purchasing and maintaining the scanner
drops. However, the research presented in this dissertation demonstrates that 3D
data can be used to build a robust face veriﬁcation system capable of achieving more
than 99% accuracy for cooperative subjects. In addition to 3D face veriﬁcation, the
methods presented in this dissertation can be used in many ﬁelds. For instance, the
3D CFDM could be used in communications and entertainmaint, while the surface

alignment algorithm could be helpful in medical domains.

202

APPENDICES

203

Appendix A

Programing Interface

Chapter 7 introduced the operation and use of the Research Platform developed
for this dissertation. This Appendix describes the different aspects of the research
platform in detail. Section A.1 discusses the various ﬁle formats used by the program,
and analyzes the load / save times and memory requirements for each ﬁle type. Section
A.2 describes the different external programing tools used during system development,
and Section A.3 lists the programming libraries and SDKS used by the 3D research
platform. Section A.4 details the different operating systems and platforms on which

the 3D research platform has been successfully used.

A. 1 File Formats

Many standards and data formats can be used to store face data. The research
platform developed for this dissertation easily loads and saves several different data
formats and allows new data formats to be added. The current set of supported data

formats includes:

0 Matlab (mat) - The Matlab ﬁle format can store any type of data object ﬁle.

For the scanner data, the stored object is a struct. The base objects in the

204

struct were ﬁrst populated by a program developed by Notre Dame to convert
*.abs ﬁles into Matlab. The base objects include the :c, y, z, and flag matrix. In
addition to the base class, the Matlab format can also store the surface normals,
shape index, surface curvatures, and anchor points. The Matlab format can be

used both by Matlab and by the 3D research platform’s C++ code.

MSU 3BX (3BX) - One problem with the Matlab format is that it requires a
working copy of Matlab to compile and run, which may not be available on some
systems. The 3BX format is a binary format designed to replace the Matlab
format. It can store all the same information and is expandable to include

future data types.

VRML (wrl) - The Virtual Reality Markup Language (VRML) is a standard,
web-based text format that can display all types of 3D information. This is a
useful format because there are several VRML viewers available and many 3D
applications can load VRML data. However, the VRML language is much more
expressive than a 2.5D depth map and therefore the research platform is only
able to write VRML ﬁles and cannot read them.

Vivid 910 File (cdm) - The Vivid 910 has a native ﬁle format called a cdm ﬁle.
These ﬁles contain the raw data from the scanner as well as scanner properties
such as mode (ﬁne/ fast), focal distance, lens type, etc. These ﬁles can only be
loaded and saved using the Vivid 910 SDK.

Dr. Flynn Format (abs) - This is a simple ASCII format that stores the
513,31, 2: and flag values. This data, along with an associated image ﬁle, is the
base format used by the Face Recognition Grand Challenge (FRGC).
PhotonX (xyz) - This is a second ASCII format, similar to abs, developed to
simplify data exchange with other researchers (PhotonX). In addition to the
:r, y, z coordinates, this format also allows surface normal ﬁles (mm), which are

the PhotonX scanners’ intermediate raw data format.

205

Throughout this dissertation project all of these ﬁle formats have been used; Table
A.1 summarizes the ﬁle formats available in the 3D research platform and highlights
key contents of each format. Figure A.1 displays the formats used most commonly
in experiments presented in this dissertation. Notice that the ABS format is much
slower than the binary formats. This is because the ABS format is in ASCII and not
designed for speed. Refer back to Figure 6.5 for a similar graph on the ﬁle memory

requirements.

Time (seconds)

 

3BX abs / ppm MAT wrl ljpg
File Format

Figure A.1: Results of experiments to compare time taken to read and write common
ﬁle types.

A.2 Programming Tools

This section describes the programming tools used in the various components of the
3D research platform. These tools enable the project to be multi-platform, simplify

program editing, and improve program documentation.

206

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

e86
seem
oz oz oz mm? mm; some. mm; oz cameo a:
erase
seem 2E
oz €25 oz E33 mm; 52 mm; mm? xesoi s?
5532
atom GEES“
oz oz oz €53 war so? 8> mm; 25 .5 Be
XiaoE
seem 2E
oz oz oz mm; new team Bees mm; as 23> 58
e82
Ecowbom
oz oz oz mm; mm; some. mm; oz is; :2
waged—2
2:8
mm; mm; mm; mm; mm; swam mm; mm; om: xmm
x3862
qum coomooom
mm; mm; mm; mm; mm; beam mm; mm; ease: as
oEewconm mﬁsaoz oaosm owe N>X amazon 39:0 39: 282 2228po

 

 

.oE opooeoom o E @805 £ mob Soc mg .2 oonE mm commoooxo o5 <
.253 meet oﬂsomtoa a 2on :oo $88“ o5 $523 mopeozoﬁ o: co worm < .889? Adam 9: r3 vow: 965.5% £5 #.< 2&th

207

o CMake - CMake is an open source make utility for managing cross platform
code trees [87]. CMake is a meta-level “make” program that takes conﬁguration
ﬁles and outputs working environments based on the operating system. For
example, in Windows CMake takes the raw source code and outputs a working
Visual Studio solution ﬁle; while in Linux CMake identiﬁes the system compiler
(gcc etc.) and outputs the makeﬁles required to build the system. The following

list highlights key properties of the CMake system:

1. Organizes cross platform builds in a text ﬁle.

 

2. Works well with Windows Visual Studio, Cygwin, Linux, Solaris, HPC,

GEC.

3. Organizes the code so that there is a clear separation between the source

tree and the binary tree.

4. Identiﬁes and links common code packages, such as Image Magick and

Matlab.

5. Can be conﬁgured to work well with doxygen.

One drawback of CMake is that the CMake program must be installed on each
platform where it will be used. However, this is generally not problematic
because CMake is a freely available, open source project that compiles easily

on most systems.

0 Visual Studio - The visual studio programming system is an excellent envi-
ronment for tracking down programming bugs [105]. The 3D research platform
can be compiled as a debug program and Visual Studio tracks which line of code
is being executed, as well as the function call stack and local variables. Also,
within the Visual Studio editor each function is hyper-linked to declaration,

allowing for quick navigation within a large project.

208

The downside of using Visual Studio is that it requires a specialized project
solution ﬁle. This ﬁle is simple to generate using Visual Studio wizards, but
it is difﬁcult and time consuming to manipulate because all of the settings are
controlled by specialized dialog boxes within Visual Studio. Fortunately, Visual
Studio solution ﬁles can also be generated using CMake, which allows for all
of the advantages of the Visual Studio interface while avoiding many of the

shortcomings.

Concurrent Versioning System - The Concurrent Versioning System (CVS)
is an open source program used to sync text ﬁles between many editors [60].
CVS runs on a system server and client programs on each user’s system down-
load the current version of the source code. After a user makes changes, he/ she
updates the code on the server using the client program. The CVS server keeps
track of all of the changes and notiﬁes the user if changes conﬂict with other
programmers. Even though CVS can handle binary ﬁles, it cannot keep track
of the types of changes within these ﬁles. This makes CVS difﬁcult to use with
Visual Studio ﬁles. However, CMake ﬁles and source code ﬁles work ideally

with CVS.

Doxygen - Doxygen is an open source, automatic documentation generation
tool [136]. Doxygen uses special tags to strip out comments from code and
format them to a human readable form, such as HTML, man pages, latex, etc.
The advantage of using Doxygen is that code commenting and documentation
are completed in the same step. Another advantage of Doxygen is that it creates

an incentive to keep the code comments up to date.

209

A.3 Algorithm Components and Required SDKS

At a minimum, the 3D research platform developed in this dissertation requires a
running copy of CMake and a C++ compiler. This section describes the set of
additional packages and SDKS that can be used by the research platform. If the
package is optional, CMake searches the host computer to ﬁnd the package and, if

found, links the new code.

0 Microsoft Foundation Classes — The GUI in the demo program is designed
as an MFC project in Microsoft Visual Studio [106]. This part of the system
code will only run on Windows. CMake will identify the operating system and
only leads the Demo program if the system is run on Windows. The main
reason for limiting the demo program to a Windows environment is that the

VIVID SDK is only designed to work in a Windows environment.

0 VIVID SDK - The VIVID SDK is a software interface library designed to
work directly with the VIVID 910 scanner and comes freely available with a new
scanner [107]. The VIVID 910 uses a SCSI interface to communicate with the
computer. However, the required SCSI drivers are only available for Windows.
The current software interface between the VIVID 910 and the 3D research
platform works only with MFC because the software uses a special MFC class
to convert from the native cdm data structure into the 3D research platform’s
facescan data structure. There is no reason that the MFC classes could not
be eliminated, but because of the limitations of the VIVID SDK eliminating
the MFC would only enable the scanner to be used though Cygwin and the
command line. Command line operation of the scanner is not particularly useful

because there in no GUI interface to show the user where he/ she is standing.

0 Template Numerical Toolkit - The Template Numerical Toolkit (TNT) is

a collection of public domain interfaces useful for mathematical computations

210

—

 

 

in C++ [123]. There are three primary data structures used throughout the
code: Array2D, Array3D and Eigenvalues. Array2D and Array3D are used
in the research platform to store all scanner matrices, while the Eigenvalue
calculations are used for surface ﬁtting. The TNT collection is required for

almost all of the programs provided with the 3D research platform.

Matlab SDK - Matlab comes installed with a Matlab SDK, which allows
Matlab code to be executed from within other programs [104]. The SDK also
allows programs to be developed that can load and save to the native Matlab
format. This SDK was important for maintaining a database that could be
loaded and run in both the Matlab version and C++ versions of the 3D Research

Platform.

Image Magick SDK - Image Magick is a freely available, open source SDK
used for loading and saving digital images [97]. Image Magick is available across
many platforms and is useful mostly because it can load a wide variety of image
formats. However, installing and linking Image Magick is not always simple,
and most of the functionality of Image Magick is also available with the other
programming interfaces. For example, both Matlab and MFC provide a class
for loading and saving images; however, neither of these classes are open source

or freely available.

Biometrics Experiment Environment - The Biometrics Experiment Envi—
ronment (BEE) was developed by the National Institute of Standards and Tech-
nology (NIST) for use within the Face Recognition Grand Challenge (F RGC)
[49]. BEE is a system of programs, data types, and interfaces that allows all the
FRGC participants to communicate in a common language and submit their re—
sults to NIST in a common format. FRGC includes many experiments, of which

Experiment 3 is the only one to deal with 3D data.

211

 

There are two types of programs required to run with BEE: preprocessors and
“BioBoxes.” A preprocessor is a program, such as the CFDM, that preprocesses
each ﬁle. Strictly speaking, the preprocessor is not needed but is commonly used
to normalize the data. The “BioBox” is a program for comparing two sets of
scans (target and query). The output of the “BioBox” is a signature set that
contains the matching scores for the target and query sets. There are several

types of ﬁles that must be read and used by the research platform to work with

BEE, including:

1. Signature Sets - XML ﬁles that contain the list of data ﬁles used by both

the preprocessor and the biobox.

2. Workspace File - An XML ﬁle that. tells BEE how to run the speciﬁc

programs for speciﬁc datasets.

Xerces - Xerces is an open source validating XML parser [59]. XML is exten-
sively used by the Biometrics Experiment Environment (BEE), which also uses
the Xerces parser. The current version of the 3D Research Platform only has
the Xerces parser working in Cygwin and Linux. A BEE signature set XML
reader was also developed independent of Xerces in order to allow this code to
be run on HPC and Windows. If Xerces is not found, CMake will not include

it in the project.

A.4 Cross Platform Computing

The 3D Research Platform developed for this dissertation is primarily designed for
running scientiﬁc experiments. Its secondary functions are to demonstrate the al-
gorithms and to provide a viable prototype for a a practical 3D veriﬁcation system.

Not all system functionality is available on all system platforms. For example, the

212

Biometrics Experiment Environment (BEE) will only run on a Linux system with ad-
ministrator privileges. Table A.2 lists the various platform features that are available

on the different systems.

Table A.2: Cross platform functionality.

 

 

 

 

 

 

 

 

 

 

 

 

GUI VIVID 910 Command Line Scripting XML Tools BEE
Demo Control Programs Tools (Xerces)
Windows Yes Yes Yes N o N o N o
Cygwin N o N 0 Yes Yes Yes N 0
Linux N o N 0 Yes Yes Yes Yes
HPC N o No Yes Yes Yes Some

 

 

Due to development and system restrictions, none of the platforms shown in Table
A.2 have all of the features available. So, it may be necessary to use more than one
system when using the 3D Research Platform. Key features of the 3D Research

Platform on each system include:

0 Windows - In Windows, the complete 3D Research Platform can be compiled
in Visual Studio. Because the demo program requires the Microsoft Foundation
Classes (MFC), Windows is the only platform able to compile and run the demo.
This is also true for operation of the VIVID 910 scanner, which requires a SCSI
driver ﬁle only available in Windows. Operating the camera also requires a
limited GUI interface that was also developed using MFC. In the future, the
MFC library should be avoided if cross platform computability is desired; a

library such as wxWidgets may be more useful in these circumstances [142].

o Cygwin - Another problem with Windows is that it cannot easily run the
numerous scripting tools (such as grep, bash, perl, python, etc.) that are used
to facilitate testing [69]. These tools are common on Linux systems but have to
be installed independently for a Windows machine. The easiest way to install
these programs is to install Cygwin. Cygwin is a Linux compatible interface

written for Window; thus, a Windows system installed with Cygwin has the

213

I

most functionality for the 3D research platform. However, the combination of

Windows and Cygwin does not allow BEE to be run.

Linux - Installing the 3D Research Platform on Linux enables all of the research
trial tools, including BEE, but it is not possible to run the Demo or gather
data with the VIVID 910 scanner. Proper installation of BEE requires full
administrator access on the host computer because the installation script for
BEE assumes that some library and SDKS (such as Image Magick, Xerces, Java,
etc.) will be installed in ﬁxed paths. A new installation script was developed
to work around this limitation by using a fake root path to install the required
software. This method allowed the software to be installed on HPC without
administrator privileges, but the script is a work in progress and does not always

perform as expected.

High Performance Computing - The 3D Research Platform was compiled on
the two High Performance Computer (HPC) systems at MSU. The ﬁrst system
is a 64 1.6GHz Itanium2 processor SGI Altix 3700 BX2 with 512 Gigabytes of
RAM. The second system is a 128 Node cluster from Western Scientiﬁc with
Dual AMD Opteron 27, 2.2GHz, 2MB Cache, and dual core processors for each
node (for a total of 512 processor cores). Both HPC computers are running
Linux, which works well with the 3D Research Platform. However, many of the
command line programs in the 3D Research Platform were rewritten to take

full advantage of the parallel system.

214

 

BIBLIOGRAPHY

215

Bibliography

[1] 3dMD. http://www.3dmd.com/. 2005.
[2] A4Vision. http://www.a4vision.com/. 2005.

[3] ABCNews. Scan suggests ex-sailor was WWII smoocher.
http://abcnews.go.com/GMA/Technology/story7id=1036576. ABC News

Internet Ventures, 2006.

[4] Bernard Achermann and Horst Bunke. Classifying range images of human faces
with hausdorff distance. In Proceedings of the 15th International Conference on
Pattern Recognition, volume 2, page 2809, Barcelona, Spain, 2000.

[5] Jorgen Ahlberg. Candide-3 — an updated parameterized face. Technical Report
LiTH-ISY-R-2326, Linkfiping University, Sweden, 2001.

[6] Soon-Jeong Ahn, Jaechil Yoo, Byung-Gook Lee, and Joon-Jae Lee. 3D surface
reconstruction from scattered data using moving least square method. In Proc-
cedings of the 13th International Conference on Image Analysis and Processing,
pages 719 726, Cagliari, Italy, 2005.

[7] Animetrics. http://www.animetrics.com/. 2006.

[8] A-Nasser Ansari and Mohamed Abdel-Mottaleb. Automatic facial feature ex-
traction and 3D face modeling using two orthogonal views with application to
3D face recognition. Pattern Recognition, 38(12):2549—2563, 2005.

[9] Vera Bakic and George Stockman. Menu selection by facial aspect. In Proceed-
ings of Vision Interface ’99, Trois-Riviéres, Canada, 1999.

[10] Philippe Ballard and George Stockman. Controlling a computer via facial as-
pect. IEEE Transactions on Systems, Man and Cybernetics, 25(4):669-—677,
1995.

[11] Roman Bartak. Theory and practice of constraint propagation. In Proceedings
of the 3rd Workshop on Constraint Programming in Decision and Control, pages
7—14, Gliwice, Poland, 2001.

216

 

 

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

Marian Stewart Bartlett, Gwen Littlewort, Ian Fasel, and Javier R. Movellan.
Real time face detection and facial expression recognition: Development and
applications to human computer interaction. In Proceedings of C VPR Workshop
on Computer Vision and Pattern Recognition for Human-Computer Interaction,

volume 05, page 53, Los Alamitos, CA, USA, 2003.

Olga Regina Pereira Bellon, Luciano Silva, and Chaua C. Queirolo. 3D face
matching using the surface interpenetration measure. In Proceedings of the 13th
International Conference on Image Analysis and Processing, pages 10514058,
Cagliari, Italy, 2005.

Chiraz BenAbdelkader and Paul A. Grifﬁn. Comparing and combining depth
and texture cues for face recognition. Image and Vision Computing, 23:339 352,
2005.

Jon Louis Bentley. Multidimensional binary search trees used for associative
searching. Communications of the ACM, 18(9):509~517, 1975.

Barry K.B. Berkovitz and Bernard J. Moxham. A Textbook of Head and Neck
Anatomy. Year Book Medical Publishers, Inc., 1988.

Paul J. Besl and Neil D. McKay. A method for registration of 3—d shapes.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2):239-
256, 1992.

Charles Beumier and Marc Acheroy. Automatic face identiﬁcation. In Appli-
cations of Digital Image Processing XVIII, SPIE, volume 2564, pages 311-323,
1995.

Charles Beumier and Marc Acheroy. Automatic 3D face authentication. Image
and Vision Computing, 18(4):315-321, 2000.

J. Ross Beveridge, David Bolme, Bruce A. Draper, and Marcie Teixeira. The

CSE face identiﬁcation evaluation system. Machine Vision and Applications,
16(2):128w138, 2005.

Volker Blanz and Thomas Vetter. Face recognition based on fitting a 3D mor-
phable model. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 25(9):1063 1074, 2003.

Chris Boehnen and Patrick Flynn. Accuracy of 3D scanning technologies in
a face scanning scenario. In 3DIM ’05: Proceedings of the Fifth International
Conference on 3-D Digital Imaging and Modeling, pages 310 317, Ottawa, On-
tario Canada, 2005.

Chris Boehnen and Trina Russ. A fast multi-modal approach to facial feature
detection. In Proceedings of 7th IEEE Workshops on Applications of Com-
puter Vision ( WACV/MOTION ’05), volume 01, pages 135 142, Breckenridge,
Colorado, 2004.

217

In.

[24]

[25]

[26]

[27]

[28]

[29]

[30]
[31]

[32]

[33]

[34]

[35]

Kevin W. Bowyer, Kyong Chang, and Patrick Flynn. A survey of approaches
to three-dimensional face recognition. In Proceedings of the 17th International
Conference on Pattern Recognition, volume 01, pages 358 361, Cambridge,
United Kingdom, 2004.

Kevin W. Bowyer, Kyong Chang, and Patrick Flynn. A survey of approaches
and challenges in 3d and multi-modal 3d + 2d face recognition. Compututer
Vision and Image Understanding, 101(1):1~15, 2006.

Alexander M. Bronstein, Micheal M. Bronstein, and Ron Kimmel. Expression-
invariant 3D face recognition. In IEEE International Workshop on Analysis and
Modeling of Faces and Gestures (AMFC 2003), pages 232-233, Nice, France,
2003.

Alexander M. Bronstein, Micheal M. Bronstein, and Ron Kimmer. Three—
dimensional face recognition. International Journal of Computer Vision,
64(1):5~—30, 2005.

Myron Z. Brown, Darius Burschka, and Gregory D. Hager. Advances in com-
putational stereo. IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, 25(8):993 1008, 2003.

Roberto Brunelli and Tomaso Poggio. Face recognition: Features versus tem-
plates. IEEE Transactions on Pattern Analysis and Machine Intelligence,
15(10):1042 1052, 1993.

Canesta. http: //www.canesta.com/ , 2005.

J. Y. Cartoux, J .T. LaPreste, and M. Richetin. Face authentiﬁcation or recog-
nition by proﬁle extraction from range images. In Proceedings of Workshop on
Interpretation of 3D Scenes, pages 194 199, Austin, Texas, November 1989.

Marco La Cascia and Stan Sclaroff. Fast, reliable head tracking under varying
illumination: An approach based on registration of texture-mapped 3D models.
IEEE Transactions on Pattern Recognition and Image Processing, 22(2):322
336, 2000.

Kyong I. Chang, Kevin W. Bowyer, and Patrick J. Flynn. Face recognition
using 2D and 3D facial data. In Workshop on Multimodal User Authentication,
pages 25 -32, Santa Barbara, California, 2003.

Kyong I. Chang, Kevin W. Bowyer, and Patrick J. Flynn. Multi-modal 2D
and 3D biometrics for face recognition. In IEEE International Workshop on
Analysis and Modeling of Faces and Gestures, page 187, France, 2003.

Kyong I. Chang, Kevin W. Bowyer, and Patrick J. Flynn. An evaluation of
multimodal 2D+3D face biometrics. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 27(4):619624, 2005.

218

[36]

[37]

[38]

[39]

[40]

[41]

[45]

[46]

[47]

Hong Chen and Anil K. Jain. Dental biometrics: Alignment and matching
of dental radiographs. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 27:1319—1326, 2005.

Yang Chen and Gérard Medioni. Object modeling by registration of mul-
tiple range images. International Journal of Image and Vision Computing,
10(3):145—155, 1992.

Yi Chen, Sarat C. Dass, and Anil K. Jain. Localized iris image quality using 2D
wavelets. In Proceedings of the International Conference on Biometrics ([08),
pages 373 381, Hong Kong, January 2006.

Dmitry Chetverikov, Dmitry Svirko, Dmitry Stepanov, and Pavel Kresek. The
trimmed iterative closest point algorithm. In Proceedings of the 16th Interna-
tional Conference on Pattern Recognitionn, volume 3, pages 545—548, Quebec
City, Canada, 2002.

Amit K. Roy Chowdhury and Rama Chellappa. Statistical bias in 3D recon-
struction from a monocular video. IEEE Transactions on Image Processing,
14:1057~1062, 2005.

CS. Chua, F. Han, and Y. Ho. 3D human face recognition using point signature.
In Proceedings of IEEE 4th International Conference on Automatic Face and
Gesture Recognition, pages 233—238, Grenoble, France, 2000.

CS. Chua and RA. Jarvis. Point signatures: A new representation for 3D
object recognition. International Journal of Computer Vision, 21(1):63—85,
1997.

Cognitec. FaceVACS face recognition software. http://www.cognitec—
systems.de/, 2006.

Philippe Colantoni, Nabil Bolukala, and Jér6me Da Rugna. Fast and accurate
color image processing using 3D graphics cards. In In Proceedings of 8th Inter-

national Fall Workshop: Vision Modeling and Visualization, Munich, Germany,
2003.

Dirk Colbry, Xiaoguang Lu, Anil Jain, and George Stockman. 3D face feature
extraction for recognition. Technical Report MSU-CSE—04-39, Michigan State
University, September 2004.

Dirk Colbry, George Stockman, and Anil Jain. Detection of anchor points for
3D face veriﬁcation. In Proceedings of Workshop Advanced 30 Imaging for
Safety and Security, San Diego California, 2005.

Robert T. Collins, Alan J. Lipton, Takeo Kanade, Hironobu Fujiyoshi, David
Duggins, Yanghai Tsin, David Tolliver, Nobuyoshi Enomoto, Osamu Hasegawa,

219

Peter Burt, and Lambert Wixson. A system for video surveillance and monitor-
ing. Technical Report CMU-RI- TR-00-12, Carnegie Mellon University Robotics
Institute, 2000.

[48] Cristina Conde and Angel Serrano. 3D facial normalization with spin images
and inﬂuence of range data calculation over face veriﬁcation. In Proceedings
IEEE Conference on Computer Vision and Pattern Recognition, page 115, 2005.

[49] Science Applications International Corporation. Biometrics experimentation
environment (BEE) http://www.bee-biometrics.org/, 2004.

[50] CyberExtruder. http://www.cyberextruder.com/. 2005.
[51] Cyberware. http://www.cyberware.com/, 2006.

[52] Maltoni David, Dario Maio, Anil K. Jain, and Salil Prabhaker. Handbook of
Fingerprint Recognition. Springer Verlag, 2003.

[53] Douglas DeCarlo, Dimitris Metaxas, and Matthew Stone. An anthropometric
face model using variational techniques. In Proceedings of International Con-
ference on Computer Graphics and Intractive Techniques, pages 67-74, 1998.

[54] Chitra Dorai and Anil K. Jain. Cosmos - a representation scheme for 3D free-

form objects. IEEE Transactions on Pattern Analysis and Machine Intelligence,
19(10):1115--1130, 1997.

[55] Ioannis Douros and Bernard F. Buxton. Three-dimensional surface curvature
estimation using quadric surface patches. In Scanning 2002, Paris, 2002.

[56] Nicolae Duta, Anil K. Jain, and Kanti V. Mardia. Matching of palmprints.
Pattern Recognition Letters, 23(4):477 485, 2002.

[57] Leslie G. Farkas and Ian R. Munro. Anthropometric Facial Proportions in
Medicine. Charles C Thomas, Springﬁeld, Illinois, 1987.

[58] Patrick Joseph Flynn. CAD-Based computer vision: Modeling and Recognition
Strategies. PhD thesis, Michigan State University, 1990.

[59] Apache Software Foundation. Xerces xml parser http: / /xm1.apache.org/ xerces-
c/,2006.

[60] Free Software Foundation. CVS - concurrent versions system.
http://www.nongnu.org/cvs/, 2006.

[61] Jerome H. Freidman, Jon Louis Bentley, and Raphael Ari Finkel. An algorithm
for ﬁnding best matches in logarithmic expected time. ACM Transactions on
Mathematical Software ( TOMS ), 3:209 226, September 1977.

220

[62] James Fung and Steve Mann. Computer vision signal processing on graphics
processing units. In Proceedings of Acoustics, Speech, and Signal Processing,
volume 5, pages 93-96, 2004.

[63] Sir Francis Galton. Personal identiﬁcation and description. Nature, pages 173 -
177, 1888.

[64] Natasha Gelfand, Leslie Ikemoto, Szymon Rusinkiewicz, and Marc Levoy. Geo-
metrically stable sampling for the icp algorithm. In Proceedings of International
Conference on 3D Digital Imaging and Modeling, pages 260—267, Banff, 2003.

[65] Geometrix. http://www.geometrix.com/. 2005.

[66] Giesecke and Devrient. Big SIM cards from 512 kb to 512 mb. Technical report,
http://www.gi-de.com/, 2006.

[67] Gaile Gordon. Face recognition based on depth and curvature features. In
Proceedings IEEE Conference on Computer Vision and Pattern Recognition,
pages 108410, 1992.

[68] A. Ardeshir Goshtasby. 2—D and 3—D Image Registration for Medical, Remote
Sensing, and Industrial Applications. Wiley, 2005.

[69] Inc. Red Hat. Cygwin DLL and utilities http://www.cygwin.com/, 2005.

[70] Curt Hesher, Anuj Srivastava, and Gordon Erlebacher. PCA of range images for
facial recognition. In Proceedings of International Multiconference in Computer
Science, Las Vegas, Nevada, 2002.

[71] Erik Hjelmas. Face detection: A survey. Computer Vision and Image Under-
standing, 83:236—--274, 2001.

[72] Horace H.S.Ip and Lijun Yin. Constructing a 3D individualized head model
from two orthogonal views. The International Journal in Computer Graphics,
12(5), 1996.

[73] Rein-Lien Hsu. Face Detection and Modeling for Recognition. Ph.D. Disserta-
tion, Michigan State University, 2002.

[74] Rein-Lien Hsu and Anil K. Jain. Face modeling for recognition. In Proceedings

of the International Conference on Image Processing, pages 693' 696, Greece,
2001.

[75] Rein-Lien Hsu and Anil K. Jain. Generating discriminating cartoon faces us-
ing interacting snakes. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 25(11):]388-4398, 2003.

[76] Robert A. Hummel and Steven W. Zucker. On the foundations of relaxation
labeling processes. IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, 5:267 287, 1983.

221

[77] Identix. FaceIT surveillance SDK http://www.identix.com/, 2005.

[78]

ATI Technologies Inc. Visual technology.
http: / / www.ati.com / technology / index.html, 2006.

[79] Anil Jain, Yi Chen, and Meltem Demirkus. Pores and ridges: Fingerprint

matching using level 3 features. In Proceedings of the 18th International Con-

ference on Pattern Recognition, Hong Kong, August 2006.

[80] Anil Jain, S. Prabhaker, and S. Pankanti. Twin test: On discriminability of

ﬁngerprints. In Proceedings of the 3rd International Conference on Audio- and
Video-Based Person Authentication, pages 211—216, 2001.

[81] Anil K. Jain, Lin Hong, and Yatin Kulkami. A multimodal biometric system

[82]

[83]

[84]

[85]

[87]

[88]

[89]

using ﬁngerprint, face, and speech. In Proceedings of the 2nd International
Conference on Audio- and Video-based Biometric Person Authentication, pages
182 —187, Washington DC, March 1999.

Anil K. Jain, Arun Ross, and Sharath Pankanti. A prototype hand geometry-
based veriﬁcation system. In Proceedings of 2nd International Conference on
Audio- and Video-based Biometric Person Authentication, pages 166 171, 1999.

Anil K. Jain, Arun Ross, and Umut Uludag. Biometric template security:
Challenges and soutions. In Proceedings of the 14th European Signal Processing
Conference (EUSIPCO), Antalya, Turkey, September 2005.

Andrew E. Johnson and Martial Hebert. Using spin images for efficient object
recognition in cluttered 3D scenes. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 21(5):433~449, 1999.

Javier Ortega-Garcia Julian Fierrez-Aguilar, Stephen Krawczyk and Anil K.
Jain. Phsion of local and region approaches for on-line signature veriﬁcation. In

Proceedings of the International Workshop on Biometric Recognition Systems
{IWBRS}, pages 1887196, Beijing, China, October 2005.

Takeo Kanade. Picture Processing System by Computer Complex and Recogni-
tion of Human Faces. PhD thesis, Kyoto University, 1973.

Insight Consortium Kitware Inc. Cmake cross-platform make.
http: / / www.cmake.org / HTML / Index.html, 2002.

Rotem Kowner. Psychological perspectives on human developmental stabil-
ity and ﬂuctuating asymmetry; sources, applications and implications. British
Journal of Psychology, 92(3):4477469, 2001.

Mitsubishi Electric Research Laboratories. http://www.merl.com/. Technical
report, 2006.

222

[90] J. Lee and E. Milios. Matching range images of human faces. Proceedings of
IEEE International Conference on Computer Vision, pages 722—726, 1990.

[91] Yonguk Lee, Hwanjong Song, Ukil Yang, Hyungchul Shin, and Kwanghoon
Sohn. Local feature based 3D face recognition. In Proceedings of the 5th Inter-
national Conference on Audio- and Video—Based Biometric Person Authentica-
tion, pages 909-918, 2005.

[92] Yuencheng Lee, Demetri Terzopoulos, and Keith Walters. Realistic modeling for
facial animation. In SI GGRAPH ’95: Proceedings of the 22nd annual conference
on Computer graphics and interactive techniques, pages 55 —-62, New York, NY,
USA, 1995. ACM Press.

[93] Alexandre Lemieux and Marc Parizeau. Experiments on eigenfaces robustness.
In Proceedings of the 16th International Conference on Pattern Recognition,

Quebec City,QC, Canada, 2002.

[94] Stan Z. Li and Anil K. Jain, editors. Handbook of Face Recognition. Springer
Verlag, 2005.

[95] James J. Little and Jeffrey E. Boyd. Recognizing people by their gait: The
shape of motion. Videre: Journal of Computer Vision Research, 1(2), 1998.

[96] Yanxi Liu, Robert T. Collins, and William E. Rothfus M.D. Automatic extrac-
tion of the central symmetry (mid-sagittal) plane from neuroradiology images.

In Proceedings of Medical Imaging Computing and Computer Assisted Interven-
tion (MICCAI 2000), Pittsburgh, 2000.

[97] ImageMagick Studio LLC. http://www.imagemagick.org/, 2006.

[98] Xiaoguang Lu and Anil Jain. Deformation analysis for 3D face matching. In
Proceedings of IEEE Workshops on Applications of Computer Vision, pages
99—104, Breckenridge, Colorado, 2005.

[99] Xiaoguang Lu and Anil K. Jain. Ethnicity identiﬁcation from face images.
In Proceedings of SPIE International Symposium on Defense and Security :
Biometric Technology for Human Identiﬁcation, volume 5404, Orlando, FL,,
2004.

[100] Ye Lu, Jason Z. Zhang, Q. M. Jonathan Wu, and Ze-Nian Li. A survey of
motion-parallax-based 3-d reconstruction algorithms. IEEE Transactions on
Systems, Man and Cybernetics, Part C: Applications and Reviews, 34(4):532
548, 2004.

[101] Sotiris Malassiotis and Michael G. Strintzis. Robust face recognition using
2D and 3D data: Pose and illumination compensation. Pattern Recognition,
38:2537—2548, 2005.

223

[102] RS. Manjunath, R. Chellappa, and C. von der Malsburg. A feature based
approach to face recognition. In Proceedings IEEE Conference on Computer
Vision and Pattern Recognition, pages 373 378, 1992.

[103] Giovianni Marola. On the detection of the axes of symmetry of symmetric and
almost symmetric planar images. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 11(1):104 108, 1989.

[104] MathWorks. Matlab http://www.mathworks.com/, 2006.
[105] Microsoft. Visual studio. http://msdn.microsoft.com/vstudio/, 2005.

[106] Microsoft. MFC Microsoft Foundation Class Library.
http: //msdn.microsoft.com/ , 2006.

[107] Minolta. Vivid software development kit 11 version 2.03, 2003.

[108] Minolta. Vivid 910 non-contact 3D laser scanner.
http:/ / konicaminoltacomsg / products / industrial / instrument / vivid / vivid910.html,
2006.

[109] N IST. Report to the United States Congress. summary of NIST standards for ,
biometric accuracy, tamper resistance, and interoperability. Technical report,
National Institute of Standards and Technology, 2002.

[110] nVidia. Programmable graphics processor technolgies.
http://www.nvidia.com/page/technologies.html. Web, 2006.

[111] Ojos. Internet photo search engine http://riya.com/, 2006.

[112] Javier Ortega-Garcia, Joaquin Gonzalez-Rodriguez, Danilo Simn—Zorita, and
Santiago Cruz-Llanas. From Biometrics Technology to Applications Regarding

Face, Vocie, Signature and Fingerprint Recognition Systems, chapter 12, pages
289—337. Kluwer Academic Publishers, 2002.

[113] J6rn Ostermann. Animation of synthetic faces in mpeg—4. In Computer Ani-
mation, pages 49 51, Philadelphia, Pennsylvania, 1998.

[114] Alice J. OToole, Fang Jiang, Dana Roark, and Hervé Abdi. Face processing:
Advanced modeling and methods, chapter 9, pages 293 319. 2006.

[115] Chia-Jung Pai, Hsiao—Rong Tyan, Yu-Ming Liang, Hong-Yuan Mark Liao, and
Sci-Wang Chen. Pedestrian detection and tracking at crossroads. Pattern Recog-
nition, 37:1025—1034, 2004.

[116] Gang Pan and Zhaohui Wu. 3D face recognition from range data. International
Journal of Image and Graphics, 5(3):573—593, 2005.

[117] Igor S. Pandzic and Robert Forchheimer, editors. MPEG 4 Facial Animation:
The Standard, Implementation and Applications. Wiley, 2002.

224

[118] P. Jonathon Phillips, Patrick J. Flynn, Todd Scruggs, Kevin W. Bowyer,
Jin Chang, Kevin Hoffman, Joe Marques, Jaesik Min, and William Worek.
Overview of the Face Recognition Grand Challenge. In Proceedings IEEE Con-
ference on Computer Vision and Pattern Recognition, 2005.

[119] P. Jonathon Phillips, Patrick Grother, Ross J. Micheals, Duane M. Black-
burn, Elham Tabassi, and Mike Bone. Face Recognition Vendor Test (FRVT),
overview and summary. Technical report, National Institute of Standards and
Technology, 2003.

[120] PhotonX. http://www.photon-x.com/, 2005.
[121] PixelVelocity. http://www.pixelvelocity.com/, 2006.

[122] Martha E. Pollack, Colleen E. McCarthy, Sailesh Ramakrishnan, Ioannis
Tsamardinos, Laura Brown, Steven Carrion, Dirk Colbry, Cheryl Orosz, and
Bart Peintner. Autominder: A planning, monitoring, and reminding assis-
tive agent. In Proceedings of 7th International Conference on Intelligent Au-
tonomous Systems, pages 265 272, 2002.

[123] Roland Pozo. TNT - template numerical toolkit. http://math.nist.gov/tnt/.
NIST, 2004.

[124] Daniel Riccio and Jean-Lue Dugelay. Asymmetric 3D / 2D processing: A novel
approach for face recognition. In Proceedings of the International Conference
on Image Analysis and Processing, pages 986993, 2005.

[125] Arun Ross, Karthik N andakumar, and Anil K. Jain. Handbook of Multibiomet-
rics. Springer Verlag, 2006.

[126] Ashok Samal and Prasana A. Iyengar. Automatic recognition and analysis of

human faces and facial expressions: A survey. Pattern Recognition, pages 65-77,
1992.

[127] Peter T. Sander and Steven Zucker. Tracing surfaces for surfacing traces.
McGill University: McGill Research Centre for Intelligent Machines, 1988.

[128] Renata Scognamillo, Gillian Rhodes, Concetta Morrone, and David Burr. A
feature-based model of symmetry detection. Proceedings Biological Sciencees /
Royal Society, 270(1525):1727*1733, 2003.

[129] Dan Smith. Caricature of Jay Leno. http://www.dansmithartist.com/, 2006.
[130] Geomagic Studio. http://www.geomagic.com/en/, 2004.

[131] H. Tanaka, M. lkeda, and H. Chiaki. Curvature-based face surface recognition
using spherical correlation. In Proceedings of IEEE 3rd International Conference
on Automatic Face and Gesture Recognition, pages 372—377, 1998.

225

[132] Carlo Tomasi and Takeo Kanade. Shape and Motion from Image Streams: A
Factorization Method Part 2. Detection and Trcking of Point Features. Tech-
nical Report CMU—CS-91—132, April 1991.

[133] Mattew Turk and Alex Pentland. Eigenfaces for recognition. Journal of Cog-
nitive Neuroscience, 3(1):71—86, 1991.

[134] Alexander V. Tuzikov, Olivier Colliot, and Iisabelle Bloch. Brain symmetry
plane computation in MR images using inertia axes and optimization. In Pro-

ceedings of the 16th International Conference on Pattern Recognition, Quebec
City, Canada, 2002.

[135] Umut Uludag, Sharath Pankanti, and Anil K. Jain. Fuzzy vault for ﬁngerprints.
In International Conference on Audio- and Video-Based Biometric Person Au-
thentication, pages 577 581, 2005.

[136] Dimitri van Heesch. Doxygen. http://www.doxygen.nl/, 2006.

[137] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade
of simple features. In Proceedings IEEE Conference on Computer Vision and
Pattern Recognition, pages 511—518, 2001.

[138] Paul Viola and Micheal Jones. Robust real-time object detection. In Second
International Workshop on Statistical and Computational Theories of Vision -
Modeling, Learning, Computing, and Sampling, Vancouver, Canada, 2001.

[139] Yingjie Wang, Chin-Seng Chua, and Yeong—Khing Ho. Facial feature detection
and face recognition from 2D and 3D images. Pattern Recognition Letters,
23:1191—1202, 2002.

[140] James Wayman, Anil Jain, Davide Maltoni, and Dario Maio, editors. Biometric
Systems : Technology, Design and Performance Evaluation. Springer, 2005.

[141] David M. Weinstein. The analytic 3-d transform for the least-squared ﬁt of
three pairs of corresponding points. School of Computing Technical Report
UUCS-98-005, University of Utah, March 1998.

[142] wxWidgets. Cross-platform GUI library. http://wxwidgets.org/. SourceForge
Project, 2006.

[143] Jing Xiao, Simon Baker, Iain Matthews, and Takeo Kanade. Real-time com-
bined 2D+3D active appearance models. In Proceedings IEEE Conference on

Computer Vision and Pattern Recognition, volume 2, pages 535—542, Washing-
ton D.C., 2004.

[144] Chenghua Xu, Yunhong Wang, Tieniu Tan, and Long Quan. Face recognition
based on 3D mesh models. In Proceedings of SPIE - The International Society
for Optical Engineering, Denver, Colorado, 2004.

226

 

 

[145]

[146]

[147]

[148]

[149]

[150]

[151]

[152)

[153}

Ping Yan and Kevin W. Bowyer. Ear biometrics using 2D and 3D images. In
Proceedings IEEE Conference on Computer Vision and Pattern Recognition,
volume 3, pages 121—121, June 2005.

Ming-Hsuan Yang, David J. Kriegman, and N arendra Ahuja. Detecting faces
in images: A survey. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 24(1):34—58, 2002.

Alper Yilmaz and Mubarak Shah. Automatic feature detection and pose recov-
ery for faces. In Proceedings of the Fifth Asian Conference on Computer Vision,
pages 284— 289, 2002.

Lijun Yin and Matt T. Yourst. 3D face recognition based on high-resolution
3D face modeling from frontal and proﬁle views. In WBMA ’03: Proceedings
of the 2003 ACM SIGMM workshop on Biometrics methods and applications,
pages 1 *8, New York, NY, USA, 2003. ACM Press.

Hagit Zabrodsky, Shmuel Peleg, and David Avnir. Symmetry as a continuous
feature. IEEE Transactions on Pattern Analysis and Machine Intelligence,
17:1154 1166, 1995.

Haihong Zhang, Zhiyong Huang, Weimin Huang, and Liyuan Li. Kernel-based
method for tracking objects with rotation and translation. In Proceedings of
the 17th International Conference on Pattern Recognition, pages 728731, Cam-
bridge, United Kingdom, 2004.

Rue Zhang, Ping-Sing Tsai, James Edwin Cryer, and Mubarak Shah. Shape
from shading: A survey. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 21(8):690—706, August 1999.

Zhengyou Zhang. Iterative point matching for registration of free-form curves
and surfaces. International Journal of Computer Vision, 13(1):119—152, 1994.

Zhi-Hua Zhou and Xin Geng. Projection functions for eye detection. Pattern
Recognition, 37:1049~1056, 2004.

227

 

 

 

i]. IIIII. Il'fi.

 

 

   

i[itiiiiiiiii