REPRESENTATION LEARNING AND IMAGE SYNTHESIS FOR DEEP

FACE RECOGNITION

By

Xi Yin

A DISSERTATION

Submitted to

Michigan State University

in partial fulﬁllment of the requirements

for the degree of

Computer Science — Doctor of Philosophy

2018

ABSTRACT

REPRESENTATION LEARNING AND IMAGE SYNTHESIS FOR DEEP FACE

RECOGNITION

By

Xi Yin

Face recognition has been advanced a lot in recent years thanks to the development of deep

neural networks. The large intra-class variations in pose, illumination, and expression are the

long-standing challenges. Learning a discriminative representation that is robust to these varia-

tions is the key. In the scenarios of proﬁle pose or long-tail training data, image or feature-level

data augmentation is needed. This dissertation presents three different methods to solve these prob-

lems. First, we explore a multi-task Convolutional Neural Network (CNN) that aims to leverage

side tasks to improve representation learning. A pose-directed multi-task CNN is introduced to

better handle pose variation. The proposed framework is effective in pose-invariant face recogni-

tion. Second, we propose a Face Frontalization-Generative Adversarial Network (FF-GAN) that

can generate a frontal face even from an input image with extreme proﬁle pose. FF-GAN handles

pose variation from the perspective of image-level data augmentation. Multiple loss functions are

proposed to achieve large-pose face frontalization. The proposed approach is evaluated on various

tasks including face reconstruction, landmark detection, face frontalization, and face recognition.

Third, a feature transfer learning method is presented to solve the problem of insufﬁcient intra-class

variation via feature-level data augmentation. A Gaussian prior is assumed across all the regular

classes and the variance are transferred from regular classes to long-tail classes. Further, an alter-

nating training regimen is proposed to simultaneously achieve less biased decision boundaries and

more discriminative representations. Extensive experiments have demonstrated the effectiveness

of the proposed feature transfer framework.

ACKNOWLEDGMENTS

This dissertation would not have been made possible without the help of many people.

I am very honored to have Dr. Xiaoming Liu as my advisor. His expectation and encourage-

ment have made me achieve more than I could ever have imagined. The time we spent to debug

codes, brainstorm, and polish papers has reﬁned my skills in critical thinking, presentation and

writing. By setting himself as an example, he has taught me what a good researcher should be like.

It is my great pleasure to have Dr. Anil K. Jain, Dr. Arun Ross, and Dr. Daniel Morris in my

Ph.D. guidance committee. As a world-leading researcher, Dr. Jain has inspired many younger

generations including me to pursue a Ph.D. Dr. Ross’s patience and insightful comments at all

presentations have shown me that every researcher desires to be heard. Thanks to Dr. Morris for

his attention to the details and valuable insights in our collaboration.

I would like to thank Dr. Xiang Yu, Dr. Kihyuk Sohn, and Dr. Manmohan Chandraker at NEC

Labs for offering me an internship and the collaboration opportunity.

I am grateful for my labmates, Joseph Roth, Amin Jourabloo, Morteza Safdarnejad, Yousef

Atoum, Luan Tran, Garrick Brazil, Yaojie Liu, Bangjie Yin, Joel Stehouwer, Adam Terwilliger.

The valuable comments in paper review, the willingness to help, the encouragement when I am in

a bad mood, and the entertainment together have made it a very pleasant journey.

I am lucky to have met Sandy and Mick who have treated me like their daughter, which means

a lot to me. Thanks to my friends at Michigan State University, Huiyun, Wei, Sahra, Panpan,

Chelsea, Kailyn, Jiao, Shaohua, Sorrachai for their company to keep my mind refreshed.

Finally, I would like to thank my parents who have taught me to be brave, positive, and kind-

hearted. Thanks to my friends Dan, Zi, Bei, Renjie for the previous friendship since childhood.

Thanks to Dr. Yu Wang for his intelligence and humor that make my life colorful.

iii

TABLE OF CONTENTS

LIST OF TABLES .

.

LIST OF FIGURES .

.

.

.

.

.

.

LIST OF ALGORITHMS .

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. vii

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

1
1
6
7

Chapter 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Face Recognition .
.
1.2 Dissertation Contributions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction and Contributions . . . . . . . . . . . . . . . . . . . . . . .
.
.
.

.

.

Chapter 2

2.1 Basics .

2.2 Challenges .

2.3 Methodologies . .

.

.

.

.

.

.

.

.

. .

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Background and Related Work . . . . . . . . . . . . . . . . . . . . . . .
.
2.1.1 Recognition Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2
Face Recognition Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.3 Evaluation Metrics
.
.
2.2.1 Data Variance .
.
2.2.2 Method Design .
.

8
8
8
9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 12
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
.
. 15
2.3.1 General Deep Face Recognition . . . . . . . . . . . . . . . . . . . . . . . 15
. . . . . . . . . . . . . . . . . . . . 16
2.3.1.1
2.3.1.2
. . . . . . . . . . . . . . . . . . . . . . . 18
Pose-Invariant Face Recognition . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2.1 Multi-View Subspace Learning . . . . . . . . . . . . . . . . . . 20
Pose-Invariant Feature Extraction . . . . . . . . . . . . . . . . . 21
2.3.2.2
2.3.2.3
Face Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Probabilistic-based Models
Energy-based Models

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3.2

.

.

.

Chapter 3

Introduction .

3.1
.
3.2 Proposed Method .

.
.

.
.

.
.

.
.

. .

Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Multi-Task CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Dynamic-Weighting Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 29
Pose-Directed Multi-Task CNN . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1
Face Identiﬁcation on Multi-PIE . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2 How does m-CNN work? . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.3 Unconstrained Face Recognition . . . . . . . . . . . . . . . . . . . . . . . 43
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

. .

. .

.

.

.

.

.

3.3 Experimental Results .

3.4 Summary .

iv

4.3

Chapter 4

Introduction .

4.1
.
4.2 Proposed Method .

.
.

.
.

.
.

.
.

. .

Lage-Pose Face Frontalization . . . . . . . . . . . . . . . . . . . . . . . . 48
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1 Reconstruction Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.2 Generation Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.3 Discrimination Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.4 Recognition Module
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Implementation Details .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 Network Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.2 Training Strategies
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Settings and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3D Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Face Recognition .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Face Frontalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.4.1
4.4.2
4.4.3
4.4.4
4.4.5 Ablation Study .
.

. .

. .

.

.

.

4.4 Experimental Results .

4.5 Summary .

Chapter 5

Introduction .

5.1
.
5.2 Proposed Method .

5.3 Experimental Results .

.
.
.

.
.
.

. .

Feature Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 76
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
.
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.1 Motivations .
5.2.2
Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.3 Long-Tail Class Feature Transfer . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.4 Alternating Training Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 85
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.1
Feature Center Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.2 Effects of m-L2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3.3 Ablation Study .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3.4 One-Shot Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3.5 Large-Scale Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . 94
5.3.6 Qualitative Results
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
.

. .

. .

.

.

.

.

5.4 Summary .

Chapter 6

6.1 Conclusions .
6.2 Future Work . .

Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 98
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

.
. .

. .

.
.

.
.

.
.

BIBLIOGRAPHY .

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

v

LIST OF TABLES

Table 3.1: Comparison of the experimental settings that are commonly used in prior work on
Multi-PIE. (* The 20 images consist of 2 duplicates of non-ﬂash images and 18 ﬂash
images. In total there are 19 different illuminations.)

. . . . . . . . . . . . . . . . . 34

Table 3.2:

Performance comparison (%) of single-task learning (s-CNN), multi-task learning (m-
CNN) with its variants, and pose-directed multi-task learning (p-CNN) on the entire
Multi-PIE dataset.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

.

.

.

.

Table 3.3: Multi-PIE performance comparison on setting III of Table 3.1.

. . . . . . . . . . . . 39

Table 3.4: Multi-PIE performance comparison on setting V of Table 3.1. . . . . . . . . . . . . . 39

Table 3.5:

Performance comparison on LFW dataset.

. . . . . . . . . . . . . . . . . . . . . . 43

Table 3.6:

Performance comparison on CFP dataset. Results reported are the average ± standard
deviation over the 10 folds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Table 3.7:

Performance comparison on IJB-A.

. . . . . . . . . . . . . . . . . . . . . . . . . 44

Table 4.1:

Performance comparison on LFW dataset with accuracy (ACC) and area-under-curve
(AUC).

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

.

.

.

.

.

.

.

.

.

Table 4.2:

Performance comparison on IJB-A dataset.

. . . . . . . . . . . . . . . . . . . . . 68

Table 4.3:

Performance comparison on Multi-PIE dataset.

. . . . . . . . . . . . . . . . . . . . 69

Table 4.4: Quantitative results of ablation study. . . . . . . . . . . . . . . . . . . . . . . . . . 74

Table 5.1: Results on the controlled experiments by varying the ratio between regular and long-tail

classes in the training sets.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Table 5.2: Results of the controlled experiments by varying the number of images for each long-

tail class in the training sets.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Table 5.3: Comparison on one-shot MS-Celeb-1M challenge. Results on the base classes are re-
ported as rank-1 accuracy and on novel classes are reported as Coverage@Precision =
0.99.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

.

.

.

.

.

.

.

.

.

.

Table 5.4:

Face recognition on LFW and IJB-A. “MP” represents media pooling and “TA” repre-
sents template adaptation. The best and second-best results are highlighted.

. . . . . . 93

vi

LIST OF FIGURES

Figure 1.1: Examples of face images showing the challenges of face recognition caused

by pose, expression, occlusion, illumination, and image blur. . . . . . . . . . .

2

Figure 1.2: Examples of face images generated by GAN. Top row shows the images gen-
erated by [44] where the face resolution is less than 100× 100. Bottom row
shows the images generated by [74] where the face resolution is as large as
1024× 1024.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

.

.

Figure 1.3: Dataset distributions for (a) CASIA-Webface and (b) MS-celeb-1M. . . . . . .

4

5

Figure 2.1: Face matching pipeline. It consists of four steps: face detection, face align-
. . . . . . . . . . . . . . . .

ment, feature extraction, and feature matching.

. 10

Figure 2.2: Pitch, yaw, roll angles for pose variation representation [4]. . . . . . . . . . . . 13

Figure 2.3: The general framework for multi-view subspace learning [36].

. . . . . . . . . 20

Figure 3.1: We propose MTL to disentangle the PIE variations from learnt identity fea-
tures. (a) For single-task learning, the main variance is captured in x-y, result-
ing in an inseparable region between these two subjects. (b) For multi-task
learning, identity is separable in x-axis by excluding y-axis that models pose
variation.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

.

.

.

.

.

.

.

Figure 3.2: The proposed m-CNN and p-CNN for face recognition. Each block reduces
the spatial dimensions and increases the channels of the feature maps. The
formats for the convolution and pooling parameters are: ﬁlter size / stride /
ﬁlter number and method / ﬁlter size / stride. The feature dimensions after
each block operation are shown on the bottom. The dashed line represents the
batch split operation as shown in Figure 3.3. The layers with the stripe pattern
are the identity features used in the testing stage for face recognition.
. . . .

. 27

Figure 3.3:

Illustration of the batch split operation in p-CNN. . . . . . . . . . . . . . . . . 31

Figure 3.4: Dynamic weights and losses for m-CNN and p-CNN during the training process. 37

Figure 3.5: Analysis on the effects of MTL: (a) the sorted energy vectors for all tasks; (b)
visualization of the weight matrix Wall where the red box in the top-left is
a zoom-in view of the bottom-right content; (c) the face recognition perfor-
mance with varying feature dimensions. . . . . . . . . . . . . . . . . . . . . . 40

vii

Figure 3.6: The mean and standard deviation of each energy vector during the training

process.

.

.

.

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Figure 3.7: Energy vectors of m-CNN models with different overall loss weights.

. . . . . 41

Figure 3.8: Yaw angle distribution on CASIA-Webface dataset. . . . . . . . . . . . . . . . 45

Figure 4.1: The proposed FF-GAN framework. Given a non-frontal face image as input,
the generator produces a high-quality frontal face. Learned 3DMM coefﬁ-
cients provide global pose and low frequency information, while the input
image injects high frequency local information. A discriminator distinguishes
generated faces against real ones, where high-quality frontal faces are con-
sidered as real ones. A face recognition engine is used to preserve identity
information. The output is a high quality frontal face that retains identity.
. .

. 49

Figure 4.2: The proposed framework of FF-GAN. R is the reconstruction module for 3DMM

coefﬁcients estimation. G is the generation module to synthesize a frontal face.
D is the discrimination module to make real or generated decision. C is the
recognition module for identity classiﬁcation.

. . . . . . . . . . . . . . . . . . 51

Figure 4.3:

3D faces generated with identity, expression, and texture variations.

. . . . . . 52

Figure 4.4:

Image ﬂip and mask generation process for the symmetry loss. . . . . . . . . . 56

Figure 4.5: Detailed network structure of FF-GAN.

. . . . . . . . . . . . . . . . . . . . . 59

Figure 4.6:

(a) Our landmark localization and face frontalization results; (b) Our 3DMM
estimation; (c) Ground truth from [166]. . . . . . . . . . . . . . . . . . . . . . 65

Figure 4.7: Face frontalization results comparison on LFW. (a) Input; (b) LFW-3D [55];

(c) HPEN [167]; (d) FF-GAN.

. . . . . . . . . . . . . . . . . . . . . . . . . . 67

Figure 4.8: Visual results on Multi-PIE. Each example shows 13 pose-variant inputs (top)
and the generated frontal outputs (bottom). We clearly observe that the outputs
consistently recover similar frontal faces across all the pose intervals.

. . . . . 70

Figure 4.9: Face frontalization results on AFLW. FF-GAN achieves very promising vi-
sual effects for faces with small (row (a) and (b)), medium (row (c)), large
(row (d)) poses and under various lighting conditions and expressions (row
(e)). We observe that the proposed FF-GAN achieves accurate frontalization,
while recovering high frequency facial details as well as identity, even for face
images observed under extreme variations in pose, expression or illumination. . 72

Figure 4.10: Face frontalization results on IJB-A. Odd rows are all proﬁle-view inputs and

even rows are the frontalized results. . . . . . . . . . . . . . . . . . . . . . . . 73

viii

Figure 4.11: Face frontalization results on CFP. Odd rows are all proﬁle-view inputs and

even rows are the frontalized results. . . . . . . . . . . . . . . . . . . . . . . . 73

Figure 4.12: Ablation study results. (a) input images. (b) M (ours). (c) M\{C}. (d)M\{D}.

(e) M\{R}. (f) M\{Gid}.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Figure 5.1:

(a) The long-tail distribution of CASIA-WebFace [150]. (b) Weight norm plot
of a classiﬁer varies across classes in proportion to their volume. (c) Weight
vector norm for head class ID 1008 is larger than tail class ID 10,449, causing
a bias in the decision boundary (dashed line) towards ID 10,449. (d) Even
after data re-sampling, the variance of ID 1008 is much larger than ID 10,449,
causing decision boundary to still be biased towards the tail class. We aug-
ment the feature space of the tail classes as the dashed ellipsoid and propose
improved training strategies, leading to an improved classiﬁer.

. . . . . . . . . 77

Figure 5.2: The proposed framework includes a feature extractor Enc, a decoder Dec, a
feature ﬁltering module R, and a fully connected layer as classiﬁer FC. The
proposed feature transfer module G generates new feature ˜g from original fea-
ture g. The network is trained with an alternative bi-stage strategy. At stage 1,
we ﬁx Enc and apply feature transfer G to generate new features (green trian-
gle) that are more diverse and likely to violate decision boundary. In stage 2,
we ﬁx the rectiﬁed classiﬁer FC, and update all the other models. As a result,
the samples that are originally on or across the boundary are pushed towards
their center (blue arrows in bottom right). Best viewed in color.

. . . . . . . . 81

Figure 5.3: Visualization of samples closest to the feature center. (Left) We ﬁnd that near-
frontal close-to-neutral faces are the nearest neighbors of the feature center
for regular classes. (Right) Faces closest to center are from classes with least
samples, which still contain pose and expression variance, as tail classes may
severely lack neutral samples. Features are extracted by VGGFace [101] and
samples are from CASIA-WebFace [150]. . . . . . . . . . . . . . . . . . . . . 84

Figure 5.4:

(a) Center estimation error comparison. (b) Two classes with intra-class and
inter-class variance illustrated. Circles from small to large show minimum,
mean and maximum distance from intra-class samples to the center. Distances
are averaged across 1K classes.

. . . . . . . . . . . . . . . . . . . . . . . . . 88

Figure 5.5: Toy example on MNIST to show the effectiveness of our m-L2 regularization.
(a) the training loss/accuracy comparison. (b) feature distribution on test set
for the model trained without m-L2 regularization. (c) feature distribution with
m-L2 regularization.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Figure 5.6: Center visualization: (a) one sample image from the selected class; (b) the

decoded image from the feature center.

. . . . . . . . . . . . . . . . . . . . . 92

ix

Figure 5.7: Feature transfer visualization between two classes for every two columns. The
ﬁrst row are the input, in which odd column denotes class 1: x1 and the even
column denotes class 2: x2. The second row are the reconstructed images x(cid:48)
1
and x(cid:48)
2. In the third row, odd column image is the decoded image of the trans-
ferred feature from class 1 to class 2 and even column image is the decoded
image of the transferred feature from class 2 to class 1. It is clear that the
transferred features share the same identity as the target class while obtain the
source image’s non-identity variance including pose, expression, illumination,
and etc.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

.

.

.

.

.

.

.

.

Figure 5.8: Transition from top-left image to top-right image via feature interpolation. For
each example, ﬁrst row shows traditional feature interpolation; second row
shows our transition of non-identity variance; third row shows our transition
of identity variance.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

x

LIST OF ALGORITHMS

Algorithm 5.1 Alternating training scheme for feature transfer learning.

. . . . . . . .

. 86

Algorithm 5.2 Functions that are called in Algorithm 5.1.

. . . . . . . . . . . . . . . . . 87

xi

Chapter 1

Introduction and Contributions

1.1 Face Recognition

Face recognition is one of the most studied and long-standing research topics in computer vision.

The key of face recognition is to extract discriminative identity features, or representation, from a

face image. The ﬁrst face recognition system [170] dates back to 1960s, when manual measure-

ment of facial shape is used as the features to identify subject. Later work focus on hand-crafted

features such as LBP [2], HOG [34], SIFT [91], and controlled face images that are collected

in the lab environment with cooperative subjects [45, 108]. In the last decade, in-the-wild face

recognition is the main focus. With the development of neural networks [78, 56, 61], deep fea-

tures [115, 33, 139] have achieved impressive performance on several challenging face recognition

benchmarks [62, 77, 116, 75], where Labeled Faces in-the-Wild (LFW) [62] has been extensively

evaluated on until surpassing human performance is achieved [115].

Deep Neural Networks (DNNs) [78] have been successfully applied to many vision applica-

tions including face recognition [101, 115, 132], pedestrian detection [131, 158, 13], semantic seg-

mentation [90, 98, 85], and etc. This is attributed to the development of advanced network struc-

tures [127, 56, 61, 60], large labeled training data [49, 30], and powerful computing resources.

Moreover, DNN frameworks make it possible for end-to-end learning, i.e., to learn a mapping

from the input image space to the target label space. In the concept of face recognition, previous

1

Figure 1.1: Examples of face images showing the challenges of face recognition caused by pose,
expression, occlusion, illumination, and image blur.

methods typically consist of two steps: high-dimensional feature extraction [19] and discrimina-

tive subspace learning [18]. With DNNs, the above two steps can be combined into a uniﬁed

framework. Such a formulation is preferable because the loss function can directly supervise the

representation learning process. How to design novel loss functions or new structures to learn a

better representation is an essential topic that has been widely studied [132, 51, 89].

We study representation learning and image synthesis for deep face recognition, which is a

challenging problem due to the large intra-class variations including Pose, Illumination and Ex-

pression (collected denoted as PIE), as well as occlusion and image blurring (as shown in Fig-

ure 1.1). General deep face recognition algorithms handle these challenges implicitly, i.e., without

considering the source of variations, by imposing large inter-class distance and small intra-class

distance regularizations [115, 141, 160]. On contrary, supervised learning with auxiliary labels can

be employed to study face recognition robust to speciﬁc variations [151, 71], among which pose has

been considered as the largest challenge. Pose-Invariant Face Recognition (PIFR) has been studied

a lot. A comprehensive survey on PIFR can be ﬁnd at [36]. Existing algorithms can be mainly clas-

siﬁed into learning-based and synthesis-based methods. Learning-based methods [153, 132] aim

to design a novel loss function or framework to learn more discriminative features from a proﬁle

input face. Synthesis-based methods [154, 63] target at generating a frontal-view face image of the

same subject, which is easier for face recognition. Representation learning and image synthesis

2

can be combined in a uniﬁed framework [151, 132].

In this dissertation, we ﬁrst introduce a multi-task Convolutional Neural Network (CNN) with

face recognition as the main task and PIE estimations as the side tasks to learn PIE-invariant repre-

sentation. It may sound straight-forward and intuitive to have PIE labels as auxiliary supervisions,

but we are actually the ﬁrst to do so. We prove that the side tasks regularize the network to dis-

entangle the PIE variations from the learnt identity representation. We overcome the problem of

loss weight selection in multi-task learning by formulating a dynamic-weight scheme to learn the

weights in the CNN framework. To better handle pose variation, we further propose a pose-directed

multi-task CNN to learn pose-speciﬁc identity features for face images with different pose groups.

Such frameworks are extended to in-the-wild face recognition where no ground truth PIE labels

are available. Instead, we use the estimated pose labels as the soft labels for training, which is

shown to be effective than the single-task learning framework.

Besides representation learning-based methods for PIFR, image synthesis-based methods [154,

63, 8] are attracting more attentions lately, owing to the success of Generative Adversarial Network

(GAN) [44] and its variants [96, 23, 46]. Goodfellow et al. introduce GAN to learn generative

models via an adversarial training process.

It consists of a generator and a discriminator that

improve themselves with a minimax two-player game. GAN has improved dramatically since

it was proposed. As shown in Figure 1.2, the original GAN can only generate low quality face

images. Recently, Karras et al. [74] propose to progressively train GAN from coarse to ﬁne in
order to improve the image resolution up to 1024 × 1024. While the generated images can be

visually appealing, it is more important to enforce identity preservation in the generated images so

that it can better contribute to face recognition.

As the second work of this dissertation, we propose Face Frontalization GAN (FF-GAN) to

tackle PIFR from the perspective of face image synthesis. We aim to generate an identity-preserved

3

Figure 1.2: Examples of face images generated by GAN. Top row shows the images generated
by [44] where the face resolution is less than 100× 100. Bottom row shows the images generated
by [74] where the face resolution is as large as 1024× 1024.

frontal face from an input face image with arbitrary view even up to extreme proﬁle. The gener-

ated face images can help to boost the face recognition performance. There are two challenges in

large-pose face frontalization. First, frontal-proﬁle face pairs are limited for deep neural network

training. Second, preserving identity in the generated faces is difﬁcult. To solve these issues, we

utilize the 3D Morphable Models [12] to provide shape and texture priors into the CNN frame-

work to accelerate training with limited data. Multiple loss functions including symmetry loss and

recognition loss are proposed to ensure identity preservation. During the testing stage, we fuse

the features of the generated image and the original image for face recognition. FF-GAN can be

considered as an image-level data augmentation method for PIFR.

From simple image transformation like scale, rotation, noise etc., to generative models such

as GAN and VAE [76], image-level data augmentation techniques are widely accepted and very

popular. Meanwhile, feature-level data augmentation has attracted relatively less attention. Until

recently, Cao et al. [15] propose an equivariant mapping method to learn a residual to map a proﬁle

face to a frontal one in the feature space. Such a feature-level data augmentation is effective for

4

(a)

(b)

Figure 1.3: Dataset distributions for (a) CASIA-Webface and (b) MS-celeb-1M.

PIFR. Tan et al. [130] introduce a new technique named Feature Super-Resolution (FSR). FSR aims

to improve the discriminatory power of the low-resolution images by performing super-resolution

in the feature space. These feature-level transformations are easier for identity preservation com-

pared to image-level face generation.

As the last part of this dissertation, we propose a feature transfer learning framework to per-

form feature-level data augmentation for long-tail classes during the training of a face recognition

system. State-of-the-art face recognition algorithms have beneﬁted from increasing volume of

training data [150, 49, 16], mostly in the data breath (number of different identities) rather than

depth (number of images per identity). In fact, most face recognition datasets are long-tailed. Fig-

ure 1.3 shows the dataset distribution of two most popular datasets CASIA-Webface [150] and

MS-celeb-1M [49] where most subjects are with limited number of images. Training with long-

tailed dataset will result in biased classiﬁer and degraded representation, as observed in [48, 160].

Our proposed framework aims to alleviate this problem by enriching the data distribution of the

long-tail classes via center-based feature transfer. An alternative training strategy is introduced to

achieve this goal. Both quantitative and qualitative results have validated that the proposed method

can indeed augment the feature space of long-tail classes and thus learn a better representation.

5

0200040006000800010000Class ID0200400600800Images/Class02468Class ID104050010001500Images/ClassOur framework can be generalized to low-shot object recognition.

1.2 Dissertation Contributions

This dissertation studies unconstrained face recognition using deep neural networks. Speciﬁcally,

we explore three problems, i.e., pose-invariant face recognition, large-pose face frontalization, and

training face recognition framework with long-tail data. This dissertation makes the following

contributions in order to solve the above problems.

• A multi-task Convolutional Neural Network (m-CNN) is formulated to leverage side tasks

for improved representation learning. A dynamic weighting technique is proposed that can

bypass the tedious loss weight search in prior multi-task learning frameworks. We provide

insights on how side tasks can help to disentangle the variations for robust face recognition.

• A pose-directed multi-task CNN is introduced on top of m-CNN by applying divide-and-

conquer into the CNN framework. A stochastic routing scheme is proposed to calculate

reliable distance measurement that is robust to potential pose estimation errors during the

testing stage. This approach is applicable to other modality-aware recognition tasks.

• A GAN-based end-to-end framework is proposed to achieve face frontalization even for

extreme proﬁle faces. We are the ﬁrst to combine 3DMM into the deep learning framework

for face frontalization. Effective loss functions including symmetry loss and smoothness

regularization are introduced to lead the generation of high-quality images.

• The proposed FF-GAN can be used for various tasks including 3D reconstruction, landmark

localization, face frontalization, and face recognition. State-of-the-art face frontalization and

recognition results are achieved on multiple datasets.

6

• The problem of training face recognition system with long-tailed data is identiﬁed and ad-

dressed via feature transfer learning. A novel framework is proposed to allow feature transfer

and recognition at different feature spaces. The proposed framework allows the visualization

of facial features, providing insights on what have been learnt in the features.

• An alternative training strategy is proposed to achieve an unbiased classiﬁer and retrain

discriminative power of the feature representation. This training scheme is generic and can

be adopted to other feature transfer learning algorithm. Various visualization results have

shown that our approach can indeed increase the intra-class variation for long-tail classes.

1.3 Dissertation Organization

The rest of this dissertation is organized as follows. Chapter 2 gives more background introduction

and reviews related work on face recognition. Chapter 3 develops the proposed multi-task CNN

framework for PIE-invariant face recognition. Chapter 4 explores large-pose face frontalization by

using prior of 3D Morphable Models. Chapter 5 presents a feature transfer learning approach for

face recognition with long-tail data. Chapter 6 concludes this dissertation.

7

Chapter 2

Background and Related Work

Biometrics refer to the technologies to recognize humans by their physical traits or behavior traits.

Physical traits, including iris [32], ﬁngerprint [94], face [135], etc., are believed to be more reliable

than behavior traits such as voice [103], typing behavior [109], gait [147], and so on. Among these

physical traits, face has the advantage of being non-invasive compared to iris and ﬁngerprint. In

the deep learning era, large amount of training data is of essential importance to the development

of a recognition algorithm. Collecting face images online is a much easier task than collecting

iris or ﬁngerprint from human subjects. As a result, face recognition has advanced a lot with the

development of deep learning techniques.

This section introduces basic knowledge that is necessary to understand deep face recognition

and reviews related work.

2.1 Basics

2.1.1 Recognition Categories

Depending on the application scenarios, face recognition can be operated in two modes: face

veriﬁcation (or authentication), and face identiﬁcation (or recognition) [65].

Face Veriﬁcation Face veriﬁcation is a one-to-one matching problem. Giving two face images,

the problem is to answer whether they belong to the same identity or not. It involves comparing the

8

feature similarity with a pre-deﬁned threshold to make the decision. Typical application scenarios

include unlocking personal devices, or self-serviced check-in with ID.

Face Identiﬁcation Face identiﬁcation is a one-to-many matching problem. It includes a pre-

registered gallery set with known identities and a query or probe image with unknown identity. If

this probe image belongs to one of the gallery identities, it is a close-set recognition problem. The

probe image is compared to each of the gallery images. The most similar face image is retrieved

to identify the query image. However, in open-set face identiﬁcation, a pre-deﬁned threshold is

needed to accept the query as one of the gallery identity or reject it as not present in this gallery set.

Typical application scenarios include security door unlocking, or surveillance watch list searching.

In either face veriﬁcation or identiﬁcation, the common problem is to compare the similarity

between two face representations. The discriminative power of the representation is the essential

component to successful face recognition.

2.1.2 Face Recognition Pipeline

As discussed above, face recognition can be simpliﬁed as the problem of comparing the similarity

between a pair of face images. The recognition pipeline is shown in Figure 2.1. It consists of four

steps: face detection, face alignment, feature extraction, and face matching.

Face Detection Face detection is a necessary ﬁrst-step in face recognition systems [58]. It is

the process to detect the region of a face (blue box in Figure 2.1). State-of-the-art face detection

algorithms [107, 67, 117] can usually achieve satisfactory pre-processing results on current face

benchmarks. However, detecting faces with very low resolution is still challenging.

Face Alignment Face alignment is the process to detect facial landmarks located around eyebrows,

eyes, nose, mouth, and facial contour. The landmarks are used to align a face image to a canonical

view. Even though large-pose face alignment [68, 69, 70] is challenging, it is not a big concern for

9

Figure 2.1: Face matching pipeline. It consists of four steps: face detection, face alignment, feature
extraction, and feature matching.

deep face recognition as the face images are only required to be roughly aligned.

Feature Extraction Feature extraction is the most crucial step in the face matching pipeline. The

history of face recognition can be viewed as the progress of identity feature learning, from the very

early manual measurement [170], to hand-crafted features [34], and now to deep features [115].

Deep face recognition involves training a model from a large set of data and using this model as

the feature extractor during the testing stage. Training such a model is the major component of a

face recognition system, which is reviewed in Section 2.3.

Feature Matching Face matching is to compute the similarity, or distance between two face

representations. Typical distance measurements are either Cosine distance between the original

features or Euclidean distance between the L2-normalized features.

2.1.3 Evaluation Metrics

Face recognition performance is reported on three standard tasks including face veriﬁcation, close-

set and open-set face identiﬁcation [65]. We will give a brief introduction on the evaluation metrics

that are used in this dissertation.

10

InputFace DetectionFace AlignmentCropped InputFeature ExtractionFeature MatchingFace Veriﬁcation As discussed in Section 2.1.2, the face matching process calculates a distance

between a pair of face images. This distance is compared to a pre-deﬁned threshold. For distance

that is smaller than this threshold, the face pair is considered as the same subject. Otherwise, it

is considered as different subjects. Correctly veriﬁed pairs are named true positive (same-person

pair) or true negative (different person pair). There are two types of errors: false positive (or face

acceptance) refers to the mistake of recognizing different subjects as the same one; false negative

(or false rejection) refers to the mistake of recognizing the same subject as different ones. Based on

these two types of errors, face veriﬁcation performance is evaluated using the following metrics:

• Accuracy: the percentage of truly recognized pairs (both true positive and true negative).

• Equal Error Rate (EER): the error when false positive rate (FPR) and false negative rate

(FNR) are the same, which is found by varying the threshold.

• Receiver Operating Characteristic (ROC): the curve of true positive rate (TPR) against false

positive rate (FPR) that are calculated by varying the threshold.

• Area Under Curve (AUC): the percentage of area under the ROC curve.

Close-Set Face Identiﬁcation Close-set face identiﬁcation involves comparing the query face

with each of the gallery face image. The resulting distances are sorted and ranked. The top n sub-

jects with the closest distances are retrieved. A true match can be deﬁned as when the true identity

is observed in the top n ranks. True Positive Identiﬁcation Rate (TPIR) refers to the probability of

observing the true identity in the top n ranks. Evaluation metrics for close-set face identiﬁcation

based on TPIR are deﬁned as follows:

• Rank-n Accuracy: TPIR at the rank of n. Typical values for n are 1, 5, 10.

• Cumulative Match Characteristic (CMC): the curve of TPIR against ranks.

11

Open-Set Face Identiﬁcation Open-set face identiﬁcation is more complicated than the close-

set identiﬁcation. It adds an additional step to compare the smallest distance with a pre-deﬁned

threshold to determine whether the query exists in the gallery or not. This dissertation does not

consider this scenario as most face benchmarks are prototyped for face veriﬁcation and close-set

face identiﬁcation. The evaluation of open-set face identiﬁcation can be referred to [65].

2.2 Challenges

Robust face recognition is a challenging problem. This challenge can be viewed from two perspec-

tives: data variance and method design.

2.2.1 Data Variance

The fact that the intra-subject difference caused by pose, illumination, expression (collectively

represented as “PIE”), age, and etc. can often surpasses the inter-subject difference can challenge

state-of-the-art face recognition systems. Additional difﬁculties come from the various occlusions,

low image quality, and so on.

Generic face recognition algorithms [115, 139, 33] aim to design novel loss functions that

can handle the data variance implicitly. Other algorithms [132, 169, 153, 37, 29, 10] propose to

explicitly handle one or multiple of the above challenges. Among these variations, pose has been

studies the most for the following reasons. First, the self-occlusions caused by pose variation is

difﬁcult for conventional face recognition algorithms that only works well on frontal faces, which

affects the progress of face recognition techniques. Second, human head pose is relatively easy to

model with the help of 3D face models [12]. The estimated pose labels make supervised learning

possible in the designing of pose-invariance face recognition algorithms.

12

Figure 2.2: Pitch, yaw, roll angles for pose variation representation [4].

Pose Model As shown in Figure 2.2, human head has three degrees of freedom in rotation, which

result in pitch, yaw, and roll angles for pose variation. Among these angles, the in-plane rotation

caused by roll angle can easily be corrected in the face pre-processing step. Pitch angle is more

widely observed in surveillance scenarios but not in the existing benchmark datasets like LFW [62],

IJB-A [77], or MS-celeb-1M [49]. Yaw angle is the most studied because it is the most commonly

observed. This dissertation mainly focuses on handling yaw angle in face recognition.

2.2.2 Method Design

Face recognition is a different problem compared to generic object recognition. First of all, many

studies [73, 133, 80] in electrophysiology have uncovered evidence that visual cortex contain spe-

cial regions involved in processing faces but no other objects, which shows that human brain sep-

arates face recognition from generic object recognition. From the machine learning perspective,

face recognition is a ﬁne-grained recognition problem. That is, the task is not to recognize whether

there is a face or not, rather, it is to recognize the subject identity.

13

In generic object recognition, the training and testing stage usually involve recognizing the

same classes, like the 1,000-class ImageNet recognition challenge [112]. Typical method aims to

learn a feature extractor and a classiﬁer that perform well on the training set and then applies them

to the testing set. In the scenario of zero-shot learning, i.e., recognizing novel objects without any

training images available, other source of data like text are needed to learn the mapping from image

to text [100, 121]. Since image and text are quite different from each other, zero-shot learning in

generic object recognition is a very difﬁcult task.

On the contrary, face recognition usually requires training on a set of subjects and test on

unknown subjects. This is achievable without the need for any other domain information, which

is different from zero-shot learning in generic object recognition. Because human face is well

structured, it is possible to learn a mapping from the face image space to a feature space that

can generalize to unseen subjects. How to learn such a mapping is the key in designing a face

recognition algorithm.

Most deep neural network (DNN)-based methods treat face recognition as a traditional clas-

siﬁcation problem where the main focus is to design novel loss functions to guide feature learn-

ing. A good feature distribution is believed to have small intra-class variance and large inter-class

variance. This has been a default rule in designing recognition methods. While generic object

recognition algorithms can be applied to train a face recognition system, it is more desirable if any

face prior information can be incorporated in the feature learning process. This is the challenging

part when designing a face recognition method.

14

2.3 Methodologies

There are two major components in face recognition algorithms: feature extraction and loss design.

Early non deep learning-based methods usually combine hand crafted features [2, 34, 91] with

unsupervised or supervised subspace learning methods such as PCA [142], LDA [9] and Joint

Bayesian [18]. With the development of deep neural networks, the above two-step procedure can

be combined into an end-to-end learning framework. The main idea is to learn a mapping function

from the input image space into a target space where a simple distance metric calculated on the

features can approximate the semantic distance in the input space.

This section reviews related work in deep face recognition. First, we review methods for gen-

eral face recognition where no other label except identity is used during training. The goal is to

design a framework with advanced loss functions that can handle different challenges implicitly.

This is in contrast to other face recognition algorithms where pose, expression, age information is

used explicitly during training for robust face recognition w.r.t. speciﬁc variations. Among these

variabilities, we focus on the review of pose-invariant face recognition algorithms.

2.3.1 General Deep Face Recognition

Based on how the loss function is designed, general deep face recognition algorithms can be classi-

ﬁed into two categories [28]: probabilistic-based models and energy-based models. Given an input

face image, probabilistic models aim to assign a normalized probability to each subject. Energy-

based models (or metric learning) assign an un-normalized energy to a pair of face images. In the

form of deep neural networks, probabilistic models learn a linear classiﬁer on the features with

cross-entropy loss. Energy-based models associate an energy to a pair of input faces by compare

the feature similarities, and thus bypass the need for a linear classiﬁer.

15

2.3.1.1 Probabilistic-based Models

Typical methods in this category consists of the softmax cross entropy loss and its variants. In

the DNN framework, a fully connect layer is added after the features to output the logits, which

is then converted to a probability distribution over all subjects with a softmax operation. Cross

Entropy (CE) loss aims to maximize the probability of an input face belonging to the target label.

For clarity, the softmax operation and the CE loss are together represented as softmax loss in the

remaining of this dissertation. Softmax loss is widely used in different kinds of classiﬁcation tasks

due to its effectiveness and simplicity. However, it is also well known that the features learnt by

softmax loss is not discriminative enough. Therefore, many methods aim to improve softmax loss

from several different perspectives.

Feature and Weight Normalization Prior work have observed several problems with face recog-

nition algorithms trained with softmax loss. First, it is proved in [138] that softmax loss can be

minimized by increasing the feature norm. This can be easily achieved especially for good samples.

In fact, Ranjan et al. [106] have observed that features with good quality are of higher norm than

those with bad quality. Second, softmax loss uses inner product for classiﬁcation during training

while face matching is done by calculating the cosine distance or normalized Euclidean distance,

such a gap will affect the performance as observed in [138]. Based on these observations, recent

work explore the normalization of features or weights, or both when using softmax loss.

Ranjan et al. [106] propose a L2-constrained softmax loss where the features are constrained

with a norm of α. The bound of α is also discussed w.r.t. the number of classes and the probability

score. Such a loss has proved to work better than the baseline softmax loss for face veriﬁcation.

Wang et al. [138] propose NormFace to reformulate the softmax loss where both the feature and

the weight are normalized. Recently, ring loss is proposed by Zheng et al. [165] to regularize the

16

features to a target norm. This formulation converts the traditional normalization operation to a

convex optimization problem. All the above studies have shown the effectiveness of performing

normalization in softmax loss.

Auxiliary Losses The weight matrix in the ﬁnal classiﬁcation layer learns a template for each

class. Softmax loss encourages the samples to be close (in the form of inner product) to its template

while away from other classes’ templates, which can guarantee the inter-class separation. However,

it does not explicitly model the intra-class variation. Therefore, some work propose additional

losses with an objective to increase the intra-class compactness.

The well-known center loss [141] proposes to learn a center for each class and penalize the dis-

tance between the sample features to their corresponding class centers. With the joint supervision

of softmax loss and center loss, the discriminative power of the features is signiﬁcantly improved.

Motivated by center loss, Qi and Su propose contrastive-center loss [104] that considers both inter-

class separation and intra-class compactness. Similarly, He et al. [57] propose triplet-center loss to

use centers to formulated triplet for 3D object retrieval.

Margins in Softmax Loss Besides introducing normalization or adding auxiliary losses, another

category of methods is to add margins in the softmax loss. Liu et al. [88] propose large-margin

softmax loss (L-Softmax) with inter-class angular margin constraint that controls the learning dif-

ﬁculty. Later on, SphereFace [87] is proposed, which shares similar idea with L-Softmax loss but

with the weight vectors being normalized. Now it is widely accepted that both the feature and the

weight vector need to be normalized, and a scale layer is added after the feature normalization to

make it an easier optimization task. Under such a framework, CosFace [139] and ArcFace [33] are

proposed with margins added to the cosine value or the angle. However, one problem with these

two loss functions are the tedious effort for parameter tuning of the scale and margin.

17

2.3.1.2 Energy-based Models

Different from probabilistic-based models which estimate a normalized probability distribution

over all classes, energy-based models assign an unnormalized energy to an input pattern. The

key differences are in three folds. First, maximizing the probability of one sample belonging to the

target class will automatically minimize the probabilities of it belonging to other non-target classes.

On contrary, energy-based models need to sample same-person pairs and different-person pairs

to make contrastive loss functions that minimize intra-class distances and maximize inter-class

distances. Second, one weight vector is required for each class in the probabilistic-based models,

which is limited by the memory constraint when the number of classes is large. Energy-based

models compare the energy between different input patterns and thus do not need to learn a weight

vector for each class. Third, training one sample in probabilistic-based models is equal to make

C (the number of classes) comparisons between the sample and all weight vectors, which is more

efﬁcient than energy-based models that makes a few limited comparisons at a time. Therefore,

there are three major components in designing an energy-based model: contrastive term in the loss

function, sample selection, and comparison scheme.

Contrastive loss [28] is the pioneer work of energy-based models for face recognition. It pro-

poses to use a scalar energy function that is 0 for same-person pair and 1 for different-person pair.

Ten years later, Schoroff et al. propose FaceNet [115] that generalizes the contrastive loss to triplet

loss, which minimizes the distance between an anchor and a positive sample that belongs to the

same identity and maximizes the distance between the anchor and a negative sample of a different

identity. Triplet loss achieves large performance gain on the LFW database. However, it is hard to

train in practice where the bottleneck is how to select meaningful triplet samples.

Other metric learning methods aim to generalize triplet loss by improving the sample selection

18

or comparison scheme. For example, Song et al. [99] propose lifted structured feature embed-

ding that considers all same- and different-person pairs of comparisons in each mini-batch. Ki-

hyuk [122] proposes N-pair loss that constructs a mini-batch with the comparisons of one positive
pair and N − 1 negative pairs. Zhang et al. [160] propose range loss that considers the hard-

est intra-class and inter-class distances in each mini-batch. Although range loss considers both

inter-class and intra-class variations, it still has to combine with softmax loss to achieve desirable

performance. Different from the above approaches that try to generalize triplet loss by exploring

different sample selection methods, Ustinova and Lempitsky propose histogram loss [136] that

focus on the comparison scheme. Histogram loss aims to separate two distance distributions of

positive pairs and negative pairs. It achieves favorable results and has no tunable parameters.

The methods reviewed in this section is general and applicable to both generic objects and

faces. Since it is not the focus of this dissertation, we use softmax loss if not speciﬁed particularly.

2.3.2 Pose-Invariant Face Recognition

Face image observed from extreme views is a problem of fundamental interest in both human

and machine facial processing and recognition. Indeed, while humans are innately skilled at face

recognition, newborns do not perform better than chance on recognition from proﬁle views [134],

although this ability seems to develop rapidly within a few months of birth [39]. Similarly, PIFR

remains an enduring challenge in computer vision. As discussed in [36], current PIFR algorithms

can be broadly grouped into three categories while some work combines the three techniques.

• Multi-View Subspace Learning.

Methods in this category aim to project the face representation from different view-points to

the same subspace where face matching is meaningful.

19

Figure 2.3: The general framework for multi-view subspace learning [36].

• Pose-Invariant Feature Extraction.

Methods in this category usually rely on a large number of labeled dataset for pose-invariant

feature extraction and conventional classiﬁers for face matching.

• Face Synthesis.

Methods in this category propose to synthesize face images with a target pose from a source

pose so that faces under different views can be compared at the same view point.

We will review some prior work in these three categories.

2.3.2.1 Multi-View Subspace Learning

The pose-variant face images are distributed in a high-dimensional non-linear manifold. As shown

in Figure 2.3, multi-view subspace learning-based methods aim to learn the projections from dif-

ferent views to the same subspace where the comparison of face images with different poses make

sense. Li et al. [81] propose Canonical Correlation Analysis (CCA) for the subspace learning

20

where two projection matrices are learnt with one for each pose. The goal of CCA is to maximize

the correlation of the projected samples of the same subject with two different poses.

Rupnik and Shawe-Taylor [111] present Multi-View CCA (MCCA) that extend CCA to work

for multiple pose variation in the training set instead of two different poses in the original CCA.

CCA and MCCA are both linear models, which is limited because the texture change caused by

pose is non-linear. A natural extension of CCA is to include kernel technique as proposed in [3].

The multi-view subspace learning-based methods all require multiple pose-variant images for each

subject in the training set, which is hard to achieve for in-the-wild datasets.

2.3.2.2 Pose-Invariant Feature Extraction

The rotation of human head results in the loss of semantic correspondence between pose-variant

face images. One solution in pose-invariant feature extraction is to rebuild the semantic correspon-

dence. Speciﬁcally, previous work rely on the facial landmark detection in order to extract features

around each keypoint. For example, Biswas et al. [11] propose to extract SIFT [92] features at each

landmark and concatenate the features of each landmark as the pose-robust feature representation.

Chen et al. [19] propose to extract a high dimensional LBP representation from the patches around

the facial landmarks for PIFR.

Accurate landmark alignment is crucial for the above methods. However, face alignment in

unconstrained images is still a challenging problem. Another solution for pose-invariant feature

extraction is to build the semantic correspondence between two images based on the extracted

keypoints instead of the ﬁxed facial landmarks. For example, Liao et al. [84] develop a partial

face recognition approach based on Multi-Keypoint Descriptors (MKD) that does not require face

alignment. The probe face image is represented sparsely by a set of gallery descriptors.

Besides the engineering features, learning-based features become more popular recently. Neu-

21

ral networks are employed and shown to be better at handling large pose variations due to the high

non-linear property and huge amount of training data. For example, Zhu et al. [168] build a deep

neural network for Face Identity-Preserving (FIP) feature extraction. Zhang et al. [161] propose

an auto-encoder for pose-robust feature extraction where the pose-variant inputs are learnt to map

to a random face. The rationality behind this approach is that if all pose-variant inputs can map to

the same random face, the learnt representation should be pose-invariant. Deep models are better

at extracting pose-invariant features. It also requires a large amount of training data and lots of

efforts for network conﬁguration.

Recently there is an increasing interest in disentangled representation learning for PIFR. The

pioneer work of DR-GAN [132] introduces an encoder-decoder structured GAN framework to

learn disentangled representation via rotating face images. Peng et al. [102] propose a framework

for feature disentanglement via side-task estimations and reconstruction. Zhao et al. [164] propose

to a Pose Invariant Model (PIM) that combines a dual-path face frontalization branch and a dis-

criminative learning branch. All the above methods have achieved good performance for PIFR as

well as appealing visual results for image synthesis.

2.3.2.3 Face Synthesis

Face synthesis approaches aim to generate face images at a different view. This is motivated by the

fact that comparing faces with the same view is an easier task than comparing face images observed

at different views. The majority of the methods in this category is proposed for face frontalization,

which is deﬁned as the process of generating a frontal-view face image from an input image with

arbitrary pose. This is a challenging problem because recovering the missing information caused

by self-occlusion is ambiguous and ill-posed.

Early attempt for face frontalization is based on piece-wise afﬁne warping for pose normaliza-

22

tion [41, 7]. Recent work on face frontalization rely on 3DMM for face rotation. For example,

Zhu et al. [167] propose a High-ﬁdelity Pose and Expression Normalization (HPEN) method to

generate a natural face with frontal view and neutral expression. The pose and expression varia-

tions are eliminated in the estimated 3DMM that is achieved via 2D landmark ﬁtting. An improved

face recognition rate is observed for Multi-PIE and LFW. Similarly, Hassner et al. [40] explores

unconstrained face frontalization by using a single 3D surface model for all query images. It is

shown to be an efﬁcient and effective alternative for the personalized 3DMM ﬁtting.

Neural network-based methods are employed for face frontalization. The work of [151] pro-

poses a deep neural network for face rotation. Given an input image, the proposed framework can

generate a face image of the subject with the target pose speciﬁed by the remote code, which is

concatenated to the image boundary. More recently, Disentangled-Representation learning Gen-

erative Adversarial Network (DR-GAN) is proposed for PIFR. It can generate face images with a

target pose speciﬁed by the pose code. By employing the GAN [44] framework, DR-GAN is very

effective in both representation learning and face rotation.

23

Chapter 3

Multi-Task Learning

3.1

Introduction

Multi-task learning (MTL) aims to learn several tasks simultaneously to boost the performance of

the main task or all tasks. It has been successfully applied to face detection [20, 156], face align-

ment [162], pedestrian detection [131], attribute estimation [1], and so on. Despite the success of

MTL in various vision problems, there is a lack of comprehensive study of MTL for face recogni-

tion. In this work, we study face recognition as a multi-task problem where identity classiﬁcation

is the main task with PIE estimations being the side tasks. We answer the questions of how and

why PIE estimations can help face recognition.

We incorporate MTL into the CNN framework for face recognition. It is widely assumed in

MTL that different tasks share the same features. Traditional linear models can be applied where

each task is parameterized by a weight vector. The weight vectors of all tasks form a weight matrix

W, which is regularized by l2,1 norm [5] or trace norm [66] to encourage W to be a low-rank

matrix. In our work, the shared features are learnt through several convolution and pooling layers.

A fully connected layer is added to the shared features for the classiﬁcation of each task. We

observe that the side tasks serve as regularizations to learn more discriminative and disentangled

identity features for PIE-invariant face recognition.

As shown in Figure 3.1, when identity (x axis) is mixed with pose variation (y axis), single-task

24

Figure 3.1: We propose MTL to disentangle the PIE variations from learnt identity features. (a)
For single-task learning, the main variance is captured in x-y, resulting in an inseparable region
between these two subjects. (b) For multi-task learning, identity is separable in x-axis by excluding
y-axis that models pose variation.

learning (s-CNN) may learn a joint decision boundary along x-y, resulting in an inseparable region

between the two different subjects. In contrary, with multi-task learning, the shared features are

learnt to model identity and pose separately. The identity features can exclude pose variation by

selecting only the key dimensions that are essential for face recognition, which leads to PIFR.

One problem in MTL is how to weight the importance of different tasks. Prior work either treat

different tasks equally [151] or obtain the weights via greedy search [131]. However, it is time

consuming or practically impossible to ﬁnd the optimal weights for all side tasks via brute-force

search in a CNN framework. To solve this problem, we propose dynamic weights where we only

determine the overall importance for all side tasks, and the CNN learns to dynamically assign a

loss weight to each side task during training, which is efﬁcient and effective.

Since pose variation is the most challenging one among other non-identity variations, and

the proposed multi-task CNN (m-CNN) already classiﬁes all training images into different pose

groups, we propose to apply divide-and-conquer to CNN learning. Speciﬁcally, we develop a novel

pose-directed multi-task CNN (p-CNN) where the pose estimation can categorize the training data

into three different pose groups (left, frontal, and right), direct them through different routes in the

25

pose  identity pose  identity separable  (a) s-CNN (b) m-CNN }network to learn pose-speciﬁc identity features in addition to the generic identity features. Simi-

larly, the loss weights for extracting these two types of features are determined dynamically. In the

testing stage, a stochastic routing scheme is formulated for feature matching, which is effective in

handling pose variation in face recognition.

Multi-PIE [45] is an ideal dataset to study face recognition under PIE variations. It has been

used to study face recognition robust to pose [82, 71], illumination [53, 52], and expression [29,

167]. Most prior work study the combined variations of pose and illumination [37, 157, 169] with

the increase of pose variations from half-proﬁle [157] to a full range [145]. This work utilizes all

data in Multi-PIE, i.e., faces with the full range of PIE variations, as the main experimental dataset.

To the best of our knowledge, there is no prior face recognition work that studies the full range

of variations in Multi-PIE. Further, we apply our method to in-the-wild datasets where we only

consider pose as the side task and the estimated labels serve as the ground truth labels for training.

In summary, we make the following contributions.

• We formulate face recognition as an MTL problem and explore how it works via an energy-

based weight analysis.

• We propose a dynamic-weighting scheme to learn the loss weights for each side task auto-

matically in the CNN.

• We develop a pose-directed multi-task CNN to learn pose-speciﬁc identity features and a

stochastic routing scheme for feature fusion during the testing stage.

• We perform a comprehensive and the ﬁrst face recognition study on the entire Multi-PIE.

We achieve comparable or superior performance to state-of-the-art methods on Multi-PIE,

LFW [62], CFP [116], and IJB-A [77].

26

Figure 3.2: The proposed m-CNN and p-CNN for face recognition. Each block reduces the spatial
dimensions and increases the channels of the feature maps. The formats for the convolution and
pooling parameters are: ﬁlter size / stride / ﬁlter number and method / ﬁlter size / stride. The feature
dimensions after each block operation are shown on the bottom. The dashed line represents the
batch split operation as shown in Figure 3.3. The layers with the stripe pattern are the identity
features used in the testing stage for face recognition.

3.2 Proposed Method

In this section, we demonstrate our methods on Multi-PIE dataset and extend it to unconstrained

datasets in the experiments. First, we propose a multi-task CNN (m-CNN) with dynamic weights

for face recognition (main task) and PIE estimations (side tasks). Second, we propose a pose-

directed multi-task CNN (p-CNN) to tackle pose variation by separating all poses into different

groups and jointly learning pose-speciﬁc identity features for each group.

3.2.1 Multi-Task CNN

We adapt CASIS-Net [150] as our backbone network with three modiﬁcations. First, batch nor-

malization (BN) [64] is applied to accelerate training. Second, the contrastive loss is removed to

simplify the loss function. Third, the dimension of the fully connected layer is changed according

to different tasks. Details of the layer parameters are shown in Figure 3.2. The network consists

of ﬁve blocks each including two convolution layers and a pooling layer. BN and ReLU [97] are

used after each convolution layer, which are omitted for clarity. Similar to [150], no ReLU is used

after conv52, and a dropout layer with a ratio of 0.4 is applied after pool5.

27

conv11:3x3/1/32 conv12: 3x3/1/64 pool1: max/2x2/2 conv21:3x3/1/64 conv22: 3x3/1/128 pool2: max/2x2/2 conv31:3x3/1/96 conv32: 3x3/1/192 pool3: max/2x2/2 conv41:3x3/1/128 conv42: 3x3/1/256 pool4: max/2x2/2 conv51:3x3/1/160 conv52: 3x3/1/320 pool5: avg/7x7/1 50x50x64 25x25x128 13x13x192 7x7x256 1x1x320 100x100x1 s-CNN:  m-CNN:  p-CNN:  µsµmfc6_id fc6_pos fc6_exp fc6_illum fc: 200/13/6/19/3/2 block 1 block 2 block 3 block 4 block 5 fc6_left fc6_frontal fc6_right fc: 200/200/200 pose-directed branch batch split Given a set of N training images and their labels: D = {Ii,yi}N

i=1, where each label yi is a

vector consisting of the identity label yd

i and the side task labels. In this work, we consider three

side tasks including pose (yp

i ). We eliminate the sample

i ), illumination (yl

i), and expression (ye

index i for clarity. As shown in Figure 3.2, the proposed m-CNN extracts a high-level feature
representation x ∈ RD×1:

x = f (I;k,b,γ,β ),

(3.1)

where f (·) represents the non-linear mapping from the input image to the shared features. k and b

are the sets of ﬁlters and bias of all convolution layers. γ and β are the sets of scales and shifts in
all BN layers. Let Θ = {k,b,γ,β} denote all parameters to be learnt to extract the features.

The extracted features x, which is pool5 in our model, are shared among all tasks. Suppose
Wd ∈ RD×Dd and bd ∈ RDd×1 are the weight matrix and bias vector in the fully connected layer for

identity classiﬁcation, where Dd is the number of different identities in D. The generalized linear

model is applied:

yd = Wd(cid:124)

x + bd.

(3.2)

yd is then fed to a softmax layer to compute the probability of x belonging to each subject in the

training set.

so f tmax(yd)n = p( ˆyd = n|x) =

exp(yd
n)
∑ j exp(yd
j )

,

(3.3)

where yd

j is the jth element in yd. The so f tmax(·) function converts the model output yd to a
probability distribution over all subjects and the subscript selects the nth element. Finally, the

estimated identity ˆyd is obtained via:

ˆyd = argmax

n

so f tmax(yd)n.

(3.4)

28

Then the cross-entropy loss can be employed:

L(I,yd) = −log(p( ˆyd = yd|I,Θ,Wd,bd)).

(3.5)

Similarly, we formulate the losses of the side tasks. Let W = {Wd,Wp,Wl,We} represent the

weight matrices for identity and PIE classiﬁcations. The bias terms are eliminated for simplicity.

Given the training set D, our m-CNN aims to minimize the combined loss of all tasks:

argmin

Θ,W

αd

N

∑
i=1

L(Ii,yd

i ) + αp

N

∑
i=1

L(Ii,yp

i ) + αl

N

∑
i=1

L(Ii,yl

i) + αe

N

∑
i=1

L(Ii,ye
i ),

(3.6)

where αd, αp, αl, αe control the importance of each task. It becomes a single-task model (s-CNN)

when αp = αl = αe = 0. The loss drives the model to learn both the parameters Θ for extracting

the shared features and W for the classiﬁcation tasks. In the testing stage, the features before the

softmax layer (yd) are used for face recognition by applying a face matching procedure based on

cosine similarity.

3.2.2 Dynamic-Weighting Scheme

In CNN-based MTL, it is an open question on how to set the loss weight for each task. Prior work

either treat all tasks equally [151] or obtain the weights via brute-force search [131]. However,

it is very time-consuming to search for all combinations. To solve this problem, we propose a

dynamic-weighting scheme to automatically assign loss weights to each side task during training.

First, the weight for the main task is set to 1, i.e. αd = 1. Second, instead of ﬁnding the loss

weight for each task, we ﬁnd the summed loss weight for all side tasks, i.e. ϕs = αp + αl + αe, via

brute-force search in a validation set. Our m-CNN learns to allocate ϕs to the side tasks. As shown

29

in Figure 3.2, we add a fully connected layer and a softmax layer to the shared features x to learn
the dynamic weights. Let ωs ∈ RD×3 and εs ∈ R3×1 denote the weight matrix and bias vector in

the new added fully connected layer,

µs = so f tmax(ωs

(cid:124)x + εs),

(3.7)

where µs = [µp, µl, µe]

(cid:124) are the dynamic loss weights for the side tasks with µp + µl + µe = 1.

So (3.6) becomes:

argmin
Θ,W,ωs

N

∑
i=1

L(Ii,yd

i ) + ϕs

(cid:104)

µp

N

∑
i=1

L(Ii,yp

i ) + µl

N

∑
i=1

L(Ii,yl

i) + µe

N

∑
i=1

(cid:105)

L(Ii,ye
i )

(3.8)

s.t. µp + µl + µe = 1.

We use mini-batch stochastic gradient descent to solve the above optimization problem where

the dynamic weights are averaged over a batch of samples. Intuitively, we expect our m-CNN to

behave in two different aspects in order to minimize the loss. First, since our main task contribute

mostly to the ﬁnal loss (usually ϕs < 1), the side task with the largest contribution to the main task

should have the highest weight in order to reduce the loss of the main task. Second, our m-CNN

should assign a higher weight for an easier task with a lower loss so as to reduce the overall loss.

We have observed these effects as shown in Figure 3.4 (a).

3.2.3 Pose-Directed Multi-Task CNN

Given the diverse variations in the data, it is very challenging to learn a non-linear mapping to

estimate the correct identity from a face image with arbitrary PIE. This challenge has been en-

countered in classic pattern recognition work. For example, in order to handle pose variation, [83]

30

Figure 3.3: Illustration of the batch split operation in p-CNN.

proposes to construct several face detectors where each of them is in charge of one speciﬁc view.

Such a divide-and-conquer scheme can be applied to CNN learning because the side tasks can “di-

vide” the data and allow the CNN to better “conquer” them by learning tailored mapping functions.

Therefore, we propose a novel task-directed multi-task CNN where the side task categorizes the

training data into multiple groups and directs them to different routes in the network. Since pose is

considered as the primary challenge in face recognition [145, 157, 169], we propose pose-directed

multi-task CNN (p-CNN) to handle pose variation. However, it is applicable to any other variation.

As shown in Figure 3.2, p-CNN is built on top of m-CNN by adding the pose-directed branch

(PDB). The PDB groups face images with similar poses to learn pose-speciﬁc identity features

via a batch split operation. We separate poses into three groups: left proﬁle (Gl), frontal (G f ),

and right proﬁle (Gr). As shown in Figure 3.3, the goal of batch split is to separate a batch of N0
samples (X = {xi}N0

i=1) into three batches Xl, X f , and Xr, which are of the same size as X. During
training, the ground truth pose is used to assign a face image into the correct group. Let us take the

xi,

0,

31

frontal group as an example:

X f
i =

if yp

i ∈ G f

otherwise,

(3.9)

14732585........369410002000........300004030505........060400700080........0090where 0 denotes a vector of all zeros with the same dimension as xi. The assignment of 0 is to

guarantee valid input to the next layer when no sample is passed into one group. Therefore, X

is separated into three batches where each batch consists of only the samples belonging to the

corresponding pose group. Each group learns a pose-speciﬁc mapping to a joint space, resulting in
three different sets of weights: {Wl,W f ,Wr}. Finally, the features from all groups are merged as

the input to a softmax layer to perform robust identity classiﬁcation jointly.

Our p-CNN aims to learn two types of identity features: Wd are the weights to extract generic

identity features that are robust to all poses; Wl, f ,r are the weights to extract pose-speciﬁc identity

features that are robust within a small pose range. Both tasks are considered as our main tasks.

Similar to the dynamic-weighting scheme in m-CNN, we use dynamic weights to combine our
main tasks as well. The summed loss weight for these two tasks is ϕm = αd + αg. Let ωm ∈ RD×2
and εm ∈ R2×1 denote the weight matrix and bias vector for learning the dynamic weights,

µm = so f tmax(ωm

(cid:124)x + εm).

(3.10)

We have µm = [µd, µg]

(cid:124) as the dynamic weights for generic identity classiﬁcation and pose-speciﬁc

identity classiﬁcation. Finally, the loss of p-CNN is formulated as:

(cid:104)

argmin
Θ,W,ω

(cid:104)

ϕm

µd

N

∑
i=1

L(Ii,yd

i ) + µg

G

∑
g=1

L(Ii,yd
i )

ϕs

µp

N

∑
i=1

L(Ii,yp

i ) + µl

N

∑
i=1

L(Ii,yl

i) + µe

L(Ii,ye
i )

(cid:105)

+

(cid:105)

Ng∑

N

i=1
∑
i=1

(3.11)

s.t. µd + µg = 1, µp + µl + µe = 1,

where G is the number of pose groups, and Ng is the number of training images in the g-th group.
ω = {ωm,ωs} is the set of parameters to learn the dynamic weights for both the main and side

32

tasks. We set ϕm = 1.

Stochastic Routing During testing, we can use the estimated pose to direct an image to extract

the pose-speciﬁc features. However, the pose estimation error may cause inferior feature extraction

results, especially for unconstrained faces. To solve this problem, we propose a stochastic routing

scheme. Speciﬁcally, given a test image I, we extract the generic features (yd) and the pose-speciﬁc
features ({yg}G
the input image I belonging to each pose group as ({pg}G

g=1) by directing it to all paths in the PDB. We can also obtain the probabilities of

g=1) using our pose estimation side task.

The distance between a pair of face images (I1 and I2) is computed as:

s =

1
2

dist(yd

1,yd

2) +

1
2

G

∑
i=1

G

∑
j=1

dist(yi

1,y j

2)· pi

1 · p j
2,

(3.12)

where dist(·) is the cosine distance between two feature vectors. The proposed stochastic routing

accounts for all combinations of pose-speciﬁc features weighted by the probabilities. This is more

robust to pose estimation errors. We treat the generic features and pose-speciﬁc features equally

and fuse them for face recognition.

3.3 Experimental Results

We evaluate the proposed m-CNN and p-CNN under two settings: (1) face identiﬁcation on Multi-

PIE with PIE estimations being the side tasks; (2) face veriﬁcation/identiﬁcation on in-the-wild

datasets including LFW, CFP, and IJB-A, where pose estimation is the only side task. Further, we

analyze the effect of MTL on Multi-PIE and discover that the side tasks regularize the network to

learn a disentangled identity representation for PIE-invariant face recognition.

33

Table 3.1: Comparison of the experimental settings that are commonly used in prior work on Multi-PIE. (*
The 20 images consist of 2 duplicates of non-ﬂash images and 18 ﬂash images. In total there are 19 different
illuminations.)

setting

session

pose

I
II
III
IV
V
ours

4
1
1
4
4
4

7
7
15
9
13
15

illum exp
1
1
1
1
1
6

1
20
20
20
20
20*

train sub. / images

200 / 5,383
100 / 14,000
150 / 45,000
200 / 138,420
200 / 199,940
200 / 498,900

gallery / probe
137 / 2,600
149 / 20,711
99 / 29,601
137 / 70,243
137 / 101,523
137 / 255,163

total
8,120
34,860
74,700
208,800
301,600
754,200

references
[6, 82]

[168, 151]

[145]

[132, 169]

[154]

3.3.1 Face Identiﬁcation on Multi-PIE

Experimental Settings Multi-PIE dataset consists of 754,200 images of 337 subjects recorded

in 4 sessions. Each subject was recorded with 15 different cameras where 13 at the head height
spaced at 15◦ interval and 2 above the head to simulate a surveillance camera view, labeled as
±45◦ in our work. For each camera, a subject was imaged under 19 different illuminations. In

each session, a subject was captured with 2 or 3 expressions, resulting in 6 different expressions

across all sessions. Unlike previous work that uses a subset of Multi-PIE for experiments, we

use the entire dataset in our work (as shown in Table 3.1). The ﬁrst 200 subjects are used for

training. The remaining 137 subjects are used for testing, where one image with frontal pose,

neutral illumination and neutral expression for each subject is selected as the gallery set and the

remaining images are selected as the probe set.

We use the landmark annotations provided in [38] to align each face to a canonical view of
size 100×100 with gray-scale. Similar to [141], we normalize the image by subtracting 127.5 and

dividing by 128. We set momentum to 0.9 and weight decay to 0.0005. All models are trained

for 20 epochs from scratch with a batch size of 4. The learning rate starts at 0.01 and reduces at

10th, 15th, and 19th epochs with a factor of 0.1. The output before the softmax layer is used as

features for face matching based on cosine similarity. The rank-1 identiﬁcation rate is reported for

34

face recognition. For the side tasks, the mean accuracy over all classes is reported.

We randomly select 20 subjects from the training set to form a validation set to ﬁnd the optimal

overall loss weight for all side tasks. We obtain ϕs = 0.1 via brute-force search. For p-CNN

model training, we split the training set into three groups based on the yaw angle of the image:
right proﬁle (−90◦,−75◦,−60◦, −45◦), frontal (−30◦, −15◦,0◦,15◦, 30◦), and left proﬁle (45◦,
60◦,75◦,90◦).

Effects of MTL Table 3.2 shows the performance comparison of single-task learning (s-CNN),

multi-task learning (m-CNN), and pose-directed multi-task learning (p-CNN) on the entire Multi-

PIE. First, we train four single-task models for identity (id), pose (pos), illumination (illum), and

expression (exp) classiﬁcation respectively. As shown in Table 3.2, the rank-1 identiﬁcation rate of

s-CNN is only 75.67%. The performance of the frontal pose group is much higher than those of the

proﬁle pose groups, indicating that pose variation is indeed a big challenge for face recognition.

Among all side tasks, pose estimation is the easiest task, followed by illumination, and expression

as the most difﬁcult one. This is caused by two potential reasons: 1) discriminating expression

is more challenging due to the non-rigid face deformation; 2) the data distribution over different

expressions is unbalanced with insufﬁcient training data for some expressions.

Second, we train multiple m-CNN models by adding only one side task at a time in order to

evaluate the inﬂuence of each side task. We use “id+pos”, “id+illum”, and “id+exp” to represent

these variants and compare them to the performance of adding all side tasks denoted as “id+all”.

To evaluate the effects of the dynamic-weighting scheme, we train a model with ﬁxed loss weights

for the side tasks as: αp = αl = αe = ϕs/3 = 0.033. The summation of the loss weights for all side

tasks are equal to ϕs for all m-CNN variants in Table 3.2 for a fair comparison.

Comparing the rank-1 identiﬁcation rates of s-CNN and m-CNNs, it is obvious that adding the

side tasks is always helpful for the main task. The improvement of face recognition is mostly on the

35

Table 3.2: Performance comparison (%) of single-task learning (s-CNN), multi-task learning (m-CNN)
with its variants, and pose-directed multi-task learning (p-CNN) on the entire Multi-PIE dataset.

model
s-CNN: id
s-CNN: pos
s-CNN: exp
s-CNN: illum
s-CNN: id+L2
m-CNN: id+pos
m-CNN: id+illum
m-CNN: id+exp
m-CNN: id+all
m-CNN: id+all (dynamic)
p-CNN

loss weights

αd = 1
αp = 1
αl = 1
αe = 1
αd = 1

αd = 1,αp = 0.1
αd = 1,αl = 0.1
αd = 1,αe = 0.1

αd = 1,αp,l,e = 0.033

αd = 1,ϕs = 0.1
ϕm = 1,ϕs = 0.1

rank-1 (all / left / frontal /right) pose illum exp
75.67 / 71.51 / 82.21 / 73.29
–
–
–

99.87

96.43

–
–

–
–
–

76.43 / 73.31 / 81.98 / 73.99
78.06 / 75.06 / 82.91 / 76.21
77.30 / 74.87 / 82.83 / 74.21
77.76 / 75.48 / 82.32 / 75.48
77.59 / 74.75 / 82.99 / 75.04
79.35 / 76.60 / 84.65 / 76.82
79.55 / 76.14 / 84.87 / 77.65

–

–
–
–

–
–

–
–
–

93.57

92.44

–
–
–

99.78

–

90.93
99.75 88.46 79.97
99.81 93.40 91.47
99.80 90.58 90.02

proﬁle faces with MTL. The m-CNN “id+all” with dynamic weights shows superior performance

to others not only in rank-1 identiﬁcation rate, but also in the side task estimations. Further,

the lower rank-1 identiﬁcation rate of “id+all” w.r.t “id+pos” indicates that more side tasks do

not necessarily lead to better performance without properly setting the loss weights. In contrast,

the proposed dynamic-weighting scheme effectively improves the performance to 79.35% from

the ﬁxed weighting of 77.59%. As will be shown in Section 3.3.2, the side tasks in m-CNN

help to inject PIE variations into the shared representation, similar to a regularization term. For

example, an L2 regularization will encourage small weights. We add L2 regularization on the

shared representation to s-CNN (“id+L2”), which improves over s-CNN without regularization.

However, it is still much worse than the proposed m-CNN.

Third, we train p-CNN by adding the PDB to m-CNN “id+all” with dynamic weights. The loss

weights are ϕm = 1 for the main tasks and ϕs = 0.1 for the side tasks. The proposed dynamic-

weighting scheme allocates the loss weights to both two main tasks and three side tasks. P-CNN

further improves the rank-1 identiﬁcation rate to 79.55%.

Dynamic-Weighting Scheme Figure 3.4 shows the dynamic weights and losses during training

36

(a) m-CNN: dynamic weights

(b) m-CNN: losses

(c) p-CNN: dynamic weights

(d) p-CNN: losses

Figure 3.4: Dynamic weights and losses for m-CNN and p-CNN during the training process.

for m-CNN and p-CNN. For m-CNN, the expression classiﬁcation task has the largest weight in

the ﬁrst epoch because it has the highest chance to be correct with random guess with the least

number of classes. As training goes on, pose classiﬁcation takes over because it is the easiest task

(highest accuracy in s-CNN) and also the most helpful for face recognition (compare “id+pos”

to “id+exp” and “id+illum”). αp starts to decrease at the 11th epoch when pose classiﬁcation is

saturated. The increased αl and αe lead to a reduction in the losses of expression and illumination

classiﬁcations. As we expected, the dynamic-weighting scheme assigns a higher loss weight for

the easiest and/or the most helpful side task.

For p-CNN, the loss weights and losses for the side tasks behave similarly to those of m-CNN.

For the two main tasks, the dynamic-weighting scheme assigns a higher loss weight to the easier

37

epoch01020dynamic weights00.020.040.060.080.1,p,l,eepoch01020loss0123iddposillumexpepoch01020dynamic weights00.20.40.60.81,d,g,p,l,eepoch01020loss0123iddidgposillumexptask at the moment. At the beginning, learning the pose-speciﬁc identity features is an easier task

than learning the generic identity features. Therefore, the loss weight αg is higher than αd. As

training goes on, αd increases as it has a lower loss. Their losses reduce in a similar way, i.e., the

error reduction in one task will also contribute to the other.

Compare to Other Methods As shown in Table 3.1, no prior work uses the entire Multi-PIE for

face recognition. To compare with state of the art, we choose to use setting III and V to evaluate

our method since these are the most challenging settings with more pose variation. The network

structures and parameter settings are kept the same as those of the full set except that the outputs of

the last fully connected layers are changed according to the number of classes for each task. Only

pose and illumination are used as the side tasks.

The performance on setting III is shown in Table 3.3. Our s-CNN already outperforms c-CNN

forest [145], which is an ensemble of three c-CNN models. This is attributed to the deep structure

of CASIA-Net [150]. Moreover, m-CNN and p-CNN further outperform s-CNN with signiﬁcant

margins, especially for non-frontal faces. We want to stress the improvement margin between our

method 91.27% and the prior work of 76.89% — a relative error reduction of 62%.

The performance on setting V is shown in Table 3.4. For fair comparison with FF-GAN [154],

where the models are ﬁnetuned from pre-trained in-the-wild models, we also ﬁnetune s-CNN,

m-CNN, p-CNN models from the pre-trained models on CASIA-Webface for 10 epochs. Our

performance is much better than previous work with a relative error reduction of 60%, especially

on large-pose faces. The performance gap between Table 3.3 / 3.4 and 3.2 indicates the challenge

of face recognition under various expressions, which is less studied than pose and illumination

variations on Multi-PIE.

38

Table 3.3: Multi-PIE performance comparison on setting III of Table 3.1.

Fisher Vector [118]
FIP_20 [168]
FIP_40 [168]
c-CNN [145]
c-CNN Forest [145]
s-CNN (ours)
m-CNN (ours)
p-CNN (ours)

±15◦ ±30◦ ±45◦ ±60◦ ±75◦ ±90◦
24.53
93.30
95.88
34.13
31.37
96.30
41.71
95.64
96.97
47.26
76.72
98.41
76.72
99.02
99.19
76.96

87.21
89.23
92.98
92.66
94.05
96.89
97.40
98.01

68.71
61.64
69.75
70.49
74.38
88.71
89.75
92.07

45.51
47.32
49.10
55.64
60.66
82.80
84.97
87.83

80.33
78.89
85.54
85.09
89.02
85.18
89.15
90.34

avg.
66.60
67.87
70.90
73.54
76.89
88.45
90.08
91.27

Table 3.4: Multi-PIE performance comparison on setting V of Table 3.1.

0◦ ±15◦ ±30◦ ±45◦ ±60◦ ±75◦ ±90◦ avg.[−60◦,60◦] avg.[−90◦,90◦]
FIP [168]
94.3 90.7 80.7 64.1 45.9
Zhu et al. [169] 95.7 92.8 83.7 72.9 60.1
Yim et al. [151] 99.5 95.0 88.5 79.9 61.9
DR-GAN [132] 97.0 94.0 90.1 86.2 83.2
FF-GAN [154] 95.7 94.6 92.5 89.7 85.2 77.2 61.2
s-CNN (ours)
95.9 95.1 92.8 91.6 88.9 84.9 78.6
m-CNN (ours) 95.4 94.5 92.6 91.8 88.4 85.3 82.2
95.4 95.2 94.3 93.0 90.3 87.5 83.9
p-CNN (ours)

72.9
79.3
83.3
89.2
91.6
92.5
92.2
93.5

85.2
89.2
89.6
91.1

–
–
–
–

–
–
–
–

–
–
–
–

3.3.2 How does m-CNN work?

It is well known in both the computer vision and the machine learning communities that learning

multiple tasks together allows each task to leverage each other and improves the generalization

ability of the model. For CNN-based MTL, previous work [163] has found that CNN learns shared

features for facial landmark localization and attribute classiﬁcations, e.g. smiling. This is under-

standable because the smiling attribute is related to landmark localization as it involves the change

of the mouth region. However, in our case, it is not obvious how the PIE estimations can share

features with the main task. On the contrary, it is more desirable if the learnt identity features are

disentangled from the PIE variations. Indeed, as we will show later, the PIE estimations regularize

the CNN to learn PIE-invariant identity features.

39

(a)

(b)

(c)

Figure 3.5: Analysis on the effects of MTL: (a) the sorted energy vectors for all tasks; (b) vi-
sualization of the weight matrix Wall where the red box in the top-left is a zoom-in view of the
bottom-right content; (c) the face recognition performance with varying feature dimensions.

We investigate why PIE estimations are helpful for face recognition. The analysis is done on

m-CNN model (“id+all” with dynamic weights) in Table 3.2. Recall that m-CNN learns a shared
embedding x ∈ R320×1. Four fully connected layers with weight matrices Wd

320×13,
320×6 are connected to x to perform classiﬁcation of each task (200 subjects, 13 poses,
19 illuminations, and 6 expressions). We analyze the importance of each dimension in x to each
task. Taking the main task as an example, we calculate an energy vector sd ∈ R320×1 whose element

320×200, Wp

Wl

320×19, We

is computed as:

sd
i =

200

∑
j=1

| Wd

i j | .

(3.13)

A higher value of sd

i indicates that the ith feature in x is more important to the identity classi-
ﬁcation task. The energy vectors sp, sl, se for all side tasks are computed similarly. Each energy

vector is sorted and shown in Figure 3.5 (a). For each curve, we observe that the energy distributes

unevenly among all feature dimensions in x. Note that the indexes of the feature dimension do not

correspond among them since each energy vector is sorted independently.

To compare how each feature in x contributes to different tasks, we concatenate the weight

matrix of all tasks as Wall

320×238 = [Wd,Wp,Wl,We] and compute its energy vector as sall. We sort
the rows in Wall based on the descending order in energy and visualize the sorted Wall in Figure 3.5

40

n0100200300energy051015sdspslse50100150200501001502002503002040601020305010015020050100150200250300-0.4-0.200.20.40.6n100200300accuracy0.720.740.760.780.8xnydn(a) mean

(b) std.

Figure 3.6: The mean and standard deviation of each energy vector during the training process.

(a) ϕs = 0.2

(b) ϕs = 0.3

Figure 3.7: Energy vectors of m-CNN models with different overall loss weights.

(b). The ﬁrst 200 columns represent the sorted Wd where most energy is distributed in the ﬁrst
∼ 280 feature dimensions (rows), which are more crucial for face recognition and less important

for PIE classiﬁcations. We observe that x are learnt to allocate a separate set of dimensions/features

for each task, as shown in the block-wise effect in the zoom-in view. Each block shows the most

essential features with high energy for PIE classiﬁcations respectively.

Based on the above observation, we conclude that the PIE classiﬁcation side tasks help to

inject PIE variations into the shared features x. The weight matrix in the fully connected layer

learns to select identity features and ignore the PIE features for PIE-invariant face recognition.

To validate this observation quantitatively, we compare two types of features for face recognition:

41

epoch05101520energy mean0510152025idposillumexpepoch05101520energy std012345idposillumexpn0100200300energy05101520sdspslsen0100200300energy05101520sdspslse1) xn : a subset of x with n largest energies in sd, which are more crucial in modeling identity

variation; 2) yd

200×1 = Wd

n×200

(cid:124)

xn×1 + bd, which is the multiplication of the corresponding subset

of Wd and xn. We vary n from 100 to 320 and compute the rank-1 face identiﬁcation rate on

the entire Multi-PIE testing set. The performance is shown in Figure 3.5 (c). When xn is used,

the performance improves with increasing dimensions and drops when additional dimensions are

included, which are learnt to model the PIE variations. In contrary, the identity features yd can

eliminate the dimensions that are not helpful for identity classiﬁcation through the weight matrix

Wd, resulting in continuously improved performance w.r.t. n.

We further analyze how the energy vectors evolve over time during training. Speciﬁcally, at

each epoch, we compute the energy vectors for each task. Then we compute the mean and standard

deviation of each energy vector, as shown in Figure 3.6. Despite some local ﬂuctuations, the overall

trend is that the mean is decreasing and standard deviation is increasing as training goes on. This

is because in the early stage of training, the energy vectors are more evenly distributed among all

feature dimensions, which leads to the higher mean values and lower standard deviations. In the

later stage of training, the energy vectors are shaped in a way to focus on some key dimensions for

each task, which leads to the lower mean values and higher standard deviations.

The CNN learns to allocate a separate set of dimensions in the shared features to each task.

The total number of dimensions assigned to each task depends on the loss weights. Recall that

we obtain the overall loss weight for the side tasks as ϕs = 0.1 via brute-force search. Figure 3.7

shows the energy distributions with ϕs = 0.2 and ϕs = 0.3, which are compared to Figure 3.5 (a)

where ϕs = 0.1. We have two observations. First, a larger loss weight for the side tasks leads to

more dimensions being assigned to the side tasks. Second, the energies in sd increase in order

to compensate the fact that the dimensions assigned to the main task decrease. Therefore, we

conclude that the loss weights control the energy distribution between different tasks.

42

Method
DeepID2 [125]
DeepFace [129]
CASIANet [150]
Wang et al. [137]
Littwin and Wolf [86]
MultiBatch [128]
VGG-DeepFace [101]
Wen et al. [141]
FaceNet [115]
s-CNN (ours)
m-CNN (ours)
p-CNN (ours)

Table 3.5: Performance comparison on LFW dataset.

#Net Training Set

Metric

Acc. ± Std. (%)

1
1
1
1
1
1
1
1
1
1
1
1

202,599 images of 10,177 subjects, private Joint-Bayes 95.43
95.92± 0.29
4.4M images of 4,030 subjects, private
cosine
96.13± 0.30
494,414 images of 10,575 subjects, public cosine
404,992 images of 10,553 subjects, public Joint-Bayes 96.2± 0.9
404,992 images of 10,553 subjects, public Joint-Bayes 98.14± 0.19
2.6M images of 12K subjects, private
2.6M images of 2,622 subjects, public
0.7M images of 17,189 subjects, public
260M images of 8M subjects, private
494,414 images of 10,575 subjects, public cosine
494,414 images of 10,575 subjects, public cosine
494,414 images of 10,575 subjects, public cosine

98.20
98.95
99.28
99.63± 0.09
97.87± 0.70
98.07± 0.57
98.27± 0.64

Euclidean
Euclidean
cosine
L2

3.3.3 Unconstrained Face Recognition

Experimental Settings We use CASIA-Webface [150] as our training set and evaluate on LFW,

CFP, and IJB-A datasets. CASIA-Webface consists of 494,414 images of 10,575 subjects. LFW

consists of 10 folders each with 300 same-person pairs and 300 different-person pairs. Given

the saturated performance of LFW mainly due to its mostly frontal view faces, CFP and IJB-A

are introduced for large-pose face recognition. CFP is composed of 500 subjects with 10 frontal

and 4 proﬁle images for each subject. Similar to LFW, CFP includes 10 folders, each with 350

same-person pairs and 350 different-person pairs, for both frontal-frontal (FF) and frontal-proﬁle

(FP) veriﬁcation protocols. IJB-A dataset includes 5,396 images and 20,412 video frames of 500

subjects. It deﬁnes template-to-template matching for both face veriﬁcation and identiﬁcation.

In order to apply the proposed m-CNN and p-CNN, we need to have the labels for the side

tasks. However, it is not easy to manually label our training set. Instead, we only consider pose

estimation as the side task and use the estimated pose as the label for training. We use PIFA [69] to
estimate 34 landmarks and the yaw angle, which deﬁnes three groups: right proﬁle [−90◦,−30◦),
frontal [−30◦,30◦], and left proﬁle (30◦,90◦]. Figure 3.8 shows the distribution of the yaw angle

43

EER

AUC

Frontal-Frontal

Table 3.6: Performance comparison on CFP dataset. Results reported are the average ± standard deviation
over the 10 folds.
Method ↓
Metric (%) →
Accuracy
96.40± 0.69 3.48± 0.67 99.43± 0.31
Sengupta et al. [116]
Sankarana. et al. [114] 96.93± 0.61 2.51± 0.81 99.68± 0.16
98.67± 0.36 1.40± 0.37 99.90± 0.09
Chen, et al. [22]
97.84± 0.79 2.22± 0.09 99.72± 0.02
DR-GAN [132]
98.67
Peng, et al. [102]
96.24± 0.67 5.34± 1.79 98.19± 1.13
Human
97.34± 0.99 2.49± 0.09 99.69± 0.02
s-CNN (ours)
97.77± 0.39 2.31± 0.06 99.69± 0.02
m-CNN (ours)
97.79± 0.40 2.48± 0.07 99.71± 0.02
p-CNN (ours)

Accuracy
84.91± 1.82 14.97± 1.98 93.00± 1.55
89.17± 2.35 8.85± 0.99 97.00± 0.53
91.97± 1.70 8.00± 1.68 97.70± 0.82
93.41± 1.17 6.45± 0.16 97.96± 0.06
93.76
94.57± 1.10 5.02± 1.07 98.92± 0.46
90.96± 1.31 8.79± 0.17 96.90± 0.08
91.39± 1.28 8.80± 0.17 97.04± 0.08
94.39± 1.17 5.94± 0.11 98.36± 0.05

Frontal-Proﬁle

AUC

–

–

EER

–

–

Table 3.7: Performance comparison on IJB-A.

Method ↓
Metric (%) →
OpenBR [77]
GOTS [77]
Wang et al. [137]
PAM [95]
DR-GAN [132]
DCNN [21]
s-CNN (ours)
m-CNN (ours)
p-CNN (ours)

Veriﬁcation

Identiﬁcation

@FAR=0.01 @FAR=0.001 @Rank-1 @Rank-5
37.5± 0.8
23.6± 0.9
40.6± 1.4
59.5± 2.0
72.9± 3.5
93.1± 1.4
88.7± 0.9
73.3± 1.8
77.4± 2.7
94.7± 1.1
78.7± 4.3
93.7± 1.0
75.6± 3.5
93.0± 0.9
93.4± 0.7
75.6± 2.8
77.5± 2.5
93.8± 0.9

10.4± 1.4
19.8± 0.8
51.0± 6.1
55.2± 3.2
53.9± 4.3
52.0± 7.0
51.6± 4.5
53.9± 4.2

24.6± 1.1
44.3± 2.1
82.2± 2.3
77.1± 1.6
85.5± 1.5
85.2± 1.8
84.3± 1.3
84.7± 1.0
85.8± 1.4

–

estimation and the average image of each pose group. CASIA-Webface is biased towards frontal

faces with 88% faces belonging to the frontal pose group based on our pose estimation.

The network structures are similar to those experiments on Multi-PIE. All models are trained

from scratch for 15 epochs with a batch size of 8. The initial learning rate is set to 0.01 and

reduced at the 10th and 14th epoch with a factor of 0.1. The other parameter settings and training

process are the same as those on Multi-PIE. We use the same pre-processing as in [150] to align a

face image. Each image is horizontally ﬂipped for data augmentation in the training set. We also

generate the mirror image of an input face in the testing stage. We use the average cosine distance

44

Figure 3.8: Yaw angle distribution on CASIA-Webface dataset.

of all four comparisons between the image pair and its mirror images for face recognition.

Performance on LFW Table 3.5 compares our face veriﬁcation performance with state-of-the-art

methods on LFW dataset. We follow the unrestricted with labeled outside data protocol. Although

it is well-known that an ensemble of multiple networks can improve the performance [124, 126],

we only compare CNN-based methods with one network for fair comparison. Our implementation

of the CASIA-Net (s-CNN) with BN achieves much better results compared to the original per-

formance [150]. Even with such a high baseline, m-CNN and p-CNN can still improve, achieving

comparable results with state of the art, or better results if comparing to those methods trained

with the same amount of data. Since LFW is biased towards frontal faces, we expect the improve-

ment of our proposed m-CNN and p-CNN to the baseline s-CNN to be larger if they are tested on

cross-pose face veriﬁcation.

Performance on CFP Table 3.6 shows our face veriﬁcation performance comparison with state-

of-the-art methods on CFP dataset. For FF setting, m-CNN and p-CNN improve the veriﬁcation

45

rate of s-CNN slightly. This is expected, as there is little pose variation. For FP setting, p-CNN

substantially outperforms s-CNN and prior work, reaching close-to-human performance (94.57%).

Note our accuracy of 94.39% is 9% relative error reduction of the previous state of the art [102]

with 93.76%. Therefore, the proposed divide-and-conquer scheme is effective for in-the-wild face

veriﬁcation with large pose variation. And the proposed stochastic routing scheme improves the

robustness of the algorithm. Even with the estimated pose serving as the ground truth pose label

for MTL, the models can still disentangle the pose variation from the learnt identity features for

pose-invariant face veriﬁcation.

Performance on IJB-A We conduct close-set face identiﬁcation and face veriﬁcation on IJB-A

dataset. First, we retrain our models after removing 26 overlapped subjects between CASIA-

Webface and IJB-A. Second, we ﬁne-tune the retrained models on the IJB-A training set of each

fold for 50 epochs. Similar to [137], we separate all images into “well-aligned" and “poorly-

aligned" faces based on the face alignment results and the provided annotations. In the testing

stage, we only select images from the “well-aligned" faces for recognition. If all images in a tem-

plate are “poorly-aligned" faces, we select the best aligned face among them. Table 3.7 shows

the performance comparison on IJB-A. Similarly, we only compare to the methods with a sin-

gle model. The proposed p-CNN achieves comparable performance in both face veriﬁcation and

identiﬁcation.

3.4 Summary

This work explores multi-task learning for face recognition with PIE estimations as the side tasks.

We propose a dynamic-weighting scheme to automatically assign the loss weights for each side

task during training. MTL helps to learn more discriminative identity features by disentangling

46

the PIE variations. We further propose a pose-directed multi-task CNN with stochastic routing

scheme to direct different paths for face images with different poses. We make the ﬁrst effort to

study face identiﬁcation on the entire Multi-PIE dataset with full PIE variations. Extensive experi-

ments on Multi-PIE show that our m-CNN and p-CNN can dramatically improve face recognition

performance, especially on large poses. The proposed method is applicable to in-the-wild datasets

with the estimated poses serving as the labels for training. We have achieved state-of-the-art per-

formance on LFW, CFP, and IJB-A, showing the value of MTL for pose-invariant face recognition

in the wild.

47

Chapter 4

Lage-Pose Face Frontalization

4.1

Introduction

This work presents Face Frontalization-Generative Adversarial Network (FF-GAN) to generate a

frontal face from a face image with arbitrary pose while maintaining high quality and preserving

identity. Face frontalization is the second category of methods for PIFR. It is motivated by the fact

that comparing frontal faces is a much easier task than comparing faces under extreme proﬁle views

with self-occlusion. By ﬁlling the missing information, face frontalization has the potential to boost

face recognition performance. Besides aiding recognition, frontalization of a face image is also a

problem of independent interest, with potential applications such as face editing, accessorizing,

and creation of models in virtual and augmented reality.

Synthesizing a frontal face from a single image with large pose variation is a challenging prob-

lem. A straight-forward method is to build a 3D model of the face and rotate the model to frontal

view. Early work on face frontalization in computer vision rely on frameworks inspired by com-

puter graphics. The well-known 3D Morphable Model (3DMM) [12] explicitly models facial shape

and appearance to match an input image as close as possible. Subsequently, the recovered shape

and appearance can be used to generate a face image under novel viewpoints. Many 3D face re-

construction methods [110, 166] build upon this direction by improving speed or accuracy. Deep

learning has made inroads into data-driven estimation of 3DMM too [169, 72], circumventing

48

Figure 4.1: The proposed FF-GAN framework. Given a non-frontal face image as input, the
generator produces a high-quality frontal face. Learned 3DMM coefﬁcients provide global pose
and low frequency information, while the input image injects high frequency local information. A
discriminator distinguishes generated faces against real ones, where high-quality frontal faces are
considered as real ones. A face recognition engine is used to preserve identity information. The
output is a high quality frontal face that retains identity.

some drawbacks of early methods such as over-reliance on the accuracy of 2D landmark localiza-

tion. Due to restricted Gaussian assumptions and nature of losses used, insufﬁcient representation

ability for facial appearance prevents such deep models from producing outputs of high quality.

While inpainting methods such as [167] attempt to minimize the impact on quality due to self-

occlusions, they still do not retain identity information.

In contrast, the proposed FF-GAN incorporates elements from both deep 3DMM and face

recognition CNNs to achieve high-quality and identity-preserving frontalization, using a single
input image that can be a proﬁle view up to 90◦. As shown in Figure 4.1, FF-GAN consists of four

modules, the reconstructor, the generator, the discriminator, and the recognizer. The reconstruction

module provides a useful prior to regularize the frontalization. However, it is well-known that deep

49

3DMM	Coeﬃcients	Pose-Variant	Input	Recogni8on	Engine	Frontalized	Output	Generator	FF-GAN	D	Discriminator	Extreme	Pose	Input	Frontalized	Output	3DMM reconstruction is limited in the ability to retain high-frequency information. Therefore, the

generation module combines both the 3DMM coefﬁcients with the input image to generate a frontal

face that maintains both global pose accuracy and retains local information present in the input

image. In particular, the generator in FF-GAN produces a frontal image based on a reconstruction

loss, a smoothness loss, and a novel symmetry-enforcing loss. The goal of the generator is to

fool the discriminator into being unable to distinguish the generated frontal image from a real

one. However, neither the 3DMM that loses high-frequency information, nor the GAN that only

aligns domain-level distributions, sufﬁce to preserve identity information in the generated image.

To retain identity information, a recognition module is used to align the feature representation of

the generated image with the input. A balanced training with all the above objectives results in

high-quality frontalized faces that preserve identity.

To summarize, our key contributions are:

• A novel GAN-based end-to-end deep framework to achieve face frontalization even for ex-

treme viewpoints.

• A deep 3DMM reconstruction module provides shape and appearance regularization beyond

the training data.

• Effective symmetry-based loss and smoothness regularization that lead to the generation of

high-quality images.

• Use of a deep face recognition CNN to enforce that the generated faces satisfy identity-

preservation, besides realism and frontalization.

• Consistent improvements on several datasets across multiple tasks, such as face recognition,

landmark localization, and 3D reconstruction.

50

Figure 4.2: The proposed framework of FF-GAN. R is the reconstruction module for 3DMM coef-
ﬁcients estimation. G is the generation module to synthesize a frontal face. D is the discrimination
module to make real or generated decision. C is the recognition module for identity classiﬁcation.

4.2 Proposed Method

Figure 4.2 shows the framework of FF-GAN. The mainstay of FF-GAN is a generative adversarial

network that consists of a generator G and a discriminator D. G takes a non-frontal face as input

to generate a frontal output, while D attempts to classify it as a real frontal image or a generated

one. Additionally, we include a face recognition engine C that regularizes the generator output to

preserve identity features. A key component is a deep 3DMM module R that provides shape and

appearance priors to the GAN. The reconstruction module R plays a crucial role in alleviating the

difﬁculty of large pose face frontalization.

Let D = {xi,xg

i=1 be the training set with N samples, with each sample consisting of an

i ,pg

i ,yi}N

input image xi with arbitrary pose, a corresponding ground truth frontal face xg
3DMM coefﬁcients pg

i and the identity label yi. We henceforth omit the sample index i for clarity.

i , the ground truth

4.2.1 Reconstruction Module

Frontalization from extreme pose is a challenging problem. While a purely data-driven approach

might be possible given sufﬁcient data and an appropriate training regimen, however it is non-

51

R 3DMM coefficients  G + D C real/ fake id 3DMM Coefficients Real / Generated Identity Figure 4.3: 3D faces generated with identity, expression, and texture variations.

trivial. Therefore, we propose to impose a prior on the generation process, in the form of a 3D

Morphable Model (3DMM) [12]. This reduces the training complexity and leads to better empirical

performance with limited data. Recall that 3DMM represents faces in the PCA space:

S = ¯S + Aidαid + Aexpαexp,

T = ¯T + Atexαtex,

(4.1)

where S represent the 3D shape coordinates computed as the linear combination of the mean shape
¯S, the shape basis Aid, and the expression basis Aexp, while T is the texture (RGB color values)
that is the linear combination of the mean texture ¯T and the texture basis Atex.

The coefﬁcients {αid,αexp,αtex} deﬁnes a unique 3D face. Figure 4.3 shows some examples
generated by varying the shape coefﬁcients (αid and αexp) or the texture coefﬁcients αtex. The

identity coefﬁcients change the structure of the face via the expression coefﬁcients mainly change

the mouth region of the face. The texture coefﬁcients change the appearance of the face.

52

S,Tαid,αexp→αtex→S =

z1

z2

x1 x2

y1 y2


u1 u2

U =

... uQ

v1

v2

... vQ

... xQ

... yQ

...

zQ

 .
 .

(4.2)

(4.3)

The 3D shape S ∈ R3×Q records the x, y, z coordinates of Q vertexs on the 3D face model,

Let U ∈ R2×Q denote the corresponding x, y coordinates on the 2D face image,

In order to build the correspondence between the 2D shape U with the 3D shape S, the camera

projection parameters are needed. Previous work [68, 166] have applied 3DMM for face alignment

where a weak perspective projection model is used to project the 3D shape into the 2D space.
Similar to [68], we calculate a projection matrix m ∈ R2×4 based on pitch, yaw, roll, scale and

2D translations so that U = m[S;1] (1 represents a vector of 1s being concatenated to S for matrix

multiplication).

Let p = {m,αid,αexp,αtex} denotes the 3DMM coefﬁcients. The target of our reconstruction
module R is to estimate p = R(x) given an input image x. Since the intent is for R to also be

trainable with the rest of the framework, we use a CNN model based on CASIA-Net [150] for

this regression task. We apply z-score normalization to each dimension of the parameters before

training. A weighted parameter distance cost similar to [166] is used:

LR = (p− pg)

(cid:62)W(p− pg),

min
p

(4.4)

where W is the importance matrix whose diagonal is the weight of each parameter. The weight is

53

calculated based on the 2D landmark errors caused by the error in the estimation of each parameter.

W is calculated once and kept the same during training.

4.2.2 Generation Module

The pose estimation obtained from module R is quite accurate. However, the shape and texture

coefﬁcients estimations lead to the loss of high frequency details presented in the original image.

This is understandable since a low-dimensional PCA representation can preserve most of the en-

ergy with lower frequency components. Thus, we use a generative module that relies on the 3DMM

coefﬁcients p and the input image x to recover a frontal face that preserves both the low and high

frequency components. Our generator relies on multiple objectives for the frontalization task as

described below respectively.

In Figure 4.2, features from the two inputs to the generator G are fused through an encoder-

decoder network to synthesize a frontal face x f = G(x,p). To penalize the generated output from

the ground truth frontal face xg, one straight-forward objective is the reconstruction loss that aims

at reconstructing the ground truth with minimal error:

LGrec = (cid:107)G(x,p)− xg(cid:107)1.

(4.5)

Since an L2 loss empirically leads to blurry output, we use an L1 loss instead to better preserve

high frequency component. At the beginning of training, the reconstruction loss harms the overall

process since the generation is far from frontalized, so the reconstruction loss operates on a poor

set of correspondences. Thus, the weight for the reconstruction loss should be set in accordance to

the training stage. The details of tuning the weight are discussed in Section 4.3.2.

To reduce block artifacts, we use a spatial total variation loss to encourage smoothness in the

54

generated output:

(cid:90)

Ω

LGtv =

1
|Ω|

|∇G(x,p)|du,

(4.6)

where |∇G| is the image gradient, u ∈ R2 is the two dimensional coordinate increment, Ω is the
image region, and |Ω| is the area normalization factor.

Based on the observation that human faces share self-similarity across left and right halves, we

explicitly impose a symmetry loss. As shown in Figure 4.4, we recover a frontalized 2D projected

mask M from the frontalized 3DMM coefﬁcients indicating the visible parts of the face. The

mask M is binary, with nonzero values indicating the visible regions and zero otherwise. By

horizontally ﬂipping the face, we can generate another mask M f lip indicating the visible region

of the ﬂipped input image. We demand that the generated frontal face for the original input image

and its ﬂipped version should be similar within their respective masks:

LGsym = (cid:107)M (cid:12) G(x,p)− M (cid:12) G(x f lip,p f lip)(cid:107)2
+(cid:107)M f lip (cid:12) G(x,p)− M f lip (cid:12) G(x f lip,p f lip)(cid:107)2.

(4.7)

Here, x f lip is the horizontally ﬂipped image for the input image x, p f lip (only the pose parameters
m is changed the remaining is the same) are the 3DMM coefﬁcients for x f lip and (cid:12) denotes the

element-wise multiplication. We emphasize on the mask because those invisible parts during rota-

tion may not be conﬁdent to contribute to the penalty, whereas the role of the mask is to focus on

the visible parts for both the original image and the ﬂipped image, rather than the background.

55

x

mask from p

frontal mask M

x f lip

mask from p f lip

frontal mask M f lip

Figure 4.4: Image ﬂip and mask generation process for the symmetry loss.

4.2.3 Discrimination Module

Generative Adversarial Network (GAN) [44], formulated as a two-player minimax game between

a generator G and a discriminator D, has been widely used for image generation [35]. In this work,

G synthesizes a frontal face image x f and D distinguishes between the generated face from the real

frontal face xg. Note that in a conventional GAN, all images used for training are considered as

real samples. However, we limit the deﬁnition of “real” sample to be the face images with frontal

view only. Therefore, G is trained to not only generate realistic but also frontal face images.

The discriminator D consists of ﬁve convolution layers and one linear layer that generates a 2D

vector with each dimension representing the probability of the input to be real or generated. During

training, D is updated with two batches of samples in each iteration. The following objective is

maximized:

LD = Exg∈R log(D(xg)) + Ex∈K log(1− D(G(x,p)))

min
D

(4.8)

where R and K are the real and generated image sets respectively.

On the other hand, G aims to fool D to classify the generated image G(x,p) to be real with the

56

following loss:

LGgan = Ex∈K log(D(G(x,p)))

(4.9)

The competition between G and D improves both modules.

In the early stages when face

images are not fully frontalized, D focuses on the pose of the face to make the real or generated

decision, which in turn helps G to generate a frontal face. In the later stages when face images

are frontalized, D focuses on subtle details of frontal faces, which guides G to generate a realistic

frontal face that is difﬁcult to achieve with the supervisions of (4.5), (4.6) and (4.7).

4.2.4 Recognition Module

A key challenge in large-pose face frontalization is to preserve the original identity in the generated

frontal face. This is a difﬁcult task due to self-occlusion in proﬁle faces. The above discriminator

can only determine whether the generated image is realistic and in frontal view but cannot tell

whether the identity of the input image is retained. Although we have L1, total variation, and

masked symmetry losses for face generation, they treat each pixel equally that result in the loss of

discriminative power for the identity features. Therefore, we use a recognition module C to impart

correct identity to the generated images.

C is a general face recognition engine that any state-of-the-art framework can be easily plugged

in. We use a CASIA-Net structure, which has proved to work well for face recognition. A cross-

entropy loss is used for training C to classify image x with the ground truth identity y. Here y is a

one-hot vector with the element of the correct identity to be 1.

(cid:2)−y j log(Cj(x))− (1− y j)log(1−Cj(x))(cid:3) ,

min
C

LC = ∑

j

(4.10)

57

update the generator G:

LGid =

−log(C(G(x,p))), ∃y

(cid:107)h f − h(cid:107)2
2,

(cid:64)y.



where j is the index of the identity classes. Cj(x) is the probability of the input x belonging to the

jth identity.

Similar to the competition between G and D. Now, our generator G must also fool C to classify

the generated image to have the same identity as the input image. If the identity label of the input

image is not available, we regularize the extracted identity features h f of the generated image to

be similar to those of the input image, denoted as h. During training, C is updated with real input

images to retain discriminative power. The loss from the generated images is back-propagated to

(4.11)

To summarize the framework, the reconstruction module R provides 3DMM prior knowledge to

the frontalization process through (4.4), the discriminator D does so through (4.8) and the recog-

nition engine C through (4.10). The generator G combines all these sources of information to

optimize an overall objective function:

min
G

LG = λrecLGrec + λtvLGtv + λsymLGsym + λganLGgan + λidLGid .

(4.12)

It is important to balance the weights between each loss, which are discussed in Section 4.3.2 to

illustrate how each component contributes to the joint optimization of G.

58

Figure 4.5: Detailed network structure of FF-GAN.

59

Conv: 3   3/32 BN, ReLU ×Conv: 3   3/64, /2 BN, ReLU ×Conv: 3   3/64 BN, ReLU ×Conv: 3   3/128, /2 BN, ReLU ×Conv: 3   3/96 BN, ReLU ×Conv: 3   3/192, /2 BN, ReLU ×Conv: 3   3/128 BN, ReLU ×Conv: 3   3/256, /2 BN, ReLU ×Conv: 3   3/160 BN, ReLU ×Conv: 3   3/320 BN, ReLU ×AvgPool: 7   7 ×Linear: 236 Conv: 3   3/160 BN, ReLU ×Conv: 3   3/320 BN, ReLU ×AvgPool: 7   7 ×Linear: 40 Conv: 4   4/64, /2 LReLU: 0.2 ×Fconv: 12   12/64 BN, ReLU ×Conv: 5   5/64, /2 LReLU: 0.2 ×Conv: 3   3/32 ReLU ×Fconv: 5   5/64, /2 BN, ReLU ×Fconv: 4   4/64, /2 BN, ReLU ×xCon pxpCon Conv: 3   3/512, /2 BN, ReLU ×Conv: 3   3/512, /2 BN, ReLU ×Fconv: 4   4/256,/2 BN, LReLU: 0.2 ×Fconv: 5   5/256,/2 BN, LReLU: 0.2 ×Fconv: 4   4/64, /2 BN, LReLU: 0.2 ×Fconv: 4   4/3, /2 Tanh ×xfConv: 5   5/128, /2 LReLU: 0.2 ×Conv: 3   3/256, /2 LReLU: 0.2 ×Dropout: 0.2 Conv: 3   3/512, /2 LReLU: 0.2 ×Dropout: 0.2 Conv: 3   3/128 LReLU: 0.2 ×Dropout: 0.5 Linear: 2 xf/xgxf/xConv: 3   3/128 ×VMaxPool:2   2,/2 ×Conv: 3   3/64 ReLU ×Conv: 3   3/256 ×VMaxPool:2   2,/2 ×Conv: 3   3/96 ReLU ×Conv: 3   3/384 ×VMaxPool:2   2,/2 ×Conv: 3   3/128 ReLU ×Conv: 3   3/512 ×VMaxPool:2   2,/2 ×Conv: 3   3/160 ReLU ×Conv: 3   3/320 ×AvgPool: 7   7 ×Linear: 200/10549 Format: Convolution: filter size / output number, / stride (default = 1) Full Convolution: filter size / output number, / stride (default = 1) Volumetric Max Pooling: filter size, / stride (default = 1) Model R Model G Model D Model C Con  : Concatenation 4.3

Implementation Details

4.3.1 Network Structures

Figure 4.5 shows the detailed network structure of FF-GAN, composed of the 3DMM reconstruc-

tion module R, the generator G, the discriminator D, and the recognition engine C.

The 3DMM module R takes the input image x and generates the 3DMM coefﬁcients p includ-
ing the weak perspective matrix m ∈ R8×1, the shape coefﬁcients αid ∈ R199×1, the expression
coefﬁcients αexp ∈ R29×1, and the texture coefﬁcients αtex ∈ R40×1. We use the provided coef-

ﬁcients in [166] as our ground truth for training. Originally, 3DMM consists of 199 bases for

texture model. Only the ﬁrst 40 bases are used in [166]. We use the CASIA-Net structure, where

the texture coefﬁcients are separated from the shape-related coefﬁcients in the later layers, which

empirically demonstrates better performance in our experiments.

The generator G takes the image x and the estimated 3DMM coefﬁcients p as the inputs to

generate a frontal-view face x f . The 3DMM coefﬁcients provide a frontal low frequency basis and

the detailed appearance is expected to be recovered from the raw pose-variant input image. Clearly,

these two inputs are not in the same domain. We apply three fully convolution layers to up-sample
p and one convolution layer to down-sample x to the same size of 50× 50× 64. The outputs are

concatenated to an encoder-decoder structured network which includes two skip connections that

are used to provide high frequency information to the decoding process. The feature after encoding
is of dimension 512× 12× 12 which maintains the spatial information to recover the input image

when necessary, i.e., if the input image x is already of frontal view, our network should produce an

identity mapping.

The discriminator D aims to distinguish between the generated face x f and the real frontal face

xg. This is a relatively easy task, so we use a shallow network with ﬁve convolution layers and one

60

linear layer, which outputs a 2D vector with each dimension indicating the probability of the input

belonging to the generated image or the real image. In each iteration during training, D is updated

with two batches of samples from x f and xg, respectively.

The recognition engine C also adopts a CASIA-Net structure. Instead of using the max pooling

layer as CASIA-Net, we choose volumetric max pooling, which applies pooling not only in the

spatial dimensions but also across the feature channels. We ﬁnd this to be helpful for face recog-

nition. C is pre-trained with CASIA-Webface dataset [150] and ﬁxed in the ﬁrst two stages of the

training process. Later, we update C using the original input image x. Note that x f are the input to

fool C during the training of G and gradients ﬂow through C to update the generator G.

4.3.2 Training Strategies

Our framework consists of mainly four parts as shown in Figure 4.2, the deep 3DMM reconstruc-

tor R, a two-way fused encoder-decoder generator G, the real/generated discriminator D and a face

recognition engine C jointly trained for the identity regularization. The training of the overall net-

work can be hardly initialized from scratch. The generator G expects to receive the correct 3DMM

coefﬁcients, whereas the reconstructor R needs to be pre-trained. Our identity regularization also

requires correct identity information from the recognizer. Thus, the reconstructor R is pre-trained

until we achieve comparable performance for face alignment compared to previous work [166]

using 300W-LP [166]. The recognizer is pre-trained using CASIA-Webface and veriﬁed with

promising veriﬁcation accuracy on LFW.

The end-to-end joint training is conducted after R and C are well pre-trained. Notice that we

leave the generator G and the discriminator D training from scratch simultaneously because we

believe pre-trained G and D do not contribute much to the adversarial training process. Good G

with poor D will quickly pull G to be poor again and vice versa. Further these two components

61

should also match with each other. Good G may be evaluated poor by a good D as the discriminator

may be trained from other sources.

4.4 Experimental Results

4.4.1 Settings and Datasets

We evaluate our proposed FF-GAN on a variety of tasks including face frontalization, landmark

localization, 3D face reconstruction, and face recognition. Frontalization and 3D reconstruction

are evaluated qualitatively by comparing the visual quality of the generated images to the ground

truth. We also report some quantitative results on sparse 2D landmark localization accuracy, which

indicates our method does fairly well on pose estimation, even though we do not train for this

speciﬁc task. Face recognition is evaluated quantitatively over several challenging face veriﬁcation

and identiﬁcation datasets. We pre-process the images by applying state-of-the-art face detection
and face alignment algorithms and crop to 100× 100 size across all the datasets. The face datasets

used in this work are introduced below.

300W-LP consists of 122,450 images that are augmented from 300W [113] by the face proﬁling

approach of Zhu et al. [166], which is designed to generate images with yaw angles ranging from
−90◦ to 90◦. We use 300W-LP as our training set by forming image pairs of pose-variant and

frontal-view images with the same identity. The estimated 3DMM coefﬁcients provided with the

images are treated as the ground truth to train module R.

AFLW2000 is constructed for 3D face alignment evaluation by the same face proﬁling method

applied in 300W-LP. The dataset includes the estimated 3DMM coefﬁcients and augmented 68

landmarks for the ﬁrst 2,000 images in AFLW. We use this dataset to evaluate module R for recon-

struction.

62

Multi-PIE consists of 754,200 images from 337 subjects with large variations in PIE. We se-

lect a subset of 301,600 images with 13 poses, 20 illuminations, neutral expression from all four

sessions. The ﬁrst 200 subjects are used for training and the remaining 137 subjects for testing,

similar to the setting of [132]. We randomly choose one image for each subject with frontal pose

and neutral illumination as gallery and all the rest as probe images.

CASIA-Webface consists of 494,414 images of 10,575 subjects where the images of 26 overlap-

ping subjects with IJB-A are removed. It is a widely applied large-scale dataset for face recogni-

tion. We apply it to pre-train and ﬁnetune module C.

LFW contains 13,233 images collected from the Internet. The veriﬁcation set consists of 10 fold-

ers, each with 300 same-person pairs and 300 different-person pairs. We evaluate face veriﬁcation

performance on frontalized images and compare with previous frontalization algorithms on LFW.

IJB-A includes 5,396 images and 20,412 video frames for 500 subjects, which is a challenging

dataset with large pose variation. Different from previous datasets, IJB-A deﬁnes face template

matching where each template contains a variant number of images. It consists of 10 folders, each

of which being a different partition of the full set. We ﬁnetune model C on the training set of each

folder and evaluate on the testing set for face veriﬁcation and identiﬁcation.

CFP is composed of 500 subjects with 10 frontal and 4 proﬁle faces for each subject. We use this
dataset to explore the frontalization quality of face images with extreme proﬁle pose (90◦).

For in-the-wild experiments, we train our model using 300W-LP. We prepare the training image
pairs by setting one pose-variant face image (15◦-90◦) as the input and the frontal-view face image
of the same subject (0◦-15◦) as the target. We use Adam solver for optimization with a batch size
of 128. The weight decay is set to 2e−4 and momentum is set to 0.9. The initial learning rate is
set to 2e−4. We reduce the learning rate by a factor of 10 for every 20 epochs.

As shown in (4.12), we set up ﬁve balance factors to control the contribution of each objective

63

to the overall loss. The end-to-end training can be divided into three stages. For the ﬁrst stage, λrec

is set to 0 and λid is set to 0.01, since these two parts are highly related with the mapping from the

generated output to the reference input. Typical values for λtv, λsym, and λgan are all 1.0s. Once

the training error of G and D strikes a balance within usually 20 epochs, we change λrec and λid to

be 1.0s while tuning down λtv to be 0.5, λsym to be 0.8, respectively for the second stage. It takes

another 20 epochs to strike a new balance. Notice that for these two stages’ training, we ﬁx model

C. After that, we relax model C and further ﬁne-tune all the modules jointly with a learning rate of
1e−6 for the third stage.

For controlled experiments on Multi-PIE, we ﬁnetune the network from the models trained on

300W-LP. We mix the dataset of 300W-LP with Multi-PIE where 300W-LP is only used to update

module R. The weights for each loss are set to 1. Since we already have a good starting point, we
do not need to adjust the weights dynamically on Multi-PIE. The initial learning rate is set to 1e−4
for the ﬁrst two stages when model C is ﬁxed and reduced to 5e−5 when model C is relaxed. The

ﬁrst two stages need approximately 10 epochs for ﬁnetuning. The other hyper-parameters are the

same as the experiments on 300W-LP.

During the testing stage, module R is used to estimate the 3DMM coefﬁcients. Module G is

used for face frontalization. Module C is used for feature extraction. Module D is used to predict

the conﬁdence score of the generated image. These outputs are all used in our experiments.

4.4.2 3D Reconstruction

FF-GAN borrows prior shape and appearance information from 3DMM to serve as the reference

for frontalization. Though we do not speciﬁcally optimize for the reconstruction task, it is inter-

esting to see whether our reconstructor is doing a fair job in the 3D reconstruction task.

Figure 4.6 (a) shows ﬁve examples on AFLW2000 for landmark localization and frontaliza-

64

Figure 4.6: (a) Our landmark localization and face frontalization results; (b) Our 3DMM estima-
tion; (c) Ground truth from [166].

tion. Our method localizes the key points correctly and generates realistic frontal faces even for

extreme proﬁle inputs. We also quantitatively evaluated the landmark localization performance

using the normalized mean square error. Our model R achieves 6.01 normalized mean square er-

ror, compared to 5.42 for 3DDFA [166] and 6.12 for SDM [146]. Note that our method achieves

competitive performance compared to 3DDFA and SDM, even though those methods are tuned

speciﬁcally for the localization task. This indicates that our reconstruction module performs well

in providing correct geometric information.

Given the input images in (a), we compute the 3DMM coefﬁcients with our model R and

generate the 3D geometry and texture using (4.1), as shown in Figure 4.6 (b). We observe that our

method effectively preserves shape and identity information in the estimated 3D models, which

can even outperform the ground truth provided by 3DDFA. For example, the shape and texture

estimations in the last example is more similar to the input while the ground truth clearly shows

a male subject rather than a female. Given that the 3DMM coefﬁcients cannot preserve local

appearance, we obtain such high frequency information from the input image. Thus, the choice of

fusing 3DMM coefﬁcients with the original input is shown to be a reasonable one empirically.

65

(a)(b)(c)Figure1:(a)landmarklocalizationandfacefrontalizationresults;(b)our3DMMestimation;(c)groundtruthfrom[?].14.4.3 Face Recognition

One of our motivation for face frontalization is to see, whether the frontalized images bring in the

correct identity information for the self-occlusion missing part, and thus boost the performance

in face recognition. To verify this, we evaluate our framework on LFW [62], MultiPIE [45], and

IJB-A [77] for veriﬁcation and identiﬁcation tasks. Features are extracted from module C across

all the experiments. Euclidean distance is used as the metric for face matching.

Evaluation on LFW We evaluate the face veriﬁcation performance on our frontalized images of

LFW, compared to previous face frontalization methods. LFW-FF-GAN represents our method to

generate frontalized images, LFW-3D is from [55] and LFW-HPEN from [167]. Those collected

datasets are pre-processed in the same way as ours. Table 4.1 shows the face veriﬁcation perfor-

mance where the average and standard deviation over 10 folders are reported. Our method achieves

strong results compared to the state-of-the-art methods, which veriﬁes that our frontalization tech-

nique preserves the identity information.

Figure 4.7 shows some visual examples. Compared to the state-of-the-art face frontalization

algorithms, the proposed FF-GAN can generate realistic and identity-preserved faces especially

for large poses. The facial detail ﬁlling technique proposed in [167] relies on a symmetry assump-

tion and may lead to inferior results (3rd row, 2nd and 7th column). In contrast, we introduce a

symmetry loss in the training process that can generalize well to the test images without the need

for post-processing to impose symmetry as a hard constraint.

Evaluation on IJB-A We further evaluate our algorithm on IJB-A dataset. Following prior

work [132], we select a subset of well aligned images in each template for face matching.

We deﬁne our distance metric as the original image pair distance plus the weighted gener-

ated image pair distance. The weights are the conﬁdence score provided by our module D, i.e.

66

Table 4.1: Performance comparison on LFW dataset with accuracy (ACC) and area-under-curve (AUC).

-

ACC(%)
AUC(%)
93.62± 1.17
98.36± 0.06
99.39± 0.02
96.25± 0.76
96.42± 0.89 99.45± 0.03

94.29

Dataset
Ferrari et al. [40]
LFW-3D [55]
LFW-HPEN [167]
LFW-FF-GAN

(a)

(b)

(c)

(d)

Figure 4.7: Face frontalization results comparison on LFW. (a) Input; (b) LFW-3D [55]; (c)
HPEN [167]; (d) FF-GAN.

D(G(x,p)). Recall that module D is trained for the real or generated classiﬁcation task, which

reﬂects the quality of the generated images. Obviously, the poorer quality of the generated images,

the less likely we take the generated image pair for the distance metric fusion. With the fused

metric distance, we expect the generated images to provide complimentary information to boost

the recognition performance.

Table 4.2 shows the veriﬁcation and identiﬁcation performance. On veriﬁcation, our method

achieves consistently better accuracy compared to the baseline methods. The gap is 6.46% at FAR

0.01 and 11.13% at FAR 0.001, which is a signiﬁcant improvement. On identiﬁcation, our fused

metric also achieves consistently better result, 4.95% improvement at Rank-1 and 1.66% at Rank-

67

5. As a challenging face dataset in the wild, large pose variation, complex background, and the

uncontrolled illumination prevents the compared methods to perform well. Closing one of those

variation gaps would lead to large improvement, as evidenced by our face frontalization method in

rectifying the pose variation.

Table 4.2: Performance comparison on IJB-A dataset.

Veriﬁcation

Identiﬁcation

Method ↓
Metric (%) → @FAR=0.01 @FAR=0.001 @Rank-1 @Rank-5
24.6± 1.1 37.5± 0.8
23.6± 0.9
OpenBR [77]
44.3± 2.1 59.5± 2.0
40.6± 1.4
GOTS [77]
82.2± 2.3 93.1± 1.4
Wang et al. [137] 72.9± 3.5
73.3± 1.8
77.1± 1.6 88.7± 0.9
PAM [95]
78.7± 4.3
85.2± 1.8 93.7± 1.0
DCNN [21]
77.4± 2.7
85.5± 1.5 94.7± 1.1
DR-GAN [132]
90.2± 0.6 95.4± 0.5
85.2± 1.0
FF-GAN

10.4± 1.4
19.8± 0.8
51.0± 6.1
55.2± 3.2
53.9± 4.3
66.3± 3.3

–

Evaluation on Multi-PIE Multi-PIE allows for a graded evaluation with respect to PIE variations.

Thus, it is an important dataset to validate the performance of our methods with respect to prior

work. The rank-1 identiﬁcation rate is reported in Table 4.3. Note that previous works only con-
sider poses within 60◦, while our method can handle all pose variations including proﬁle views at
90◦. The results suggest that when pose variation is within 15◦, which is near frontal, our method
is competitive to state-of-the-art methods. But when the pose is 30◦ or larger, our method demon-
strates signiﬁcant advantages over all the other methods. We achieve 3.8% improvement at 60◦
and 2.8% better on the average accuracy from 0◦ to 60◦ compared to the previous best result. The
average accuracy on 0◦ to 90◦ only drops 4.5% from our method’s average on 0◦ to 60◦. Further

visual results in Figure 4.8 and 4.12, second row, also support that our method is almost invariant

to pose.

68

Table 4.3: Performance comparison on Multi-PIE dataset.

0o
15o
94.3 90.7
95.7 92.8

45o
60o
64.1 45.9
Zhu et al. [168]
Zhu et al. [169]
72.9 60.1
Yim et al. [151] 99.5 95.0 88.5 79.9 61.9
97.0 94.0 90.1
DR-GAN [132]
86.2 83.2
95.5 94.8 93.4 91.0 87.0
FF-GAN

30o
80.7
83.7

75o
–
–
–
–
82.7

90o
–
–
–
–
71.7

Avg(0o-60o) Avg(0o-90o)

72.9
79.3
83.3
89.2
92.0

–
–
–
–

87.5

4.4.4 Face Frontalization

In this section, we will illustrate further face frontalization results on Multi-PIE, AFLW, IJB-A,

and CFP datasets.

Visualization on Multi-PIE Figure 4.8 shows the face frontalization results of eight subjects in

the test set of Multi-PIE. The proposed FF-GAN generates realistic frontal faces that are similar to

the ground truth (top rows are the input, where the frontal ground truth is the image in the middle

column) across all different poses. Furthermore, the gender, race, and attributes like eyeglasses are

well-preserved. It is clear that the larger the pose angle is, the more difﬁcult it is for the generated
output to preserve identity. Surprisingly, for large poses (up to 90◦), FF-GAN can still preserve

the identity to a large extent. To the best of our knowledge, this is the ﬁrst work to show face
frontalization results for faces beyond 60◦.

Visualization on AFLW Figure 4.9 shows the face frontalization results on AFLW, which en-

compasses more pose variation than LFW. For better visualization, we separate the faces into three

different groups with small, medium, and large pose variation, which are deﬁned based on the

visibility of the two eyes (both visible for small pose, one eye half-occluded for medium pose and

one eye fully-occluded for large pose). FF-GAN works extremely well for the face images with

small pose, in rows (a) and (b). For face images with medium or large poses in rows (c) and (d),

respectively, FF-GAN still generates plausible results without many artifacts. We note that even

69

Figure 4.8: Visual results on Multi-PIE. Each example shows 13 pose-variant inputs (top) and the
generated frontal outputs (bottom). We clearly observe that the outputs consistently recover similar
frontal faces across all the pose intervals.

70

for nearly proﬁle views in row (d), high-frequency details of facial features are recovered well, the

frontalized face is symmetric and identity is preserved quite well. Row (e) shows results for input

images under various lighting or expressions. Again FF-GAN works well under these variations.

Visualization on IJB-A Figure 4.10 shows the face frontalization results on IJB-A, which consists

of large-pose and low-quality face images. The input images are of medium to large pose and

under a large variation of race, age, expression, and lighting conditions. However, FF-GAN can

still generate realistic and identity-preserved frontal faces.

Visualization on CFP We further explore face frontalization on CFP, which is a challenging
dataset with extreme proﬁle faces (90◦). Note that our training set, 300W-LP, has a large systematic

gap from CFP. Therefore, we ﬁnetune our models with a limited subset (1,600 proﬁle and 4,000

frontal images of 400 subjects) from CFP. Figure 4.11 shows the face frontalization results on some

of the remaining unseen proﬁle faces. Despite some artifacts and blurring effects in the occluded

side of the face, FF-GAN manages to generate a consistent frontal face. We observe that identity

is preserved to some degree and facial features are reconstructed to a reasonable extent. However,

we observe that there is some blurriness in the frontalized output. This is attributed to the fact that

the face images and crops are different from other datasets. For instance, the ear and neck regions

are prominently visible in CFP but not in other datasets. Thus, they are not entirely eliminated in

the frontalized output and cause ghosting artifacts. A larger training dataset that includes proﬁle

faces similar to those in CFP will likely alleviate this issue.

In summary, our frontalization results are of very high quality in Multi-PIE, LFW, AFLW,

IJB-A datasets, with some room for improvement in the CFP dataset.

71

(a)

(b)

(c)

(d)

(e)

Figure 4.9: Face frontalization results on AFLW. FF-GAN achieves very promising visual effects
for faces with small (row (a) and (b)), medium (row (c)), large (row (d)) poses and under various
lighting conditions and expressions (row (e)). We observe that the proposed FF-GAN achieves
accurate frontalization, while recovering high frequency facial details as well as identity, even for
face images observed under extreme variations in pose, expression or illumination.

72

Figure 4.10: Face frontalization results on IJB-A. Odd rows are all proﬁle-view inputs and even
rows are the frontalized results.

Figure 4.11: Face frontalization results on CFP. Odd rows are all proﬁle-view inputs and even rows
are the frontalized results.

73

Table 4.4: Quantitative results of ablation study.

removed module
performance (syn.)

−
74.2

C
59.2

D
73.4

R
68.5

Gid
69.3

Gtv Gsym
72.9
73.1

4.4.5 Ablation Study

FF-GAN consists of four modules M = {R,G,C,D}. Our generator G is the key component for

image synthesis, which cannot be removed. We train three partial variants by removing each of the
remaining modules, which results in M\{C}, M\{D}, and M\{R}. Further, we train another three

variants by removing each of the three loss functions (including Gid, Gtv, Gsym) applied on the
generated images, resulting in M\{Gid}, M\{Gtv}, and M\{Gsym}. We keep the training process

and all hyper-parameters the same and explore how the performances of those models differ.

Figure 4.12 shows visual comparisons between the proposed framework and its incomplete

variants. Our method is visually better than those variants, across all different poses, which sug-

gests that each component in our model is essential for face frontalization. Without the recognizer

C, it is hard to preserve identity especially on large poses. Without the discriminator D, the gen-

erated images are blurry without much high-frequency identity information. Without the recon-

structor R, there are artifacts on the generated faces, which highlights the effectiveness of 3DMM

in frontalization. Without the reconstruction loss Gid, the identity can be preserved to some extent,

however the overall image quality is low, and the lighting condition is not preserved.

Table 4.4 shows the quantitative results of the ablation study models by evaluating the recogni-

tion rate of the synthetic images generated from each model. Our FF-GAN with all modules and all

loss functions performs the best among all other variants, which suggests the effectiveness of each

part of our framework. For example, the performance drops dramatically without the recognition

engine regularization. The 3DMM module also performs a signiﬁcant role in face frontalization.

74

Figure 4.12: Ablation study results. (a) input images. (b) M (ours). (c) M\{C}. (d)M\{D}. (e)
M\{R}. (f) M\{Gid}.

4.5 Summary

In this work, we propose a 3DMM conditioned GAN framework to frontalize faces under all

pose ranges including proﬁle views. To the best of our knowledge, this is the ﬁrst work to ex-
pand pose ranges to 90◦ in challenging large-scale datasets. The 3DMM coefﬁcients provide an

important shape and appearance prior to guide the generator to rotate faces. The recognition en-

gine regularizes the generated image to preserve identity information. We propose new losses and

carefully design the training procedure to obtain high-quality generated images. Extensive experi-

ments consistently suggest that our frontalization algorithm may potentially boost face recognition

performances and be applied for 3D face reconstruction tasks. Large-pose face frontalization is a

challenging and ill-posed problem, but we believe this work has made convincing progress towards

a viable solution.

75

(a)(b)(c)(d)(e)(f)1Chapter 5

Feature Transfer Learning

5.1

Introduction

Face recognition is one of the ongoing success stories of the deep learning era, yielding very high

accuracies on traditional datasets [62, 77, 49]. However, it remains undetermined how these re-

sults translate to practical applications, or how deep learning classiﬁers for ﬁne-grained recognition

must be trained to maximally exploit real-world data. While it has been established that recogni-

tion engines are data-hungry and keep improving with more volume [123], mechanisms to derive

beneﬁts from the vast diversity of real data are relatively unexplored. In particular, real-world data

is long-tailed [59], with only a few samples available for most classes. In practice, effective han-

dling of long-tail classes is also indispensable in surveillance applications where subjects may not

cooperate during data collection.

It is evident that classiﬁers that ignore this long-tail nature of data likely imbibe hidden biases.

Consider the example of the CASIA-Webface dataset [150] in Fig. 5.1(a), where about 39% of the

10K subjects have less than 20 images. A simple solution is to simply ignore the long-tail classes,

as common for traditional batch construction and weight update schemes [48]. Besides reduction in

the volume of data, the inherently uneven sampling leads to biases in the weight norm distribution

across head and tail classes (Fig. 5.1(b,c)). Sampling tail classes at a higher frequency addresses

the latter, but still leads to biased decision boundaries due to insufﬁcient intra-class variance in tail

76

Figure 5.1: (a) The long-tail distribution of CASIA-WebFace [150]. (b) Weight norm plot of a
classiﬁer varies across classes in proportion to their volume. (c) Weight vector norm for head class
ID 1008 is larger than tail class ID 10,449, causing a bias in the decision boundary (dashed line)
towards ID 10,449. (d) Even after data re-sampling, the variance of ID 1008 is much larger than
ID 10,449, causing decision boundary to still be biased towards the tail class. We augment the
feature space of the tail classes as the dashed ellipsoid and propose improved training strategies,
leading to an improved classiﬁer.

classes (Fig. 5.1(d)).

In this work, we propose strategies for training more effective classiﬁers for face recognition

by adapting the distribution of learned features from tail classes to mimic that of head (or regular)

classes. We propose to handle long-tail classes during training by augmenting their feature space

using a center-based transfer.

In particular, we assume a Gaussian prior, whereby most of the

variance of regular classes is captured by the top few components of a Principal Components

Analysis (PCA) decomposition. By transferring the principal components from regular to long-tail

classes, we encourage the variance of long-tail classes to mimic that of regular classes. Motivations

for center-based transfer can also be found in recent works on the related problem of low-shot

recognition [120], where the feature center is found to be a good proxy that preserves identity.

Thus, restricting the transfer variance within the minimum inter-class distance limits the transfer

error to be within the classiﬁer error.

Our feature transfer overcomes the issues of imbalanced and limited training data. However,

directly using the augmented data for training is sub-optimal, since the transfer might further skew

the class distributions. Thus, we propose a training regimen that alternates between carefully

77

-0.100.10.20.3-0.2-0.100.10.2w1w2classID0500010000classifier weight norm0.40.60.811.2norm varies -0.2-0.100.10.2-0.2-0.100.10.2w1w2ID:10449 -0.100.10.20.3-0.2-0.100.10.2ID:1008 1008 classID0500010000classifier weight norm0.40.60.811.210449 classifier weight norm classID0500010000classifier weight norm0.40.60.811.2class ID class ID0500010000# images0200400600800classID0500010000classifier weight norm0.40.60.811.2class ID classID0500010000classifier weight norm0.40.60.811.210000 classID0500010000classifier weight norm0.40.60.811.20 0 200 400 600 800 # images classID0500010000classifier weight norm0.40.60.811.20 0.4 0.6 0.8 1.0 1.2 -0.100.10.20.3-0.2-0.100.10.2w1w2classID0500010000classifier weight norm0.40.60.811.2norm varies -0.2-0.100.10.2-0.2-0.100.10.2w1w2ID:10449 -0.100.10.20.3-0.2-0.100.10.2ID:1008 1008 classID0500010000classifier weight norm0.40.60.811.210449 classifier weight norm classID0500010000classifier weight norm0.40.60.811.2class ID class ID0500010000# images0200400600800classID0500010000classifier weight norm0.40.60.811.2class ID classID0500010000classifier weight norm0.40.60.811.210000 classID0500010000classifier weight norm0.40.60.811.20 0 200 400 600 800 # images classID0500010000classifier weight norm0.40.60.811.20 0.4 0.6 0.8 1.0 1.2 w1w2-0.100.10.2ID:10449-0.100.10.2Weight Norm Bias-0.100.10.2w1w210449ID:1008GeneratedBoundary reshapedesigned choices to solve for the feature transfer (with the goal of obtaining a less biased decision

boundary) and feature learning (with the goal of learning a more discriminative representation).

Further, we propose a novel metric regularization that jointly regularizes softmax feature space and

weight templates, leading to empirical beneﬁts such as reduced problems with vanishing gradients.

An approach for such feature-level transfer has also been proposed by Hariharan and Gir-

shick [54] for 1K-class ImageNet classiﬁcation [112]. But the face recognition problem is geared

towards at least two orders of magnitude more classes, which leads to signiﬁcant differences due

to more compact decision boundaries and different nature of within-class variances. In particular,

we note that the intuition of [54] to transfer semantic aspects based on relative positions in feature

space is valid for ImageNet categories that vary greatly in shape and appearance, but not for face

recognition. Rather, we must transfer the overall variance in feature distributions from regular to

long-tail classes.

To study the empirical properties of our method, we mimic a long-tail dataset by limiting the

number of samples for various proportions of classes in the MS-Celeb-1M dataset [49], while

evaluating on LFW, IJB-A and the hold-out set from MS-Celeb-1M dataset. We observe that our

feature transfer consistently improves upon a method that does not speciﬁcally handle long-tail

classes. Moreover, we observe that adding more long-tail classes improves the overall performance

of face recognition. We compare against the state-of-the-art on LFW and IJB-A benchmarks, to

obtain highly competitive results that demonstrate improvement due to our feature transfer. Fur-

ther, our method can be applied to challenging low-shot or one-shot scenarios, where we show

competitive results on the one-shot MS-Celeb-1M challenge [48] without any tuning. Finally, we

visualize our feature transfer through smooth interpolations, which demonstrate that a disentan-

gled representation is learned that preserves identity while augmenting non-identity aspects of the

feature space.

78

To summarize, we make the following contributions to face recognition:

• A center-based feature-level transfer algorithm to enrich the distribution of long-tailed classes,

leading to diversity without sacriﬁcing volume. It also leads to an effective disentanglement

of identity and non-identity feature representation.

• A simple but effective metric regularization to enhance performances for both our method

and baselines, which is also applicable to other recognition tasks.

• A two-stage alternating training scheme to achieve an unbiased classiﬁer and retain discrim-

inative power of the feature representation despite augmentation.

• Empirical analysis through extensive ablation studies and demonstration of beneﬁts for face

recognition in both general and one-shot settings.

5.2 Proposed Method

In Section 5.2.1, we introduce the problems caused by long-tail classes on training, such as clas-

siﬁer weight norm bias or intra-class variance bias, and overview challenges and solutions that

will be discussed with more details in later sections. Then, we demonstrate the overall frame-

work in Section 5.2.2 with a novel regularization. In Section 5.2.3, a center-based feature transfer

method is proposed to resolve the intra-class variance bias of long-tail classes. We ﬁnally present

an alternating regimen for updating the classiﬁer with the proposed feature transfer and the feature

representation to effectively train the entire system in Section 5.2.4.

79

5.2.1 Motivations

It is known that training deep face representations using data with long-tail distribution results in

degraded performance [160]. We have similar observations in our experiments, where we train

CASIA-Net [150] on CASIA-Webface [150], whose data distribution indeed shows long-tail be-

havior as in Fig. 5.1 (a). We further observe two atypical classiﬁer behaviors, such as signiﬁcant

variations on norms of classiﬁer weights or intra-class variances between regular and long-tail

classes.

Imbalance in Classiﬁer Weight Norm: As shown in Fig. 5.1 (b), we observe the norm of

classiﬁer weight (i.e., weight matrix of last fully connected layer) of regular classes is much larger

than that of tail classes, which causes the decision boundary biases towards the tail class. This is

mainly due to the fact that the weights of regular classes are more frequently updated than those

of tail classes. In this regard, there exist several well-known solutions, such as data re-sampling in

proportion to the volume of each class or class weights normalization [48].

Imbalance in Intra-class Variance: Unfortunately, we still observe signiﬁcant imbalance after

weight norm regularization via data re-sampling.1 As an illustrative example, we randomly pick

two classes, one from a regular class (ID=1008) and the other from a tail class (ID=10449). We

visualize the features from two classes projected onto 2D space using t-SNE [93] in Figure 5.1(c)

and those after weight norm regularization in Figure 5.1(d). Although the weights are regularized

to be similar, the low intra-class variance of the tail class is not fully resolved. This causes the

decision boundary to be biased, which impacts recognition performance.

We build upon this observation to posit that enlarging the intra-class variance for tail classes is

the key to alleviate the impact of long-tail classes. In particular, we propose a data augmentation

1We found it harder to train models with weight normalization [48], nonetheless, the intra-class variance issues to

which we allude would still remain.

80

Figure 5.2: The proposed framework includes a feature extractor Enc, a decoder Dec, a feature
ﬁltering module R, and a fully connected layer as classiﬁer FC. The proposed feature transfer
module G generates new feature ˜g from original feature g. The network is trained with an alterna-
tive bi-stage strategy. At stage 1, we ﬁx Enc and apply feature transfer G to generate new features
(green triangle) that are more diverse and likely to violate decision boundary. In stage 2, we ﬁx the
rectiﬁed classiﬁer FC, and update all the other models. As a result, the samples that are originally
on or across the boundary are pushed towards their center (blue arrows in bottom right). Best
viewed in color.

approach at the feature-level that can be used as extra positive examples for tail classes to enlarge

the intra-class variance. As illustrated in Figure 5.1(d), the feature distribution augmented by these

virtual positive examples helps rectify the classiﬁer boundary, which in turn allows reshaping the

feature representation.

The remainder of this section proposes speciﬁc mechanisms for regularization, feature aug-

mentation and neural network training to realize the above intuitions.

5.2.2 Proposed Framework

Many recent successes in deep face recognition are attributable to the design of novel loss or

regularization [88, 114, 106, 87, 31, 115], that reduces over-ﬁtting to limited amount of labeled

training data. In contrast, our method focuses on recovering the missing samples of tail classes

by transferring knowledge from regular classes to enlarge intra-class variance. At ﬁrst glance,

our goal of diversifying features of tail classes appears to contradict the general premise of deep

learning frameworks, which is to learn compact and discriminative features. However, we argue

81

w1w2w1w2decision boundaryreshapew1w2w1compact feature learningw2Stage 1:Stage 2:xEncRFCg gfxDecGc1c2that it is more advantageous to learn intra-class variance of tail classes for generalization, that is,

adapting to unseen examples. To achieve this, we enlarge the intra-class variance of tail classes

at a lower layer, which we call a rich feature layer [50], while subsequent ﬁltering layers learn

a compact representation with the softmax loss. Next, we deﬁne the training objectives of our

proposed framework.

As illustrated in Figure 5.2, the proposed face recognition system is composed of several com-

ponents, such as an encoder, decoder, feature transfer module followed by ﬁltering and classiﬁer

layers, as well as multiple training losses, such as image reconstruction loss, or classiﬁcation loss.
An encoder Enc computes rich feature g = Enc(x) ∈ R320 of an input image x ∈ R100×100 and
reconstruct an input image with a decoder Dec, i.e., x(cid:48) = Dec(g) = Dec(Enc(x)) ∈ R100×100. This

pathway is trained with the following pixel-wise reconstruction loss:

Lrecon = (cid:107)x(cid:48) − x(cid:107)2

2

(5.1)

The reconstruction loss allows g to contain diverse attributes besides identity, such as pose and

expression that are to be transferred from regular classes to tail classes. A feature transfer module G
transfers the variance computed from regular classes and generates a new feature ˜g = G(g) ∈ R320

from tail classes, as described in the next section. Then, a ﬁltering network R is applied to generate
identity-related features f = R(g) ∈ R320 that are fed to a classiﬁer layer FC with weight matrix
[wj] ∈ RNc×320. This pathway optimizes the softmax loss:

Ls f mx = −log

where i is the ground truth label of f.

exp(wT
i f)
j exp(wT

∑Nc

j f)

,

(5.2)

82

We note that the softmax loss is scale-dependent, that is, the loss can be made arbitrarily small

by scaling the norm of the weights w j or feature f. Typical solutions to prevent the problem are to

either regularize the norm of weights2 or features [106], or to normalize them [48, 138]. However,

we argue that these are too stringent since they penalize norms of individual weights and feature

without considering their compatibility. Instead, we propose to directly regularize the norm of

exponent of softmax loss as follows:

Lreg = (cid:107)WT f(cid:107)2

2

(5.3)

We term our proposed regularization a metric L2 or m-L2. As we will discuss in Section 5.3.2, joint

regularization of weights and features through the magnitude of their inner product works better in

practice than individual regularization.

Finally, we formulate the overall training loss as Equation 5.4, with the regularization coefﬁ-

cients set to αs f mx=αrecon=1, and αreg=0.25 unless otherwise stated:

L = αs f mxLs f mx + αreconLrecon + αregLreg.

(5.4)

5.2.3 Long-Tail Class Feature Transfer

Following previous face recognition approaches, such as joint Bayesian face models [18, 17], we

assume that rich features gik from class i lies in Gaussian distribution with the class-speciﬁc mean

ci and the covariance matrix Σi. To transfer intra-class variance from regular to long-tail classes,

we assume the covariance matrices are shared across all classes, Σi = Σ. Under this assumption, the

2http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression\#Weight_Decay

83

Figure 5.3: Visualization of samples closest to the feature center. (Left) We ﬁnd that near-frontal
close-to-neutral faces are the nearest neighbors of the feature center for regular classes. (Right)
Faces closest to center are from classes with least samples, which still contain pose and expres-
sion variance, as tail classes may severely lack neutral samples. Features are extracted by VG-
GFace [101] and samples are from CASIA-WebFace [150].

mean, or a class center, is simply estimated as an arithmetic average of all features from the same

class. As shown in the left of Figure 5.3, the center representation for regular classes is identity-

speciﬁc while removing irrelevant factors of variation such as pose, expression or illumination.

However, as in the right of Figure 5.3, due to lack of training examples, the center estimate of

long-tail classes is not accurate and often biased towards certain identity-irrelevant factors, such as

pose, which we ﬁnd dominant in practice. To improve the quality of center estimate for long-tail

classes, we discard examples with extreme pose variations. Furthermore, we consider averaging
features from both the original and horizontally ﬂipped images. With ¯gik ∈ R320 a rich feature

extracted from the ﬂipped image, the feature center is estimated as follows:

ci =

1
2|Ωi| ∑
k∈Ωi

(gik + ¯gik),

Ωi = { j|(cid:107)pik − ¯pik(cid:107)2 ≤ τ},

(5.5)

where pik and ¯pik are the pose codes for ¯gik and gik, respectively. Ωi includes indices for examples

with yaw angle less than a threshold τ.

Next, we transfer the variance estimated from the regular classes to long-tail classes. In theory,
one can draw feature samples of long-tail classes by adding a noise vector ε ∼ N (0,σ ). However,

the direction of noise vectors might be too random when sampled from the distribution and does not

84

reﬂect the true factors of variation found in the regular classes. Instead, we transfer the intra-class

variance evaluated from individual samples of regular classes. To further remove identity-related
component in the variance, we ﬁlter them using PCA basis Q ∈ R320×150 [142] achieved from

intra-class variances of all regular classes. We take the top 150 Eigen vectors as preserving 95%

energy. Our center-based feature transfer is achieved using:

˜gik = ci + QQT (g jk − c j),

(5.6)

where g jk and c j are a feature sample and center of a regular class j, ci is the feature center of a

long-tail class i and ˜gik is the transferred feature for class i. Here, ˜gik preserves the same identity

as ci, with similar intra-class variance as g jk. By sufﬁciently sampling g jk across different regular

classes, we expect to obtain an enriched distribution of the long-tail class i, which consists of both

the original observed features gik and the transferred features ˜gik.

5.2.4 Alternating Training Strategy

Given a training set of regular and long-tail classes D = {Dreg,Dlt}, we ﬁrst train all modules
M = {Enc,Dec,R,FC} using Equation 5.4 without any feature transfer. Then, we alternatively

train a classiﬁer until convergence with decision boundary reshaping using our proposed feature

transfer and a feature representation with boundary-corrected classiﬁer. The overview of our two-

stage alternating training process is illustrated in Algorithm 5.1 and 5.2. We describe in more

details of each training stage below.

Stage 1: Decision Boundary Reshape.

In this stage, we update R and FC while ﬁxing other

modules using variance transferred features from regular to long-tail classes to enlarge the intra-

85

Algorithm 5.1: Alternating training scheme for feature transfer learning.

Stage 0: model pre-training

train M with dataset D using Eqn. 5.4

Stage 1: decision boundary reshape

Fix Enc and Dec, train R and FC
[C,Q,h] = UpdateStats()
Init G(C,Q)
for i = 1, . . . ,Niter do
train 1st batch from h: {xr,yr}
train 2nd batch from Dlt: {xt,yt}
˜gt = Transfer(xr, yr, yt)
train 3rd batch: {˜gt,yt}

Stage 2: compact feature learning

Fix FC, train Enc, Dec, and R
for i = 1, . . . ,Niter do
random samples from D: {x,y}
train {x,y} using Eqn. 5.4

alternate stage 1 and 2 until convergence

class variance of long-tail classes, thus, reshape the decision boundary. We ﬁrst update the statistics

including the feature centers C, PCA basis Q and an index list h of hard samples that are with the

distance from the center more than the average distance for each regular class. The PCA basis Q is

achieved by decomposing the covariance matrix V computed with the samples from regular classes
Dreg. Three batches are used for training in each iteration: a regular batch sampled from hard index
list h: {gr,yr}, a long-tail batch sampled from long-tail classes {gt,yt}, and a transferred batch
{˜gt,yt} by transferring the variances from regular batch to long-tail batch.

Stage 2: Compact Feature Learning.
In this stage, we train Enc, Dec as well as R using
normal batches {x,y} from regular and long-tail classes using Equation 5.4 without transferred

batch. We keep FC ﬁxed since it is already trained well from the previous stage with decision

boundary corrected using feature transfer. The gradient directly back-propagates to R and Enc for

more compact representation, which decreases violation of class boundaries.

86

Algorithm 5.2: Functions that are called in Algorithm 5.1.

Function [C,Q,h] = UpdateStats()
Init C = [], V = [], h = []
for i = 1, . . . ,Nc do

gi = Enc(xi), ¯gi = Enc(¯xi)
ci = 1
2|Ωi| ∑ j∈Ωi(gi j + ¯gi j)
C.append(ci)
if i in Dreg then

mi ∑ j||gi j − ci||2
d = 1
for j = 1, . . . ,mi do
V.append(gi j − ci)
if ||gi j − ci||2 > d then
h.append([i,j])

Q = PCA(V)
Function ˜gt = Transfer(xr, yr, yt)
gr = Enc(xr)
for k = 1, . . . ,Nb do

ci = C(yr
˜gt
k = ci + QQT (gr

k,:), c j = C(yt
k − c j)

k,:)

5.3 Experimental Results

We train our models on MS-Celeb-1M dataset [49], which consists of 10M images from around

100K celebrities. Due to label noise, we adopt a cleaned version from [143] and further remove

the subjects overlapped with LFW and IJB-A, which results in 4.8M images of 76.5K classes for

training. A class with no more than 20 images is considered as a long-tail class, following [160].

For implementation, we apply encoder-decoder structure similar to [132] and ResNet-54 for

Enc in Section 5.3.5. Model R consists of a FC layer, two full convolution layers, two convolution
layers and another FC layer to achieve f ∈ R320×1. More detail is referred to supplementary mate-
rial. Adam solver with learning rate 2e−4 is used in stage 0. Learning rate 1e−5 is used in stage 1

and 2, which alternated for every 5K iterations.

87

(a)

(b)

Figure 5.4: (a) Center estimation error comparison. (b) Two classes with intra-class and inter-class
variance illustrated. Circles from small to large show minimum, mean and maximum distance
from intra-class samples to the center. Distances are averaged across 1K classes.

5.3.1 Feature Center Estimation

Feature center estimation is a key step for feature transfer. To evaluate center estimation for tail

classes, 1K regular classes are selected from MS-Celeb-1M and features are extracted using a

pretrained recognition model. We randomly select a subset of 1, 5, 10, 20 images to mimic a

long-tail class. Three methods are compared: (1) “PickOne”, randomly pick one sample as center.

(2) “AvgAll”, average feature of all images.

(3) “AvgFlip”, proposed method in Equation 5.5.

Error is the difference between the center of full set (ground truth) and the subset. Intra-class and

inter-class variance are provided as reference. All errors are normalized by the inter-class variance.

Results in Figure 5.4 show that our “AvgFlip" achieves clear smaller error. When compared

to the intra-class variance, the error is fairly smaller, which convince that our center estimate is

accurate to support the feature transfer.

88

151020# images00.10.20.30.40.5errorPickOneAvgAllAvgFlip+	*	+	+	+	0.34 0.49 0.79 1.00 (a)

(b)

(c)

Figure 5.5: Toy example on MNIST to show the effectiveness of our m-L2 regularization. (a) the
training loss/accuracy comparison. (b) feature distribution on test set for the model trained without
m-L2 regularization. (c) feature distribution with m-L2 regularization.

5.3.2 Effects of m-L2 Regularization

To study the effects of the proposed m-L2 regularization, we show a toy example on the MNIST

dataset [79]. We use LeNet++ network [141] to learn a 2D feature for better visualization. Two

models are trained: one with softmax loss only; the other with softmax loss and m-L2 regularization

(αreg = 0.001).

m-L2 regularization has several advantages. (1) m-L2 effectively avoids over-ﬁtting. In Fig-

ure 5.5, so f tmax training shows over-ﬁtting as training error goes to 0 whereas with m-L2 training

error stays small but not 0. (2) m-L2 enforces a more balanced feature distribution. Figure 5.5

(c) shows a more balanced angular distribution than Figure 5.5 (b). The performance with m-L2

improves so f tmax from 99.06% to 99.35%. We believe m-L2 is a simple yet powerful general

regularization that can be easily adapted to other recognition problems.

5.3.3 Ablation Study

We study two factors to analyze the long-tail training: (1) the ratio of the portion of regular classes

vs. the portion of long-tail classes; (2) the number of images per long-tail class. We use discrete

89

epoch01020304050err / acc00.20.40.60.811.2sfmx: errsfmx+m-L2: errsfmx: accsfmx+m-L2: accActivation of the 1st neuron-5000500Activation of the 2nd neuron-20002004000123456789Activation of the 1st neuron-50050Activation of the 2nd neuron-40-20020403725149608Table 5.1: Results on the controlled experiments by varying the ratio between regular and long-tail classes
in the training sets.

Test →

Train↓ Method↓
10K0K

sfmx
sfmx+m-L2
sfmx
sfmx+m-L2
Ours
sfmx
sfmx+m-L2
Ours
sfmx
sfmx+m-L2
Ours
sfmx
sfmx+m-L2

10K10K

10K30K

10K50K

60K0K

LFW
g

f

–

–

97.15 97.45
97.00 97.88
97.85
97.08 97.85
96.72 98.33
97.80
97.13 98.08
96.87 98.42
97.93
97.32 98.10
96.95 98.48
97.52 98.30
97.90 98.85

–

FAR@.01 @.001 Rank-1 Rank-5 Reg.

IJB-A: Verif.

69.39
73.00
72.96
74.07
80.25
74.03
76.92
81.80
72.87
78.52
82.60
82.75
86.38

33.04
44.78
49.22
46.27
54.95
47.93
47.17
61.04
49.04
53.44
62.60
62.33
74.44

IJB-A: Identif. MS1M: NN
LT
87.17 82.47
90.21 84.68
85.87 85.25
89.48 84.10
92.27 88.16
86.14 85.47
90.60 86.40
91.76 88.72
85.28 84.21
90.24 87.11
92.08 89.36
90.43 89.54
93.68 93.46

90.35
91.49
90.46
91.74
92.83
91.25
91.93
92.62
91.15
92.17
93.08
93.78
94.65

81.63
83.77
82.38
83.70
85.88
83.04
84.81
86.08
82.40
84.95
86.53
87.11
89.34

approximation to mimic the real regular vs. long-tail class distribution and the continuous distri-

bution of number of samples per tail class. Our main focus is to analyze the long-tail distribution

impact on recognition thus assume discrete setting for simplicity.

Regular/Long-Tail Class Ratio: we use 60K regular classes with most number of images from

MS-Celeb-1M. The top 10K classes are selected as regular classes which are shared among all

training sets. We regard the 10K and 60K sets to serve as the lower and upper bounds. Among the

rest 50K classes sorted by number of images, we select the ﬁrst 10K, 30K and 50K and randomly

pick 5 images per class. In this way, we form the training set of 10K10K, 10K30K, and 10K50K,

of which the ﬁrst 10K are regular and the last 10K or 30K or 50K are called faked long-tail classes.

A hold-out testing set is formed by selecting 5 images from each of the shared 10K regular classes

and 10K tail classes, resulting in 100K testing images.

The evaluation on the hold out test set from MS-Celeb-1M is to mimic the low-shot learning

90

scenario, where we use the feature center from the training images as the gallery and nearest

neighbor (NN) for face matching. The rank-1 accuracy for both regular and long-tail classes are

reported. We also evaluate the general face recognition performance on LFW and IJB-A. The

results are shown in Table 5.1 and we draw the following observations.

• The feature space g is less discriminative than the feature space f, which validates our as-

sumption that g is rich in intra-class variance for feature transfer while f is more discrimina-

tive for face recognition.

• The proposed m-L2 regularization boosts the performance with a large margin over the base-

line softmax loss.

• The proposed transfer method consistently improves over sfmx and sfmx+m-L2 with signif-

icant margins, and largely close the gap from 10K0K to 60K0K.

• Our method is more beneﬁcial when more long-tail classes are added to training as more

long-tail classes lead to better face recognition performance.

Number of Images per Long-Tail Class: we vary n = 1,5,10,20 under setting 10K30K. Ta-

ble 5.2 reveals that more images in long-tail classes leads to better results, due to better center esti-

mation. Consistent with Table 5.1, the proposed algorithm signiﬁcantly improves performance on

low-shot setting of MS-Celeb-1M and general face recognition on LFW and IJB-A. On 10K30K

(n = 5) setting, we look into the FC classiﬁer performance, 93.59% and 2.04% for regular and

long-tail respectively. Whereas our method achieves 96.26% and 81.89% accordingly, which sug-

gests our method’s effectiveness in correcting classiﬁer bias.

91

Table 5.2: Results of the controlled experiments by varying the number of images for each long-tail class
in the training sets.

Test →

Train ↓
10K30K sfmx
(n = 1)

Method↓

sfmx+m-L2
Ours
10K30K sfmx
(n = 5)

sfmx+m-L2
Ours
10K30K sfmx
(n = 10)

sfmx+m-L2
Ours
10K30K sfmx
(n = 20)

sfmx+m-L2
Ours

LFW

f

97.82
97.93
98.28
97.80
98.08
98.42
97.98
98.38
98.60
98.08
98.58
98.83

FAR@.01 @.001 Rank-1 Rank-5 Reg.

IJB-A: Identif. MS1M:NN
LT
87.35 86.94
90.47 84.85
92.65 88.99
86.14 85.47
90.60 86.40
91.76 88.72
86.04 85.93
90.83 88.77
92.89 90.89
86.42 86.76
91.40 90.05
93.38 92.26

91.01
91.52
92.23
91.25
91.93
92.62
91.34
93.11
93.72
91.77
93.36
94.14

82.51
83.94
85.82
83.04
84.81
86.08
83.41
86.00
87.55
83.68
86.34
88.42

IJB-A: Verif.

72.03
74.22
78.65
74.03
76.92
81.80
75.67
80.11
84.07
76.36
80.61
85.27

43.56
47.79
51.15
47.93
47.17
61.04
52.48
56.51
64.73
54.14
59.75
67.19

(a)

(b)

Figure 5.6: Center visualization: (a) one sample image from the selected class; (b) the decoded
image from the feature center.

5.3.4 One-Shot Face Recognition

While our method has only tangential relation to one-shot recognition, we evaluate on the MS1M

one-shot challenge as an illustration [48]. In this setting, the training data consists of a base set with
20K classes each with 50 ∼ 100 images and a novel set of 1K classes each with only 1 image. The

test set consists of 1 image for each base class and 5 images for each novel class. The main purpose

is to evaluate the recognition performance on the novel classes while monitoring the performance

on the base classes.

92

Table 5.3: Comparison on one-shot MS-Celeb-1M challenge. Results on the base classes are reported as
rank-1 accuracy and on novel classes are reported as Coverage@Precision = 0.99.

Method
MCSM [148]
Cheng et al. [25]
Choe et al. [27]
UP [48]
Hybrid [144]
DM [119]
Ours

External Data Models

YES
YES
NO
NO
NO
NO
NO

3
4
1
1
2
1
1

Base

Novel
61.0
100
99.74
≥ 95.00 11.17
77.48
99.80
92.64
99.58
73.86
92.60

99.21

–

–

Table 5.4: Face recognition on LFW and IJB-A. “MP” represents media pooling and “TA” represents
template adaptation. The best and second-best results are highlighted.

FAR@.01 @.001 Rank-1 -5

Test →
LFW
98.27
MTL [153]
98.71
L-Softmax [88]
98.95
VGG Face [101]
99.15
DeepID2 [125]
NormFace [138]
99.19
CenterLoss [141] 99.28
SphereFace [87]
99.42
99.53
RangeLoss [160]
99.63
FaceNet [115]
98.60
sfmx
sfmx + L2
98.53
sfmx + m-L2
99.18
Ours
99.37

Test →
Method ↓
PAMs [95]
DR-GAN [132]
FF-GAN [154]
TA [31]
TPE [114]
NAN [149]
sfmx
sfmx + L2
sfmx + m-L2
Ours
Ours + MP
Ours + MP + TA

IJB-A: Verif.

82.6
83.1
85.2
93.9
90.0
94.1
86.5
84.5
90.6
91.0
92.1
93.1

65.2
69.9
66.3

–

81.3
88.1
71.0
68.1
80.5
81.0
83.8
87.3

84.0
90.1
90.2
92.8
86.3
95.8
88.7
88.6
92.3
92.7
93.3
93.9

IJB-A: Identif.
-10
92.5 94.6
95.3
95.4

–

–
–
98.6
93.2 97.7
98.0 98.6
94.5 96.1
94.9 96.4
96.3 97.2
96.4 97.4
96.7 97.7
96.6 97.5

As shown in Table 5.3, we achieve 95.48% rank-1 accuracy with a single model and single

crop testing. We use the output from softmax layer as the conﬁdence score and achieve 92.60%

coverage at precision of 0.99. Note that the best models [144] and [25] use model ensembling with

different crops for testing. Compared to similar setting methods [48, 27], we achieve competitive

performance on the base classes and much better results on the novel classes.

93

Figure 5.7: Feature transfer visualization between two classes for every two columns. The ﬁrst
row are the input, in which odd column denotes class 1: x1 and the even column denotes class 2:
x2. The second row are the reconstructed images x(cid:48)
2. In the third row, odd column image
is the decoded image of the transferred feature from class 1 to class 2 and even column image is
the decoded image of the transferred feature from class 2 to class 1. It is clear that the transferred
features share the same identity as the target class while obtain the source image’s non-identity
variance including pose, expression, illumination, and etc.

1 and x(cid:48)

5.3.5 Large-Scale Face Recognition

In this section, we train our model on the full MS-celeb-1M dataset and evaluate on LFW and

IJB-A. The cleaned dataset includes 76.5K classes, of which 9.5K classes consist of less than 20

images. We use a ResNet-54 structure for Enc. As shown in Table 5.4, the deeper network structure

with our proposed m-L2 regularization already provides good results. Our feature transfer learning

further improves the performance signiﬁcantly. On LFW, our performance is among the state-

of-the-art. On IJB-A, our method signiﬁcantly outperforms most of the methods except “NAN”.

While “NAN” is designed with attention or aggregation model to speciﬁcally incorporate temporal

information, our method is geared towards still image recognition with long-tail classes.

5.3.6 Qualitative Results

We apply decoder Dec in our framework for feature visualization. It is well known that the skip

link between an encoder and decoder can improve visual quality [154]. However, we do not apply

94

Input Recon. Transfer it in order to encourage the feature g to incorporate the intra-class variance other than from the skip

link.

Center Visualization Given a class with multiple samples, we compute a feature center, on which

the Dec is applied to generate a center face. From Figure 5.6, we conﬁrm the observation that the

center is mostly an identity-preserved frontal neutral face. It also applies to portrait and cartoon

ﬁgures.
Feature Transfer The transferred feature is visualized using the Dec. Let x1,2, x(cid:48)

1,2, g1,2, c1,2

denote the input images, reconstructed images, encoded features and feature centers of two classes,
respectively. We transfer feature from class 1 to class 2 by: g12 = c2 +QQT (g1−c1), and visualize

the decoded images. We also transfer from class 2 to class 1 and visualize the decoded images.

Figure 5.7 shows the examples of feature transfer between two classes. The transferred images

preserve the target class identity while retaining intra-class variance of the source in terms of pose,

expression and lighting, which shows that our feature transfer is effective at enlarging the intra-

class variance.

Feature Interpolation The interpolation between two facial representations shows the appearance

transition from one to the other [105, 132]. Let g1,2, c1,2 denote the encoded features and the
feature centers of two samples. Previous work generates a new representation as g = g1 + α(g2 −
g1) where identity and non-identity changes are mixed together. In our work, we can generate
a smooth transition of non-identity change as g = c1 + αQQT (g2 − c2) and identity change as
g = g1 + α(c2 − c1). Figure 5.8 shows an interpolation example of a female with left pose and a

male with right pose, where the illumination also changes signiﬁcantly. Traditional interpolation

generates undesirable artifacts. However, our method shows smooth transitions, which veriﬁes that

the proposed model is effective at disentangling identity and non-identity information.

95

5.4 Summary

In this work, we propose a novel feature transfer approach for deep face recognition that exploits

the long-tailed nature of training data. We observe that generic approaches to deep face recog-

nition encounter classiﬁer bias problems due to imbalanced distribution of training data across

identities. In particular, uniform sampling of both regular and long-tail classes leads to biased

classiﬁer weights, since the frequency of updating them for long-tail classes is much lower. By ap-

plying the proposed feature transfer approach, we enrich the feature space of the tail classes, while

retaining identity. Utilizing the generated data, our alternating feature learning method rectiﬁes

the classiﬁer and learns more compact feature representations. Our proposed m-L2 regulariza-

tion demonstrates consistent advantages which can boost performance across different recognition

tasks. The disentangled nature of the augmented feature space is visualized through smooth in-

terpolations. Experiments consistently show that our method can learn better representations to

improve the performance on regular, long-tail and unseen classes. While this work focuses on face

recognition, our future work will also derive advantages from the proposed feature transfer for

other recognition applications, such as long-tail natural species.

96

α → 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Figure 5.8: Transition from top-left image to top-right image via feature interpolation. For each
example, ﬁrst row shows traditional feature interpolation; second row shows our transition of non-
identity variance; third row shows our transition of identity variance.

97

Chapter 6

Conclusions and Future Work

6.1 Conclusions

Face recognition is an important research topic because it has many real-world applications in

surveillance, law enforcement, commercial systems, and etc. With deep neural networks, the

performance on benchmark datasets improve dramatically, which contributes to the deployment

of face recognition systems in our daily life. For example, face starts to replace ﬁngerprint for

unlocking personal devices. Companies are using a face recognition system to control building

access. Railway stations provide self-serviced check-in that compares a customer’s face with his

or her ID photo to make the transit process more efﬁcient. These applications have driven the face

recognition research for several decades.

Throughout this dissertation, we have presented three different approaches from the perspective

of representation learning and image synthesis for deep face recognition. Representation learn-

ing is the essential component in designing a face recognition method. It is challenging but also

promising with the development of novel network structures and loss functions. On the other hand,

image synthesis has the advantage of providing visual appealing results for better understanding.

These two methods are often not independent from each other. A good representation is very

important for identity-preserved face image synthesis. Meanwhile, synthesized images can pro-

vide complementary features for representation learning. We have made some efforts from both

98

perspectives to advance state-of-the-art deep face recognition in this dissertation.

6.2 Future Work

While the recognition performance on benchmark face databases improves dramatically [115, 33,

139], unconstrained face recognition is still not a solved problem as it can encounter a lot of failure

cases in real-world deployment. Such failure cases include faces captured with heavy occlusions

(hair, eyeglasses, etc) or under bad lighting conditions. One application with increasing interest

is surveillance face recognition, which is challenging as the face images are often with very low

quality. Improving the recognition performance of low-quality face images is important to facilitate

surveillance face recognition. Moreover, the generalization ability of the learnt representation is

usually very poor when the test data distribution is different from the training data. This is due to

the limited understanding of the representation learnt in a DNN framework. We are interested in

these two aspects for future work.

Surveillance Face Recognition Face images captured with surveillance cameras are different

from those celebrity face images collected from the internet. Surveillance Face Recognition (SFR)

is difﬁcult mainly due to the absence of surveillance data. Fortunately, it is attracting more atten-

tions lately. The UCCS dataset [47] is released for unconstrained face detection and open-set face

recognition from surveillance videos. Cheng et al. [26] introduce the SFR challenge, where state-

of-the-art recognition algorithms are still far from being satisfactory in the surveillance scenario.

Apparently, more research efforts are needed to tackle this problem.

Similar to current face recognition techniques, representation learning, and image synthesis

are the two main directions to explore SFR. Representation learning-based methods [140] are less

studied compare to synthesis-based methods of face super-resolution, where facial priors like land-

99

mark and attributes are used for image restoration [14, 24, 155]. Although face super-resolution is

a promising direction, the current experimental setups are based on down-sampled low-resolution

images from the original high-resolution face images, which is quite different compared to the

real low-resolution / low-quality images. We will explore how to better combine representation

learning and image synthesis for SFR.

Feature Interpretation Unlike traditional features (LBP, HOG, SIFT, etc) where the feature ex-

traction process is well-deﬁned, deep features, though being more discriminative, are less inter-

pretable. While a tremendous number of work focuses on the design of the representation learn-

ing methods, only limited work is introduced to understand the learnt representation. Gong et

al. [42, 43] are the ﬁrst to study the capacity and the intrinsic dimension of a face representation.

It provides insights that state-of-the-art face representation (the 128-d FaceNet features) are with

high capacity and low intrinsic dimension. This work raises the question of how many dimen-

sions are really needed for face recognition, which is relatively unexplored in the face recognition

community.

Besides studying the dimension of the representation, the meaning of the representation, or the

logic inside a CNN framfework is also studied in [159, 152]. Zhang et al. [159] introduce inter-

pretable CNN that automatically assigns each ﬁlter in the convolution layer with an object part

during training. Such a formulation can encode more meaningful knowledge into the high convo-

lution layers at the price of decreased classiﬁcation accuracy. Yin et al. [152] propose interpretable

face recognition to encourage the diversity of different ﬁlters and the learnt representation, which is

shown to improve the performance. However, there is still a gap compared to state-of-the-art non-

interpretable face recognition methods. How to learn interpretable representation that can achieve

competitive performance is an interesting problem to study in the future.

100

BIBLIOGRAPHY

101

BIBLIOGRAPHY

[1] A. H. Abdulnabi, G. Wang, J. Lu, and K. Jia. Multi-task CNN model for attribute prediction.

TMM, 2015.

[2] T. Ahonen, A. Hadid, and M. Pietikainen. Face description with local binary patterns: Appli-
cation to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI), (12):2037–2041, 2006.

[3] S. Akaho. A kernel method for canonical correlation analysis. arXiv preprint cs/0609071,

2006.

[4] E. N. Arcoverde Neto, R. M. Duarte, R. M. Barreto, J. P. Magalhães, C. Bastos, T. I. Ren,
and G. D. Cavalcanti. Enhanced real-time head pose estimation system for mobile device.
Integrated Computer-Aided Engineering, 2014.

[5] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine

Learning, 2008.

[6] A. Asthana, T. K. Marks, M. J. Jones, K. H. Tieu, and M. Rohith. Fully automatic pose-

invariant face recognition via 3D pose normalization. In ICCV, 2011.

[7] A. Asthana, C. Sanderson, T. D. Gedeon, R. Goecke, et al. Learning-based face synthesis

for pose-robust recognition from single image. In BMVC, 2009.

[8] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. Towards open-set identity preserving face

synthesis. In CVPR, 2018.

[9] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman. Eigenfaces vs. ﬁsherfaces: Recog-
nition using class speciﬁc linear projection. Technical report, Yale University New Haven
United States, 1997.

[10] L. Best-Rowden and A. K. Jain. Longitudinal study of automatic face recognition. IEEE

Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017.

[11] S. Biswas, G. Aggarwal, P. J. Flynn, and K. W. Bowyer. Pose-robust recognition of low-
resolution face images. IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI), 2013.

[12] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. ACM

Press/Addison-Wesley Publishing Co., 1999.

[13] G. Brazil, X. Yin, and X. Liu.

Illuminating pedestrians via simultaneous detection and

102

segmentation. In ICCV, 2017.

[14] A. Bulat and G. Tzimiropoulos. Super-FAN: Integrated facial landmark localization and
super-resolution of real-world low resolution faces in arbitrary poses with gans. In CVPR,
2018.

[15] K. Cao, Y. Rong, C. Li, X. Tang, and C. C. Loy. Pose-robust face recognition via deep

residual equivariant mapping. In CVPR, 2018.

[16] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. Vggface2: A dataset for recog-

nising faces across pose and age. In FG, 2018.

[17] X. Cao, D. Wipf, F. Wen, G. Duan, and J. Sun. A practical transfer learning algorithm for

face veriﬁcation. In ICCV, 2013.

[18] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun. Bayesian face revisited: A joint formulation.

In ECCV, 2012.

[19] D. Chen, X. Cao, F. Wen, and J. Sun. Blessing of dimensionality: High-dimensional feature

and its efﬁcient compression for face veriﬁcation. In CVPR, 2013.

[20] D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun. Joint cascade face detection and alignment. In

ECCV, 2014.

[21] J.-C. Chen, V. M. Patel, and R. Chellappa. Unconstrained face veriﬁcation using deep CNN

features. In WACV, 2016.

[22] J.-C. Chen, J. Zheng, V. M. Patel, and R. Chellappa. Fisher vector encoded deep convolu-

tional features for unconstrained face veriﬁcation. In ICIP, 2016.

[23] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel.

InfoGAN:
Interpretable representation learning by information maximizing generative adversarial nets.
In NIPS, 2016.

[24] Y. Chen, Y. Tai, X. Liu, C. Shen, and J. Yang. Fsrnet: End-to-end learning face super-

resolution with facial priors. In CVPR, 2018.

[25] Y. Cheng, J. Zhao, Z. Wang, Y. Xu, K. Jayashree, S. Shen, and J. Feng. Know you at one

glance: A compact vector representation for low-shot learning. In ICCV workshop, 2017.

[26] Z. Cheng, X. Zhu, and S. Gong. Surveillance face recognition challenge. arXiv preprint

arXiv:1804.09691, 2018.

[27] J. Choe, S. Park, K. Kim, J. Hyun Park, D. Kim, and H. Shim. Face generation for low-shot

learning using generative adversarial networks. In ICCV workshop, 2017.

103

[28] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with

application to face veriﬁcation. In CVPR, 2005.

[29] B. Chu, S. Romdhani, and L. Chen. 3D-aided face recognition robust to expression and

pose variations. In CVPR, 2014.

[30] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke,
S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding.
In CVPR, 2016.

[31] N. Crosswhite, J. Byrne, C. Stauffer, O. Parkhi, Q. Cao, and A. Zisserman. Template adap-

tation for face veriﬁcation and identiﬁcation. In FG, 2017.

[32] J. Daugman. How iris recognition works.

In The Essential Guide to Image Processing,

pages 715–739. 2009.

[33] J. Deng, J. Guo, and S. Zafeiriou. ArcFace: Additive angular margin loss for deep face

recognition. arXiv preprint arXiv:1801.07698, 2018.

[34] O. Déniz, G. Bueno, J. Salido, and F. De la Torre. Face recognition using histograms of

oriented gradients. Pattern Recognition Letters, 32(12):1598–1603, 2011.

[35] E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian

pyramid of adversarial networks. In NIPS, 2015.

[36] C. Ding and D. Tao. A comprehensive survey on pose-invariant face recognition. ACM

Transactions on Intelligent Systems and Technology (TIST), 2016.

[37] C. Ding, C. Xu, and D. Tao. Multi-task pose-invariant face recognition. IEEE Transactions

on Image Processing (TIP), 2015.

[38] L. El Shafey, C. McCool, R. Wallace, and S. Marcel. A scalable formulation of probabilistic
IEEE Transactions on Pattern

linear discriminant analysis: Applied to face recognition.
Analysis and Machine Intelligence (TPAMI), 2013.

[39] J. F. Fagan III. Infants’ recognition of invariant features of faces. Child Development, 1976.

[40] C. Ferrari, G. Lisanti, S. Berretti, and A. Bimbo. Effective 3D based frontalization for

unconstrained face recognition. In ICPR, 2016.

[41] H. Gao, H. Ekenel, and R. Stiefelhagen. Pose normalization for local appearance-based face

recognition. Advances in Biometrics, 2009.

[42] S. Gong, V. N. Boddeti, and A. K. Jain. On the capacity of face representation. arXiv

preprint arXiv:1709.10433, 2017.

104

[43] S. Gong, V. N. Boddeti, and A. K. Jain. On the intrinsic dimensionality of face representa-

tion. arXiv preprint arXiv:1803.09672, 2018.

[44] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,

and Y. Bengio. Generative adversarial nets. In NIPS, 2014.

[45] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-PIE. Image and Vision

Computing, 2010.

[46] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training

of wasserstein GANs. In NIPS, 2017.

[47] M. Günther, P. Hu, C. Herrmann, C. H. Chan, M. Jiang, S. Yang, A. R. Dhamija, D. Ra-
manan, J. Beyerer, J. Kittler, et al. Unconstrained face detection and open-set face recogni-
tion challenge. In IJCB, 2017.

[48] Y. Guo and L. Zhang. One-shot face recognition by promoting underrepresented classes.

arXiv preprint arXiv:1707.05574, 2017.

[49] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. MS-Celeb-1M: A dataset and benchmark for

large scale face recognition. In ECCV, 2016.

[50] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik. Learning rich features from RGB-D images

for object detection and segmentation. In ECCV, 2014.

[51] N. Hadad, L. Wolf, and M. Shahar. A two-step disentanglement method. In CVPR, 2018.

[52] H. Han, S. Shan, X. Chen, S. Lao, and W. Gao. Separability oriented preprocessing for

illumination-insensitive face recognition. In ECCV, 2012.

[53] H. Han, S. Shan, L. Qing, X. Chen, and W. Gao. Lighting aware preprocessing for face

recognition across varying illumination. In ECCV, 2010.

[54] B. Hariharan and R. Girshick. Low-shot visual recognition by shrinking and hallucinating

features. In ICCV, 2017.

[55] T. Hassner, S. Harel, E. Paz, and R. Enbar. Effective face frontalization in unconstrained

images. In CVPR, 2015.

[56] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.

CVPR, 2016.

In

[57] X. He, Y. Zhou, Z. Zhou, S. Bai, and X. Bai. Triplet-center loss for multi-view 3d object

retrieval. In CVPR, 2018.

[58] E. Hjelmås and B. K. Low. Face detection: A survey. Computer Vision and Image Under-

105

standing, 83(3):236–274, 2001.

[59] G. V. Horn and P. Perona. The devial is in the tails: Fine-grained classiﬁcation in the wild.

In arXiv preprint arXiv:1709.01450, 2017.

[60] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger. Multi-scale
dense networks for resource efﬁcient image classiﬁcation. arXiv preprint arXiv:1703.09844,
2017.

[61] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolu-

tional networks. In CVPR, 2017.

[62] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A
database for studying face recognition in unconstrained environments. Technical report,
Technical Report 07-49, University of Massachusetts, Amherst, 2007.

[63] R. Huang, S. Zhang, T. Li, R. He, et al. Beyond face rotation: Global and local perception

gan for photorealistic and identity preserving frontal view synthesis. In ICCV, 2017.

[64] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by re-

ducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

[65] A. K. Jain and S. Z. Li. Handbook of face recognition. Springer, 2011.

[66] S. Ji and J. Ye. An accelerated gradient method for trace norm minimization. In ICML,

2009.

[67] H. Jiang and E. Learned-Miller. Face detection with the faster R-CNN. In FG, 2017.

[68] A. Jourabloo and X. Liu. Pose-invariant 3D face alignment. In ICCV, 2015.

[69] A. Jourabloo and X. Liu. Large-pose face alignment via CNN-based dense 3D model ﬁtting.

In CVPR, 2016.

[70] A. Jourabloo and X. Liu. Pose-invariant face alignment via CNN-based dense 3D model

ﬁtting. International Journal of Computer Vision, 124(2):187–203, 2017.

[71] M. Kan, S. Shan, H. Chang, and X. Chen. Stacked progressive auto-encoders (SPAE) for

face recognition across poses. In CVPR, 2014.

[72] M. Kan, S. Shan, and X. Chen. Multi-view deep network for cross-view classiﬁcation. In

CVPR, 2016.

[73] N. Kanwisher, J. McDermott, and M. M. Chun. The fusiform face area: a module in human
extrastriate cortex specialized for face perception. Journal of Neuroscience, 17(11):4302–
4311, 1997.

106

[74] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved

quality, stability, and variation. In ICLR, 2018.

[75] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard. The megaface bench-

mark: 1 million faces for recognition at scale. In CVPR, 2016.

[76] D. P. Kingma and M. Welling. Auto-encoding variational bayes.

arXiv:1312.6114, 2013.

arXiv preprint

[77] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen, P. Grother, A. Mah,
M. Burge, and A. K. Jain. Pushing the frontiers of unconstrained face detection and recog-
nition: IARPA Janus Benchmark A. In CVPR, 2015.

[78] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolu-

tional neural networks. In NIPS, 2012.

[79] Y. LeCun, C. Cortes, and C. J.C. Burges. The MNIST database of handwritten digits. Tech-

nical report, 1998.

[80] J. Z. Leibo, J. Mutch, and T. Poggio. Why the brain separates face recognition from object

recognition. In NIPS, 2011.

[81] A. Li, S. Shan, X. Chen, and W. Gao. Maximizing intra-individual correlations for face

recognition across pose differences. In CVPR, 2009.

[82] S. Li, X. Liu, X. Chai, H. Zhang, S. Lao, and S. Shan. Morphable displacement ﬁeld based

image matching for face recognition across pose. In ECCV, 2012.

[83] Y. Li, B. Zhang, S. Shan, X. Chen, and W. Gao. Bagging based efﬁcient kernel ﬁsher

discriminant analysis for face recognition. In ICPR, 2006.

[84] S. Liao, A. K. Jain, and S. Z. Li. Partial face recognition: Alignment-free approach. IEEE

Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2013.

[85] G. Lin, A. Milan, C. Shen, and I. D. Reid. ReﬁneNet: Multi-path reﬁnement networks for

high-resolution semantic segmentation. In CVPR, 2017.

[86] E. Littwin and L. Wolf. The multiverse loss for robust transfer learning. In CVPR, 2016.

[87] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. SphereFace: Deep hypersphere embed-

ding for face recognition. In CVPR, 2017.

[88] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-margin softmax loss for convolutional neural

networks. In ICML, 2016.

[89] Y. Liu, F. Wei, J. Shao, L. Sheng, J. Yan, and X. Wang. Exploring disentangled feature

107

representation beyond face identiﬁcation. In CVPR, 2018.

[90] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmen-

tation. In CVPR, 2015.

[91] D. G. Lowe. Object recognition from local scale-invariant features. In ICCV, 1999.

[92] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Jour-

nal of Computer Vision, 2004.

[93] L. v. d. Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning

Research, 9(Nov):2579–2605, 2008.

[94] D. Maltoni, D. Maio, A. K. Jain, and S. Prabhakar. Handbook of ﬁngerprint recognition.

Springer Science & Business Media, 2009.

[95] I. Masi, S. Rawls, G. Medioni, and P. Natarajan. Pose-aware face recognition in the wild.

In CVPR, 2016.

[96] M. Mirza and S. Osindero. Conditional generative adversarial nets.

arXiv:1411.1784, 2014.

arXiv preprint

[97] V. Nair and G. E. Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In

ICML, 2010.

[98] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation.

In ICCV, 2015.

[99] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured

feature embedding. In CVPR, 2016.

[100] M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell. Zero-shot learning with

semantic output codes. In NIPS, 2009.

[101] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In BMVC, 2015.

[102] X. Peng, X. Yu, K. Sohn, D. N. Metaxas, and M. Chandraker. Reconstruction-based disen-

tanglement for pose-invariant face recognition. In ICCV, 2017.

[103] T. K. Perrachione, S. N. Del Tufo, and J. D. Gabrieli. Human voice recognition depends on

language ability. Science, 333(6042):595–595, 2011.

[104] C. Qi and F. Su. Contrastive-center loss for deep neural networks. In ICIP, 2017.

[105] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep

convolutional generative adversarial networks. In ICLR, 2016.

108

[106] R. Ranjan, C. D. Castillo, and R. Chellappa. L2-constrained softmax loss for discriminative

face veriﬁcation. arXiv preprint arXiv:1703.09507, 2017.

[107] R. Ranjan, V. M. Patel, and R. Chellappa. HyperFace: A deep multi-task learning framework
for face detection, landmark localization, pose estimation, and gender recognition. arXiv
preprint arXiv:1603.01249, 2016.

[108] S. A. Rizvi, P. J. Phillips, and H. Moon. The FERET veriﬁcation testing protocol for face

recognition algorithms. In FG, 1998.

[109] J. Roth, X. Liu, and D. Metaxas. On continuous user authentication via typing behavior.

IEEE Transactions on Image Processing (TIP), 23(10):4611–4624, 2014.

[110] J. Roth, Y. Tong, and X. Liu. Unconstrained 3D face reconstruction. In CVPR, 2015.

[111] J. Rupnik and J. Shawe-Taylor. Multi-view canonical correlation analysis. In Conference

on Data Mining and Data Warehouses (SiKDD), 2010.

[112] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, et al. ImageNet large scale visual recognition challenge. Interna-
tional Journal of Computer Vision, 115(3):211–252, 2015.

[113] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 Faces in-the-wild challenge:

The ﬁrst facial landmark localization challenge. In ICCV workshop, 2013.

[114] S. Sankaranarayanan, A. Alavi, C. Castillo, and R. Chellappa. Triplet probabilistic embed-

ding for face veriﬁcation and clustering. arXiv preprint arXiv:1604.05417, 2016.

[115] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uniﬁed embedding for face recog-

nition and clustering. In CVPR, 2015.

[116] S. Sengupta, J.-C. Chen, C. Castillo, V. M. Patel, R. Chellappa, and D. W. Jacobs. Frontal

to proﬁle face veriﬁcation in the wild. In WACV, 2016.

[117] X. Shi, S. Shan, M. Kan, S. Wu, and X. Chen. Real-time rotation-invariant face detection

with progressive calibration networks. In CVPR, 2018.

[118] K. Simonyan, O. M. Parkhi, A. Vedaldi, and A. Zisserman. Fisher vector faces in the wild.

In BMVC, 2013.

[119] E. Smirnov, A. Melnikov, S. Novoselov, E. Luckyanets, and G. Lavrentyeva. Doppelganger

mining for face representation learning. In ICCV workshop, 2017.

[120] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In NIPS,

2017.

109

[121] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal

transfer. In NIPS, 2013.

[122] K. Sohn. Improved deep metric learning with multi-class n-pair loss objective. In NIPS,

2016.

[123] C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of

data in deep learning era. In ICCV, 2017.

[124] Y. Sun, D. Liang, X. Wang, and X. Tang. DeepID3: Face recognition with very deep neural

networks. arXiv preprint arXiv:1502.00873, 2015.

[125] Y. Sun, X. Wang, and X. Tang. Deep learning face representation from predicting 10,000

classes. In CVPR, 2014.

[126] Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective,

and robust. In CVPR, 2015.

[127] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and

A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.

[128] O. Tadmor, Y. Wexler, T. Rosenwein, S. Shalev-Shwartz, and A. Shashua. Learning a metric

embedding for face recognition using the multibatch method. In NIPS, 2016.

[129] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level

performance in face veriﬁcation. In CVPR, 2014.

[130] W. Tan, B. Yan, and B. Bare. Feature super-resolution: Make machine see more clearly. In

CVPR, 2018.

[131] Y. Tian, P. Luo, X. Wang, and X. Tang. Pedestrian detection aided by deep learning semantic

tasks. In CVPR, 2015.

[132] L. Tran, X. Yin, and X. Liu. Disentangled representation learning GAN for pose-invariant

face recognition. In CVPR, 2017.

[133] D. Y. Tsao, W. A. Freiwald, R. B. Tootell, and M. S. Livingstone. A cortical region consist-

ing entirely of face-selective cells. Science, 311(5761):670–674, 2006.

[134] C. Turati, H. Bulf, and F. Simion. Newborns’ face recognition over changes in viewpoint.

Cognition, 2008.

[135] M. A. Turk and A. P. Pentland. Face recognition using eigenfaces. In CVPR, 1991.

[136] E. Ustinova and V. Lempitsky. Learning deep embeddings with histogram loss. In NIPS,

2016.

110

[137] D. Wang, C. Otto, and A. K. Jain. Face search at scale.

Analysis and Machine Intelligence (TPAMI), 2016.

IEEE Transactions on Pattern

[138] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. NormFace: l_2 hypersphere embedding for

face veriﬁcation. arXiv preprint arXiv:1704.06369, 2017.

[139] H. Wang, Y. Wang, Z. Zhou, X. Ji, and W. Liu. CosFace: Large margin cosine loss for deep

face recognition. In CVPR, 2018.

[140] Z. Wang, S. Chang, Y. Yang, D. Liu, and T. S. Huang. Studying very low resolution recog-

nition using deep networks. In CVPR, 2016.

[141] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep

face recognition. In ECCV, 2016.

[142] S. Wold, K. Esbensen, and P. Geladi. Principal component analysis. Chemometrics and

Intelligent Laboratory Systems, 1987.

[143] X. Wu, R. He, Z. Sun, and T. Tan. A light CNN for deep face representation with noisy

labels. arXiv preprint arXiv:1511.02683, 2015.

[144] Y. Wu, H. Liu, and Y. Fu. Low-shot face recognition with hybrid classiﬁers.

workshop, 2017.

In ICCV

[145] C. Xiong, X. Zhao, D. Tang, K. Jayashree, S. Yan, and T.-K. Kim. Conditional convolutional

neural network for modality-aware face recognition. In ICCV, 2015.

[146] X. Xiong and F. D. la Torre. Supervised descent method and its applications to face align-

ment. In CVPR, 2013.

[147] D. Xu, S. Yan, D. Tao, S. Lin, and H.-J. Zhang. Marginal ﬁsher analysis and its variants
for human gait recognition and content-based image retrieval. IEEE Transactions on Image
Processing (TIP), 16(11):2811–2821, 2007.

[148] Y. Xu, Y. Cheng, J. Zhao, Z. Wang, L. Xiong, K. Jayashree, H. Tamura, T. Kagaya,
S. Pranata, S. Shen, et al. High performance large scale face recognition with multi-
cognition softmax and feature retrieval. In ICCV workshop, 2017.

[149] J. Yang, P. Ren, D. Chen, F. Wen, H. Li, and G. Hua. Neural aggregation network for video

face recognition. In CVPR, 2017.

[150] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation from scratch. arXiv preprint

arXiv:1411.7923, 2014.

[151] J. Yim, H. Jung, B. Yoo, C. Choi, D. Park, and J. Kim. Rotating your face using multi-task

111

deep neural network. In CVPR, 2015.

[152] B. Yin, L. Tran, H. Li, X. Shen, and X. Liu. Towards interpretable face recognition. In

arXiv preprint arXiv:1805.00611, 2018.

[153] X. Yin and X. Liu. Multi-task convolutional neural network for pose-invariant face recog-

nition. IEEE Transactions on Image Processing (TIP), 2018.

[154] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker. Towards large-pose face frontalization

in the wild. In ICCV, 2017.

[155] X. Yu, B. Fernando, R. Hartley, and F. Porikli. Super-resolving very low-resolution face

images with supplementary attributes. In ECCV, 2018.

[156] C. Zhang and Z. Zhang. Improving multiview face detection with multi-task deep convolu-

tional neural networks. In WACV, 2014.

[157] H. Zhang, Y. Zhang, and T. S. Huang. Pose-robust face recognition via sparse representation.

Pattern Recognition, 2013.

[158] L. Zhang, L. Lin, X. Liang, and K. He. Is faster R-CNN doing well for pedestrian detection?

In ECCV, 2016.

[159] Q. Zhang, Y. N. Wu, and S.-C. Zhu. Interpretable convolutional neural networks. 2018.

[160] X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao. Range loss for deep face recognition with

long-tailed training data. In CVPR, 2017.

[161] Y. Zhang, M. Shao, E. K. Wong, and Y. Fu. Random faces guided sparse many-to-one

encoder for pose-invariant face recognition. In ICCV, 2013.

[162] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multi-task

learning. In ECCV, 2014.

[163] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Learning deep representation for face alignment
with auxiliary attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI), 2016.

[164] J. Zhao, Y. Cheng, Y. Xu, L. Xiong, J. Li, F. Zhao, K. Jayashree, S. Pranata, S. Shen, J. Xing,

et al. Towards pose invariant face recognition in the wild. In CVPR, 2018.

[165] Y. Zheng, D. K. Pal, and M. Savvides. Ring loss: Convex feature normalization for face

recognition. In CVPR, 2018.

[166] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li. Face alignment across large poses: A 3D

solution. In CVPR, 2016.

112

[167] X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li. High-ﬁdelity pose and expression normalization

for face recognition in the wild. In CVPR, 2015.

[168] Z. Zhu, P. Luo, X. Wang, and X. Tang. Deep learning identity-preserving face space. In

ICCV, 2013.

[169] Z. Zhu, P. Luo, X. Wang, and X. Tang. Multi-view perceptron: A deep model for learning

face identity and view representations. In NIPS, 2014.

[170] W. W. ç. The model method in facial recognition. Panoramic Research Inc., Palo Alto, CA,

Rep. PR1, 15(47):2, 1966.

113