TOWARDS INTERPRETABLE FACE RECOGNITION

By

Bangjie Yin

A THESIS

Michigan State University

in partial fulﬁllment of the requirements

Submitted to

for the degree of

Computer Science – Master of Science

2019

ABSTRACT

TOWARDS INTERPRETABLE FACE RECOGNITION

By

Bangjie Yin

Deep CNNs have been pushing the frontier of visual recognition over past years. Besides recog-
nition accuracy, strong demands in understanding deep CNNs in the research community motivate
developments of tools to dissect pre-trained models to visualize how they make predictions. Recent
works further push the interpretability in the network learning stage to learn more meaningful
representations. In this work, focusing on a speciﬁc area of visual recognition, we report our eﬀorts
towards interpretable face recognition. We propose a spatial activation diversity loss to learn more
structured face representations. By leveraging the structure, we further design a feature activation
diversity loss to push the interpretable representations to be discriminative and robust to occlusions.
We demonstrate on three face recognition benchmarks that our proposed method is able to achieve
the state-of-art face recognition accuracy with easily interpretable face representations.

Copyright by
BANGJIE YIN
2019

ACKNOWLEDGEMENTS

This dissertation would not have been made possible without the help of many people.

I am very honored to have Dr. Xiaoming Liu as my advisor. His expectation and encouragement
have made me achieve more than I could ever have imagined. The time we spent to discuss
experiments, brainstorm, and polish papers has reﬁned my skills in critical thinking, presentation
and writing. By setting himself as an example, he has taught me what a good researcher should be
like.

I am grateful for my labmates, Yousef Atoum, Xi Yin, Amin Jourabloo, Luan Tran, Garrick
Brazil, Yaojie Liu, Joel Stehouwer, Shengjie Zhu, Masa Hu. The valuable comments in paper
review, the willingness to help, the encouragement when I am in a bad mood, and the entertainment
together have made it a very pleasant journey.

Thanks to my friends at Michigan State University, Tony, Zhongzheng, Bohao, Hieu, Lisheng,

Xiaoyan, Ding, Zhiming, for their company to keep my mind refreshed.

Finally, I would like to thank my parents who have taught me to be brave, positive, and

kindhearted. Thanks to my wife Jiajia for the long-term being supportive to my career and life.

iv

TABLE OF CONTENTS

.

.
.

1

4
4
5
6

.
.

.
.

.
.

.
.

.
.

.
.

LIST OF TABLES .
LIST OF FIGURES .

.
.
INTRODUCTION .
CHAPTER 1
CHAPTER 2 RELATED WORK .

.
.

Implementation Details .

3.1.1
3.1.2 Our Proposed Modiﬁcations

4.1 Experimental Settings .
.
.
.

4.1.1
4.1.2 Database
4.2 Ablation Study .

2.1
2.2 Parts and Occlusion in Face Recognition . . . . . . . . . . . . . . . . . . . . . .
2.3 Occlusion Handling with CNNs

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
Interpretable Representation Learning . . . . . . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . .
CHAPTER 3 PROPOSED METHOD . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1 Spatial Activation Diversity Loss . . . . . . . . . . . . . . . . . . . . . . . . . . .
Spatial Activation Diversity Loss . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .

7
8
8
9
3.2 Feature Activation Diversity Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3
CHAPTER 4 EXPERIMENTAL RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . 14
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
.
4.2.1 Diﬀerent Thresholds
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.2 Diﬀerent Occlusions and Dynamic Window Size . . . . . . . . . . . . . . 17
4.2.3
. . . . . . . . . . . . . . . . . . . . . . 17
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.1
Spreadness of Average Locations of Filter Response . . . . . . . . . . . . . 17
4.3.2 Mean Feature Diﬀerence Comparison . . . . . . . . . . . . . . . . . . . . 19
4.3.3 Visualization on Feature Diﬀerence Vectors . . . . . . . . . . . . . . . . . 20
4.3.4
Filter Response Visualization . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 Quantitative Evaluation on Benchmark . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4.1
. . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4.2 Generic in-the-wild faces . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4.3 Occlusion faces .
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5.1
Partial face retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5.2 Occlusion detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

CHAPTER 5 CONCLUSIONS .
.
CHAPTER 6 RESNET50 .

Introduction .
.
.

Spatial vs. Feature Diversity Loss

4.3 Qualitative Evaluation .

Standard Deviation of Peaks

4.5 Other Applications

.

.

.
.
.
.

.

.

.

.

v

6.1 The network structure of our modiﬁed ResNet50 . . . . . . . . . . . . . . . . . . . 32
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

BIBLIOGRAPHY .

.

.

.

.

.

.

.

.

vi

LIST OF TABLES

Table 3.1: The structures of our network architecture.

. . . . . . . . . . . . . . . . . . . . 12

Table 4.1: Ablation study on IJB-A database.

. . . . . . . . . . . . . . . . . . . . . . . . . 16

Table 4.2: Comparison of selected ﬁlter numbers.

. . . . . . . . . . . . . . . . . . . . . . 16

Table 4.3: Compare standard deviations of peaks with varying d.

. . . . . . . . . . . . . . 19

Table 4.4: Comparison on IJB-A database.

. . . . . . . . . . . . . . . . . . . . . . . . .

. 23

Table 4.5: Comparison on IJB-C database.

. . . . . . . . . . . . . . . . . . . . . . . . .

. 24

Table 4.6: Comparison on IJB-A database with synthetic occlusions.

. . . . . . . . . . . . 25

Table 4.7: Comparison on IJB-A database with natural occlusions.

. . . . . . . . . . . . . 25

Table 4.8: Comparison on IJB-C database with natural occlusions. . . . . . . . . . . . . . . 25

Table 6.1: The structures of the modiﬁed ResNet50.

. . . . . . . . . . . . . . . . . . . . . . . 32

vii

LIST OF FIGURES

Figure 1.1: An example on the behaviors of an interpretable face recognition system:
left most column is three faces of the same identity and right six columns
are ﬁlter responses from six ﬁlters; each ﬁlter captures a clear and consistent
semantic face part, e.g., eyes, nose, and jaw; heavy occlusions, eyeglass or
scarf, alternate responses of corresponding ﬁlters and make the responses
being more scattered, as shown in red bounding boxes. . . . . . . . . . . . . . .

Figure 3.1: overall network architecture of the proposed method. . . . . . . . . . . . . . .

.

2

7

Figure 3.2: With barycentric coordinates, we warp the vertices of the template face mask

to each image within the 64-image mini-batch.

. . . . . . . . . . . . . . . . . . 13

Figure 4.1: Example face from (a) IJB-A, (b) IJB-C and (c) AR face databases. The

occlusions include scarf, eyeglass, hands, etc.

. . . . . . . . . . . . . . . . . . 15

Figure 4.2: The average locations of positive (top) and negative (bottom) peak responses
(a) base CNN model ( ¯d = 6.9), (b) our
of 320 ﬁlters for three models:
(SAD only, ¯d = 17.1), and (c) our model ( ¯d = 18.7), where ¯d quantiﬁes the
average locations spreadness. The color on each location denotes the standard
deviation of peak locations. The face size is 96 × 96. . . . . . . . . . . . . . . . 19

Figure 4.3: Mean of feature diﬀerence on two occluded parts.

. . . . . . . . . . . . . . .

. 20

Figure 4.4: The correspondence between feature diﬀerence magnitude and occlusion lo-

cations. Best viewed electronically.

. . . . . . . . . . . . . . . . . . . . . . . . 21

Figure 4.5: Visualization of ﬁlter response “heat maps" of 10 diﬀerent ﬁlters on faces
from diﬀerent subjects (top 4 rows) and the same subject (bottom 4 rows).
The positive and negative responses are shown as two colors within each
image. Note the high consistency of response locations across subjects and
across poses. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

.

.

.

.

.

Figure 4.6: Histograms of standard deviations of peak locations for positive (left) and

negative (right) responses.

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. 23

Figure 4.7: ROC curves of diﬀerent models on AR database. . . . . . . . . . . . . . . . . . 26

Figure 4.8: Partial face retrieval with mouth (left), and nose (right).

. . . . . . . . . . . . . 26

Figure 4.9: The overall framework of partial face retrieval. . . . . . . . . . . . . . . . . . . 27

viii

Figure 4.10: The framework of occlusion detection on AR database.

. . . . . . . . . . . . . 28

Figure 6.1: The block setting.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

ix

CHAPTER 1

INTRODUCTION

In the era of deep learning, one major focus in the research community has been on designing
network architectures and objective functions towards discriminative feature learning He et al.
(2016); Iandola et al. (2014); Lin et al. (2017); Wen et al. (2016); Liu et al. (2017b); Tran et al.
(2017a). Meanwhile, given its superior even surpassing-human recognition accuracy He et al.
(2015); Lu & Tang (2015), there is a strong demand from both researchers and general audiences
to interpret its successes and failures Goodfellow et al. (2014); Olah et al. (2018), to understand,
improve, and trust its decisions. Increased interests in visualizing CNNs lead to a set of useful
tools to dissect their prediction paths to identify the important visual cues Olah et al. (2018). While
it is interesting to see the visual evidences for predictions from pre-trained models, what’s more
interesting is to guide the learning towards better interpretability.

CNNs trained towards discriminative classiﬁcation may learn ﬁlters with wide-spreading at-
tentions – usually hard to interpret for human. Prior work even empirically demonstrate models
and human attend to diﬀerent image areas in visual understanding Das et al. (2017). Without
design to harness interpretability, even when ﬁlters are observed to actively respond to certain
local structure across several images, there is nothing preventing them to simultaneously capture
a diﬀerent structure; and the same structure may activate other ﬁlters too. One potential solution
to address this issue is to provide annotations to learn locally activated ﬁlters and construct a
structured representation from bottom-up. However, in practice, this is rarely feasible. Manual
annotations are expensive to collect, diﬃcult to deﬁne in certain tasks, and sub-optimal compared
with end-to-end learned ﬁlters.

A desirable solution would keep the end-to-end training pipeline intact and encourage the
interpretability with a model-agnostic design. However, in the recent interpretable CNNs Zhang
et al. (2017), where ﬁlters are trained to represent object parts to make the network representation
interpretable, they observe degraded recognition accuracy after introducing interpretability. While

1

Figure 1.1: An example on the behaviors of an interpretable face recognition system: left most
column is three faces of the same identity and right six columns are ﬁlter responses from six
ﬁlters; each ﬁlter captures a clear and consistent semantic face part, e.g., eyes, nose, and jaw;
heavy occlusions, eyeglass or scarf, alternate responses of corresponding ﬁlters and make the
responses being more scattered, as shown in red bounding boxes.

the work is seminal and inspiring, this drawback largely limits its practical applicability.

In this paper, we study face recognition and strive to learn an interpretable face representation
(Fig. 1.1). We deﬁne interpretability in this way that when each dimension of the representation is
able to represent a face structure or a face part, the face representation is of higher interpretability.
Although the concept of part-based representations has been around Li et al. (2001); Felzenszwalb
et al. (2008); Berg & Belhumeur (2013); Li & Hua (2017), prior methods are not easily applicable
to deep CNNs. Especially in face recognition, as far as we know, this problem is rarely addressed
in the literature.

In our method, the ﬁlters are learned end-to-end from data and constrained to be locally activated
with the proposed spatial activation diversity loss. We further introduce a feature activation
diversity loss to better align ﬁlter responses across faces and encourage ﬁlters to capture more
discriminative visual cues for face recognition, especially occluded face recognition. Compared
with the interpretable CNNs from Zhang et al. Zhang et al. (2017), our ﬁnal face representation
does not compromise recognition accuracy, instead it achieves improved performance as well as

2

enhanced robustness to occlusion. We empirically evaluate our method on three face recognition
benchmarks with detailed ablation studies on the proposed objective functions.

To summarize, our contributions in this paper are in three-fold: 1) we propose a spatial activation
diversity loss to encourage learning interpretable face representations; 2) we introduce a feature
activation diversity loss to enhance discrimination and robustness to occlusions, which promotes
the practical value of interpretability; 3) we demonstrate superior interpretability, while achieving
improved or similar face recognition performance on three face recognition benchmarks, compared
to base CNN architectures.

3

CHAPTER 2

RELATED WORK

2.1 Interpretable Representation Learning

Understanding the visual recognition has a long history in computer vision Mahendran &
Vedaldi (2016); Sudderth et al. (2005); Juneja et al. (2013); Singh et al. (2012); Parikh & Zitnick
(2011). In early days when most models use hand-craft features, a number of research focused on
how to interpret the predictions. Back then visual cues include image patches Juneja et al. (2013),
body parts Yao et al. (2011), face parts Li & Hua (2017), or middle-level representations Singh
et al. (2012) contingent on the tasks. For example, Vondrick et al. Vondrick et al. (2013) develop
the HOGgles to visualize HOG descriptors in object detection. Since features such as SIFT Lowe
(2004), LBP Ahonen et al. (2006) are extracted from image patches and serve as building blocks
in the recognition pipeline, it was intuitive to describe the process from the level of patches. With
the more complicated CNNs, it demands new tools to dissect its prediction. Early works include
direct visualization of the ﬁlters Zeiler & Fergus (2014), deconvolutional networks to reconstruct
inputs from diﬀerent layers Zeiler et al. (2011), gradient-based methods to generate novel inputs
that maximize certain neurons Nguyen et al. (2015), and etc. Recent eﬀorts along this line include
CAM Zhou et al. (2016) which leverages the global max pooling layer to visualize dimensions
of the representation and Grad-CAM Selvaraju et al. (2016) which relaxes the constraints on the
network with a general framework to visualize any convolution ﬁlters. While our method can be
related to visualization of CNNs and we leverage the tools to visualize our learned ﬁlters, it is not
the focus of this paper.

Visualization of CNNs is a good way to interpret the network but by itself it does not make
the network more interpretable. Attention model Xu et al. (2015) has been used in image caption
generation. By attention mechanism, their model can push the feature maps responding separately
to each predicted caption word, which is seemingly close to our idea, but needs many labeled data

4

for training.

One recent work on learning a more meaningful representation is the interpretable CNNs Zhang
et al. (2017).
In their method, they design two losses to regularize the training of late-stage
convolutional ﬁlters: one to encourage each ﬁlter to encode a distinctive object part and another to
push it to respond to only one local region. AnchorNet Novotny et al. (2017) adopts the similar
idea to encourage orthogonality of the ﬁlters and ﬁlter responses to keep each ﬁlter activated by
a local and consistent structure. In our method, we generally extend the ideas in AnchorNet with
some new aspects for face recognition in designing our spatial activation diversity loss. Another
line of research in learning interpretable representations is also referred to as feature disentangling,
e.g., InfoGAN Chen et al. (2016), face editing Shu et al. (2017), 3D face recognition Liu et al.
(2018), and face modeling Tran & Liu (2018). They intend to factorize the latent representation to
describe the inputs from diﬀerent aspects, of which the direction is largely diverged from our goal
in this paper.

2.2 Parts and Occlusion in Face Recognition

Face recognition is extensively studied in computer vision Learned-Miller et al. (2016); OâĂŹ-
Toole et al. (2018). Early works constructing meaningful representations for face recognition are
mostly intended to improve the recognition accuracy. Some face representations are composed
from face parts. The part-based models are either learned unsupervised from data Li et al. (2013)
or speciﬁed by manually annotated landmarks Cao et al. (2010). Besides local parts, diﬀerent
face attributes are also interesting elements to build up face representations. Kumar et al. (2009)
proposed to encode a face image with scores from attribute classiﬁers and demonstrate improved
veriﬁcation performance before the deep learning era. In this paper, we propose to learn mean-
ingful part-based face representations with a deep CNN and the face part ﬁlters are learned with
the carefully designed losses. We demonstrate how we leverage the interpretable representation for
occlusion robust face recognition. Prior methods addressing pose variations in face recognition Li
et al. (2013); Cao et al. (2010); Tran et al. (2017b); Chai et al. (2007); Yin et al. (2017); Yin & Liu

5

(2018) can be related since pose changes may lead to self-occlusions. However, in this work, we are
more interested in more explicit situations when faces are occluded by hand, sunglasses, and other
objects. Interestingly, this speciﬁc aspect is rarely studied with CNNs. Cheng et al. (2015) propose
to restore occluded faces with deep auto-encoder for improved recognition accuracy. Zhou et al.
(2015) argue that naively training a high capacity network with suﬃcient coverage in training data
could achieve superior recognition performance. In our experiment, we indeed observed improved
recognition accuracy to occluded faces after augmenting training data with synthetic occluded
faces. However, with the proposed method, we can further improve robustness to occlusion without
increasing network capacity, which highlights the merits of interpretable representation.

2.3 Occlusion Handling with CNNs

Diﬀerent methods are proposed to handle occlusion with CNNs for robust object detection and
recognition. Wang et al. (2017) learn an object detector by generating an occlusion mask for each
object, which synthesizes harder samples for the adversarial network. In Singh & Lee, occlusion
masks are utilized to enforce the network to pay attention to diﬀerent parts of the objects. Ge
et al. (2017) solve face detection with heavy occlusions by proposing a masked face dataset and
applying it on their proposed LLE-CNNs. Despite using masked images, our occlusion robustness
mainly comes from enforcing constraints for the spreadness of the feature activations and guiding
the network to extract features from diﬀerent parts of the face.

6

CHAPTER 3

PROPOSED METHOD

Our network architecture in training is shown in Fig. 3.1. From a high-level perspective, we
construct a Siamese network with two branches sharing weights to learn face representations from
two faces: one with synthetic occlusion and one without. We would like to learn a set of diverse ﬁlter
F, which applies on a hyper-column descriptor Φ, consisting of feature at multiple semantic levels.
The proposed Spatial Activation Diversity (SAD) loss encourages the face representation to be
structured with consistent semantic meaning. Softmax loss helps encode the identity information.
The input to the lower network branch is a synthetic occluded version of the above input. The
proposed Feature Activation Diversity (FAD) loss requires ﬁlters to be insensitive to the occluded
part, hence more robust to occlusion. At the same time, we mask out parts of the face representation
sensitive to the occlusion and train to identify the input face solely based on the remaining elements.
As a result, the ﬁlters respond to non-occluded parts are trained to capture more discriminative
cues for identiﬁcation.

Figure 3.1: overall network architecture of the proposed method.

7

3.1 Spatial Activation Diversity Loss

Novotny et al. (2017) proposed a diversity loss for semantic matching by penalizing correlations
among ﬁlters weights and their responses. While their idea is general enough to extend to face
representation learning, in practice, their design is not directly applicable due to the prohibitively
large number of identities (classes) in face recognition. Their approach also suﬀers from degradation
in recognition accuracy. We ﬁrst introduce their diversity loss and then describe our proposed
modiﬁcations tailored to face recognition.

3.1.1 Spatial Activation Diversity Loss

For each of K class in the training set, Novotny et al. (2017) proposed to learn a set of diverse ﬁlters
with discriminative power to distinguish an object of the category and background images. The
ﬁlers F apply on a hypercolumns descriptor Φ(I), created by concatenating the ﬁlter responses of an
image I at diﬀerent convolutional layers Hariharan et al. (2015). This helps F to aggregate features
at diﬀerent semantic levels. The response map of this operation is denoted as ψ(I) = F ∗ Φ(I).

The diversity constraint is implemented by two diversity losses L f ilter

SAD
aging the orthogonality of the ﬁlters and of their responses, respectively. L f ilter
orthogonal by penalizing their correlations:

SAD and Lresponse

, encour-
SAD makes ﬁlters

j(cid:105)
(cid:104)Fp
i , Fp
i (cid:107)F (cid:107)Fp
(cid:107)Fp
j (cid:107)F

(3.1)

SAD (F) =

L f ilter

p

i(cid:44)j

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)
(I; Φ, F) =

(cid:13)(cid:13)(cid:13)(cid:13)

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ,
(cid:13)(cid:13)(cid:13)(cid:13)2

where Fp
i

is the column of ﬁlter Fi at the spatial location p. Note that orthogonal ﬁlters are
likely to respond to diﬀerent image structures, but this is not necessarily the case. Thus, the second
term Lresponse

is introduced to directly decorrelate the ﬁlters’ response maps ψk(I):

SAD

Lresponse
SAD

(3.2)
This term is further regularized by using the smoothed response maps ψ(cid:48)(I) (cid:17) gσ ∗ (ψ(I)) in
loss computing. Here the channel-wise Gaussian kernel gσ is applied

place of ψ(I) in Lresponse
to encourage ﬁlter responses to spread farther apart by dilating their activations.

SAD

i(cid:44)j

(cid:104)ψi, ψj(cid:105)
(cid:107)ψi(cid:107)F(cid:107)ψj(cid:107)F

8

3.1.2 Our Proposed Modiﬁcations

Novotny et al. (2017) learn K sets of ﬁlters, one for each of K categories. The discrimination
of the features are maintained by K binary classiﬁcation losses for each category vs. background
images. The discriminative loss is proposed to enhance (or suppress) the maximum value in the
response maps ψk for the positive (or negative) class. In Novotny et al. (2017), the ﬁnal feature
representation f is obtained via global max-pooling operation on ψ. This design is not applicable
for face classiﬁcation CNN as the number of identities K are usually prohibitively large (usually in
the order of ten thousands or above).

Here, to make the feature discriminative, we only learn one set of ﬁlters and connect the

representation f(I) directly to a K-way softmax classiﬁcation:

Lid = − log(Pc(f(I))).

(3.3)

Here we minimize the negative log-likelihood of feature f(I) being classiﬁed to its ground-truth
identity c.

Furthermore, global max-pooling could lead to unsatisﬁed recognition performance, as shown
in Novotny et al. (2017) where they observed minor performance degradation compared to the
model without their diversity loss. One empirical explanation of this performance degradation
is that max-pooling has similar eﬀect to ReLU activation which makes the response distribution
biased to non-negative range [0, +∞). Hence it signiﬁcantly limits the feasible learning space.

Most recent works choose to use global average pooling Yi et al. (2014); Tran et al. (2017b).
However, when applying average-pooling to introduce interpretability, it does not promote desired
spatially peaky distribution. Empirically, we found the learned feature response maps of average
pooling failed to have strong activation in small local regions.

Here we propose to design a pooling operation that satisﬁes two objectives: i) promote peaky
distribution to be well-cooperated with the spatial activation diversity loss; ii) maintain the statistics
of the feature responses for the global average-pooling to achieve good recognition performance.
Based on these considerations, we propose the operation termed Large magnitude ﬁltering (LMF),

9

as follows:

For each channel in the feature response map, we assign d% of elements with smallest magnitude
to be 0. The size of the output remains the same. The Lresponse
loss is applied on the modiﬁed
response map ψ(cid:48)(I) (cid:17) gσ ∗(LMF(ψ(I))) in place of ψ(I) in Eqn. 3.2. Then, the conventional global
average pooling is applied to LMF(ψ(I)) to obtain the ﬁnal representation f(I).

SAD

By removing small magnitude values from ψk, f won’t be aﬀected much after global average
pooling, which favors discriminative feature learning. On the other hand, the peaks of the response
maps are still well maintained, which leads to more reliable computation of the diversity loss.

3.2 Feature Activation Diversity Loss

One way to evaluate whether the diversity loss is eﬀective is to compute the average location
of the peaks within the kth response maps ψ(cid:48)
k(I) for an image set. If the average locations across
K ﬁlters spread all over the face spatially, the diversity loss is well functioning and can associate
each ﬁler with a speciﬁc face area. With the SAD loss, we do observe the improved spreadness
compared to the base CNN model trained without the SAD loss. Since we believe that more
spreadness indicates higher interpretability, we hope to further boost the spreadness of the average
peak locations across ﬁlters, i.e., elements of the learnt representation.

Motivated by the goal of learning part-based face representations, it is desirable to encourage
that any local face area only aﬀects a small subset of the ﬁlter responses. To fulﬁll this desire, we
propose to create synthetic occlusion on local areas of a face image, and constrain on the diﬀerence
between its feature response and that of the non-occluded original image. The second motivation
for our proposal is to design an occlusion-robust face recognition algorithm, which, in our view,
should be a natural by-product or beneﬁt of the part-based face representation.

With this in mind, we propose a Feature Activation Diversity (FAD) Loss to encourage the
network to learn ﬁlters robust to occlusions. That is, occlusion in a local region should only aﬀect
a small subset of elements within the representation. Speciﬁcally, leveraging pairs of face images
I, ˆI, where ˆI is a version of I with a synthetically occluded region, we enforce the majority of two

10

feature representations, f(I) and f(ˆI), to be similar:

LFAD(I, ˆI) =

(cid:12)(cid:12)τi(I, ˆI)(cid:2)fi(I) − fi(ˆI)(cid:3)(cid:12)(cid:12) ,

where the feature selection mask τ(I, ˆI) is deﬁned with threshold t: τi(I, ˆI) = 1 if(cid:12)(cid:12)fi(I) − fi(ˆI)(cid:12)(cid:12) < t,

(3.4)

i

otherwise τi(I, ˆI) = 0. There are multiple design choices for the threshold: number of elements
based or value based. We evaluate and discuss these choices in the experiments.

We also would like to correctly classify occluded images using just subset of feature elements,
which is insensitive to occlusion. Hence, the softmax identity loss in the occlusion branch is applied
to the masked feature:

Loccluded
id

= − log(Pc(τ(I, ˆI) (cid:12) f(ˆI))).

(3.5)

By sharing the classiﬁer’s weights between two branches, this classiﬁer is learned to be more robust
to occlusion. It also leads to a better representation as ﬁlters respond to non-occluded parts need
to be more discriminative.

3.3 Implementation Details

Our proposed method is model agnostic. To demonstrate this, we apply the proposed SAD and
FAD losses to two diﬀerent network architectures: one inspired by the widely used CASIA-Net Yi
et al. (2014); Tran et al. (2018), the other based on ResNet50 He et al. (2016), both of which are
popular in face recognition community. The structure of the CASIA-Net is shown in Tab. 3.3.
ResNet50 contains a lot of layers, we put its structure in Appendix. And the We add HC-descriptor-
related blocks for our SAD loss learning. Conv33, conv44, conv54 layers are used to construct
the HC descriptor via conv upsampling layers. We set the feature dimension N f = 320. As for
ResNet50, we take the modiﬁed version in Deng et al. (2018), where N f = 512. We also construct
the HC descriptor by using 3 layers at diﬀerent resolutions. To speed up the training, we reuse the
pretrained feature extraction network shared by Tran et al. (2018) and Deng et al. (2018). All new
weights are randomly initialized using a truncated normal distribution with std of 0.02. The entire
network is jointly trained using SGD optimizer at an initial learning rate of 0.001 and a momentum
of 0.9. The learning rate is divided by 10 for twice when the training loss is stabled.

11

Table 3.1: The structures of our network architecture.

Layer
conv11
conv12
conv21
conv22
conv23
conv31
conv32
conv33
conv41
conv42
conv43
conv51
conv52
conv53
conv43-U
conv44
conv53-U
conv54
Φ (HC)
Ψ
AvgPool

Input
Image
conv11
conv12
conv21
conv22
conv23
conv32
conv32
conv33
conv41
conv42
conv43
conv51
conv52
conv43
conv43-U
conv53
conv53-U
conv33,44,54

Φ
Ψ

Filter/Stride Output Size
96 × 96 × 32
3 × 3/1
3 × 3/1
96 × 96 × 64
48 × 48 × 64
3 × 3/2
3 × 3/1
48 × 48 × 64
48 × 48 × 128
3 × 3/1
3 × 3/2
24 × 24 × 128
3 × 3/1
24 × 24 × 96
24 × 24 × 192
3 × 3/1
3 × 3/2
12 × 12 × 192
12 × 12 × 128
3 × 3/1
3 × 3/1
12 × 12 × 256
3 × 3/2
6 × 6 × 256
6 × 6 × 160
3 × 3/1
3 × 3/1
6 × 6 × N f
upsampling 24 × 24 × 256
1 × 1/1
24 × 24 × 192
upsampling 24 × 24 × 320
24 × 24 × 192
1 × 1/1
3 × 3/1
24 × 24 × 576
24 × 24 × N f
3 × 3/1
24 × 24/1
1 × 1 × N f

12

Figure 3.2: With barycentric coordinates, we warp the vertices of the template face mask to each
image within the 64-image mini-batch.

For FAD, the feature mask τ can be computed per image pair I and ˆI. However, to obtain a more
reliable feature mask, we opt to compute τ using multiple image pairs sharing the same physical
occluded mask, i.e., τi({I, ˆI}N

(cid:12)(cid:12)fi(Ij) − fi(ˆIj)(cid:12)(cid:12) < t, otherwise 0.

N

j=1) = 1 if 1

N

j=1

To provide the same physical mask to images in a batch regardless their poses, we ﬁrst deﬁne
a frontal face template with 142 triangles created by 68 facial landmarks. A 32 × 12 rectangle,
which is randomly placed and may cover the face area, such as eye, nose and mouth, is selected as a
normalized mask. Each of the rectangle’s four vertices can be represented by the barycentric coor-
dinate w.r.t. the triangle enclosing the vertex. For each image within a mini-batch, corresponding
four vertices of a quadrilateral can be found via the same barycentric coordinates. This quadrilateral
denotes the location of a warped mask of that image. An example of this mask warping process is
shown in Fig. 3.2.

13

CHAPTER 4

EXPERIMENTAL RESULTS

4.1 Experimental Settings

4.1.1

Introduction

The following sections provide ablation studied, qualitative and quantitative evaluation. Firstly, to
further analyze the inﬂuence of parameters setting in our model, ablation studies are conducted.
We set the diﬀerent thresholds in feature activation diversity loss to explore the face recognition
performance, and then compare the performance among models trained with diﬀerent occlusion
types and dynamic occlusion window sizes. Besides, turn oﬀ one of the diversity losses also help
us to understand the eﬀect of the proposed two loss functions. Secondly, to better illustrate the
results of our method, we present qualitative visualization of the learned representations, response
maps, etc. Lastly, we compare the face recognition performance on three benchmark datasets:
IJB-A Klare et al. (2015), IJB-C Brianna Maze et al. (2018) and AR face Martinez (1998).

4.1.2 Database

For CASIA-Net, we take CASIA-WebFace databases Yi et al. (2014) as the training databases.
For ReNet50, we use MS-Celeb-1M database Guo et al. (2016) for training, and IJB-A Klare
et al. (2015), IJB-C Brianna Maze et al. (2018) and AR face Martinez (1998) for testing. CASIA-
WebFace contains 493, 456 images of 10, 575 subjects. MS-Celeb-1M includes 1M images of 100k
subjects. Since it contains many labeling noise, we use a cleaned version of MS-Celeb-1M Guo
et al. (2016). In our experiments, we evaluate IJB-A in three diﬀerent scenarios, i.e., original faces,
synthetic occlusion and natural occlusion faces. For synthetic occlusion, we randomly generate a
warped occluded area for each testing image, the same as what we did in training. IJB-C extends
IJB-A, also is a video-based face database with 3, 134 images and 117, 542 frames from videos of

14

a)

b)

c)

Figure 4.1: Example face from (a) IJB-A, (b) IJB-C and (c) AR face databases. The occlusions
include scarf, eyeglass, hands, etc.

3, 531 subjects. One unique property of IJB-C is its label on ﬁne-grained occlusion area. Thus,
we use IJB-C to evaluate occlusion-robust face recognition, using testing images with at least one
occluded area. AR face is another natural occlusion face database, with ∼ 4K faces of 126 subjects.
We only use AR faces with natural occlusions, including wearing glass and scarfs. Some examples
of IJB-A, IJB-C and AR databases are in Fig. 4.1. Following the setting in Deng et al. (2018),
all training and test images are processed and resized to 112 × 112. Note that, all ablation and
qualitative evaluations use the CASIA-Net-based model, and the quantitative evaluations use both
models.

4.2 Ablation Study

4.2.1 Diﬀerent Thresholds

As mentioned in Sec. 3.2, the threshold t for Eqn. 3.4 denotes the number of elements in two
N f -dim features that the FAD loss encourages their similarity. To study the eﬀect of t to face
recognition, we train diﬀerent models with t = 130, 260, 320. The ﬁrst three rows in Tab. 4.2.1
show the comparison on all three variants of IJB-A dataset. When forcing all elements of f(I)
and f(ˆI) to be the same (t = N f = 320), the performance signiﬁcantly drops on all three sets.
In this case, the feature representation of the non-occluded face is negatively aﬀected as being
completely pushed toward a representation of the occluded one. While the model with t = 130 has
similar performance with model t = 260, we will use the latter for the rest of the paper, due to the
observation that the latter model aﬀects less ﬁlters, push other ﬁlter responses away from any local
occlusions, and subsequently enhances the spreadness of the average response locations.

Moreover, in our Feature Activation Diversity Loss (FAD), the feature mask is computed by

15

Table 4.1: Ablation study on IJB-A database.

IJB-A

Manual Occlusion

Natural Occlusion
Method
@FAR=.01 @Rank-1 @FAR=.01 @Rank-1 @FAR=.01 @Rank-1
Metric (%)
79.0 ± 1.6 89.5 ± 0.8 76.1 ± 1.7 88.0 ± 1.4 66.2 ± 4.0 73.0 ± 3.3
BlaS(t = 130)
79.2 ± 1.8 89.4 ± 0.8 76.1 ± 1.4 88.0 ± 1.2 66.5 ± 6.4 72.3 ± 2.8
BlaS(t = 260)
74.6 ± 2.4 88.9 ± 1.3 71.8 ± 3.1 87.5 ± 1.6 61.0 ± 6.5 71.6 ± 3.2
BlaS(t = 320)
GauD(t = 260) 79.3 ± 2.0 89.9 ± 1.0 76.2 ± 2.4 88.6 ± 1.1 66.8 ± 3.5 73.2 ± 3.3
78.1 ± 1.8 88.1 ± 1.1 66.6 ± 5.6 81.2 ± 1.9 64.2 ± 6.9 71.0 ± 3.3
SAD only
76.7 ± 2.0 88.1 ± 1.1 75.2 ± 2.4 85.1 ± 1.2 66.5 ± 6.4 72.3 ± 2.8
FAD only

Table 4.2: Comparison of selected ﬁlter numbers.

Method
Metric (nsel \ ntotal)
GauD(t = 260)
BlaS(ﬁxed)

Nose

Forehead EyeL
cheek Mouth
60 \ 320 60 \ 320 60 \ 320 60 \ 320 60 \ 320
1 \ 320 59 \ 320 98 \ 320 41 \ 320 82 \ 320

thresholding on the averaged feature diﬀerence. There are also multiple design choices for the
thresholding operation, such as number of element based or value based thresholding. In our paper,
we explored the ﬁrst choice: minimizing the diﬀerence of t, out of 320, elements with smaller
averaged diﬀerence values. Under this setting, the number of ﬁlters enforced to be similar by FAD
loss are ﬁxed regardless the occlusion location. Intuitively, dynamic selected number of ﬁlters can
be more natural, as diﬀerent occlusion areas might cover diﬀerent regions, some of which contains
more discriminative feature (i.e eye, mouth area) than the others (cheek, forehead). To further
evaluate this eﬀect, we explore the latter thresholding options: setting a ﬁxed threshold value to
select diﬀerent number of ﬁlters. As shown in Table 4.2,the forehead only has one ﬁlter been
selected, which means covering forehead won’t aﬀect the face recognition, since forehead contain
little identity information. However, for eye, nose and mouth, there are 80 ﬁlter on average been
aﬀected. Because they are the most dicriminative parts of the face.

16

4.2.2 Diﬀerent Occlusions and Dynamic Window Size

It is important to
In FAD loss, we use the warped black window as the synthetic occlusion.
introduce another type of occlusion to see the eﬀects on face recognition. Thus, we use Gaussian
noise to replace the black color in the occlusion window. Further, we employ a dynamic window
size by randomly generating a value from [12, 32] for both the window height and width. The face
recognition results on IJB-A are shown in Tab. 4.2.1, where ’BlaS’ means black window with static
sizes, while ’GauD’ means Gaussian noise window with dynamic sizes. From the results, it is
interesting to ﬁnd that the performance is slightly better than ’BlaS’. Comparing to black window,
Gaussian noise contains more diverse information.

4.2.3 Spatial vs. Feature Diversity Loss

Since we propose two diﬀerent diversity losses in our model, it is important to evaluate the eﬀects
of two losses on face recognition performance respectively. As shown in Tabs 4.2.1, we can train
our models using either diversity loss, or both of them. We observe that, while the SAD loss
performs reasonably well on general IJB-A, it suﬀers for data with occlusions, being synthetic or
natural. Alternatively, using only the FAD loss can improve the performance on the two occlusion
datasets. Finally, using both losses, the row of ‘BlaS(t = 260)’, improves upon both models with
only one loss.

4.3 Qualitative Evaluation

4.3.1 Spreadness of Average Locations of Filter Response
Given an input face image, our model computes ψ(cid:48)(I), the 320 feature maps of size 24 × 24,
where the average pooling of one map is one element of the ﬁnal 320-d feature representation.
Each feature map contains both the positive and negative response values, which are distributed
at diﬀerent spatial areas of the face. We select the locations of both the highest value for positive
response and the lowest value for negative response as the peak response locations. To better

17

illustrate the spatial distribution of peak locations, we randomly select 1, 000 testing images and
calculate the weighted average location for each ﬁlter, with three notes. One is that there are two
types of locations, for the highest (positive) and lowest (negative) responses respectively. The other
is that, since the ﬁlters are responsive to semantic facial components, their 2D spatial locations
may change with pose. To compensate that, we warp the peak location in an arbitrary-view face to
a canonical frontal-view face, by its barycentric coordinates w.r.t. the triangle enclosing it. Similar
to Fig. 3.2, we use 68 estimated landmarks Liu et al. (2017a); Jourabloo & Liu (2017) and control
points on the image boundary to deﬁne the triangular mesh. Finally, the weight of each location is
determined by the magnitude of its peak response.

ci

i

N f

i

(cid:12)(cid:12)(cid:12)ci − 1

N f

N f

(cid:12)(cid:12)(cid:12) to quantify the average locations spreadness, where ci denotes the (x, y)

With that, the average locations for all feature maps (or ﬁlters) are shown in Fig. 4.2. To
compare the visualization results between our models and CNN base model, we compute ¯d =
1
N f
coordinates of the ith average location. For both the positive and negative peak response, we take
the mean of their ¯d. As stated in Fig. 4.2, our model with SAD loss enlarge the spreadness of
the average locations. Further, our model with both losses continue to push the ﬁlter responses
apart from each other. This demonstrates that indeed our model is able to push ﬁlters to attach to
diverse face areas, while all ﬁlters doesn’t attach to speciﬁc facial part, results in average locations
stay near the image center. In addition, we compute the standard deviation for each ﬁlter’s peak
location from 1, 000 images. From Fig. 4.2, we can observe that the base model has larger standard
deviations than SAD only model or our model, which means our model can better concentrate on
a local part than the base model.

In above analysis, we set the LMF rate d to be 95.83%. It is worthy to ablate the impact of the
rate d. We train models with diﬀerent d of 0%, 75%, 87.5% and 95.83%. Since before average
pooling the feature map is of 24∗24, the last 3 percentages mean that we remove 24×18, 24×21 and
24 × 23 responses respectively for each model and 0% denotes the base model. Tab. 4.3 compares
the average of standard deviations of peak locations across 320 ﬁlters. Note the values of the best
model (12.9/13.4) equals to the average color of Fig. 4.2(c). When we use the larger LMF rate, the

18

Table 4.3: Compare standard deviations of peaks with varying d.

LMF (d%)
95.83
std(pos./neg.) 25.7/25.7 14.7/14.4 13.5/14.0 12.9/13.4

75.00

0

87.50

(a)

(b)

(c)

Figure 4.2: The average locations of positive (top) and negative (bottom) peak responses of 320
ﬁlters for three models: (a) base CNN model ( ¯d = 6.9), (b) our (SAD only, ¯d = 17.1), and (c) our
model ( ¯d = 18.7), where ¯d quantiﬁes the average locations spreadness. The color on each location
denotes the standard deviation of peak locations. The face size is 96 × 96.

model tends to be more concentrated onto a local facial part. For this reason, we set d = 95.83%.

4.3.2 Mean Feature Diﬀerence Comparison

Both of our losses promotes part-based feature learning, which leads to occlusion robustness.
Especially, in FAD, we directly minimize the diﬀerence in a portion of representation of face with
and without occlusion. We now study the eﬀect of our loss on faces with occlusion. Firstly, we
randomly select 1, 000 test faces in diﬀerent poses and generate the synthetic occlusion. After that,
we calculate the mean of feature diﬀerence of each ﬁlter on both original and occluded faces based
on diﬀerent models. Fig. 4.3 (a) and (b) illustrates the sorted feature diﬀerence of three models
at two diﬀerent occlusion parts, eye and nose, respectively. Compare to the base CNN (trained
with Lid), both of our losses have smaller magnitude of diﬀerences. Diversity properties of SAD
loss could help to reduce the feature change on occlusion, even without directly minimizing this

19

(a) Eye part occlusion

(b) Nose part occlusion

Figure 4.3: Mean of feature diﬀerence on two occluded parts.

diﬀerence. FDA loss further enhances robustness by only letting the occlusion alternate a small
portion of the representation, keeping the remaining elements invariant to the occluded part.

4.3.3 Visualization on Feature Diﬀerence Vectors

Fig. 4.2 demonstrates that each of our ﬁlter spatially corresponds to a face location. Here we further
study the relation of these average locations and semantic meaning on input images. In Fig. 4.4,
we visualize the magnitude of changes of each ﬁlter response due to ﬁve diﬀerent occlusions. We
observe the locations of points with large feature diﬀerence are around the occluded face area,
which means our learned ﬁlters are indeed sensitive to various facial areas. Further, the magnitude
of the feature diﬀerence can be vary with diﬀerent occlusions. The maximum feature diﬀerence
can be as high as 0.7 with occlusion in eye or mouth, meanwhile this number is only 0.3 in less
critical area, e.g., forehead.

4.3.4 Filter Response Visualization

Fig. 4.5 visualizes the feature responses of some ﬁlters on diﬀerent subjects’ faces. From the
heat maps, we can see how each ﬁlter is attached to a speciﬁc semantic location on the faces,
independent to either identities or poses. This is especially impressive for faces with varying poses,

20

Figure 4.4: The correspondence between feature diﬀerence magnitude and occlusion locations.
Best viewed electronically.

in that despite no pose prior is used in training, the ﬁlter can always respond to the semantically
equivalent local part.

4.4 Quantitative Evaluation on Benchmark

Although we have shown some results of standard deviations of peaks in Fig. 4.2, we still want
to further investigate the response concentration properties between our model and base model,
which we will give some histograms to illustrate. To show that our method is model agnostic, we
use two diﬀerent base CNN models, CASIA-Net and ResNet50. Our proposed method and the
respective base model only diﬀers in the loss functions. E.g., both our CASIA-Net-based model
and base CASIA-Net model use the same network architecture as Tab. 3.3. We test on two types of
datasets: the generic in-the-wild faces and occlusion faces.

4.4.1 Standard Deviation of Peaks

Fig. 4.2 shows the average of the peak locations of 320 ﬁlter responses, and also use the diﬀerent
color to show the vary standard deviations of peak locations for each ﬁlter across 1,000 images.
But it is hard to tell which model has the smallest standard deviation, therefore We can also
compute the histograms of standard deviations across 320 ﬁlters, for both the positive responses
and negative responses, respectively. As shown in Fig. 4.6, our model can generate ﬁlter responses
whose locations have smaller standard deviations than CNN base model, i.e., our ﬁlter can be more
concentrated on one speciﬁc location of the face. Note that the SAD loss only model also reduces
the standard deviation over the CNN base model, our model can have a slightly smaller standard

21

Figure 4.5: Visualization of ﬁlter response “heat maps" of 10 diﬀerent ﬁlters on faces from
diﬀerent subjects (top 4 rows) and the same subject (bottom 4 rows). The positive and negative
responses are shown as two colors within each image. Note the high consistency of response
locations across subjects and across poses.

deviations than SAD loss only model.

4.4.2 Generic in-the-wild faces

As shown in Tabs. 4.4.2, 4.4.3, when comparing to the base CASIA-Net model, our CASIA-
Net-based model with two losses achieves the superior performance. The same superiority is
demonstrated w.r.t. CASIA-Net with data augmentation, which shows that the gain is caused by
the novel loss function design. For the deeper ResNet50 structure, our proposed model achieves

22

Figure 4.6: Histograms of standard deviations of peak locations for positive (left) and negative
(right) responses.

.

Table 4.4: Comparison on IJB-A database.

Veriﬁcation

Identiﬁcation

Method ↓
Metric (%) →
@FAR=.01 @FAR=.001 @Rank-1 @Rank-5
56.2 ± 7.2 88.7 ± 1.1 95.0 ± 0.8
79.9 ± 1.6
DR-GAN Tran et al. (2018)
74.3 ± 2.8
49.0 ± 7.4 86.6 ± 2.0 94.2 ± 0.9
CASIA-Net
79.3 ± 2.0
60.2 ± 5.5 89.9 ± 1.0 95.6 ± 0.6
Ours (CASIA-Net)
FaceID-GAN Shen1 et al. (2018) 87.6 ± 1.1
69.2 ± 2.7
85.1 ± 3.0 96.1 ± 0.6 98.2 ± 0.4
93.9 ± 1.3
VGGFace2 Cao et al. (2018)
94.4 ± 0.9
86.8 ± 1.5 92.4 ± 1.6 96.2 ± 1.0
PRFaceCao2 et al. (2018)
94.8 ± 0.6
86.0 ± 2.6 94.1 ± 0.8 96.1 ± 0.6
ResNet50 He et al. (2016)
87.9 ± 1.0 93.7 ± 0.9 96.0 ± 0.5
94.6 ± 0.8
Ours (ResNet50)

−

−

similar performance as the base model, and both outperform the models with CASIA-Net as the
base. Even comparing to state-of-art methods, the performance of our ResNet50-based model
is still competitive.
It is worthy note that this is the ﬁrst time that a reasonably interpretable
representation is able to demonstrate competitive state-of-the-art recognition performance on a
widely used benchmark, e.g., IJB-A.

23

Table 4.5: Comparison on IJB-C database.

Method ↓
Metric (%) →
DR-GAN Tran et al. (2018)
CASIA-Net
Ours (CASIA-Net)
VGGFace2 Cao et al. (2018)
Mn-v Xie & Zisserman (2018)
AIM Zhao et al. (2018)
ResNet50 He et al. (2016)
Ours (ResNet50)

Veriﬁcation

Identiﬁcation

@FAR=.01 @FAR=.001 @Rank-1 @Rank-5

88.2
87.1
89.2
95.0
96.5
96.2
95.9
95.8

73.6
72.9
75.6
90.0
92.0
93.5
93.2
93.2

74.0
74.1
77.6
89.8
−
−
90.5
90.3

84.2
83.5
86.1
93.9
−
−
93.2
93.2

4.4.3 Occlusion faces

We test our models and base models on multiple occlusion face datasets. The synthetic occlusion
of IJB-A, the natural occlusion of IJB-A, and the natural occlusion of IJB-C have 500/25, 795,
466/12, 703, and 3, 329/78, 522 images/subjects, respectively. As shown in Tabs. 4.4.3, 4.4.3,
4.4.3, the performance improvement on the occlusion datasets are more substantial than the generic
IJB-A database, which shows the advantage of interpretable representations in handling occlusions.
For AR faces, we select all 810 images with eyeglasses and scarfs occlusions, from which
6, 000 same-person and 6, 000 diﬀerent-person pairs are randomly selected. We compute the
representations of an image pair and its cosine distance.

As shown in Fig. 4.7, the Equal Error Rates of CASIA-Net, ours (CASIA-Net), ResNet50 and
ours (ResNet50) are 21.6%, 16.2%, 4.2% and 3.9%, respectively. We observe that our model based
on CASIA-Net achieves the superior performance comparing to the CASIA-Net base model. And
as for the state-of-art ResNet50 model, we can still observe the performance improvement of our
model to the ResNet50 base model.

24

Table 4.6: Comparison on IJB-A database with synthetic occlusions.

IJB-A synthetic occlusion

Veriﬁcation

Identiﬁcation

Dataset
Method ↓
Metric (%) →
DR-GAN Tran et al. (2018) 61.9 ± 4.7
61.8 ± 5.5
CASIA-Net
76.2 ± 2.4
Ours (CASIA-Net)
93.0 ± 0.7
ResNet50 He et al. (2016)
94.2 ± 0.6
Ours (ResNet50)

@FAR=.01 @FAR=.001 @Rank-1 @Rank-5
35.8 ± 4.3 80.0 ± 1.1 91.4 ± 0.8
39.1 ± 7.8 79.6 ± 2.1 91.4 ± 1.2
55.5 ± 5.7 88.6 ± 1.1 95.0 ± 0.7
80.9 ± 4.7 92.8 ± 0.9 95.5 ± 0.8
87.5 ± 1.5 93.4 ± 0.7 95.8 ± 0.4

Table 4.7: Comparison on IJB-A database with natural occlusions.

IJB-A natural occlusion

Veriﬁcation

Identiﬁcation

Dataset
Method ↓
Metric (%) →
DR-GAN Tran et al. (2018) 64.7 ± 4.1
64.4 ± 6.1
CASIA-Net
66.8 ± 3.4
Ours (CASIA-Net)
86.0 ± 1.8
ResNet50 He et al. (2016)
86.0 ± 1.6
Ours (ResNet50)

@FAR=.01 @FAR=.001 @Rank-1 @Rank-5
41.8 ± 6.4 70.8 ± 3.6 81.7 ± 2.9
40.7 ± 6.8 71.3 ± 3.5 81.6 ± 2.5
48.3 ± 5.5 73.2 ± 2.5 82.3 ± 3.3
64.3 ± 7.7 79.8 ± 4.2 84.9 ± 3.1
72.6 ± 5.0 80.0 ± 3.2 85.0 ± 3.1

Table 4.8: Comparison on IJB-C database with natural occlusions.

Dataset
Method ↓
Metric (%) →
DR-GAN Tran et al. (2018)
CASIA-Net
Ours (CASIA-Net)
ResNet50 He et al. (2016)
Ours (ResNet50)

IJB-C natural occlusion

Veriﬁcation

Identiﬁcation

@FAR=.01 @FAR=.001 @Rank-1 @Rank-5

66.1
67.0
69.3
89.0
89.8

70.8
72.1
74.5
87.5
87.4

82.8
83.3
83.6
91.0
90.7

82.4
83.3
83.8
93.1
93.4

25

Figure 4.7: ROC curves of diﬀerent models on AR database.

.

Figure 4.8: Partial face retrieval with mouth (left), and nose (right).

4.5 Other Applications

4.5.1 Partial face retrieval

In addition to the interpretable face recognition, another novel potential application of our method
is partial face retrieval. Assuming we are interested in retrieving images with similar mouth, we
can deﬁne“mouth ﬁlters" base on ﬁlters’ average peak location with our models, as in Fig. 4.9.

26

Figure 4.9: The overall framework of partial face retrieval.

M

Assume we have M original and occluded face pairs as the input, for all the pairs, we compute
i=1 fi
their average feature diﬀerence fdi f f = |favg(I) − favg(ˆI)|, where favg =
M . Then we ﬁnd
the indexes of top N f − t elements in fdi f f , which we denote as IDlarge. After that, we can use
IDlarge to ﬁlter the ’mouth’ related feature elements for each testing face Ii. By giving a probe
face Ii and the gallery faces Ij, j ∈ {1, ..., L}, our model conducts the features f(Ii) and f(Ij). For
each probe-gallery pair, the ’mouth’ related features, fmouth(Ii) and fmouth(Ij), will be computed.
Through applying cosine distance on those two part-based features, we can retrieve the most similar
Ij to Ii.

27

For experimental demonstration, we select one pair of images from a subset of 150 identities
from IJB-A test set, to create a set of 300 images in total. Using diﬀerent facial parts of each image
as a query, our accuracy of retrieving the remaining image of the same subject as the top 1 result
are 71%, 58% and 69% for eyes, mouth, and nose respectively. Results are visualized in Fig. 4.8,
we can retrieve facial parts that are not from the same identity but visually very similar to the query
part.

4.5.2 Occlusion detection

Face occluded area detection is another interesting task that we can explore. As observed from
Fig. 1.1, ﬁlter responses will be weak and scattered if there exists a heavy occlusion on the region
to which the ﬁlter responding. This observation can be leveraged to unsupervisely detect the
existed occlusion areas. Fig. 4.10 describes our approach of occlusion detection. For each ﬁlter,
its visibility is deﬁned by three criterias: (i) distance to the average peak location, (ii) the feature
activation spreadness, (iii) inverse peak value. More speciﬁcally, we would like to have a detailed
discussion here about this approach.

Figure 4.10: The framework of occlusion detection on AR database.

.

Assume we have N occluded images just like the leftmost face shown in Fig. 4.10. Then we
create a 6 × 4 grid on the facial part for each occluded face. Since manually labeling those grids
will be ineﬃciently, we ﬁrstly select a frontal face and create such grid for it, then we use the
barycentric coordinates to warp the vertices of the grid to other faces within the N occluded images
set. Once we get the grids, their ground truth labels can be given. When looking at the rightmost

28

face given by Fig. 4.10, the ground truth label is constructed as a binary form string, 1 denotes
existing occlusions within a square of the grids, while 0 deﬁnes the non-occluded square. For
example, the ground truth label of the face in Fig. 4.10 is "000000001111111100000000".

After obtaining labels for all the occluded faces, we should design our approach to detect the
occlusions. Before that, we pair the occluded face image with another twin image, which has the
similar properties except having the heavy occlusions. By utilizing the twin images, we can train a
two-class classiﬁer for the visibility. Because we have deﬁned the criterias for its visibility of each
ﬁlter, it is worthy to discuss the detailed formulations of those criterias.

Firstly, distance to the average peak location is a meaningful way to measure the visibility.
Our previous experiments have shown that the standard deviations among peaks will be small by
applying our two loss functions. If there is a heavy occlusion, the locations of the peak response
of the ﬁlters could be scattered. We can compute a average location for each ﬁlter across N non-
occluded images and then for both non-occluded images and occluded images sets, we can calculate
the average distances to the average peak location for each ﬁlter. Ideally, the one computed on the
occluded images will be larger.

Secondly, we can also evaluate the diﬀerence of the feature activation spreadness.

In our
assumption, smaller activation spreadness means stronger interpretability. We observe that the
feature response of a ﬁlter for non-occluded face is concentrated on a local part, in other words,
its area will be small. As for occluded face, the heavy occlusion will push the ﬁlters respond to
a scatter area. Based on this observation, we compute the average area of each ﬁlter for the two
sets of images. By comparing the average areas, we can get some knowledge about the diﬀerence
between occluded and non-occluded images.

Thirdly, except for using the spreadness of the feature response, we now explore the property
of response strength. Through looking at the value of peaks of each ﬁlters, we ﬁnd that the ﬁlters
tend to respond stronger to non-occluded faces than the occluded faces. To quantitatively evaluate
this response strength, we select the average peak values for each ﬁlter across N images for both
occluded and non-occluded sets.

29

The summation of the normalized scores of three criterias’ output will determine the ﬁlter
visibility. For each region in the 6 × 4 grid, the binary decision of the region’s visibility is decided
by majority votes from all ﬁlters it contains. One note, our output is also a binary string. As
shown in Fig. 4.10, the middle face illustrates the estimated detection results, its predicted label
is "001001111110111000000100". Using Simple Matching Coeﬃcient (SMC) metrics, we can
compute the coeﬃcient of the sample image to be 0.71. On N = 810 faces with occlusion of AR
dataset, our method achieves the averaged SMC score of 0.58.

30

CHAPTER 5

CONCLUSIONS

In this paper, we present our eﬀorts towards interpretable face recognition. Our grand goal is to learn
from data a structured face representation where each dimension activates on a consistent semantic
face part and captures its identity information. We propose two novel losses to encourage both
spatial activation diversity and feature activation diversity in the ﬁnal-stage convolutional ﬁlters
and the face representation. We empirically demonstrate the proposed method can lead to more
locally constrained individual ﬁlter responses and overall widely-spreading ﬁlters distribution. A
by-product of the harnessed interpretability is improved robustness to occlusions in face recognition.

31

CHAPTER 6

RESNET50

6.1 The network structure of our modiﬁed ResNet50

Table 6.1: The structures of the modiﬁed ResNet50.

Layer
Input
conv11
Image
MaxPool
conv11
conv21 MaxPool
conv21
conv22
conv23
conv22
conv23
conv24
conv24
conv25
conv25
conv26
conv27
conv26
conv27
conv28
conv28
conv29
conv31
conv29
conv32
conv32
conv32
conv33
conv34
conv33
conv34
conv35
conv35
conv36
conv36
conv37
conv38
conv37
conv38
conv39
conv310
conv39
conv311 conv310
conv312 conv311
conv312
conv41
conv41
conv42
conv43
conv42

Filter/Stride Output Size
56 × 56 × 64
3 × 3/2
3 × 3/2
56 × 56 × 64
56 × 56 × 64
1 × 1/1
3 × 3/1
56 × 56 × 64
1 × 1/1
56 × 56 × 256
56 × 56 × 64
1 × 1/1
3 × 3/1
56 × 56 × 64
1 × 1/1
56 × 56 × 256
1 × 1/1
56 × 56 × 64
56 × 56 × 64
3 × 3/1
56 × 56 × 256
1 × 1/1
1 × 1/2
28 × 28 × 128
28 × 28 × 128
3 × 3/1
28 × 28 × 512
1 × 1/1
1 × 1/1
28 × 28 × 128
28 × 28 × 128
3 × 3/1
1 × 1/1
28 × 28 × 512
1 × 1/1
28 × 28 × 128
3 × 3/1
28 × 28 × 128
28 × 28 × 512
1 × 1/1
28 × 28 × 128
1 × 1/1
3 × 3/1
28 × 28 × 128
1 × 1/1
28 × 28 × 512
14 × 14 × 256
1 × 1/2
3 × 3/1
14 × 14 × 256
1 × 1/1
14 × 14 × 1024

Layer
conv44
conv45
conv46
conv47
conv48
conv49
conv410
conv411
conv412
conv413
conv414
conv415
conv416
conv417
conv418
conv51
conv52
conv53
conv54
conv55
conv56
conv57
conv58
conv59
conv418-U
conv419
conv59-U
conv510
Φ (HC)
Ψ
AvgPool

Input
conv43
conv44
conv45
conv46
conv47
conv48
conv49
conv410
conv411
conv412
conv413
conv414
conv415
conv416
conv417
conv418
conv51
conv52
conv53
conv54
conv55
conv56
conv57
conv58
conv418
conv43-U
conv59

conv510-U

conv312,419,510

Φ
Ψ

Filter/Stride Output Size
14 × 14 × 256
1 × 1/1
3 × 3/1
14 × 14 × 256
1 × 1/1
14 × 14 × 1024
14 × 14 × 256
1 × 1/1
3 × 3/1
14 × 14 × 256
1 × 1/1
14 × 14 × 1024
1 × 1/1
14 × 14 × 256
14 × 14 × 256
3 × 3/1
1 × 1/1
14 × 14 × 1024
1 × 1/1
14 × 14 × 256
3 × 3/1
14 × 14 × 256
14 × 14 × 1024
1 × 1/1
14 × 14 × 256
1 × 1/1
3 × 3/1
14 × 14 × 256
1 × 1/1
14 × 14 × 1024
7 × 7 × 512
1 × 1/2
3 × 3/1
7 × 7 × 512
1 × 1/1
7 × 7 × 2048
1 × 1/2
7 × 7 × 512
7 × 7 × 512
3 × 3/1
7 × 7 × 2048
1 × 1/1
1 × 1/2
7 × 7 × 512
3 × 3/1
7 × 7 × 512
1 × 1/1
7 × 7 × N f
upsampling 28 × 28 × 1024
1 × 1/1
28 × 28 × 512
upsampling 28 × 28 × 512
28 × 28 × 512
1 × 1/1
28 × 28 × 1536
3 × 3/1
3 × 3/1
28 × 28 × N f
28 × 28/1
1 × 1 × N f

As shown in Tab.6.1, for ResNet50, the image input size will change to 112 × 112. And the
ﬁnal dimention of the feature representaion will also be N f = 512. And we take the same block

32

setting as described in Deng et al. (2018), which is shown in Fig. 6.1. This more advanced residual
unit, which has a BN-Conv-BN-PReLu-Conv-BN structure, has been proved to be eﬃcient in face
reocognition.

Figure 6.1: The block setting.

.

33

BIBLIOGRAPHY

34

BIBLIOGRAPHY

Ahonen, Timo, Abdenour Hadid & Matti Pietikainen. 2006. Face description with local binary

patterns: Application to face recognition. TPAMI .

Berg, Thomas & Peter N Belhumeur. 2013. Poof: Part-based one-vs.-one features for ﬁne-grained

categorization, face veriﬁcation, and attribute estimation. In Cvpr, .

Brianna Maze, Jocelyn Adams, Nathan Kalka James A. Duncan, Charles Otto Tim Miller,
W. Tyler Niggel Anil K. Jain, Jordan Cheney Janet Anderson & Patrick Grother. 2018. IARPA
Janus Benchmark-C: Face dataset and protocol. In Icb, .

Cao, Qiong, Li Shen, Weidi Xie, Omkar M. Parkhi & Andrew Zisserman. 2018. Vggface2: A

dataset for recognising faces across pose and age. In Fg, .

Cao, Zhimin, Qi Yin, Xiaoou Tang & Jian Sun. 2010. Face recognition with learning-based

descriptor. In Cvpr, .

Cao2, Kaidi, Yu Rong1, Cheng Li, Xiaoou Tang & Chen Change Loy. 2018. Pose-robust face

recognition via deep residual equivariant mapping. In Cvpr, .

Chai, Xiujuan, Shiguang Shan, Xilin Chen & Wen Gao. 2007. Locally linear regression for

pose-invariant face recognition. TIP .

Chen, Xi, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever & Pieter Abbeel. 2016.
Infogan: Interpretable representation learning by information maximizing generative adversarial
nets. In Nips, .

Cheng, Lele, Jinjun Wang, Yihong Gong & Qiqi Hou. 2015. Robust deep auto-encoder for occluded

face recognition. In Icm, .

Das, Abhishek, Harsh Agrawal, Larry Zitnick, Devi Parikh & Dhruv Batra. 2017. Human attention
in visual question answering: Do humans and deep networks look at the same regions? CVIU .
Deng, Jiankang, Jia Guo & Stefanos Zafeiriou. 2018. Arcface: Additive angular margin loss for

deep face recognition. In arxiv preprint arxiv:1801.07698, .

Felzenszwalb, Pedro, David McAllester & Deva Ramanan. 2008. A discriminatively trained,

multiscale, deformable part model. In Cvpr, .

Ge, Shiming, Jia Li, Qiting Ye1 & Zhao Luo1. 2017. Detecting masked faces in the wild with

lle-cnns. In Cvpr, .

Goodfellow, Ian J, Jonathon Shlens & Christian Szegedy. 2014. Explaining and harnessing adver-

sarial examples. arXiv preprint arXiv:1412.6572 .

Guo, Yandong, Lei Zhang, Yuxiao Hu, Xiaodong He & Jianfeng Gao. 2016. Ms-celeb-1m: A

dataset and benchmark for large-scale face recognition. In Eccv, .

35

Hariharan, Bharath, Pablo Arbeláez, Ross Girshick & Jitendra Malik. 2015. Hypercolumns for

object segmentation and ﬁne-grained localization. In Cvpr, .

He, Kaiming, Xiangyu Zhang, Shaoqing Ren & Jian Sun. 2015. Delving deep into rectiﬁers:

Surpassing human-level performance on imagenet classiﬁcation. In Iccv, .

He, Kaiming, Xiangyu Zhang, Shaoqing Ren & Jian Sun. 2016. Deep residual learning for image

recognition. In Cvpr, .

Iandola, Forrest, Matt Moskewicz, Sergey Karayev, Ross Girshick, Trevor Darrell & Kurt
Keutzer. 2014. Densenet: Implementing eﬃcient convnet descriptor pyramids. arXiv preprint
arXiv:1404.1869 .

Jourabloo, Amin & Xiaoming Liu. 2017. Pose-invariant face alignment via CNN-based dense 3D

model ﬁtting. IJCV .

Juneja, Mayank, Andrea Vedaldi, CV Jawahar & Andrew Zisserman. 2013. Blocks that shout:

Distinctive parts for scene classiﬁcation. In Cvpr, .

Klare, Brendan F, Ben Klein, Emma Taborsky, Austin Blanton, Jordan Cheney, Kristen Allen,
Patrick Grother, Alan Mah & Anil K Jain. 2015. Pushing the frontiers of unconstrained face
detection and recognition: IARPA Janus Benchmark-A. In Cvpr, .

Kumar, Neeraj, Alexander C Berg, Peter N Belhumeur & Shree K Nayar. 2009. Attribute and simile

classiﬁers for face veriﬁcation. In Iccv, .

Learned-Miller, Erik, Gary B Huang, Aruni RoyChowdhury, Haoxiang Li & Gang Hua. 2016.
Labeled faces in the wild: A survey. In Advances in face detection and facial image analysis, .

Li, Haoxiang & Gang Hua. 2017. Probabilistic elastic part model: a pose-invariant representation

for real-world face veriﬁcation. TPAMI .

Li, Haoxiang, Gang Hua, Zhe Lin, Jonathan Brandt & Jianchao Yang. 2013. Probabilistic elastic

matching for pose variant face veriﬁcation. In Cvpr, .

Li, Stan Z, Xin Wen Hou, Hong Jiang Zhang & Qian Sheng Cheng. 2001. Learning spatially

localized, parts-based representation. In Cvpr, .

Lin, Tsung-Yi, Priya Goyal, Ross Girshick, Kaiming He & Piotr Dollár. 2017. Focal loss for dense

object detection. arXiv preprint arXiv:1708.02002 .

Liu, Feng, Dan Zeng, Qijun Zhao & Xiaoming Liu. 2018. Disentangling features in 3D face shapes

for joint face reconstruction and recognition. In Cvpr, .

Liu, Yaojie, Amin Jourabloo, William Ren & Xiaoming Liu. 2017a. Dense face alignment. In Iccv

workshop, .

Liu, Yu, Hongyang Li & Xiaogang Wang. 2017b. Learning deep features via congenerous cosine

loss for person recognition. arXiv preprint arXiv:1702.06890 .

36

Lowe, David G. 2004. Distinctive image features from scale-invariant keypoints. IJCV .
Lu, Chaochao & Xiaoou Tang. 2015. Surpassing human-level face veriﬁcation performance on

LFW with gaussianface. In Aaai, .

Mahendran, Aravindh & Andrea Vedaldi. 2016. Visualizing deep convolutional neural networks

using natural pre-images. IJCV .

Martinez, Aleix M. 1998. The AR face database. CVC Technical Report24 .
Nguyen, Anh, Jason Yosinski & Jeﬀ Clune. 2015. Deep neural networks are easily fooled: High

conﬁdence predictions for unrecognizable images. In Cvpr, .

Novotny, David, Diane Larlus & Andrea Vedaldi. 2017. Anchornet: A weakly supervised network

to learn geometry-sensitive features for semantic matching. In Cvpr, .

Olah, Chris, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Katherine Ye &

Alexander Mordvintsev. 2018. The building blocks of interpretability. Distill .

OâĂŹToole, Alice J., Carlos D. Castillo, Connor J. Parde, Matthew Q. Hill & Rama Chellappa.
2018. Face space representations in deep convolutional neural networks. Trends in cognitive
sciences .

Parikh, Devi & C Zitnick. 2011. Human-debugging of machines. NIPS WCSSWC .
Selvaraju, Ramprasaath R, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh
& Dhruv Batra. 2016. Grad-cam: Visual explanations from deep networks via gradient-based
localization. https://arxiv. org/abs/1610.02391 v3 .

Shen1, Yujun, Ping Luo1, Junjie Yan, Xiaogang Wang & Xiaoou Tang. 2018. Faceid-gan: Learning

a symmetry three-player gan for identity-preserving face synthesis. In Cvpr, .

Shu, Zhixin, Ersin Yumer, Sunil Hadap, Kalyan Sunkavalli, Eli Shechtman & Dimitris Samaras.
2017. Neural face editing with intrinsic image disentangling. arXiv preprint arXiv:1704.04131 .
Singh, Krishna Kumar & Yong Jae Lee. ???? Hide-and-seek: Forcing a network to be meticulous

for weakly-supervised object and action localization. In Iccv, .

Singh, Saurabh, Abhinav Gupta & Alexei A Efros. 2012. Unsupervised discovery of mid-level

discriminative patches. In Eccv, .

Sudderth, Erik B, Antonio Torralba, William T Freeman & Alan S Willsky. 2005. Learning

hierarchical models of scenes, objects, and parts. In Iccv, .

Tran, Luan & Xiaoming Liu. 2018. Nonlinear 3D face morphable model. In Cvpr, .
Tran, Luan, Xiaoming Liu, Jiayu Zhou & Rong Jin. 2017a. Missing modalities imputation via

cascaded residual autoencoder. In Cvpr, .

Tran, Luan, Xi Yin & Xiaoming Liu. 2017b. Disentangled representation learning GAN for

pose-invariant face recognition. In Cvpr, .

37

Tran, Luan, Xi Yin & Xiaoming Liu. 2018. Representation learning by rotating your faces. TPAMI

.

Vondrick, Carl, Aditya Khosla, Tomasz Malisiewicz & Antonio Torralba. 2013. Hoggles: Visual-

izing object detection features. In Iccv, .

Wang, Xiaolong, Abhinav Shrivastava & Abhinav Gupta. 2017. A-fast-rcnn: Hard positive gener-

ation via adversary for object detection. In Cvpr, .

Wen, Yandong, Kaipeng Zhang, Zhifeng Li & Yu Qiao. 2016. A discriminative feature learning

approach for deep face recognition. In Eccv, .

Xie, Weidi & Andrew Zisserman. 2018. Multicolumn networks for face recognition.

preprint arxiv:1807.09192, .

In arxiv

Xu, Kelvin, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov,
Richard S. Zemel & Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation
with visual attention. In Icml, .

Yao, Bangpeng, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas & Li Fei-Fei. 2011.

Human action recognition by learning bases of action attributes and parts. In Iccv, .

Yi, Dong, Zhen Lei, Shengcai Liao & Stan Z Li. 2014. Learning face representation from scratch.

arXiv preprint arXiv:1411.7923 .

Yin, Xi & Xiaoming Liu. 2018. Multi-task convolutional neural network for pose-invariant face

recognition. IEEE Transactions on Image Processing .

Yin, Xi, Xiang Yu, Kihyuk Sohn, Xiaoming Liu & Manmohan Chandraker. 2017. Towards large-

pose face frontalization in the wild. In Iccv, .

Zeiler, Matthew D & Rob Fergus. 2014. Visualizing and understanding convolutional networks. In

Eccv, .

Zeiler, Matthew D, Graham W Taylor & Rob Fergus. 2011. Adaptive deconvolutional networks for

mid and high level feature learning. In Iccv, .

Zhang, Quanshi, Ying Nian Wu & Song-Chun Zhu. 2017.

networks. arXiv preprint arXiv:1710.00935 .

Interpretable convolutional neural

Zhao, Jian, Yu Cheng, Yi Cheng, Yang Yang, Haochong Lan, Fang Zhao, Lin Xiong, Yan Xu,
Jianshu Li, Sugiri Pranata, Shengmei Shen, Junliang Xing, Hengzhu Liu, Shuicheng Yan1 &
Jiashi Feng. 2018. Look across elapse: Disentangled representation learning and photorealistic
cross-age face synthesis for age-invariant face recognition. In arxiv preprint arxiv:1809.00338, .
Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva & Antonio Torralba. 2016. Learning

deep features for discriminative localization. In Cvpr, .

Zhou, Erjin, Zhimin Cao & Qi Yin. 2015. Naive-deep face recognition: Touching the limit of LFW

benchmark or not? arXiv preprint arXiv:1501.04690 .

38