FROM PIXELS TO IDENTITY:
VISUAL RECOGNITION AND BIOMETRIC APPLICATIONS

By

Minchul Kim

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Computer Science—Doctor of Philosophy

2025

ABSTRACT

Automated visual recognition has undergone a transformative evolution, advancing from

handcrafted feature extraction to deep learning-driven systems that now permeate modern

security, social, and personal computing platforms. Within this rapidly evolving landscape,

face and body recognition have emerged as critical tasks—driven by their non-contact nature,

scalability, and growing presence in real-world applications. However, achieving robust and

generalizable performance in unconstrained settings continues to pose significant challenges,

including image degradation, pose misalignment, limited training data, and the complexities

of multimodal recognition.

This thesis investigates these challenges through the lens of biometric recognition, lever-

aging the transformative potential of deep learning and generative artificial intelligence to

address both algorithmic and data-centric limitations. It introduces six major contributions.

AdaFace proposes an adaptive margin loss that prioritizes learning from high-quality sam-

ples, improving performance in low-quality image conditions. CAFace targets video-based

recognition with an attention-based feature aggregation framework optimized for temporal

redundancy and long-duration sequences. DCFace pioneers synthetic dataset generation

using a dual-condition diffusion model, enabling ethical, diverse, and scalable data creation

for face recognition. KPRPE introduces a keypoint-aware positional encoding scheme that

enhances robustness to misalignment and geometric variation. SapiensID unifies face and

full-body recognition via a multi-resolution transformer trained on the large-scale, multimodal

WebBody4M dataset.

Building upon these advances, the thesis concludes with a contribution aimed at real-world

deployment: an efficient unified backbone for human recognition. This architecture introduces

Keypoint-based Token Fusion (KP-ToFu) and Keypoint Absolute Position Encoding (KP-

APE) to reduce computational cost while preserving spatial fidelity and identity-relevant

detail. The result is a model that achieves a good performance with significantly lower

FLOPs, making unified recognition systems viable for resource-constrained applications.

Together, these contributions form a comprehensive exploration of visual recognition in the

deep learning era, highlighting how adaptive loss design, synthetic data generation, positional

encoding, and architectural innovations can collectively address longstanding challenges. This

thesis lays the foundation for the next generation of intelligent biometric systems—systems

that are robust and explainable for deployment in complex, real-world environments.

Copyright by
MINCHUL KIM
2025

ACKNOWLEDGEMENTS

Throughout the course of my PhD journey, I have received immense support, guidance, and

encouragement from many incredible individuals, to whom I am deeply grateful.

First and foremost, I would like to express my sincere gratitude to my advisor, Dr.

Xiaoming Liu, for his invaluable guidance, unwavering support, and insightful mentorship

throughout my PhD. Your expertise and encouragement have shaped both my research and

professional growth.

I am also grateful to Dr. Anil Jain for his wisdom, inspiring feedback, and the many

thought-provoking discussions that helped refine my work. It has been an honor to learn

from you.

To my thesis committee members — Dr. Arun Ross, Dr. Anil Jain, Dr. Daniel Morris, and

Dr. Kong Yu — thank you for your time, thoughtful questions, and constructive suggestions.

Your perspectives and insights pushed me to broaden the scope and impact of my research.

A special shout-out to Christopher Perry — working with you on the BRIAR project

was a true pleasure. Your technical skill, dedication, and collaboration made a tremendous

difference, and the project wouldn’t have been possible without you.

To my fellow labmates in the Computer Vision Lab — Feng Liu, Shengjie Zhu, Masa Hu,

Andrew Hou, Abhinav Kumar, Vishal Asnani, Xiao Guo, Yiyang Su, Jie Zhu, Girish Ganesan,

Zeyuan Yin, Zhihao Zhang, Zhiyuan Ren, Zhizhong Huang, Gu Ziang, and Dingqiang Ye

— thank you for the stimulating discussions, support, and camaraderie. Our time together,

both inside and outside the lab, has been invaluable.

Most importantly, I would like to thank my wife for standing by my side throughout this

journey — your patience, love, and belief in me mean everything. And to my son, Gio, who

was born during my PhD and continues to grow into a happy, curious little human — your

presence gave new meaning and motivation to my work. Finally, thank you to my parents for

your unconditional love, sacrifices, and unwavering support — this achievement is as much

yours as it is mine.

v

TABLE OF CONTENTS

CHAPTER 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 2

ADAFACE: QUALITY ADAPTIVE MARGIN LOSS FOR FACE
RECOGNITION . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Gradient Scaling Term . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Feature Norm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
2.7 Visualization of Success and Failed Test Images
2.8 Comparison with General Image-Quality Aware Learning Method . . . .
2.9 Effect of Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.10 Implementation Details and Code . . . . . . . . . . . . . . . . . . . . . .
2.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 3

CLUSTER AND AGGREGATE: FACE RECOGNITION WITH
LARGE PROBE SET . . . . . . . . . . . . . . . . . . . . . . . .
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5
Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6 Norm Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7 Additional Performance Results . . . . . . . . . . . . . . . . . . . . . . .
3.8 Resource and Efficiency Comparison . . . . . . . . . . . . . . . . . . . .
3.9 Training Progress and Learned Assignment . . . . . . . . . . . . . . . . .
3.10 Weight Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.11 Comparison of Assignment Maps in Various Scenarios . . . . . . . . . . .
3.12 Effect of Sequence Length . . . . . . . . . . . . . . . . . . . . . . . . . .
3.13 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 4

DCFACE: SYNTHETIC FACE GENERATION WITH DUAL
CONDITION DIFFUSION MODEL . . . . . . . . . . . . . . . .
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Dataset Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
4.7 More Experiment Results

1
4
4

6
6
9
12
18
24
25
30
30
31
31
32

33
33
37
38
45
51
51
52
53
55
56
56
57
58

60
60
64
65
71
73
78
80

vi

4.8 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9 Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.10 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.11 Societal Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.12 Implementation Details and Code . . . . . . . . . . . . . . . . . . . . . .
4.13 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81
85
86
87
88
88

CHAPTER 5

KEYPOINT RELATIVE POSITION ENCODING FOR FACE
89
RECOGNITION . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
5.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Face Recognition Experiments . . . . . . . . . . . . . . . . . . . . . . . .
98
5.5 Gait Recognition Experiments . . . . . . . . . . . . . . . . . . . . . . . . 103
5.6 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.7 Supplementary Performance Analysis . . . . . . . . . . . . . . . . . . . . 105
5.8 Training Landmark Detector (MobileNet-RetinaFace) . . . . . . . . . . . 108
5.9
IJB-S Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.10 Alignment Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.11 Comparison with SoTA Off-the-Shelf Landmark Detector . . . . . . . . . 113
5.12 Pipeline Detail
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.13 KPRPE Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.14 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

CHAPTER 6

SAPIENSID: FOUNDATION MODEL FOR UNIFIED HUMAN
RECOGNITION . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.5 Method Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.6 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.7 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.8 Potential Application of Retina Patch . . . . . . . . . . . . . . . . . . . . 153
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.9 Limitations
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.10 Ethical Concerns
6.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

CHAPTER 7

EFFICIENT HUMAN RECOGNITION FRAMEWORK . . . . . 155
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.1
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.3 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

vii

CHAPTER 8

DISCUSSION AND CONCLUSION . . . . . . . . . . . . . . . . . 167
8.1 Historical Context and Research Trajectory . . . . . . . . . . . . . . . . 167
8.2 Limitations and Open Challenges . . . . . . . . . . . . . . . . . . . . . . 168
8.3 Looking Ahead: Potential Future Directions . . . . . . . . . . . . . . . . 172
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.4 Closing Remarks

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

viii

CHAPTER 1

INTRODUCTION

For five decades, the pursuit of automated facial recognition has evolved from a futuristic

concept into a tangible and widely deployed technology. Initially perceived as a computation-

ally formidable challenge demanding laborious, hand-crafted feature extraction techniques,

exemplified by early efforts [113], face recognition (FR) has matured into a foundational

element of contemporary security systems, user convenience features, and online social plat-

forms. This remarkable trajectory, increasingly driven by sophisticated deep learning models,

is intertwined with complex considerations surrounding ethics, data governance, and the

very definition of ’identity’ in automated systems. This paper examines this half-century

progression, highlighting key technological milestones, persistent challenges, and the future

directions anticipated for automated FR.

Today, FR stands as perhaps the most widely utilized biometric identification method,

partly because it closely resembles the way humans naturally identify one another [4, 108].

Several practical advantages underpin its ubiquity. Facial identification can occur without

direct contact and from a distance, offering a less invasive experience compared to biometrics

requiring physical touch, such as fingerprint or iris scanning [6]. The technology readily inte-

grates with affordable camera sensors, fostering accessibility and large-scale implementation

across various applications [6, 108]. The non-contact nature also presents hygienic advantages,

a factor gaining prominence recently [5, 135]. Moreover, FR systems can operate discreetly

using prevalent surveillance infrastructure and benefit significantly from the vast repositories

of facial images already existing in government and institutional databases, including passport,

visa, and driver’s license photos [108].

The field of face recognition has undergone a paradigm shift with the advent of generative

artificial intelligence (GenAI) [43,81]. No longer confined to passive analysis, visual recognition

now operates within an ecosystem where synthetic data generation, augmentation, and

domain adaptation powered by GenAI have become integral to advancing state-of-the-art

1

systems [15, 228]. This transition is reshaping how machines perceive, interpret, and interact

with the visual world, unlocking possibilities that were previously unattainable. From

identifying individuals in images to interpreting complex scenes and bridging multimodal

domains, visual recognition is now a cornerstone of modern AI applications.

However, achieving robust visual recognition in unconstrained, real-world environments

presents multifaceted challenges. Variability in image quality, diverse lighting conditions,

complex poses, occlusions, and the scarcity of high-quality labeled datasets remain persistent

barriers. GenAI [64, 98, 232, 302] offers transformative solutions to these issues, yet it also

introduces new concerns, such as domain gap, data privacy and the potential misuse of

synthetic content [82,133,234,300]. To fully harness the capabilities of generative technologies

while addressing these concerns, the field must innovate across algorithm design, data

generation strategies, and ethical frameworks.

This thesis explores visual recognition through the lens of biometric recognition (face and

body) and the challenges posed by image degradation, the limitations of existing datasets,

and the need for unified recognition systems that can operate seamlessly across modalities.

While generative AI plays a pivotal role in some aspects—particularly in data synthesis

and domain adaptation—the broader narrative reflects a holistic effort to build recognition

systems that are robust, efficient, and generalizable to real-world complexities.

AdaFace [122] introduces an adaptive loss function that prioritizes learning from high-

quality, informative samples, mitigating the adverse effects of low-quality images often

encountered in real-world data. CAFace [123] further extends the challenge to the video

domain and proposes a attenion based feature fusion framework that performs frame selection

on arbitrary length of video. Both works lay the foundation for conducting robust visual

recognition in the presence of low quality imageries.

DCFace [124] pioneers face dataset generation task, synthesizing diverse and realistic face

datasets. This work marks a significant step toward replacing real datasets with synthetic

ones, addressing ethical concerns associated with biometric data collection. Additionally, it

2

investigates the potential advantages of incorporating synthetic datasets alongside real data

to enhance visual recognition performance.

KPRPE [125] incorporates semantic keypoints in visual recognition, enhancing their

resilience to misalignment and geometric transformations often prevalent in low quality

images. This innovation highlights how generative-inspired positional encoding can empower

models to better handle real-world variability.

SapiensID [126] further takes this idea of keypoint enhanced recognition and introduces

a unified recognition system that spans facial and full-body identification, demonstrating

the potential for overcoming modality-specific boundaries. Complemented by the creation of

the WebBody4M dataset, SapiensID exemplifies the importance of good quality dataste in

creating versatile and generalizable recognition systems.

Together, these contributions provide a comprehensive framework for advancing visual

recognition systems in the context of real-world challenges and the transformative influence

of generative artificial intelligence. By addressing critical aspects such as image degradation,

the scarcity of high-quality datasets, and the complexities of multimodal recognition, this

thesis demonstrates how a combination of innovative algorithms, adaptive methodologies,

and ethical considerations can push the boundaries of the field. Generative AI serves as

both a tool and a catalyst in this journey, enabling solutions that are robust, efficient, and

ethically sound.

This exploration of visual recognition under the GenAI paradigm underscores its pivotal

role in shaping the future of biometric and multimodal recognition. By seamlessly integrating

generative methodologies with resilient recognition frameworks, this work lays the foundation

for systems capable of thriving in diverse and unpredictable environments. The advancements

presented here not only address the specific challenges of face and body recognition but also

pave the way for a new generation of intelligent systems equipped to meet the demands of

increasingly complex visual domains.

3

1.1 Thesis Contributions

This thesis addresses critical challenges in visual recognition, focusing on robustness, data

efficiency, and ethical considerations. The primary contributions are as follows:
• Robust Recognition under Image Degradation: AdaFace introduces a novel adap-
tive loss function that learns better representations from low-quality images, enhancing

performance in degraded conditions.

• Scalable Recognition in Video Data: CAFace proposes a novel feature fusion frame-
work with clustering and aggregation mechanisms based on attention, enabling efficient

recognition that scales to long videos.

• Synthetic Dataset Generation: DCFace pioneers the generation of diverse synthetic
datasets with dual condition diffusion model and demonstrates the benefits of combining

synthetic and real data for enhanced visual recognition.

• Resilience to Misalignment: KPRPE introduces KeyPoint Relative Position Encoding,
an enhancement to traditional relative position encoding, making recognition robust to

misalignment and geometric transformations.

• Unified Recognition Across Modalities: SapiensID presents a system for recognizing
both faces and bodies, supported by the WebBody4M dataset, emphasizing versatility and

generalizability.

• Efficient Unified Recognition: The final contribution proposes an efficient ViT-based
architecture that unifies face and body recognition while reducing computational cost.

It introduces Keypoint-based Token Fusion (KP-ToFu) and Keypoint Absolute Position

Encoding (KP-APE), achieving state-of-the-art results with significantly lower FLOPs.

1.2 Thesis Organization

The thesis is organized as follows:

• Chapter 2: Introduces AdaFace for handling low-quality images.
• Chapter 3: Discusses CAFace for robust video-based recognition.
• Chapter 4: Presents DCFace for synthetic dataset generation.

4

• Chapter 5: Explores KPRPE for improved robustness to misalignment.
• Chapter 6: Details SapiensID for unified face and body recognition.
• Chapter 7: Describes the proposed efficient ViT-based backbone with KP-ToFu and

KP-APE for real-time unified human recognition.

• Chapter 8: Discussion of current limitations and future research directions.

5

CHAPTER 2

ADAFACE: QUALITY ADAPTIVE MARGIN LOSS FOR FACE
RECOGNITION

Recognition in low quality face datasets is challenging because facial attributes are obscured

and degraded. Advances in margin-based loss functions have resulted in enhanced discrim-

inability of faces in the embedding space. Further, previous studies have studied the effect

of adaptive losses to assign more importance to misclassified (hard) examples. In this work,

we introduce another aspect of adaptiveness in the loss function, namely the image quality.

We argue that the strategy to emphasize misclassified samples should be adjusted according

to their image quality. Specifically, the relative importance of easy or hard samples should

be based on the sample’s image quality. We propose a new loss function that emphasizes

samples of different difficulties based on their image quality. Our method achieves this in

the form of an adaptive margin function by approximating the image quality with feature

norms. Extensive experiments show that our method, AdaFace, improves the face recognition

performance over the state-of-the-art (SoTA) on four datasets (IJB-B, IJB-C, IJB-S and

TinyFace). Code and models are released in Link.

2.1

Introduction

Image quality is a combination of attributes that indicates how faithfully an image

captures the original scene [206]. Factors that affect the image quality include brightness,

contrast, sharpness, noise, color constancy, resolution, tone reproduction, etc. Face images,

the focus of this paper, can be captured under a variety of settings for lighting, pose and facial

expression, and sometimes under extreme visual changes such as a subject’s age or make-up.

These parameter settings make the recognition task difficult for learned face recognition

(FR) models. Still, the task is achievable in the sense that humans or models can often

recognize faces under these difficult settings [231]. However, when a face image is of low

quality, depending on the degree, the recognition task becomes infeasible. Fig. 2.1 shows

examples of both high quality and low quality face images. It is not possible to recognize the

6

Figure 2.1 Examples of face images with different qualities and recognizabilities. Both high

and low quality images contain variations in pose, occlusion and resolution that sometimes

make the recognition task difficult, yet achievable. Depending on the degree of degradation,

some images may become impossible to recognize. By studying the different impacts these

images have in training, this work aims to design a novel loss function that is adaptive to a

sample’s recognizability, driven by its image quality.

subjects in the last column of Fig. 2.1.

Low quality images like the bottom row of Fig. 2.1 are increasingly becoming an important

part of face recognition datasets because they are encountered in surveillance videos and

drone footage. Given that SoTA FR methods [55, 56, 102, 139] are able to obtain over 98%

verification accuracy in relatively high quality datasets such as LFW or CFP-FP [100, 202],

recent FR challenges have moved to lower quality datasets such as IJB-B, IJB-C and IJB-

S [112, 169, 253]. Although the challenge is to attain high accuracy on low quality datasets,

most popular training datasets still remain comprised of high quality images [55, 82]. Since

only a small portion of training data is low quality, it is important to properly leverage it

during training.

One problem with low quality face images is that they tend to be unrecognizable. When

the image degradation is too large, the relevant identity information vanishes from the image,

resulting in unidentifiable images. These unidentifiable images are detrimental to the training

procedure since a model will try to exploit other visual characteristics, such as clothing color

or image resolution, to lower the training loss. If these images are dominant in the distribution

7

Easy to RecognizeHard to RecognizeImpossible to RecognizeHigh QualityLow Quality: Images contain enough clues to identify the subject: Images do not have enough clues to identify the subjectImage QualityRecognizabilityFigure 2.2 Conventional margin based softmax loss vs our AdaFace. (a) A FR training

pipeline with a margin based softmax loss. The loss function takes the margin function to

induce smaller intra-class variations. Some examples are SphereFace, CosFace and

ArcFace [55, 154, 240]. (b) Proposed adaptive margin function (AdaFace) that is adjusted

based on the image quality indicator. If the image quality is indicated to be low, the loss

function emphasizes easy samples (thereby avoiding unidentifiable images). Otherwise, the

loss emphasizes hard samples.

of low quality images, the model is likely to perform poorly on low quality datasets during

testing.

Motivated by the presence of unidentifiable facial images, we would like to design a loss

function which assigns different importance to samples of different difficulties according to the

image quality. We aim to emphasize hard samples for the high quality images and easy samples

for low quality images. Typically, assigning different importance to different difficulties of

samples is done by looking at the training progression (curriculum learning) [19, 102]. Yet,

we show that the sample importance should be adjusted by looking at both the difficulty and

the image quality.

The reason why importance should be set differently according to the image quality is that

naively emphasizing hard samples always puts a strong emphasis on unidentifiable images.

This is because one can only make a random guess about unidentifiable images and thus,

they are always in the hard sample group. There are challenges in introducing image quality

into the objective. This is because image quality is a term that is hard to quantify due

to its broad definition and scaling samples based on the difficulty often introduces ad-hoc

procedures that are heuristic in nature.

8

Feature 𝒛!TargetPredictionInput 𝒙!TargetPrediction𝐶×1LossLoss(a) Margin based Softmax(b) Proposed approach (AdaFace)Margin Functionex: CosFace, ArcFaceImage Quality IndicatorAdaptive Margin FunctionBackboneBackbone + FC"𝒛𝒊FC𝑓𝑓512×1𝐶×1Input 𝒙!In this work, we present a loss function to achieve the above goal in a seamless way. We

find that 1) feature norm can be a good proxy for the image quality, and 2) various margin

functions amount to assigning different importance to different difficulties of samples. These

two findings are combined in a unified loss function, AdaFace, that adaptively changes the

margin function to assign different importance to different difficulties of samples, based on

the image quality (see Fig. 2.2).

In summary, the contributions of this paper include:

• We propose a loss function, AdaFace, that assigns different importance to different difficul-
ties of samples according to their image quality. By incorporating image quality, we avoid

emphasizing unidentifiable images while focusing on hard yet recognizable samples.
• We show that the angular margin scales the learning signal (gradient) based on the training
sample’s difficulty. This observation motivates us to change margin function adaptively

to emphasize hard samples if the image quality is high, and ignore very hard samples

(unidentifiable images) if the image quality is low.

• We demonstrate that feature norms can serve as the proxy of image quality. It bypasses the
need for an additional module to estimate image quality. Thus, adaptive margin function

is achieved without additional complexity.

• We verify the efficacy of the proposed method by extensive evaluations on 9 datasets
(LFW, CFP-FP, CPLFW, AgeDB, CALFW, IJB-B, IJB-C, IJB-S and TinyFace) of various

qualities. We show that the recognition performance on low quality datasets can be hugely

increased while maintaining performance on high quality datasets.

2.2 Related Work

Margin Based Loss Function The margin based softmax loss function is widely used

for training face recognition (FR) models [55, 102, 154, 240]. Margin is added to the soft-

max loss because without the margin, learned features are not sufficiently discriminative.

SphereFace [154], CosFace [240] and ArcFace [55] introduce different forms of margin functions.

9

Specifically, it can be written as,

L = − log

exp(f (θyi, m))

exp(f (θyi, m)) + (cid:80)n

j̸=yi

exp(s cos θj)

,

(2.1)

where θj is the angle between the feature vector and the jth classifier weight vector, yi is the

index of the ground truth (GT) label, and m is the margin, which is a scalar hyper-parameter.

f is a margin function, where

f (θj, m)SphereFace =







f (θj, m)CosFace =

f (θj, m)ArcFace =





s cos(mθj)

j = yi

s cos θj

j ̸= yi

s(cos θj − m)

j = yi

s cos θj

j ̸= yi

s cos(θj + m)



s cos θj

j = yi

j ̸= yi

,

,

.

(2.2)

(2.3)

(2.4)

Sometimes, ArcFace is referred to as an angular margin and CosFace is referred to as an

additive margin. Here, s is a hyper-parameter for scaling. P2SGrad [287] notes that m and s

are sensitive hyper-parameters and proposes to directly modify the gradient to be free of m

and s.

Our approach aims to model the margin m as a function of the image quality because

f (θyi, m) has an impact on which samples contribute more gradient (i.e.
during training.

learning signal)

Adaptive Loss Functions Many studies have introduced an element of adaptiveness in

the training objective for either hard sample mining [145, 248], scheduling difficulty during

training [102,211], or finding optimal hyperparameters [286]. For example, CurricularFace [102]

brings the idea of curriculum learning into the loss function. During the initial stages of

training, the margin for cos θj (negative cosine similarity) is set to be small so that easy

samples can be learned and in the later stages, the margin is increased so that hard samples

10

are learned. Specifically, it is written as

f (θj, m)Curricular =





s cos(θj + m)

j = yi

,

N (t, cos θj)

j ̸= yi

(2.5)

where

N (t, cos θj) =




cos(θj)

s cos(θyi + m) ≥ cos θj

,

(2.6)



cos(θj)(t + cos θj)

s cos(θyi + m) < cos θj

and t is a parameter that increases as the training progresses. Therefore, in CurricularFace,

the adaptiveness in the margin is based on the training progression (curriculum).

On the contrary, we argue that the adaptiveness in the margin should be based on the

image quality. We believe that among high quality images, if a sample is hard (with respect

to a model), the network should learn to exploit the information in the image, but in low

quality images, if a sample is hard, it is more likely to be devoid of proper identity clues and

the network should not try hard to fit on it.

MagFace [172] explores the idea of applying different margins based on recognizability. It

applies large angular margins to high norm features on the premise that high norm features

are easily recognizable. Large margin pushes features of high norm closer to class centers. Yet,

it fails to emphasize hard training samples, which is important for learning discriminative

features. It is also worth mentioning that DDL [101] uses the distillation loss to minimize the

gap between easy and hard sample features.

Face Recognition with Low Quality Images Recent FR models have achieved high

performance on datasets where facial attributes are discernable, e.g., LFW [100], CFP-FP [202],

CPLFW [296], AgeDB [174] and CALFW [297]. Good performance on these datasets can be

achieved when the FR model learns discriminative features invariant to lighting, age or pose

variations. However, FR in unconstrained scenarios such as in surveillance or low quality

videos [276] brings more problems to the table. Examples of datasets in this setting are

IJB-B [253], IJB-C [169] and IJB-S [112], where most of the images are of low quality, and

some do not contain sufficient identity information, even for human examiners. The key to

11

good performance involves both 1) learning discriminative features for low quality images

and 2) learning to discard images that contain few identity cues. The latter is sometimes

referred to as quality aware fusion.

To perform quality aware fusion, probabilistic approaches have been proposed to predict

uncertainty in FR representation [38, 139, 181, 208, 298]. They assume the features are

distributions and the variance can be used to calculate the certainty in prediction. However,

probabilistic approaches often resort to learning mean and variance separately, which is not

simple during training and suboptimal as the variance is optimized with a fixed mean. Our

work, however, is a modification to the conventional softmax loss, making the framework easy

to use. And we use the feature norm as a proxy for quality during quality-aware fusion.

QSub-PM [293] and UGG [294] also show good performances in LQ video recognition by

using rich subspace (matrix) representation for comparison and using auxiliary context (such

as a body) to aid feature fusion respectively.

Synthetic data or augmentations can be used to mimic low quality data. [69, 210] adopts

3D reconstruction to generate faces. Extra steps complicate the training procedure, making it

hard to generalize to other domains. We adopt easily applicable crop, blur and photometric

augmentations.

2.3 Proposed Approach

The cross entropy softmax loss of a sample xi can be formulated as follows,

LCE(xi) = − log

exp(Wyizi + byi)
j=1 exp(Wjzj + bj)

(cid:80)C

,

(2.7)

where zi ∈ Rd is the xi’s feature embedding, and xi belongs to the yith class. Wj refers to the

jth column of the last FC layer weight matrix, W ∈ Rd×C, and bj refers to the corresponding

bias term. C refers to the number of classes.

During test time, for an arbitrary pair of images, xp and xq, the cosine similarity metric,

is used to find the closest matching identities. To make the training objective directly

zp·zq
∥zp∥∥zq∥
optimize the cosine distance, [154, 239] use normalized softmax where the bias term is set to

12

Figure 2.3 Illustration of different margin functions and their gradient scaling terms on the

feature space. B0 and B1 show the decision boundary with and without margin m,

respectively. The yellow arrow indicates the shift in the boundary due to margin m. Our

work adaptively changes the margin functions based on the norm. With high norm, we

emphasize samples away from the boundary and with low norm we emphasize samples near

the boundary. Circles and triangles in the arc show example scenarios in the right most plot

(AdaFace).

zero and the feature zi is normalized and rescaled with s during training. This modification

results in

LCE(xi) = − log

exp(s · cos θyi)
j=1 exp(s cos θj)

(cid:80)C

,

(2.8)

where θj corresponds to the angle between zi and Wj. Follow-up works [55, 240] take this

formulation and introduces a margin to reduce the intra-class variations. Generally, it can be

written as Eq. 2.1 where margin functions are defined in Eqs. 2.2, 2.3 and 2.4 correspondingly.

2.3.1 Margin Form and the Gradient

Previous works on margin based softmax focused on how the margin shifts the decision

boundaries and what their geometric interpretations are [55, 240]. In this section, we show

that during backpropagation, the gradient change due to the margin has the effect of scaling

the importance of a sample relative to the others.

In other words, angular margin can

introduce an additional term in the gradient equation that scales the signal according to the

sample’s difficulty. To show this, we will look at how the gradient equation changes with the

margin function f (θyi, m).

Let P (i)

j

be the probability output at class j after softmax operation on an input xi. By

13

𝑚=0.5𝐵0=Decision Boundary w/o margin𝐵1=Decision Boundary w/ marginHarder Sample, High NormEasier Sample, High NormHarder Sample, Low NormEasier Sample, Low Norm|𝑔|, Gradient Scale Term: Eq.12smalllarge𝑾!!𝑚=0.4𝑾!!𝑾"𝑚=0.5𝑚=−0.5𝑾!!𝑾!!𝑾"𝑾"𝑾"Additive Margin(CosFace)Angular Margin(ArcFace if 𝑚>0)later stageearly stage𝑾!!𝑾!!𝑾"𝑾"Angular Margin + Curriculum(CurricularFace, 𝑚=0.5)Adaptive Margin Function(AdaFace)𝑾!!𝑾"Adaptive Angular Margin(MagFace)𝑚=0.5deriving the gradient equations for LCE w.r.t. Wj and xi, we obtain the following,

P (i)

j =

exp(f (cos θyi))

exp(f (cos θyi)) + (cid:80)n

j̸=yi

exp(s cos θj)

,

(2.9)

∂LCE
∂Wj

(cid:16)

=

P (i)
j − 1(yi = j)

(cid:17) ∂f (cos θj)
∂ cos θj

∂ cos θj
∂Wj

,

∂LCE
∂xi

=

C
(cid:88)

(cid:16)

k=1

P (i)
k −1(yi = k)

(cid:17)∂f (cos θk)
∂ cos θk

∂ cos θk
∂xi

.

(2.10)

(2.11)

In Eqs. 2.10 and 2.11, the first two terms,

(cid:16)

(cid:17)
P (i)
j − 1(yi = j)

and ∂f (cos θj )
∂ cos θj

are scalars. Also,

these two are the only terms affected by parameter m through f (cos θyi). As the direction
term, ∂ cos θj
is free of m, we can think of the first two scalar terms as a gradient scaling term
∂Wj
(GST) and denote,

(cid:16)

g :=

P (i)
j − 1(yi = j)

(cid:17) ∂f (cos θj)
∂ cos θj

.

(2.12)

For the purpose of the GST analysis, we will consider the class index j = yi, since all

negative class indices j ̸= yi do not have a margin in Eqs. 2.2, 2.3, and 2.4. The GST for the

normalized softmax loss is

gsoftmax = (P (i)
yi

− 1)s,

since f (cos θyi) = s · cos θyi

and ∂f (cos θyi )
∂ cos θyi

= s. The GST for the CosFace [240] is also

gCosFace = (P (i)
yi

− 1)s,

(2.13)

(2.14)

as f (cos θyi) = s(cos θyi − m) and ∂f (cos θyi )
∂ cos θyi
be

= s. Yet, the GST for ArcFace [55] turns out to

gArcFace = (P (i)

j − 1)s

cos(m)+

(cid:32)

cos θyi sin(m)
(cid:112)1− cos2 θyi

(cid:33)

.

(2.15)

Since the GST is a function of θyi
emphasis on samples based on the difficulty, i.e., θyi

during training.

and m as in Eq. 2.15, it is possible to use it to control the

To understand the effect of GST, we visualize GST w.r.t. the features. Fig. 2.3 shows the

GST as the color in the feature space. Note that for the angular margin, the GST peaks at

14

the decision boundary but slowly decreases as it moves away towards Wj and harder samples

receive less emphasis. If we change the sign of the angular margin, we see an opposite effect.

Note that, in the 6th column, MagFace [172] is an extension of ArcFace (positive angular

margin) with larger margin assigned to high norm feature. Both ArcFace and MagFace fail to

put high emphasis on hard samples (green area near Wj). We combine all margin functions

(positive and negative angular margins and additive margins) to emphasize hard samples

when necessary.

Note that this adaptiveness is also different from approaches that use the training stage

to change the relative importance of different difficulties of samples [102]. Fig. 2.3 shows

CurricularFace where the decision boundary and the GST g change depending on the training

stage.

2.3.2 Norm and Image quality

Image quality is a comprehensive term that covers characteristics such as brightness,

contrast and sharpness.

Image quality assessment (IQA) is widely studied in computer

vision [283]. SER-FIQ [227] is an unsupervised DL method for face IQA. BRISQUE [173] is

a popular algorithm for blind/no-reference IQA. However, such methods are computationally

expensive to use during training. In this work, we refrain from introducing an additional

module that calculates the image quality. Instead, we use the feature norm as a proxy for

the image quality. We observe that, in models trained with a margin-based softmax loss, the

feature norm exhibits a trend that is correlated with the image quality.

In Fig. 2.4 (a) we show a correlation plot between the feature norm and the image

quality (IQ) score calculated with (1-BRISQUE) as a green curve. We randomly sampled

1, 534 images from the training dataset (MS1MV2 [55] with augmentations described in

Sec. 2.4.1) and calculate the feature norm using a pretrained model. At the final epoch, the

correlation score between the feature norm and IQ score reaches 0.5235 (out of −1 and 1).

The corresponding scatter plot is shown in Fig. 2.4 (b). This high correlation between the

feature norm and the IQ score supports our use of feature norm as the proxy of image quality.

15

Figure 2.4 (a) A plot of Pearson correlation with image quality score (1-BRISQUE) over

training epochs. The green and orange curves correspond to the correlation plot using the

feature norm ∥zi∥ and the probability output for the ground truth index Pyi
(b) and (c) Corresponding scatter plots for the last epoch. The blue line on the scatter plot

, respectively.

and the corresponding equation shows the least square line fitted to the data points.

In Fig. 2.4 (a) we also show a correlation plot between the probability output Pyi

and the

IQ score as an orange curve. Note that the correlation is always higher for the feature norm

. Furthermore, the correlation between the feature norm and IQ score is visible

than for Pyi
from an early stage of training. This is a useful property for using the feature norm as the

proxy of image quality because we can rely on the proxy from the early stage of training.

Also, in Fig. 2.4 (c), we show a scatter plot between Pyi
non-linear relationship between Pyi
difficulty is with 1 − Pyi
is different based on image quality. Therefore, it makes sense to consider the image quality

, and the plot shows that the distribution of the difficulty of samples

and the image quality. One way to describe a sample’s

and IQ score. Notice that there is a

when adjusting the sample importance according to the difficulty.

2.3.3 AdaFace: Adaptive Margin based on Norm

To address the problem caused by the unidentifiable images, we propose to adapt the

margin function based on the feature norm. In Sec. 2.3.1, we have shown that using different

margin functions can emphasize different difficulties of samples. Also, in Sec. 2.3.2, we have

observed that the feature norm can be a good way to find low quality images. We will merge

the two findings and propose a new loss for FR.

16

a) Correlation for all epochs b) Feature norm vs img. qual.c) Prob. output vs img. qual.Image Quality Indicator As the feature norm, ∥zi∥ is a model dependent quantity, we

normalize it using batch statistics µz and σz. Specifically, we let

(cid:100)∥zi∥ =

(cid:22) ∥zi∥ − µz
σz/h

(cid:25)1

,

−1

(2.16)

where µz and σz are the mean and standard deviation of all ∥zi∥ within a batch. And ⌊·⌉

refers to clipping the value between −1 and 1 and stopping the gradient from flowing. Since

∥zi∥−µz
σz/h

makes the batch distribution of (cid:100)∥zi∥ as approximately unit Gaussian, we clip the

value to be within −1 and 1 for better handling. It is known that approximately 68% of the

unit Gaussian distribution falls between −1 and 1, so we introduce the term h to control

the concentration. We set h such that most of the values ∥zi∥−µz
σz/h

fall between −1 and 1. A

good value to achieve this would be h = 0.33. Later in Sec. 2.4.2, we ablate and validate this

claim. We stop the gradient from flowing during backpropagation because we do not want

features to be optimized to have low norms.

If the batch size is small, the batch statistics µz and σz can be unstable. Thus we use the

exponential moving average (EMA) of µz and σz across multiple steps to stabilize the batch

statistics. Specifically, let µ(k) and σ(k) be the k-th step batch statistics of ∥zi∥. Then

µz = αµ(k)

z + (1 − α)µ(k−1)

z

,

(2.17)

and α is a momentum set to 0.99. The same is true for σz.

Adaptive Margin Function We design a margin function such that 1) if image quality

is high, we emphasize hard samples, and 2) if image quality is low, we de-emphasize hard

samples. We achieve this with two adaptive terms gangle and gadd, referring to angular and

additive margins, respectively. Specifically, we let

f (θj, m)AdaFace =






s(cos(θj +gangle)−gadd)

j = yi

s cos θj

j ̸= yi

where gangle and gadd are the functions of (cid:100)∥zi∥. We define

gangle = −m · (cid:100)∥zi∥,

gadd = m · (cid:100)∥zi∥ + m.

17

(2.18)

(2.19)

Note that when (cid:100)∥zi∥ = −1, the proposed function becomes ArcFace. When (cid:100)∥zi∥ = 0, it

becomes CosFace. When (cid:100)∥zi∥ = 1, it becomes a negative angular margin with a shift. Fig. 2.3

shows the effect of the adaptive function on the gradient. The high norm features will receive

a higher gradient scale, far away from the decision boundary, whereas the low norm features

will receive higher gradient scale near the decision boundary. For low norm features, the

harder samples away from the boundary are de-emphasized.

2.4 Experiments

2.4.1 Datasets and Implementation Details

Datasets We use MS1MV2 [55], MS1MV3 [57] and WebFace4M [300] as our training datasets.

Each dataset contains 5.8M, 5.1M and 4.2M facial images, respectively. We test on 9 datasets

of varying qualities. Following the protocol of [210], we categorize the test datasets into 3

types according to the visual quality (examples shown in Fig. 2.5).
• High Quality: LFW [100], CFP-FP [202], CPLFW [296] AgeDB [174] and CALFW [297]
are popular benchmarks for FR in the well controlled setting. While the images show

variations in lighting, pose, or age, they are of sufficiently good quality for face recognition.
• Mixed Quality: IJB-B and IJB-C [169, 253] are datasets collected for the purpose of
introducing low quality images in the validation protocol. They contain both high quality

images and low quality videos of celebrities.

• Low Quality: IJB-S [112] and TinyFace [46] are datasets with low quality images and/or
videos. IJB-S is a surveillance video dataset, with test protocols such as Surveillance-to-

Single, Surveillance-to-Booking and Surveillance-to-Surveillance. The first/second word in

the protocol refers to the probe/gallery image source. Surveillance refers to the surveillance

video, Single refers to a high quality enrollment image and Booking refers to multiple

enrollment images taken from different viewpoints. TinyFace consists only of low quality

images.

Training Settings We preprocess the dataset by cropping and aligning faces with five

landmarks, as in [55, 285], resulting in 112 × 112 images. For the backbone, we adopt

18

Figure 2.5 Examples of three categories of test datasets in our study.

ResNet [86] as modified in [55]. We use the same optimizer and a learning rate schedule as

in [102], and train for 24 epochs. The model is trained with SGD with the initial learning rate

of 0.1 and step scheduling at 10, 18 and 22 epochs. If the dataset contains augmentations,

we add 2 more epochs for convergence. For the scale parameter s, we set it to 64, following

the suggestion of [55, 240].

Augmentations Since our proposed method is designed to train better in the presence of

unidentifiable images in the training data, we introduce three on-the-fly augmentations that

are widely used in image classification tasks [88], i.e., cropping, rescaling and photometric

jittering. These augmentations will create more data but also introduce more unidentifiable

images. It is a trade-off that has to be balanced. In FR, these augmentations are not used

because they generally do not bring benefit to the performance (as shown in Sec. 2.4.2). We

show that our loss function is capable of reaping the benefit of augmentations because it can

adapt to ignore unidentifiable images.

Cropping defines a random rectangular area (patch) and makes the region outside the

area to be 0. We do not cut and resize the image as the alignment of the face is important.

Photometric augmentation randomly scales hue, saturation and brightness. Rescaling involves

resizing an image to a smaller scale and back, resulting in blurriness. These operations are

applied randomly with a probability of 0.2.

2.4.2 Ablation and Analysis

For hyperparameter m and h ablation, we adopt a ResNet18 backbone and use 1/6th of

the randomly sampled MS1MV2. We use two performance metrics. For High Quality Datasets

19

(a) High Quality(b) Mixed Quality(c) Low Quality(HQ), we use an average of 1:1 verification accuracy in LFW, CFP-FP, CPLFW, AgeDB

and CALFW. For Low Quality Datasets (LQ), we use an average of the closed-set rank-1

retrieval and the open-set TPIR@FIPR=1% for all 3 protocols of IJB-S. Unless otherwise

stated, we augment the data as described in Sec. 2.4.1.

Effect of Image Quality Indicator Concentration h In Sec. 2.3.3, we claim that h = 0.33

is a good value. To validate this claim, we show in Tab. 2.1 the performance when varying h.

When h = 0.33, the model performs the best. For h = 0.22 or h = 0.66, the performance is

still higher than CurricularFace. As long as h is set such that (cid:100)∥zi∥ has some variation, h is

not very sensitive. We set h = 0.33.

Effect of Hyperparameter m The margin m corresponds to both the maximum range

of the angular margin and the magnitude of the additive margin. Tab. 2.1 shows that the

performance is best for HQ datasets when m = 0.4 and for LQ datasets when m = 0.75.

Large m results in large angular margin variation based on the image quality, resulting

in more adaptivity. In subsequent experiments, we choose m = 0.4 since it achieves good

performance for LQ datasets without sacrificing performance on HQ datasets.

Effect of Proxy Choice In Tab. 2.1, to show the effectiveness of using the feature norm

as a proxy for image quality, we switch the feature norm with other quantities such as

(1-BRISQUE) or Pyi
The BRISQUE score is precomputed for the training dataset, so it is not as effective in

. The performance using the feature norm is superior to using others.

capturing the image quality when training with augmentation. We include Pyi
the adaptiveness in feature norm is different from adaptiveness in difficulty.

to show that

Effect of Augmentation We introduce on-the-fly augmentations in our training data.

Our proposed loss can effectively handle the unidentifiable images, which are generated

occasionally during augmentations. We experiment with a larger model ResNet50 on the full

MS1MV2 dataset.

Tab. 2.2 shows that indeed the augmentation brings performance gains for AdaFace.

The performance on HQ datasets stays the same, whereas LQ datasets enjoy a significant

20

Method
CurricularFace [102]
aaaAdaFaceaaa
AdaFace
AdaFace
aaaAdaFaceaaa
AdaFace
AdaFace
aaaAdaFaceaaa
-
-

h
-
0.22
0.33
0.66

0.33

m
0.50

0.40

0.40
0.50
0.75

0.33 0.40

Proxy

HQ Datasets LQ Datasets

Norm

Norm

Norm
1−BRISQUE
Pyi

93.43
93.67
93.74
93.70
93.74
93.56
93.37
93.74
93.43
93.46

32.92
34.92
35.40
35.29
35.40
35.23
35.69
35.40
34.55
35.17

Table 2.1 Ablation of our margin function parameters h and m, and the image quality proxy

choice on the ResNet18 backbone. The performance metrics are as described in Sec. 2.4.2.

Method

p HQ Datasets LQ Datasets

CurricularFace [102] 0.0
CurricularFace [102] 0.2
CurricularFace [102] 0.3
0.0
0.2
0.3

AdaFace
AdaFace
AdaFace

96.85
96.75
96.59
96.72
96.88
96.78

41.00
40.84
40.58
40.95
41.82
41.93

Table 2.2 Ablation of augmentation probability p, on the ResNet50 backbone. The metrics

are the same as Tab. 2.1.

performance gain. Note that the augmentation hurts the performance of CurricularFace,

which is in line with our assumption that augmentation is a tradeoff between a positive effect

from getting more data and a negative effect from unidentifiable images. Prior works on

margin-based softmax do not include on-the-fly augmentations as the performance could

be worse. AdaFace avoids overfitting on unidentifiable images, therefore it can exploit the

augmentation better.

Analysis To show how the feature norm ∥zi∥ and the difficulty of training samples change

during training, we plot the sample trajectory in Fig. 2.6. A total of 1, 536 samples are

randomly sampled from the training data. Each column in the heatmap represents a sample,

and the x-axis is sorted according to the norm of the last epoch. Sample #600 is approximately

a middle point of the transition from low to high norm samples. The bottom plot shows that

many of the probability trajectories of low norm samples never get high probability till the

end. It is in line with our claim that low norm features are more likely to be unidentifiable

images. It justifies our motivation to put less emphasis on these cases, although they are

21

Figure 2.6 A plot of training samples’ trajectories of feature norm ∥zi∥ and the probability

output for the ground truth index Pyi
data with augmentations, and show 8 images evenly sampled from them. The features with

. We randomly select 1, 536 samples from the training

low norm have a different probability trajectory than others and the corresponding images

are hard to identify.

“hard” cases. The percentage of samples with augmentations is higher for the low norm

features than for the high norm features. For samples number #0 to #600, about 62.0% are

with at least one type of augmentation. For the samples #600 or higher, the percentage is

about 38.5%.

Time Complexity Compared to classic margin-based loss functions, our method adds a

negligible amount of computation in training. With the same setting, ArcFace [55] takes

0.3193s per iteration while AdaFace takes 0.3229s (+1%).

2.4.3 Comparison with SoTA methods

To compare with SoTA methods, we evaluate ResNet100 trained with AdaFace loss on 9

datasets as listed in Sec. 2.4.1. For the high quality datasets, Tab. 2.3 (a) shows that AdaFace

performs on par with competitive methods such as BroadFace [128], SCF-ArcFace [139] and

VPL-ArcFace [56]. This strong performance in high quality datasets is due to the hard sample

emphasis on high quality cases during training. Note that some performances in high quality

datasets are saturated, making the gain less pronounced. Thus, choosing one model over the

others is somewhat difficult based solely on the numbers. Unlike SCF-ArcFace, our method

does not use additional learnable layers, nor requires 2-stage training. It is a revamp of the

22

Training EpochsSamples Sorted According to the Last Epoch NormFeature NormProb. for GTCorresponding ImagesHigh Quality

Mixed Quality

Method

Venue

Train Data

CosFace (m = 0.35) [240]
ArcFace (m = 0.50) [55]
AFRN [114]
MV-Softmax [248]
CurricularFace [102]
URL [210]
BroadFace [128]
MagFace [172]
SCF-ArcFace [139]
DAM-CurricularFace [152]
AdaFace (m = 0.4)
VPL-ArcFace [56]
AdaFace (m = 0.4)
ArcFace* [55]
AdaFace (m = 0.4)

IJB-C [169]
96.37
96.03
93.00
95.20
96.10
96.60
96.38
95.97
96.09
96.20
96.89
96.76
97.09
97.16
97.39
(a) A performance comparison of recent methods on high and mixed quality datasets.
a
Low Quality (IJB-S [112] and TinyFace [46])

CVPR18 MS1MV2
CVPR19 MS1MV2
MS1MV2
ICCV19
AAAI20
MS1MV2
CVPR20 MS1MV2
CVPR20 MS1MV2
ECCV20 MS1MV2
CVPR21 MS1MV2
CVPR21 MS1MV2
ICCV21
MS1MV2
CVPR22 MS1MV2
CVPR21 MS1MV3
CVPR22 MS1MV3
CVPR19 WebFace4M
CVPR22 WebFace4M

CALFW [297]
95.76
95.45
96.30
96.10
96.20
-
96.20
96.15
96.12
-
96.08
96.12
96.02
96.00
96.05

CFP-FP [202]
98.12
98.27
95.56
98.28
98.37
98.64
98.63
98.46
98.40
-
98.49
99.11
99.03
99.19
99.17

CPLFW [296]
92.28
92.08
93.48
92.83
93.13
-
93.17
92.87
93.16
-
93.53
93.45
93.93
94.35
94.63

IJB-B [253]
94.80
94.25
88.50
93.60
94.80
-
94.97
94.51
94.74
95.12
95.67
95.56
95.84
95.75
96.03

AgeDB [174]
98.11
98.28
95.35
97.95
98.32
-
98.38
98.17
98.30
-
98.05
98.60
98.17
97.95
97.90

LFW [100]
99.81
99.83
99.85
99.80
99.80
99.78
99.85
99.83
99.82
-
99.82
99.83
99.83
99.83
99.80

AVG
96.82
96.78
96.11
96.99
97.16
-
97.25
97.10
97.16
-
97.19
97.42
97.40
97.46
97.51

Method

Train Data

MS1MV2 [55]
PFE [208]aaa
MS1MV2 [55]
ArcFace [55]
URL [210]
MS1MV2 [55]
CurricularFace* [102] MS1MV2 [55]
MS1MV2 [55]
AdaFace (m = 0.4)
AdaFace (m = 0.4)
MS1MV3 [57]
ArcFace* [55]
WebFace4M [300]
AdaFace (m = 0.4) WebFace4M [300]

-
-
63.89
63.68
68.21
67.81
71.11
72.02
(b) A performance comparison of recent methods on low quality datasets.

Surveillance-to-Surveillance [112]
Rank-1 Rank-5
20.82
-
-
32.80
37.47
40.60
46.67
48.22

9.20
-
-
19.54
23.74
26.23
32.13
35.05

Surveillance-to-Booking [112]
Rank-1 Rank-5
61.75
53.60
64.95
57.36
67.12
61.98
69.74
63.81
71.61
66.27
72.88
67.83
75.15
70.31
76.11
70.93

1%
35.99
41.23
42.73
47.57
50.87
52.03
56.89
58.02

Surveillance-to-Single [112]
Rank-1 Rank-5
58.33
50.16
64.42
57.35
65.78
59.79
68.68
62.43
70.53
65.26
72.67
67.12
74.31
69.26
75.29
70.42

1%
31.88
41.85
41.06
47.68
51.66
53.67
57.06
58.27

1%
0.84
-
-
2.53
2.50
3.28
5.32
4.96

-
-
68.67
67.65
71.54
70.98
74.38
74.52

TinyFace [46]
Rank-1 Rank-5

Table 2.3 Comparison on benchmark datasets, with the ResNet100 backbone.

loss function, which makes it easier to apply our method to new tasks or backbones.

For mixed quality datasets, Tab. 2.3 (a) clearly shows the improvement of AdaFace. On

IJB-B and IJB-C, AdaFace reduces the errors of the second best relatively by 11% and 9%

respectively. This shows the efficacy of using feature norms as an image quality proxy to

treat samples differently.

For low quality datasets, Tab. 2.3 (b) shows that AdaFace substantially outperforms all

baselines. Compared to the second best, our averaged performance gain over 4 Rank-1 metrics

is 3.5%, and over 3 TPIR@=FPIR=1% metrics is 2.4%. These results show that AdaFace

is effective in learning a good representation for the low quality settings as it prevents the

model from fitting on unidentifiable images.

We further train on a refined dataset, MS1MV3 [57] for a fair comparison with a recent

work VPL-ArcFace [56]. The performance using MS1MV3 is higher than MS1MV2 due to

less noise in MS1MV3. We also train on newly released WebFace4M [300] dataset. While one

method might shine on one type of data, it is remarkable to see that collectively Adaface

achieves SOTA performance on test data with a wide range of image quality, and on various

training sets.

23

2.5 Gradient Scaling Term

In Sec. 3.1 , the gradient scaling term (GST), g is introduced. Specifically, it is derived

from the gradient equation for the margin-based softmax loss and defined as

(cid:16)

g :=

P (i)
j − 1(yi = j)

(cid:17) ∂f (cos θj)
∂ cos θj

,

(2.20)

where

P (i)

j =

exp(f (cos θyi))

exp(f (cos θyi)) + (cid:80)n

j̸=yi

exp(s cos θj)

.

(2.21)

This scalar term, g affects the magnitude of the gradient during backpropagation from

the margin-based softmax loss. The form of g depends on the form of the margin function

f (cos θj). In Tab.1 of AdaFace Supplementary, we summarize the margin function f (cos θj)

and the corresponding GST when j = yi, the ground truth index.

Note that Pyi

is also affected by the choice of the margin function f (cos θyi) as in Eqn. 2.21.
So, g is a function of m, except for Softmax, and g is affected by m through f (cos θyi) in
. For Angular Margin, m appears in the equation for g directly. We derive g for Angular

Pyi

Margin below. The term g for the Adaptive Angular Margin and CurricularFace [102] can be

obtained using the g from the Angular Margin. The GST term for AdaFace can be obtained

by using g for the Angular Margin and the Additive Margin, and replacing m with adaptive

terms gangle and gadd. This is possible because ∥zi∥ is treated as a constant.

2.5.1 Derivation of Angular Margin

We can rewrite f (cos θyi) as

f (cos θyi) = s · (cos(θyi + m))

= s · (cos θyi cos m − sin θyi sin m)

= s ·

(cid:17)
(cid:16)
cos θyi cos m − (cid:112)1 − cos2 θyi sin m

,

by the laws of trignometry. Therefore,

∂f (cos θyi)
∂ cos θyi

(cid:32)

= s

cos(m) +

cos θyi sin(m)
(cid:112)1 − cos2 θyi

(cid:33)

.

24

(2.22)

(2.23)

2.5.2

Interpretation of g

For Softmax and Additive Margin, we see that g = (P (i)

yi − 1)s. Since the softmax

operation in P (i)
yi

has a tendency to scale the result to be close to either 0 or 1, the first

term in g, (P (i)

j − 1) tends to be close to 1 or 0 far away from the decision boundary. In
, there is also s which is a scaling hyper-parameter, and is often set to

the equation for Pyi
s = 64 [55, 102, 154, 240]. This high s makes the softmax operation even steeper near the

decision boundary. This results in almost equal GST for samples away from the decision

boundary, regardless of how far they are from the decision boundary. This is evident in

Fig. 2.7, where the blue curve is flat except near the decision boundary when s is high.

Figure 2.7 Plot of Pyi
from Softamx (i.e. m = 0).

for different values of s. In this figure, Pyi

is calculated with f (cos θj)

For Softmax and Additive Margin, ∂f (cos θyi )
∂ cos θyi

= s. This term is different for Angular Margin

due to ∂f (cos θyi )
∂ cos θyi

being a function of cos θyi

. The exact form of ∂f (cos θyi )
∂ cos θyi

for Angular Margin is

found in Eqn. 2.23. As shown in Fig. 2.8, Eqn. 2.23 is monotonically increasing with respect

when m > 0 and vice versa. Note that cos θyi

to cos θyi
ground truth weight vector, and it is closely related to the difficulty of the sample during

is how close the sample is to the

training. Therefore, this partial derivative term from the angular margin, ∂f (cos θyi )
∂ cos θyi

, can be

viewed as scaling the importance of sample based on the difficulty.

2.6 Feature Norm Analysis

2.6.1 Correlation between Norm and BRISQUE during Training

In the Sec. 3.2 of the main paper, we introduce the idea of using the feature norm as a

proxy of the image quality. We observe that in models trained with a margin-based softmax

25

1.00.50.00.51.0cosyi0.00.20.40.60.81.0Pyis=11.00.50.00.51.0cosyis=81.00.50.00.51.0cosyis=161.00.50.00.51.0cosyis=321.00.50.00.51.0cosyis=64Figure 2.8 Plot of ∂f (cos θyi )
∂ cos θyi

Margin.

for different value of m when the margin function is Angluar

loss, the feature norm exhibits a trend that is correlated with the image quality. Here, we

show for ArcFace and AdaFace, both loss functions exhibit this trend, in Fig. 2.9. Regardless

of the form of the margin function, the correlation between the feature norm and the image

quality is quite similar (green plot in 1st and 2nd columns). We leverage this behavior to

design the proxy for the image quality.

Figure 2.9 Comparison between ArcFace and AdaFace on the correlation between the

feature norm and the image quality. We randomly sampled 1, 534 images from the training

dataset (MS1MV2 [55]) to show this plot.

We use three concepts (image quality, feature norm and sample difficulty) to describe a

26

1.00.50.00.51.0cosyi050100150f(cosyi)cosyis=64m=0.501.00.50.00.51.0cosyis=64m=0.101.00.50.00.51.0cosyis=64Softmaxm=0.001.00.50.00.51.0cosyis=64m=0.101.00.50.00.51.0cosyis=64ArcFacem=0.50a) Correlation for all Epochs b) Feature Norm vs Img. Qual.c) Prob. Output vs Img. Qual.ArcFaceAdaFaceMethod
MagFace [172]
AdaFace

Relationship
Sample Difficulty vs. ∥zi∥
Image Qual. vs. ∥zi∥

Gradient Flow to ∥zi∥
Yes
No

Figure 2.10 An illustration of different components to describe a sample and their usage in

previous works.

sample, as illustrated in Fig. 2.10. We leverage the correlation between the feature norm and

the image quality to apply different emphasis to different difficulty of samples. In contrast,

MagFace learns a representation that aligns the feature norm with recognizability. The term,

image quality in MagFace paper [172] refers to the face recognizability, which is closer in

meaning to the sample difficulty than the term, image quality, we use in our paper. Please

refer to the Fig. 1 (a) and the first contribution claim of the MagFace paper [172]. Also note

the difference in gradient flow through the feature norm, ∥zi∥. MagFace relies on learning the

feature that has ∥zi∥ aligned with the recognizability of the sample, requiring the gradient to

flow through ∥zi∥ during backpropagation. The loss function has the incentive to reduce the

margin by reducing ∥zi∥. However, our objective is to adaptively change the loss function,

itself, so we treat ∥zi∥ as a constant. Finally, from Tab. 3 of our main paper, AdaFace

substantially outperforms MagFace, e.g. reducing the errors of MagFace on IJB-B and IJB-C

relatively by 21% and 23% respectively.

2.6.2 Training Sample Visualization

Figure 2.11 Actual training data examples corresponding to 6 zones. A pretrained AdaFace

model is used as a feature extractor.

27

2. Image Quality(e.g. BRISQUE)3. Sample Difficulty(1−𝑃!!)1. Feature Norm(𝑧")Components to Describe a SampleEasyHardWe show some visualization of the actual training images. From the randomly sampled

1, 534 images from the training dataset (MS1MV2 [55]), we divide the samples into 6 different

zones. We plot the samples by cos θyi
y-axis in Fig. 2.11. We divide the plot into 6 zones and sample a few images from each group.

(decreasing) as the x-axis and the feature norm ∥zi∥ as

Clearly, there are not many samples in the zones highlighted by the gray area (top right and

bottom left). This indicates that the sample difficulty distribution is different for each level

of feature norm. Furthermore, the samples in the dark green area are mostly unrecognizable

images. AdaFace de-emphasizes these samples. Also, the samples in the bright pink area are

more difficult samples than the dark pink area. AdaFace puts more emphasis on the harder

samples when the feature norm is high. We would like to reminde the readers thatthis figure

may serve as an empirical validation of the two-dimensional face image categorization we

made in Fig. 1 of the main paper.

2.6.3 Training Samples’ Gradient Scaling Term for AdaFace

Figure 2.12 (a) Scatter plot of samples from Fig. 2.11 with the color as the GST term. (b):

Scatter plot of the same 1, 534 points in angular space. For each feature, the angle from Wyi
is calculated from cos θyi
terms are normalized for visualization. (c): Sample image visualization from the low norm

and the distance from the origin is calculated from ∥zi∥. Both

and high norm regions of similar cos θyi

.

In Fig. 2.12 (a), we plot the actual GST term for AdaFace. We use the same 1, 534 images

from the training dataset (MS1MV2 [55]) as in Fig. 2.11. The color of points indicates the

28

Unrecognizable ImagesHard yet Recognizable Image𝑾!!(a)ScatterPlot between cos𝜃!!and 𝑧"(b) ScatterPlot in Angular Space(c) Selected Samples Visualization form the scatter plotmagnitude of the GST term. The purple points on the left side of the scatter plot are samples

past the decision boundary. Therefore the magnitude of GST term is low. The effective

difference in GST term for samples outside the decision boundary can be seen by the color

change from green to yellow. Note that AdaFace de-emphasizes samples of low feature norm

and high difficulty. This is shown in the lower right region of the plot. In Fig. 2.12 (b), we

warp the plot into the angular space to make a correspondence with the Fig. 3 of the main

paper, where we illustrate the GST term for AdaFace. We illustrate how actual training

samples are distributed in this angular space. In Fig. 2.12 (b) and (c), we visualize two

groups of images where one is from the low feature norm area (triangle) and the other is from

the high feature norm area (star). AdaFace exploits images that are hard yet recognizable, as

indicated by the yellow star regions, and lowers the learning signal from the unrecognizable

images, as indicated by the green triangle regions.

2.6.4 Train Samples’ Gradient Scaling Term Comparison with ArcFace

In Fig. 2.13, we compare the GST term placed on training samples. We have two groups of

images. One group is comprised of unrecognizable images, shown under the red bar. Another

group is comprised of hard yet recognizable images, shown under the green bar. Each bar

corresponds to one training sample, and the height of the bar indicates the magnitude of the

gradient scaling term (GST). For ArcFace shown on the left, the same level of GST is placed

on all samples. However, in AdaFace, unrecognizable samples are less emphasized relative to

the recognizable samples.

Figure 2.13 Comparison of the magnitude of GST term between ArcFace and AdaFace.

29

(a) ArcFace(b) AdaFaceFigure 2.14 Examples from IJB-C [169] dataset, where ArcFace fails to identify the subject

whereas AdaFace successfully finds the correct match between the probe and the gallery. On

the left is the set of probe images and on the right is the set of gallery images.

2.7 Visualization of Success and Failed Test Images

We show samples from IJB-C [169] dataset to show which samples are correctly classified

in AdaFace, compared to ArcFace [55]. In each pair of probe and gallery images, we write the

rank and the similarity score for both ArcFace and AdaFace. Rank= 1 is the correct match

and a high similarity score is desired. Note that the majority of the cases where AdaFace

successfully matches the hard samples for ArcFace are comprised of low quality samples. This

shows that indeed AdaFace works well on low quality images.

2.8 Comparison with General Image-Quality Aware Learning Method

We compare our method with QualNet [120] (CVPR21) as a comparison with general

image-quality aware learning method. The scope of general image-quality aware learning

methods is not limited to face recognition, but the idea is applicable. In Tab. 2.4, we show the

comparison with QualNet with models trained on CASIA-WebFace. AdaFace outperforms

QualNet on the TinyFace test set. QualNet aligns the low quality (LQ) image feature

distribution to the high quality (HQ) features’ distribution via a fixed pretrained decoder. In

contrast, AdaFace prevents LQ images from degrading the overall recognition performance

by de-emphasizing heavily degraded LQ images. Since LQ facial images can often be devoid

of identity, it helps to avoid overfitting on unidentifiable LQ images and learn to exploit the

identifiable LQ images. This improves generalization across HQ and LQ.

30

Method
QualNet [120]
AdaFace

Training Set

CASIA-Webface TinyFace

Test set Rank1 Rank5
35.54
44.45
44.39 47.23

Table 2.4 Closed set identification performance (ranked match rate) on TinyFace. For a fair

comparison, we adopt the train/test setting of QualNet. QualNet results are directly taken

from the CVPR21 paper.

2.9 Effect of Batch Size

Our image quality proxy (cid:100)∥zi∥ does not depend on the batch size due to the exponential

moving average in Eq.17 of the main paper (rewritten below).

(cid:100)∥zi∥ =

(cid:22) ∥zi∥ − µz
σz/h

(cid:25)1

,

−1

µz = αµ(k)

z + (1 − α)µ(k−1)

z

(2.24)

(2.25)

.

To empirically show this, we train R50 model on MS1MV2 with the batch size of 128, 256

and 512 and report their performance on IJB-B TAR@FAR=0.01%. As shown in Tab. 2.5,

the difference due to the batch size is minimal.

Method Batch size 128 Batch size 256 Batch size 512
AdaFace

94.35

94.32

94.42

Table 2.5 Performance comparison by varying the batch size. This shows that AdaFace

performance not subject to different batch sizes.

2.10

Implementation Details and Code

The code is released at https://github.com/mk-minchul/AdaFace. For preprocessing

the training data MS1MV2 [55], we reference InsightFace [1] and InsightFacePytorch [2], for

the backbone model definition, TFace [3] and for evaluation of LFW [100], CFP-FP [202],

CPLFW [296], AgeDB [174], CALFW [297], IJB-B [253], and IJB-C [169], we use InsightFace

[1]. For preprocessing IJB-S [112] and TinyFace [46], we use MTCNN [285] to align faces.

31

2.11 Conclusion

In this work, we address the problem arising from unidentifiable face images in the

training dataset. Data collection processes or data augmentations introduce these images

in the training data. Motivated by the difference in recognizability based on image quality,

we tackle the problem by 1) using a feature norm as a proxy for the image quality and 2)

changing the margin function adaptively based on the feature norm to control the gradient

scale assigned to different quality of images. We evaluate the efficacy of the proposed adaptive

loss on various qualities of datasets and achieve SoTA for mixed and low quality face datasets.

Limitations This work addresses the existence of unidentifiable images in the training data.

However, a noisy label is also one of the prominent characteristics of large-scale facial training

datasets. Our loss function does not give special treatment to mislabeled samples. Since

our adaptive loss assigns large importance to difficult samples of high quality, high quality

mislabeled images can be wrongly emphasized. We believe future works may adaptively

handle both unidentifiability and label noise at the same time.

Potential Societal Impacts We believe that the Computer Vision community as a whole

should strive to minimize the negative societal impact. Our experiments use the training

dataset MS1MV*, which is a by-product of MS-Celeb [161], a dataset withdrawn by its

creator. Our usage of MS1MV* is necessary to compare our result with SoTA methods

on a fair basis. However, we believe the community should move to new datasets, so we

include results on newly released WebFace4M [300], to facilitate future research. In the

scientific community, collecting human data requires IRB approval to ensure informed consent.

While IRB status is typically not provided by dataset creators, we assume that most FR

datasets (with the exceptions of IJB-S) do not have IRB, due to the nature of collection

procedures. One direction of the FR community is to collect large datasets with informed

consent, fostering R&D without societal concerns.

32

CHAPTER 3

CLUSTER AND AGGREGATE: FACE RECOGNITION WITH LARGE
PROBE SET

Feature fusion plays a crucial role in unconstrained face recognition where inputs (probes or

galleries) comprise of a set of N low quality images whose individual qualities vary. Advances

in attention and recurrent modules have led to feature fusion that can model the relationship

among the images in the input set. However, attention mechanisms cannot scale to large N

due to their quadratic complexity and recurrent modules suffer from input order sensitivity.

We propose a two-stage feature fusion paradigm, Cluster and Aggregate, that can both scale

to large N and maintain the ability to perform sequential inference with order invariance.

Specifically, Cluster stage is a linear assignment of N inputs to M global cluster centers,

and Aggregation stage is a fusion over M clustered features. The clustered features play an

integral role when the inputs are sequential as they can serve as a summarization of past

features. By leveraging the order-invariance of incremental averaging operation, we design an

update rule that achieves batch-order invariance, which guarantees that the contributions of

early image in the sequence do not diminish as time steps increase. Experiments on IJB-B

and IJB-S benchmark datasets show the superiority of the proposed two-stage paradigm in

unconstrained face recognition. Code and pretrained models are available in Link.

3.1

Introduction

Face Recognition (FR) matches a set of input query imagery, known as probe, to enrolled

identity database, known as gallery. Verification is to confirm the claimed probe’s identity and

identification is to identify the unknown probe’s identity by searching a known database [195].

In either case, a probe can go beyond an image and include a set of images, videos, or their

combinations [21]. Thus FR involves fusing features of multiple images or videos to create a

discriminative feature for a probe.

Due to the interest in unconstrained surveillance scenarios, e.g. IJB-S [112], the role of

fusion is becoming more important. Unconstrained FR is often based on probes from low

33

Figure 3.1 a) An illustration of the importance of the intra-set relationship in feature fusion.

Without the intra-set relationship, a large weight on a good quality image can still be

outweighed by many bad quality images in a probe set. b) We need a framework that can

both account for the intra-set relationship of large N probes and handle sequential inputs

with order invariance. c) The role of fusion model increases with larger probe size. For our

proposed method, CAFace, the relative performance gain over Naïve (simple averaging)

method, i.e., CAFace-Naïve

Naïve

∗ 100%, increases with the probe size acorss four datasets.

PFE [209] and CFAN [78] are single-image based and lack intra-set relationship. RSA [156]

computes intra-set relationship but unusable for large N .

quality images and videos. It is challenging due to two issues: 1) individual video frames can

be of poor quality, causing erroneous FR model prediction and 2) the number of images in a

probe can be very large, e.g., a probe video in IJB-S may have 500, 000 frames. Feature fusion

across all frames in the probe is especially crucial if frame-based predictions are unreliable.

While prior works [122, 172] address the first issue of prediction in low quality images, the

large size of probe set was not addressed. Fig. 3.1 a) illustrates the problem caused by

the absence of proper feature fusion. The contribution of good quality image can be made

insignificant in the presence of many other poor quality images in the set.

This paper aims to learn a fusion function that maps an unordered set of N probe

features {fi}N of the same person to a single fused output f . Note that fi = E(xi) is the

feature extracted from the i-th sample in the set, using a fixed feature extractor E. The

task of fusing multiple features involves 1) estimating the quality of individual features and

34

-4.0-2.00.02.04.06.08.0IJB-A     (9)IJB-B    (21)IJB-C    (23)IJB-S   (7114)NaïvePFECFANRSACAFaceContribution of badquality imagesContribution of goodquality imagesUnconstrained Setting(low quality + multiple frames) …EEEEEEEE…Intra-SetRelationLarge NProbeOrderInvarianceSingle ImageIntra-Set AttentionRecurrent ModuleProposedc) Benefit of feature fusion with large N>>a) Fusion without intra-set relationshipb) Drawbacks of prior worksProposed( ⋅) : avg probe size% inc. over Naïve2) modeling the intra-set relationship of the features. Prior feature fusion works utilize

either simple average pooling [42, 184], reinforcement learning [157], recurrent models [77] or

self-attention [78, 156, 159, 247, 266].

Typically, to compute the intra-set relationship among inputs of an arbitrary size N ,

one would adopt set-to-set functions such as Multihead Self Attention (MSA) [156, 236, 247],

enabling inputs to propagate information among themselves. The downside of this approach

is its computational cost of O(N 2) which becomes infeasible when N exceeds a few thousand.

Also, when the inputs are sequential as in a live video feed, it is nontrivial to model the

intra-set relationship except to compute attention over all past frames at each time step.

Recurrent methods [77, 93] are useful in the sequential inference but their drawback is set

order inconsistency, i.e., as the number of sequential steps T increases, the contribution of

early frames in a set decreases. Fig. 3.1 b) contrasts various fusion methods.

A feature fusion framework that can consider both 1) intra-set relationship for a large N

and 2) efficient sequential inference is necessary in the real-world unconstrained FR scenarios.

Fig. 3.1 c) shows the average probe sizes of four datasets. IJB-S [112]’s probe size is too large

for intra-set attention such as RSA [156] to perform inference with all frames concurrently.

We present a feature fusion framework, Cluster and Aggregate (CAFace), that achieves two

abovementioned criteria. It consists of two modules: Cluster Network (CN) and Aggregation

Network (AGN). CN makes soft assignments of N features into M fixed number of clusters,

i.e., {fi}N → {f ′

AGN combines M clustered features into a single feature f , i.e., {f ′

j }M where M << N . While N varies from one set to other, M is fixed.
j }M → f . Conceptually,
M intermediate cluster features serve as a summarization of N inputs and AGN models the

intra-set relationship among {f ′

j }M .

The proposed framework depends on learning global cluster assignments {fi}N → {f ′

j }M
that are consistent across different probes. Thus, we propose learning shared cluster centers

that are i nput independent. These centers govern the clustering assignments. But, it is

not obvious which clustering criterion is the best for feature fusion. Thus, we design CN to

35

Figure 3.2 Comparison of feature fusion paradigms. a) In the individual paradigm, each

probe sample’s weight is determined independently. b) In the intra-set paradigm, the sample

weight is determined based on all inputs. However, when N is large or sequential, intra-set

calculations become infeasible. c) In the Cluster and Aggregate paradigm, the intermediate

representation F ′ (green) can be updated across batches, allowing for large N intra-set

modeling and sequential inference. Sharing universal cluster centers C ensures consistency of

F ′ across batches. Unlike RNN, the update rule is batch-order invariant.

discover learned clusters with an end-to-end differentiable framework that allows AGN to

back-propagate the gradients to CN. The cluster assignments are learned to maximize the

FR performance. We also design an input pipeline, Style Input Maker (SIM) that can helps

CN perform class (identity) agnostic clustering efficiently.

The purpose of introducing an intermediate stage {f ′

j }M is to facilitate the sequential
j }M as a linear combination of {fi}N .
This guarantees that even when the input sequence of set length N is divided into T smaller

inference. The key design of CN is to formulate {f ′

batches of set length N ′, {f ′

j }M can be sequentially updated with batch-order invariance.
This is due to our update rule, inspired by the order invariance property of the averaging

operation, as in Eq. 3.8. When the inputs are sequential, we feed only the new features to CN

and update the cached {f ′

j }M . It achieves a similar effect as having used all previous features
simultaneously. Fig. 3.2 shows the contrast with previous approaches. For readability, we

will interchange the set notation {fi}N with the matrix notation F ∈ RN ×C.

In summary, the contributions of this paper include:

36

Weighted averageGlobal Centers𝑪∈ℝ!×#𝑭𝒕𝟏∈ℝ&!×’𝑭𝒕𝟐∈ℝ&!×’𝑭𝒕𝟏)∈ℝ!×’𝑭𝒕𝟐)∈ℝ!×’UpdateRule𝑭𝒕𝟏𝟐)∈ℝ!×’𝒇∈ℝ*×’𝑭𝒕𝟑∈ℝ&!×’UpdateRule𝑭𝒕𝟏𝟐𝟑)∈ℝ!×’𝑭𝒕𝟑)∈ℝ!×’𝒇∈ℝ*×’𝒇∈ℝ*×’𝑭∈ℝ&×’𝑭∈ℝ&×’a) Individual Paradigmb) Joint Intra-set Paradigmc) Cluster and Aggregate ParadigmAggregationNetworkAggregationNetworkCluster NetworkCluster NetworkCluster Network• A novel feature fusion framework for both large N feature fusion and efficient sequential
inference. To our knowledge, this is the first approach to utilize linearly combined

intermediate clusters to achieve batch-order invariance with intra-set relationship modeling.
• An task-driven clustering mechanism that can discover latent clustering centers that
maximize the task performance. In our case, the task is FR. We achieve the task-driven

clustering with an assignment algorithm using the global query and decoupled key and

value structure.

• We show the superiority of CAFace in unconstrained face recognition on multiple datasets.

3.2 Related Work
Feature Fusion (Unordered Set) The simplest way of feature fusion is to average over a

set of features {fi}N [42, 184]. In this case, the features with larger norms play a bigger role,

and it generally works since easy samples tend to show larger norms [172, 191]. To learn the

weights, CFAN and QAN utilize the self-attention mechanism, a learned weighted averaging

mechanism [78, 159]. The drawback of these approaches is the lack of an intra-set relationship

during the weight calculation process.

Previous works that adopt the intra-set attention mechanism are Non-local Neural network

and RSA [156, 247]. These works use intermediate feature maps Ui of size RCM ×H×W during

aggregation because feature maps provide rich and complementary information that can be

refined by taking the spatial relationship into account. However, the drawback is in the

heavy computation in the attention calculation. For a set of N features maps, an attention

module involves making (N × H × W )2 sized affinity map. Our Cluster Network utilizes a

compact style vector from SIM and makes N 2 sized affinity map which greatly increases the

computation efficiency in attention computation.

DAC [157] and MARN [77] propose RL-based and RNN-based quality estimators, respec-

tively. Yet, they fail to be agnostic to input order, thus unsuitable for modeling long-range

dependencies. Our method can split the N inputs into T smaller batches and still achieve

batch-order invariance. Lastly, modeling the intra-set relationship with auxiliary context

37

(such as a body) is shown to be helpful [294].

Video Recognition (Ordered Set) The feature fusion for recognition has a resemblance

to video-based recognition [155, 298], but set inputs cannot always expect the temporal

dependencies to be available. Therefore, most video-based approaches for tasks such as action

recognition or quality enhancement [13, 20, 127, 162, 176, 224, 288] focus on exploiting the

relationship between nearby frames, whereas feature fusion approaches do not define them.

In video-based FR, the general trend is to focus more on assessing the quality of individual

frames as opposed to exploring the relationship among nearby frames. Some examples of

video-based FR utilize n-order statistics [166], affine hulls [37, 97, 267], SPD matrices [106]

and manifolds [84, 244]. Recently, probabilistic representation such as PFE [209] gained

popularity [12, 39, 204, 209] since the variance in distribution serves as a quality estimation

for individual frames. QSub-PM [293] bypassed the need for a single feature by representing

a video with a subspace (matrix) and proposing a novel subspace comparison.

Attention Mechanism Multihead Self Attention (MSA) [236] is a widely adopted set-to-set

function that models intra-set relationships via an affinity map. It is also a key component

in transformer architectures which outperform CNNs in various vision tasks [36, 48, 61, 141,

160, 230, 282]. The underlying mechanism of MSA which uses the affinity of query and key

to update the value is versatile in its application beyond recognition and has led to its

usage in memory retrieval and grouping [32, 226, 263]. The unique property of the proposed

Cluster Network is in the linear combination of value assignment which enables batch-order

invariance using an incremental average update rule. Unlike MSA which requires concurrent

inputs during inference for intra-set relationship, ours can split the inference and establish a

connection across batches without decreasing the contribution of early inputs.

3.3 Proposed Approach

The Cluster and Aggregate paradigm seeks to divide the large N inference into partitioned

inferences while still obtaining the result as seeing all inputs at the same time. This can be

achieved if 1) each partitioned inference can update the intermediate representation with

38

Figure 3.3 An overview of CAFace with cluster and aggregate paradigm. The task is to fuse

a sequence of images to a single feature vector f for face recognition. SIM is responsible for

decoupling facial identity features F from image style S that carry information for feature

fusion (Sec. 3.3.1). Cluster Network (CN) calculates the affinity of S to the global centers C

and produces an assignment map A. It will be used to map F and S to create fixed size

representations F ′ and S′. Note that F ′ and S′ are linear combinations of raw inputs F , S

respectively. This property ensures that the previous and current batch representations can

be combined using weighted average, which is order-invariant. Lastly, AGN computes the

intra-set relationship of S′ to estimate the importance of F ′ for fusion. For interpretability,

AGN produces the weights for averaging F ′ to obtain f .

necessary information and 2) the order of inference does not affect the final outcome, so the

information in early batches is not forgotten. In essence, the intermediate representation

serves as a communication channel across batches. We achieve this by designing a Cluster

Network (CN) and Aggregation Network (AGN). Fig. 3.2 c) shows the proposed paradigm.

In this section, we will elaborate on how we obtain the global assignment that is consistent

across batches and how the update rule can be batch-order invariant. We formally layout a

few assumptions for the Cluster and Aggregate paradigm in the face recognition (FR) task

as shown in Fig. 3.3.

Let {xi}N be a set of N facial images from the same person. The task is to produce

a single feature vector f from {xi}N that is discriminative for the recognition task. We

39

𝑺!=𝑨𝑺StyleInputMaker(SIM)…Identity 𝑭…Style 𝑺…ℝ"!×$𝑁!imagesℝ"!×%Decouples style and identityCluster Network (CN) 𝑸𝒖𝒆𝒓𝒚:𝑪𝑲𝒆𝒚:𝑺𝑽𝒂𝒍𝒖𝒆:(𝑺,𝑭)Global  CentersTransformer+SoftMax𝑪∈ℝ&×$Assignment map 𝑨Aggregation Network (AGN)𝑷∈ℝ&×%ℝ&×$𝑭!=𝑨𝑭M×𝑁′ℝ&×%MLPMIxerweightweightedaverage𝒇:outputCluster and Aggregate(CAFace)previous batch𝑺𝒕!,𝑭𝒕!,𝑨𝒕next batch𝑺𝒕(𝟏!,𝑭𝒕(𝟏!,𝐀𝒕(𝟏ℝ*×%Assigns inputs to globally shared centers(Optionally) update intermediates with previous batch intermediates and fuse.𝐸assume that a single image based pretrained face recognition model E : xi → fi is available

following the settings of previous works [78, 156, 209]. For readability, we will interchange

the set notation {fi}N with the matrix notation F ∈ RN ×C where {fi}N is the input is a set

of length N and F simplifies equations. For clarity, we denote N to be the probe size (the

number of images in a set) and N ′ to be the partitioned set size when N is large. During

training, we fix the number of images for fusion as N ′. Note that the shape of inputs during

training would have one more dimension, training batch size B, i.e. F ∈ RB×N ′×C. Training

batch size refers to the number of persons sampled in a mini-batch, different from the number

of images per person, N ′. We drop the training batch size dimension in equations for brevity.

3.3.1 Architecture
Cluster Network (CN) Cluster Network is responsible for mapping inputs F ∈ RN ′×C of

variable size N ′ to F ′ ∈ RM ×C of a fixed size, M . A natural choice for the architecture would

be Transformer [61, 236] as it is a set to set function. However, there are two problems with

it. 1) It cannot handle large inputs due to the quadratic complexity of MSA. 2) When the

inputs are partitioned and inferred sequentially, the intra-set information across the batch

is lost, as MSA computes the affinity within the given inputs. CN solves this problem by

modifying Transformer with 1) shared queries and 2) linear value mapping. These changes

result in a clustering mechanism.

We first consider the following generic attention equation [236] with query Q, key K and

value V .

Attn(Q, K, V ) = SoftMaxrow

(cid:18) QWq (KWk)
√

⊺

d

(cid:19)

WvV ,

(3.1)

where Wq, Wk, Wv are learnable weights and d is the channel dimension of K. The row-wise

Softmax ensures that the output is the weighted average of all projected values WvV for

each query index. We modify this to

Assign

C(K, V ) = SoftMaxcol

(cid:18)CWq (KWk)
√

⊺

d

(cid:19)

V = AV .

(3.2)

First, unlike K and V which are inputs, the query is now a shared learnable parameter C

initialized at the beginning of training. Secondly, removing Wv and the column-wise Softmax

40

Figure 3.4 a) Row-SoftMax. The sum across the row should be 1.0. b) Column SoftMax.

Each column sums to 1.0. c) Column SoftMax with row normalization (Eq. 3.3). d)

Depiction of how values are assigned to centers when A is multiplied to V . The matrix is

deliberatly made sparse for visualization but it can be soft-assignments.

ensure that for each query index of C, the output is the (soft) selection of values V . These

two modifications result in a learned soft assignment mechanism where C serves as the global

shared center. We name the assignment map as A. The difference between A from row and

column SoftMax is shown in Fig. 3.4 a) and b). We then divide A by the weight of samples

assigned to each center (row-sum of A) as in

ClusterC(K, V ) =

A
j Ai,j

(cid:80)

V ,

CN(K, V ) = ClusterC(Transformer([K, C]), V )).

(3.3)

Note that ClusterC(K, V ) is linear in V , while the prediction of A is nonlinear. To further

add the nonlinearity of A to the Cluster Network, we first embed the keys K with shallow

Transformer before clustering. The combined result CN(·) is the learned soft assignment of

values according to the affinity between keys and global queries.

Style Input Maker (SIM) So far, we have discussed the generic Cluster Network algorithm.

For face recognition, we still need to decide keys K and values V for feature fusion. It is clear

that V should be F , the facial identity features, as it is what we are interested in merging. It

is possible to use F for K as well, but K should ideally contain useful information for fusion

and be compatible with queries which are the global center C. However, fi is optimized to be

invariant to any characteristics other than the identity. Thus it lacks input image style which

encompasses various image traits such as brightness, contrast, quality, pose, or a domain

differences from the training data.

41

a) Row-SoftMaxb) Column-SoftMaxc) Column-SoftMax (normalized)𝑽𝑨∗=d)DepictedAssignment𝑶𝒖𝒕In light of the success of using first and second-order feature statistics as an image

style [116, 136, 175], we propose SIM for extracting style information using the intermediate

representations of feature extractors. The benefit of modeling keys K in clustering with style

over using identity is shown in Sec. 3.4.2. We also ablate the benefit of further including

feature norm ||fi|| in K as it is sometimes used to approximate the confidence of the

prediction [122, 172].

Let Ui ∈ RCM ×H×W be the intermediate feature. We capture image style by a style vector

γi ∈ R64

γi = BatchNorm(FC(ReLU(AvgPool(Ws ⊙ Γ))))

(3.4)

where Γ = [µsty, σsty], µsty = SpatialMean(Ui), σsty = SpatialStd(Ui).

A learnable matrix Ws ∈ RCM ×2 controls the importance of µsty and σsty via element-wise

multiplication ⊙. Simply put, SIM is a shallow network on spatial mean and standard

deviation of Ui. One can take Ui from more than one intermediate locations and in such a

case, we concatenate them.

To verify whether the feature norm would further benefit the fusion process, we embed the

feature norm ||fi||2 to a 64-dim vector, following the convention of Sinusoidal conversion [236],

which is analogous to the position embeddings in ViT [61]. The norm embedding ni is a

64-dim vector. Finally, the output SIM is the concatenation, si = [γi, ni] where si ∈ Rd and

d = 64 + 64 = 128. For readability, we denote the set {si}N ′ ∈ RN ′×128 as S.

In summary, we decouple style S and identity F and use S as keys to map

F ′ = CN(key = S, value = F ), S′ = CN(key = S, value = S),

(3.5)

which are the intermediates that will be used for subsequent fusion in AGN or stored for

sequential inference. We also map S to S′ using the same assignment. Fig. 3.3 shows the

overall diagram.

Aggregation Network (AGN) The Aggregation Network is responsible for fusing a fixed

number of M inputs, F ′ and S′ into a single fused output f with intra-set relationship. We

42

adopt MLP-Mixer [229] as it can efficiently propagate information for the fixed-size input.

For interpretability, we predict weights P ∈ RM ×C that combines F ′ to f ∈ RC. Specifically,

f = AGN(S′, F ′) is

f =

(cid:88)

M

P ⊙ F ′, P = SoftMax(MLPMixer([S′, C])),

(3.6)

where [S, C] denotes the concatenation along the channel dimension. The magnitude of P is

an interpretable quantity showing the importance of each cluster during fusion. The final

output f is a weighted average of F ′ whose weight is P .

Sequential Inference A key characteristic of CAFace is its ability to divide the inputs

into T -step sequential inference of smaller set length N ′ when N is large, and still achieve

similar results as the concurrent inference. It is possible as the intermediates F ′ and S′ are

linear combinations of F , S respectively, although estimating the combination weights A is

non-linear. This allows us to formulate the update rule as the incremental weighted average

whose innate property is order-invariant.

Consider partitioned inputs F1, ..., FT , with corresponding predicted weights A1, ..., AT .

Since by definition (Eq. 3.5), F ′

t = AtFt/ (cid:80)N ′

j At,(i,j), we can write the cumulative interme-

diate, (cid:99)FT

′ as

This formulation requires storing all inputs of timestep 1, ..., T . We can easily convert this to

′

(cid:99)FT

=

A1F1 + ... + AT FT
(cid:80)T
(cid:80)N ′
t=1 At,(i,j)
j=1

.

(3.7)

′

(cid:99)FT

=

aT −1 (cid:98)F ′

T −1 + (cid:80)N ′
aT −1 + (cid:80)N ′

j=1 AT,(i,j)FT
i=1 AT,(i,j)

,

where

aT −1 =

T −1
(cid:88)

N ′
(cid:88)

t=1

i=1

At−1,(i,j)

(3.8)

which requires storing only the cumulative row-summed assignment map aT −1 and a cumu-

lative intermediate (cid:98)F ′
Note that this operation, by design, is invariant to inference order (batch-order) as the final

of the previous time-step. The same logic applies to S′ as well.

T −1

result will always be the total weighted average. However, we do not obtain element-wise

permutation invarinace as the prediction of At will change with different inputs. We test the

susceptibility to element-wise permutation in Sec. 3.4.3 and it has minimal impact on the

overall performance.

43

Figure 3.5 A plot of assignment map A ∈ R4×23 (right) and the mean of cluster weights P

(left) for samples in IJB-B [253]. For each column in A, the values sum up to 1.0. A shows

that 1) high quality images are assigned to clusters 1, 2 and 3, with large mean cluster

weights P ; low quality images are assigned to cluster 4 with near 0.0 weight. 2) There are

variations among clusters 1, 2 and 3 as to which images have more influence, e.g., cluster 3

focuses on relatively blurred or occluded images.

3.3.2 Loss Function
Template Loss The objective is to make the fused output f be close to the ground truth

class center fGT . The task can be viewed as correctly inferring the true class center in the

presence of low quality features {fi}N . Here fGT is dependent on the pretrained feature

extractor E and can be either taken from the last FC layer of E or computed using per-subject

average of the embeddings fi in the training data. Our loss function can be viewed as the

cosine distance version of the Center loss [251].

In training, we randomly sample B number of subjects and N ′ images per subject. Let

the superscript in F (b) denote the b-th subject. The loss to increase the cosine similarity is,

Lt =

1
B

B
(cid:88)

(cid:16)

b=1

1 − CosSim

(cid:16)

AGN(S′(b), F ′(b)), f (b)
GT

(cid:17)(cid:17)

.

(3.9)

Set Permutation Consistency Loss The Cluster And Aggregate paradigm achieves

batch-wise order invariance by formulation, but it does not achieve element-wise permutation

invariance, as noted in Sec. 3.3.1. Therefore, we explore the element-wise permutation’s

added benefit using an additional loss function. Thus, the set permutation consistency loss

Lp is

Lp =

1
B

B
(cid:88)

(cid:16)

b=1

1 − CosSim

(cid:16)

AGN(S′(b), F ′(b)), AGN( (cid:99)ST

′(b)

′(b)

(cid:17)(cid:17)
)

.

, (cid:99)FT

(3.10)

44

Cluster 1      Mean𝑷𝟏: 0.653   ←Cluster 2      Mean𝑷𝟐: 0.258←Cluster 3      Mean𝑷𝟑: 0.089←Cluster 4      Mean𝑷𝟒: 0.000←𝑷∈ℝ𝟒×𝟓𝟏𝟐,  Mean𝐏𝒋:ℝ𝟓𝟏𝟐→ℝ𝟏It lets the splitted inference outcome similar to the concurrent inference. Sec. 3.4.2 shows

the benefit of Lp but it is small, meaning the batch-order invariance from model design is

already powerful. The final loss is

L = Lt + λpLp.

(3.11)

where λp is the scaling terms for Lp respectively.

3.4 Experiments

3.4.1 Datasets and Implementation Details

We use WebFace4M [300] as our training dataset. It is a large-scale dataset with 4.2M

facial images from 205, 990 identities. The single image based pretrained face recognition

model E has been trained with the whole training dataset. To train the aggregation module,

we use its randomly sampled subset, consisting of 813, 482 images from 10, 000 identities. We

do not use VGG-2 [33] or MS1MV2 [55, 82] as they were withheld by their distributors due

to privacy and other issues.

For the pretrained face recognition model E, we use the IResNet-101, trained with ArcFace

loss [55]. Since the performance of the aggregation depends on the quality of E, we set E

to be the same for all experiments. E produces an embedding vector fi ∈ R512 for each

image. To offer variations in the training data features, we randomly augment the dataset

with cropping, blurring, and photometric augmentations.

We test on IJB-B [253], IJB-C [169] and IJB-S [112] datasets. IJB-B is a widely used FR

test set containing both high-quality images and low-quality videos of celebrities (see Fig. 3.5

for examples). IJB-C is an updated version of IJB-B with more complex motions in the

video. IJB-S is a surveillance video dataset, benchmarking extremely low-quality image/video

face recognition. The probe and gallery set size can exceed 500, 000. Within the set, there

are many low-quality images, making IJB-S very challenging and suitable for measuring the

feature fusion framework (see Fig. 3.6 for examples).

For IJB-S, we use protocols, Surv.-to-Single, Surv.-to-Booking and Surv.-to-Surv. The

first/second word in the protocol refers to the probe/gallery image source. Surv.

is the

45

Figure 3.6 A plot of similarity of fused probe vs gallery. The circles represent individual

probe images in IJB-S. The colors represent the contribution of each image during fusion.

(top: CAFace, bottom: averaging). The x-axis is the cosine similarity with the gallery

feature (closer to right: better the match). The black star represents the fused feature. For

CAFace, we also plot 4 intermediate features F ′ that go into AGN. (1) Compared to

averaging scheme, in CAFace, only a select few are contributing to fusion (few red). Since

most samples are low quality, a sparse selection of samples lead to better result. (2) Note

that, out of 4 intermediates, F ′
3

falls behind others. It is because CN tends to assign bad

quality samples to one cluster, e.g, C3.

surveillance video, ‘Single’ is the frontal high-quality enrollment image and ‘Booking’ refers

to the 7 high-quality enrollment images. For the ablation study, Sec. 3.4.2, we report the

average of all 9 metrics listed in Tab. 3.4.

3.4.2 Ablation and Analysis
Effect of Different Style Input In Tab.3.1, we ablate the efficacy of various components in

SIM which prepares the input for CN. The table shows that using fi for clustering is harmful

to performance and increases the no. of parameters for CN. It also shows that using the

additional norm embedding ni along with si produces the best results for IJB-B and IJB-S

datasets. However, the margin is small, so simply using si would suffice for the setting that

requires faster computation.

Effect of Lp As noted in Sec. 3.3.2, we further propose to constrain the set permutation

46

CAFaceSimple Averaging (magnitude of 𝒇is weight)Gallery Image→Figure 3.7 A plot of IJB-B performance of CAFace with varied temporal batch size N ′ for

models with different numbers of clusters M .

4

fi si ni # of Centers (M ) Lp # of Params
✓ ✓ ✓
16.28M
✕ ✓ ✕
0.25M
✕ ✓ ✓
0.79M
✕ ✓ ✓
✕ ✓ ✓
✕ ✓ ✓
✕ ✓ ✓
✕ ✓ ✓
✕ ✓ ✓

✓
✓
✓
✓
✕
✓
✓
✓
✓

0.79M

4

1
2
4
8

0.7860M
0.7862M
0.7865M
0.7874M
0

Naive Average Fusion

IJB-B (TAR@FAR=1e-3) IJB-B (TAR@FAR=1e-4)

96.11
96.81
96.91
96.91
96.86
96.10
96.88
96.91
96.90
96.10

94.38
95.53
95.53
95.53
95.52
94.31
95.52
95.53
95.61
94.30

IJB-S (AVG)
53.83
57.42
57.55
57.55
57.36
53.87
57.11
57.55
57.33
54.12

Table 3.1 Ablation of varied inputs, loss functions and the number of centers.

consistency with the additional loss Lp. The ablation between λp = 0 and λp = 1 is shown in

Tab.3.1.

Effect of Number of Clusters In Tab.3.1, the effect of the number of clusters M is shown.

The IJB-S performance peaks when M = 4. When M = 2, the summary representations

F ′ and S′ have only two assignment options where one cluster takes the poor quality images

with low weights. The behavior could be interpreted as performing an outlier detection, which

is powerful enough to give high performance. When M > 2, F ′ and S′ have the capacity

to store richer history of previous frames which would be beneficial in sequential inference.

IJB-S has large N probes which require dividing the inference into batches. Higher IJB-S

performance when M = 4 indicates that the freedom to assign samples to different clusters

is important in the sequential setting. A similar phenomenon is observed in Fig. 3.7. The

performance gap widens for M = 2 as we reduce the batch size to make more sequential steps

47

96.496.596.696.796.896.997N'=4N'=8N'=32N'=64N'=128N'=256N'=NM=2M=4M=8Sharper Decrease for M=2 TAR@FAR=1e-4Stable across varied Bin inference.

Weight Visualizations An example of clustering assignments when M = 4 could be viewed

in Figs. 3.5 and 3.6. Fig. 3.5 shows how samples are soft-assigned to different clusters

along with the weight estimation. The cluster weight is calculated by averaging along 512

dimensions of P ∈ RM ×512. Note that each column sums to 1 but F ′ and S′ are calculated

by averaging each row. Thus, the relative contribution of samples in each row is important.

Fig. 3.6 shows the actual contribution of each sample during fusion. The contribution can be

calculated by multiplying the magnitudes of A and P . Note that in the presence of many

poor quality images, selecting a few good ones is very important and the sample weight of

our method can effectively select a subset of samples during fusion.

3.4.3 Comparison with SoTA methods

To compare with prior feature aggregation methods, we use the same feature extractor E

as in Sec. 3.4.1, for a fair comparison. Average is the conventional embedding fi averaging

scheme that is adopted in the absence of a learned aggregation model. It is equivalent to

the stand-alone ArcFace model performance. The rest of the methods learn an additional

network for fusing the set of features.

In Tab. 3.2, we show the performance of various feature fusion methods on IJB-B. CAFace

achieves a large performance gain in all TAR@FAR metrics. CFAN [78] and PFE [209] do

not use any intra-set relationship, as they learn to predict the confidence of a single image.

RSA [156] calculates intra-set attention of feature maps, which is computationally costly and

incapable of sequential inference. CAFace obtains the best results with the least number of

parameters. In Tab. 3.3, the performance in IJB-C dataset is also shown with the similar

observation as in IJB-B. We also include an additional backbone, AdaFace [122] to highlight

how CAFace can work across different backbones.

In Tab. 3.4, we compare feature aggregation models in the IJB-S dataset that has large

N low quality images/videos in probes. RSA [156] cannot load all images in the probes

concurrently for large N . As an alternative, we divide the probes into a manageable size of

48

Method # of Params Intra-set Att Seq. Inference FPS ↑ TAR@FAR=1e-3 ↑ TAR@FAR=1e-4 ↑ TAR@FAR=1e-5 ↑
Average
PFE [209]
CFAN [78]
RSA [156]
CAFace

-
360.1×
554.1×
3.1×
64.4×

0
13.37M
12.85M
2.62M
0.79M

94.30
94.82
94.83
95.00
95.53

89.53
91.02
91.10
91.22
92.29

96.10
96.37
96.43
96.41
96.91

✓
✓
✓
✕
✓

✕
✕
✕
✓
✓

Table 3.2 A performance comparison of recent methods on the IJB-B [253] dataset.

Dataset

IJB-C [169]
Naive Average WebFace4M [300]
WebFace4M [300]
WebFace4M [300]
WebFace4M [300]
WebFace4M [300]
Naive Average WebFace4M [300]
WebFace4M [300]

PFE [209]
CFAN [78]
RSA [156]
CAFace

CAFace

Backbone E
IResNet101+ArcFace [55]
IResNet101+ArcFace [55]
IResNet101+ArcFace [55]
IResNet101+ArcFace [55]
IResNet101+ArcFace [55]
IResNet101+AdaFace [122]
IResNet101+AdaFace [122]

TAR@FAR=1e-3 TAR@FAR=1e-4 TAR@FAR=1e-5
95.78
96.33
96.45
96.49
97.15
96.42
97.30

92.60
94.16
94.40
94.58
95.78
94.47
95.96

97.30
97.53
97.55
97.49
97.99
97.63
98.08

Table 3.3 A performance comparison of recent methods on the IJB-C [131] dataset. CAFace

achieves the best result in IJB-C dataset. We also compare two different backbones

ArcFace [55] and AdaFace [122].

N ′ = 256 and average the results. Since RSA does not have a sequential update mechanism,

dividing large N probes reduces the performance, which shows why the sequential capacity

is important. CAFace also divides the probes image into batch size of N ′ = 256 images yet

achieves a large margin improvement in IJB-S. It shows that our two-stage mechanism is very

effective in the large N setting. Particularly, the performance gain in the hardest protocol,

Surveillance to Surveillance, is the largest. We also randomly shuffle images within the probes

5 times and measure the mean and std. of the performance in the last row. The result shows

that our model is robust to input ordering. We also include an experiment on a high quality

image dataset, IJB-A [131] later, and note that the performance gain with feature fusion is

negligible. As noted in Fig. 3.1 c), the improvement over the baseline (averaging) goes up

with the increased number of images in the probe, which highlights the importance of large

N scalability.

3.4.4 Resource and Computation Efficiency

Since CAFace is build on top of a single image feature extractor E, we show the relative

FPS of CAFace with respect to the FPS of E in Tab. 3.2. The relative FPS reported in

Tab. 3.2 is computed with the input sequence length N = 256. It shows that the single image

49

Surveillance-to-Single

Surveillance-to-Booking

Method

Naive Average
PFE [209]
CFAN [78]
RSA [156]
CAFace

CAFace (Random Order)

Rank-1 Rank-5
74.31
69.26
74.39
69.50
74.58
70.00
67.33
63.04
76.43
71.61
76.37
71.65
±0.04
±0.05

aa1%aa Rank-1 Rank-5
75.16
70.32
75.29
70.53
75.58
70.90
68.23
63.54
77.41
72.72
77.37
72.77
±0.03
±0.04

57.06
57.51
57.93
51.62
62.21
62.27
±0.11

aa1%aa Rank-1 Rank-5
46.67
32.13
46.70
32.27
45.59
31.66
31.80
16.82
49.59
36.51
49.40
36.43
±0.05
±0.08

Surveillance-to-Surveillance
aa1%aa
5.32
5.41
5.79
0.75
8.78
8.89
±0.03

56.89
57.98
58.09
51.89
62.68
62.70
±0.06

Table 3.4 A performance comparison of recent methods on the IJB-S [112] dataset.

PFE
CFAN
CAFace
RSA

Max N
115, 200
115, 200
12, 000
384

N = 16 N = 32 N = 64 N = 256 N = 512
2133.6x
86.3x
664.2x
268.8x
129.3x
16.4x
OOM
9.2x

360.1x
544.1x
64.4x
3.1x

44.1x
158.7x
8.2x
13.1x

21.8x
82.6x
4.2x
6.9x

Table 3.5 A table of relative FPS of the fusion model with respect to the FPS of the

backbone. We compare various fusion models with varied input size N . As N increases, it

requires more GPU memory as well. Max N in the second column refers to the maximum

number of images that can be in a set without causing the out of memory error (OOM). The

third to the seventh columns represent the relataive FPS under different set length N . The

higher the relative FPS, the faster the fusion method.

based quality estimation methods, PFE and CFAN are the fastest. And RSA with intra-set

attention is the slowest. CAFace can achieve a relatively good speed and obtain the best

performance.

Another aspect of computation requirement is GPU memory usage.

In the second

column of Tab. 3.5, we show the maximum sequence length N that each method can take

simultaneously to perform the feature fusion. It shows that RSA with the inner-set attention

cannot handle a sequence length N larger than 384. This is a drawback that prevents the

method from fusing large N features. On the other hand, CAFace can take a large N sequence

upto 12, 000 simultaneously. Note that for the sequence length larger than this can still be

handled because CAFace has a sequential inference scheme as described in Sec. 3.3.1. In other

words, we can divide the input into smaller set size N ′, and the intermediate representation

is updated to account for all elements in a set. Tab. 3.5 also shows the relative FPS of the

fusion model compared to the backbone FPS under different sequence lengths N .

50

3.5

Implementation Details

To train the fusion network F which is comprised of SIM, CN and AGN, we set the

batch size to be 512. We take the pretrained model E, which is IResNet-101 [55], trained on

WebFace4M [300] with ArcFace loss [55] and freeze it without further tuning. For training

CAFace, the number of images per identity N is randomly chosen between 2 and 16 during

each step of training, and we take two sets per identity. The intermediate feature for the

Style Input Component (SIM) is taken from the block 3 and 4 of the IResNet-101. The

number of clusters in CN is varied in the ablation studies and fixed to be 4 for subsequent

experiments. The number of layers L in CN is equal to 2.

We train the whole network end-to-end for 10 epochs with an AdamW optimizer [164].

The learning rate is set to 1e − 3 and decayed by 1/10 at epochs 6 and 9. The weight decay

is set to 5e − 4. For the loss term, we use λt = 1.0 and λp = 1.0 while the efficacy of λp = 1.0

is ablated with λp = 0.0 in the ablation studies. For f (p)
GT

, we take the feature embeddings fi

extracted from E for each labeled image in the training data, and average them per identity,

with a flip augmentation.

3.6 Norm Embedding

For an embedding vector fi, the norm is a model dependent quantity, we L2 normalize the

feature norm using batch statistics µf and σf and convert it to a bounded integer between

[−qk, qk).

(cid:36)(cid:32)

(cid:100)∥fi∥ =

q ∗

(cid:32)(cid:22) ∥fi∥ − µf
σf

(cid:33)(cid:33)(cid:37)

(cid:25)k

.

−k

(3.12)

Two hyper-parameters, q and k controll the concentration of the (cid:100)∥fi∥ distribution and ⌊·⌉k
−k
refers to clipping the value between −k and k. ⌊·⌋ refers to the floor operation to convert the

quantity to an integer. Following the convention of Sinusoidal position embedding in [236],

we let

nt(2t) = sin( (cid:100)∥fi∥/10000

2t

c ), nt(2t + 1) = cos( (cid:100)∥fi∥/10000

2t

c ),

(3.13)

51

where t is the channel index and c is the dimension of the norm embedding. The resulting

ni ∈ Rc is a 64-dim vector in our experiments.

3.7 Additional Performance Results

In this section, we provide additional performance results from IJB-A [131], IJB-B [253],

IJB-C [169] and IJB-S [112] dataset with additional backbones.

Dataset

TAR@FAR=0.001 TAR@FAR=0.01

IJB-A [131]
Naive Average VGGFace2(3.3M) [33]
VGGFace2(3.3M) [33]
3M Web Crawl [266]
VGGFace2(3.3M) [33]
WebFace4M [300]
WebFace4M [300]
WebFace4M [300]
WebFace4M [300]
WebFace4M [300]

Backbone E
ResNet50
CNN256
GoogleNet
ResNet50
IResNet101+ArcFace [55]
IResNet101+ArcFace [55]
IResNet101+ArcFace [55]
IResNet101+ArcFace [55]
IResNet101+ArcFace [55]
Table 3.6 A performance comparison of recent methods on the IJB-A [131] dataset. The ±

QAN [159]
NAN [266]
RSA [156]
Naive Average
PFE [209]
CFAN [78]
RSA [156]
CAFace

95.0 ± 0.5
94.2 ± 1.5
94.1 ± 0.8
97.6 ± 0.6
99.1 ± 0.2
99.1 ± 0.2
99.2 ± 0.2
99.1 ± 0.2
99.2 ± 0.2

89.5 ± 1.9
89.3 ± 3.9
88.1 ± 1.1
94.3 ± 0.8
98.5 ± 0.6
98.5 ± 0.6
98.5 ± 0.5
98.6 ± 0.5
98.7 ± 0.4

sign refers to the standard devidation calculated from the official 10-fold cross validation

splits from the dataset. For recent SoTA backbone models, the performance is saturated

above 98.5.

Dataset

IJB-C [169]
Naive Average WebFace4M [300]
WebFace4M [300]
WebFace4M [300]
WebFace4M [300]
WebFace4M [300]
Naive Average WebFace4M [300]
WebFace4M [300]

PFE [209]
CFAN [78]
RSA [156]
CAFace

CAFace

Backbone E
IResNet101+ArcFace [55]
IResNet101+ArcFace [55]
IResNet101+ArcFace [55]
IResNet101+ArcFace [55]
IResNet101+ArcFace [55]
IResNet101+AdaFace [122]
IResNet101+AdaFace [122]

TAR@FAR=1e-3 TAR@FAR=1e-4 TAR@FAR=1e-5
95.78
96.33
96.45
96.49
97.15
96.42
97.30

92.60
94.16
94.40
94.58
95.78
94.47
95.96

97.30
97.53
97.55
97.49
97.99
97.63
98.08

Table 3.7 A performance comparison of recent methods on the IJB-C [131] dataset. CAFace

achieves the best result in IJB-C dataset. We also compare two different backbones

ArcFace [55] and AdaFace [122] (CVPR’22). The performance gain is observed in both

backbones.

The size of the probes N in each dataset increases in the order of IJBA [131], IJBB [253],

IJB-C [169] and IJB S [112]. As the probe size increases, the role of a feature fusion model

also increases. As noted in Fig.1 c) of the main paper, previous methods either fail to model

the intra-set relationship or scale to large N , which results in a suboptimal performance with

52

Dataset

IJB-B [253]
Naive Average WebFace4M [300]
WebFace4M [300]
Naive Average WebFace4M [300]
WebFace4M [300]

CAFace

CAFace

Backbone E
IResNet101+ArcFace [55]
IResNet101+ArcFace [55]
IResNet101+AdaFace [122]
IResNet101+AdaFace [122]

TAR@FAR=1e-3 TAR@FAR=1e-4 TAR@FAR=1e-5

96.1
96.91
96.66
96.97

94.30
95.53
94.84
95.78

89.53
92.29
90.86
92.78

Table 3.8 An additional performance on the IJB-B [131] dataset. We compare two different

backbones ArcFace [55] and AdaFace [122] (CVPR’22).

Surveillance-to-Single

Surveillance-to-Booking

Method

E

CAFace

Naive Average ArcFace
ArcFace
Naive Average AdaFace
AdaFace

CAFace

Rank-1 Rank-5
74.31
69.26
76.43
71.61
75.29
70.42
77.14
72.91

aa1%aa Rank-1 Rank-5
75.16
70.32
77.41
72.72
76.11
70.93
78.04
73.39

57.06
62.21
58.27
62.96

aa1%aa Rank-1 Rank-5
46.67
32.13
49.59
36.51
48.22
35.05
50.47
39.25

Surveillance-to-Surveillance
aa1%aa
5.32
8.78
4.96
7.65

56.89
62.68
58.02
63.61

Table 3.9 An additional performance result on IJB-S [112] dataset with two different

backbones, ArcFace [55] and AdaFace [122] (CVPR’22). AdaFace [122] combined with our

proposed CAFace achieves a large margin improvement in IJB-S.

an increasing probe size. The plot of the relative performance increase over the naive average

baseline shows that for CAFace, as the set size increases, the performance gain also increases.

The relative performance gain for Fig.1 c) is calculated as M ethod−N aive

N aive

∗ 100% where the

metrics for each dataset are TAR@FAR=0.001 for IJB-A, TAR@FAR=1e − 4 for IJB-B and

IJB-C, and the average of 9 metrics across all 3 protocols for IJB-S.

3.8 Resource and Efficiency Comparison

We report the FPS (frames per second) to give the estimation of how much resource the

feature fusion framework takes with respect to the backbone E. For the table below, we

use the backbone of IResNet-101 [55]. We measured the FPS with Nvidia RTX3090. It is

equipped with a GPU memory of 24 GB. For measuring the time, we feed the random array

as an input to the model and simulate the run for 1, 000 times. In Tab. 3.10, we first show

the FPS for the backbone E. The FPS increases with batch-size due to the efficiency of GPU

architecture. We take 1, 288 FPS as the FPS for the backbone and measure the relative FPS

of the fusion models F with respect to the backbone, i.e. F P S(F )
F P S(E)

.

In Tab. 3.11, we show F P S(F )
F P S(E)

of various feature fusion models with the varied set size

N . First, note that the feature fusion model’s inference speed is always faster than the

53

backbone model, i.e. F P S(F )

F P S(E) > 1. In practice, we would like the fusion time to be a fraction
of the backbone inference time. Secondly, we show the maximum set size N each method

can take. Note that methods without intra-set relationships, PFE [209] and CFAN [78], are

computationally very fast and require little memory. Therefore, it can take many samples

together (large N ) during inference. In contrast, the maximum set size N for RSA [156]

is 384 because the intra-set attention with the feature map is a memory-intensive module.

CAFace is fast and uses relatively little memory, allowing the maximum set number to be

N = 12, 000.

Note the ability to perform sequential inference is different from large N . For instance,

with CAFace, we can split a set of size 64, 000 with a batch size of 64 and run 1, 000 sequential

inferences, without sacrificing the performance. It is evident in the high performance of IJB-S

dataset, where we adopt the batch size of 256.

Backbone (Batchsize: 1)
Backbone (Batchsize: 256)

Batch Size
1

FPS
91
256 1, 288

Table 3.10 FPS for the face recognition backbone model IResNet-101. Higher the FPS, the

faster the inference speed per image.

F P S(F )
Max N N = 16 N = 32 N = 64 N = 256 N = 512
F P S(E)
2133.6x
PFE
115, 200
664.2x
CFAN
115, 200
129.3x
CAFace 12, 000
OOM
RSA
384

360.1x
544.1x
64.4x
3.1x

44.1x
158.7x
8.2x
13.1x

86.3x
268.8x
16.4x
9.2x

21.8x
82.6x
4.2x
6.9x

Table 3.11 A table of relative FPS of the fusion model w.r.t. the FPS of the backbone. We

compare various fusion models with varied input size N . As N increases, it requires more

GPU memory as well. Max N refers to the maximum number of images that can be in a set

without causing the out of memory error (OOM). The higher the F P S(F )
F P S(E)

, the faster the

fusion method.

54

3.9 Training Progress and Learned Assignment

To see how the assignment behavior changes during training, we plot the entropy of the

assignment map A ∈ RM ×N over the training epochs. We note that each j-th cluster is a

weighted average of individual N samples. Therefore, if all samples are contributing equally

to the j-th cluster, then the entropy of A for each row would be high. When a few samples’

contribution is larger than the others (i.e., A is sparse) then the entropy would be low. We

use entropy as a proxy of how sparse is the influence of samples for each cluster.

The entropy is calcuated as

M
(cid:88)

N
(cid:88)

j=1

i=1

−pj,i log(pj,i),

where pj,i = Aj,i/ (cid:80)N

i=1 Aj,i. In other words, it is the mean of the row-wise entropy of the
normalized assignment map. Lower entropy value tells you that the cluster features are

deviating from a simple average of all samples. In Fig. 3.8, we show the plot of the mean

entropy over the training progression using the IJB-B dataset [253]. The value decreases

steeply during the first few epochs, indicating that the clustering mechanism is quickly

deviating from the simple averaging of the given samples.

Figure 3.8 A plot of mean entropy during training. The samples used are random 200

probes taken from the IJB-B [253] dataset.

55

22.22.42.62.833.23.4012345678910MEAN ENTROPYEPOCHS3.10 Weight Visualization

We show a few examples of the weight visualizations of different methods. The weights

for CAFace are calculated as

wi =

(cid:80)M

j Aj,i

(cid:80)C

c=1 (Pj,c/C)
z

,

the sum of the contributions each sample makes to each cluster, weigthed by the importance

of the cluster. C is the dimension of f , which is 512 in our backbone. M is the number
of clusters. z is the normalization constant to make the (cid:80)N

i=1 wi = 1. For the Averaging,
the weights are the normalized feature norms. For PFE and CFAN, the weights are the

output of the respective modules. Note that RSA does not have a weight estimation, as it

directly estimates the fused output as opposed to estimating the weights. The circles in the

plot represent individual probe images in IJB-S and the color represents the magnitude of

the weights. The horizontal axis represents the similarity of individual probe images to the

gallery shown on the right. The vertical axis exists only to scatter the points. Note that for

both PFE and CFAN, the weight estimation is based on a single image.

3.11 Comparison of Assignment Maps in Various Scenarios

To analyze the behavior of the assignment map A ∈ R4×N of CAFace in varied scenarios,

we show in Fig. 3.10, IJB-S [112] probe examples that come from 3 typical settings; mixed,

poor and good quality image scenarios. The mixed-quality probe is comprised of both low

and high quality images as illustrated in scenario 1. On the other hand, probes could contain

all poor or all good quality images as illustrated by scenarios 2 and 3. Note that each column

of A sums to 1, and each row of A are the relative weights responsible for creating each

clustered vector in F ′ ∈ R4×512.

Note that cluster 4 works as a place where bad quality images are strongly assigned to.

Since the mean P4 is close to zero, all images assigned to cluster 4 have very little contribution

to the final fused output f . For scenario 2 where all of the images are of bad quality, a few

relatively better images are still assigned to cluster 1, 2 and 3, making it possible to perform

feature fusion with bad quality images. This is possible because CAFace incorporates intra-set

56

Figure 3.9 Visualization of importance weights.

relationships that allow the information to communicate among the inputs to determine

which features are more usable than the others. For scenario 3, we can observe that most of

the images are quite similar to one another, providing duplicating information. Therefore,

the assignments are learned to discard many of the duplicating images, as shown by the high

(red) values in the last row of scenario 3.

3.12 Effect of Sequence Length

In Fig. 3.11, to illustrate the importance of using all video sequences, we show how the

IJB-S performance of CAFace changes as we divide the probe videos into 10 partitions and

use first 1:k partitions. The increasing trend reveals that longer video sequences can provide

more information for fusion.

57

CAFaceAveragingPFECFANFigure 3.10 The comparison of assignment maps depending on the probe image

configurations.

Figure 3.11 The performance of IJB-S with increasing video sequence length. The metric for

y-axis is the average of all protocols in IJB-S and 1 : 10 is using all videos in the probe.

3.13 Conclusions

We address the two problems arising from the feature fusion of large N inputs, a common

scenario in unconstrained FR. With large N features, modeling intra-set relationships with

attention mechanisms are prohibitive due to computational constraints while sequential

inference suffers from the reduced contribution of early frames. In this work, we explore the

possibility of dividing N inputs into T smaller unordered batches whose inference result is the

same as concurrent N inference. To this end, we introduce a two-stage cluster and aggregate

paradigm. The clustering stage, inspired by order-invariance of incremental mean operation,

58

Cluster 1      Mean𝑷𝟏: 0.653   ←Cluster 2      Mean𝑷𝟐: 0.258 ←Cluster 3      Mean𝑷𝟑: 0.089 ←Cluster 4      Mean𝑷𝟒: 0.000 ←𝑷∈ℝ𝟒×𝟓𝟏𝟐,  Mean𝐏𝒋:ℝ𝟓𝟏𝟐→ℝ𝟏Cluster 1      Mean𝑷𝟏: 0.485   ←Cluster 2      Mean𝑷𝟐: 0.323←Cluster 3      Mean𝑷𝟑: 0.191←Cluster 4      Mean𝑷𝟒: 0.001←𝑷∈ℝ𝟒×𝟓𝟏𝟐,  Mean𝐏𝒋:ℝ𝟓𝟏𝟐→ℝ𝟏Cluster 1      Mean𝑷𝟏: 0.446   ←Cluster 2      Mean𝑷𝟐: 0.376←Cluster 3      Mean𝑷𝟑: 0.178←Cluster 4      Mean𝑷𝟒: 0.000 ←𝑷∈ℝ𝟒×𝟓𝟏𝟐,  Mean𝐏𝒋:ℝ𝟓𝟏𝟐→ℝ𝟏Scenario 1: Mixed-Quality Probe Scenario 2: All Poor-Quality Probe Scenario 3: All Good-Quality Probe 1:11:21:31:41:51:61:71:81:91:10IJB-S Perf.0.52640.53820.55390.56030.56720.56820.56990.57310.5740.57550.50.510.520.530.540.550.560.570.58AVG OF ALL PROTOCOLSSEQUENCE RANGE is designed to linearly combine N inputs to M global cluster centers, whose assignment

is invariant to the batch-order. The aggregation stage efficiently produces a fused output

from M clustered features, while utilizing the intra-set relationship. We show our proposed

CAFace outperforms baselines on unconstrained face datasets such as IJB-B and IJB-S.

Limitations Cluster and Aggregate is a feature fusion framework that learns the weights of

individual inputs, given a fixed feature extractor E. Weight estimation, in other words, is

an interpolation among the given set of features which is a double-edged sword, as it gives

the interpretability, but is not capable of extrapolation. Therefore, when the given feature

extractor E is sub-optimal, it could be favorable to relax the constraint and let the model

extrapolate for better performance.

Potential Negative Societal Impacts We believe that the machine learning community as

a whole should strive to minimize the negative societal impacts. Large scale face recognition

training datasets inevitably comprise web-crawled images which are without formal consent,

or IRB review. We refrained from using any dataset withdrawn by its creators such as

VGG-2 [33] or MS1MV2 [82] to avoid any known copyright issues. We hope that the FR

community can collectively move toward collecting datasets with informed consent, fostering

R&D without societal concern.

59

CHAPTER 4

DCFACE: SYNTHETIC FACE GENERATION WITH DUAL CONDITION
DIFFUSION MODEL

Generating synthetic datasets for training face recognition models is challenging because

dataset generation entails more than creating high fidelity images. It involves generating

multiple images of same subjects under different factors (e.g., variations in pose, illumina-

tion, expression, aging and occlusion) which follows the real image conditional distribution.

Previous works have studied the generation of synthetic datasets using GAN or 3D models.

In this work, we approach the problem from the aspect of combining subject appearance

(ID) and external factor (style) conditions. These two conditions provide a direct way to

control the inter-class and intra-class variations. To this end, we propose a Dual Condition

Face Generator (DCFace) based on a diffusion model. Our novel Patch-wise style extractor

and Time-step dependent ID loss enables DCFace to consistently produce face images of the

same subject under different styles with precise control. Face recognition models trained on

synthetic images from the proposed DCFace provide higher verification accuracies compared

to previous works by 6.11% on average in 4 out of 5 test datasets, LFW, CFP-FP, CPLFW,

AgeDB and CALFW. Code Link

4.1

Introduction

What does it take to create a good training dataset for visual recognition? An ideal

training dataset for recognition tasks would have 1) large inter-class variation, 2) large

intra-class variation and 3) small label noise. In the context of face recognition (FR), it

means, the dataset has a large number of unique subjects, large intra-subject variations,

and reliable subject labels. For instance, large-scale face datasets such as WebFace4M [300]

contain over 1M subjects and large number of images/subject. Both the number of subjects

and the number of images per subject are important for training FR models [55, 122]. Also,

datasets amassed by crawling the web are not free from label noise [33, 300].

In various domains, synthetic datasets are traditionally used to help generalize deep

60

Figure 4.1 Illustration of three factors that characterize a labeled face dataset. It contains

large subject variation, style variation and label consistency. Synthetic face datasets should

be created with all three factors in mind. Face images in this figure are samples generated by

our proposed method which combines arbitrary ID condition with style condition while

preserving subject identity.

models when only limited real datasets could be collected [64, 98, 232, 302] or when bias

exists in the real dataset [133, 234]. Lately, more attention has been drawn to training with

only synthetic datasets in the face domain, as synthetic data can avoid leaking the privacy

of real individuals. This is important as real face datasets have been under scrutiny for

their lack of informed consent, as web-crawling is the primary means of large-scale data

collection [82,99,300]. Also, synthetic training datasets can remedy some long-standing issues

in real datasets, e.g. the long tail distribution, demographic bias, etc.

When it comes to generating synthetic training datasets, the following questions should

be raised. (i) How many novel subjects can be synthesized (ii) How well can we mimic

the distribution of real images in the target domain and (iii) How well can we consistently

generate multiple images of the same subjects? We start with the hypothesis that face dataset

generation can be formulated as a problem that maximizes these criteria together.

Previous efforts in generating synthetic face datasets touch on one of the three aspects

but do not consider all of them together [17, 188]. SynFace [188] generates high-fidelity face

images based on DiscoFaceGAN [59], coming close to real images in terms of FID metric [90].

However, we were surprised to find that the actual number of unique subjects that can be

61

1. Subject Variation2. Style Variation3. Subject ConsistencyACBLabeledDatasetFigure 4.2 Two stage dataset generation paradigm. In the sampling stage, 1) Gid generates

a high-quality face image Xid that defines how a person looks and 2) the style bank selects a

style image Xsty that defines the overall style of the final image. The mixing stage generates

image with identity from Xid and style from Xsty. Repeating this process multiple times, one

can generate a labeled synthetic face dataset.

generated by DiscoFaceGAN is less than 500, a finding that will be discussed in Sec. 4.3.1.

The recent state of the art (SoTA), DigiFace [17], can generate 1M large-scale synthetic face

images with many unique subjects based on 3D parametric model rendering. However, it

falls short in matching the quality and style of real face images.

We propose a new data generation scheme that addresses all three criteria, i.e. the large

number of novel subjects (uniqueness), real dataset style matching (diversity) and label

consistency (consistency). In Fig. 4.1, we illustrate the high-level idea by showcasing some of

our generated face samples. The key motivation of our paper is that the synthetic dataset

generator needs to control the number of unique subjects, match the training dataset’s style

distribution and be consistent in the subject label.

In light of this, we formulate the face image generation as a dual condition inverse

problem, retrieving the unknown image Y from the observable Identity condition Xid and

Style condition Xsty. Specifically, Xid specifies how a person looks and Xsty specifies how

Xid should be portrayed in an image. Xsty contains identity-independent information such as

pose, expression, and image quality.

Our choice of dual conditions (identity and style) is important in how we generate a

synthetic dataset as ID and style conditions are controllable factors that govern the dataset’s

characteristics. To achieve this, we propose a two-stage generation paradigm. First, we

62

Style Image 𝑿𝒔𝒕𝒚𝑩ID Image 𝑿𝒊𝒅𝑨Predicted"𝑿𝒔𝒕𝒚𝑨𝑬𝒔𝒕𝒚Generate ID image𝑁(0,1)𝑮𝒊𝒅Dual Condition Generator 𝑮𝒎𝒊𝒙𝑬𝒊𝒅Style BankChoose Style imageSubject : ASubject : BSubject : A𝝐𝜽1. Sampling Stage2. Mixing StageIdentity GeneratorABCLabeled Datasetgenerate a high-quality face image Xid using a face image generator and sample a style

image Xsty from a style bank. Secondly, we mix these two conditions using a dual condition

generator which predicts an image that has the ID of Xid and a style of Xsty. An illustration

is given in Fig. 4.2.

Training the mixing generator in stage 2 is not trivial as it would require a triplet of

(XA

id, XB
sty

, XA
sty

) where XA
sty

is a hypothetical combination of the ID of subject A and the

style of subject B. To solve this problem, we propose a new dual condition generator that

can learn from (XA

id, XA
sty

), a tuple of same subject images that can always be obtained in a

labeled dataset. The novelty lies in our style condition extractor and ID loss which prevents

the training from falling into a degenerate solution. We modify the diffusion model [91, 213]

to take in dual conditions and apply an auxiliary time-dependent ID loss that can control

the balance between sample diversity and label consistency.

We show that our Dual Condition Face Dataset Generator (DCFace) is capable of

surpassing the previous methods in terms of FR performance, establishing a new benchmark

in face recognition with synthetic face datasets. We also show the roles dataset subject

uniqueness, diversity and consistency play in face recognition performance.

The followings are the contributions of the paper.

• We propose a two-stage face dataset generator that controls subject uniqueness, diversity

and consistency.

• For this, we propose a dual condition generator that mixes the two independent conditions

Xid and Xsty.

• We propose uniqueness, consistency and diversity metrics that quantify the respective
properties of a given dataset, useful measures that allow one to compare datasets apart

from the recognition performance.

• We achieve SoTA in FR with 0.5M image synthetic training dataset by surpassing the

previous methods by 6.11% on average in 5 popular test datasets.

63

4.2 Related Works
Face Recognition Face Recognition (FR) is the task of matching query imagery to an enrolled

identity database. SoTA FR models are trained on large-scale web-crawled datasets [55,82,300]

with margin-based softmax losses [55, 102, 122, 154, 240]. The FR performance is measured on

various benchmark datasets such as LFW [100], CFP-FP [202], CPLFW [296], AgeDB [174]

and CALFW [297]. These datasets are designed to measure factors such as pose changes and

age variations. Performance on these datasets for models trained on large-scale datasets such

as WebFace260M is well above 97% [122] in verification accuracy.

Synthetic Face Generation Recent advances in generative models allow high fidelity

synthetic face image generations [30, 47, 91, 115–117, 215]. GANs have been widely used to

manipulate, animate or enhance face images [47, 59, 96, 143, 187, 221, 231, 262]. They typically

learn disentangled representations in GAN latent space that control desired face properties.

On the contrary, some works leverage the 3D face prior from 3D datasets (e.g., 3DMM [24])

for controllable synthesis [52, 73, 75, 119, 170, 178, 185, 207]. These methods have advantages in

the fine-grained control over face generation and 3D consistency yet lack in style or domain

variation.

Recent advances in the latent variable models such as diffusion or score-based models have

shown great success in high-quality image generation with a more stable and simple objective

of MSE loss [91, 179, 213, 215–218]. Diffusion models have advanced the conditional image

generation in tasks such as text-conditional image generation, inpainting, etc [25,190,196,246].

We adopt the diffusion model as a backbone and explore how the two image characteristics,

namely ID and style images, can control complementary information, the subject appearance

and the style of an image.

Face Recognition with Synthetic Dataset Synthetic training datasets offer an advantage

over real datasets with regards to ethical issues and class imbalance problems as large-

scale face datasets have been criticized for lacking informed consent and reflecting racial

biases [17,55,274,300]. Despite the benefit, use of synthetic datasets as the sole training data is

64

not widely adopted due to the resulting low recognition performance. In various domains such

as face recognition [17, 150, 188], fingerprint recognition [64, 260], and anti-spoofing [158, 219],

synthetic datasets have been shown to improve recognition when combined with real images.

In the face domain, SynFace [188] studied the efficacy of using DiscoFaceGAN [59] for

synthetic face generation. Recently, DigiFace-1M [17] studied the efficacy of 3D model based

face rendering in combination with image augmentations to create a synthetic dataset. We

propose a face dataset generation method that can generate both a large number of subjects

and diverse styles that are close to the real dataset.

4.3 Proposed Approach

We propose Dual Condition Face Dataset Generator (DCFace), a two-stage dataset

generator (see Fig. 4.2). Stage 1 is the Condition Sampling Stage, generating a high-quality

ID image (Xid) of a novel subject and selects one arbitrary style image (Xsty) from the bank

of real training data. Stage 2 is the Mixing Stage which combines the two images using the

Dual Condition Generator.

For trainable models in each stage, Stage 1 requires training an ID image generator Gid.

For the style bank, we can conveniently use any real face dataset that we wish generated

samples to follow. Stage 2 requires training a dual condition mixer Gmix. Both Gid and Gmix

are based on diffusion models [91]. We describe each component and the associated training

procedure in the following subsections.

4.3.1 Preliminary

Diffusion models [91, 213] are a class of denoising generative models that are trained

to predict an image from random noise through a gradual denoising process. One notable

difference from the class of GAN-based generators [79] is in the objective function and the

sampling procedure. The forward process as expressed in Eq. 4.1 corrupts the input X using

variance controlled Gaussian noise over t time-steps,

q (Xt|Xt−1) = N

(cid:16)

Xt; (cid:112)1 − βtXt−1, βtI

(cid:17)

,

(4.1)

and the denoising is done by training a model ϵθ(Xt, t) to predict the initial noise ϵ with an

65

Figure 4.3 Comparison of the number of unique subjects generated by DiscoFaceGAN [59]

and unconditional DDPM [91]. Uniqueness is the number of unique subjects measured by a

face recognition model. By varying the threshold which determines a match between two

subjects, we plot the number of unique subjects as defined in Eq. 4.11. Unconditional

DDPM and DiscoFaceGAN are trained on FFHQ [116] and each generates 10, 000 samples.

The ability to generate novel subjects is larger for DDPM.

L2 objective,

L = Et,X0,ϵ

√

(cid:104)(cid:13)
(cid:13)ϵθ(
(cid:124)

√

αtX0 +
(cid:123)(cid:122)
Xt

1 − αtϵ
(cid:125)

, t) − ϵ(cid:13)
2
(cid:13)
2

(cid:105)

.

(4.2)

βt and αt are pre-set variance scheduling scalars. The denoising diffusion model (DDPM)

has shown success in producing diverse samples in text-conditioned image generation [190].

We find that in unconditional face generation, DDPM is also capable of generating many

unique subjects. For instance, Fig. 4.3 compares DiscoFaceGAN [59] with DDPM [91] in

their capacity to generate different subjects for every sample. It shows that DDPM [91] is a

good model choice for Gid and Gmix as it can generate many unique subjects. For Gid, we

adopt the unconditional DDPM trained on FFHQ [116], having observed that it is capable of

generating a large number of unique subject images.

4.3.2 Dual Condition Generator Gmix

The two-stage data generation requires Dual Condition Generator Gmix which is a

conditional DDPM. Specifically, two conditions Xid and Xsty are injected into the denoiser

ϵθ(Xt, t, Eid(Xid), Esty(Xsty)) using trainable feature extractors Eid and Esty and cross-

66

CountThresholdFigure 4.4 a) A diagram of Gmix during training. At each step, we draw two labeled images

from the labeled training dataset and use them as Xid and Xsty. We ensure Xid to be the

good-quality frontal view image. temb is the time-step embedding in DDPM [91]. Xsty also

serves as a target image and we apply Gaussian noise ϵ to Xsty to create Xt as DDPM

specifies. Then ϵθ(Xt, t, Xid, Xsty) is trained to predict ϵ using LM SE, conceptually

equivalent to the reconstruction loss to recover Xsty. We also apply LID as in Eq. 4.10 for

the dependence on Xid. b) Patch-wise Style Extractor generates style vectors from small

patches of images. Style vectors are architecturally constrained from containing full ID

information. c) Time-step dependent ID Loss is a linear interpolation between the Xid and

Xsty in the recognition feature space. It forces ϵθ to rely on Xid to extract the subject’s

appearance and gradually shift the style to Xsty.

attentions. Gmix is responsible for the operation X A

id + X B

sty → X A
sty

, a mixing of an image of

a novel subject A and an arbitrary style image of different subject B.

Naive training would require the reference image X A
sty

, an image of subject A in the

style of X B
sty

. This reference is absent in the labeled training dataset. As such, we modify

the operation to X A

id + X A

sty → X A
sty

, using two different images from the same subject as

illustrated in Fig. 4.4(a). But this formulation is prone to a trivial solution of ignoring X A
id

,

making the ID condition unused during test time. To mitigate this issue, we propose the

following two elements.

Patch-wise Style Extractor Esty The motivation of Style Extractor is to map an image

Xsty to a feature that contains little ID information, forcing Gmix to rely on Xid for ID

information. In prior works such as StyleGAN, 1st and 2nd order statistics of a feature are

67

Noised Image 𝑿𝒕predicted"𝑿𝟎U-Net 𝝐𝜽𝒕𝒆𝒎𝒃Style Image 𝑿𝑺𝒕𝒚ID Image 𝑿𝒊𝒅destroyTarget Y𝑬𝒔𝒕𝒚𝑬𝒊𝒅1. 𝑳𝑴𝑺𝑬2. 𝑳𝑰𝑫b) Patch-wise Style Extractor (𝑬𝒔𝒕𝒚)𝑭𝒔c) Time-step Dependent ID Loss (𝑳𝑰𝑫)𝑡=𝑇𝑡=0𝑿𝒊𝒅!𝑿𝟎−𝜶𝒕𝑪𝑺𝑭𝑿𝒊𝒅,𝑭(𝑿𝟎−(𝟏−𝜶𝒕)𝑪𝑺𝑭𝑿𝒔𝒕𝒚,𝑭(𝑿𝟎𝑿𝒔𝒕𝒚a) Training Dual Condition Generator  𝑷𝒐𝒐𝒍𝒎𝒆𝒂𝒏𝑷𝒐𝒐𝒍𝒔𝒕𝒅𝑬𝒒.𝟑Same person𝝁𝒔𝒕𝒚𝒌𝒊𝝈𝒔𝒕𝒚𝒌𝒊𝑰𝒔𝒕𝒚𝒌𝒊𝒔𝒊shown to resemble the image style [116, 123, 136]. Yet, resulting statistics are reduced in

spatial dimensions and consequently without spatially local informations such as pose.

We propose a module that can extract style information without losing spatial information.

Specifically, consider a pretrained and fixed face recognition model Fs and its intermediate

feature Fs(Xsty) = Isty ∈ RC×H×W . We divide the feature into a k ×k grid. For each element

in the grid Iki

sty ∈ RC×H

k ×W

k , we perform non-linear mapping on the mean and variance of Iki
sty

.

Specifically,

ˆIki = BN(Conv(ReLU(Dropout(Iki

sty)))),

µki

sty = SpatialMean(ˆIki), σki
ski = LN (cid:0)(W1 ⊙ µki

sty + W2 ⊙ σki

sty) + Pemb

(cid:1) ,

sty = SpatialStd(ˆIki),

Esty(Xsty) := s = [s1, s2, ski..., sk×k, s′],

(4.3)

(4.4)

(4.5)

(4.6)

where s′ corresponds to Iki
sty

being a global feature, where k = 1. The final output s is

a concatenation of all style vectors for each patch. Each ski is a mean and variance of

local information which is constrained from containing full pixel-level details with the ID

information. And Pemb is a learned position embedding to let the model differentiate different

patch locations. BN and LN are BatchNorm [107] and LayerNorm [16]. Fs is a shallow CNN

taken from the early layers of a pretrained FR model. It is fixed and not updated to prevent

it from optimizing Isty, serving only to create style information. By varying the grid size k×k,

we can represent style at different spatial locations. An illustration of Esty can be found in

Fig. 4.4(b).

Time-step Dependent ID Loss To train Dual Condition Generator Gmix, the original

DDPM objective of L2 loss, Eq. 4.2 is not sufficient to guarantee the consistency in subject
identity between the ID condition Xid and the prediction, ˆX0. To ensure the ID consistency,

one could devise a loss function to maximize the similarity between Xid and the predicted
denoised image ˆX0, in the ID feature space using a pretrained FR model, F . Specifically,

68

Figure 4.5 Illustration of conditional distributions in 2D space. Colored regions represent

the true data distribution with individual colors representing different labels. Colored

triangles represent generated samples with corresponding labels. For each scenario except

(a), the generated distribution does not follow the true distribution. Consistency, diversity

and uniqueness analysis can quantify the shortcomings.

following the Eq.15 of DDPM [91], one-step prediction of the original image is

ˆX0 = (Xt −

√

1 − ¯αtϵθ(Xt, t, Xid, Xsty))/

¯αt.

√

A simple ID loss to increase cosine similarity (CS) is

Lnaive1 = −CS

(cid:17)
(cid:16)
F (Xid), F ( ˆX0))

.

(4.7)

(4.8)

However, this loss is in conflict with MSE loss and is empirically observed to reduce the

predicted image quality. This is because the FR model, F is not invariant to image style;

some style of Xid has to match in order to completely reduce Lnaive1. In contrast, one could

also use

Lnaive2 = −CS

(cid:16)

(cid:17)
F (Xsty), F ( ˆX0))

,

(4.9)

as during training the label of Xsty and Xid are the same. However, Lnaive2 causes the model

to depend on Xsty for ID information. Thus, during evaluation, when Xsty and Xid are

different subjects, the label consistency in the generated dataset is compromised. We show

this in Tab. 4.2.

Instead, we propose to interpolate between F (Xid) and F (Xsty) across diffusion time-steps.

Specifically,

LID = − γtCS

(cid:16)

(cid:17)
F (Xid), F ( ˆX0))

− (1 − γt)CS

(cid:16)

(cid:17)
F (Xsty), F ( ˆX0))

,

(4.10)

69

Low ConsistencyHigh DiversityHigh ConsistencyLow DiversityLow UniquenessHigh ConsistencyHigh Diversitya) Sampling from True Dist.b) Unconditional Generationc) Lacking Style Variationd) Lacking Class Variationwhere γt = t
T

is a time-dependent weight that linearly changes from 0 to 1. When t = T , ϵθ is

predicting Xt−1 from random noise, and we let the model fully exploit the ID information of

Xid. Gradually as t increases, we let the model’s prediction walk into the direction of Xsty.

Note that during training, the actual label of Xsty and Xid are the same. So the interpolation

in the loss forces the prediction to be the same in identity but gradually shifting in style

toward Xsty. This loss allows ϵθ(Xt, t, Xid, Xsty)) to play different roles depending on t. For

t ≈ T , ϵθ will exploit Xid to infer front-view ID rich image. And as t → 0, it will change the

image’s style to match the style of Xsty. The final loss is LM SE + λLID with λ as a scaling

parameter.

Eid and Conditioning Mechanism Following the success text-conditional image generation

and inpainting using DDPM [186, 190, 246], we adopt a similar architecture for inserting

conditions into the model. We concatenate Eid(Xid) and Esty(Xsty) and put in ϵθ using

cross-attention and adaptive group normalization layers (AdaGN) [186]. Eid is a CNN, with

the same architecture as a small FR model (e.g. ResNet50). And Eid is trained end-to-end

with ϵθ to extract useful ID feature for ϵθ.

4.3.3 Condition Sampling Strategy

ID Image Sampling For sampling ID images, we generate 200, 000 facial images from GID,

from which we remove faces that are wearing sunglasses or too similar to the subjects in

CASIA-WebFace with the Cosine Similarity threshold of 0.3 using Feval. We are left with

105, 446 images. Then we narrow them down to 62, 570 images that are unique according

to uniqueness, Eq. 4.11 using Feval and r = 0.3. Then we explore two different options, 1)

random sampling and 2) gender/ethnicity balanced sampling as Gid has a skewed distribution

towards White subjects as shown in Tab. 4.1. We use [9] to classify the ethnicity and

use [109, 257] to detect sunglasses. We denote the sampling option 1 as random and 2 as

balance.

Style Image Sampling For style sampling, for each Xid, we randomly sample Xsty from

the style bank. We denote this option as random. We also explore the option of sampling

70

CASIA-WebFace
DDPM Gid
Balanced Ethnicity

White Asian Others Black
0.074
0.634
0.046
0.660
0.200
0.200

0.144
0.209
0.200

0.074
0.034
0.200

Indian
0.072
0.048
0.200

Table 4.1 Ethnicity Distribution of CASIA-WebFace. Ethnicity prediction is made using [9].

DDPM Gid is trained on FFHQ [116].

Xsty from the pool of images whose gender/ethnicity matches that of Xid. We denote this

option as match.

4.4 Dataset Evaluation

In evaluating the synthesized dataset, one often adopts 1) FID [90] for evaluating the

distribution similarity to the real images and 2) subsequent recognition performance. In this

section, we propose three class-dependent metrics that aid us in understanding the property

of generated labeled datasets. We let Feval be an recognition model used for evaluating

synthesized face datasets. Note that this is different from F in ID loss. F is a model for

training loss and Feval is for evaluating metrics. The more generalizable Feval is, the more

accurate the metrics become in capturing the identity and diversity of the synthesized dataset.

Let yc be a class label, and fi = Feval(Xi). Let d(fi, fj) be the distance between two

images in Feval feature space.

Uniqueness Consider the following non-overlapping r-ball in Feval space,

U = {fi : d(fi, fj) > r, j < i, i, j ∈ {1, .., N }},

(4.11)

where d(fi, fj) is the cosine distance. Then |U | is the count of unique subjects determined by

the threshold r in an unlabeled dataset. Note that the set U is equivalent to sequentially

adding a r-ball into Feval-space until you cannot add more without collision. |U | is subject to

both r and Feval. In FR, r is a threshold in the FR model that is set to determine match or

non-match.

For a labeled synthetic dataset, one generates multiple feature sets {f c

i } for the same
label. To count the number of unique subjects, we calculate the number of unique centers,

f c = 1
Nc

(cid:80)Nc

i f c
i

for c ∈ {1, ..., C}, where C is the number of subjects and Nc is the number

of images per subject. Then we define the number of unique subjects in a labeled dataset

71

with |Uc| where Uc is

Uc = {fc : d(f cn, f cm) > r, m < n, n, m ∈ {1, .., C}},

(4.12)

For the metric, we use Uclass = |Uc|/C, the ratio between the number of unique subjects and

the number of labels.

Intra-class Consistency It measures how consistent the generated samples are in adhering

to the label condition, as

Cintra =

1
C

C
(cid:88)

c=1

1
Nc

Nc(cid:88)

i=1

d(f c

i , f c) < r,

(4.13)

which is the ratio of individual features f c
i

being close to the class center f c. For a given

threshold r, higher values of Cintra mean the samples are more likely to be the same subject

under the same label.

Intra-class Diversity It measures how diverse the generated samples are under the same

label condition. Note that the diversity is in the style of an image, not in the subject’s

identity. We define the style space as a vector space defined by Inception Network [198]

features pretrained on ImageNet [53] following the convention of [134], denoting the real and

generated image inception vectors as {sc

j}.
For intra-class diversity, we measure how many real images fall into the style space

i }, {ˆsc

manifold defined by the generated images under the same label condition. We compute this

by extending the Improved Recall Metric [134], from comparing the unconditional distributions

of real and fake images to comparing the label-conditional distributions. Specifically, for a

set of real and generated feature vectors {sc
define k-nearest feature distance rk as rk = d(cid:0)ˆsc
k-nearest feature vector in {ˆsc

j} and

i }, {ˆsc

j} under the same label condition yc, we
j}(cid:1)(cid:1), where NNk returns the

j, {ˆsc

(cid:0)ˆsc

j − NNk

I(sc

i , {ˆsc

j}) =




1, ∃ˆsc

j ∈ {ˆsc

j} s.t. d

(cid:16)

i − ˆsc
sc
j

(cid:17)

≤ rk

(4.14)



0, otherwise.

72

Figure 4.6 A plot of FR performance on 5 synthetic datasets with respect to Consistency

and Diversity metrics. Color intensity and circle size denotes the FR accuracy.

d(·) is an Euclidean distance. Then diversity is defined by

Dintra =

1
C

1
N

C
(cid:88)

Nc(cid:88)

c=1

i=1

I(sc

i , {ˆsc

j}),

(4.15)

which is the fraction of real image styles manifold covered by the generated image style

manifold as defined by k-nearest neighbor ball. If the style variation is small, then rk becomes
(cid:1) ≤ rk. We compute the recall per class to capture
small, reducing the chance of d (cid:0)sc

i − ˆsc
j

style variation conditional on the subject label.

In Fig. 4.5, we illustrate different scenarios of conditional generation and how these metrics

can capture the shortcomings in each scenario. In Sec. 4.5 and Fig. 4.6, we measure the

metrics on our generated datasets and compare with previous synthetic datasets [17, 188].

We find that FR performance is at best when consistency and diversity are balanced. Also,

we find SynFace and DigiFace have high Cintra and low Dintra compared to our method in

Fig. 4.5.

4.5 Experiments

For Gid which generates ID images, we adopt the publicly released unconditional DDPM [91]

trained on FFHQ [116]. For Gmix, we train it on CASIA-WebFace [99] after initializing

weights from Gid. Although using all of CASIA-WebFace is a valid setting, we split it into a

95-5 split between train and validation sets. The validation set is used as a real dataset in

73

00.20.40.60.810.40.60.81DiversityConsistencyFR Performance on Consistency vs DiversityOurs : Patch-size: 5(Acc: 89.04)Ours : Patch-size: 3(Acc: 85.79)Ours : Patch-size: 7(Acc: 0.5)DigiFace(Acc: 83.45)SynFace(Acc: 74.75)Consistency / Diversity TradeoffGrid Size
SynFace
DigiFace
1×1
3×3
5×5
7×7

5×5

Loss
-
-

LID

Lnaive1
Lnaive2
LID

5×5

LID

F

-
-

Loss Model Uclass Cintra Dintra FR Perf.
0.131
0.9966
0.080
0.178
0.297
0.9973
0.978 0.9987 0.4418
0.7030
0.9809
0.956
0.7734
0.9035
0.924
0.690
0.5937 0.7950
0.988 0.9996 0.6546
0.8046
0.866
0.7835
0.924
0.9035 0.7734
0.7734
0.9035
0.924
0.954 0.9197 0.7715

74.75
83.45
79.28
85.79
89.04
50.00
84.75
50.00
89.04
89.04
89.89

F
Fbigger

F

Table 4.2 Model Ablation. For FR performance, we generate a synthetic dataset of 10K

subjects with 50 images per subject using (random, random) ID and style sampling strategy.

Blue color indicates the adopted setting for subsequent experiments.

measuring the uniqueness, consistency and diversity metrics. Gmix is trained for 10 epochs

with a batch-size of 256 using AdamW Optimizer [129, 164] with the learning rate of 0.001.

Training takes 8 hours using two A100 GPUs. Once Gmix is trained, we use Gid, Gmix and a

style bank to generate a synthetic labeled dataset. The style bank is the CASIA-WebFace

training set. For sampling, we use DDIM [215] with 200 intervals. Generating 500K samples

takes about 20 hours using one A100 GPU.

To train FR models, for a fair comparison, we adopt the training scheme of [17, 188] using

IR-SE-50 [55] as a backbone and AdaFace [122] as a loss function. We evaluate the trained

FR models on five datasets, LFW [100], CFP-FP [202], CPLFW [296], AgeDB [174] and

CALFW [297]. CFP-FP and CPLFW are designed to measure the FR in the large pose

variation and AgeDB and CALFW are for the large age variation. To measure the consistency,

diversity and uniqueness during evaluation, we adopt Feval as IR101 [55] model trained on

WebFace4M [300] with AdaFace [122] loss.

4.5.1 Model Ablation

To show the efficacy of our proposed modules, we ablate on 1) the grid size in Style

extractor Esty, 2) Time-step dependent ID loss and 3) the ID loss backbone F ’s. The number

of samples we generate for the ablation are 10K subjects with 50 images per subject, similar

to CASIA-WebFace image counts. We report the FR performance with the synthetic data by

74

averaging the 5 validation set verification accuracies. To measure Uclass, Cintra and Dintra,

we use 500 subjects with 20 real images from the held-out validation set of CASIA-WebFace

and generate an equivalent number of images from each method.

Grid Size We choose 4 grid sizes ranging from 1×1 to 7×7. Note that 1×1 corresponds

to the style vector of a whole image. We expect to see higher spatial control in Xsty as the

grid size increases. In Tab. 4.2, we report the three metrics Uclass, Cintra and Dintra. As

the grid size increases, Esty features contain more fine-grained information, possibly related

to ID, lowering the consistency. However, the diversity increases, making the conditional

distribution similar to the real dataset. The subsequent FR performance using the model is

the best in the setting 5×5, which is a good compromise between consistency and diversity.

In Fig. 4.7, we show the effect of the grid size with examples.

ID Loss For ID loss, we compare LID with Lnaive1 and Lnaive2 in Tab. 4.2. Using Lnaive1

or Lnaive2 both suffer from lower FR performance, but for different reasons. Lnaive1 has low

diversity because it is optimized to be similar to Xid of front-view high quality face images.

Lnaive2 has low consistency because of the lack of dependence on Xid, making the resulting

dataset with random labels. FR performance of 0.5 means the model diverged and is returning

random predictions. LID, a linear interpolation of the Lnaive1 and Lnaive2 across time-steps

results in the best performance.

ID Loss Backbone F ID Loss requires a pretrained FR model, F . For all of our experiments,

we use F as IR50 trained on CASIA-WebFace. But, we are curious if there is a benefit to

have a better representation from F . For this, we ablate Fbigger, a model pretrained on a

larger dataset, WebFace4M [300]. Tab. 4.2 shows that a better FR backbone induce the

generator to synthesize better datasets, even without explicitly showing WebFace4M images

to generators. But for fairness in comparing to the real CASIA-WebFace dataset, we do not

use Fbigger for subsequent analysis.

75

ID

Style
84.17
random random 98.05
84.61
random match
98.28
83.27
random 98.30
balance
balance match
84.06
98.38
balance over smpl 98.55 85.33

LFW CFPFP CPLFW AGEDB CALFW AVG
89.04
89.12
88.77
89.11
89.56

91.40
91.28
91.27
91.38
91.60

89.38
89.12
89.40
89.30
89.70

82.20
82.32
81.60
82.45
82.62

Table 4.3 Sampling Ablation. We generate a synthetic dataset of 10K subjects with 50

images per subject, using the setting indicated by the blue text in Tab. 4.2. over smpl is

over-sampling Xid during training for showing more front-view faces.

Methods
SynFace
DigiFace
DCFace (Ours)
DigiFace
DCFace (Ours)
DCFace (Ours)

Venue
ICCV21
WACV23
-
WACV23
-
-

CASIA-WebFace (Real)

# images (# IDs× # imgs/ID)
0.5M (10K × 50)
0.5M (10K × 50)
0.5M (10K × 50)
1.2M (10K × 72 + 100K × 5)
1.0M (20K × 50)
1.2M (20K × 50 + 40K × 5)
0.49M (approx. 10.5K × 47)

LFW CFP-FP CPLFW AgeDB CALFW
70.43
91.93
78.87
95.4
82.62
98.55
82.23
96.17
84.22
98.83
85.07
98.58
89.73
99.42

61.63
76.97
89.70
81.10
90.45
90.97
94.08

75.03
87.4
85.33
89.81
88.4
88.61
96.56

74.73
78.62
91.60
82.55
92.38
92.82
93.32

Avg
74.75
83.45
89.56
86.37
90.86
91.21
94.62

Gap to Real
26.58
13.39
5.65
9.55
4.14
3.74
0.0

Table 4.4 Verification accuracies of FR models trained with SoTA synthetic training

datasets. SynFace [188] is a GAN-based dataset with a latent space mixup technique.

DigiFace [17] is a 3D model-based dataset with heavy image augmentation. DCFace uses the

model setting from the ablation study, Tab. 4.2, 4.3 indicated by blue colors. FR backbone is

IR-SE50 [55] + AdaFace [122] to match the setting of DigiFace.

4.5.2 Sampling Ablation

Using the sampling strategy defined in Sec. 4.3.3, we ablate on the ID sampling options

(random, balance) and style sampling methods (random, match) in Tab. 4.3. We find that

either balancing the gender/ethnicity distribution or making the gender/ethnicity of style

image equal to that of ID images does not bring significant performance gain.

On the other hand, to compensate for lower label consistency compared to the real

dataset, we include the same Xid for 5 additional times for each label. This has the effect

of oversampling Xid during training FR model. When we add the oversampling option

to (balance, match) setting, we observe an average verification accuracy of 89.56%, 0.52%

increase over the (random, random) setting.

76

Figure 4.7 An example of SynFace and DigiFace in rows 1-2 and DCFace with different grid

size settings in rows 3-7. SynFace (DiscoFaceGAN) generates mostly frontal-view

high-quality images and DigiFace contains synthetic face images with unrealistic texture

compared to real images. Our grid size ablation changes the contribution of Xsty and Xid. A

good FR performance is a compromise in-between, 5×5. Note that our method can have

diverse styles such as low lighting, pose, glassses, hat, etc. Using Xid to query subjects in

CASIA-WebFace and DCFace datasets returns top 5 most similar subjects. We see Xid

sufficiently different from other (real or fake) subjects.

4.5.3 Comparison with Previous Methods

For training FR models with synthetic datasets, we compare with SynFace [188] and

DigiFace [17]. We compare 0.5M and 1.2M image count settings. The first setting corresponds

to the size of the CASIA-WebFace real dataset. The second setting is to evaluate the effect

of increasing the training dataset size. In Tab. 4.4, we show the verification accuracies of 5

validation sets. In 0.5M regime, our DCFace can surpass DigiFace in 4 out of 5 datasets with

an improvement of 6.11% on average. In CFP-FP dataset with extremely large pose variation,

DigiFace performs better, showing the merit of 3D consistent face synthesis using 3D models.

DCFace has a good balance of consistency and diversity with many unique subjects, leading

to a better FR performance in general. Note the larger style variation compared to SynFace

and DigiFace in Fig. 4.7.

77

SynFaceDigiFace1x13x35x57x7StyleHigh ConsistencyLow DiversityLow ConsistencyHigh Diversity𝑿𝒊𝒅𝑿𝒔𝒕𝒚High QualityLow LightingGlassesPose VariationHatAgeTop5 Similar Subjects DCFaceCASIA-WebFaceThe last column of Tab. 4.4 shows the gap between synthetic and real, calculated as

(REAL−SYN)/SYN, e.g. 5.65% = 94.62−89.56

89.56

. It indicates how much improvement is needed

to be on par with the real dataset.

In 0.5M setting, DCFace reduces the gap to real

performance by 57% over the SoTA. When we use more synthetic data as in 1.2M regime,

the synthetic dataset performance comes closer to that of the real dataset (3.74% in gap), a

60.9% improvement from the previous method (9.55% in gap).

4.6 Training Details

4.6.1 Architecture Detals

The dual condition generator Gmix is a modification of DDPM [91] to incorporate two con-

ditions. We insert two conditions Xid and Xsty into the denoising U-Net ϵθ(Xt, t, Xid, Xsty).

Conditioning images Xsty and Xid are mapped to features using Esty and Eid, respectively.

According to Eq. 6 of the main paper, the style information Esty(Xsty) is the concatenation

of style vectors at different k×k patch locations,

Esty(Xsty) := s = (cid:2)s1, s2, ski..., sk×k, s′(cid:3) ∈ R(k2+1)×C.

(4.16)

On the other hand, ID information is a concatenation of features extracted from a trainable

CNN (e.g. ResNet50 [86]), which produces an intermediate feature Iid of shape R7×7×512and a

feature vector fid of shape R512. Specifically,

Eid(Xid) := i = [Flatten(Iid), fid] + Pemb ∈ R50×C,

(4.17)

where Flatten refers to removing the H×W spatial dimension and R50×C is from concatenating

features of length 7 ∗ 7 and 1. Pemb is a learnable position embedding for distinguishing

each feature position for the subsequent cross-attention operation. Detailed illustrations of

Esty(Xsty) and Eid(Xid) are shown in Fig. 4.8. C for the channel dimension of Esty(Xsty)

and Eid(Xid) is 512.

When Esty(Xsty) and Eid(Xid) is prepared, they together form (k2 + 1) + 50 vectors of

shape 512. These can be injected into the U-Net ϵθ by following the convention of the DDPM

based text-conditional image generators [190]. Specifically, cross attention operation can be

78

Figure 4.8 Left: An illustration of Xsty. The key property of Xsty is in restricting the

information in Xsty from flowing freely to the next layer. The fixed feature encoder Fs and

the patch-wise spatial mean-variance operation destroy the detailed ID information while

preserving the style of an image. We create an output of size R(k2+1)×C. Right: A simple

CNN based on ResNet50. We take intermediate representation and the last feature vector

and concatenate them together to create a output of size R50×C.

written as a modification of attention equation [236] with query Q, key K and value V with

additional query Qc, key Kc.

Attn(Q, K, V ) = SoftMax

Cross-Attn(Q, K, V , Kc, Vc) = SoftMax

(cid:18) QWq (KWk)
√

⊺

(cid:19)

d
(cid:18) QWq ([K, Kc]Wk)
√

⊺

WvV ,

d

(4.18)

(cid:19)

Wv[V , Vc],

(4.19)

where Wq, Wk and Wv are learnable weights and [·] refers to concatenation operation. In

our case, Q = K = V are an arbitrary intermediate feature in the U-Net. And Kc = Vc

are conditions generated by Esty(Xsty) and Eid(Xid), concatenated together. This operation

allows the model to update the intermediate features with the conditions if necessary. We

insert the cross-attention module in the last two DownSampling Residual Blocks in the U-Net,

as shown in Fig. 4.9.

To increase the effect of Xid in the conditioning operation, we also add fid to the time-step

embedding temb. As shown in the right side of Fig. 4.9, the Residual Block in the U-Net

modulates the intermediate features according to the scaling vector provided by fid + temb.

GNorm [259] refers to Group Normalization and SiLU refers to Sigmoid Linear Units [63].

Adding fid to temb for the Residual Block allows more paths for Xid to change the output of

79

Patch-wise Style Extractor (𝑬𝒔𝒕𝒚)𝑭𝒔𝑷𝒐𝒐𝒍𝒎𝒆𝒂𝒏𝑷𝒐𝒐𝒍𝒔𝒕𝒅𝑬𝒒.𝟑𝝁𝒔𝒕𝒚𝟏𝝈𝒔𝒕𝒚𝟏𝑰𝒔𝒕𝒚𝟏𝒔𝟏𝑷𝒐𝒐𝒍𝒎𝒆𝒂𝒏𝑷𝒐𝒐𝒍𝒔𝒕𝒅𝑬𝒒.𝟑𝒔𝒌×𝒌…𝑷𝒐𝒐𝒍𝒎𝒆𝒂𝒏𝑷𝒐𝒐𝒍𝒔𝒕𝒅𝑬𝒒.𝟑𝒔,𝒑𝒆𝒎𝒃ID Extractor (𝑬𝒊𝒅)𝒔∈ℝ(𝒌∗𝒌0𝟏)×𝟓𝟏𝟐…𝒄𝒐𝒏𝒄𝒂𝒕………𝑹𝒆𝒔𝑵𝒆𝒕𝟓𝟎𝒑𝒆𝒎𝒃𝑭𝒍𝒂𝒕𝒊∈ℝ𝟓𝟎×𝟓𝟏𝟐𝑰𝒊𝒅∈ℝ𝟒𝟗×𝟓𝟏𝟐𝒇𝒊𝒅∈ℝ𝟓𝟏𝟐Figure 4.9 Illustration of DDPM U-Net with conditioning operations highlighted. The red

arrow indicates how the dual conditions are injected into the intermediate features of U-Net

using cross-attention layers. For clarity, up-sampling stages are not illustrated, but they are

symmetric to the down-sampling stages. On the right is a detailed illustration of the

Residual Block with timestep and ID condition. temb and fid from Eid are added together

and used to scale the output of the Residual Block.

U-Net.

4.6.2 Training Hyper-Parameters

The final loss for training the model end-to-end is LM SE + λLID with λ as a scaling

parameter. We set λ = 0.05 to compensate for the different scale between L2 and Cosine

Similarity. All our input image sizes are 112×112, following the convention of SoTA face

recognition model datasets [55, 99, 300]. And our code is implemented in Pytorch.

4.7 More Experiment Results

4.7.1 Adding Real Dataset

We include additional experiment results that involve adding real images. Although the

motivation of the paper is to use an only-synthetic dataset to train a face recognition model,

the performance comparison with an addition of a subset of the real dataset has its merits; it

shows 1) whether the synthetic dataset is complementary to the real dataset and 2) whether

the synthetic dataset can work as an augmentation for real images.

Tab. 4.5 shows the performance comparison between DigiFace [17] and our proposed

80

𝑬𝒔𝒕𝒚ℝ(𝒌𝟐&𝟏&𝟓𝟎)×𝟓𝟏𝟐𝑬𝒊𝒅Noised Image 𝑿𝒕predicted$𝑿𝟎𝒇𝒊𝒅Style Image 𝑿𝑺𝒕𝒚ID Image 𝑿𝒊𝒅Res-BlockRes-BlockRes-BlockRes-BlockCrossAttCrossAtt𝒕𝒆𝒎𝒃Residual BlockU Net with Dual ConditionGNormSiLUConvGNorm+𝒇𝒊𝒅𝒕𝒆𝒎𝒃Scaling feature with 𝒕𝒆𝒎𝒃+ 𝒇𝒊𝒅DCFace when 1) a few real images are added and 2) both synthetic datasets are combined.

The performance gap for DigiFace is large, jumping from 86.37 to 92.67 on average when

2K real subjects with 20 images per subject are added. In contrast, ours show a relatively

less dramatic gain, 91.21 to 92.90 when few real images are added. This indicates that

DigiFace [17] is quite different from the real images and ours is similar to the real images.

This is in-line with our expectation as we have created a synthetic dataset that tries to mimic

the style distribution of the training dataset, whereas DigiFace simulates image styles using

3D models.

4.7.2 Combining Multiple Synthetic Datasets

In the second to the last row of Tab. 4.5, when we combined the two synthetic datasets

without the real images, the performance is the highest, reaching 93.06 on average. This result

indicates that different synthetic datasets can be complementary when they are generated

using different methods.

# Synthetic Imgs

# Real Imgs LFW CFPFP CPLFW AGEDB CALFW AVG

DigiFace
DigiFace
DCFace
DCFace

1.2M (10K×72+100K×5)
1.2M (10K×72+100K×5)
1.2M (20K×50+40K×5)
1.2M (20K×50+40K×5)

DCFace+DigiFace (2.4M)

CASIA

0

0
2K×20
0
2K×20
0
0.5M

96.17
99.17
98.58
98.97
99.20
99.42

89.81
94.63
88.61
94.01
93.63
96.56

82.23
88.1
85.07
86.78
87.25
89.73

81.10
90.5
90.97
91.80
92.25
94.08

82.55
90.97
92.82
92.95
92.95
93.32

86.37
92.67
91.21
92.90
93.06
94.62

Gap to
Real
8.72
2.06
3.61
1.82
1.65
0

Table 4.5 Verification accuracies of FR models trained with synthetic datasets and subset of

real datasets. In all settings, the backbone is set to IR50 [55] model with AdaFace loss [122]

for a fair comparison.

4.8 Analysis

C.1 Unique Subject Counts In Fig. 4.10, we plot the number of unique subjects that

(cid:77)

can be sampled as we increase the sample size. The blue curve shows that the number of

unique samples that can be generated by a DDPM of our choice does not saturate when we

sample 200, 000 samples. At 200, 000 samples, the unique subjects are about 60, 000. And

by extrapolating the curve, we estimate the number might reach 80, 000 with more samples.

81

Our DDPM of choice is trained on FFHQ [116] dataset which contains 70, 000 unlabeled

high-quality images. The orange line shows the number of unique samples that are sufficiently

different from the subjects in the CASIA-WebFace dataset. The green line shows the number

of unique samples left after filtering images that contain sunglasses. The flat region is due

to the filtering stage reducing the total candidates. The plot shows that DDPM trained

on FFHQ dataset can sufficiently generate a large number of unique and new samples that

are different from CASIA-WebFace dataset. However, with more samples, eventually there

is a limit to the number of unique samples that can be generated. When the number of

total generated samples is 100, 000, one additional sample has approximately 24% chance

of being unique, whereas, at 200, 000, the probability is 15%. The rate of sampling another

unique subject decreases with more samples. The model used for evaluating the uniqueness

is IR101 [55] trained on the WebFace4M [300] dataset. And we use the threshold of 0.3. We

would like to note a typo in Sec. 3.3 of the main paper, where the number of unique subjects

should be corrected from 62, 570 to 42, 763.

Figure 4.10 Plot of unique subject count as the number of samples from Gid is increased

from 1000 to 200, 000. At 200, 000, one additional sample has approximately 15% chance of

being unique. And the rate decreases with more samples.

C.2 Feature Plot In Fig. 4.11, we show the 2D t-SNE [235] plot of synthetic images

(cid:77)

generated by 3 different methods (DiscoFaceGAN [59], DigiFace [17] and proposed DCFace).

The red circles represent real images from CASIA-WebFace. We extract the features from

each image using a pre-trained face recognition model, IR101 [55] trained on WebFace4M [300].

We show two settings we sample (a) 50 subjects with 1 image per subject and (b) 1 subject

82

Number of SamplesUniqueness Countwith 50 images per subject. Note that the proximity of DCFace image features is closer

to CASIA-WebFace image features, highlighted in a circle. For each setting, we show the

features extracted from an intermediate layer of IR101 and the last layer. As the layer

becomes deeper, the features become suitable for recognition, as shown in the last column of

the figure.

Figure 4.11 (a) the t-SNE plot of features from synthetic and real datasets of 50 subjects

per dataset. It shows how 50 randomly sampled subjects from each dataset are distributed.

The distribution between real (red) and DCFace (green) is the closest. (b) the t-SNE plot of

features from synthetic and real datasets of 1 subject per dataset with 50 images. We

randomly sample 1 subject from each dataset. The last layer features are well separated as

the model is a face recognition model that separates the features of different subjects.

C.3 Comparison with Classifier Free Guidance

(cid:77)
When ϵ(xt, c) learns to use the condition c, the difference ϵ(xt, c)−ϵ(xt) can give further

guidance during sampling to increase the dependence on c. But, in our case, the ID condition

is the fine-grained facial difference that is hard to learn with MSE loss. Proposed Time-

dependent ID loss, LID helps the model learn this directly. Row 3 vs 4 of Tab. 4.6 shows

that LID is more effective than CFG.

Interestingly, with a large guidance scale, CFG becomes harmful. CFG decreases diversity

as pointed out by [92]. We observe that guidance with Xid leads to consistent ID but with

little facial variation, the same phenomenon in DCFace with grid-size 1x1 in Esty, in Tab. 2

83

Intermediate Layer FeatureLast Layer FeatureIntermediate Layer FeatureLast Layer Feature(b) 1 Subject 50 Images (Intra-class Dist.)(a) 50 Subject 1 Images (Inter-class Dist.)Conditions

Train Loss Sampling FR.Perf ↑
+ Guide

1 CNN(Xid), CNN(Xsty)
2 CNN(Xid), Esty(Xsty)
3 CNN(Xid), Esty(Xsty)
4 CNN(Xid), Esty(Xsty) MSE+LID

MSE
MSE
MSE

×
+ Guide
×

73.38
82.30
84.05
89.56

Table 4.6 Green Esty and LID indicates the novelty of our paper. For guidance, we adopt

10% condition masking during training and the guidance scale of 3 during sampling. FR.Perf

is an average of 5 face recognition performances as in the main paper.

(main). Good FR datasets need both large intra and inter-subject variability and we combine

Esty and LID to achieve this.

C.4 FID Scores Note that our generated data is not high-res images like FFHQ when

(cid:77)

compared to how SynFace is similar to FFHQ. (Tab. 4.7 row 5 vs 6). But, we point out that

our aim is not to create HQ images but to create a database with realistic inter/intra-subject

variations. In that regard, we have successfully approximated the distribution of the popular

FR training dataset CASIA-WebFace (FID=13.67).

Generator Train Data
-
CASIA (train)
FFHQ+3DMM
3D Face Capture
CASIA (train)
FFHQ+3DMM
3D Face Capture

1
2
3
4
5
6
7

Target (real)
CASIA (val)
CASIA (val)
CASIA (val)
CASIA (val)

Source (real/syn)
CASIA (train)
DCFace
SynFace
DIGIFACE1M
DCFace
SynFace

FID ↓
9.57
13.67
38.48
71.65
FFHQ (train+val) 35.45
FFHQ (train+val) 21.75
DIGIFACE1M FFHQ (train+val) 68.67

Table 4.7 FID scores of synthetic vs real datasets. For synthetic datasets, we randomly

sampled 10, 000 images. See Line 630 for Casia-WebFace Train and Val set split. All images

are aligend and cropped to 112×112 to be in accordance with CASIA-WebFace.

Having said this, we note FID is not comprehensive in evaluating labeled datasets. It

cannot capture the label consistency nor directly relate to the FR performance. As such,

SynFace/DigiFace do not report FID. We propose U,D,C metrics that enable holistic analysis

of labeled datasets.

84

C.5 Does DCFace change gender? DCFace combines XID and Xsty, while adhering to
(cid:77)

the subject ID as defined by a pre-trained FR model. Factors weakly related to ID, such

as age and hair style, can vary. Biometric ambiguity can occur due to makeup, wig, weight

change, etc. even in real life. The perceived gender may change, but changes such as hair are

less relevant to subject ID for the FR model.

C.6 Why DCFace is better in U,D,C metrics? We note DCFace is not better in all

(cid:77)

U,D,C. Fig. 6 (main) shows SynFace has the highest consistency (C). But, DCFace excels in

the tradeoff between C and D. In other words, style similarity to the real dataset (i.e. D) is

lacking in other datasets and it is as important as ID consistency. As such, U,D,C metrics

reveal weak/strong points of synthetic datasets.

4.9 Visualizations

4.9.1 Time-step Visualizaton

Fig. 4.12 shows how DDPM generates output at each time-step. The far left column

shows Xsty, the desired style of an image. The far right column shows Xid, the desired ID

image of choice. In early time-steps, the network reconstructs the front-view image with

an ID of Xid. And gradually, it interpolates the image into the desired style of Xsty. The

gradual transition can be in the pose, hair-style, expression, etc.

Figure 4.12 A plot of DCFace outputs at each time-step.

85

𝑿𝒊𝒅𝑿𝒔𝒕𝒚𝒕=𝟏𝟎𝟎𝟎𝒕=𝟎4.9.2

Interpolation

In Fig. 4.13, we show the plot of interpolation in Xsty. While keeping the same identity

Xid, we take two style images Xsty1 and Xsty2. We interpolate with α in αEstry(Xsty1) + (1 −

α)Estry(Xsty2) with α increasing linearly from 0 to 1. The interpolation is smooth, creating

an intermediate pose and expression that did not exist before.

Figure 4.13 A plot of DCFace output with style interpolation.

4.10 Miscellaneous

Similarity threshold Threshold=0.3 is based on FR evaluation model having a threshold

of 0.3080 for verification with TPR@FPR=0.01% : 97.17% on IJB-B [253]. FPR=0.01% is

widely used in practice and the scale of similarity is (−1, 1). At threshold=0.3, FFHQ has 200

(2%) more unique subjects than DDPM, signaling a similar level of uniqueness.

Style Extracting Model We use the early layers of face recognition model for style

extractor backbone. Our rationale for adopting the early layers of the FR model, as opposed

to that of the ImageNet-trained model is that the early layers extract low-level features and

we wanted features optimized with the face dataset. But, it is possible to take other models

as long as it generates low-level features.

86

Evaluation on Harder Datasets We evaluate on IJB-B [253] (TPR@FPR=0.01%: 75.12)

and TinyFace [46] (Rank1: 41.66). We include this result for future works to evaluate on

harder datasets.

Real and Generated Similarity Analysis

In addition to Fig.7 mathcing ˆXid with CASIA-
WebFace, matching all ˆX0 (generated) images against CASIA-WebFace at threshold=0.3, we

get 0.0026% FMR. This implies that only a small fraction of CAISA-WebFace images are

similar to the generated images.

4.11 Societal Concerns

We believe that the Machine Learning and Computer Vision community should strive

together to minimize the negative societal impact. Our work falls into the category of

1) image generation using generative models and 2) synthetic labeled dataset generation.

In the field of image generation, unfortunately, there are numerous well-known malicious

applications of generative models. Fake images can be used to impersonate high-profile figures

and create fake news. Conditional image generation models make the malicious use cases

easier to adapt to different use cases because of user controllability. Fortunately, GAN-based

generators produce subtle artifacts in the generated samples that allow the visual forgery

detection [14, 76, 245, 279]. With the recent advance in DDPM, the community is optimistic

about detecting forgeries in diffusion models [203]. It is also known that proactive treatments

on generated images increase the forgery detection performance [14], and as generative models

become more sophisticated, proactive measures may be advised whenever possible.

Synthetic dataset generation is, on the other hand, an effort to avoid infringing the privacy

of individuals on the web. Large-scale face dataset is collected without informed consent and

only a few evaluation datasets such as IJB-S [112] has IRB compliance for safe and ethical

research. Collecting large-scale datasets with informed consent is prohibitively challenging

and the community uses web-crawled datasets for the lack of an alternative option. Therefore,

efforts to create synthetic datasets with synthetic subjects can be a practical solution to this

87

problem. In our method, we still use real images to train the generative models. We hope

that research in synthetic dataset generation will eventually replace real images, not just in

the recognition task, but also in the generative tasks as well, removing the need for using

real datasets in any form.

4.12

Implementation Details and Code

The code will be released at https://github.com/mk-minchul/dcface. For preprocess-

ing the training data CASIA-WebFace [99], we reference AdaFace [122] and use MTCNN [285]

for alignment and cropping faces. For the backbone model definition, TFace [3] and for

evaluation of LFW [100], CFP-FP [202], CPLFW [296], AgeDB [174] and CALFW [297], we

use AdaFace repository [122].

4.13 Conclusion

This paper presents a method for creating a synthetic training dataset for face recognition.

Dataset generation is studied from the perspective of generating many unique subjects with

large style diversity and label consistency. We propose the Dual Condition Face Generator to

this end and show its large FR performance gain over previous methods on synthetic dataset

generation. We believe our approach takes one step towards matching the performance of

real training datasets with synthetic training datasets.

Limitations This work addresses the problem of generating label consistent and diverse

datasets for face recognition model training. In our model ablation, we find that sacrificing

label consistency for diversity to some degree is beneficial for the FR model training. However,

this is not ideal; for instance, our synthetic face generator lacks 3D consistency across pose,

which is an advantage of generative models with 3D priors. Secondly, the goal of our research

is to release a synthetic face dataset that alleviates the dependence on large-scale web-crawled

images. As shown in our experiments, there is still some performance gap between real and

synthetic training datasets. In this work, we take one step towards the goal and hope that

the continued research will introduce a standalone synthetic face dataset.

88

CHAPTER 5

KEYPOINT RELATIVE POSITION ENCODING FOR FACE RECOGNITION

In this paper, we address the challenge of making ViT models more robust to unseen

affine transformations. Such robustness becomes useful in various recognition tasks such as

face recognition when image alignment failures occur. We propose a novel method called

KP-RPE, which leverages key points (e.g. facial landmarks) to make ViT more resilient

to scale, translation, and pose variations. We begin with the observation that Relative

Position Encoding (RPE) is a good way to bring affine transform generalization to ViTs.

RPE, however, can only inject the model with prior knowledge that nearby pixels are more

important than far pixels. Keypoint RPE (KP-RPE) is an extension of this principle, where

the significance of pixels is not solely dictated by their proximity but also by their relative

positions to specific keypoints within the image. By anchoring the significance of pixels

around keypoints, the model can more effectively retain spatial relationships, even when those

relationships are disrupted by affine transformations. We show the merit of KP-RPE in face

and gait recognition. The experimental results demonstrate the effectiveness in improving

face recognition performance from low-quality images, particularly where alignment is prone

to failure. Code and pre-trained models are available.

5.1

Introduction

Geometric alignment has shown to be highly effective for certain computer vision problems,

such as face, body and gait recognition [55, 56, 100, 122, 128, 131, 139, 149, 154, 169, 172, 174,

202, 240, 253, 289, 290, 292, 296]. Alignment is the process of transforming input images,

to a consistent and standardized form, often by scaling, rotating, and translating. This

standardization helps recognition models learn the underlying patterns and features more

effectively. As a result, many state-of-the-art (SoTA) face recognition models [55,122,172,240]

rely on well-aligned datasets [54, 55, 82, 300] to achieve high accuracy.

Fig. 5.1 shows a toy example with a training dataset MNIST [58] and test set AffNIST [197]

which is in unseen affine transformation of MNIST. Using a shallow ViT [61] model, one can

89

Figure 5.1 Toy Example illustrating how different Position Embeddings impact the ViT’s

robustness to unseen affine transforms. Abs-PE refers to the learnable Absolute Position

Embedding. RPE and iRPE refers to Relative Position Embedding adopted to ViT [105, 256].

Keypoints in MNIST is arbitrarily defined to be the four corners of a box that covers a digit.

Abs-PE* is drawing the keypoints onto the input image. KP-RPE uses the keypoints to

adjust the RPE.

easily achieve 98.1% accuracy in the MNIST test set. However, in AffNIST, ViT with the

original Absolute Position Embedding obtains only 77.27% accuracy. Such a sharp decrease

in performance with unseen affine transform causes problems in applications that rely on

accurate input alignments.

In face recognition, alignment can be imperfect, especially in low-quality images where

accurate landmark detection is difficult [54, 148]. Thus, images with low resolution or taken

in poor lighting may result in misalignment during testing. Given the interplay between

alignment and recognition, it becomes crucial to proactively handle potential alignment

failures, which often result from, e.g., low-quality images. In other words, there is a need for

a recognition model that is robust to scale, rotation, and translation variations.

We revisit the Relative Position Encoding (RPE) concept used in ViT [61] and find

90

Train Set(MNIST)Test Set(AffNIST)Toy Example for Measuring Affine Transformation RobustnessKeypoints*Abs-PEAbs-PE*RPEiRPEKP-RPEAffNIST Test Accuracy %Unseen Affine Transform KP-RPE offers robustness to unseen affine transforms by using keypoints.Figure 5.2 Illustration of RPE [205] and proposed KP-RPE. The blue arrow represents the

learned attention offset Bij between a query i and key j of attention in RPE. The query-key

relationship at the same i and j should represent different relationships as the scale or pose

change. But Bij does not change in RPE. KP-RPE addresses this issue by incorporating the

distance to the keypoints when calculating the learned attention offset in RPE.

that RPE can be useful for introducing affine transform robustness. RPE [205] enables the

model to capture the relative spatial relationships among regions of an image, learning the

positional dependencies without relying on absolute coordinates. As shown in Fig. 5.1, adding

RPE to ViT increases the performance in AffNIST. With RPE [205], queries and keys of

self-attention [236] at closer distances can be assigned different attention weights compared

to those at a greater distance. While RPE allows the model to exploit relative positions,

it has a limitation: even if an image changes in terms of scaling, shifting, or orientation,

the significance of the key-query position in RPE stays the same. This static behavior is

illustrated in Figs. 5.2 a)-c). Notably, the key-query relationship is the same regardless of the

corresponding pixels’ semantic meaning changes.

We hypothesize that an RPE which dynamically adapts based on image keypoints, such

as facial landmarks, could improve the model’s comprehension of spatial relationships in the

image. By leveraging the spatial relationships with respect to these keypoints, the model can

adapt to variations in scale, rotation, and translation, resulting in a more robust recognition

system capable of handling both aligned and misaligned datasets. Fig. 5.2 d) highlights a

keypoint-dependent query-key relationship.

To this end, we introduce KeyPoint RPE (KP-RPE), a method that dynamically adapts

91

QueryKeyRelative Position EmbeddingImage Spacea) Eye to Nose RelationshipScale ChangePose Changeb) Eye to Mouth Relationshipc) Skin to Skin RelationshipRPE: relationship based on distance in the image plane.d) Eye to Mouth RelationshipKeypoint (     ) DependentKP-RPEProblem: query key relationship is invariant to scale, pose change=𝑓(     ,    ,     )=𝑓(     ,    )=𝑓(     ,    )=𝑓(     ,    )𝐁𝐢𝐣	for RPE(learned attention offset)Unique 𝐁𝐢𝐣 for different   ==the spatial relationship in ViT based on the keypoints present in the image. Our experi-

ments demonstrate that incorporating KP-RPE into ViT significantly enhances the model’s

robustness to misaligned test datasets while maintaining or even improving performance on

aligned test datasets. We show the usage of KP-RPE in face recognition and gait recognition

as the inputs share the same topology (face or body) that allows the keypoints to be defined.

Finally, KP-RPE is an order of magnitude faster than iRPE [256], a widely used RPE that

depends on the image content.

In summary, the contributions of this paper include:

• The insight that RPE (or its variants) can improve the robustness of ViT to unseen affine

transformations.

• The development of Keypoint RPE (KP-RPE), a novel method that dynamically adapts
the spatial relationship in Vision Transformers (ViT) based on the keypoints in the image,

significantly enhancing the model’s robustness to misaligned test datasets while maintaining

or improving performance on aligned test datasets.

• Comprehensive experimental validation demonstrating the effectiveness of our proposed
KP-RPE, showcasing its potential for advancing the field of recognition by bringing model’s

robustness to geometric transformation. We improve the recognition performance across

unconstrained face datasets such as TinyFace [46] and IJB-S [112] and even non-face

datasets such as Gait3D [67, 292].

5.2 Related Works

Relative Position Encoding in ViT Relative Position Encoding (RPE) is first introduced

by Shaw et al. [205] as a technique for encoding spatial relationships between different elements

in a sequence. By adding relative position encodings into the queries and keys, the model can

effectively learn positional dependencies without relying on absolute coordinates. Subsequent

works, such as those by Dai et al. [50] and Huang et al. [105], refine and expand upon the

concept of RPE, demonstrating its effectiveness in natural language processing (NLP) tasks.

The adoption of RPE in Vision Transformers [61] has been explored by several researchers.

92

For instance, Ramachandran et al. [189] propose a 2D RPE method that computes the x, y

distance in an image plane separately to include directional information. A notable RPE

method in ViT is iRPE [256], which considers directional relative distance modeling as well as

the interactions between queries and relative position encodings in a self-attention mechanism.

Despite the success of these RPE methods in various vision tasks, they do not specifically

address the challenges associated with scale, rotation, and translation variations in face recog-

nition applications. This shortcoming highlights the need for RPE methods that can better

handle these variations, which are common in real-world low-quality face recognition scenarios.

To address this, we propose KP-RPE, which incorporates keypoint information during the

network’s feature extraction, significantly enhancing the model’s ability to generalize across

affine transformations.

Keypoints and Spatial Reasoning Keypoint detection, often associated with landmarks,

has been fundamental in various vision tasks such as human pose estimation [35, 177], face

landmark detection [31, 132, 224, 285], and object localization [183]. These keypoints serve as

representative points that capture the essential structure or layout of an object, facilitating

tasks like alignment, recognition, and even animation.

Face landmark detection is commonly carried out alongside face detection. MTCNN [285]

is a widely-used method for combined face detection and facial landmark localization, utilizing

cascaded CNNs (P-Net, R-Net, and O-Net) that collaborate to detect faces and landmarks

in an image. RetinaFace [54], on the other hand, is a single-stage detector [144, 153] based

landmark localization algorithm, demonstrating strong performance when trained on the

annotated WiderFace [269] dataset. TinaFace [299] further enhances detection capabilities by

incorporating SoTA generic object detection algorithms. MTCNN and RetinaFace are often

used for aligning face datasets.

Recent advances in keypoint detection techniques, particularly using deep neural networks,

have led to using keypoints to improve the performance of recognition tasks [220, 265]. For

instance,

[83] proposes a keypoint-based pooling mechanism and shows promising results

93

in skeleton-based action recognition and spatio-temporal action localization tasks. Albeit

its benefit, many models including ViTs do not have pooling mechanisms. KP-RPE is the

first attempt at incorporating keypoints into the RPE which can be easily inserted into ViT

models.

5.3 Proposed Method

5.3.1 Background

Self-Attention Self-attention is a crucial component of transformers [236], which is a popular

choice for a wide range of NLP tasks. ViT [61] applies the same self-attention mechanism

to images, treating images as sequences of non-overlapping patches. The self-attention

mechanism in Transformers calculates attention weights based on the compatibility between

a query and a set of keys. Given a set of input vectors, the Transformer computes query (Q),

key (K), and value (V) matrices through linear transformations:

Qi = xiWQ, Kj = xjWK, Vj = xjWV ,

(5.1)

where xi is the i-th input vector, and WQ, WK, and WV are learnable weight matrices.

The self-attention mechanism computes attention weights as the dot product between the

query and key vectors, followed by a softmax normalization:

eij =

QiKT
j√
dk

,

aij =

exp(eij)
j=1 exp(eij)

(cid:80)N

,

(5.2)

where dk is the dimension of the key vectors. Finally, the output matrix Y is computed as
the product of the attention weight matrix and the value matrix: Yi = (cid:80)N

j=1 aijVj.

Absolute Position Encoding

Transformers are inherently order invariant, as their self-attention mechanism does not

consider input token positions. To address this, absolute position encoding is introduced [74,

236], which adds fixed, learnable positional embeddings to input tokens:

x′
i = xi + PE(i),

(5.3)

where x′
i

is the updated input token with positional information, xi is the original input token,

and PE(i) is the positional encoding for the i-th position. These embeddings, generated using

94

sinusoidal functions or learned directly, enable the model to capture the absolute positions of

elements.

Relative Position Encoding (RPE)

RPE, introduced by Shaw et al. [205] and refined by Dai et al. [50] and Huang et

al. [105], encodes relative position information, essential for tasks focusing on input element

relationships. Unlike absolute position encoding, RPE considers query-key interactions based

on sequence-relative distances. The modified self-attention calculation for RPE is:

e′
ij =

(Qi + RQ

ij)(Kj + RK
√

ij )T

dk

, Yi =

n
(cid:88)

j=1

aij(Vj + RV

ij).

(5.4)

Here, RQ
ij

, RK
ij

, and RV
ij

are relative position encoding between the i-th query and j-th key

with shape Rdz . Each R is a learnable matrix of RK×dz , where Ri,j corresponds to the

relative position encoding for distance d(i, j) = k and K is the maximum possible value of

d(i, j). To obtain relative position encoding, we index the R matrix using the computed

distance R[d(i, j)]. Common choices for d are quantized Euclidean distance, separate x, y

cross distance [189]. [256] uses a quantized x, y product distance, which encodes direction

information. Note, query location i is a 2D point (ix, iy). Fig. 5.3 a) and b) illustrate the

distance between i and all possible j with different distance functions. For KP-RPE, we

modify [256] and allow the RPE to be keypoint dependent.

5.3.2 Keypoint Relative Position Encoding

Building upon the general formulation of [256], we begin with the following RPE formula-

tion:

e′
ij =

QiKjT + Bij
√
dk

.

(5.5)

Here, Bij is a scalar that adjusts the attention matrix based on the query and key indices

i, j. Assuming a set of keypoints P ∈ RNL×2 is available for each x, our goal is to make Bij

dependent on P. For face recognition, P is the five facial landmarks (two eyes, nose, mouth

tips). For gait recognition, P is 17 points from the joint locations of skeleton predictions.

For the MNIST toy example, P is five keypoints from the four corners and the center of the

95

Figure 5.3 Depiction of key-query combinations in an image, given a query location

i = (7, 7) (⋆). Distinct colors represent varying attention offset values in RPE based on the

distance between i and j. We are showing Bi=(7,7),j for all j ∈ (14 × 14). a) The distance

function is a quantized Euclidean distance. b) Product distance proposed in iRPE accounts

for direction. c) We adopt b) and allow Bi,j to vary based on keypoint locations (•).

minimum cover box of a foreground image. As such P can be defined for objects with shared

topology.

The novelty of KP-RPE lies in the design of Bij. Since

Bij = W[d(i, j)] ∈ R1,

(5.6)

comprises of a learnable table W and a distance function d(i, j), we can make W or d(i, j)

depend on the keypoints. At a first glance, d(i, j, P), conditioning the distance on P seems

plausible. However, we find that it leads to inefficiencies, as distance caching, which is

precomputing d(i, j) for a given input size, is only feasible when d(i, j) is independent of the

input. Therefore, we propose an alternative where the bias matrix W, is a function of P:

Bij = f (P)[d(i, j)].

(5.7)

96

a) iRPE: Euclidean Distanceb) iRPE: Product Distancec) KP-RPE: Product Distance (Two different keypoints)Figure 5.4 a) Illustration of KP-RPE. First a mesh grid M and an image-specific keypoints

P are generated. Then the broadcasted difference D is calculated, and we linearly map D to

f (P). Finally for a given i, j, we can find the Bij = f (P)[i, d(i, j)]), which is used to adjust

the attention map in self-attention. b) Backbone contains multiple transformer blocks

followed by an MLP for classification. KP-RPE is used where multi-head attention modules

exist. KP-RPE is efficient as f (P) is computed once.

We propose three variants of f (P) building up from the simplest solution.

Absolute f (P) Let P ∈ RNL×2 be the normalized keypoints between 0 and 1. First,

the simplest way to model the indexing table is to linearly map P to the desired shape.

f (P) = P′WL where P′ ∈ R1×(2NL) is reshaped keypoints P and WL ∈ R(2NL)×K is a learnable

matrix. K is the maximum distance value in d(i, j). For each distance between i and j, we

learn a keypoint adaptive offset value. However, this f (P) only works with the absolute

position information of P and the relative distance between i and j. It is missing the relative

distance between P and (i, j).

Relative f (P) To improve, f (P) can be adjusted to work with the position of keys and

queries relative to the keypoints. In other words, so that the query-key relationship in Bij

depends on the query-landmark relationship. To achieve this, we generate a mesh grid

M ∈ RN ×2 of patch locations containing all possible combinations of ix and iy. N represents

the number of patches. We then compute the element-wise difference between the normalized

97

Keypoints 𝐏(5×2)𝑖!𝑖"Mesh Grid 𝐌2D 𝒙,𝒚 Grid(N×5×2)Broadcasted Difference…(N×K)𝐖𝐋𝑑=0𝑑=1𝑑=2𝑑=𝐾𝐁(,*=𝑓𝐏[i,di,j]𝑓𝐏b) Model OverviewAdd & NormFeed ForwardAdd & NormMulti-headSelf-AttentionN-layer Transformer BlocksKP-RPEInput Embedding𝑓(𝐋)𝑫𝑑: distance between 𝑖,𝑗𝑖=0𝑖=1𝑖=𝑁𝑖: attention query location𝑗: attention key location(N×2)𝑖=0𝑖=1a) KP-RPE detailed diagram𝐁(,* is keypoint dependent.grid and keypoints P to obtain a grid of i, j relative to the keypoints:

D = Expand(M, dim = 1) − Expand(P, dim = 0),

(5.8)

where D is the broadcasted tensor difference of shape RN×NL×2 . Finally, we reshape D and

linearly project it with WL. Specifically,

D′ = Reshape(D) ∈ RN×(2NL)

f (P) = D′WL ∈ RN ×K

Bij = f (P)[i, d(i, j)] ∈ R1.

(5.9)

(5.10)

(5.11)

In other words, the offset value Bij is determined with respect to the positions of the

keypoints and is unique for each query location. This approach allows for more expressive

control of the query-key relationships with the keypoint locations. An illustration of this is

shown in Fig. 5.4.

Multihead Relative f (P) Lastly, we can further enhance our method by tailoring the

query-keypoint relationship for each head in the attention mechanism. When there are

H heads, we simply expand the dimension of WL to WL ∈ R(2NL)×HK. By reshaping

f (P), we obtain f (P)h for each head. Furthermore, considering the multiple self-attentions

in ViT which entails multiple RPEs, we can individualize f (P) for each self-attention by

additionally increasing the dimension of WL to WL ∈ R(2NL)×NdHK, where Nd represents the

transformer’s depth. Since f (P) is computed only once per forward pass, this modification

introduces negligible computational overhead compared to other operations. In Sec. 5.4.2,

we evaluate and compare the various KP-RPE versions (basic, relative keypoint, multiple

relative keypoint), demonstrating the superior performance of the multiple relative keypoint

approaches.

5.4 Face Recognition Experiments

5.4.1 Datasets and Implementation Details

To validate the efficacy of KP-RPE, we train our model using aligned face training

data and evaluate on three distinct types of datasets: 1) aligned face data, 2) intentionally

98

Method

ViT
ViT + iRPE
ViT+KP-RPE

Low Quality Aligned Dataset
IJB-S [112]

TinyFace [46]

Rank-1 Rank-5 Rank-1 Rank-5
68.31
68.24
70.50
69.05
72.04
69.88

72.96
73.10
74.25

59.60
62.49
63.44

High Quality Aligned Dataset High Quality Unaligned Dataset
CFPFP [202]
Verification
96.11
97.01
96.60

CFPFP [202]
Verification
72.81
77.91
93.56

IJB-C [169]
TAR@0.01%
21.62
34.73
91.85

IJB-C [169]
TAR@0.01%
92.22
92.72
94.20

Table 5.1 Ablation of RPE on ViT-small. Aligned is the standard protocol with raw face

images (detector bounding box) aligned by RetinaFace [54] and resized to 112×112.

Unaligend takes the raw face images and simply resizes it to 112×112. Aligned setting

always shows better performances and Unaligned is for simulating alignment failure. Low

Quality Aligned dataset may have alignment failures.

Method

Low Quality Aligned Dataset
IJB-S [112]

TinyFace [46]

KP-RPE Absolute f (P)
KP-RPE Relative f (P)
KP-RPE MultiHead f (P)

Rank-1 Rank-5 Rank-1 Rank-5
69.13
68.11
70.77
69.42
72.04
69.88

72.42
73.71
74.25

9.97
62.51
63.44

High Quality Aligned Dataset High Quality Unaligned Dataset
CFPFP [202]
Verification
96.51
96.74
96.60

CFPFP [202]
Verification
68.09
89.70
93.56

IJB-C [169]
TAR@0.01%
90.96
94.28
94.20

IJB-C [169]
TAR@0.01%
14.91
85.22
91.85

Table 5.2 Ablation of KP-RPE with three different formulations of keypoint dependent RPE

tables f (P). The sharp increase in Unaligned setting shows the robustness to unseen affine

transform manifests with Relative f (P). Multihead f (P) further improves the performance.

unaligned face data, and 3) low-quality face data containing misaligned images. For the

evaluation, aligned face datasets include CFPFP [202], AgeDB [174], and IJB-C [169]. For

unaligned face data, we intentionally use the raw CFPFP [202] and IJB-C [169] datasets

without aligning them. Raw images, as provided by their respective creators, are equivalent

to images cropped based on face detection bounding boxes. Lastly, we assess the model’s

robustness on low-quality face datasets, specifically TinyFace [46] and IJB-S [112], which are

prone to alignment failures. This comprehensive setup enables us to examine the effectiveness

of our proposed method across diverse data conditions.

The training datasets MS1MV2 [55] MS1MV3 [57] and WebFace4M [300] are released

as aligned and resized to 112×112 by RetinaFace [54] whose backbone is ResNet50 model

trained on WiderFace [269]. For keypoint detection in KP-RPE, we also use RetinaFace [54]

but with lighter backbone MobileNetV2 for faster inference. Given the sensitivity of ViTs

to hyperparameters, we report the exact settings for learning rate, weight decay, and other

parameters in the later section. For ablation dataset, we take the MS1MV2 subset dataset as

99

used in [122].

Following the training conventions of [122, 230], we adopt RandAug [49], repeated aug-

mentation [94], random resized crop, and blurring. We utilize the AdaFace [122] loss function

to train all models. For ablation, we employ ViT-small, while for SoTA comparisons, we use

ViT-base models. The AdamW [164] optimizer and Cosine Learning Rate scheduler [163, 254]

are used. In WebFace4M trained models, we adopt PartialFC [10, 11] to reduce the classifier’s

dimension.

5.4.2 Ablation Analysis

Row 1 in Tab. 5.1 shows results on the baseline ViT. Row 2 and 3 show results on the

baseline ViT with iRPE and our proposed KP-RPE. KP-RPE demonstrates a substantial

performance improvement on unaligned and low-quality datasets, without compromising

performance on aligned datasets. Last row highlights the difference between ViT and

ViT+KP-RPE. Also, Fig. 5.5 shows the sensitivity to the affine transformation, i.e., how the

performance changes when one interpolates the affine transformation from the face detection

images to the aligned images in CFPFP dataset.

Tab. 5.2 further investigates the effect of modifications to KP-RPE. By making KP-RPE

dependent on the difference between the query and keypoints (row 2), we observe a significant

improvement in unaligned dataset performance. Also, by allowing unique mapping for each

head and module in ViT (row 3), we achieve a further improvement. In other words, more

expressive KP-RPE is beneficial for learning complex RPE that depends on the keypoints of

an image. Overall, the ablation study highlights the necessity of each component in KP-RPE

and the effectiveness of KP-RPE in enhancing the robustness of face recognition models,

particularly with unaligned and low-quality datasets.

5.4.3 Computation Analysis

In this section, we analyze the computational efficiency of our proposed KP-RPE in terms

of FLOPs, throughput, and the number of parameters. Tab. 5.3 shows that KP-RPE is

highly efficient, with only a small increase in the computational cost (FLOPs) compared to

100

Figure 5.5 Plot of Verification Accuracy in CFPFP [202]. On the X-axis, we interpolate the

affine transformation from raw data (Detection Image) to canonical alignment (Alignment

Image). Note KP-RPE is robust to affine transformations, while all models have been

trained on the aligned image dataset.

GFLOP ∆ in GFLOP

IResNet50
ViT Small
ViT Small + iRPE
ViT Small + KP-RPE
ViT Small + KP-RPE (+ Ldmk)
IResNet101
ViT Base
ViT Base + iRPE
ViT Base + KP-RPE
ViT Base + KP-RPE (+ Ldmk)

12.62
17.42
18.13
17.44
17.58
24.19
24.83
26.25
24.90
25.04

-
1
1 +0.71
1 +0.02
1 +0.16
-
2
2 +1.42
2 +0.07
2 +0.21

Eval
Throughput
1432.72 img/s
1303.15 imgs/s
832.12 imgs/s
1145.90 imgs/s
1085.22 imgs/s
773.12 imgs/s
644.10 imgs/s
337.32 imgs/s
502.57 imgs/s
489.37 imgs/s

Train
Throughput
337.93 img/s
333.17 img/s
186.55 img/s
302.70 img/s
302.70 img/s
189.74 img/s
162.94 img/s
79.40 img/s
136.15 img/s
136.15 img/s

%∆ in Train
Throughput
-
1
1 -44.01%
1 -9.15%
1 -9.15%
-
2
2 -51.27%
2 -16.44%
2 -16.44%

# Param

43.59M
95.95M
96.07M
96.00M
96.49M
65.15M
114.87M
114.98M
115.08M
115.56M

Table 5.3 Computation resource comparison. GFLOP refers to Giga Floating Operating per

Second. We measure it as [193]. Throughput refers to the number of images processed per

second during the train/eval iteration.

the backbone: 0.02 GFLOP increase for ViT Small and 0.07 GFLOP increase for ViT Base

(ViT vs ViT+KP-RPE). Notably, KP-RPE is considerably more efficient than iRPE, which

incurs an increase of 0.71 GFLOP for ViT Small and 1.42 GFLOP for ViT Base.

Considering training throughput, which factors in computation time during training (with

backpropagation), KP-RPE’s efficiency is more pronounced. It only reduces throughput by

9.15% for ViT Small and 16.44% for ViT Base, as opposed to iRPE’s larger decrease. Also,

we show the GFLOP and throughput with the landmark detection time included. Landmark

detection time is negligible compared to the total feature extraction time.

Also, our method introduces a negligible increase in the number of parameters: just

0.05M for ViT Small and 0.21M for ViT Base. Hence, incorporating KP-RPE into the model

101

Low Quality Dataset

High Quality Dataset

Method

Backbone

Train Data

TinyFace [46]

IJB-S [112]

PFE [208]aaa
ArcFace [55]
URL [210]
CurricularFace [102]
AdaFace [55]
AdaFace [55]
AdaFace [122]
AdaFace [122]
ArcFace [55]
AdaFace [122]
AdaFace [122]
AdaFace [122]
AdaFace [122]
AdaFace [122]
AdaFace [122]

CNN64
ResNet101
ResNet101
ResNet101
ResNet101
ResNet101
ViT

MS1MV2 [55]
MS1MV2 [55]
MS1MV2 [55]
MS1MV2 [55]
MS1MV2 [55]
MS1MV3 [57]
MS1MV3 [57]
ViT+KP-RPE MS1MV3 [57]
WebFace4M [300]
WebFace4M [300]
WebFace4M [300]
WebFace4M [300]
ViT+KP-RPE WebFace4M [300]
WebFace12M [300]
ViT+KP-RPE WebFace12M [300]

ResNet101
ResNet101
ViT
ViT+iRPE

ResNet101

Rank-1 Rank-5 Rank-1 Rank-5
58.33
64.42
65.78
68.68
70.53
72.67
71.64
73.25
74.31
75.29
77.09
77.14
78.20
77.04
77.46

-
-
68.67
67.65
71.54
70.98
74.84
76.39
74.38
74.52
77.58
77.98
78.49
74.81
78.97

-
-
63.89
63.68
68.21
67.81
72.05
73.50
71.11
72.02
74.81
74.92
75.80
72.42
76.18

50.16
57.35
59.79
62.43
65.26
67.12
65.95
67.62
69.26
70.42
71.90
71.93
72.78
71.46
72.94

AgeDB [174] CFPFP [202]
Verification Accuracy

-
98.28
-
98.32
98.05
98.17
97.87
97.98
97.93
97.90
97.48
97.15
97.67
98.00
98.07

-
98.27
98.64
98.37
98.49
99.03
99.06
99.11
99.06
99.17
98.94
99.01
99.01
99.24
99.30

IJB-C [169]
TAR@FAR=0.01%
-
96.03
96.60
96.10
96.89
97.09
97.10
97.16
96.63
97.39
97.14
97.01
97.13
97.66
97.82

Table 5.4 SoTA comparison on low-quality and high-quality datasets. ViT models are

ViT-Base sized.

achieves enhanced performance without a substantial rise in computational cost or model

complexity.

5.4.4 Comparison with SoTA Methods

In this section, we position ViT+KP-RPE, against SoTA face recognition methodologies

with large-scale datasets and large models. We undertake a comprehensive evaluation,

covering both high-quality and low-quality image datasets. The results, as shown in Tab.5.4,

underscore the strengths of KP-RPE. Notably, the inclusion of KP-RPE does not impair

the performance on high-quality datasets, a testament to its applicability to both low and

high-quality datasets.

This becomes particularly compelling when we observe the performance on low-quality

datasets. Consistent with the findings of our ablation study, the introduction of KP-RPE leads

to an appreciable improvement in these challenging scenarios. This supports our thesis that

face recognition models with robust alignment capabilities can indeed enhance performance on

low-quality datasets. In summary, our model with KP-RPE not only maintains competitive

performance on high-quality datasets but also brings significant improvements on low-quality

ones, marking it a valuable contribution to the field of face recognition.

102

5.4.5 Note on the Landmark Predictor

KP-RPE in all experiments uses our own MobileNet [199] based RetinaFace [54] to predict

landmarks for KP-RPE. We train MobileNet version for computation efficiency. However, the

original landmark predictor used for aligning the test datasets is ResNet50-RetinaFace [54].

We also report the KP-RPE performance with the officially released ResNet50-RetinaFace.

We report this to compare KP-RPE on the same ground with other models by using the same

landmark used to pre-align the testset. The face recognition performance of KP-RPE+Official

is similar to KP-RPE+Ours (75.86 vs 75.80 in TinyFace Rank1). Our MobileNet-RetinaFace

is improved to perform similarly to ResNet50 in landmark prediction by applying additional

tricks while training. Therefore, the face recognition performances are also similar. Unlike

vanilla RetinaFace on face alignment, ours is fully differentiable during inference and named

Differentiable Face Aligner.

5.4.6 Scalability on Larger Training Datasets

We train the ViT+KP-RPE model on a larger WebFace12M [300] dataset to demonstrate

the potential of KP-RPE in its scalability and applicability in real-world, data-rich scenarios.

Tab.5.4’s last row shows that the performance continues to increase with WebFace12M

dataset.

Discussion Why are noisy keypoints more useful in KP-RPE than in simple alignment? The

short answer is that not all predicted points are noisy in an image while alignment as a result

of one or more noisy point impacts all pixels.

5.5 Gait Recognition Experiments

KP-RPE is a generic method that can generalize beyond face recognition to any task with

keypoints. We apply KP-RPE to gait recognition using body joints as the keypoints.

Dataset. We train and evaluate on Gait3D [292], an in-the-wild gait video dataset. In our

experiments, we use silhouettes and 2D keypoints preprocessed and released by the authors

directly. Following SMPLGait [273, 292], we use rank-n accuracy (n = 1, 5, 10), mean Average

Precision (mAP), and mean Inverse Negative Penalty (mINP) for evaluation.

103

Model

Rank-1 Rank-5 mAP mINP

GaitSet [40]
MTSGait [291]
DANet [167]
GaitGCI [62]
GaitBase [67]
HSTL [242]
DyGait [243]

SwinGait-2D [66]
+ KP-RPE

36.7
48.7
48.0
50.3
64.6
61.3
66.3

67.1
68.2

58.3
67.1
69.7
68.5
81.5
76.3
80.8

83.7
84.4

30.01
37.63
—
39.5
55.31
55.48
56.40

58.76
60.81

17.30
21.92
—
24.3
31.63
34.77
37.30

34.36
36.19

Table 5.5 KP-RPE performance on Gait3D [292] compared with the baseline. KP-RPE

boosts all metrics by a large margin.

Implementation Details We implement SwinGait-2D [66] as the baseline in our experiments.

SwinGait-2D is chosen over SwinGait-3D [66] because we focus on exploiting the geometric

information in gait recognition. SwinTransformer [160] uses vanilla relative positional encoding

for each windowed self-attention. To incorporate KP-RPE into the SwinTransformer, we

modify the 2D grid M to be the size of the window as opposed to the image size. Following

the default configuration of [292], we use an AdamW [164] optimizer with a learning rate

3 × 10−4 and weight decay 2 × 10−2, accompanied by an SGDR [163] scheduler. We train our

models for 60,000 iterations, sampling 32 subjects and 4 sequences per subject in a batch.

Results and Analyses In Tab. 5.5, we compare to SoTA approaches, including SwinGait-

2D [66], with and without KP-RPE. We can see that the KP-RPE shows a significant

improvement over SwinGait-2D, with 1.1 % and 0.7 % improvement on rank-1 and -5 accuracies,

respectively. mAP has improved by 2.05 % and mINP by 1.23 % of the baseline) compared

to SwinGait-2D. We believe that a great portion of the improvement comes from KP-RPE

exploiting the gait information contained in 2D skeletons. Gait skeletons contain identity-

related information, such as body shape and walking posture. This demonstrates that

KP-RPE is both effective and generalizable to gait recognition.

5.6 Training Details

Training code will be released for reproducibility. Our experiments were conducted using

the PyTorch deep learning framework. Detailed information pertaining to the training

104

parameters, configurations, and specifics can be referred to in Tab. 5.6. We employed the

Vision Transformer (ViT) model architectures as implemented in the InsightFace GitHub

repository, ensuring a well-established and tested model foundation. When measuring the

throughput of our KeyPoint Relative Position Embedding (KPRPE), we utilized an NVIDIA

RTX3090 GPU.

Ablation Experiments Large Scale Experiments

Backbone
LR
Batch Size
Epoch
Momentum
Weight Decay
Scheduler
Optimizer
Warmup
AdaFace Loss Margin
AdaFace Loss h

Augmentation

ViT Small
0.001
512
34

ViT Large
0.0001
1024
36

0.9
0.05
Cosine
AdamW
3
0.4
0.333
Flip, Brightness, Contrast, Scaling,
Translation, RandAug [49](magnitude:14/31),
Blur, Cutout, Rotation (20◦)

PartialFC
RepeatedAug Prob

None
0.5

sampling rate 0.6
0.1

Table 5.6 Details for training face recognition models with or without KPRPE.

5.7 Supplementary Performance Analysis

5.7.1 Performance Across Various Loss Functions

In our extensive evaluation, we have employed three popular loss functions: AdaFace [122],

CosFace [240], and ArcFace [55], to train the Vision Transformer (ViT) in combination with

our proposed KeyPoint Relative Position Embedding (KPRPE). As demonstrated by the

results in Tab. 5.7 rows 3-6, our method exhibits consistent performance improvements on

lower quality datasets across all three loss functions when compared to the standalone ViT.

This signifies the versatility of KPRPE in synergizing with a variety of loss functions to

enhance the robustness of face recognition models to less-than-optimal image quality.

105

Method

Backbone

Train Data

AdaFace [122]
WebFace4M [300]
ViT
AdaFace [122] ViT+KPRPE WebFace4M [300]
ArcFace [55] ViT+KPRPE WebFace4M [300]
CosFace [240] ViT+KPRPE WebFace4M [300]

TinyFace [46]

Low Quality Dataset

IJB-S [112]
Rank-1 Rank-5 Rank-1 Rank-5 Verification Accuracy
74.81
75.80
75.62
75.48

High Quality Dataset
AgeDB CFPFP IJB-C
0.01%
97.14
97.13
97.21
96.98

77.09
78.20
78.62
77.67

71.90
72.78
73.04
72.22

98.94
99.01
99.06
98.94

77.58
78.49
78.57
78.30

97.48
97.67
97.57
97.45

Table 5.7 SoTA comparison on low-quality and high-quality datasets. IJB-C [253] reports

TAR@FAR=0.01%.

5.7.2 Performance with Different Number of Keypoints

We include the impact of the number of keypoints in KP-RPE. We initiated the analysis

with 5 keypoints, the maximum available in RetinaFace. And gradually reduce the number

of points.

Number of Keypoints
5
4
3
2
No Keypoints (Vanilla ViT)

TinyFace Rank1 TinyFace 5 AgeDB CFPFP
74.25
73.63
73.95
73.42
72.96

96.60
96.57
96.80
95.97
96.11

69.88
69.58
69.66
69.26
68.24

95.92
95.65
95.77
95.73
95.57

Table 5.8 Performance by changing the number of keypoints.

For datasets characterized by lower image quality like TinyFace, the performance dimin-

ishes as the number of keypoints reduce. But it does not diminish compared to not using

the keypoints. It could be that the information about the scale and rotation of an image

could still be captured by few points as 2 or 3. Interestingly, in high-resolution datasets, the

trend is absent and the performance remains relatively consistent regardless of the number of

keypoints used. More keypoints can be adopted with other landmark detectors but they are

not trained with low quality images in WiderFace as the dataset only provides 5 points.

5.7.3 Sensitivty to Landmark Error in KPRPE

To test the sensitivity of KPRPE to the landmark prediction error, we take the prediction

of the landmark predictor and perturb it by the following equation,

Lpert = L + αL.

(5.12)

106

Figure 5.6 Verification accuracy measured in CFPFP dataset with added noise in landmark

predictions.

α is a parameter that changes the level of noise in the prediction. We change α from

0 to 0.1 after noting that 0.1 makes the NME score to be 0.12 which is far worse than the

NME score of 0.05 in WiderFace which is a harder dataset. Therefore, α = 0.1 is an extreme

scenario where all of the inputs have failed to the level which exceeds the average level of

failure in WiderFace by two times.

Note that as we add noise into the landmark prediction, the performance goes down,

signaling that KPRPE is dependent on the landmark prediction. However, the amount of

performance drop within the range of realistic noise level is not too much (about 1.5%).

Fig 5.11 shows the experiment setting in a diagram.

5.7.4 Why are noisy keypoints more useful in KP-RPE than in simple alignment?

The short answer is that not all predicted points are noisy in an image while alignment

as a result of one or more noisy point impacts all pixels. For a more concrete example, in

Fig. 5.7, we have taken images from WiderFace which contains human-annotated ground

truth keypoints and compared them with RetinaFace prediction. Fig. 5.7 (a) shows a well

aligned scenario. (b) and (c) show that when one or two landmarks (red color) deviate from

the ground truth (GT), the resulting alignment changes dramatically. For KP-RPE, this

is a less severe problem because individual landmarks affect the RPE i ndepenently in the

landmark space (0-1). On the other hand, when affine transformation is regressed to align

the image to a canonical space, individual landmark error becomes correlated and amplified.

107

0.000.020.040.060.080.10Landmark Noise Level 0.95000.95250.95500.95750.96000.96250.96500.9675Verificiation AccuracyViT+KPRPEFigure 5.7 Keypoints in Aligned images. Blue: ground truth keypoints. Yellow/Red:

RetinaFace keypoints with less/more than 5% error from GT. Overlay of (b,c) shows how

small deviation from one or two points can lead to significant scale, translation change.

5.8 Training Landmark Detector (MobileNet-RetinaFace)

RetinaFace [54], a single-stage face detector, is built upon Feature Pyramid Network [144]

(FPN) and Single Shot MultiBox Detector [153].

It is originally designed for detecting

multiple faces using anchor boxes in each location in an image. However, in our case, we

assume the presence of one face, and we leverage this constraint to improve the landmark

detection performance and efficiency of the model. This assumption is valid if a face detector

crops out a face, which is a standard practice in face recognition. With this assumption,

we can modify the RetinaFace to predict more accurate landmarks when the input image is

cropped. We adopt few training techniques and a faster aggregation technique and name

it Differentiable Face Aligner (DFA). The name suggests that with the modifications we

propose, the face alignment network is differentiable (unlike RetinaFace because of NMS and

CPU based cropping), making it potentially useful for other applications in computer vision.

Training Data Adaptation We adapt the training data WiderFace [269] for our Differen-

tiable Face Aligner (DFA) by cropping out facial images using the ground truth bounding

boxes. And we resize the input to be 160x160. This change in data size and distribution

108

DetectionAligned by:GroundTruthAligned by:RetinaFaceOverlay(a)(b)(c)One Landmark can impact alignmentallows the model to specialize in localizing landmarks for single faces, ultimately improving

its performance.

Aggregation Network The motivation for the aggregation network is to eliminate the

Non-Maximum Suppression (NMS) and output a single landmark prediction from multiple

anchor boxes. We design a network that takes in the output of FPN and aggregates it to a

single prediction. The architecture of the aggregation network consists of MixerMLP [229].

Specifically, let X be an image, and let Fbbox, Fscore and Fldmk be the set of the output of

FPN followed by the corresponding multitask head (bounding box, face score and landmark

prediction) for each anchor box. For example, when an image is sized 160 × 160, there are

1050 anchor boxes. Based on these outputs, we predict the weights for fusing the outputs.

Specifically,

O = Concat(Fbbox, Fscore, Fldmk) ∈ R1050×(Cbbox+Cscore+Cldmk) = R1050×(4+1+10),

w = Softmax(MixerMLP(O)) ∈ R1050,

L = wT Fldmk.

(5.13)

(5.14)

(5.15)

The final output L is the weighted average of the landmarks in all anchor boxes. The

aggregation network is trained end to end with the rest of the detection model with the

smooth L2 Loss [194] between L and the ground truth landmark LGT .

By incorporating these modifications, we show in Sec. 5.8.1 that our DFA achieves superior

landmark detection performance compared to the RetinaFace while using a more efficient

backbone architecture.

Training Details For the training of our Differentiable Face Aligner (DFA), we incorporated

specific training settings to optimize the performance. We used an input image size of 160

pixels, with a batch size of 320. Training was conducted for 750 epochs, ensuring that

the model had adequate exposure to learn and generalize from the dataset. Training was

109

Figure 5.8 Cumulative Error Distribution curve and the corresponding NME for models

evaluated on WiderFace [269] validation set.

performed using the WiderFace training dataset, with images cropped using the ground truth

bounding boxes and a padding of 0.1.

5.8.1 Landmark Detection Performance

In this section, we evaluate the performance of our proposed Differentiable Face Aligner

(DFA) in terms of landmark detection. We use the Normalized Mean Error (NME) as the

metric and evaluate on WiderFace validation set [269] as in RetinaFace [54].

Fig. 5.8 shows an improvement in NME when using DFA compared to the baseline

RetinaFace. The RetinaFace with MobileNet backbone achieves an NME of 0.077, while the

one with ResNet50 achieves 0.0553. In contrast, our DFA achieves 0.0518, demonstrating its

superiority in landmark detection.

Moreover, the DFA model benefits from the introduction of the aggregation network, which

eliminates the need for the NMS stage. The improvement in NME due to the aggregation

network is from 0.0527 to 0.0518. This not only simplifies the overall pipeline but also

contributes to the enhanced performance of the DFA model in the landmark detection. With

a straightforward modification in the training data and an aggregation stage that assumes a

single-face image, a lightweight backbone with better performance can be trained.

110

5.9

IJB-S Evaluation Method

IJB-S [112] is a video-based dataset that defines probe and gallery templates according to

its predefined video clip of arbitrary length. This naturally implies one must perform feature

aggregation (fusion) when frame-level features are predicted. Since the backbone predicts

a unit-norm feature vector for one image, the simplest method would be to average all the

features within the template. The most popular method is to utilize norm-weighted average,

where the features are averaged before normalization [122]. This only works if the norm is a

good proxy for the prediction quality. However, in certain cases, depending on various factors

such as dataset, learning rate, backbone, optimizer, etc., that go into the training of a model,

this may not be the case. Also, in our experience, ViT+KPRPE was not the case.

Therefore, we propose a proxy that could easily replace the norm with another quantity

that can be found within a model. Since DFA predicts the landmarks L and a face score

Fscore, we derive a fusion score using those quantities. First, let us review the conventional

norm-weighted feature fusion equation for a set of N number of feature vectors {fi}N where
fi = ||fi||2 · ¯fi decomposes fi into the norm and the unit length feature.

fnorm weighted =

(cid:80)N

i=1 ||fi||2 · ¯fi
N

.

(5.16)

In the equation above, fi represents the i-th frame-level feature, and N is the total number

of frames. Now, for KPRPE, we propose a new feature fusion method, incorporating the face

score and the Euclidean distance between predicted landmarks L and the canonical landmark

ˆL, which is a known set of landmarks that the training images are aligned to. This distance

score, d, is computed as:

di =

h − min(|Li − ˆL|2, h)
h

,

(5.17)

where h = 0.2 is a fixed constant that allows the score to be bounded between 0 and 1.

The face score Fi

score

represents the quality of the image, and di assigns more weight to

well-aligned images. Proposed feature fusion equation, hence, becomes:

fKPRPE =

(cid:80)N

i=1(di · Fi
N

score) · ¯fi

.

(5.18)

111

This method allows for the aggregation of features even when the feature norm does not serve

as a good proxy for the quality of an image. In computing IJB-S result for ViT+KPRPE, we

use this fusion method.

However, for a fair comparison in IJB-S, it is important to apply this fusion method to

previous methods. Therefore, we include the breakdown of with and without landmark score

based fusion. For single image based datasets such as TinyFace, AgeDB or CFPFP, feature

fusion is not needed.

Training Data: MS1MV3 Feature Fusion Method

ViT Base+IRPE
ViT Base+IRPE
ViT Base+KPRPE
ViT Base+KPRPE

Average
Landmark based
Average
Landmark based

Training Data: WebFace4M Feature Fusion Method

ViT Large+IRPE
ViT Large+IRPE
ViT Large+KPRPE
ViT Large+KPRPE

Average
Landmark based
Average
Landmark based

IJBS Rank1
62.49
63.81
63.44
64.68

IJBS Rank1
71.32
71.93
65.95
72.78

IJBS Rank5 TinyFace Rank1

70.50
71.30
72.04
72.33

69.05
69.05
69.88
69.88

IJBS Rank5 TinyFace Rank1

76.22
77.14
71.64
78.20

74.92
74.92
75.80
75.80

Table 5.9 Breakdown of with and without fusion method in various backbones and datasets.

The performance of ViT+KP-RPE consistently surpasses ViT+iRPE, both in scenarios

using Averaging or Landmark-based fusion. This affirms the efficacy of KP-RPE in enhancing

performance, even in single image contexts like TinyFace. Importantly, while the keypoint

detection step is integral to KP-RPE, it isn’t incorporated within iRPE, making a direct

comparison based on this score less straightforward for iRPE.

Interestingly, average fusion does not synergize well with ViT+KP-RPE. Contrary to

typical observations where feature magnitude positively correlates with image quality [122],

with ViT+KP-RPE, a higher feature magnitude actually suggests reduced image quality.

It remains unclear why this inverse relation emerges in our model. Through empirical

observations, the relationship between feature magnitude and image quality appears contingent

on the chosen training dataset and model architecture. For instance, models based on the

ResNet architecture consistently exhibit a positive correlation between feature magnitude

and image quality.

112

5.10 Alignment Visualizations

TinyFace [46] and IJBS [112], which are prone to alignment failures. In Fig 5.9 we show

some success and failure caes in alignment. These images are taken from the released aligned

dataset itself.

Figure 5.9 Actual examples of aligned and mis-aligned images from TinyFace [46] (row1,3)

and IJB-S [112] (row2,4) datasets. These are shown as processed and used by [122]. Lines

are placed on the eyes for a visual guide for an alignment.

5.11 Comparison with SoTA Off-the-Shelf Landmark Detector

We evaluate the off-the-shelf landmark detector SLPT [261] (CVPR2022), which delivers

strong performance on the high-quality WFLW [258] dataset. However, its performance dips

significantly on the WiderFace dataset, populated with lower-quality images, as demonstrated

in Tab. 5.10. This evaluation is not aimed at drawing a direct comparison between SLPT

and DFA, as DFA is trained specifically on WiderFace. Instead, it serves to underline the

performance variations of landmark detectors when trained on diverse datasets, stressing the

importance of training dataset selection. Additionally, DFA boasts a magnitude faster speed

than SLPT.

Since SLPT predicts 98 landmarks compared to 5 landmarks in DFA, we convert the

SLPT landmarks by selecting indices that represent the left eye, right eye, nose, left mouth,

and right mouth. An example is shown in Fig. 5.10.

113

AlignedMis-alignedModels

Train Data

DFA MobileNet WiderFace [269]

SLPT [261] 6 Layer WFLW [258]

NME
0.0518
0.1104

FLOP

Params
0.14 GFLOP 0.49M
8.40 GFLOP 13.19M

Table 5.10 Comparison of DFA to SoTA Landmark detector. Note that NME is evaluated

on on WiderFace Validation set. DFA is trained on WiderFace training set. SLPT is trained

on WFLW. Direct NME comparison is not fair as the training dataset is different.

Figure 5.10 For converting 98 points landmarks from SLPT output, we choose indices 96, 97,

57, 76, 82.

5.12 Pipeline Detail

In this section, we elaborate on the inference scenarios involved in evaluation pipelines. A

face recognition pipeline could be simplified to the following diagram. For a given raw image

(a), the face detector crops out an image containing a face region (b). Then a conventional

alignment algorithm (MTCNN, RetinaFace, DFA) simultaneously predicts the landmarks (c)

from (b). The least-square minimization algorithm is used to align (b) into the aligned image

(d) using keypoints (c) and a reference landmark. This reference landmark is arbitrarily

chosen, but the FR community usually adopts one popular setting.

When one trains or evaluates face recognition models, most of the time, it is using aligned

images (d), highlighted by the blue path. In our main paper, Tables 1,2, and 4, the aligned

dataset and low-quality dataset are evaluated this way. The unaligned dataset in Tables 1

and 2 refers to the orange path. Whenever KPRPE is used, the keypoints are predicted using

the inputs (b) or (d) depending on the path.

114

SLPT 98 LandmarksConverted to 5 LandmarksFigure 5.11 An illustration of face recognition pipeline from the raw image (a) to the aligned

image (d).

5.13 KPRPE Visualization

We show the learned attention offset values in KPRPE. The red star denotes the query

location and the blue circles represent the predicted landmarks. We pick head index 0 and

plot the Transformer depth 0,1,3,5,7. Figs. 5.12 show different patterns of learned offset

based on depth and query locations. Note that the higher values are denoted by a stronger

blue color. Some attention offsets are 1) far from the query location, 2) horizontal pattern,

etc. But there is an inherent bias toward attending nearby pixels.

Also, we show in Fig. 5.13, an image with different images, therefore different landmark

patterns. The changes in attention are not as dramatic as the changes across different head

or depth. However, these changes observed in Fig .5.13 account for the spatial variations in

the image once they accumulate over all of the attention modules in the model.

5.14 Conclusion

In this work, we introduce Keypoint-based Relative Position Encoding (KP-RPE), a

method designed to enhance the robustness of recognition models to alignment errors. Our

method uniquely establishes key-query relationships in self-attention based on their distance

to the keypoints, improving its performance across a variety of datasets, including those

115

(a) Image(b) Face Detected Image(c) KeypointDetection(d) Aligned ImageCNN/ViTCNN/ViTUnaligned Dataset Evaluation : Aligned Dataset Evaluation :  Low Quality Dataset Evaluation :  Table 1,2,4Figure 5Gradually changing input (b) to (d) inSupplementary  Figure 3 Path:Add noise in c. KeypointDetectionExperiment Explanation+KPRPEFigure 5.12 KPRPE Learned Offset Bij visualization for different Transformer depths.

with low-quality or misaligned images. KP-RPE demonstrates superior efficiency in terms of

computational cost, throughput and recognition performance, especially when affine transform

robustness is beneficial. We believe that KP-RPE opens a new avenue in recognition research,

paving the way for the development of more robust models.

Limitations While KP-RPE shows impressive face recognition capabilities, it does require

keypoint supervision, which may not always be readily available and can constrain its

116

Depth 0Depth 1Depth 3Depth 5Depth 7Figure 5.13 Cross image learned KPRPE visualization. We show the depth 5, and head

index 0 of the same model.

application, particularly when the dataset is not comprised of images with a consistent

topology. Future work should consider the self-discovery of keypoints to lessen this dependence,

thereby boosting the model’s flexibility.

Potential Societal Impacts Within the CV/ML community, we must strive to mitigate

any negative societal impacts. This study uses the MS1MV* dataset, derived from the

discontinued MS-Celeb, to allow a fair comparison with SoTA methods. However, we

encourage a shift towards newer datasets, showcasing results using the recent WebFace4M

dataset. Data collection ethics are paramount, often requiring IRB approval for human data

collection. Most face recognition datasets likely lack IRB approval due to their collection

methods. We support the community in gathering large, consent-based datasets or fully

synthetic datasets [17, 124], enabling research without societal backlash.

117

Depth 5CHAPTER 6

SAPIENSID: FOUNDATION MODEL FOR UNIFIED HUMAN
RECOGNITION

Existing human recognition systems often rely on separate, specialized models for face and

body analysis, limiting their effectiveness in real-world scenarios where pose, visibility, and

context vary widely. This paper introduces SapiensID, a unified model that bridges this gap,

achieving robust performance across diverse settings. SapiensID introduces (i) Retina Patch

(RP), a dynamic patch generation scheme that adapts to subject scale and ensures consistent

tokenization of regions of interest, (ii) a masked recognition model (MRM) that learns from

variable token length, and (iii) Semantic Attention Head (SAH), an module that learns pose-

invariant representations by pooling features around key body parts. To facilitate training,

we introduce WebBody4M, a large-scale dataset capturing diverse poses and scale variations.

Extensive experiments demonstrate that SapiensID achieves state-of-the-art results on various

body ReID benchmarks, outperforming specialized models in both short-term and long-term

scenarios while remaining competitive with dedicated face recognition systems. Furthermore,

SapiensID establishes a strong baseline for the newly introduced challenge of Cross Pose-Scale

ReID, demonstrating its ability to generalize to complex, real-world conditions. The dataset,

code and models will be released.

6.1

Introduction

Human recognition has traditionally been approached through domain-specific models

focused exclusively on either face [55, 102, 122–124, 128, 154, 239, 240] or body [80, 110, 140,

149, 151, 268] recognition (or ReID). Each of these modalities relies heavily on specific dataset

alignments, where face recognition models are optimized for tightly cropped, aligned facial

images [1, 54, 82, 300], and body recognition models are designed to process full-body images

of standing individuals [212, 250, 268, 295].

Despite the advances in both face and body recognition, no single model has yet effectively

managed to handle a diverse range of poses and visible area simultaneously. However,

118

Figure 6.1 SapiensID is a human recognition model trained on a large-scale dataset of

human images featuring varied poses and visible body parts. For the first time, a single

model performs effectively across diverse face and body benchmarks [100, 212, 268, 297]. This

marks a significant improvement over previous body recognition models, which were often

limited to one specific camera setup or image alignments for one model, with worse

performance in in-the-wild scenarios. Additionally, we introduce a large-scale, cross-pose and

cross-scale training and evaluation set designed to facilitate further research in this area. —

The name SapiensID pertains to the ability to recognize humans.

in real-world scenarios, human recognition often requires harnessing the full spectrum of

available clues, integrating both face and body information. Typically, individual modality

outputs are fused at the feature or score level [87, 147] to mitigate this issue.

In other

words, no single model can handle both face image and body image at the same time as

robustly as the modality-specific model. A unified model would mark a significant advance

in human recognition, freeing users from constraints on visible facial or standing-body views

and allowing reliable identification across varied poses and scales of different body parts. As

shown in Fig. 6.2, current research on body recognition models relies heavily on in-domain

datasets, fail to generalize effectively to other datasets.

Addressing this gap is important for several reasons. In real-world applications, human

recognition systems should operate across a variety of poses (sitting vs standing) and visible

contextual areas (upper torso vs whole body) [271]. Furthermore, a model capable of handling

varied inputs simplifies model deployment and usage for downstream tasks by eliminating

the need for preprocessing steps such as face alignment [54] or dependency on camera

119

Fixed Whole BodyAligned FaceCross Pose-Scale RecognitionvsFigure 6.2 Conventionally, face and body recognition were handled independently.

Furthermore, body recognition models were trained on one specific dataset without the

ability to generalize to other datasets. SapiensID model for the first time generalizes across

modalities and different body poses and camera settings.

setups [212, 268].

However, addressing this problem is not trivial. First, it requires a large-scale labeled

human image dataset that captures a wide range of poses and visibility variations. Secondly,

even with such a dataset, the model must be capable of managing the substantial variability

in scale and pose that human images naturally show. As in Fig. 6.1, close-up portraits

show a large face, while full-body shots display it much smaller. Modality-specific models

have eliminated the scale inconsistency problem with some form of pre-alignment stage.

For instance, body recognition models assume consistent camera setup [212, 268] and face

recognition models assume the images are aligned with 5 facial landmarks to a canonical

position [54, 285]. Such transformations of input reduce irrelevant variability in recognizing a

person, making training easier. However, models fail to generalize when the preprocessing

step fails [125].

To this end, we propose SapiensID, one model capable of handling the complexities of

human recognition in diverse settings. Our contributions are

120

Model Face FaceEval sets Model LTCCLTCCModel PRCCPRCCMarketMSMT17Previous: One model works on only one settingWhole Body Evaluation SetsOne model for all CrossPoseScale ModalityWebBodyModel MarketModelMSMT17SapiensID• Model Innovations: We introduce three major improvements over conventional special-

ized recognition models:

1. Retina Patch addresses scale variations often encountered in human images by

dynamically allocating more patches to important regions.

2. Masked Recognition Model reduces the number of tokens, achieving 8× speed up

in ViT during training.

3. Semantic Attention Head addresses pose variations by learning to pool features

around keypoints.

• Data Contribution: To aid the development and evaluation of SapiensID, we release
WebBody4M (Fig. 6.1), a large-scale dataset specifically designed for comprehensive

human recognition across different poses and scales.

Our approach is a paradigm shift human recognition, laying the groundwork for research

that bridges the gap between specialized models and holistic recognition systems.

6.2 Related Works

6.2.1 Face Recognition

Face Recognition (FR) matches query images to an enrolled identity database. State-of-

the-art (SoTA) FR models are trained on large-scale datasets [55, 82, 300] with margin-based

softmax losses [55,102,122,154,240]. FR performance is evaluated on a set of benchmarks, e.g.

LFW [100], CFP-FP [202], CPLFW [296], AgeDB [174], CALFW [297], and IJB-B,C [169,253].

They are designed to assess the model’s robustness to factors such as pose variations and

age differences. Models trained on large datasets, e.g. WebFace260M, achieve over 97%

verification accuracy on these benchmarks [122]. FR in low-quality imagery is substantially

harder and TinyFace [46] and IJB-S [112] are popular benchmarks.

Face recognition is often accompanied by facial landmark prediction [31, 132, 224, 285] so

that input faces are aligned and tightly cropped around the facial region. However, when

alignment fails, FR models perform poorly [125]. Eliminating alignment would not only

121

simplify the pipeline but also enhance robustness in conditions where alignments are prone

to fail. We propose an alignment-free paradigm capable of handling any human image with

or without a visible face.

6.2.2 Body Recognition

Body recognition, a.k.a. Person Re-identification (ReID), seeks to identify individuals

across different times, locations, or camera settings. Prior works [71, 72, 137, 138, 146, 241, 278,

284, 295] focus on short-term scenarios where subjects generally end up with the same attire.

Removing this assumption has led to long-term, cloth-changing ReID [41, 80, 95, 110, 140, 238,

268,280], on datasets such as PRCC [268], LTCC [212], CCDA [149] and CelebReID [103,104].

All of these datasets are composed primarily of whole-body images, where the subjects are

fully visible from head to toe, with poses generally limited to walking or standing. While this

format has been valuable in the development of person ReID models for controlled environ-

ments, it lacks the scale and visibilty variety often encountered in real-world applications. To

address these limitations, we propose a novel model capable of handling diverse and complex

poses and visible areas. Further, to facilitate the training and evaluation of these models, we

introduce a new large-scale, labeled dataset that significantly broadens pose-scale diversity.

6.2.3 Patch Generation for Vision Transformers

In Vision Transformer (ViT) [61], an image is divided into patches, with each transformed

into a token via linear projection. This patch-based approach transforms images to an

unordered set of tokens for sequence-to-sequence modeling [236], processing images in a

scalable and flexible way in downstream tasks. Typically, patches are created by dividing an

image into a grid with a specific number of patches.

Several works explore how the patchifying process helps ViT capture multi-scale objects in

images [249]. For instance, [51] predefines patch counts without resizing the input, retaining

the image’s aspect ratio and scale. [22] randomizes patch sizes in training for generalization

across image scales, enhancing efficiency while sometimes reducing accuracy. Importantly,

the representation quality of specific regions, such as face or hand, depends on the number

122

of tokens allocated to those areas. A smaller face within a constant patch size, for example,

generates fewer tokens and thus captures less detail than a larger face. To address this, we

propose to maintain a consistent number of tokens for regions of interest while ensuring full,

non-overlapping coverage across the image in line with grid-based tokenization principles.

6.3 Proposed Method

A human recognition model is formulated as a metric learning task such that images of

the same subject are closer in feature space than those of different subjects, satisfying

d(f i

A, f j

A) < d(f i

A, f k

B),

(6.1)

where f i
A

and f j
A

denote the feature vectors of two different images i and j of the same

subject A, while f k
B

represents the feature vector of an image of a different subject B.

Notably, the subjects A and B are not observed during training. Following established

research on margin-based techniques for enhancing intra-class compactness in the feature

space [55, 122, 154, 172, 240], we utilize a margin-based softmax loss [122] to train our model

on a labeled dataset. We collect a large-scale web-collected human image training dataset

which will be discussed in Sec. 6.3.4.

The key challenge that sets this apart from prior work on a separate face [55, 154] or

body [140, 268] recognition task is that the input image can be highly varying in 1) scale

and 2) body pose. To tackle these challenges, we propose a new architecture, which will be

discussed in the subsections.

6.3.1 Retina Patch (RP)

To address the issue of varying scale in human images, we propose a novel Retina Patch

mechanism inspired by the human eye’s ability to adapt focus dynamically to regions of

interest (ROIs) within a scene. In natural images, subjects can appear in diverse poses and

with varying visibility of the face and body, leading to substantial differences in scale across

regions. For instance, in a full-body image, a face may be a small portion, whereas in a

close-up, it dominates. To account for these variations, our Retina Patch dynamically assigns

123

Figure 6.3 Comparison between the standard grid patch scheme of Vision Transformers

(ViT) and our Retina Patch. While maintaining the same or lower computational budget

(number of tokens), Retina Patch dynamically allocates more patches to critical regions (e.g.,

face and upper torso) in an image. This allocation enhances the model’s ability to capture

fine-grained details in important regions, and to handle varying scales more effectively than

fixed grid patch.

more patches to critical regions within the image.

Assume we have an input image i and a set of image-dependent regions of interest,

{ROIi

r | r = 0, 1, . . . , R}, each defined by a bounding box. There are R ROIs per image.
be the whole

Details on how ROIs are computed will be discussed later. We also let ROIi
0

image. For each ROIi
r

, we set a specific number of patches mr and an order zr, both controlling

how many patches can come from each ROIi
r

.

To obtain patches, we may perform a grid patching operation on each ROI independently.

However, this would naturally result in overlapping patches with redundant feature extraction.

Our aim is to cover the whole image with patches without any overlap. To avoid redundancy,

overlapping patches between regions with a lower order (e.g., order z = 1) and those with

a higher order (e.g., order z = 2) are excluded from the patch set of the low-order regions.

This selective inclusion process ensures that each patch belongs uniquely to the ROI with the

124

Grid Patch (ViT) Num Patches: 576 Face Area: 25RetinaPatch Num Patches: 348 Face Area: 144Figure 6.4 Illustration of Retina Patch and Position Encoding computation. Top: It shows

three different ROIs generating patches at various scales (e.g., full image, upper torso, face).

It also shows the corresponding position encodings sampled from the same spatial locations

as the patches, allowing ViT to infer spatial context and understand where each patch

originated within the image. Bottom: patches and position embedding created by Retina

Patch.

highest priority, as indicated by the order. Specifically,

Pi =

R
(cid:91)

r1=0

(cid:32)

Pi

ROIr1

−

R
(cid:91)

(cid:33)

,

Pi

ROIr2

r2=r1+1

(6.2)

where P

ROIi
r

represents the set of patches for region ROIr of image i, and r denotes the index

of each ROI, ordered by their respective priorities for patch inclusion.

This approach allows us to dynamically allocate critical regions with more patches while

ensuring that the entire image is represented by patches without repetition. Also, the scale

inconsistency is mitigated as long as the ROIs are semantically defined (e.g., face, upper

torso). The number of patches within each ROI is kept consistent across images, ensuring

that each patch covers a similar scale within its designated ROI. Fig. 6.3 uses an example to

compare the vanilla grid patch of ViT with our proposed Retina Patch.

Computing ROI Retina Patch is a generic algorithm that can work for any class of images

by designing ROIs for the particular domain. In this paper, for recognizing a subject from a

125

Image+Keypoints 2D Pos Embedding (𝐻×𝑊×𝐶)𝑅𝑂𝐼0𝑅𝑂𝐼1𝑅𝑂𝐼2Retina PatchPatchesPos Embedding+human image, we set the ROIs in 3 parts: 1) whole image, 2) upper torso and 3) face. The

upper torso and face ROIs are computed using the off-the-shelf body keypoint detector [34].

Tokenization The input to ViT’s transformer block is a set of tokens or feature vectors.

Since each patch’s size is dependent on both the ROI size and the number of patches mr, the

size of each patch may not be the same across ROIs. We simply resize all patches to be the

size of patches from the whole image ROIi
0

. We then use a linear layer to map each patch to

the desired dimension, as in ViT.

Position Embedding Since Transformer operates on sets of tokens without inherent order,

Position Embedding (PE) is crucial for informing ViT of the spatial origin of each patch

within the original image. For tokens of Retina Patch, we cannot use a traditional PE as the

patch’s source location is dynamic. Thus, we propose a Region-Sampled PE.

Let PE ∈ RC×H×W be the fixed 2D sin-cosine position embedding [23, 45] for the whole

image. Given a normalized region of interest ROIi

and 1, we define a sampling grid Grid

the position embedding PE. Let (h′

r, hi

r, wi

r = (xi

r, yi
over the region [xi

r) with values between 0
r] within
r] and [yi
, such that
r) be the target output shape for PE

r + wi

r + hi

r, xi

r, yi

ROIi
r

ROIi
r

r, w′

r = mr, the desired number of patches for ROIi

r · w′
h′
then obtained by bilinearly interpolating PE at the points in Grid

r

. The Region Sampled PE, PE

is

ROIi
r

to match the shape

ROIi
r

(h′

r, w′

r):

PE

ROIi
r

= GridSample(PE, Grid

ROIi
r

, (h′

r, w′

r)).

(6.3)

By using region-specific position embeddings, Retina Patch enables the model to differentiate

between patches from distinct areas of the image while preserving spatial structure similar to

the patches. An example is shown in Fig. 6.4.

6.3.2 Masked Recognition Model (MRM)

For each image, Retina Patch results in different numbers of tokens because different

ROIs create different areas of intersection. For example, the number of patches from ROI0

in Fig. 6.4 is 12 × 12 but the upper torso ROI1 subtracts 4 × 4 patches from ROI0 to avoid

overlap. This operation leads to a different number of tokens per image, which prevents

126

Figure 6.5 Illustration of Masked Recognition Backbone with masking and attention scaling

trick for batched input during training. In testing, we pad with mask tokens to make the

length the same.

us from training and testing with batched inputs. To address the token inconsistency, we

propose the Masked Recognition Model (MRM), introducing two key techniques: (1) masking

with attention scaling and (2) a variable masking rate.

Masking with Attention Scaling During training, we select tokens to keep. Unlike

MAE [85], which discards the masked tokens, we replace them with a learnable mask token.

We do this because (i) the mask token will be used during testing for padding the input,

and (ii) this allows the model to explicitly know how many tokens are masked. Yet, since all

masked tokens share the same value, we can reduce computation by applying the Attention

Scaling Trick.

Specifically, although there are multiple masked tokens, we can achieve the same effect

with a single mask token by adjusting its attention scores to reflect the total number of

masked tokens. Let ni be the total number of tokens for i-th image, nk be the number of

tokens we keep, and nm,i = ni − nk be the number of masked tokens. We modify the attention

computation in the Transformer as:

A = softmax

(cid:16)

√

QK⊤/

d + δ

(cid:17)

,

(6.4)

where Q ∈ R(nk+1)×d and K ∈ R(nk+1)×d are the query and key matrices with tokens to keep

127

Retina PatchTokens B×𝑁1×𝐶Retina PatchTokens B×𝑁2×𝐶Learnable Mask :Token :Random MaskingTokens B×𝑁1×𝐶Tokens B×𝑁2×𝐶Append MaskRandom Masking42(#: mask scaling factor Eq.5)Batch CompatibleTraining StageTesting Stage#and one mask token. d is the embedding dimension. We add a bias matrix δ ∈ Rn×n so that

it is mathematically equivalent to repeating the mask tokens nm,i times during attention

computation.

δij =




log nm,i,

if j is the mask token,



0,

otherwise.

(6.5)

In summary, we reduce the number of tokens from ni to (nk + 1). Note that (nk + 1) is

fixed and not image dependent. But we adjust the attention to make it equivalent to using

ni tokens where nm,i tokens replaced by learnable mask tokens. By applying the Attention

Scaling Trick, we handle varying token counts in training. Also in practice, nk is set to be

about 1/3 of ni, masking 66% of tokens for the speed gain. During testing, we simply find

the longest token length and pad the others with the mask token to batchify the inputs. An

illustration is in Fig. 6.5.

Variable Masking Rate As we view masked training as a form of augmentation, we

randomize nk during training and adjust the batch size correspondingly. For each batch, let

ˆnk be the sampled number of tokens to keep,

ˆnk = nk + (ni − nk) · e−λ·U (0,1).

(6.6)

λ is a scaling factor, and U (0, 1) denotes a random uniform distribution between 0 and 1. In

short, ˆnk is sampled from a distribution that peaks at nk and exhibits an exponential decay

in probability toward ni.

With a randomized token length nk, we adjust the batch size B based on the relationship

, where increasing nk would require decreasing B to maintain the same GPU

k ∝ 1
n2
B
memory and FLOP. And we adjust the learning rate according to the effective batch size

Ladj = Lˆnk × Bˆnk/Bnk

to maintain consistent gradient magnitudes per sample.

The effect of (1) masking with attention scaling and (2) variable masking rate is ablated

in Tab 6.5. While (1) and (2) are both helpful, the effect of (2) is more pronounced.

128

Figure 6.6 Illustration of semantic pooling in Semantic Attention Head. Keypoints (e.g.,

nose, feet) are used to grid-sample position embeddings (PE), forming queries that repeat n

times and added with a global offset bias B. This setup enables attention to slightly varied

locations around each keypoint. Value comes from ViT backbone and Key is the PE. Result

is a learned pooling mechanism.

6.3.3 Semantic Attention Head (SAH)

In biometric recognition, the head module is key for converting the backbone’s output

feature map into a compact feature vector for recognition. Face recognition models flatten the

feature map and apply linear layers [55, 122], while body recognition models use horizontal

pooling [34, 272]. However, these approaches rely on input image alignment (aligned face or

standing body) which fails when there are large pose variations. To tackle this, we introduce

a Semantic Attention Head (SAH) that extracts semantic part features from key body parts,

making the representation less sensitive to pose.

Our method uses keypoints (e.g., nose, hip) for capturing semantic parts. But instead of

sampling features only at keypoints, which may miss the surrounding context, SAH learns to

pool features around each keypoint. We construct a semantic query Qi
kp

(e.g., nose) using

2D position embeddings (PE) from the backbone, sampled at keypoint locations:

Qi

kp = GridSample(PE, kpi) + B,

(6.7)

where PE is the fixed 2D image position embedding. kp

i ∈ Rnk×2 is the image-specific
predicted keypoints [34]. We duplicate keypoints n times and add shared bias B ∈ Rnk×C.

129

2D Pos Emb (𝐻𝑊×𝐶)KeypointsSampling𝐊𝐞𝐲 (𝐻𝑊×𝐶) Value (𝐻𝑊×𝐶) Query (𝑛𝑘×𝐶) repeat BackboneImage + Keypoints (𝑘×2) Global Offset 𝑩 (+) (𝑘×𝐶) 𝑨𝒕𝒕𝑲𝑽The purpose of B is to learn to offset the center of attention so that it learns to pool from

diverse locations around keypoints. Key in attention is the fixed PE. Value is the backbone’s

feature map. The attenton with Qi
kp

captures the neighborhood of the backbone feature map

around keypoints:

Oi

part = Attention (cid:0)Qi

kp, PE, backbone(Xi)(cid:1) .

(6.8)

The Oi

part ∈ RB×k×C contains semantic part features corresponding to k keypoints. Finally,

applying a multi-layer perceptron (MLP) to the flattened Oi

part

produces a feature,

f i = MLP(flatten(Oi

part)).

(6.9)

By learning to pool features adaptively around each keypoint, this attention mechanism enables

pose-invariant recognition that goes beyond conventional alignment-dependent methods.

Fig. 6.6 illustrates the attention pooling.

Training with Mixed Datasets While SAH effectively handles pose variations, we hypoth-

esize that key cues for recognition differ between short-term and long-term training datasets.

Clothing and hairstyle, for example, are useful in short-term datasets but less reliable in

long-term due to possible appearance changes.

To aid learning with mixed datasets which combines short-term and long-term datasets,

we introduce one more measure during training. We introduce a learnable scale that controls

the importance of individual part features in (Oi

part) for each dataset. It is to allow the model
to emphasize features that are most discriminative for each dataset. During testing, however,

we can use the average scale because we do not want utilize the knowledge about the test

dataset a priori.

Specifically, let Wt ∈ Rk be a weight for the t-th dataset. For each sample, we choose the

weight and apply

f i = MLP(flatten(Oi

part · σ(Wt))),

(6.10)

where σ is the Sigmoid function, ensuring weights are between 0 and 1, controlling the

influence of each of the k semantic parts. We observe that after training, short-term datasets

130

Method

Arch

Train Data

CAL [80]
CAL [80]
CAL [80]
CLIP3DReID [151]
CLIP3DReID [151]
SOLDIER [44]
SOLDIER [44]
HAP [281]
HAP [281]
HAP [281]
HAP [281]
HAP [281]
SapiensID (Ours)

LTCC
PRCC
LTCC+PRCC
LTCC
PRCC
LU4M+Market1501
LU4M+MSMT17
LU4M+LTCC
LU4M+PRCC
LU4M+Market1501
LU4M+MSMT17

R50
R50
R50
R50
R50
Swin-Base
Swin-Base
ViT-Base
ViT-Base
ViT-Base
ViT-Base
ViT-Base WebBody4M (Ours)
ViT-Base WebBody4M (Ours)

Method

Arch

Train Data

CAL [80]
CAL [80]
CAL [80]
CLIP3DReID [151]
CLIP3DReID [151]
SOLDIER [44]
SOLDIER [44]
HAP [281]
HAP [281]
HAP [281]
HAP [281]
HAP [281]
SapiensID (Ours)

LTCC
PRCC
LTCC+PRCC
LTCC
PRCC
LU4M+Market1501
LU4M+MSMT17
LU4M+LTCC
LU4M+PRCC
LU4M+Market1501
LU4M+MSMT17

R50
R50
R50
R50
R50
Swin-Base
Swin-Base
ViT-Base
ViT-Base
ViT-Base
ViT-Base
ViT-Base WebBody4M (Ours)
ViT-Base WebBody4M (Ours)

Avg

48.64
35.07
49.69
50.89
35.14
64.85
70.19
45.71
54.09
66.61
66.64
61.49
73.05

Avg

28.40
24.71
29.46
30.24
25.79
24.84
22.17
20.21
26.12
27.49
21.61
44.90
66.30

top1
75.63
74.48
74.83
77.28
71.73
40.27
32.73
44.16
49.15
54.74
37.81
89.00
92.57

mAP
95.64
99.76
99.01
96.43
99.84
99.53
98.71
86.44
98.38
98.45
96.50
98.26
98.79

mAP
40.84
6.19
38.12
45.15
6.19
36.28
36.74
29.02
29.36
35.97
32.07
25.88
34.56

top1
99.51
100.00
99.54
99.43
100.00
99.51
99.30
95.53
98.84
99.30
99.15
99.72
100.00

LTCC (General) PRCC (SC) [268] CCVID (General)
top1
74.04
20.69
72.41
75.66
21.30
73.83
74.44
65.11
63.29
73.02
67.95
56.80
72.01
LTCC (CC) [212]
top1
38.01
6.38
33.16
41.84
6.63
25.00
26.02
25.00
29.08
24.74
23.47
22.70
42.35

mAP
28.08
20.86
29.43
30.01
19.81
36.56
27.76
30.43
37.73
45.14
30.52
71.65
77.82
CCVID (CC) [80]
top1
74.97
71.61
73.89
76.28
69.32
39.61
31.85
41.64
45.73
52.37
34.54
88.34
88.72

PRCC (CC)
mAP
top1
35.20
37.00
55.64
55.69
45.42
45.39
38.38
40.81
61.97
62.40
32.12
26.87
25.36
22.27
22.34
26.14
41.94
38.05
37.00
33.90
25.00
23.82
49.38
54.93
72.60
78.75

mAP
18.84
3.14
16.27
22.58
3.17
12.18
11.33
11.63
12.52
11.71
10.74
9.96
17.79

mAP
25.08
17.40
26.65
26.69
16.38
35.48
26.48
25.77
33.12
41.33
26.81
68.66
72.22

mAP
16.11
6.47
21.03
20.33
7.49
94.04
73.20
27.29
50.11
92.20
57.07
42.41
68.26

Market1501
top1
35.60
18.97
43.65
41.66
20.93
97.03
89.85
51.63
73.49
96.23
80.37
66.18
88.18
CCDA [149]
mAP
top1
9.67
3.91
8.61
2.85
9.14
3.74
10.18
4.31
8.89
3.17
16.48
8.62
15.54
8.79
11.18
4.56
13.40
5.13
16.02
8.30
13.33
6.27
41.49
28.80
69.08
61.84

MSMT17 [250]
mAP
top1
5.06
15.92
0.69
2.56
4.44
14.48
5.50
17.45
0.85
3.28
22.77
48.64
78.01
91.12
6.56
20.89
10.99
29.61
23.02
48.01
75.85
89.13
21.42
43.61
31.02
67.25
Celeb-ReID [103]
top1
37.42
23.59
37.11
37.31
23.82
46.37
47.95
30.28
37.79
44.38
46.37
65.78
92.80

mAP
3.92
2.20
3.81
4.02
2.17
5.66
6.14
3.54
4.48
5.20
5.77
18.93
66.92

Table 6.1 Generalization comparison with SoTA ReID models on two settings. "Long-term"

refers to clothing change (CC) protocol of LTCC, PRCC, and CCVID datasets, while

"short-term" the same clothing (SC) protocol. For other datasets, the data capture

characteristics define short or long-term conditions. SapiensID demonstrates superior

generalization in both settings. Our WebBody4M dataset shows higher performance in

long-term ReID, but not with the dataset alone, as shown in the comparison of HAP vs

SapiensID with the same training set. The proposed Retina-Patch and Semantic Attention

Head are essential for learning under large pose and scale variations.

tend to focus on the clothing and long-term datasets focus on the upper torso. The weight is

for learning discriminative parts during training but we do not use dataset-specific weights in

testing.

6.3.4 WebBody Dataset

To facilitate the training, we collect a large-scale, labeled human dataset from the web.

Specifically, we gather 94 million images with 3.8 million celebrity names. Given the inherent

noise in web-sourced name queries, we perform extensive label cleaning. First, we use

YOLOv8 [111] to crop the dominant person in each image to a size of 384 × 384, adding

padding to maintain aspect ratio. We then extract facial features using RetinaFace [54] and

KP-RPE [125]. Following the approach in [300], we apply DBSCAN [65] clustering to identify

131

Method

Arch

Train Data

Avg

CAL [80]
CAL [80]
SOLDIER [44]
SOLDIER [44]
HAP [281]
HAP [281]
HAP [281]
HAP [281]
HAP [281]
SapiensID (Ours)

R50
R50

PRCC
LTCC

2.47
3.79
3.22
Swin-Base Market1501
5.96
Swin-Base
1.74
ViT-Base
2.61
ViT-Base
4.31
ViT-Base Market1501
ViT-Base
4.87
ViT-Base WebBody4M 47.12
ViT-Base WebBody4M 64.41

MSMT17
LTCC
PRCC

MSMT17

WebBody Testset
top1
4.29
6.57
5.42
9.95
2.89
4.37
7.22
8.22
64.36
76.82

mAP
0.64
1.02
1.02
1.98
0.58
0.85
1.39
1.52
29.89
52.00

Table 6.2 ReID Performance on variable pose and scale settings.

the most consistent group of images for each name. By assuming all images stem from a

single name query, we relax the similarity threshold beyond conventional face recognition

standards. We also exclude any images with face features matching those in validation

sets [100, 174, 202, 296, 297].

This process yields a labeled dataset with 4.4 million images across 217, 722 unique subjects.

However, because the dataset is labeled based on facial similarity, it lacks images where the

face is obscured (e.g., back-facing images). To address this, we incorporate additional body

ReID training datasets [70,80,103,212,222,264,268,295], which account for approximately 10%

of the final dataset. After merging, the resulting dataset—named WebBody4M—comprises

4.9 million images and 263, 920 subjects in total. WebBody4M is the largest labeled dataset

to date with high pose and scale variation. The keypoint visibility distribution of different

body parts shows a predominance of visible upper body, with visibility decreasing gradually

down the body (around 17% visible ankles). An example of the WebBody4M dataset can be

seen in Fig. 6.1.

The dataset collection and label cleaning procedure is similar to WebFace4M dataset [300].

We compare the face-cropped version of WebBody4M with WebFace4M and observe that

an FR model trained on WebBody4M-FaceCrop is similar in performance to WebFace4M.

Separate from the WebBody4M, we also prepare a test set called WebBody-test to evaluate

the cross pose-scale ReID performance. It comprises 96, 624 images of 4, 000 gallery and

probe subjects. Examples are shown in Fig. 6.2.

132

6.4 Experiments

Implementation Details To train SapiensID on Webbody4M, we use AdaFace [122] loss

and ViT-Base with KP-RPE as the main backbone [125], following the convention of face

recognition model training pipeline. We do not include additional losses such as Triplet

Loss [200] since there are a sufficient number of subjects in the training set. Input image size

is 384 × 384 with white padding if the aspect ratio is not 1. We use 3 ROIs (whole image,

upper torso, and head) and the grid size per ROI is 12×12 leading to a maximum 144×3

number of patches. With masked recognition training, we replace at most 66% of tokens with

mask (Sec. 6.3.2), leading to ∼9 times speed up in training. The masking probability and

batch size rule are discussed later. We use 7 H100 GPUs to train the whole model in 2 days,

starting from scratch.

Whole Body ReID The task identifies individuals walking or standing in distant camera

views, categorized into short or long-term scenarios based on the time gap between captures

and the likelihood of clothing changes. Tab. 6.1 shows our results on the ReID benchmarks.

A significant departure from prior works is the use of a single SapiensID model across all

evaluation settings, whereas previous methods employ fine-tuned models for each evaluation

dataset (one model per dataset). This distinction highlights SapiensID’s potential for

deployment in diverse, unseen, real-world environments.

SapiensID achieves the highest average mAP of 73.05% across short-term ReID benchmarks.

Furthermore, we attain SoTA results on all evaluated long-term ReID datasets. This strong

performance underscores the value of the WebBody4M dataset in training a generalizable

model. However, this achievement would not have been possible without our SapiensID

architecture, which effectively handles variations in pose and visible body areas. A strong

baseline (HAP [281]) trained on WebBody4M alone does not achieve comparable results,

highlighting the importance of our architectural innovations to leverage the dataset. SapiensID

marks a significant advance by being the first single model capable of strong performance

across short and long-term ReID tasks.

133

Method

OccludedReID
mAP
top1
82.60
KPR [214] + SOLDIER LU4M +OccludedReID 84.80
75.57
87.30

Training Data

WebBody4M

SapiensID

Table 6.3 Performance in occluded ReID. SapiensID achieves a higher top-1 accuracy, while

KPR [214] shows a higher mAP. SapiensID is trained without OccludedReID training data.

Method

Train Data

LFW [100]
CPLFW [296]
CFPFP [202]
CALFW [297]
AGEDB [174]
Face Avg
LTCC [212]
Market1501 [295]
Body Avg
Combined Avg

AdaFace-ViT [122]
WebBody4M-
FaceCrop
99.82
95.12
99.19
96.07
97.97
97.63
21.70
7.81
14.76
56.19

SapiensID (Ours)

WebBody4M

99.82
94.85
98.74
95.78
97.33
97.31
72.01
88.18
80.10
89.80

Table 6.4 Performance on cross-modality setting. Face recognition is evaluated on aligned

face recognition datasets and body recognition is evaluated on short-term ReID datasets.

LTCC and Market1501 measure top1 of short-term setting.

Cross Pose-Scale ReID Real-world human recognition can present scenarios where subjects

are captured across varying camera viewpoints and exhibit diverse poses, such as sitting,

bending, or engaging in activities. For example, a security camera might capture a person

standing upright, while a social media photo shows the same individual sitting in a cafe. This

poses a challenge for conventional ReID systems. We refer to this setting as Cross Pose-Scale

ReID.

To evaluate this setting, we introduce the WebBody-Test dataset, specifically designed

to encompass such pose and scale variations. Tab. 6.2 details the performance comparison

on this dataset. Conventional ReID models struggle to generalize to this scenario due to

the significant shift in visual appearance caused by pose and scale changes. SapiensID with

the highest performance establishes a strong baseline for this research area. Since the task

itself is challenging, there is still room for improvement. WebBody dataset demonstrates the

potential of SapiensID to address the complexities of Cross Pose-Scale ReID, while offering a

valuable starting point for future research in this area.

134

(1) ViT
(2) ViT+RP
(3) ViT+SAH
(4) ViT+RP+SAH (SapiensID)
(4) − Learned Mask
(4) − Variable nk

All

59.54
66.35
71.67
78.67
76.99
74.39

Face

90.63
92.93
95.84
96.66
96.08
95.95

Whole Body ReID
Short
56.17
59.16
72.63
73.05
70.44
69.58

Long
31.81
46.95
46.55
66.30
64.46
57.64

Table 6.5 Ablation study of SapiensID. Face is the average accuracy of CPLFW, CFPFP,

CALFW, and AGEDB. Short and Long Term use the average of the datasets in Tab 6.1.

Results show the necessity and strong complementarity of both RP and SAH in SapiensID.

1
2
3
4
5
6
7
8
9
10
11

0−None
1+Nose
2+Eye
3+Mouth
4+Ear
5+Shoulder
6+Elbow
7+Wrist
8+Hip
9+Knee
10+Ankle (Full)

LTCC CC

PRCC CC

Top1
0.00
25.77
30.61
38.01
39.80
41.84
41.07
41.07
40.56
42.35
42.35

mAP
3.56
5.78
8.87
11.81
14.05
15.82
16.64
17.16
17.50
17.73
17.79

Top1
1.47
27.21
63.87
73.36
77.65
79.73
80.55
79.34
79.99
79.00
78.75

mAP
4.28
21.04
55.17
65.05
70.45
73.14
73.54
73.16
73.38
72.88
72.6

Table 6.6 Impact of adding body parts on ReID. None means all features are zeroed out.

Each row adds features to the previous row.

Occluded ReID Occlusions, whether due to obstacles in the scene or self-occlusion from

the subject’s pose, present a further challenge for robust human recognition. We evalu-

ate SapiensID in occluded scenarios on the OccludedReID dataset [301], comparing with

KPR [214], a SoTA method designed for occlusion handling. As shown in Tab. 6.3, SapiensID

achieves a competitive performance of top-1 87.30%, demonstrating its strong ability to

handle occlusions even without being explicitly trained on the OccludedReID dataset. This

result further underscores the value of our architecture and training dataset in learning

representations that are resilient to real-world challenges like occlusions.

Face Recognition We evaluate on traditional aligned face recognition benchmarks to assess

the ability to handle FR tasks. Tab. 6.4 compares SapiensID with a SoTA FR model,

AdaFace [122], both with a ViT-Base backbone. AdaFace is trained on faces aligned and

cropped to 112 × 112 by [54]. AdaFace achieves a slightly higher average accuracy of 97.63%

across five benchmarks. This marginal difference is expected, given AdaFace’s training on

135

tightly cropped, aligned faces. However, SapiensID’s performance remains highly competitive,

bridging the gap between specialized face recognition and general human recognition tasks.

While AdaFace excels in FR datasets, its performance degrades when applied to ReID

datasets which contain images without visible face region (e.g. back of the face). AdaFace is

evaluated by cropping faces using [54]. In contrast, SapiensID maintains strong performance

across both modalities.

Ablation of Components Tab. 6.5 ablates SapiensID’s key components: Retina Patch

(RP) and Semantic Attention Head (SAH). Starting from a simple ViT backbone with

AvgMax pooling [80] as a baseline, we progressively incorporate RP and SAH to analyze their

individual and combined contributions. Performance is evaluated across face recognition and

both short-term and long-term ReID. The results show that both RP and SAH are essential.

We also show the importance of MRM. (4) - Learned Mask means using MAE [85] to

simply drop tokens. (4) - Variable nk is fixing nk without sampling. The result shows that

learned mask is of some benefit while changing the masking rate during training is of larger

benefit.

Analysis of Part Contribution To see the impact of body parts in recognition, we erase

part features by making them zero. Tab. 6.6 shows a trend of performance gain as more

parts are added. For LTCC dataset accuracy increases from 25.77% to 42.35% as body

parts from the nose to ankle are incorporated. This suggests that including the full range

of body parts aids recognition.

In contrast, PRCC achieves high performance by using

upper body cues, reaching a top-1 accuracy of 80.55% with parts up to the shoulder and

elbow. Lower body features add minimal or even negative value. This analysis implies the

benefit of scenario-specific adjustments where relevant body regions can optimize recognition

performance. We also visualize the part features similarity with sample images from the test

set of WebBody4M in Fig 6.7. Samples of different scales and poses are visualized.

136

Figure 6.7 Part Similarity Visualization. Top shows the same subject pairs. Bottom shows

different subject pairs. Part features provide some indication of where the similar parts are,

but the final similarity is generated through a nonlinear mapping of the part features.

Figure 6.8 Illustration of the feature vector generation in SapiensID. First, Retina Patch

(RP) generates image patches. Then, Masked Recognition Model (MRM) modifies the

number of tokens. Finally, Semantic Attention Head (SAH) produces the feature vector from

the set of tokens.

6.5 Method Details

6.5.1 Training Details

The training pipeline of SapiensID is largely similar to the setting of training a ViT model

in face recognition [125]. This is possible because WebBody4M is a labeled dataset with a

sufficient number of subjects, just as face recognition datasets. We use the AdaFace [122] loss

137

RPTokens B×𝑁𝑖×𝐶MRM: Masked Recognition ModelViTMask TokensCompress MaskOutput FeatureMRMSAHRP: Retina PatchFeature map B×(𝑁𝑘+1)×𝐶Feature Vector B×𝐶′SAH: Semantic Attention HeadKey: Pos EmbQuery: KeypointsValue: FeatureATTMLPand optimize the model with the AdamW [164] optimizer for 33 epochs. The learning rate is

scheduled by the Cosine Annealing Learning Rate Scheduler [163] with an additional warm-up

period of 3 epochs. The maximum learning rate is set to 0.0001. We use 7 A100 GPUs with

a batch size of 128. We also change the classifier to PartialFC [11] with a sampling ratio of

0.1 to save GPU memory and gain computation efficiency. Overview of the model is shown

in Fig. 6.8.

For data augmentation, we find that it is important to use a moderate amount of geometric

augmentation (zoom in-out: 0.9 ∼ 1.1, translation: ±0.05) and aspect ratio adjustments

(0.95 ∼ 1.05). We also find it effective for improving aligned face recognition performance to

include face-zoomed-in images frequently (40%). We also oversample images that contain

more visible keypoints because those images are relatively scarce (note Tab. 6.12).

6.5.2 Notation Clarification in the Main Paper

In Semantic Attention Pooling’s SAH, the equation presented as Eq. 6.8:

Oi

part = Attention (cid:0)Qi

kp, PE, backbone(Xi)(cid:1) ,

Attention(Q, K, V) is specifically defined as:

Oi

part = softmax

(cid:18)WqQWkK⊤
√

d

(cid:19)

WvV,

(6.11)

(6.12)

where Q, K, and V represent the query, key, and value matrices, respectively, and Wq,

Wk, and Wv are their associated projection weights. This is how the size of the attenion is

modulated during learning.

Also notice that without the learnable projections Wq,k,v and a small d, the attention

simply focuses on the position with the highest proximity to the keypoint. To make sure that

we have this feature from the sharp peak at the keypoint location, we additionally use

Oi

peak = softmax

(cid:19)

(cid:18) QK⊤
√
d

V.

(6.13)

The final feature vector is computed by concatenating the two sets of semantic features

Oi

part

and Oi

peak

and flattening them for MLP projection. Specifically, it is

f i = MLP(flatten([Oi

part, Oi

peak])).

(6.14)

138

The addition of Oi

peak

is simply to ensure that the model always has the feature from the

keypoint location. We have not tested how much performance gap is created by removing

this inductive bias in SAH. The final number of part features is 152 (19 keypoints × 4

offset repeats × 2 from concatenating Oi

part

and Oi

peak

. We realize that the readers could

be confused about the formulation of SAH attention, so we will make it clearer in the main

paper.

6.5.3 Things We Tried That Did Not Make it into the Main Algorithm
• We tried to initialize the model with the Sapiens [118] pretrained backbone, thinking
it would be a good starting point that leads to better generalization. However, it did

not lead to better performance. We believe this is because: 1) our patch scheme is

dramatically different from the original patch scheme, and 2) Sapiens is trained with the

MAE [85] objective, which is suitable for dense prediction tasks. However, SapiensID is a

classification (or metric learning) task. Dense prediction tasks prioritize spatial consistency

and detailed reconstruction, whereas classification tasks focus on extracting discriminative

features, which may require different feature representations.

• We tried using the differential layerwise learning rate [270], but it did not help and the

learning was only slower.

• We tried not learning the size and offset for the Semantic Attention Head (SAH) by simply
taking the feature from the keypoint locations. This led to worse performance in general.

6.5.4 Transforming Keypoints to ROIs

SapiensID relies on predicted keypoints to define Regions of Interest (ROIs). Assuming

we have an input image roughly cropped around the visible body area (typically using a

person detector’s bounding box), we start with a set of predicted keypoints K = {(xk, yk)}N

k=1

,

where N is the number of keypoints. Our goal is to generate bounding boxes for each ROI.

Specifically, we generate two bounding boxes—for the face and the upper torso—in the format

(x1, y1, x2, y2), representing the top-left and bottom-right corners.

139

1. Valid Keypoint Selection:

Let K = {1, 2, . . . , N } be the set of keypoint indices. For each keypoint k ∈ K, the

coordinates are (xk, yk) ∈ R2. We define a visibility indicator vk for each keypoint:

vk =






1,

if xk ̸= −1 and yk ̸= −1,

0, otherwise.

(6.15)

Define the sets of keypoint indices relevant to each ROI:

K1: Left Eye K6: Left Mouth Corner

K2: Right Eye K7: Right Mouth Corner

K3: Left Ear

K8: Left Shoulder

K4: Right Ear K9: Right Shoulder

K5: Nose

Then Face Keypoints are

Mf = {K1, K2, K3, K4, K5, K6, K7}.

And Upper Torso Keypoints are

Mu = Mf ∪ {K8, K9, K10, K11}.

The valid keypoints for each ROI are those that are both visible and relevant:

V face = {k ∈ Mf | vk = 1},

V torso = {k ∈ Mu | vk = 1}.

(6.16)

(6.17)

2. Bounding Box Center and Size Calculation:

For each ROI (face or upper torso), we compute the center using the set V, which is

either V face or V torso:

140

First compute the minimum and maximum coordinates among valid keypoints:

xmin = min
k∈V

xk,

ymin = min
k∈V

yk,

xmax = max
k∈V

xk,

ymax = max
k∈V

yk.

Then calculate the center of the bounding box:

cx =

xmin + xmax
2

,

cy =

ymin + ymax
2

.

(6.18)

(6.19)

(6.20)

Then determine the maximum distance d from the center to the valid keypoints:

d = max
k∈V

(cid:113)

(xk − cx)2 + (yk − cy)2.

3. Bounding Box with Padding:

First define the bounding box size s with a padding factor p (e.g., p = 0.3):

s = d × (1 + p).

Then calculate the coordinates of the bounding box:

x1 = cx − s,

y1 = cy − s,

x2 = cx + s,

y2 = cy + s.

(6.21)

(6.22)

(6.23)

(6.24)

4. Making Bounding Box Divisible: To ensure that the patches cover the image

without any overlap, the boundaries of the bounding box must snap onto the patch

grid. In other words, the bounding box coordinate should be divisible by the patch size

(pw, ph) of the enclosing ROI. Let nr and nc be the desired number of rows and columns

for patches within the ROI. We modify the bounding box size s to ensure divisibility.

x′
1 = ⌊

x′
2 = ⌈

x1
pw
x2
pw

⌋ × pw,

y′
1 = ⌊

⌉ × pw,

y′
2 = ⌈

y1
ph
y2
ph

⌋ × ph

⌉ × ph

141

(6.25)

(6.26)

The final, grid-aligned bounding box is then:

b = (x′

1, y′

1, x′

2, y′

2) ∈ R4.

(6.27)

This snapping process ensures that the bounding box boundaries coincide with patch

boundaries, resulting in clean, non-overlapping patch extraction. We compute two

bounding boxes, bface and btorso, using this process. All these steps can be conducted

in GPU for efficient computation.

6.5.5 Proof of Scaled Attention Equivalence

Let the scaled dot-product attention mechanism for self attention is defined as:

A = softmax

(cid:19)

(cid:18) QK⊤
√
d

V,

We aim to prove that when a scaling factor δ ∈ R1×M is added to the logits:

A = softmax

(cid:18) QK⊤
√
d

(cid:19)

+ δ

V,

this is equivalent to repeating each key Kj and value Vj exactly mj times, where δj = log mj.

Proof: Consider the following term:

QK⊤
√
d

+ δ.

For a query i and key j, the element of this matrix is:

(cid:18) QK⊤
√
d

(cid:19)

+ δ

=

ij

Qi · K⊤
j√
d

+ log mj,

where Qi is the i-th query and Kj is the j-th key. Applying the softmax function, we get:
(cid:16) Qi·K⊤
j√
d
(cid:16) Qi·K⊤
k√
d

+ log mk

+ log mj

Aij =

k exp

exp

(cid:17).

(cid:80)

(cid:17)

Using the property exp(a + b) = exp(a) exp(b), this simplifies to:

Aij =

exp

(cid:80)

k exp

(cid:17)

(cid:16) Qi·K⊤
j√
d
(cid:16) Qi·K⊤
k√
d

mj
(cid:17)

mk

.

142

Dataset
WF4M
WB4M-
Facecrop

Avg
97.44

LFW CPLFW CFPFP CALFW AGEDB
99.80

94.97

96.03

98.94

97.48

97.63

99.82

95.12

99.19

96.07

97.97

Table 6.7 Performance Comparison between WebFace4M and WebBody4M in the Face

Recognition Task.

This is equivalent to each key Kj and corresponding value Vj are duplicated mj times. We

discard the values corresponding to the mask, so the result of the attenion mechanism is

the same. Thus, the attention mechanism with δ scaling is mathematically equivalent to

duplicating the keys and values proportionally to the number of times the mask appears.

6.6 Performance

6.6.1 WebBody4M vs WebFace4M Comparison

To assess the quality of the face image data within WebBody4M, we create WebBody-

Facecrop by cropping face from the WebBody datset. And we compare its face recognition

performance against WebFace4M [300], a dedicated large-scale face recognition dataset. We

train the same ViT-based model with AdaFace loss on both datasets. Tab. 6.7 presents the

results on standard face recognition benchmarks (LFW, CPLFW, CFPFP, CALFW, and

AGEDB). The model trained on WebBody4M achieves a slightly higher average accuracy

(97.63%) compared to that of WebFace4M (97.44%). This indicates WebBody4M label is of

comparable quality, even slightly exceeding WebFace4M label.

6.6.2 Fusion Performance

While SapiensID inherently handles both face and body information within a single model,

a common alternative approach involves training separate face and body recognition models

and fusing their outputs. We compare SapiensID’s performance with such multi-modal fusion

methods. We consider a baseline where a body model (CAL [80]) is trained on either PRCC

or LTCC, and a face model (ViT-Base [122]) is trained on WebFace4M. We then fuse the

similarity scores of these two dedicated face and body models using three common fusion

strategies: Max Fusion, Min-Max Normalization Fusion, and Mean Fusion. Tab. 6.8 presents

the performance.

143

Body
Face
Fused-Max
Fused Min-Max
Fused-Mean
SapiensID

AVG

42.04
36.56
42.93
49.92
49.99
52.87

LTCC CC

PRCC CC

Top1
38.01
17.60
39.80
39.80
39.80
42.35

mAP
18.84
4.91
13.25
12.95
12.82
17.79

Top1
55.69
72.62
61.22
79.00
79.48
78.75

mAP
55.63
51.10
57.45
67.93
67.85
72.60

Table 6.8 Performance table of score fusion (Body and Face).

Method
Training Data
top1
mAP
top1
mAP
top1
mAP

OccludedReID

LTCC General

LTCC CC

KPR [214] + SOLDIER

SapiensID

LUPerson4M + OccludedReID WebBody4M

84.80
82.60
68.15
32.42
21.17
10.19

87.30
75.57
74.24
36.88
42.60
17.39

Table 6.9 Generalization performance comparison under occlusion. SapiensID demonstrates

superior generalization to unseen datasets (LTCC) compared to KPR+SOLDIER.

As shown in the table, even the best fusion strategy (Mean Fusion) achieves an average

mAP of 49.99%, lower than SapiensID’s 52.87%. Fusion is more helpful in PRCC but not

much in LTCC with an increase in Top1 and a decrease in mAP. This result highlights

the advantage of SapiensID’s unified architecture, which learns to integrate face and body

information more effectively than post-hoc fusion methods. Fusion methods treat each

modality independently, potentially missing valuable contextual information that arises from

their combined analysis.

6.6.3 Occluded ReID

Occlusions pose a significant challenge for robust human recognition. While specialized

methods can be effective within their training domain, generalization to unseen scenarios

is crucial for real-world deployment. We compare SapiensID’s performance with KPR

[214] combined with SOLDIER, a state-of-the-art occlusion handling method, to evaluate

their respective generalization capabilities. KPR+SOLDIER is trained on a combination

of LUPerson4M and the OccludedReID [301] dataset, while SapiensID is trained on our

WebBody4M dataset without any OccludedReID data.

Tab. 6.9 presents the results on OccludedReID and the LTCC dataset (both General

144

1
2
3
4
5
6
7
8
9
10
11

None
1+Nose
2+Eye
3+Mouth
4+Ear
5+Shoulder
6+Elbow
7+Wrist
8+Hip
9+Knee
10+Ankle (full)

LTCC CC

PRCC CC

Top1
0.00
25.77
30.61
38.01
39.80
41.84
41.07
41.07
40.56
42.35
42.35

mAP
3.56
5.78
8.87
11.81
14.05
15.82
16.64
17.16
17.50
17.73
17.79

Top1
1.47
27.21
63.87
73.36
77.65
79.73
80.55
79.34
79.99
79.00
78.75

mAP
4.28
21.04
55.17
65.05
70.45
73.14
73.54
73.16
73.38
72.88
72.60

1
2
3
4
5
6
7
8
9
10
11

None
1+Ankle
2+Knee
3+Hip
4+Wrist
5+Elbow
6+Shoulder
7+Ear
8+Mouth
9+Eye
10+Nose (Full)

LTCC CC

PRCC CC

Top1
0.00
27.04
32.14
35.71
37.24
40.05
41.33
41.58
41.58
41.58
42.35

mAP
3.56
7.37
9.55
12.34
13.83
15.72
16.87
17.61
17.95
17.80
17.79

Top1
1.47
45.05
55.12
66.07
67.63
69.57
73.84
76.21
78.18
79.23
78.75

mAP
4.28
35.32
44.97
55.04
58.43
62.61
67.80
70.62
72.63
72.92
72.60

(a) top-down

(b) bottom-up

Table 6.10 Comparison of feature erasing performance. (a) shows the performance as we

progressively introduce features from Nose to Ankle (top-down approach). (b) demonstrates

the performance when adding features from Ankle to Nose (bottom-up approach). Results

are evaluated on LTCC and PRCC Cloth Changing (CC) protocol.

1
2
3
4
5
6
7
8
9

None
1+Top1
2+Top2
3+Top3
4+Top4
5+Top5
6+Top6
7+Top7
Full

LTCC CC

Top1
2.30
5.10
27.04
29.34
33.67
37.24
36.48
41.07
42.35

mAP
1.89
2.61
11.88
13.20
13.88
14.65
15.49
16.63
17.79
(a) top-add

PRCC CC

Top1
12.67
78.04
79.25
78.35
77.82
76.97
78.55
80.07
78.75

mAP
4.78
67.29
70.53
69.85
69.55
69.28
70.39
71.52
72.60

1
2
3
4
5
6
7
8
9

None
1+Bottom1
2+Bottom2
3+Bottom3
4+Bottom4
5+Bottom5
6+Bottom6
7+Bottom7
Full

LTCC CC

mAP
Top1
1.87
2.30
2.26
2.81
3.08
6.12
3.62
5.87
4.26
10.20
5.33
12.50
6.48
16.07
13.20
35.46
42.35
17.79
(b) bottom-add

PRCC CC

Top1
12.50
24.56
31.22
33.78
33.08
22.10
24.47
29.07
78.75

mAP
4.78
10.89
16.94
20.65
24.59
21.31
24.80
28.63
72.60

Table 6.11 Impact of progressively adding visible parts from the (a) top and from the (b)

bottom. In contrast to Tab. 6.10 which measures the performance with the intermediate

features zeroed out, here the actual input image is masked out.

and Clothing Change protocols). KPR+SOLDIER and SapiensID similar performance

on OccludedReID, SapiensID demonstrates significantly better generalization performance.

On LTCC, SapiensID substantially outperforms KPR+SOLDIER across both protocols,

highlighting the limitations of specialized training. This underscores the importance of

training on diverse datasets like WebBody4M to achieve robust generalization in real-world

human recognition. SapiensID, by learning from a wide range of poses, viewpoints, and

clothing styles, is more adaptable and effective in unseen scenarios.

145

Figure 6.9 Illustration of how Images are erased from top to bottom or bottom to top.

6.6.4

Impact of Body Part Features

We investigate the relative importance of different body parts in human recognition

by conducting an ablation study on the Semantic Attention Head (SAH). Starting from

part features (Oi

part

in Eq. 6.8) multiplied by zero, we progressively undo masking, either

from nose-to-ankles (top-down) or ankles-to-nose (bottom-up). We evaluate performance

on LTCC (Clothing Change protocol) and PRCC (Clothing Change protocol). Results

are presented side-by-side in Tab. 6.10. The top-down approach generally yields faster

performance gains than bottom-up, suggesting that upper-body features contribute more

significantly to recognition.

Interestingly, ankle features alone appear more discriminative than nose features alone.

However, this counter-intuitive finding does not imply that ankles are inherently more

informative than noses for person identification. We hypothesize that this observation

arises because each part feature within SAH is not solely derived from the corresponding

body part. Due to the preceding ViT backbone’s attention mechanism, each part feature

incorporates information from other body regions. Therefore, the presented results reflect

the discriminative power of a part plus peripheral information from other parts, rather than

the isolated contribution of each part.

A more accurate assessment of a part’s individual discriminative ability would involve

manipulating the input image directly, such as by occluding specific body parts. This

approach, which isolates the impact of each part, is explored in the following section.

146

Top AddBottom Add6.6.5

Impact of Actual Image Erased

To isolate the contribution of each body region, we conduct a second ablation study

where we progressively erase sections of the input image, either top-down or bottom-up, as

illustrated in Fig. 6.9. We erase equal-sized horizontal strips, starting with a single strip

and progressively adding more until the whole image is erased (represented as "None" in

the tables). The "Full" row represents the baseline performance with the complete image.

Results are presented in Tab. 6.11.

Figure 6.10 Illustration of the masked image and the sampling distribution of the number of

tokens to keep ˆnk. The red vertical line shows where the sampling took place for the right

image. From top to bottom, less samples are kept (more masking).

The direct manipulation of the image confirms the importance of upper body regions. On

both datasets, removing the top portion of the image drastically reduces performance. It

comes as a surprise that PRCC can achieve a very good performance with only 1 top strip

of image. But for LTCC, the lower parts are necessary to obtain a good performance. This

indicates that different datasets exhibit different characteristics that can be exploited for

conducting ReID.

147

6.7 Visualization

6.7.1 Token Length Sampling Distribution

In Masked Recognition Model (MRM), we propose an adaptive token sampling strategy

during training to enhance the robustness and generalization of our masked recognition model.

Fig. 6.10 illustrates the sampling distribution and its effect on the input image. The number

of tokens to keep, ˆnk, is determined by Eqn. 6.6:

ˆnk = nk + (ni − nk) · e−λ·U (0,1),

where ni is the maximum possible number of tokens (432 in our case, with 3 ROIs of 12x12

patches each), nk is the minimum number of tokens to keep, U (0, 1) is a uniform random

variable, and λ controls the decay rate (set to 4).

This sampling strategy allows us to retain between 26% and 80% of the tokens (112

to 345 tokens), with an average of 166 tokens per batch. As depicted in Fig. 6.10, heavy

masking can significantly distort the input image. Fixing the masking rate to such high levels

could introduce a distribution shift between training and testing (where all tokens are used),

causing a performance drop. Our adaptive sampling mitigates this issue by exposing the

model to a variety of masking ratios, encouraging it to learn robust representations that

generalize well to full token input during inference.

One thing to note is that the sampling of ˆnk happens per batch. And when a larger ˆnk is

sampled per batch, we reduce the batch size accordingly for the given GPU memory (See

Sec. 6.3.2 for more details).

6.7.2 WebBody4M Dataset Body Parts Visibility

WebBody4M dataset encompasses a wide range of human poses and viewpoints, resulting

in varying visibility of body keypoints. Tab. 6.12 presents the percentage of images in which

each keypoint (left and right sides) is visible. As expected, keypoints in the upper body, such

as eyes and shoulders, exhibit high visibility rates (over 74% and 88% respectively). Visibility

decreases progressively down the body, with elbows and wrists around 50%, hips around

148

Visibility
Eye
Ear
Shoulder
Elbow
Wrist
Hip
Knee
Ankle

Left (%) Right (%)

93.49
76.87
88.15
53.76
49.98
45.68
23.92
16.98

93.59
74.48
90.04
53.80
50.35
45.70
23.95
17.00

Table 6.12 Keypoint Visibility in WebBody Dataset.

Figure 6.11 Comparison of learned part weights across seven datasets. Left and right sides

are averaged together before visualization.

45%, and knees and ankles below 24% and 17% respectively. This distribution reflects the

natural tendency for upper body parts to be more frequently visible in unconstrained images,

as lower body parts are often occluded by clothing, objects, or the image frame itself. This

distribution also helps explain why upper body parts provide greater discriminative power

for person ReID in our earlier analysis (Supp 6.6.4).

6.7.3 Visualization of Part Weights

To facilitate effective learning from a mixture of short-term and long-term ReID datasets,

we hypothesize that it would be helpful to add learnable weights that modulate the importance

of individual part features within the Semantic Attention Head (SAH). Our conjecture is

the discriminative characteristics of body parts can vary significantly depending on whether

clothing remains constant or varying in the training dataset.

Fig. 6.11 visualizes the learned weights (Eqn. 6.14) for WebBody4M and several additional

whole-body ReID datasets. WebBody4M, primarily composed of web-collected images,

exhibits a higher emphasis on facial features compared to lower body parts. This is expected,

149

as the WebBody4M was collected largely based on facial similarity.

In contrast to WebBody4M, auxiliary datasets like Market1501, LTCC, and PRCC, which

feature many images with consistent clothing (e.g., 1-3 outfits across 20-30 images per person),

show increased emphasis on body features for recognition. This highlights the importance

of body shape, pose, and clothing appearance as discriminative cues when attire remains

relatively constant. However, Celeb-ReID, similar to WebBody4M, primarily contains images

with clothing changes across captures. Consequently, Celeb-ReID exhibits a similar weighting

pattern, with less emphasis on body features and a relatively higher focus on other cues,

likely emphasizing facial features.

To validate the hypothesis, we conducted an ablation study to evaluate the impact of

training with learnable weights. Tab. 6.13 presents a comparison between SapiensID and

SapiensID without the learnable weights. In the latter, all aspects remain the same except

that the learnable weights are removed during training.

From the results, it is evident that the inclusion of learnable weights does not yield a

significant overall improvement. Instead, it shows a specific enhancement in long-term ReID

performance, possibly because WebBody4M’s learning was not hindered by the influence

of short-term datasets with same clothings. However, for short-term datasets, the addition

of weights does not result in performance gains. This suggests that while the weighting

mechanism provides insights into dataset-specific learning behaviors, it is not a definitive

factor for achieving better ReID performance.

In conclusion, while the introduction of learnable weights is interesting for analytical

purposes, we want to let the readers clearly know that it is not a deciding factor for learning

universal representation that works for both short-term and long-term ReID. Future research

could explore alternative methods that better balance the learning from diverse dataset

characteristics without negatively impacting specific subsets.

150

All

Face

SapiensID
SapiensID-Weight

78.67
78.59

96.66
96.66

Whole Body ReID
Short
73.05
75.72

Long
66.30
63.39

Table 6.13 Performance comparison of SapiensID and SapiensID without weight masking

during training across different metrics.

Figure 6.12 Keypoint visualization (left) and corresponding Retina Patch results (right) for

images from the CUB dataset.

6.7.4 SAH Visualization

The Semantic Attention Head (SAH) plays a crucial role in SapiensID by generating pose-

invariant features. To understand how SAH behaves after training, we visualize its attention

maps in Fig. 6.13. To be specific, we visualize the following. Let Qi

kp = GridSample(PE, kpi)+
B be the semantic query embedding for i-th image created by sampling from the fixed 2D

position embeddings (PE) at the 19 keypoint locations. The dimension is Qi

kp ∈ Rnk×C,
where k = 19 and n = 4 because it is repeated 4 times to learn 4 different offsets. In SAH,

we perform attention with Qi
kp

and PE by

Oi

part = softmax

(cid:18)WqQWkK⊤
√

d

(cid:19)

WvV.

(6.28)

In our visualization, we are showing

softmax

(cid:18)WqQWkK⊤
√

d

(cid:19)

,

for each keypoint and each offset. We have nk attention maps as shown by the visualization.

151

Figure 6.13 Visualization of attention maps in the Semantic Attention Head (SAH). Regions

with higher attention values are highlighted in red, while regions with lower attention values

are shown in blue. Blacked-out areas represent parts of the images without visible keypoints.

The visualizations provides how SAH allows learning both varied size and offsets based on a

set of keypoints.

For each input image, we show each row corresponds to a different offset. There are 4

rows because we learn n = 4 offsets for each of 19 keypoints. Offest refers to B ∈ Rnk×C in

Eqn. 6.7. Offset bias allows the keypoints to move slightly from its original position. Each

152

column correspond to different keypoints used by SAH (e.g., nose, left right shoulder, etc).

As the visualization shows, the learned attenion maps are not limited to the keypoint location

but also move around the keypoints and vary in size.

6.8 Potential Application of Retina Patch

While SapiensID focuses on human recognition, the Retina Patch (RP) mechanism has

broader applicability to other domains. Figure 6.12 demonstrates its potential for fine-grained

visual recognition, using the CUB birds dataset as an example. This dataset provides semantic

keypoints, enabling the definition of meaningful regions of interest (ROIs) for RP. We define

two ROIs: "head" (beak, forehead, crown, left eye, right eye, throat) and "body" (back, belly,

breast, nape, left wing, right wing) excluding tail, left leg and right leg.

The figure showcases multiple bird images processed with RP, illustrating its ability to

handle variations in bird size and head size. By dynamically allocating more patches to

these regions, RP ensures consistent representation of crucial features, regardless of their

scale within the image. Though we do not know whether the performance of CUB bird

classification will be improved with RP, we want to suggest that RP could be beneficial

for general recognition tasks where image naturally contains large pose and scale variation.

Future work could explore the integration of RP into models for more broad set of datasets

to quantitatively evaluate its benefits.

6.9 Limitations

While SapiensID demonstrates promising results for human recognition, its reliance on

predefined Regions of Interest (ROIs) introduces certain limitations. The effectiveness of

the Retina Patch mechanism hinges on the ability to define meaningful ROIs that capture

discriminative features. This approach works well for humans, who share a consistent body

topology and where keypoints like the face, torso, and limbs provide valuable cues for

recognition.

However, this reliance on ROIs poses challenges when dealing with objects or entities that

lack a consistent or well-defined structure. For instance, applying SapiensID to amorphous

153

objects, scenes with highly variable elements, or categories with significant intra-class topo-

logical differences would require alternative strategies. In such cases, predefined ROIs might

not adequately capture the relevant information, or might even be detrimental by focusing

on irrelevant or inconsistent features. Future research could explore more flexible or adaptive

mechanisms for defining regions of interest, enabling the application of similar principles to a

wider range of object recognition tasks.

6.10 Ethical Concerns

Our goal is to facilitate research in human recognition while operating strictly within

the bounds of copyright law, privacy regulations, and ethical considerations. For large-scale

image datasets, it is a common practice to release datasets in URL format [18, 201] because

researchers do not hold the rights to redistribute the data directly. By providing permanent

link URLs, labels and a one step code to download and prepare dataset, researchers can have

access and utilize the data responsibly, while respecting the rights of copyright holders and

individuals. We believe this approach balances the need for large-scale datasets to advance

research with the imperative to protect intellectual property and privacy.

6.11 Conclusion

SapiensID presents a paradigm shift in human recognition, moving beyond modality-

specific models to a unified architecture capable of identification across diverse poses and

body-part scales. Retina Patch, Semantic Attention Head, and Masked Recognition Model

combined with WebBody4M dataset, enable SapiensID to achieve SoTA performance across

various ReID benchmarks and establish a strong baseline for Cross Pose-Scale ReID. This

work marks a step towards holistic human recognition systems.

154

CHAPTER 7

EFFICIENT HUMAN RECOGNITION FRAMEWORK

While unified face and body recognition models offer enhanced robustness across diverse

poses and scales, their reliance on Vision Transformers (ViTs) processing numerous tokens

often leads to prohibitive computational costs, hindering practical deployment in real-time

applications. This paper introduces a novel approach to significantly improve the efficiency

of unified biometric recognition ViTs without compromising accuracy. We propose Keypoint-

based Token Fusion (KP-ToFu), a heuristic token reduction strategy specifically designed

for biometrics, which merges less informative tokens while strategically preserving those

corresponding to crucial human keypoints essential for identification. To maintain spatial

reasoning capabilities after the token structure is altered by fusion, we develop Keypoint

Absolute Position Encoding (KP-APE). Additionally, we introduce Reasoning Tokens, pro-

gressively added learnable tokens that compensate for the reduced input token count and

enhance the model’s representational capacity for complex identity reasoning. Our synergistic

approach, combining KP-ToFu, KP-APE, and Reasoning Tokens, achieves state-of-the-art

performance on challenging joint face and body recognition benchmarks while providing

substantial computational speed-ups. We further demonstrate the versatility of our efficient

backbone by successfully adapting it to gait recognition. This work paves the way for fast,

accurate, and deployable unified human recognition systems.

7.1

Introduction

Human recognition remains a fundamental challenge in computer vision, crucial for

applications ranging from security surveillance to personalized user experiences. Historically,

this task has been tackled using disparate approaches: highly specialized models for face

recognition [55,101,102,122–124,128,154,239,240,252,276] and separate models for body-based

person re-identification (ReID) [80, 110, 140, 149, 151, 268]. While successful in constrained

environments relying on specific alignments [1, 54] or consistent camera views [212, 268],

this fragmented strategy falls short in real-world settings. Practical scenarios often present

155

humans in diverse poses (sitting, standing, partial views) and scales, requiring systems to

leverage both face and body cues opportunistically [112, 271]. The conventional solution

involves fusing outputs from multiple models [87,147], adding system complexity and potential

failure points. A unified model, capable of processing the full spectrum of human appearance

variations, promises greater robustness and simpler deployment.

Recent advancements, such as the SapiensID framework, have moved towards this unifica-

tion using Vision Transformer (ViT) architectures. By processing multiple input resolutions

(e.g., whole body, upper torso, face), SapiensID aims to achieve scale invariance. However,

this approach comes at a steep price: computational cost. Feeding multiple high-resolution

views into a ViT drastically increases the number of input tokens (e.g., 432), leading to

significant computational overhead and slow inference speeds. It forms a critical bottleneck,

rendering such powerful unified models impractical for many real-world biometric applica-

tions, including real-time video analysis, large-scale identity searches, and deployment on

resource-constrained edge devices, where low latency and high throughput are paramount.

Addressing this efficiency gap is therefore essential to unlock the potential of unified human

recognition.

A promising direction for enhancing ViT efficiency is token fusion [27, 121], which reduces

the computational load by dropping or averaging redundant or less informative tokens within

the network. However, applying standard token fusion techniques directly to biometric

recognition tasks presents a major challenge. Biometric identification relies heavily on

preserving fine-grained details and the precise spatial arrangement of keypoints (e.g., facial

landmarks, body joints). Naive token fusion, which merges tokens without considering

their semantic importance, can inadvertently collapse these critical keypoint representations,

severely degrading the model’s discriminative ability and undermining the core purpose of

recognition.

To overcome this limitation while still reaping the benefits of token reduction, we first

propose Keypoint-based Token Fusion (KP-ToFu). The core motivation is to achieve computa-

156

tional efficiency without sacrificing the crucial fine-grained information needed for biometrics.

KP-ToFu intelligently identifies tokens corresponding to essential human keypoints and

explicitly prevents them from being merged. Similar tokens are fused, significantly reducing

the token count while ensuring that the structural integrity of the human form, vital for

part-based matching, is preserved. This allows for substantial speed-ups while safeguarding

recognition accuracy.

Secondly, the act of merging tokens fundamentally disrupts the regular grid structure

inherent in the initial tokenization of the image. This disruption poses a significant problem

because standard methods for incorporating spatial awareness in ViTs, such as 2D Relative

Position Encoding (RPE) [89] or KP-RPE [125], rely on this grid structure to calculate

positional relationships. Without effective positional encoding, the model loses vital informa-

tion about where features are located relative to each other, further hindering recognition.

To address this, we introduce efficient keypoint Absolute Position Encoding. KP-APE is

specifically designed to calculate meaningful positional biases to keypoints even after tokens

have been fused and their original grid coordinates are lost, thereby allowing the model to

maintain spatial reasoning capabilities within the reduced and irregular token set.

Finally, while KP-ToFu preserves keypoints and KP-APE maintains spatial awareness,

the overall reduction in token count via fusion might lead to a potential loss in the model’s

representational capacity. The concern is that fewer tokens might limit the network’s ability

to perform complex reasoning and integrate information across different parts of the input

effectively. To counteract this, we introduce Reasoning Tokens. These are a small number

of randomly initialized, learnable tokens that are progressively added into the ViT blocks

alongside the image tokens. Similar to the [CLS] token, these reasoning tokens are not tied

to specific spatial locations. They serve as adaptable computational resources, providing

the network with additional capacity to synthesize features, model complex relationships,

and perform higher-level reasoning about identity, compensating for the information density

increase caused by token fusion.

157

By combining KP-ToFu, KP-APE, and Reasoning Tokens, we construct an efficient yet

powerful backbone for unified human recognition. Our approach achieves state-of-the-art

results on challenging benchmarks requiring joint face and body identification, demonstrating

significant improvements in both computational efficiency and recognition accuracy. Fur-

thermore, we showcase the versatility of our optimized backbone by successfully adapting it

for efficient gait recognition through the addition of temporal attention mechanisms. This

work paves the way for deploying robust, unified human recognition systems in demanding

real-world applications where both accuracy and speed are critical.

7.2 Related Work

7.2.1 Biometric Recognition

The field of biometric recognition has traditionally operated with distinct silos for face

recognition (FR) [55, 101, 102, 122–124, 128, 154, 239, 240, 252, 276] and body recognition

(Person ReID) [80,110,140,149,151,268]. While achieving high performance, these specialized

models often depend on constrained inputs, such as aligned faces [1, 54] or canonical full-body

poses [212, 268], limiting their utility in unconstrained real-world scenarios [112, 271]. Recent

advancements, exemplified by models like SapiensID, have pioneered a unified approach,

developing single models capable of jointly processing face and body information across

varying scales and poses, often eliminating the need for strict pre-alignment [125]. This

unification offers enhanced robustness and simplifies deployment pipelines.

However, this progress towards unification introduced a new, critical challenge: computa-

tional efficiency. Architectures like SapiensID, which employ Vision Transformers (ViTs) and

process multiple input resolutions (e.g., face, upper-torso, full-body) to handle scale variance,

inherently generate a very large number of tokens. This large token count leads to substantial

computational demands and slow inference speeds, creating a significant barrier to deploying

these powerful unified models in practical, real-time biometric systems where low latency is

often crucial. The prohibitive cost associated with these initial unified models underscores

the urgent need for methods that can preserve their recognition capabilities while drastically

158

improving their computational efficiency.

7.2.2 Token Reduction

The quadratic complexity of self-attention makes Vision Transformers (ViTs) compu-

tationally expensive, hindering their use in efficiency-sensitive applications like real-time

biometrics and motivating token reduction strategies. Existing methods include learned

approaches [8, 68, 142, 171, 182, 192, 275, 277] that prune tokens using auxiliary modules often

requiring complex training, and simpler, training-free heuristic alternatives. Heuristic tech-

niques like pooling [168], sampling [68], and Token Merging (ToMe) [28, 29]—which pioneered

similarity-based training-free merging. Token Fusion [121] further explored strategies blending

pruning and merging concepts.

However, a critical limitation of these heuristic methods for biometrics is their lack of

semantic awareness. Merging tokens based solely on general feature similarity risks destroying

the fine-grained details and spatial relationships of key anatomical points (e.g., eyes, joints)

crucial for identification. Our work addresses this via Keypoint-based Token Fusion (KP-

ToFu), an approach that explicitly preserves important keypoint tokens during fusion. We

also propose KP-APE which is a modified version of KP-RPE [125] that can be applied to

reduced token sets.

7.3 Proposed Work

Our goal is to develop an efficient backbone for unified human recognition, capable of

processing diverse inputs containing faces and bodies across various scales and poses, while

remaining computationally tractable for practical applications. We formulate the task as

metric learning, aiming to produce discriminative embeddings where images of the same

identity are closer than images of different identities, trained using a margin-based softmax

loss [122]. The overall pipeline is shown in Fig 7.1.

7.3.1 Overview and Baseline Input Processing

Inspired by multi-region processing but simplified for efficiency, we handle scale variance

by extracting the whole image, upper torso, and face regions (derived via keypoints [34]),

159

Figure 7.1 Overview of the Proposed Pipeline. Tokens are merge each layer while the

keypoint tokens remain intact.

resizing each to a standard resolution (e.g., 384x384), and concatenating them horizontally

into a single wide image (e.g., 384x1152). This wide image is processed by a standard Vision

Transformer (ViT) backbone [60]. Using 32x32 patches results in a large number of initial

tokens (N=432 for a 384x1152 input). These tokens, augmented with standard position

embeddings.

The primary challenge is the computational cost associated with processing this large

initial token count (N = 432) through self-attention layers, which have (O(N 2)) complexity.

This significantly hinders practical deployment. To address this efficiency bottleneck, we

introduce the following techniques designed to reduce the effective number of processed tokens

while preserving critical biometric information: Keypoint-based Token Fusion (KP-ToFu),

Keypoint Absolute Position Encoding (KP-APE), and Reasoning Tokens, detailed in the

subsequent sections.

7.3.2 Keypoint-based Token Fusion (KP-ToFu)

Our approach to improving the efficiency of the ViT backbone hinges on reducing the

number of tokens N processed in its layers. We adapt recent training-free token reduction

techniques, specifically focusing on token merging, but tailor it for the biometric recognition.

160

1. Standard Token Merging Standard Token Merging employs Bipartite Soft Matching

(BSM) [28] to identify the r most similar pairs of tokens (idxsrc, idxdst) within an input

sequence X ∈ RN ×C. Similarity is typically based on token features from the preceding

Multihead Self Attention (MSA) layer (e.g., using Key K vectors averaged across heads).

The r source tokens (indexed by idxsrc) are then merged into their corresponding destination

tokens (indexed by idxdst) using Average Merging via scatter_reduce [121]:

Xsrc_selected ← X[idxsrc]

X′ ← X.scatter_reduce(Xsrc_selected, idxdst, mode=’mean’)

(7.1)

(7.2)

After merging, the r source tokens are removed, yielding N − r tokens. However, this standard

approach is unaware of token semantics and can inadvertently merge tokens representing

anatomical keypoints, degrading biometric performance.

2. Keypoint Identification and Index Partitioning For robust biometric recognition,

preserving keypoint information is crucial. We first identify the indices of tokens corresponding

to K anatomical keypoints (detected via [34]), denoted as the set Ikp ⊂ {1, ..., N }. The

remaining token indices form the non-keypoint set Inkp = {1, ..., N } \ Ikp. This distinction is

fundamental to our modified fusion strategy.

3. KP-ToFu: Keypoint-Preserving Merging Our Keypoint-based Token Fusion (KP-

ToFu) method modifies the BSM matching process to explicitly protect keypoint tokens. We

achieve this by carefully defining the source and destination pools for BSM:
• The non-keypoint indices Inkp is partitioned into two subsets, I (1)

and I (2)
nkp

nkp

, based on

alternating index.

• The Source (SRC) pool for BSM is restricted to tokens indexed by I (1)
• The Destination (DST) pool for BSM includes tokens indexed by the other non-keypoint
nkp ∪ Ikp.

partition plus all keypoint tokens, i.e., I (2)

nkp

.

161

Then we find the r most similar pairs (idxsrc, idxdst) between the SRC and DST pools, using

the same similarity metric (averaged Key vectors). Crucially, this construction guarantees

that the source indices idxsrc are always a subset of I (1)
nkp

, ensuring no keypoint token is ever

selected for removal (∀i ∈ idxsrc, i /∈ Ikp).

The merging (Eq. 7.2) and subsequent removal of the r source tokens proceed as in

standard merging, producing a sequence of N − r tokens. KP-ToFu thus efficiently reduces

token count while guaranteeing the preservation of all K keypoint tokens, maintaining the

structural fidelity essential for accurate biometric recognition.

7.3.3 Keypoint Absolute Position Encoding (KP-APE)

Token fusion via KP-ToFu (Sec 7.3.2) disrupts the token grid, making standard positional

encoding methods like KP-RPE [125] impractical due to the excessive overhead of merging

relative positional biases at each layer. To efficiently provide spatial awareness in the reduced

token set, we introduce Keypoint Absolute Position Encoding (KP-APE).

KP-APE leverages the importance of anatomical landmarks in biometrics. We define a

set of learnable absolute position embeddings, Pkp ∈ RK×C, one pk ∈ RC for each of the K

keypoint types. At each layer l, KP-APE updates every token x(l)
i

in the current sequence X(l)

by adding a distance-weighted sum of the keypoint embeddings. Using learnable non-negative

decay parameters λk, the update is:

i = x(l)
x′(l)

i +

K
(cid:88)

k=1

e−λkdi,kpk

(7.3)

Here, x′(l)

i

denotes the updated token feature vector, and di,k is the distance between one

token and a keypoint token.

This distance di,k is determined by tracking token locations. We initialize (x, y) coordinates

for each token and update them at each layer l in parallel with KP-ToFu. As tokens merge

(using indices idxsrc, idxdst), their coordinates are also merged. In other words, the average

of coordinate represents the location of the new token. di,k is then the Euclidean distance

between the resulting tracked coordinate of token i and the coordinate of keypoint k.

162

This approach efficiently encodes spatial information. The benefit is that keypoint tokens

receive a strong signal from their corresponding embedding (di,k ≈ 0 =⇒ e−λkdi,k ≈ 1,

assuming keypoints don’t merge into non-keypoints). Furthermore, all tokens gain awareness

of their position relative to keypoints via distance-modulated contributions based on their

dynamically updated effective locations. It adapts seamlessly to the fused token sequence at

each layer using tracked coordinates and fixed embeddings. KP-APE thus maintains crucial

spatial reasoning capabilities focused on keypoints within an efficient, fusion-compatible

framework.

7.3.4 Reasoning Tokens

To enhance pose/shape invariance and compensate for increased information density

after token fusion (Sec 7.3.2), we introduce Reasoning Tokens (RTs). These are learnable,

randomly initialized tokens, not tied to specific spatial image locations, functioning as

adaptable computational resources within the network.

RTs are progressively added based on a schedule specifying rl ≥ 0 new tokens for

each transformer block l. Let X(l) ∈ RN (l)×C be the image tokens entering block l, and

R(l) ∈ RM (l)×C be the RTs propagated from the previous block (M (1) = 0). We initialize rl

new RTs, R(l)

new ∈ Rrl×C.

The full input sequence to block l’s self-attention is the concatenation:

Z(l) = Concat(X(l), R(l), R(l)

new) ∈ R(N (l)+M (l)+rl)×C

(7.4)

All tokens in Z(l) interact. The output tokens corresponding to R(l) and R(l)

new form the

propagated set R(l+1) for the next block. These RTs provide additional capacity for the model

to synthesize features, integrate information across image tokens, and potentially distill more

abstract, invariant semantic information crucial for robust identity recognition, mitigating

potential representation loss from token fusion.

163

Method

LFW [100] CPLFW [296] CFPFP [202] CALFW [297] FLOPs (G)

SapiensID [125]
Proposed Work

99.82
99.82

94.85
94.92

98.74
98.80

95.78
95.88

31.77
20.87

Table 7.1 Face Recognition Performance Comparison. Accuracy (%) is reported. FLOPs (G)

are measured for a single forward pass with the standard input (384x1152).

7.4 Experiments

Implementation Details We evaluate our proposed efficient unified recognition backbone

against the SapiensID [126] baseline. Both models use a ViT-Base architecture and are trained

on the WebBody4M dataset using the AdaFace loss [122], following established training

practices. The input image size is 384 × 384 (with padding), and 3 ROIs (whole image,

upper torso, head) are extracted, initially leading to 12 × 12 × 3 = 432 patches for the ViT

backbone.

Face Recognition Performance We evaluate performance on standard aligned face

recognition benchmarks. Table 7.1 compares our Proposed Work against the SapiensID

baseline. The results show that our method achieves highly comparable accuracy across all

datasets (LFW, CPLFW, CFPFP, CALFW). This demonstrates that the proposed efficiency

enhancements, including token fusion and the KP-APE positional encoding, successfully

preserve the fine-grained details necessary for face identification.

Body Recognition Performance We further evaluate on body recognition benchmarks,

including long-term ReID (LTCC, PRCC, focusing on clothing changes) and short-term

ReID (Market1501). Table 7.2 shows the top-1 accuracy comparison. Our Proposed Work

demonstrates strong performance, achieving higher accuracy on PRCC long-term ReID

compared to the baseline. While slightly lower on LTCC and Market1501, the results remain

highly competitive, showcasing the effectiveness of the proposed architecture in handling body

recognition tasks under challenging conditions even after significant token reduction. The

performance trade-offs might reflect the different balance struck between spatial detail and

164

Method

Long-Term ReID

Short-Term ReID
LTCC [212] PRCC [268] Market1501 [295]

SapiensID [125]
Proposed Work

42.60
29.59

66.69
69.94

90.53
87.98

Table 7.2 Body Recognition Performance Comparison. Top-1 Accuracy (%) is reported.

Long-term datasets use clothing-change protocols.

semantic abstraction due to token fusion and the KP-APE vs KP-RPE encoding strategies.

Efficiency Analysis A primary contribution of our work is the enhancement of computa-

tional efficiency. As indicated in Table 7.1, our Proposed Work reduces the computational

cost from 31.77 GFLOPs (SapiensID baseline) to 20.87 GFLOPs. This constitutes a signifi-

cant reduction of approximately 34.3%, offering a theoretical speedup of about 1.52x. This

efficiency gain is primarily achieved through Keypoint-based Token Fusion (KP-ToFu), which

substantially decreases the number of tokens processed by the computationally intensive

self-attention layers, while the addition of Reasoning Tokens incurs minimal overhead. This

makes our proposed backbone much more suitable for deployment in resource-constrained

environments or real-time applications.

Visualization of Token Retention Figure 7.2 visualizes the effectiveness of KP-ToFu in

preserving keypoint-related tokens during the fusion process. For two example inputs—one

seated and one standing—the visualizations compare the standard ToFu (top row) and our

proposed KP-ToFu (bottom row) at deeper layers (e.g., depth 23). Each red dot indicates

a remaining token after fusion. The facial and body keypoints are marked on top of the

input images. We observe that KP-ToFu explicitly retains tokens near critical landmarks

(e.g., facial features, joints), whereas standard ToFu can aggressively merge away fine-grained

regions. This highlights how KP-ToFu maintains semantic structure crucial for accurate

biometric recognition while achieving substantial token reduction.

165

Figure 7.2 Visualization of token retention with ToFu vs. KP-ToFu. Each red dot represents

a retained token at a deeper transformer layer. KP-ToFu better preserves keypoint-related

tokens (e.g., facial landmarks, joints), improving structural integrity for recognition tasks.

7.5 Conclusion

In this work, we present a new backbone for unified human recognition that addresses a

critical challenge in modern biometric systems: how to maintain high recognition accuracy

across diverse human appearances while significantly reducing computational cost. We intro-

duce three key innovations—Keypoint-based Token Fusion (KP-ToFu), Keypoint Absolute

Position Encoding (KP-APE), and Reasoning Tokens—that together enable effective token

reduction in Vision Transformers without compromising the fine-grained spatial information

vital for face and body recognition.

Through extensive experiments, we demonstrate that our proposed method achieves recog-

nition performance on par with or exceeding the state-of-the-art across multiple benchmarks,

while reducing FLOPs by over 34%. Our approach maintains strong discriminative power

for both aligned face recognition and unconstrained person re-identification, showcasing its

robustness across pose, scale, and appearance variations. Furthermore, visualizations confirm

that KP-ToFu preserves semantic structure by protecting keypoint-relevant tokens, a critical

feature that standard token fusion methods lack. Our efficient and unified backbone paves the

way for scalable, real-time biometric systems deployable on edge devices, enabling practical

applications in security, forensics, and personalized user interfaces.

166

CHAPTER 8

DISCUSSION AND CONCLUSION

8.1 Historical Context and Research Trajectory

As shown in Fig 8.1, face recognition has undergone several paradigm shifts over the

past decades. Early systems relied on hand-crafted features such as Eigenfaces [233] and

Local Binary Patterns (LBP) [7], later evolving into more structured feature engineering

approaches like SIFT [165] and Haar features [237]. The major turning point arrived

with the advent of deep learning around 2014. Pioneering models like DeepFace [225]

and FaceNet [200] introduced end-to-end learning pipelines that significantly outperformed

traditional techniques. This deep learning wave was quickly followed by a flurry of innovations

in loss functions [55, 102, 122, 154, 239, 240], architecture design [86], and large-scale dataset

development [300], each contributing to improved robustness and scalability.

This dissertation contributes to this ongoing evolution by addressing key limitations of

modern face recognition. It comprises five core works that span the three foundational pillars

of progress in this domain:

loss functions, dataset design, and architectural innovations.

These include:
• Loss function innovation: AdaFace, which adapts margin constraints based on image

quality for robust training.

• Architectural design: CAFace and KPRPE, which address challenges in large-scale

video recognition and pose misalignment, respectively.

• Dataset and generative modeling: DCFace, a dual-conditioned diffusion framework

for generating high-quality synthetic training identities.

• Multi-modal Biometrics SapiensID: A large-scale multimodal model developed to

support a new biometric paradigm that unifies face and body recognition.

167

Figure 8.1 Timeline of face recognition evolution, tracing the transition from hand-crafted

features to deep learning-based approaches. The contribution of this thesis is in bolded text.

8.2 Limitations and Open Challenges

8.2.1 Precision of Facial Features

As face recognition systems are increasingly deployed in unconstrained, real-world envi-

ronments, the precision of facial features becomes a critical factor. In particular, systems

must handle low-quality imagery, varied poses, occlusions, and other challenging conditions.

Under such circumstances, the quality and reliability of the underlying keypoint detections

can substantially influence recognition performance. Architectures like KP-RPE [125] aim

to incorporate structural priors by explicitly modeling spatial relationships between facial

landmarks. However, such approaches inherently rely on the success of the keypoint detector

itself [54, 165]. When landmarks are mislocalized due to motion blur, extreme pose, or

occlusion, the downstream recognition model can suffer from incorrect relational encodings.

This points to a broader challenge: the current decoupling between keypoint detection and

identity recognition creates a potential vulnerability. For future systems, it may be beneficial

to explore joint optimization frameworks where the keypoint estimation is co-trained or

tightly coupled with the recognition objective, improving resilience in low-quality scenarios.

Moreover, advancements in keypoint detection, especially under degraded conditions such as

surveillance video—will be essential for enabling robust structural encodings in the wild. As

applications move beyond controlled datasets, these front-end challenges are likely to become

bottlenecks, underscoring the need for more integrated, end-to-end solutions.

168

EigenFace 1991ElasticFace 1991LBPFace 2004SIFT  2004Viola-Jone 2001Kanade 1973Bledsoe1970-199019661973The First Automated Systems The Rise of Feature Engineering1990s-2000sFaceNet 2015DeepFace 2014NormFace 2017SphereFace 2017CosFace 2017ArcFace 2019 CurricularFace 2020PartialFC 2021AdaFace 2022DCFace 2023MagFace 2021KPRPE 2024Deep Learning Paradigm2010s-PresentCAFace 2022SapiensID 20258.2.2

Identical Twins and Similarity Challenges

Despite significant advances, deep face recognition systems still struggle with edge cases

such as identical twins. Even under high-quality imaging conditions, the facial similarities

between twins can be so high that image-based models fail to reliably distinguish them [130].

Consumer devices have demonstrated this limitation. Basic facial recognition methods, which

rely on matching key facial features, can sometimes be fooled by twins. To remedy this,

advanced systems use depth sensors to reduce the likelihood of errors.

This challenge highlights a fundamental limitation of appearance-based recognition: when

inter-subject variance is extremely low, visual information alone may be insufficient. Recent

work has extended this insight to the broader problem of lookalike disambiguation, where

non-twin individuals exhibit extremely similar facial appearances. Swearingen and Ross [223]

proposed a reranking strategy that augments traditional face matchers with a dedicated

disambiguator, specifically tuned to distinguish between such lookalikes. Their method

improved closed-set identification accuracy on the challenging TinyFace dataset, suggesting

that hybrid architectures may offer a viable path forward in these hard scenarios.

To further improve robustness, future systems will likely need to adopt multi-modal

inputs, integrating cues such as depth or voice. These additional modalities can provide

independent, discriminative signals that are more effective when appearance alone fails. As

face recognition moves into more diverse and security-critical applications, handling these

hard cases will be essential for practical reliability.

8.2.3

Interpretability and Trustworthiness

Modern recognition systems function as black boxes, often outputting similarity scores

without rationale. Despite efforts in visualizing attention or embedding distances, inter-

pretability remains minimal. In high-stakes scenarios such as border control or forensic

analysis, models must provide calibrated confidence, clear failure modes, and possibly human-

interpretable explanations.

169

8.2.4

Identity Capacity of Generative Models

An emerging question in synthetic dataset design is not just whether generated faces look

realistic, but how many truly distinct and usable identities a generative model can produce.

This is fundamentally a question of identity capacity: given a fixed number of real training

images, how many well-separated subjects can a model generate?

DCFace [124], trained on 52k real face images, generates 20k new synthetic identities. In

contrast, Vec2Face [255], trained on a much larger dataset (360k images), achieves up to

200k well-separated identities. This scaling behavior demonstrates that generative identity

capacity is closely related to the diversity and richness of the real training data.

Recent work by Boddeti et al. [26] offers a principled statistical framework for estimating

the upper bound of this capacity, framing it as a hyperspherical packing problem in the

feature space of a face recognition model. They define capacity as the maximum number of

identities that can be placed in this space without exceeding a predefined similarity threshold

(related to a false acceptance rate). Their empirical estimates show that StyleGAN3 have a

practical upper bound—approximately 1.43 million identities at a 0.1% FAR, which decreases

sharply with stricter thresholds. For class-conditional models like DCFace, the capacity was

significantly lower, due to its greater intra-class variation.

These results underscore an important insight: while generative models can amplify

identity diversity, their capacity is not unlimited. The sampling distribution remains bounded

by the identity entropy encoded during training. Thus, future research can aim to formalize

these constraints, explore the theoretical upper bounds of novel identity generation, and

propose methods for synthetic identities to be meaningfully distinct and diverse.

8.2.5 Recognition at Scale: The Challenge of Large Galleries

Beyond academic benchmarks, real-world biometric systems often operate at an entirely

different scale. For example, India’s national identification system, Aadhaar, maintains bio-

metric records—including face, fingerprint, and iris data—for over 1.4 billion individuals [180].

In such large-scale deployments, the gallery size is not in the thousands but in the hundreds

170

Gallery Setting

Gallery Size Rank-1 Acc. Rank-5 Acc. TPIR @ FPIR=0.01

Baseline Gallery
+1K External Imposters
+5K External Imposters
+10K External Imposters

202
1,202
5,202
10,202

62.0%
56.1%
51.1%
48.4%

68.2%
61.6%
57.3%
55.1%

46.1%
43.7%
40.8%
38.0%

Table 8.1 Performance degradation on IJB-S (small/surv2single protocol) as gallery size

increases with imposters sampled from an external dataset. The addition of external

distractors reveals challenges not captured in standard closed-set benchmarks.

of millions or more. At this scale, even a small drop in recognition accuracy can lead to a

significant number of false matches or missed identifications, potentially affecting millions of

people.

While academic benchmarks like IJB-S offer a valuable setting to evaluate face recognition

systems, they often fall short in simulating the true scale and complexity of operational

deployments. In real-world applications, systems must search against vast galleries filled with

distractors, occlusions, and varying quality levels.

To better approximate such conditions, we conducted an experiment where the baseline

IJB-S gallery (containing approximately 202 identities in this setup) was augmented by

sampling additional imposter identities from an external dataset. We added between 1,000

and 10,000 such imposters and measured the impact on recognition performance using the

‘surv2single-small‘ protocol. As detailed in Table 8.1, the results show a marked deterioration

in accuracy as the number of external distractors increases. This decline highlights a

fundamental truth: face recognition, especially when dealing with large galleries containing

unknown imposters, remains a significant challenge.

This experiment serves as a reminder that deploying face recognition systems at scale

introduces complexities not yet fully captured in many academic settings. As we move

forward, bridging the gap between benchmark success and real-world reliability, especially

under large-scale, open-set conditions, remains a central challenge for the field.

171

8.3 Looking Ahead: Potential Future Directions

8.3.1 From Recognition Towards Reasoning

The next phase in biometrics appears poised to move beyond simple matching. Future

recognition systems could extend beyond identification to also incorporate reasoning, explana-

tion, and interaction capabilities. This emerging paradigm might involve multimodal inputs,

contextual memory, and probabilistic inference—potentially leading to agents that could ask

follow-up questions, simulate identities, or defer to humans in ambiguous situations.

Figure 8.2 Future biometric agents may need to go beyond raw accuracy, providing

calibrated confidence, interpretability, and the ability to call for intervention.

8.3.2 Evaluation Beyond Accuracy

Traditional metrics such as TAR or rank-1 may prove insufficient for evaluating future

biometric deployments. There appears to be a growing need for new benchmarks that quantify

aspects like trust, explanation clarity, and user experience. These could include metrics

for: 1) Intervention accuracy (when the model seeks assistance) 2) Hypothesis quality in

low-confidence settings 3) Interpretability and decision traceability.

8.3.3 Multimodal and Personalized Recognition

Multimodal fusion—combining face, body, gait, or even voice or language—continues

to hold promise for boosting robustness. Similarly, personalized models that adapt to

specific users or deployment contexts may improve usability. Pursuing these directions will

172

likely involve considerations around continual learning, fusion architectures, and cross-modal

representation learning.

8.4 Closing Remarks

The journey of face recognition, significantly accelerated by deep learning, has reached

impressive milestones, yet it is crucial to acknowledge that the core challenge of reliable

biometric identification in diverse, real-world conditions is far from solved. This dissertation,

therefore, concludes not at an end-point, but at what appears to be an inflection point for

the community. Also the focus may increasingly pivot from pure accuracy maximization

towards the creation of systems that embody trustworthiness, interpretability, and effective

human interaction. Looking ahead, the goal seems less about marginal gains on leaderboards

and more about engineering resilient systems designed to coexist meaningfully with humans,

adapt appropriately to context, and navigate ambiguity with grace.

173

BIBLIOGRAPHY

[1]

[2]

[3]

[4]

InsightFace. https://github.com/deepinsight/insightface.git. Accessed: 2021-09-01.

InsightFacePytorch. https://github.com/TreB1eN/InsightFacePytorch.git. Accessed:
2021-09-01.

TFace. https://github.com/Tencent/TFace.git. Accessed: 2021-10-03.

Face detection vs facial recognition – what’s the difference?, 06 2022. Accessed: 2025-
03-21.

[5]

Pros and cons of facial recognition, 08 2023. Accessed: 2025-03-21.

[6] Why facial recognition is the best biometric, 06 2023. Accessed: 2025-03-21.

[7]

[8]

Timo Ahonen, Abdenour Hadid, and Matti Pietikäinen. Face recognition with local
binary patterns. In ECCV, Prague, Czech Republic, May 11-14, 2004. Proceedings,
Part I 8, pages 469–481. Springer, 2004.

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón,
and Sumit Sanghai. GQA: Training generalized Multi-Query transformer models from
Multi-Head checkpoints. In arXiv, May 2023.

[9] Vítor Albiero. Face analysis pytorch. https://github.com/vitoralbiero/faceanalysis,

2022.

[10] Xiang An, Jiankang Deng, Jia Guo, Ziyong Feng, XuHan Zhu, Jing Yang, and Tongliang
Liu. Killing two birds with one stone: Efficient and robust training of face recognition
cnns by partial fc. In CVPR, pages 4042–4051, 2022.

[11] Xiang An, Xuhan Zhu, Yuan Gao, Yang Xiao, Yongle Zhao, Ziyong Feng, Lan Wu, Bin
Qin, Ming Zhang, Debing Zhang, et al. Partial fc: Training 10 million identities on a
single machine. In ICCV, pages 1445–1449, 2021.

[12] Ognjen Arandjelovic, Gregory Shakhnarovich, John Fisher, Roberto Cipolla, and Trevor
Darrell. Face recognition with image sets using manifold density divergence. In CVPR,
2005.

[13] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia

Schmid. Vivit: A video vision transformer. In ICCV, 2021.

[14] Vishal Asnani, Xi Yin, Tal Hassner, Sijia Liu, and Xiaoming Liu. Proactive image

manipulation detection. In CVPR, 2022.

[15] Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia, Mohammad Norouzi, and David J
Fleet. Synthetic data from diffusion models improves imagenet classification. arXiv
preprint, 2023.

[16] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv

preprint, 2016.

174

[17] Gwangbin Bae, Martin de La Gorce, Tadas Baltrusaitis, Charlie Hewitt, Dong Chen,
Julien Valentin, Roberto Cipolla, and Jingjing Shen. Digiface-1m: 1 million digital face
images for face recognition. In WACV, 2023.

[18] Romain Beaumont.

img2dataset: Easily turn large sets of image urls to an image

dataset. https://github.com/rom1504/img2dataset, 2021.

[19] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum

learning. In ICML, pages 41–48, 2009.

[20] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you

need for video understanding. In ICML, 2021.

[21] Lacey Best-Rowden, Hu Han, Charles Otto, Brendan F Klare, and Anil K Jain.
Unconstrained face recognition: Identifying a person of interest from a media collection.
TIFS, 2014.

[22] Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith,
Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, and
Filip Pavetic. Flexivit: One model for all patch sizes. In CVPR, 2023.

[23] Lucas Beyer, Xiaohua Zhai, and Alexander Kolesnikov. Better plain vit baselines for

imagenet-1k. arXiv preprint, 2022.

[24] Volker Blanz and Thomas Vetter. A morphable model for the synthesis of 3D faces. In

SIGGRAPH, 1999.

[25] Andreas Blattmann, Robin Rombach, Kaan Oktay, and Björn Ommer. Retrieval-

augmented diffusion models. arXiv preprint, 2022.

[26] Vishnu Naresh Boddeti, Gautam Sreekumar, and Arun Ross. On the biometric capacity
of generative face models. In 2023 IEEE International Joint Conference on Biometrics
(IJCB), pages 1–10. IEEE, 2023.

[27] Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer,
and Judy Hoffman. Token merging: Your vit but faster. arXiv preprint, 2022.

[28] Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer,

and Judy Hoffman. Token merging: Your ViT but faster. In ICLR, 2023.

[29] Daniel Bolya and Judy Hoffman. Token merging for fast stable diffusion. In CVPR

Workshop, pages 4598–4602, Mar. 2023.

[30] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high

fidelity natural image synthesis. arXiv preprint, 2018.

[31] Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2D & 3D
face alignment problem? (and a dataset of 230,000 3d facial landmarks). In ICCV,
2017.

175

[32] Mikhail S Burtsev, Yuri Kuratov, Anton Peganov, and Grigory V Sapunov. Memory

transformer. arXiv preprint, 2020.

[33] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. VGGFace2:

A dataset for recognising faces across pose and age. In FG, 2018.

[34] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. Openpose: Realtime

multi-person 2d pose estimation using part affinity fields. PAMI, 2019.

[35] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d

pose estimation using part affinity fields. In CVPR, pages 7291–7299, 2017.

[36] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov,
and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.

[37] Hakan Cevikalp and Bill Triggs. Face recognition based on image sets. In CVPR, 2010.

[38] Jie Chang, Zhonghao Lan, Changmao Cheng, and Yichen Wei. Data uncertainty

learning in face recognition. In CVPR, pages 5710–5719, 2020.

[39] Jie Chang, Zhonghao Lan, Changmao Cheng, and Yichen Wei. Data uncertainty

learning in face recognition. In CVPR, 2020.

[40] Hanqing Chao, Yiwei He, Junping Zhang, and Jianfeng Feng. Gaitset: Regarding gait
as a set for cross-view gait recognition. In Proceedings of the AAAI conference on
artificial intelligence, volume 33, pages 8126–8133, 2019.

[41] Jiaxing Chen, Xinyang Jiang, Fudong Wang, Jun Zhang, Feng Zheng, Xing Sun, and
Wei-Shi Zheng. Learning 3d shape feature for texture-insensitive person re-identification.
In CVPR, 2021.

[42] Jun-Cheng Chen, Rajeev Ranjan, Amit Kumar, Ching-Hui Chen, Vishal M Patel, and
Rama Chellappa. An end-to-end system for unconstrained face verification with deep
convolutional neural networks. In ICCVW, 2015.

[43] Minshuo Chen, Song Mei, Jianqing Fan, and Mengdi Wang. An overview of diffusion
models: Applications, guided generation, statistical rates and optimization. arXiv
preprint, 2024.

[44] Weihua Chen, Xianzhe Xu, Jian Jia, Hao Luo, Yaohua Wang, Fan Wang, Rong Jin,
and Xiuyu Sun. Beyond appearance: a semantic controllable self-supervised learning
framework for human-centric visual tasks. In CVPR, 2023.

[45] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised

vision transformers. In ICCV, 2021.

[46] Zhiyi Cheng, Xiatian Zhu, and Shaogang Gong. Low-resolution face recognition. In

Asian Conference on Computer Vision, pages 605–621, 2018.

176

[47] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul
Choo. StarGAN: Unified generative adversarial networks for multi-domain image-to-
image translation. In CVPR, 2018.

[48] Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Xiaolin Wei, Huaxia Xia, and
Chunhua Shen. Conditional positional encodings for vision transformers. arXiv preprint,
2021.

[49] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical
automated data augmentation with a reduced search space. In CVPR workshops, pages
702–703, 2020.

[50] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan
Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length
context. In ACL, pages 2978–2988, 2019.

[51] Mostafa Dehghani, Basil Mustafa, Josip Djolonga, Jonathan Heek, Matthias Minderer,
Mathilde Caron, Andreas Steiner, Joan Puigcerver, Robert Geirhos, Ibrahim M Alab-
dulmohsin, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and
resolution. In NeurIPS, 2024.

[52] Jiankang Deng, Shiyang Cheng, Niannan Xue, Yuxiang Zhou, and Stefanos Zafeiriou.
UV-GAN: Adversarial facial uv map completion for pose-invariant face recognition. In
CVPR, 2018.

[53] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A

large-scale hierarchical image database. In CVPR. Ieee, 2009.

[54] Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou.
In CVPR, pages

Retinaface: Single-shot multi-level face localisation in the wild.
5203–5212, 2020.

[55] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. ArcFace: Additive
angular margin loss for deep face recognition. In CVPR, pages 4690–4699, 2019.

[56] Jiankang Deng, Jia Guo, Jing Yang, Alexandros Lattas, and Stefanos Zafeiriou. Vari-
ational prototype learning for deep face recognition. In CVPR, pages 11906–11915,
2021.

[57] Jiankang Deng, Jia Guo, Debing Zhang, Yafeng Deng, Xiangju Lu, and Song Shi.
Lightweight face recognition challenge. In ICCV Workshops, pages 0–0, 2019.

[58] Li Deng. The mnist database of handwritten digit images for machine learning research.

IEEE Signal Processing Magazine, 29(6):141–142, 2012.

[59] Yu Deng, Jiaolong Yang, Dong Chen, Fang Wen, and Xin Tong. Disentangled and
controllable face image generation via 3D imitative-contrastive learning. In CVPR,
2020.

177

[60] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua
Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold,
Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition
at scale. arXiv preprint, 2020.

[61] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua
Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold,
Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition
at scale. In ICLR, 2021.

[62] Huanzhang Dou, Pengyi Zhang, Wei Su, Yunlong Yu, Yining Lin, and Xi Li. Gaitgci:

Generative counterfactual intervention for gait recognition. In CVPR, 2023.

[63] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural
network function approximation in reinforcement learning. Neural Networks, 107, 2018.

[64] Joshua J Engelsma, Steven A Grosz, and Anil K Jain. Printsgan: synthetic fingerprint

generator. TPAMI, 2022.

[65] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based
algorithm for discovering clusters in large spatial databases with noise. In kdd, 1996.

[66] Chao Fan, Saihui Hou, Yongzhen Huang, and Shiqi Yu. Exploring deep models for

practical gait recognition. arXiv preprint, 2023.

[67] Chao Fan, Junhao Liang, Chuanfu Shen, Saihui Hou, Yongzhen Huang, and Shiqi Yu.
Opengait: Revisiting gait recognition toward better practicality. arXiv preprint, 2022.

[68] Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando
Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, and Jürgen
Gall. Adaptive token sampling for efficient vision transformers.
In ECCV, pages
396–414, 2022.

[69] Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou. Joint 3d face recon-
struction and dense alignment with position map regression network. In ECCV, pages
534–551, 2018.

[70] Dengpan Fu, Dongdong Chen, Hao Yang, Jianmin Bao, Lu Yuan, Lei Zhang, Houqiang
Li, Fang Wen, and Dong Chen. Large-scale pre-training for person re-identification
with noisy labels. arXiv preprint, 2022.

[71] Yixiao Ge, Dapeng Chen, and Hongsheng Li. Mutual mean-teaching: Pseudo label
refinery for unsupervised domain adaptation on person re-identification. In ICLR, 2020.

[72] Yixiao Ge, Feng Zhu, Dapeng Chen, Rui Zhao, et al. Self-paced contrastive learning

with hybrid memory for domain adaptive object re-id. In NeurIPS, 2020.

[73] Baris Gecer, Binod Bhattarai, Josef Kittler, and Tae-Kyun Kim. Semi-supervised
adversarial learning to generate photorealistic face images of new identities from 3D
morphable model. In ECCV, 2018.

178

[74] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin.
Convolutional sequence to sequence learning. In ICML, pages 1243–1252. PMLR, 2017.

[75] Zhenglin Geng, Chen Cao, and Sergey Tulyakov. 3D guided fine-grained face manipula-

tion. In CVPR, 2019.

[76] Sharath Girish, Saksham Suri, Sai Saketh Rambhatla, and Abhinav Shrivastava. To-
wards discovery and attribution of open-world gan generated images. In ICCV, 2021.

[77] Sixue Gong, Yichun Shi, and Anil Jain. Low quality video face recognition: Multi-mode

aggregation recurrent network (MARN). In ICCVW, 2019.

[78] Sixue Gong, Yichu Shi, Nathan D Kalka, and Anil K Jain. Video face recognition:

Component-wise feature aggregation network (C-Fan). In ICB, 2019.

[79]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.
Communications of the ACM, 63(11), 2020.

[80] Xinqian Gu, Hong Chang, Bingpeng Ma, Shutao Bai, Shiguang Shan, and Xilin Chen.

Clothes-changing person re-identification with RGB modality only. In CVPR, 2022.

[81] J Gui, Z Sun, Y Wen, D Tao, and J Ye. A review on generative adversarial networks:

Algorithms, theory, and applications. arxiv preprint arxiv: 200106937. 2020.

[82] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. MS-Celeb-1M:
A dataset and benchmark for large-scale face recognition. In ECCV, pages 87–102,
2016.

[83] Ryo Hachiuma, Fumiaki Sato, and Taiki Sekii. Unified keypoint-based action recognition
framework via structured keypoint pooling. In CVPR, pages 22962–22971, 2023.

[84] Mehrtash T Harandi, Conrad Sanderson, Sareh Shirazi, and Brian C Lovell. Graph
embedding discriminant analysis on grassmannian manifolds for improved image set
matching. In CVPR, 2011.

[85] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick.

Masked autoencoders are scalable vision learners. In CVPR, 2022.

[86] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for

image recognition. In CVPR, pages 770–778, 2016.

[87] Mingxing He, Shi-Jinn Horng, Pingzhi Fan, Ray-Shine Run, Rong-Jian Chen, Jui-
Lin Lai, Muhammad Khurram Khan, and Kevin Octavius Sentosa. Performance
evaluation of score level fusion in multimodal biometric systems. Pattern Recognition,
43(5):1789–1800, 2010.

[88] Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag
of tricks for image classification with convolutional neural networks. In CVPR, pages
558–567, 2019.

179

[89] Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embed-

ding for vision transformer. In ECCV (ECCV), 2024.

[90] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp
Hochreiter. Gans trained by a two time-scale update rule converge to a local nash
equilibrium. NeurIPS, 30, 2017.

[91] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.

NeurIPS, 33, 2020.

[92] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint, 2022.

[93] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computa-

tion, 1997.

[94] Elad Hoffer, Tal Ben-Nun, Itay Hubara, Niv Giladi, Torsten Hoefler, and Daniel Soudry.
Augment your batch: Improving generalization through instance repetition. In CVPR,
pages 8129–8138, 2020.

[95] Peixian Hong, Tao Wu, Ancong Wu, Xintong Han, and Wei-Shi Zheng. Fine-grained
shape-appearance mutual learning for cloth-changing person re-identification. In CVPR,
2021.

[96] Qiyang Hu, Attila Szabó, Tiziano Portenier, Paolo Favaro, and Matthias Zwicker.

Disentangling factors of variation by mixing them. In CVPR, 2018.

[97] Yiqun Hu, Ajmal S Mian, and Robyn Owens. Sparse approximated nearest points for

image set classification. In CVPR, 2011.

[98] Yuan-Ting Hu, Jiahong Wang, Raymond A Yeh, and Alexander G Schwing. Sail-vos
3d: A synthetic dataset and baselines for object detection and 3d mesh reconstruction
from video data. In CVPR, 2021.

[99] Gary Huang, Marwan Mattar, Honglak Lee, and Erik Learned-Miller. Learning to align

from scratch. NeurIPS, 25, 2012.

[100] Gary B Huang, Marwan Mattar, Tamara Berg, and Eric Learned-Miller. Labeled Faces
in the Wild: A database forstudying face recognition in unconstrained environments. In
Workshop on Faces in’Real-Life’Images: Detection, Alignment, and Recognition, 2008.

[101] Yuge Huang, Pengcheng Shen, Ying Tai, Shaoxin Li, Xiaoming Liu, Jilin Li, Feiyue
Huang, and Rongrong Ji. Improving face recognition from hard samples via distribution
distillation loss. In ECCV, pages 138–154, 2020.

[102] Yuge Huang, Yuhan Wang, Ying Tai, Xiaoming Liu, Pengcheng Shen, Shaoxin Li, Jilin
Li, and Feiyue Huang. CurricularFace: adaptive curriculum learning loss for deep face
recognition. In CVPR, pages 5901–5910, 2020.

[103] Yan Huang, Qiang Wu, Jingsong Xu, and Yi Zhong. Celebrities-ReID: A benchmark

for clothes variation in long-term person re-identification. In IJCNN, 2019.

180

[104] Yan Huang, Jingsong Xu, Qiang Wu, Yi Zhong, Peng Zhang, and Zhaoxiang Zhang.
Beyond scalar neuron: Adopting vector-neuron capsules for long-term person re-
identification. TCSVT, 2019.

[105] Zhiheng Huang, Davis Liang, Peng Xu, and Bing Xiang. Improve transformer models
with better relative position embeddings. In EMNLP, pages 3327–3335, Online, Nov.
2020.

[106] Zhiwu Huang, Ruiping Wang, Shiguang Shan, Xianqiu Li, and Xilin Chen. Log-
euclidean metric learning on symmetric positive definite manifold with application to
image set classification. In ICML, 2015.

[107] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network

training by reducing internal covariate shift. In ICML, 2015.

[108] Anil K Jain, Karthik Nandakumar, and Arun Ross. 50 years of biometric research:
Accomplishments, challenges, and opportunities. Pattern recognition letters, 79:80–105,
2016.

[109] Xiaoyi Jiang, Michael Binkert, Bernard Achermann, and Horst Bunke. Towards
detection of glasses in facial images. Pattern Analysis & Applications, 3(1), 2000.

[110] Xin Jin, Tianyu He, Kecheng Zheng, Zhiheng Yin, Xu Shen, Zhen Huang, Ruoyu
Feng, Jianqiang Huang, Zhibo Chen, and Xian-Sheng Hua. Cloth-changing person
re-identification from a single image with gait prediction and regularization. In CVPR,
2022.

[111] Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralytics yolov8, 2023.

[112] Nathan D Kalka, Brianna Maze, James A Duncan, Kevin O’Connor, Stephen Elliott,
IJB–S: IARPA Janus Surveillance
Kaleb Hebert, Julia Bryan, and Anil K Jain.
Video Benchmark. In 2018 IEEE 9th International Conference on Biometrics Theory,
Applications and Systems (BTAS), pages 1–9, 2018.

[113] Takeo Kanade. Picture processing system by computer complex and recognition of

human faces. 1974.

[114] Bong-Nam Kang, Yonghyun Kim, Bongjin Jun, and Daijin Kim. Attentional feature-pair
relation networks for accurate face recognition. In ICCV, pages 5472–5481, 2019.

[115] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of

gans for improved quality, stability, and variation. In ICLR, 2018.

[116] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for

generative adversarial networks. In CVPR, 2019.

[117] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo
Aila. Analyzing and improving the image quality of stylegan. In CVPR, 2020.

181

[118] Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James,
Peter Selednik, Stuart Anderson, and Shunsuke Saito. Sapiens: Foundation for human
vision models. In ECCV, pages 206–228. Springer, 2025.

[119] Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias
Niessner, Patrick Pérez, Christian Richardt, Michael Zollhöfer, and Christian Theobalt.
Deep video portraits. TOG, 2018.

[120] Insoo Kim, Seungju Han, Ji-won Baek, Seong-Jin Park, Jae-Joon Han, and Jinwoo
In CVPR, pages

Shin. Quality-agnostic image recognition via invertible decoder.
12257–12266, 2021.

[121] Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, and Hongxia Jin. Token
fusion: Bridging the gap between token pruning and token merging. In WACV, pages
1383–1392, 2024.

[122] Minchul Kim, Anil K Jain, and Xiaoming Liu. AdaFace: Quality adaptive margin for

face recognition. In CVPR, 2022.

[123] Minchul Kim, Feng Liu, Anil Jain, and Xiaoming Liu. Cluster and aggregate: Face

recognition with large probe set. NeurIPS, 2022.

[124] Minchul Kim, Feng Liu, Anil Jain, and Xiaoming Liu. DCFace: Synthetic face

generation with dual condition diffusion model. 2023.

[125] Minchul Kim, Yiyang Su, Feng Liu, Anil Jain, and Xiaoming Liu. Keypoint relative

position encoding for face recognition. In CVPR, 2024.

[126] Minchul Kim, Dingqiang Ye, Yiyang Su, Feng Liu, and Xiaoming Liu. Sapiensid:

Foundation for human recognition. In CVPR, 2025.

[127] Tae Hyun Kim, Mehdi SM Sajjadi, Michael Hirsch, and Bernhard Scholkopf. Spatio-

temporal transformer network for video restoration. In ECCV, 2018.

[128] Yonghyun Kim, Wonpyo Park, and Jongju Shin. BroadFace: Looking at tens of
thousands of people at once for face recognition. In ECCV, pages 536–552, 2020.

[129] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In

ICLR, 2015.

[130] Brendan Klare, Alessandra A Paulino, and Anil K Jain. Analysis of facial features in
identical twins. In 2011 International Joint Conference on Biometrics (IJCB), pages
1–8. IEEE, 2011.

[131] Brendan F Klare, Ben Klein, Emma Taborsky, Austin Blanton, Jordan Cheney, Kris-
ten Allen, Patrick Grother, Alan Mah, and Anil K Jain. Pushing the frontiers of
unconstrained face detection and recognition: IARPA Janus Benchmark-A. In CVPR,
2015.

182

[132] Abhinav Kumar, Tim K. Marks, Wenxuan Mou, Ye Wang, Michael Jones, Anoop
Cherian, Toshiaki Koike-Akino, Xiaoming Liu, and Chen Feng. Luvli face alignment:
Estimating landmarks’ location, uncertainty, and visibility likelihood. In CVPR, 2020.

[133] David Kupas and Balazs Harangi. Solving the problem of imbalanced dataset with
synthetic image generation for cell classification using deep learning. In EMBC, 2021.

[134] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila.
Improved precision and recall metric for assessing generative models. NeurIPS, 32,
2019.

[135] Kenneth Lai, Leonardo Queiroz, Vlad Shmerko, Kelly Sundberg, and Svetlana Yanushke-
vich. Post-pandemic follow-up audit of security checkpoints. IEEE Access, 11:7599–7616,
2023.

[136] HyunJae Lee, Hyo-Eun Kim, and Hyeonseob Nam. Srm: A style-based recalibration

module for convolutional neural networks. In ICCV, 2019.

[137] Minxian Li, Xiatian Zhu, and Shaogang Gong. Unsupervised person re-identification

by deep learning tracklet association. In ECCV, 2018.

[138] Minxian Li, Xiatian Zhu, and Shaogang Gong. Unsupervised tracklet person re-

identification. PAMI, 2019.

[139] Shen Li, Jianqing Xu, Xiaqing Xu, Pengcheng Shen, Shaoxin Li, and Bryan Hooi.
Spherical confidence learning for face recognition. In CVPR, pages 15629–15637, 2021.

[140] Yu-Jhe Li, Xinshuo Weng, and Kris M Kitani. Learning shape representations for

person re-identification under clothing change. In WACV, 2021.

[141] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte.

Swinir: Image restoration using swin transformer. In ICCV, 2021.

[142] Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not
all patches are what you need: Expediting vision transformers via token reorganizations.
Feb. 2022.

[143] Jianxin Lin, Yingce Xia, Tao Qin, Zhibo Chen, and Tie-Yan Liu. Conditional image-to-

image translation. In CVPR, 2018.

[144] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge
Belongie. Feature pyramid networks for object detection. In CVPR, pages 2117–2125,
2017.

[145] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss

for dense object detection. In ICCV, pages 2980–2988, 2017.

[146] Yutian Lin, Xuanyi Dong, Liang Zheng, Yan Yan, and Yi Yang. A bottom-up clustering

approach to unsupervised person re-identification. In AAAI, 2019.

183

[147] Feng Liu, Ryan Ashbaugh, Nicholas Chimitt, Najmul Hassan, Ali Hassani, Ajay Jaiswal,
Minchul Kim, Zhiyuan Mao, Christopher Perry, Zhiyuan Ren, et al. Farsight: A physics-
driven whole-body biometric system at large distance and altitude. In WACV, 2024.

[148] Feng Liu, Ryan Ashbaugh, Nicholas Chimitt, Najmul Hassan, Ali Hassani, Ajay
Jaiswal, Minchul Kim, Zhiyuan Mao, Christopher Perry, Zhiyuan Ren, Yiyang Su,
Pegah Varghaei, Kai Wang, Xingguang Zhang, Stanley Chan, Arun Ross, Humphrey
Shi, Zhangyang Wang, Anil Jain, and Xiaoming Liu. Farsight: A physics-driven
whole-body biometric system at large distance and altitude. In WACV, 2024.

[149] Feng Liu, Minchul Kim, ZiAng Gu, Anil Jain, and Xiaoming Liu. Learning clothing
and pose invariant 3d shape representation for long-term person re-identification. In
ICCV, 2023.

[150] Feng Liu, Minchul Kim, Anil Jain, and Xiaoming Liu. Controllable and guided face

synthesis for unconstrained face recognition. In ECCV, 2022.

[151] Feng Liu, Minchul Kim, Zhiyuan Ren, and Xiaoming Liu. Distilling clip with dual
guidance for learning discriminative human body shape representation. In CVPR, 2024.

[152] Jiaheng Liu, Yudong Wu, Yichao Wu, Chuming Li, Xiaolin Hu, Ding Liang, and
Mengyu Wang. DAM: Discrepancy alignment metric for face recognition. In ICCV,
pages 3814–3823, 2021.

[153] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-
Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In ECCV, pages
21–37. Springer, 2016.

[154] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song.
In CVPR, pages

SphereFace: Deep hypersphere embedding for face recognition.
212–220, 2017.

[155] Xiaoming Liu and Tsuhan Chen. Video-based face recognition using adaptive hidden

markov models. In CVPR, 2003.

[156] Xiaofeng Liu, Zhenhua Guo, Site Li, Lingsheng Kong, Ping Jia, Jane You, and BVK
Kumar. Permutation-invariant feature restructuring for correlation-aware image set-
based recognition. In ICCV, 2019.

[157] Xiaofeng Liu, BVK Kumar, Chao Yang, Qingming Tang, and Jane You. Dependency-
aware attention control for unconstrained face recognition with image sets. In ECCV,
2018.

[158] Yaojie Liu and Xiaoming Liu. Spoof trace disentanglement for generic face anti-spoofing.

TPAMI, 45(3), 2023.

[159] Yu Liu, Junjie Yan, and Wanli Ouyang. Quality aware network for set to set recognition.

In CVPR, 2017.

184

[160] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and
Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows.
In ICCV, 2021.

[161] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes

in the wild. In ICCV, pages 3730–3738, 2015.

[162] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video

swin transformer. arXiv preprint, 2021.

[163] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts.

arXiv preprint, 2016.

[164] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv

preprint, 2017.

[165] David G Lowe. Distinctive image features from scale-invariant keypoints. International

journal of computer vision, 60:91–110, 2004.

[166] Jiwen Lu, Gang Wang, and Pierre Moulin.

Image set classification using holistic
multiple order statistics features and localized multi-kernel metric learning. In ICCV,
2013.

[167] Kang Ma, Ying Fu, Dezhi Zheng, Chunshui Cao, Xuecai Hu, and Yongzhen Huang.

Dynamic aggregated network for gait recognition. In CVPR, 2023.

[168] Dmitrii Marin, Jen-Hao Rick Chang, Anurag Ranjan, Anish Prabhu, Mohammad
Rastegari, and Oncel Tuzel. Token pooling in vision transformers. In arxiv, Oct. 2021.

[169] Brianna Maze, Jocelyn Adams, James A Duncan, Nathan Kalka, Tim Miller, Charles
Otto, Anil K Jain, W Tyler Niggel, Janet Anderson, Jordan Cheney, and Patrick
Grother. IARPA Janus Benchmark-C: Face dataset and protocol. In 2018 International
Conference on Biometrics (ICB), pages 158–165, 2018.

[170] Safa C. Medin, Bernhard Egger, Anoop Cherian, Ye Wang, Joshua B. Tenenbaum,
Xiaoming Liu, and Tim K. Marks. MOST-GAN: 3d morphable stylegan for disentangled
face image manipulation. In AAAI, 2022.

[171] Lingchen Meng, Hengduo Li, Bor-Chun Chen, Shiyi Lan, Zuxuan Wu, Yu-Gang Jiang,
and Ser-Nam Lim. AdaViT: Adaptive vision transformers for efficient image recognition.
In CVPR, pages 12309–12318, June 2022.

[172] Qiang Meng, Shichao Zhao, Zhida Huang, and Feng Zhou. MagFace: A universal
representation for face recognition and quality assessment. In CVPR, pages 14225–14234,
2021.

[173] Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. No-reference image
quality assessment in the spatial domain. IEEE Transactions on Image Processing,
21(12):4695–4708, 2012.

185

[174] Stylianos Moschoglou, Athanasios Papaioannou, Christos Sagonas, Jiankang Deng,
Irene Kotsia, and Stefanos Zafeiriou. AGEDB: the first manually collected, in-the-wild
age database. In CVPR Workshops, pages 51–59, 2017.

[175] Hyeonseob Nam, HyunJae Lee, Jongchan Park, Wonjun Yoon, and Donggeun Yoo.

Reducing domain gap by reducing style bias. In CVPR, 2021.

[176] Daniel Neimark, Omri Bar, Maya Zohar, and Dotan Asselmann. Video transformer

network. In ICCV, 2021.

[177] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human

pose estimation. In ECCV, pages 483–499. Springer, 2016.

[178] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian Richardt, and Yong-Liang Yang.
HoloGAN: Unsupervised learning of 3d representations from natural images. In ICCV,
2019.

[179] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion proba-

bilistic models. In ICML, pages 8162–8171. PMLR, 2021.

[180] Unique

Identification Authority of

India (UIDAI).

Aadhaar dashboard.

https://uidai.gov.in/aadhaardashboard/, 2024. Accessed: 2024-04-01.

[181] Necmiye Ozay, Yan Tong, Frederick W. Wheeler, and Xiaoming Liu. Improving face
recognition with a quality-based probabilistic framework.
In Proceeding of IEEE
Conference on Computer Vision and Pattern Recognition Workshops, pages 134–141,
2009.

[182] Bowen Pan, Rameswar Panda, Yifan Jiang, Zhangyang Wang, Rogerio Feris, and Aude
Oliva. IA-RED: Interpretability-Aware redundancy reduction for vision transformers.
In NeurIPS, volume 34, pages 24898–24911, 2021.

[183] George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson,
Chris Bregler, and Kevin Murphy. Towards accurate multi-person pose estimation in
the wild. In CVPR, pages 4903–4911, 2017.

[184] Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman. Deep face recognition. In

BMVC, 2015.

[185] Jingtan Piao, Chen Qian, and Hongsheng Li. Semi-supervised monocular 3D face
reconstruction with end-to-end shape-preserved domain transfer. In ICCV, 2019.

[186] Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwa-
janakorn. Diffusion autoencoders: Toward a meaningful and decodable representation.
In CVPR, 2022.

[187] Albert Pumarola, Antonio Agudo, Aleix M Martinez, Alberto Sanfeliu, and Francesc
Moreno-Noguer. Ganimation: Anatomically-aware facial animation from a single image.
In ECCV, 2018.

186

[188] Haibo Qiu, Baosheng Yu, Dihong Gong, Zhifeng Li, Wei Liu, and Dacheng Tao. SynFace:

Face recognition with synthetic data. In ICCV, pages 10880–10890, 2021.

[189] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya,
and Jon Shlens. Stand-alone self-attention in vision models. NeurIPS, 32, 2019.

[190] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierar-
chical text-conditional image generation with clip latents. arXiv preprint, 2022.

[191] Rajeev Ranjan, Carlos D Castillo, and Rama Chellappa. L2-constrained softmax loss

for discriminative face verification. arXiv preprint, 2017.

[192] Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh.
DynamicViT: Efficient vision transformers with dynamic token sparsification.
In
NeurIPS, Nov. 2021.

[193] Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed:
System optimizations enable training deep learning models with over 100 billion param-
eters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining, pages 3505–3506, 2020.

[194] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-
time object detection with region proposal networks. Advances in neural information
processing systems, 28, 2015.

[195] Syed A Rizvi, P Jonathon Phillips, and Hyeonjoon Moon. The FERET verification

testing protocol for face recognition algorithms. In FG, 1998.

[196] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.
High-resolution image synthesis with latent diffusion models. In CVPR, 2022.

[197] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between

capsules. Advances in neural information processing systems, 30, 2017.

[198] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and

Xi Chen. Improved techniques for training gans. NeurIPS, 29, 2016.

[199] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh
Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, pages 4510–
4520, 2018.

[200] Florian Schroff, Dmitry Kalenichenko, and James Philbin. FaceNet: A unified embed-

ding for face recognition and clustering. In CVPR, pages 815–823, 2015.

[201] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk,
Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki.
Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint,
2021.

187

[202] Soumyadip Sengupta, Jun-Cheng Chen, Carlos Castillo, Vishal M Patel, Rama Chel-
lappa, and David W Jacobs. Frontal to profile face verification in the wild. In WACV,
pages 1–9, 2016.

[203] Zeyang Sha, Zheng Li, Ning Yu, and Yang Zhang. De-fake: Detection and attribution
of fake images generated by text-to-image diffusion models. arXiv preprint, 2022.

[204] Gregory Shakhnarovich, John W Fisher, and Trevor Darrell. Face recognition from

long-term observations. In ECCV, 2002.

[205] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position

representations. arXiv preprint, 2018.

[206] Hamid R Sheikh and Alan C Bovik. Image information and visual quality. IEEE

Transactions on Image Processing, 15(2):430–444, 2006.

[207] Yujun Shen, Bolei Zhou, Ping Luo, and Xiaoou Tang. Facefeat-GAN: a two-stage

approach for identity-preserving face synthesis. arXiv preprint, 2018.

[208] Yichun Shi and Anil K Jain. Probabilistic face embeddings. In ICCV, pages 6902–6911,

2019.

[209] Yichun Shi and Anil K Jain. Probabilistic face embeddings. In ICCV, 2019.

[210] Yichun Shi, Xiang Yu, Kihyuk Sohn, Manmohan Chandraker, and Anil K Jain. Towards
universal representation learning for deep face recognition. In CVPR, pages 6817–6826,
2020.

[211] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object

detectors with online hard example mining. In CVPR, pages 761–769, 2016.

[212] Xiujun Shu, Xiao Wang, Xianghao Zang, Shiliang Zhang, Yuanqi Chen, Ge Li, and Qi
Tian. Large-scale spatio-temporal person re-identification: Algorithms and benchmark.
TCSVT, 2021.

[213] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep
unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.

[214] Vladimir Somers, Alexandre Alahi, and Christophe De Vleeschouwer. Keypoint prompt-

able re-identification. In ECCV, 2025.

[215] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.

In ICLR, 2021.

[216] Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood

training of score-based diffusion models. NeurIPS, 34, 2021.

[217] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the

data distribution. NeurIPS, 32, 2019.

188

[218] Yang Song and Stefano Ermon. Improved techniques for training score-based generative

models. NeurIPS, 33:12438–12448, 2020.

[219] Joel Stehouwer, Amin Jourabloo, Yaojie Liu, and Xiaoming Liu. Noise modeling,

synthesis and classification for generic object anti-spoofing. In CVPR, 2020.

[220] Yukun Su, Guosheng Lin, Jinhui Zhu, and Qingyao Wu. Human interaction learning
on 3d skeleton point clouds for video violence recognition. In ECCV, pages 74–90.
Springer, 2020.

[221] Tiancheng Sun, Jonathan T Barron, Yun-Ta Tsai, Zexiang Xu, Xueming Yu, Graham
Fyffe, Christoph Rhemann, Jay Busch, Paul E Debevec, and Ravi Ramamoorthi. Single
image portrait relighting. TOG, 2019.

[222] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models:
Person retrieval with refined part pooling (and a strong convolutional baseline). In
ECCV, 2018.

[223] Thomas Swearingen and Arun Ross. Lookalike disambiguation: Improving face identifi-
cation performance at top ranks. In 2020 25th International Conference on Pattern
Recognition (ICPR), pages 10508–10515. IEEE, 2021.

[224] Ying Tai, Yicong Liang, Xiaoming Liu, Lei Duan, Jilin Li, Chengjie Wang, Feiyue Huang,
and Yu Chen. Towards highly accurate and stable face alignment for high-resolution
videos. In AAAI, 2019.

[225] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing
the gap to human-level performance in face verification. In CVPR, pages 1701–1708,
2014.

[226] Yi Tay, Vinh Q Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen
Qin, Kai Hui, Zhe Zhao, Jai Gupta, et al. Transformer memory as a differentiable
search index. In NeurIPS, 2022.

[227] Philipp Terhörst, Jan Niklas Kolf, Naser Damer, Florian Kirchbuchner, and Arjan
Kuijper. Ser-fiq: unsupervised estimation of face image quality based on stochastic
embedding robustness. in 2020 ieee. In CVF Conference on Computer Vision and
Pattern Recognition, CVPR, pages 13–19, 2020.

[228] Yonglong Tian, Lijie Fan, Kaifeng Chen, Dina Katabi, Dilip Krishnan, and Phillip
Isola. Learning vision from models rivals learning vision from data. In CVPR, pages
15887–15898, 2024.

[229] Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai,
Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit,
et al. Mlp-mixer: An all-mlp architecture for vision. In NeurIPS, 2021.

[230] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablay-
rolles, and Hervé Jégou. Training data-efficient image transformers & distillation
through attention. In ICML, 2021.

189

[231] Luan Tran, Xi Yin, and Xiaoming Liu. Disentangled representation learning GAN for
pose-invariant face recognition. In Proceeding of IEEE Computer Vision and Pattern
Recognition, pages 1415–1424, 2017.

[232] Jonathan Tremblay, Aayush Prakash, David Acuna, Mark Brophy, Varun Jampani,
Cem Anil, Thang To, Eric Cameracci, Shaad Boochoon, and Stan Birchfield. Training
deep networks with synthetic data: Bridging the reality gap by domain randomization.
In CVPRW, 2018.

[233] Matthew A Turk, Alex Pentland, et al. Face recognition using eigenfaces. In CVPR,

volume 91, pages 586–591, 1991.

[234] Boris van Breugel, Trent Kyono, Jeroen Berrevoets, and Mihaela van der Schaar. Decaf:
Generating fair synthetic data using causally-aware generative networks. NeurIPS,
34:22221–22233, 2021.

[235] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal

of Machine Learning Research, 9(11), 2008.

[236] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS,
2017.

[237] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple

features. In CVPR, volume 1, pages I–I. Ieee, 2001.

[238] Fangbin Wan, Yang Wu, Xuelin Qian, Yixiong Chen, and Yanwei Fu. When person

re-identification meets changing clothes. In CVPRW, 2020.

[239] Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille. NormFace: L2 hyper-
sphere embedding for face verification. In Proceedings of the 25th ACM International
Conference on Multimedia, pages 1041–1049, 2017.

[240] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng
Li, and Wei Liu. CosFace: Large margin cosine loss for deep face recognition. In CVPR,
pages 5265–5274, 2018.

[241] Jingya Wang, Xiatian Zhu, Shaogang Gong, and Wei Li. Transferable joint attribute-
identity deep learning for unsupervised person re-identification. In CVPR, 2018.

[242] Lei Wang, Bo Liu, Fangfang Liang, and Bincheng Wang. Hierarchical spatio-temporal

representation learning for gait recognition. In ICCV, 2023.

[243] Ming Wang, Xianda Guo, Beibei Lin, Tian Yang, Zheng Zhu, Lincheng Li, Shunli
Zhang, and Xin Yu. Dygait: Exploiting dynamic representations for high-performance
gait recognition. In ICCV, 2023.

[244] Ruiping Wang, Shiguang Shan, Xilin Chen, and Wen Gao. Manifold-manifold distance

with application to face recognition based on image set. In CVPR, 2008.

190

[245] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros.
Cnn-generated images are surprisingly easy to spot... for now. In CVPR, 2020.

[246] Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong Chen, Qifeng Chen, and
Fang Wen. Pretraining is all you need for image-to-image translation. arXiv preprint,
2022.

[247] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural

networks. In CVPR, 2018.

[248] Xiaobo Wang, Shifeng Zhang, Shuo Wang, Tianyu Fu, Hailin Shi, and Tao Mei. Mis-
classified vector guided softmax loss for face recognition. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 34, pages 12241–12248, 2020.

[249] Yulin Wang, Rui Huang, Shiji Song, Zeyi Huang, and Gao Huang. Not all images are
worth 16x16 words: Dynamic transformers for efficient image recognition. In NeurIPS,
2021.

[250] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge

domain gap for person re-identification. In CVPR, 2018.

[251] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature

learning approach for deep face recognition. In ECCV, 2016.

[252] Frederick W Wheeler, Xiaoming Liu, and Peter H Tu. Multi-frame super-resolution for
face recognition. In 2007 First IEEE International Conference on Biometrics: Theory,
Applications, and Systems, 2007.

[253] Cameron Whitelam, Emma Taborsky, Austin Blanton, Brianna Maze, Jocelyn Adams,
Tim Miller, Nathan Kalka, Anil K Jain, James A Duncan, Kristen Allen, et al. IARPA
Janus Benchmark-B face dataset. In CVPR Workshops, pages 90–98, 2017.

[254] Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-

models, 2019.

[255] Haiyu Wu, Jaskirat Singh, Sicong Tian, Liang Zheng, and Kevin W Bowyer. Vec2face:
Scaling face dataset generation with loosely constrained vectors. arXiv preprint, 2024.

[256] Kan Wu, Houwen Peng, Minghao Chen, Jianlong Fu, and Hongyang Chao. Rethinking
In ICCV, pages

and improving relative position encoding for vision transformer.
10033–10041, 2021.

[257] Tianxing Wu. Realtime glasses detection. https://github.com/TianxingWu/realtime-

glasses-detection, 2022.

[258] Wayne Wu, Chen Qian, Shuo Yang, Quan Wang, Yici Cai, and Qiang Zhou. Look at
boundary: A boundary-aware face alignment algorithm. In CVPR, pages 2129–2138,
2018.

[259] Yuxin Wu and Kaiming He. Group normalization. In ECCV, 2018.

191

[260] Andre Brasil Vieira Wyzykowski and Anil K Jain. Synthetic latent fingerprint generator.

In WACV, 2023.

[261] Jiahao Xia, Weiwei Qu, Wenjian Huang, Jianguo Zhang, Xi Wang, and Min Xu. Sparse
local patch transformer for robust face alignment and landmarks inherent relation
learning. In CVPR, pages 4052–4061, 2022.

[262] Taihong Xiao, Jiapeng Hong, and Jinwen Ma. Elegant: Exchanging latent encodings

with GAN for transferring multiple face attributes. In ECCV, 2018.

[263] Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, and
Xiaolong Wang. GroupViT: Semantic segmentation emerges from text supervision. In
CVPR, 2022.

[264] Peng Xu and Xiatian Zhu. DeepChange: A large long-term person re-identification

benchmark with clothes change. 2021.

[265] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional
networks for skeleton-based action recognition. In Proceedings of the AAAI conference
on artificial intelligence, volume 32, 2018.

[266] Jiaolong Yang, Peiran Ren, Dongqing Zhang, Dong Chen, Fang Wen, Hongdong Li,
and Gang Hua. Neural aggregation network for video face recognition. In CVPR, 2017.

[267] Meng Yang, Pengfei Zhu, Luc Van Gool, and Lei Zhang. Face recognition based on

regularized nearest points between image sets. In FG, 2013.

[268] Qize Yang, Ancong Wu, and Wei-Shi Zheng. Person re-identification by contour sketch

under moderate clothing change. PAMI, 2019.

[269] Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang. Wider face: A face

detection benchmark. In CVPR, pages 5525–5533, 2016.

[270] Zhilin Yang. Xlnet: Generalized autoregressive pretraining for language understanding.

arXiv preprint, 2019.

[271] Bangpeng Yao and Li Fei-Fei. Modeling mutual context of object and human pose in

human-object interaction activities. In CVPR, 2010.

[272] Dingqiang Ye, Chao Fan, Jingzhe Ma, Xiaoming Liu, and Shiqi Yu. Biggait: Learning

gait representation you want by large vision models. In CVPR, 2024.

[273] Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven CH Hoi. Deep
learning for person re-identification: A survey and outlook. IEEE transactions on
pattern analysis and machine intelligence, 44(6):2872–2893, 2021.

[274] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from

scratch. arXiv preprint, 2014.

192

[275] Hongxu Yin, Arash Vahdat, Jose M Alvarez, Arun Mallya, Jan Kautz, and Pavlo
Molchanov. A-ViT: Adaptive tokens for efficient vision transformer. In CVPR, pages
10809–10818, June 2022.

[276] Xi Yin, Ying Tai, Yuge Huang, and Xiaoming Liu. FAN: Feature adaptation network for
surveillance face recognition and normalization. In Proceedings of the Asian Conference
on Computer Vision, pages 301–319, 2020.

[277] Hao Yu and Jianxin Wu. A unified pruning framework for vision transformers. volume 66,

pages 1–2, Apr. 2023.

[278] Hong-Xing Yu, Wei-Shi Zheng, Ancong Wu, Xiaowei Guo, Shaogang Gong, and Jian-

Huang Lai. Unsupervised person re-identification by soft multilabel learning. 2019.

[279] Ning Yu, Larry S Davis, and Mario Fritz. Attributing fake images to gans: Learning

and analyzing gan fingerprints. In ICCV, 2019.

[280] Shijie Yu, Shihua Li, Dapeng Chen, Rui Zhao, Junjie Yan, and Yu Qiao. COCAS: A
large-scale clothes changing person dataset for re-identification. In CVPR, 2020.

[281] Junkun Yuan, Xinyu Zhang, Hao Zhou, Jian Wang, Zhongwei Qiu, Zhiyin Shao,
Shaofeng Zhang, Sifan Long, Kun Kuang, Kun Yao, et al. Hap: Structure-aware
masked image modeling for human-centric perception. In NeurIPS, 2023.

[282] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis EH
Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers
from scratch on imagenet. In ICCV, 2021.

[283] Guangtao Zhai and Xiongkuo Min. Perceptual image quality assessment: a survey.

Science China Information Sciences, 63(11):211301, 2020.

[284] Yunpeng Zhai, Shijian Lu, Qixiang Ye, Xuebo Shan, Jie Chen, Rongrong Ji, and
Yonghong Tian. Ad-cluster: Augmented discriminative clustering for domain adaptive
person re-identification. In CVPR, 2020.

[285] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and
alignment using multitask cascaded convolutional networks. IEEE Signal Processing
Letters, 23(10):1499–1503, 2016.

[286] Xiao Zhang, Rui Zhao, Yu Qiao, Xiaogang Wang, and Hongsheng Li. Adacos: Adaptively
scaling cosine logits for effectively learning deep face representations. In CVPR, pages
10823–10832, 2019.

[287] Xiao Zhang, Rui Zhao, Junjie Yan, Mengya Gao, Yu Qiao, Xiaogang Wang, and
Hongsheng Li. P2sGrad: Refined gradients for optimizing deep face models. In CVPR,
pages 9906–9914, 2019.

[288] Yanyi Zhang, Xinyu Li, Chunhui Liu, Bing Shuai, Yi Zhu, Biagio Brattoli, Hao Chen,
Ivan Marsic, and Joseph Tighe. Vidtr: Video transformer without convolutions. In
ICCV, 2021.

193

[289] Ziyuan Zhang, Luan Tran, Feng Liu, and Xiaoming Liu. On learning disentangled

representations for gait recognition. IEEE T-PAMI, 44(1):345–360, 2020.

[290] Weisong Zhao, Xiangyu Zhu, Kaiwen Guo, Xiao-Yu Zhang, and Zhen Lei. Grouped

knowledge distillation for deep face recognition. AAAI, 2023.

[291] Jinkai Zheng, Xinchen Liu, Xiaoyan Gu, Yaoqi Sun, Chuang Gan, Jiyong Zhang, Wu
Liu, and Chenggang Yan. Gait recognition in the wild with multi-hop temporal switch.
In Proceedings of the 30th ACM International Conference on Multimedia, 2022.

[292] Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Chenggang Yan, and Tao Mei. Gait
recognition in the wild with dense 3d representations and a benchmark. In CVPR,
pages 20228–20237, 2022.

[293] Jingxiao Zheng, Rajeev Ranjan, Ching-Hui Chen, Jun-Cheng Chen, Carlos D Castillo,
and Rama Chellappa. An automatic system for unconstrained video-based face recogni-
tion. IEEE Transactions on Biometrics, Behavior, and Identity Science, 2(3):194–209,
2020.

[294] Jingxiao Zheng, Ruichi Yu, Jun-Cheng Chen, Boyu Lu, Carlos D Castillo, and Rama
Chellappa. Uncertainty modeling of contextual-connections between tracklets for
unconstrained video-based face recognition. In ICCV, pages 703–712, 2019.

[295] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian.

Scalable person re-identification: A benchmark. In ICCV, 2015.

[296] Tianyue Zheng and Weihong Deng. Cross-Pose LFW: A database for studying cross-
pose face recognition in unconstrained environments. Beijing University of Posts and
Telecommunications, Tech. Rep, 5:7, 2018.

[297] Tianyue Zheng, Weihong Deng, and Jiani Hu. Cross-Age LFW: A database for studying
cross-age face recognition in unconstrained environments. CoRR, abs/1708.08197, 2017.

[298] Shaohua Zhou, Volker Krueger, and Rama Chellappa. Probabilistic recognition of

human faces from video. CVIU, 91(1-2):214–245, 2003.

[299] Yanjia Zhu, Hongxiang Cai, Shuhan Zhang, Chenhao Wang, and Yichao Xiong. Tinaface:

Strong but simple baseline for face detection. arXiv preprint, 2020.

[300] Zheng Zhu, Guan Huang, Jiankang Deng, Yun Ye, Junjie Huang, Xinze Chen, Jiagang
Zhu, Tian Yang, Jiwen Lu, Dalong Du, and Jie Zhou. WebFace260M: A benchmark
unveiling the power of million-scale deep face recognition. In CVPR, pages 10492–10502,
2021.

[301] Jiaxuan Zhuo, Zeyu Chen, Jianhuang Lai, and Guangcong Wang. Occluded person

re-identification. In ICME, 2018.

[302] Hasib Zunair and A Ben Hamza. Synthesis of covid-19 chest x-rays using unpaired
image-to-image translation. Social network analysis and mining, 11(1), 2021.

194