ON THE INTERACTION BETWEEN MACHINE LEARNING ROBUSTNESS AND
FAIRNESS: PROBLEMS, THEORIES AND METHODS
By
Han Xu
A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Computer Science —Doctor of Philosophy
2024
ABSTRACT
Machine Learning (ML) and Artificial Intelligence (AI) has become a powerful tool in solving a
wide range of tasks, from language processing and image recognition to more complex fields such
as healthcare and finance. These advanced capabilities are transforming industries and improving
efficiency and accuracy in unprecedented ways. However, alongside the remarkable benefits of ML,
there are significant concerns about its trustworthiness. Key among these concerns are issues related
to safety, fairness, privacy, explainability, and so on. Safety, particularly in terms of robustness to
perturbations, is a crucial aspect of ML trustworthiness. It refers to the reliability of a model’s
predictions when faced with small changes or noise in the input data. On the other hand, fairness
is another critical dimension of ML trustworthiness, which entails that ML models should provide
equalized prediction quality across different groups of people, ensuring that no individual or group
is unfairly disadvantaged by the model’s outputs.
Although various techniques have been proposed to address these issues individually, the internal
relationship between robustness and fairness remains largely unexplored. For instance, it is
unclear whether enhancing robustness will inadvertently compromise fairness, or vice versa. This
is a crucial question for applications like facial recognition systems, where both safe and equitable
prediction quality are essential. If we aim to achieve both robustness and fairness, what is the
optimal strategy to ensure both? In this thesis, we conduct a comprehensive study to explore the interplay
between robustness and fairness in ML models. Our findings reveal a strong tension between
these two factors. Enhancing robustness can negatively impact fairness, and improving fairness can
reduce robustness. This trade-off poses significant challenges for developing AI systems that are
both safe and equitable. Based on our studies, we provide new insights into ML models and propose
promising strategies to balance robustness and fairness.
To my family and friends for the love and support.
iii
ACKNOWLEDGEMENTS
First, I am deeply grateful to my advisor, Dr. Jiliang Tang, for his invaluable advice, inspiration,
encouragement, and support throughout my Ph.D. journey. Throughout this period, I learned so so
much from him.
I would like to express my deepest appreciation to my dissertation committee members, Dr.􀀀Anil
K.􀀀Jain, Dr.􀀀Charu C. Aggarwal, Dr.􀀀Sijia Liu, and Dr.􀀀Yuying Xie for their insightful comments
and helpful suggestions.
Besides, I was fortunate to work as an intern at VISA Research with amazing colleagues and
mentors: Dr.􀀀Menghai Pan, Dr.􀀀Huiyuan Chen and Dr.􀀀Zhimeng Jiang. I enjoyed the wonderful
and productive days with them.
During my years in Michigan State University, I am grateful to have had the pleasure and fortune
of having supportive and encouraging colleagues. Indeed, I am grateful with the academic and
personal mentorship from Dr. Xiaorui Liu, Dr. Yao Ma, Dr. Zhiwei Wang, Dr. Wenqi Fan, Dr. Wei
Jin and Dr. Tyler Derr. Moreover, it is also a great pleasure to being work, collaborate and debate
with Pengfei He, Dr. Wentao Wang, Yaxin Li, Jie Ren, Yuxuan Wan, Yingqian Cui, Shenglai Zeng,
Yuping Lin. Moreover, Dr. Sandeep Kulkarni is the professor who saved me from quitting and
taught me how to be kind to others.
Finally, I would like to thank my family, for their unconditional love and support, my roommates
Tanner and Hunter for the accompany during the pandemic.
iv
TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 ENHANCING ROBUSTNESS MAY CAUSE CLASS-WISE
UNFAIRNESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
CHAPTER 3 ENHANCING ROBUSTNESS MAY AMPLIFY INNER-CLASS
UNFAIRNESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
CHAPTER 4 ENHANCING ROBUSTNESS MAY AMPLIFY UNFAIRNESS ON
IMBALANCED DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
CHAPTER 5 ENHANCING FAIRNESS MAY THREATEN ROBUSTNESS AGAINST
POISONING ATTACKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
CHAPTER 6 CONCLUSION AND FUTURE DIRECTIONS . . . . . . . . . . . . . . . 81
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
v
CHAPTER 1
INTRODUCTION
1.1 Introduction
1.1.1 Motivation
Machine Learning (ML) has been applied across a diverse array of applications, including computer
vision, language processing, and numerous other domains. Despite their considerable potential,
these technologies may encounter various trustworthiness challenges, particularly when implemented
in high-stake applications such as autonomous vehicles [57] or medical diagnosis and treatment
[68]. Specifically, there are profound concerns regarding the (adversarial) robustness [126]
and fairness [34, 42] of these models. In detail,
• Adversarial robustness is crucial for ensuring that a model provides correct prediction outcomes,
even when the training or test data contain malicious perturbations. For example,
an attacker might add doodles onto traffic signs to mislead the recognition system of an
autonomous vehicle [76], leading to potentially dangerous situations. Formally, most studies
[41, 71, 15] formulate the adversarial attack as optimization problems: consider a classification
model 𝑓 and a test sample 𝑥, an attacker aims to introduce a perturbation 𝛿 such that
the model’s prediction loss is maximized:
max
||𝛿||≤𝜖
L( 𝑓 (𝑥 + 𝛿), 𝑦) (1.1)
Meanwhile, the norm of the perturbation 𝛿 is constrained to keep the perturbation to be slight.
Since the model’s loss value on 𝑥 + 𝛿 to the true label 𝑦 is high, the model can produce an incorrect
prediction. Empirically, many existing studies in the literature demonstrate that many
popular ML models are very vulnerable to adversarial attacks. For example, on a benchmark
image classification task on ImageNet [57], an attacker can cause the model to have almost
0% accuracy by introducing noises which are imperceptible to human eyes [89]. Therefore,
as a countermeasure to protect ML models against these attacks, adversarial training strat-
1
egy [71, 135] is one of the most studied methods, which first generate adversarial perturbations
for each training sample by solving the problem in Eq. 1.1, and then it trains the model
to minimize the error even when these perturbation exists:
min
𝑓
E𝑥

max
||𝛿||≤𝜖
L( 𝑓 (𝑥 + 𝛿), 𝑦)

. (1.2)
In this way, the trained models will give the correct prediction even when those exists.
In addition to the case above where the attacks happen during the inference phase, recent
works have also found that slight perturbation in the training dataset of an ML model can
also result great compromise of the model performance, which is called data poisoning attacks
[100, 46]. As countermeasures against such data poisoning attacks, well-known defense
strategies include data sanitization methods [29], and robust training [88].
• Fairness aims to ensure equitable treatment and consistent prediction quality across different
individuals and groups. Thus, fair machine learning techniques are critical to promote equity
in ML applications such as healthcare [72], hiring [96], lending [3], and law enforcement [83].
For linear ML models, improving the fairness involves integrating fairness constraints into
the optimization process, then solve constraint optimization problems [34, 128, 42]. In deep
neural networks, techniques like fair representation learning and adversarial debiasing [129,
132] are used to create feature representations that are invariant to sensitive attributes.
Although there are multiple methods that have been proposed to improve model robustness and
fairness separately, it is still unknown whether these two properties have interaction between each
other. Indeed, when we apply adversarial training to improve the model robustness, will it impact
the model’s fairness to cause discrimination? It is a critical problem to answer for the ML applications
where both robustness and fairness are important. For instance, in human facial recognition,
both the robustness to perturbations and the equalized prediction quality among different people
are desired. Under such an application, what are the optimal strategy to guarantee both the robustness
and fairness? This thesis aims to answer the questions above by conducting a comprehensive
2
study on the interaction between ML robustness and fairness, uncovering the inner mechanism of
machine learning in terms of these two properties, and provide viable solutions to achieve both
robustness and fairness.
1.1.2 Dissertation Contributions
This thesis comprehensively studies the interaction between ML robustness and fairness from
different angles and perspectives. Overall, it will first present three works uncovering the impact
from improving the ML robustness to the model fairness, when disparity is considered among
different classes, different sub-groups in each class, as well as the scenario of imbalanced dataset.
Then, the impact from improving ML fairness to the model robustness, especially to poisoning
attacks is discussed. In specifics,
• As one of the most popular and effective methods to improve ML robustness, adversarial training
has been widely studied and applied. However, our work [121] has reveal that adversarial
training tends to exaggerate class-wise disparity in terms of the model accuracy and robustness
(accuracy on adversarial examples), which has been validated on various benchmark
image classification datasets. This empirical finding demonstrates that adversarial training
tends to cause obvious weak points of the model accuracy and safety, which also bring great
concerns to the application of such methods. In the work [121], a further empirical analysis
is further conducted to validate that adversarial training has the natural behavior to sacrifice
the performance on hard classes and favor those easy classes. This result and analysis does
not only uncover a natural behavior of adversarial training, but also remind the importance
of debiasing during building adversarial robust machine learning models.
• Beyond the fairness / bias among different classes, a following-up work [124] conducts a further
analysis to compare the performance of adversarial training between different subgroups
in each class of a classification task. Indeed, the work [124] consider the disparity between
the samples from the main-subgroup (the data distribution that most samples are from) and
the rare-subgroup (the data samples are not from the main distribution). Empirical results
3
demonstrate that adversarial training has the behavior to result greater discrepancy between
main subgroup and rare subgroups. This result shows that adversarial training will also result
a greater bias compared to traditional training, which is consistent to the conclusion found in
the work [121].
• In practice of machine learning, it is also frequently happening that the training data is imabalnced:
some classes have much more training samples compared to other classes. In the
work [111], a study of adversarial training on imbalanced dataset has been conducted. Overall,
the work [111] finds that adversarial training tends to cause a greater performance disparity
between well-represented classes (the classes with more training samples) and underrepresented
classes (the classes with fewer training samples). Moreover, the work [111]
also demonstrates that adversarial training will greatly sacrifice the performance on underrepresented
classes, when applying common imbalanced training schemes such as re-weighting.
These findings show that adversarial training will also have severe bias / unfairness issues under
imbalanced dataset.
• On an opposite way, fairness training techniques will also bring risk to the ML safety. In our
work [123], we conduct the first study to show that fair classification methods in linear models
can increase the vulnerability to data poisoning attacks, which is one type of adversarial attack
that add perturbations to the training samples of an ML model. Our work provides both
experimental and theoretical studies to demonstrate why fair classification methods are more
vulnerable to such attacks, and we propose corresponding defenses for safer fair classification.
1.1.3 Dissertation Structure
The remainder of this dissertation is organized as follows and the overall picture is presented in
Figure 1.1. On one hand, in Chapter 2, we systematically study the interaction between adversarial
robustness and class-wise performance disparity (fairness). We provide comprehensive experimental
and theoretical results to show that adversarial training can inherently introduce class-wise unfairness.
In Chapter 3, we systematically study the interaction between adversarial robustness and
4
Figure 1.1: The overall outlook on the discussed topics about the interaction between ML robustness
and fairness.
performance disparity in each class of classification tasks. Indeed, we provide evidence demonstrating
that adversarial training can cause greater unfairness between the subgroups in each class.
In Chapter 4, we extend our discussion from class-balance dataset to imbalance dataset, where
the number of training samples in each class are different. Again, we provide detailed empirical
evidence and theoretical analysis validating that improving adversarial robustness can introduce
greater unfairness (between well-represented and under-represented classes). On the other hand, in
Chapter 5, we conduct a study on an opposite way to show that fair classification algorithms can
hurt the models’ robustness to data poisoning attacks. Finally, we conclude the thesis by remarking
the importance of the strong tension between model robustness and fairness, and discussing future
directions in Chapter 6.
5
CHAPTER 2
ENHANCING ROBUSTNESS MAY CAUSE CLASS-WISE UNFAIRNESS ¹
2.1 Introduction
The existence of adversarial examples [41] causes great concerns when applying deep neural networks
to safety-critical tasks, such as autonomous driving vehicles and face identification [76, 99].
As countermeasures against adversarial examples, adversarial training algorithms aim to train a
classifier that can classify the input samples correctly even when they are adversarially perturbed.
Namely, they optimize the model to have the minimum adversarial risk of a sample that can be
perturbed to be wrongly classified:
min
𝑓
E𝑥

max
||𝛿||≤𝜖
L( 𝑓 (𝑥 + 𝛿), 𝑦)

. (2.1)
These adversarial training methods [58, 71, 135] have been shown to be one of the most effective
and reliable approaches to improve the model robustness against adversarial attacks.
Although promising to improve the model’s robustness, we reveal an intriguing property about
adversarial training algorithms: they usually result in a large disparity of accuracy and robustness
among different classes. As a preliminary study in Section 2.3, we apply natural training and PGD
adversarial training [71] on the CIFAR10 dataset [56] using a PreAct-ResNet18 [45] architecture.
For a naturally trained model, the model performance in each class is similar. However, in the adversarially
trained model, there is a severe class-wise performance discrepancy (both accuracy and
robustness). For example, the model has both low standard and robust errors on the samples from
the class “car”, but it has a much larger error rate on those “cat” images. Meanwhile, this fairness
issue does not appear in natural models which are trained on clean data. Note that this phenomenon
happens in a balanced CIFAR10 dataset and exists in other datasets, model structures and adversarial
training algorithms. More details can be found in Section 2.3. If this phenomenon happens in
real-world applications, it can raise huge concerns about safety. Imagine that a traffic sign recog-
¹This chapter is based on previously published work by Han Xu, Xiaorui Liu, Yaxin Li, Anil Jain, Jiliang Tang􀀀
titled “To be robust or to be fair: Towards fairness in adversarial training” published at the International Conference on
Machine Learning (2021) [121].
6
nizer has an overall high performance, but it is very inaccurate and vulnerable to perturbations only
for some specific signs. The safety of this autonomous driving car is still not guaranteed. Meanwhile,
this phenomenon can also lead to issues from the social ethics perspective. For example, a
robustly trained face identification system might provide different levels of safety for the services
provided to different ethnic communities. Thus, there is a pressing need to study this phenomenon.
In this work, we define this phenomenon as the “robust fairness” problem of adversarial training
and we aim to understand and mitigate this fairness problem. We first explore the question:
“why does adversarial training have this accuracy/robustness disparity between classes while natural
training does not present a similar issue?” To answer this question, we first study on several
cases on rather simple classification problems with mixture Gaussian distributions, where we design
the data in different classes with different “difficulty levels” to be classified. By studying the
class-wise performance of naturally and adversarially trained models, we deepen our understandings
on this problem. Compared to natural training, adversarial training has a stronger tendency
to favor the accuracy of the classes which are “easier” to be predicted. Meanwhile, adversarial
training will sacrifice the prediction performance for the “difficult” classes. As a result, there is a
much obvious class-wise performance discrepancy in adversarial training. Motivated by our empirical
findings and theoretical understandings, we propose a dynamic debiasing framework, i.e.,
Fair Robust Learning (FRL), to mitigate the robust fairness issue in adversarial settings. Our main
contributions are summarized as:
• We discover the robust fairness problem of adversarial training algorithms and empirically
verify that this problem can be general;
• We build conceptual examples to understand the potential reasons that cause this fairness
problem; and
• We propose a Fair Robust Learning (FRL) framework to mitigate the fairness issue in adversarial
training.
7
2.2 Related Work
Adversarial Attacks and Adversarial Training. The existence of adversarial attacks [41, 102,
15] causes huge concerns when people adopt machine learning models in various application domains
[126, 49]. As countermeasures against adversarial examples, adversarial training (robust
optimization) algorithms [41, 71, 135, 98, 134] are formulated as a min-max problem that directly
minimize the model’s risk on the adversarial samples. Another mainstream of defense methods are
certified defense, which aims to provide provably robust DNNs under 𝑙𝑝 norm bound [117, 24] and
guarantee the robustness.
Fairness in Machine Learning & Imbalanced Dataset. Fairness issues recently draw much attention
from the community of machine learning. These issues can generally divided into two
categorizations: (1) prediction outcome disparity [128]; and (2) prediction quality disparity [10].
Unlike existing works, this work is the first study the unfairness issue in the adversarial setting. We
also mention the imbalanced data learning problem [43, 65] as one related topic of our work. Since
in our work, (i.e.. Figure 2.1), we show that the prediction performance differences are indeed existing
between different classes. This phenomenon is also well studied in imbalanced data problems
or long-tail distribution learning problems [113] where some classes have much fewer training samples
than others. However, in our case, we show that this unfairness problem can generally happen
in balanced datasets, so it desires new scopes and methods for further study.
Fairness in Robust Learning. A parallel and independent work [79] also figures out that the phenomenon
of class-wise unequal robustness can happen for many deep learning tasks in the wild.
While our work is more focused on adversarial training algorithms and we argue that adversarial
training methods can have the property to cause these fairness phenomena. In addition, our work
also discusses various potential mitigation methods to achieve more balanced robustness for adversarial
training methods.
2.3 Preliminary Studies
In this section, we introduce our preliminary findings to show that adversarial training algorithms
usually present the fairness issue, which is related to the strong disparity of standard accuracy
8
(a) Natural Training. (b) PGD Adv. Training. (c) TRADES (1/𝜆 = 1). (d) TRADES (1/𝜆 = 6).
Figure 2.1: The class-wise standard / robust error of natural and adversarial training methods on
CIFAR10, under PreAct ResNet18.
and robustness among different classes. We first examine algorithms including PGD adversarial
training [71] and TRADES [135] on the CIFAR10 dataset [56] under the 𝑙∞-norm adversarial attack
(under perturbation bound 8/255). In CIFAR10, we both naturally and adversarially train PreAct-
ResNet18 [43] models. The results are presented in Figure 2.1.
From Figure 2.1, we can observe that – for the naturally trained models, every class has similar
standard error (around 7±5%) and 100% robust error under the 8/255 PGD attack. However, for adversarially
trained models, the disparity phenomenon is severe. For example, a PGD-adversarially
trained model has 32.8% standard error and 82.4% robust error for the samples in the class “cat”,
which are much larger than the model’s average standard error 15.5% and average robust error
56.4%. While, the best class “car” only has 6.1% standard error and 34.3% robust error. These
results suggest that adversarial training can cause strong disparities of standard / robustness performance
between different classes, which are originally negligible in natural training.
To further support the claim that adversarial training amplifies the class-wise performance disparity,
besides the Figure 2.1, we provide additional statistical evidence to numerically compare
each model’s class-wise performance disparity. We measure each model’s Standard Deviation of
class-wise error (SD), Normalized Standard Deviation of class-wise error (NSD), and Normalized
Standard Deviation of class-wise accuracy (NSD (Acc)).²Note that NSD and NSD (Acc) normalize
the scale of average performance, which exclude the influence from the average performance to
the disparity issue. From Table 2.1, we find adversarial training methods have larger disparity than
natural training across all three metrics, for both standard and robust performance. Thus, we can
9
confirm that adversarial training do amplify the performance disparity between classes.
Table 2.1: The class-wise disparity on CIFAR10 (%).
Std. Error SD NSD NSD (Acc)
Natural Train. 2.8 0.47 0.03
PGD Adv. Train. 9.2 0.57 0.11
TRADES (1/𝜆 = 1) 7.5 0.55 0.09
TRADES (1/𝜆 = 6) 9.6 0.49 0.12
Rob. Error SD NSD NSD (Acc)
Natural Train. 0.0 0.0 -
PGD Adv. Train. 15.6 0.28 0.34
TRADES (1/𝜆 = 1) 16.8 0.29 0.40
TRADES (1/𝜆 = 6) 17.0 0.34 0.34
Moreover, we find that the reason of this fairness phenomenon might be due to the unequal
influence of adversarial training on different classes. It tends to hurt the standard performance of
classes which are intrinsically “harder” to be classified, but not effectively improve their robustness
performance. In Table 2.2, we list the classes “dog” and “cat”, which have the highest errors in
natural training, as well as “car” and “ship”, which have the lowest errors. We can observe that adversarial
training increases the standard errors of “dog” and “cat” by a much larger margin than the
classes “car” and “ship”. Similarly, adversarial training gives poorer help to reduce the robust errors
of “dog” and “cat”. As a conclusion, we hypothesize that adversarial training tends to make the
hard classes even harder to be classified or robustly classified. In Section 2.4, we will theoretically
confirm this hypothesis.
We further investigate other settings including model architecture WRN28, SVHN dataset, 𝑙2-
norm adversarial training and Randomized Smoothing [24] algorithm. Results can be found at
Appendix 2.2 where we can make similar observations. These findings suggest that observations
from Figure 2.1 and Table 2.2 & Table 2.1 are likely to be generalized into other adversarial training
algorithms, model architectures, datasets and other types of adversarial attacks.
²Normalized Standard Deviation: 𝑁𝑆𝐷 = 𝑆𝐷/𝑚𝑒𝑎𝑛. Note NSD and NSD (Acc) have unequal “mean” values.
10
Table 2.2: The changes of standard & robust error in natural & adversarial training in CIFAR10.
Std. Error Cat Dog Car Ship
Nat. Train 11.3 10.0 1.8 3.5
PGD Adv. Train 34.8 26.9 6.1 6.4
Diff. (Adv. - Nat.) 23.5 16.9 4.3 2.9
Rob. Error Cat Dog Car Ship
Nat. Train 100 100 100 100
PGD Adv. Train 82.7 66.4 34.3 40.8
Diff. (Adv. - Nat.) -17.3 -33.5 -65.7 -59.2
2.4 Theoretical Analysis
From our preliminary studies, we consistently observe that adversarially trained models have great
performance disparity (both standard and robust errors) between different classes. Moreover, adversarial
training tends to hurt the classes which are originally harder to be classified in natural
training. What is the reason that leads to this phenomenon? Is this “unfairness” property an inherent
property of adversarial training methods? These questions are not trivial to answer because it is
closely related to another property of adversarial training: it usually degrades the model’s average
accuracy. Given that naturally trained models already present slight disparities between classes
(see Figure 2.1 (a)), is the larger class-wise accuracy disparity in adversarial training only a natural
outcome of the model’s worse average accuracy?
To deepen our understandings on these questions, we theoretically study the effect of adversarial
training on a binary classification task under a mixture Gaussian distribution. We design the two
classes with different “difficulties” to be classified. In such case, adversarial training will not significantly
degrade the average standard error, but its decision boundary, compared to naturally trained
models, are more biased to favor the “easier” class and hurt the “harder” class. From this theoretical
study, we aim to validate that adversarial training methods can have the intrinsic property to give
unequal influence between different classes and consequently cause the fairness problem. In the
following, we first introduce the necessary notations and definitions.
Notation. We use 𝑓 to denote the classification model which is a mapping 𝑓 : X → Y from input
data space X and output labels Y. It can be formulated as 𝑓 (𝑥) = sign(𝑤 · 𝑥 + 𝑏) with parameters 𝑤
11
and 𝑏. Generally, for a classifier 𝑓 , the overall standard error is defined as Rnat ( 𝑓 ) = Pr.( 𝑓 (𝑥) ≠ 𝑦);
and its overall robust error is Rrob ( 𝑓 ) = Pr.(∃𝛿, ||𝛿|| ≤ 𝜖, s.t. 𝑓 (𝑥 +𝛿) ≠ 𝑦), which is the probability
that there exists a perturbation to make the model give a wrong prediction.³ We use Rnat ( 𝑓 ; 𝑦) to
denote the standard error conditional on a specific class 𝑌 = 𝑦.
2.4.1 A Binary Classification Task
We start by giving a rather simple example of a binary classification task with a Gaussian mixture
data. Here, we aim to design the two classes with inherent different “difficulties” to be classified.
Specifically, in the following definition, the data are from 2 classes Y = {−1, +1} and the data from
each class follow a Gaussian distribution D which is centered on −𝜃 and 𝜃 respectively. In our case,
we specify that there is a 𝐾-factor difference between two classes’ variance: 𝜎+1 : 𝜎−1 = 𝐾 : 1 and
𝐾 > 1.
𝑦 𝑢.∼𝑎.𝑟 {−1, +1}, 𝜃 = (
dim = d
z }| {
𝜂, ..., 𝜂),
𝑥 ∼
8>>>>
<
>>>>:
N (𝜃, 𝜎2
+1𝐼) if 𝑦 = +1
N (−𝜃, 𝜎2−
1𝐼) if 𝑦 = −1
(2.2)
Intuitively, the class “+1” is harder than class “-1” because it is less compacted in the data space.
In Theorem 1, we formally show that the class “+1” can be indeed harder because an optimal linear
classifier will give a larger error for the class “+1” than class “-1”.
Theorem 1 For a data distribution D in Eq. 2.2, the optimal linear classifier 𝑓nat which minimizes
the average standard classification error:
𝑓nat = arg min
𝑓
Pr.( 𝑓 (𝑥) ≠ 𝑦)
It has the intra-class standard error for the two classes:
Rnat ( 𝑓nat, −1) = Pr.{N(0, 1) ≤ 𝐴 − 𝐾 ·
q
𝐴2 + 𝑞(𝐾)}
Rnat ( 𝑓nat, +1) = Pr.{N(0, 1) ≤ −𝐾 · 𝐴 +
q
𝐴2 + 𝑞(𝐾)}
(2.3)
³In this section, we focus on 𝑙∞-norm bounded perturbation.
12
where 𝐴 = 2
𝐾2−1
√
𝑑𝜂
𝜎 and 𝑞(𝐾) = 2 log 𝐾
𝐾2−1 which is a positive constant and only depends on 𝐾, As a
result, the class “+1” has a larger standard error:
Rnat ( 𝑓nat, −1) < Rnat ( 𝑓nat, +1).
A detailed proof of Theorem 1 can be found in Appendix 1.2. From Theorem 1, it demonstrates
that class “+1” (with large variance) can be harder to be classified than class “-1”, because an
optimal natural classifier will present a larger standard error in class “+1”. Note that the classwise
difference is due to the positive term 𝑞(𝐾) in Eq. A.5, which depends on the variance ratio 𝐾. If
the two classes’ variances are equal, i.e., 𝐾 = 1 , the standard errors for the two classes are the same.
Next, we will show that, in the setting of Eq. 2.2, an optimal robust classifier (adversarial training)
will give a model which further benefits the easier class “-1” and hurt the harder class “+1”.
2.4.2 Optimal Linear Model to Minimize Robust Error
In this subsection, we demonstrate that an adversararially trained model exacerbates the performance
gap between these two classes, by giving a decision boundary which is closer to samples in
the harder class “+1” and farther to the class “-1”. Similar to Theorem 1, we calculate the classwise
standard errors for robust classifiers.
Theorem 2 For a data distribution D in Eq. 2.2, the optimal robust linear classifier 𝑓rob which
minimizes the average robust error:
𝑓rob = arg min
𝑓
Pr.(∃||𝛿|| ≤ 𝜖 s.t. 𝑓 (𝑥 + 𝛿) ≠ 𝑦)
It has the intra-class standard error for the two classes:
Rnat ( 𝑓rob, −1) = Pr.{N(0, 1) ≤ 𝐵 − 𝐾 ·
q
𝐵2 + 𝑞(𝐾) −
√
𝑑
𝜎
𝜖 }
Rnat ( 𝑓rob, +1) = Pr.{N(0, 1) ≤ −𝐾 · 𝐵 +
q
𝐵2 + 𝑞(𝐾) −
√
𝑑
𝐾𝜎
𝜖 }
(2.4)
where 𝐵 = 2
𝐾2−1
√
𝑑(𝜂−𝜖)
𝜎 and 𝑞(𝐾) = 2 log 𝐾
𝐾2−1 is a positive constant and only depends on 𝐾,
Note that we limit the perturbation margin 𝜖 in the region 0 < 𝜖 < 𝜂 to guarantee that the robust
optimization gives a reasonable classification in this setting in Eq. 2.2. The detailed proof is similar
13
to Theorem 1 and is given in Appendix 1.2. From the results in Theorems 1 and 2, an corollary can
demonstrate that the robust classifier will further hurt the “harder” class’s performance.
Corollary 1 Adversarially Trained Models on D will increase the standard error for class “+1”
and reduce the standard error for class “-1”:
Rnat ( 𝑓rob, −1) < Rnat ( 𝑓nat, −1).
Rnat ( 𝑓rob, +1) > Rnat ( 𝑓nat, +1).
Proof 1 A simplified proof sketch for Corollary 1 can help shed light on the reason of this behavior
for adversarial training. First, from the definition of the distribution D, it is easy to have the
intermediate result that natural / robust classifiers have the weight vectors 𝑤nat = 𝑤rob = 1. The
only difference is their interception terms 𝑏nat and 𝑏rob. In Appendix 1.2, we show that:
𝑏nat =
𝐾2 + 1
𝐾2 − 1 𝑑𝜂 − 𝐾
s
4
(𝐾2 − 1)2
· 𝑑2𝜂2 + 𝑑𝜎2𝑞(𝐾) := 𝑔(𝜂). (2.5)
In particular, for robust classifiers, if we look at the objective of robust classification:
Rrob ( 𝑓 ) =Pr.(∃||𝛿|| ≤ 𝜖 s.t. 𝑓 (𝑥 + 𝛿) ≠ 𝑦)
= max
||𝛿||≤𝜖
Pr.( 𝑓 (𝑥 + 𝛿) ≠ 𝑦)
=
1
2Pr.( 𝑓 (𝑥 + 𝜖) ≠ −1|𝑦 = −1) + 1
2Pr.( 𝑓 (𝑥 − 𝜖) ≠ +1|𝑦 = +1)
the robust classifier 𝑓rob directly minimizes the standard error of samples (𝑥−𝜖) for 𝑥 in class “+1”,
and it minimizes the error of samples (𝑥+𝜖) for 𝑥 in class “-1”. As a consequence, the robust model
𝑓rob minimizes the errors of samples whose centers are ±𝜃′ = ±(𝜂 − 𝜖, ..., 𝜂 − 𝜖), which are both
𝜖-distance closer to zero 0 compared to ±𝜃. Thus, we can get the interception term of the robust
model, by only replacing 𝜂 in 𝑏nat by (𝜂 − 𝜖):
𝑏rob = 𝑔(𝜂 − 𝜖).
In Appendix 1.2, we show 𝑔 is a monotone increasing function from 0 to 𝜂; thus we have the relation
0 < 𝑏rob < 𝑏nat. This suggests that a robust classifier will predict more samples inDto be a negative
“-1” class, and hence reduce the classification error for class “-1” but increase the error of class
“+1”.
14
(a) Logistic Regression. (b) One Layer Perceptron.
Figure 2.2: Logistic Regression and Multi-perceptron classifiers (natural and robust) on simulated
binary data in Eq. 2.2.
In Figure 2.2,we give an illustration for the theories in a 2-dim space to show the effect of adversarial
training. Particularly, we sampled Gaussian data from class “+1” (blue) which centered on
𝜃 = (2, 2), and has variance 𝜎2
+1 = 2, as well as class “-1” (yellow) which centered on −𝜃 = (−2, −2),
and has variance 𝜎2−
1 = 1. We apply a logistic regression classifier for natural training (green line)
and adversarial training (with perturbation bound 𝜖 = 0.5, red line). From Figure 2.2, we get the
consistent results with our theories: (1) there are more samples in the class “+1” than class “-1”
which are wrongly predicted by a natural classifier; (2) adversarial training will further exacerbate
the disparity by moving decision boundary closer to the harder class “+1”. In Figure 2.2 (right),
we also show the results for a non-linear MLP classifier, where we can make similar observations.
These theoretical understandings about adversarial training suggest that adversarial training will
naturally bring unequal effect to different classes in the data distribution and consequently cause
severe fairness issues.
2.5 Fair Robust Learning (FRL)
The observations from both preliminary experiments and theoretical understandings suggest that
the fairness issues in adversarial training can be a general and serious concern. In this section, we
first discuss fairness requirements that an optimal robust model should satisfy, and then introduce
algorithms to achieve these objectives.
15
2.5.1 Objective of Robust Fairness
In this work, we desire an optimal robust model that can achieve the parity of both standard
prediction accuracy and adversarial robustness among each class 𝑦 ∈ 𝑌:
• Equalized Accuracy: one classifier 𝑓 ’s standard error is statistically independent of the
ground truth label 𝑦: Pr.( 𝑓 (𝑥) ≠ 𝑦|𝑌 = 𝑦) ≈ Pr.( 𝑓 (𝑥) ≠ 𝑦) for all 𝑦 ∈ 𝑌.
• Equalized Robustness: one classifier 𝑓 ’s robust error is statistically independent of the ground
truth label 𝑦: Pr.(∃ ||𝛿|| ≤ 𝜖, 𝑓 (𝑥 + 𝛿) ≠ 𝑦|𝑌 = 𝑦) ≈ Pr.(∃ ||𝛿|| ≤ 𝜖, 𝑓 (𝑥 + 𝛿) ≠ 𝑦) for all
𝑦 ∈ 𝑌.
The notion of “Equalized Accuracy” is well studied in traditional fairness research [10, 128] which
desires the model to provide equal prediction quality for different groups of people. The “Equalized
Robustness” is our new desired “fairness” property for robustly trained models. For every class, the
model should provide equal robustness and safety to resist adversarial attacks. Therefore, the robust
model will have high overall safety but no obvious “weakness”.
Faced with the fairness objectives mentioned above, in our work, we propose a Fair Robust
Learning (FRL) strategy to train robust models that have equalized accuracy and robustness performance
for each class. Formally, we aim to train a classifier 𝑓 to have minimal overall robust error
(Rrob ( 𝑓 )), as well stressing 𝑓 to satisfy a series of fairness constraints as:
minimize
𝑓
Rrob ( 𝑓 )
s.t.
8>>>>
<
>>>>:
Rnat ( 𝑓 , 𝑖) − Rnat ( 𝑓 ) ≤ 𝜏1
Rrob ( 𝑓 , 𝑖) − Rrob ( 𝑓 ) ≤ 𝜏2 for each 𝑖 ∈ 𝑌
(2.6)
where 𝜏1 and 𝜏2 are small and positive predefined paramters. The constraints in Eq. 2.6 restrict
the model’s error for each class 𝑖 ∈ 𝑌 (both standard error Rnat ( 𝑓 , 𝑖) and robust error Rrob ( 𝑓 , 𝑖))
should not exceed the average level (Rnat ( 𝑓 ) and Rrob ( 𝑓 )) by a large margin. Thus, there will be
no obvious worst group in the whole dataset. One thing to note is that in Eq 2.6, the robust error
is always strongly related to the standard error [135, 105] (see Eq. 2.7). Thus, during the debiasing
16
process, we could have a twisted influence on the class 𝑖’s standard and robust errors. For example, if
we apply some importance weighting methods to upweight the cost of Rrob ( 𝑓 , 𝑖), we also implicitly
upweight the cost for Rnat ( 𝑓 , 𝑖). Therefore, we propose to separate the robust error into the sum of
standard error and boundary error inspired by [135] as:
Rrob ( 𝑓 , 𝑖)
=Pr.{∃𝛿, s.t. 𝑓 (𝑥 + 𝛿) ≠ 𝑦|𝑦 = 𝑖}
=Pr.{ 𝑓 (𝑥) ≠ 𝑦|𝑦 = 𝑖} + Pr.{∃𝛿, 𝑓 (𝑥 + 𝛿) · 𝑓 (𝑥) ≤ 0|𝑦 = 𝑖}
=Rnat ( 𝑓 , 𝑖) + Rbndy ( 𝑓 , 𝑖)
(2.7)
where Rbndy ( 𝑓 , 𝑖) = Pr.{∃𝛿, 𝑓 (𝑥 +𝛿) · 𝑓 (𝑥) ≤ 0|𝑦 = 𝑖} represents the probability that a sample from
class 𝑖 lies close to the decision boundary and can be attacked. By separating the standard error
and boundary error during adversarial training, we are able to independently solve the unfairness
of both standard error and boundary error. Formally, we have the training objective as:
minimize
𝑓
Rnat ( 𝑓 ) + Rbndy ( 𝑓 )
s.t.
8>>>>
<
>>>>:
Rnat ( 𝑓 , 𝑖) − Rnat ( 𝑓 ) ≤ 𝜏1
Rbndy ( 𝑓 , 𝑖) − Rbndy ( 𝑓 ) ≤ 𝜏2 for each 𝑖 ∈ 𝑌
(2.8)
During training to optimize the boundary errors, we borrow the idea from [135], which minimizes
the KL-divergence between the output logits of clean samples and their adversarial samples. In the
following subsections, we explore effective methods to solve the problem in Eq. 2.8.
2.5.2 Reweight for Robust Fairness
In order to solve the fair robust training problem in Eq. 2.8, we first follow the main pipeline from
traditional machine learning debiasing works such as [1, 128], which reduce the problem in Eq. 2.6
into a series of Cost-sensitive classification problems and continuously penalize the terms which
violate the fairness constraints. We begin by introducing Lagrange multipliers 𝜙 = (𝜙𝑖
nat, 𝜙𝑖
bndy
)
17
(non-negative) for each constraint in Eq. 2.6 and form the Lagrangian:
𝐿( 𝑓 , 𝜙) =Rnat ( 𝑓 ) + Rbndy ( 𝑓 )
+
Õ𝑌
𝑖=1
𝜙𝑖
nat (Rnat ( 𝑓 , 𝑖) − Rnat ( 𝑓 ) − 𝜏1)+
+
Õ𝑌
𝑖=1
𝜙𝑖
bndy
(Rbndy ( 𝑓 , 𝑖) − Rbndy ( 𝑓 ) − 𝜏2)+
(2.9)
Thus, the problem in Eq. 2.8 equals to solving the max-min game between two rivals 𝑓 and 𝜙 as:
max
𝜙nat,𝜙rob≥0
min
𝑓
𝐿( 𝑓 , 𝜙). (2.10)
We present the details for solving Eq. 2.10 in Algorithm 1. During the training process, we start
from a pre-trained robust model and we test its class-wise standard / boundary errors on a separated
validation set (step 5 and 6). We check whether the current model 𝑓 violates some constraints in
Eq. 2.8. For example, if there exists a class “i”, whose standard error is higher than the average by
a larger margin: Rnat ( 𝑓 , 𝑖) − Rnat ( 𝑓 ) − 𝜏1 > 0. We will adjust its multiplier term 𝜙𝑖
nat according to
the extent of violation (step 7). As a result, we increase the training weight for the standard error
Rnat ( 𝑓 , 𝑖) for the class 𝑖. We also adjust the multiplier for boundary errors at the same time (step
8). Next, fixing the multipliers 𝜙, the algorithm will solve the inner-minimization to optimize the
model 𝑓 . By repeating the steps 4-9, the model 𝑓 and Lagrangian multiplier 𝜙 will be alternatively
updated to achieve the equilibrium until we finally reach an optimal model that satisfies the fairness
constraints. Note that we denote the framework with the proposed Reweight strategy as FRL
(Reweight).
2.5.3 ReMargin for Robust Fairness
Though upweighting the cost for Rnat ( 𝑓 , 𝑖) has the potential to help penalize large Rnat ( 𝑓 , 𝑖)
and improve the standard performance for worse groups. However, only upweighting one class’s
boundary error’s cost could not succeed to fulfill the goal to help decrease its boundary error. In
Section 3.6.2, we show this fact in PGD adversarial training on CIFAR10, where we find that even
we give a large weight ratio for a class’s boundary error, it is not sufficient to reduce the boundary
18
error for this class. Thus, we cannot achieve to mitigate the boundary error disparity and robust
error disparity.
To solve this problem, we propose an alternative strategy by enlarging the perturbation margin
𝜖. It is evident from some existing works [104, 31] that increasing the margin 𝜖 during adversarial
training can effectively improve model’s robustness against attacks under the current intensity 𝜖.
Therefore, enlarging the adversarial margin 𝜖 when generating adversarial examples during training
specifically for the class 𝑖 has the potential to improve this class’s robustness and reduce the large
boundary error Rbndy ( 𝑓 , 𝑖). In this work, we define this strategy as FRL (Remargin). The FRL (Remargin)
resembles the procedure in Algorithm 1, except for the step 7, where we instead update the
adversarial margin for the boundary errors. Specifically, we change the adversarial margin 𝜖𝑖 of the
class “𝑖” as follows:
𝜖𝑖 = 𝜖𝑖 · exp
􀀀
𝛼∗
2
(Rbndy ( 𝑓 , 𝑖) − 𝜏2))

(2.11)
Besides FRL (Reweight) and FRL (Remargin), we can combine Reweight and Remargin, where we
jointly update the weight of the boundary errors and change the margin.
Algorithm 1 The Fair Robust Learning (FRL) Algorithm.
1: Input: Fairness constraints specified by 𝜏1 > 0 and 𝜏2 > 0, test time attacking radius 𝜖 and
hyper-param update rate 𝛼1, 𝛼2
2: Output: A fairly robust neural network 𝑓
3: Initialize network with a pre-trained robust model
Set 𝜙𝑖
nat = 0, 𝜙𝑖
bndy = 0 and 𝜙 = (𝜙nat, 𝜙bndy),
4: repeat
5: Rnat ( 𝑓 ),Rnat ( 𝑓 , 𝑖) = EVAL( 𝑓 )
6: Rbndy ( 𝑓 ),Rbndy ( 𝑓 , 𝑖) = EVAL( 𝑓 , 𝜖)
7: 𝜙𝑖
nat = 𝜙𝑖
nat
+ 𝛼1 · (Rnat ( 𝑓 , 𝑖) − Rnat ( 𝑓 ) − 𝜏1)
8: 𝜙𝑖
bndy = 𝜙𝑖
bndy
+ 𝛼2 · (Rbndy ( 𝑓 , 𝑖) − Rbndy ( 𝑓 ) − 𝜏2)
9: 𝑓 ← TRAIN( 𝑓 , 𝜙, 𝜖)
10: until Model 𝑓 satisfies all constraints
2.6 Experiment
In this section, we present the experimental results to validate the effectiveness of the proposed
framework (FRL) on building fairly robust DNN models. We implement and compare our proposed
19
three strategies (i.e., Reweight, Remargin and Reweight+Remargin) on real-world data and discuss
their effectiveness.
2.6.1 Experimental Settings
We conduct our experiments on benchamrk adversarial learning datasets, including CIFAR10 [56]
and SVHN dataset [80]. For both datasets, we study the algorithms under the model architectures
PreAct-ResNet18 and WRN28. We only present the results of PreAct-ResNet18 in this section and
we leave the results of WRN28 in Appendix 1.2.2. As baselines, we present the original performance
of two popular adversarial training algorithms [71, 135], and a debiasing method which is
inherited from [1]. It is a traditional debiasing technique and we directly apply it to upweight the
cost of the class with the highest robust error in the training data. Other existing unfairness debiasing
methods, such as [128, 132] are not included in our experiment, because they are not proposed
for deep learning models and have similar ideas with [1] to reweight the costs during training.
In our implementation for FRL methods, during the training process, we split the training sets
to get validation sets with 300 samples in each class to help us adjust the hyperparameters. For
each variant of our proposed FRL method, we pre-define the model to achieve fairness constraints
to satisfy that both 𝜏1 and 𝜏2 are not larger than 5% or 7%. In the training, we start FRL from a
pre-trained robust model (such as a PGD-adversarial training), and run FRL with model parameter
learning rate 1e-3 and hyperparameter learning rate 𝛼1 = 𝛼2 = 0.05 in the first 40 epochs. Then
we decay the model parameter learning rate and the hyperparameter learning rate by 0.1 every 40
epochs. During the evaluation phase, we report each trained model’s average standard error rate,
boundary error rate and robust error rate, as well as the worst intraclass error rate. Note that the
boundary and robust errors are calculated by the PGD attack under 𝑙∞-norm 8/255.
2.6.2 Fairness Performance
Table 2.3 shows each algorithm’s performance on the CIFAR10 dataset, including each variant
of FRL under the fairness constraints 𝜏1 = 𝜏2 = 5% and 𝜏1 = 𝜏2 = 7%. From the experimental
results, we can see that all FRL algorithms reduce the worst-class standard errors and robust errors
under different degrees. FRL (Reweight) has the best performance to achieve the minimal “worst-
20
Table 2.3: Average & worst-class standard error, boundary error and robust error for various algorithms
on CIFAR10.
Avg. Std. Worst Std. Avg. Bndy. Worst Bndy. Avg. Rob. Worst Rob.
PGD Adv. Training 15.5 33.8 40.9 55.9 56.4 82.7
TRADES(1/𝜆 = 1) 14.6 31.2 43.1 64.6 57.7 84.7
TRADES(1/𝜆 = 6) 19.6 39.1 29.9 49.5 49.3 77.6
Baseline Reweight 19.2 28.3 39.2 53.7 58.2 80.1
FRL(Reweight, 0.05) 16.0 22.5 41.6 54.2 57.6 73.3
FRL(Remargin, 0.05) 16.9 24.9 35.0 50.6 51.9 75.5
FRL(Reweight+Remargin, 0.05) 17.0 26.8 35.7 44.5 52.7 69.5
FRL(Reweight, 0.07) 16.1 23.8 38.4 55.2 54.0 75.2
FRL(Remargin, 0.07) 16.9 26.0 37.4 51.6 53.5 75.1
FRL(Reweight+Remargin, 0.07) 17.1 26.7 36.7 48.3 53.8 70.2
Table 2.4: Average & worst-class standard error, boundary error and robust error for various algorithms
on SVHN.
Avg. Std. Worst Std. Avg. Bndy. Worst Bndy. Avg. Rob. Worst Rob.
PGD Adv. Training 9.4 19.8 37.0 53.9 46.4 73.7
TRADES(1/𝜆 = 1) 9.9 18.6 39.1 60.6 48.0 78.3
TRADES(1/𝜆 = 6) 10.5 23.4 32.5 52.5 43.1 76.6
Baseline Reweight 8.8 17.4 39.3 54.7 48.2 72.1
FRL(Reweight, 0.05) 7.9 13.3 38.2 56.4 46.1 69.7
FRL(Remargin, 0.05) 9.2 17.4 39.7 49.6 48.9 67.0
FRL(Reweight+Remargin, 0.05) 7.7 12.8 36.2 51.2 43.9 64.0
FRL(Reweight, 0.07) 8.0 13.6 37.2 54.2 45.0 67.8
FRL(Remargin, 0.07) 8.5 14.2 36.9 50.6 45.5 64.8
FRL(Reweight+Remargin, 0.07) 8.3 15.4 36.7 51.4 45.0 64.9
class” standard error. Compared to vanilla methods, it has around 10% reduction to the worst
class’s standard error. However, it cannot equalize the boundary errors adequately. Alternatively,
FRL (Remargin) is more effective than FRL (Reweight) to decrease the worst class boundary error.
Furthermore, their combination FRL (Reweight + Remargin) is the most effective way to reduce
the worst-class boundary and worst-class robust errors. It can accomplish to train a robust classifier
with the worst-class robust error around 70%, which is 10 ∼ 15% lower than vanilla adversarial
training methods. The baseline method (Baseline Reweight) [1] can only help decrease the worstclass
standard error but cannot equalize boundary errors or robust errors. These results suggest that
the FRL method can mitigate the fairness issues by improving the model’s standard and robustness
performance on the worst classes.
Notably, from the results in Table 2.3, we find that those methods, which use the strategy “Remargin”,
usually have a 1 ∼ 2% larger average standard error, compared to the PGD Adversarial
21
Training or TRADES (1/𝜆 = 1). This is because “Remargin” increases the perturbation margin 𝜖
for some classes during training. According to the existing works [104, 98], using a large perturbation
margin might degrade the model’s standard performance generalization. Thus, in this work
we limit every training sample’s perturbation margin does not exceed 16/255 to guarantee that the
model gives an acceptable average standard performance. On the other hand, because the “Remargin”
strategies increase the perturbation margin, their average boundary errors are smaller than the
vanilla methods. As a result, the average robust errors are comparable with the vanilla methods or
even have 2 ∼ 3% improvement.
In Table 2.4, we present the experimental results on the SVHN dataset. We have similar observations
as those on CIFAR10. FRL (Reweight) can achieve the minimal worst-class standard error
and FRL (Reweight+Remargin) gives the minimum worst-class robust error. One intriguing fact is
that FRL methods, including those which use “Remargin”, do not cause an increase of the average
standard error. Instead, each FRL method results in 1 ∼ 2% decrease of the average standard error.
This might be due to the reason that these methods help improve the worst-class standard error by
a large margin.
2.6.3 Ablation Study
From the results in Table 2.3, we find that FRL (Reweight) is not effective to improve the worstclass
boundary error. As a result, it cannot sufficiently equalize the robustness performance between
classes. In this subsection, we study the potential reasons that cause this fact. To have a closer look
at this issue, we implement adversarial training on CIFAR10 dataset in two groups of experiments
following the basic settings described in Section 2.6.1. For each group, we only upweight a chosen
class’s boundary error by different ratios (from 1.0 to 4.5 times of other classes) and keep other
classes’ weights fixed. Then, for each model, we present this model’s standard / robustness performance
for this class in Figure 2.3 (left). As a comparison, we also show the cases where we
increase the perturbation margin (1.0 to 2.5 times of the original margin) (Figure 2.3 (right)). From
Figure 2.3, we find that when we increase the weight only for the class “deer”, it results in the boundary
error for this class to decrease but also increasing its standard error. Thus, reweight only acts to
22
(a) Upweight Boundary Error. (b) Increase Perturbation Margin.
Figure 2.3: The effect of upweighting on boundary error (left) and standard error (right) for the
class “deer”.
leverage the inner-class boundary error and standard error. It cannot improve the robustness of this
class over other classes to solve the fairness relationship between classes. As a contrast, increasing
the margin will not increase the standard error while it effectively decreases this class’s boundary
error and robust error. Therefore, from the results in Table 2.3 and 2.4, we observe that “Remargin”
based methods can successfully improve the worst-class robustness performance.
2.7 Conclusion
In this work we first empirically and theoretically uncover one property of adversarial training algorithms:
it can cause serious disparity for both standard accuracy and adversarial robustness between
different classes of the data. As the first attempt to mitigate the fairness issues from adversarial training,
we propose the Fair Robust Learning (FRL) framework. We validate the effectiveness of FRL
on benchmark datasets. In the future, we want to examine if the fairness issue can be observed in
other types of defense methods.
23
CHAPTER 3
ENHANCING ROBUSTNESS MAY AMPLIFY INNER-CLASS UNFAIRNESS¹
3.1 Introduction
Recent studies [39, 40, 4, 19, 77] uncover one key reason for the extraordinary performance of deep
neural networks (DNNs), a.k.a, the memorization (or benign overfitting). In detail, the works [39,
40] empirically show that popular image datasets, such as CIFAR10, CIFAR100 and ImageNet, always
have very diverse data distributions, especially containing a large fraction of “atypical” samples
(see Fig.(3.1)). These atypical samples are visually and statistically very different from other
(typical) samples in their class. For example, the atypical samples of “dog” in CIFAR10 can have
very diverse appearance, species, camera angles, backgrounds, which make them visually distinct
from typical “dogs”. Besides, although the total amount of atypical samples is large, each atypical
samples only appears in a less-frequent region in the whole data distribution. As a result, DNNs
can only fit these atypical samples by “memorizing” their labels. Surprisingly, the studies [39, 40]
figure out that memorizing / fitting these atypical samples will not hurt the model performance on
typical samples, but can boost DNNs’ accuracy by correctly classifying the atypical samples in the
test set. Thus, the capacity to memorize atypical samples is one key factor for DNNs to achieve
better performance beyond shallow machine learning models.
Similar to DNNs trained via empirical risk minimization (ERM), adversarial training methods
[71, 58] are also devised to fit the whole training dataset. Specifically, adversarial training minimizes
the model’s error against adversarial perturbations [41, 102] by fitting the model on manually
generated adversarial examples of all training data. Although adversarial training can fit and memorize
all training data as well as their adversarially perturbed counterparts, they always suffer from
both poor clean accuracy and adversarial accuracy (or robustness) on the test set [105, 94]. These
facts suggest that the memorization in adversarial training may have a very different behavior from
traditional ERM. Therefore, it is natural to ask a question: What is the impact of memorization
¹This chapter is based on previously published work by Han Xu, Xiaorui Liu, Wentao Wang, Zitao Liu, Anil K Jain,
Jiliang Tang titled “How does the Memorization of Neural Networks Impact Adversarial Robust Models?” published
at the Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining [124].
24
in adversarial training? Can memorizing the atypical samples and their adversarial counterparts
benefit the model’s accuracy and robustness?
To answer this question, we first conduct preliminary studies to explore whether memorizing
atypical samples in adversarial training can benefit the DNNs’ test performance, especially on those
test atypical samples. We implement PGD adversarial training [71] on CIFAR100 under WideRes-
Net28 [45] and fine-tune them until achieving the optimal training performance. In Fig.(3.2), we
observe that the memorization in adversarial training can only benefit the clean accuracy of test atypical
samples, instead of their adversarial counterparts. In detail, when the DNNs fit/memorize the
atypical samples, they can achieve clean accuracy close to ∼ 40% on the test atypical set. However,
the adversarial accuracy on test atypical set is constantly low (∼ 10%) during the whole training
process, even though the models can fit almost all atypical (adversarial) training samples. Remind
that atypical samples occupy at least ∼ 40% of the whole CIFAR100 dataset, this fact suggests:
failing on atypical adversarial examples during test could be one of the key reasons leading to
the poor robustness generalization of DNNs.
Next, we continue to study whether memorizing atypical samples in adversarial training can
impact the model’s performance on typical samples. We again implement PGD adversarial training
[71] on CIFAR100 for multiple trails, which are trained with different amount of atypical samples.
From the results in Fig.(3.3), an adversarially trained model on the training set without any
atypical samples has 95% clean accuracy and 55% adversarial accuracy on the test typical set.
While, the model trained with 100% atypical samples only has 90.2%/50.4% clean/adversarial accuracy
respectively. This result shows: in adversarial training, fitting atypical samples will even hurt
DNNs’ performance on those “typical” samples. In another word, atypical samples act more like
“poisoning data” [6] to deteriorate model performance on typical samples. Notably, in Fig.(3.3),
we also demonstrates that this poisoning effect is absent in traditional ERM: fitting atypical samples
will not reduce the model accuracy in ERM. This fact highlights the uniqueness of memorization
in adversarial training methods.
As a conclusion, these two facts about memorization uncovers key differences between tra-
25
Figure 3.1: Typical & Atypical Samples in CIFAR10 & CIFAR100, with different “memorization
value”. The higher the memorization value (=1.0), the more likely the sample is atypical. The lower
the memorization value (=0.0), the more likely the sample is typical.
ditional ERM and adversarial training, and shed clues to the reason of the poor generalization
of adversarial training. Motivated by our findings, we propose a novel algorithm called Benign
Adversarial Training (BAT), which can eliminate the negative influence from memorizing those
“harmful” atypical samples, meanwhile preserving the model’s ability to memorize those “benign /
useful” atypical samples. It is worth mentioning that, by fitting those “benign” atypical samples, the
BAT method can achieve good clean accuracy on the atypical set; by eliminating those “harmful”
atypical samples, the BAT method can improve the clean & adv. accuracy on the typical set. In
26
our experiments, we verify that BAT can benefit adversarial training method, by enjoying a better
accuracy vs. robustness tradeoff.
3.2 Related Works
Memorization and atypical samples. The memorization effect of overparameterized DNNs has
been extensively studied both empirically [133, 78] and theoretically [5]. From the views of traditional
ML, the memorization can be harmful to the model generalization, because it makes models
easily fit outliers and noisy labels. However, recent studies point out the concept of “benign overfitting”
[4, 39, 40], which suggests the memorization effect necessary for DNNs to have extraordinary
performance on modern machine learning tasks. Especially, the recent work [40] empirically figures
out those atypical/rare samples in benchmark datasets and show the contribution from memorizing
atypical samples to the DNN’s performance. Besides the work [40], there are also other
strategies [14] to find atypical samples in training dataset.
Adversarial robustness. Adversarial training methods [71, 135] are considered as one of the
most reliable and effective methods to protect DNN models against adversarial attacks [41, 126].
However, there are several intrinsic properties of adversarial training different from traditional ERM
which require deeper understandings. For example, they always suffer from poor robustness generalization
[94, 89]; they always present strong trade-off relation between clean accuracy vs. robustness
[105, 135]. Our work aims to study the robust overfitting issues from the data perspective and
demonstrate critical connection of the memorization of DNNs.
Comparison to Existing Works. Notably, our work is not the first effort to study the influence of
memorization on DNN’s adversarial robustness. A previous study [93] illustrates that memorizing
the mis-labeled samples might be a reason to cause the DNNs’ adversarial vulnerability. Similarly,
the works [32] claims that fitting some “problematic” data can degrade the performance of adversarial
training. A concurrent work [33] finds that fitting some “hard” samples is also likely to hurt the
adversarially trained model’s performance. However, these methods didn’t provide efficient strategies
to trace the problematic / hard samples, which limit their empirical support. Our study which
focused on the “atypical” samples directly compare adversarial training and traditional ERM, on a
27
fixed set of samples (atypical samples). which highlights the great importance of atypical samples
on the problem of robust generalization. Moreover, we propose a novel defense method, BAT, to
further improve the performance of adversarial training.
3.3 Definition and Notation
3.3.1 Atypical Samples and Memorization
In this section, we provide a formal definition about atypical samples, following the work [39],
and refer atypical samples are the samples that:
“a learning algorithm A cannot predict its label without observing it in the training
dataset. Fitting it will require memorizing its label.”
To identify such atypical samples in practice, one representative strategy [40] proposes to examine
which training samples can only be fitted by memorization, and measure each training sample’s
“memorization value”. Formally, for a training algorithm A (i.e., ERM), the memorization value
“mem(A,D, 𝑥𝑖)” of a training sample (𝑥𝑖 , 𝑦𝑖) ∈ D in training set D is defined as:
mem(A,D, 𝑥𝑖) = Pr.
𝐹←A(D)
(𝐹(𝑥𝑖) = 𝑦𝑖) − Pr.
𝐹←A(D\𝑥𝑖 )
(𝐹(𝑥𝑖) = 𝑦𝑖), (3.1)
which calculates the difference between the model 𝐹’s accuracy on 𝑥𝑖 with and without 𝑥𝑖 removed
from the training set D. In practice, [40] trains a DNN model using ERM method for 1,000 trials,
with each trial preserves 70% samples of the whole training dataset. Based on this metric, one can
identify all common datasets having a large fraction of atypical samples. For example, CIFAR10,
CIFAR100 and Tiny ImageNet have more than 11%, 40% and 49% samples with a large memorization
value > 0.15. This strategy can also facilitate to find atypical samples in the test set, which
are the samples that are strongly influenced by atypical training samples. In detail, by removing an
atypical training sample (𝑥𝑖 , 𝑦𝑖), we calculate “influence value” on each test sample (𝑥′
𝑗 , 𝑦′
𝑗
) ∈ D′
in test set D′:
infl(A,D, 𝑥𝑖 , 𝑥′
𝑗
) = Pr.
𝐹←A(D)
(𝐹(𝑥′
𝑗
) = 𝑦′
𝑗
) − Pr.
𝐹←A(D\𝑥𝑖 )
(𝐹(𝑥′
𝑗
) = 𝑦′
𝑗
). (3.2)
28
If the sample pair (𝑥𝑖 , 𝑥′
𝑗
) has a high influence value, removing the atypical sample 𝑥𝑖 will drastically
decrease the model’s accuracy on 𝑥′
𝑗 . Remark: In this paper, we also use ERM as A to
identify atypical samples, because ERM is computational efficient and more accurate than other
algorithms (such as adversarial training). In appendix, we provide more discussions including: (a)
the frequency of atypical samples in common datasets and (b) alternative (more efficient) metrics
to identify atypical samples.
3.3.2 Adversarial Training
Similar to classification models trained via empirical risk minimization (ERM) algorithms, adversarial
training methods [71, 58] are also devised to fit the whole dataset by training the model
on manually generated adversarial examples. Formally, they minimize the adversarial loss:
min
𝐹
E𝑥

max
||𝛿||≤𝜖
L(𝐹(𝑥 + 𝛿), 𝑦)

, (3.3)
which is the model 𝐹’s average loss on the data (𝑥, 𝑦) that perturbed by adversarial noise 𝛿. These
adversarial training methods have been shown to be one of the most effective approaches to improve
the model robustness against adversarial attacks. Note that similar to traditional ERM, adversarially
trained models can also achieve very high training performance. For example, under
WideResNet28-10 (WRN28), PGD adversarial training [45] can achieve 100% clean accuracy and
99% adversarial accuracy. It suggests these DNN models have sufficient capacity to memorize the
labels of these atypical samples and their adversarial counterparts. However, different from ERM,
adversarially trained models usually suffer from bad generalization performance during test. For
example, the WRN28 model can only have 59% and 24% test clean accuracy and adversarial accuracy.
Thus, these facts indicate that the memorization in adversarial training is probably not always
beneficial to the test performance and requires deep understanding.
3.4 Preliminary: Memorization in Adversarial Training
In this section, we attempt to understand adversarial training’s behavior by studying its relation with
the memorization effect. The discussions are based on PGD-adversarial training [71] on CIFAR 100.
Implementation details and similar experiments in CIFAR 10 and Tiny ImageNet [60] are shown
29
in appendix.
3.4.1 (a) Robustness of Atypical Samples can Hardly Generalize
In this subsection, we first check whether fitting atypical samples in adversarial training can
effectively help the model correctly and robustly predict the atypical samples in the test set. We
apply PGD adversarial training [71] on original CIFAR100 dataset for 200 epochs and evaluate
the model’s clean accuracy and adversarial accuracy on training atypical set Datyp = {𝑥𝑖 ∈ D :
mem(𝑥𝑖) > 0.15} and its corresponding test atypical set D′
atyp = {𝑥′
𝑗
∈ D′ : infl(𝑥𝑖 , 𝑥′
𝑗
) >
0.15, for ∀𝑥𝑖 ∈ Datyp}. In Fig.(3.2), we report the algorithm’s performance (clean & adversarial
accuracy) on these atypical sets along with the training process. From the results, we observe that
the WRN28 model is capable to memorize almost all atypical samples and their adversarial counterparts,
since it can achieve ≈ 100% clean & adversarial accuracy on the training atypical set.
As the training goes, the models’ clean accuracy on the test atypical set gradually improves and
finally approaches 41%. However, its adversarial robustness keeps constant around 10% from the
beginning epochs to the last ones, no matter how high the training performance is. These results
suggest that memorizing atypical samples in adversarial training may only improve their test clean
accuracy, but hardly help their adversarial robustness. Recall that in CIFAR100, atypical set Datyp
(with memorization value > 0.15) covers 40% samples of the whole dataset. Completely failing on
the adversarial robustness of atypical samples could be one key reason that contributes to the poor
robustness generalization of DNNs [89].
As the previous theoretical study [94] stated, for a model to achieve good robustness generalization
performance, it always demands a training set with much more samples, than a model to
have good clean accuracy generalization. In our case, the sub-population of each particular atypical
sample has a very low frequency to appear in the training set, and it is always deviated from
the main sub-population. Thus, in the sub-population of this atypical sample, it is equivalent to a
classification task based on an extremely small dataset, with one or a few training samples given.
Thus, the robustness of atypical samples can be extremely hard to generalize.
30
Figure 3.2: Clean Acc. (left) & Adv. Acc (right) on Atypical Set of CIFAR100 under WRN28.
3.4.2 (b) Memorizing Atypical Samples Hurts Typical Samples’ Performance
In this subsection, we find that fitting atypical samples will even bring negative effects on “typical”
samples. Here, we define “typical” samples as the subset of training set D which has low
memorization values: Dtyp = {𝑥𝑖 ∈ D : mem(𝑥𝑖) < 0.02}. For the test typical set D′
typ, we exclude
all test samples which have high influence values from any atypical training samples, and
also exclude the samples that have low success rate to predict using ERM algorithm A (the samples
which cannot be learned from D): D′
typ = D′ − {𝑥′
𝑗 : infl(𝑥𝑖 , 𝑥′
𝑗
) > 0.02, for ∀𝑥𝑖 ∈ Datyp} ∪ {𝑥′
𝑗 :
Pr.𝐹←A(D) (𝐹(𝑥′
𝑗
) = 𝑦 𝑗 ) < 0.8}.
We conduct PGD adversarial training [71] for several trials on resampled CIFAR100 datasets:
each dataset is constructed with the whole training typical setDtyp, and a part of the training atypical
set Datyp (randomly sample 0%, 20% and 100% in Datyp). In Fig.( 3.3), we report the adversarially
trained model’s clean and adversarial accuracy on the test typical set D′
typ and check the impact of
atypical samples on the typical samples. From the results, we find that the existence of atypical
samples makes a significant influence on the typical samples. For example, an adversarially trained
model without atypical samples has 94.7% clean accuracy and 55.0% adversarial accuracy on the
test typical samples (on the last epochs). While, the model trained with all atypical samples only
has 90.2% and 50.4% clean & adv. accuracy, respectively. These results suggest: the more atypical
samples exist in the training set, the poorer performance the model will have onD′
typ. In other words,
31
(a) Adv. Train (Clean Acc.). (b) Adv. Train (Adv. Acc.).
(c) Traditional ERM (Clean Acc.).
Figure 3.3: Adversarial Training vs. Traditional ERM on Typical Set of CIFAR100 under WRN28.
these atypical samples act like “poisoning” samples [6, 126] which can deteriorate the model’s
performance. It is also worth mentioning that this negative phenomenon does not appear in the
traditional ERM algorithm. In Fig.(3.3c), we can see the models have similar accuracy on the
typical set, if different numbers of atypical samples appear in the training set.
Harmful Atypical Samples. A natural question is what kinds of atypical samples are likely
to hurt model robustness and why? Different from previous literature [61] which argue that mislabeled
samples are very likely to degrade model performance, CIFAR100 is a clean dataset with no
or very few wrong labels. Thus, we hypothesize that these “harmful” atypical samples are those
32
which share similar features with a “wrong” class. Recall that atypical samples are always distinct
from the main data distribution in their labeled class, it is likely that they are close to the distribution
of a “wrong” class. In Fig.(3.4), in CIFAR100, we identify an atypical “plate” (via the method in
the next section) which has features, such as its shape, resembling to the images of “apple”. If the
model memorizes this atypical “plate” and predicts any samples with similar features to be “plate”,
the model cannot distinguish “apple” and “plate”. Remarkably, these “harmful” atypical samples
are still correctly labeled and they only share some similar features with other classes.
Figure 3.4: “Harmful” Atypical Samples: in CIFAR100 datasets, there are atypical “plates” like
“apples”; there are atypical “lamps” like “cups”.
Why does this negative effect only happen during adversarial training, instead of traditional
ERM? To deepen our understanding of this question, we theoretically analyze the consequence of
fitting “harmful” atypical samples for both algorithms. In particular, we formally state: the negative
effect during adversarial training can be significantly stronger than traditional ERM. Our analysis
is based on a binary classification task with a two-dimensional Gaussian Mixture distribution:
𝑦 𝑢.∼𝑎.𝑟 {−1, +1}, 𝜃 = [𝜂, 0], 𝜂 ∈ R+,
𝑥 ∼
8>>>>
<
>>>>:
N (𝜃, 𝜎2𝐼) if 𝑦 = +1
N (−𝜃, 𝜎2𝐼) if 𝑦 = −1,
(3.4)
and we consider linear models 𝐹 = sign(𝑤𝑇 𝑥 + 𝑏), where 𝑤 and 𝑏 are the model weight and bias.
Moreover, we assume there is an atypical training sample 𝑥0 ∈ R2 labeled as “−1”, with Euclidean
33
distance to the center of each class ||𝑥0 − (−𝜃) ||2 = 𝑑−1, ||𝑥0 − 𝜃||2 = 𝑑+1. Intuitively, this setting
of 𝑥0 resembles the cases in Fig.(3.4) if 𝑑+1 is relatively small, since it has a short distance to the
wrong class. We also constrain 𝑑−1 ≤ 2 × 𝜂 for 𝑥0 to preserve certain similarity to the class “-1”.
The following theorem shows that there always exists a model which fits the sample 𝑥0, and achieve
optimal accuracy (i.e. > 99%) on both classes, although 𝑑+1 can be small.
Theorem 3 (Fitting Atypical Samples in traditional ERM) For a data distribution in Eq. 3.4,
given that 𝜂/𝜎 is large enough, we define 𝑙 =
√
2 · Φ−1(0.99) · 𝜎 · (1 + Φ−1 (0.99) · 𝜎
𝜂
)−1/2. For
a point 𝑥0 with 𝑑−1 ≤ 2𝜂, if 𝑑+1 ≥ 𝑙, then there exist a model 𝐹 fitting 𝑥0: 𝐹(𝑥0) = −1 and has
classification error at most 1% on both classes.
The detailed proof can be found in the appendix. Remarkably, for classification tasks if the two
classes are well separated, i.e. 𝜎/𝜂 is sufficiently small, there exist models that can achieve high
performance even when the atypical sample 𝑥0 is only around (
√
2·Φ−1(0.99) ·𝜎) distant away from
the center of class “+1”, which is much smaller than the center distance of the two class 2𝜂. This
suggests that for traditional ERM, the negative effect of fitting the atypical sample 𝑥0 is negligible,
even if 𝑥0 is close to the wrong class “+1”. Next, we show that under the same setting, any model
which fits the adversarial counterpart of 𝑥0, must have a poor accuracy on the class “+1”.
Theorem 4 (Fitting Atypical Samples in Adversarial Training) For the same data distribution
in Eq. 2.2 and same definition 𝑙 =
√
2 · Φ−1(0.99) · 𝜎 · (1 + Φ−1(0.99) · 𝜎
𝜂
)−1/2. We assume
𝑙 ≤ 𝑑+1 < 𝑘 · 𝑙, where 1 ≤ 𝑘 << 𝜂/𝜎. Then, any model 𝐹 will have a classification error at least
50% on class “+1”, when 𝐹 fits the adversarial example of 𝑥0: 𝐹(𝑥0 + 𝛿) = −1, ∀||𝛿||2 ≤ 𝜖, and
𝜖 ≥ 𝜎 · (𝑘 ·
√
2 · Φ−1(0.99)) · (1 + Φ−1(0.99) · 𝜎
𝜂
)−1/2 (3.5)
Notably, when the two classes are well separated, i.e. 𝜎/𝜂 is small, the adversarial training
bound 𝜖 is also much smaller than 𝜂. The analysis suggests that when 𝑥0 is close to the class “+1”,
all linear models which fit the adversarial example of 𝑥0 will have a large error on class “+1”, even
the adversarial training bound 𝜖 is small. Thus, in adversarial training, the negative effect of fitting
34
atypical samples is significantly greater than that in traditional ERM. These theorems highlight
the key difference between adversarial training and traditional ERM when fitting atypical samples,
which motivate us to devise strategies to eliminate this negative effect during adversarial training.
3.5 Methodology: Benign Adversarial Training
Based on previous discussions, the effect of atypical samples in adversarial training can be briefly
summarized as: (a) They can benefit the models’ clean accuracy (especially on atypical samples),
but hardly improve their robustness. (b) They hurt the performance of typical samples. In this
section, we ask the question: Can we eliminate the negative influence from fitting atypical samples
during adversarial training? Admittedly, strategies for adversarial training to stop at early
epoch [89] is potential to achieve this goal, since the atypical samples may be fitted in the later
stages of training. However, it is still unclear how to properly identify such an epoch in general
scenarios. Moreover, the early-stop mechanism tends to ignore all atypical samples and significantly
limits the model’s ability to handle atypical samples, especially on complex datasets such as
CIFAR100 and ImageNet with large fractions of atypical samples.
In this work, we propose a novel algorithm Benign Adversarial Training (BAT). It is composed of
two major components: (i) Reweighting to mitigate the impact from harmful atypical samples and
learn benign atypical samples; (ii) Discrimination Loss to further protect the discrimination ability
among different classes. Combining these two components, BAT can improve the performance of
typical samples, but still preserve the ability to fit those “useful”’ atypical samples. Compared to
PGD adversarial training [71], BAT enjoys better clean vs. adversarial accuracy trade-off. More
results can be found in Section 5.5. Next, we detail these two components of BAT.
3.5.1 Reweighting & Harmful Score
Following PGD adversarial training [71], BAT starts by fitting manually generated adversarial
examples using Projected Gradient Descent (PGD). During training, in order to identify which
samples can hurt or degrade the model’s performance, we define the “harmful score” for each
35
training (clean) sample 𝑥𝑖 as:
q(𝑥𝑖) = max
𝑡≠𝑦
𝐹𝑡 (𝑥𝑖) (3.6)
which is the model 𝐹’s largest prediction score (after softmax) of the clean sample 𝑥𝑖 to a wrong
class 𝑡 other than the true class 𝑦. A high harmful score suggests that the current model predicts
the sample 𝑥𝑖 to be a wrong class with high confidence. Note that the atypical samples are usually
fitted into the model later than typical samples. Under a model with a few atypical samples fitted,
if there is an atypical sample 𝑥𝑖 with high q(𝑥𝑖), 𝑥𝑖 is very likely to be close to the distribution
of a wrong class instead of its true class. In Figure 3.4, we present several atypical samples with
harmful scores larger than 0.8. These samples present semantic features which are very similar to
the wrong class that the model predicts. Therefore, fitting these atypical samples could cause the
model to learn mixed concepts of features and degrades the model performance.
During the training process, we desire that the model should mitigate the influence from the
atypical samples with large harmful scores, but still learn other “useful /benign” atypical samples.
Thus, we design a cost-sensitive reweighting strategy which downweights the cost of atypical samples
with large harmful scores. In particular, we specify the weight value 𝑤𝑖 for each training sample
𝑥adv
𝑖 as:
𝑤𝑖 =
8>>>>
<
>>>>:
Lexp(E − 𝛼·q(𝑥𝑖)) if mem(𝑥𝑖) >𝜎 and arg max𝑡𝐹𝑡 (𝑥𝑖) ≠𝑦
1 otherwise.
(3.7)
where 𝛼 ∈ R+ and 𝜎 control the size of the reweighted atypical set. Since the function exp(−𝛼(·))
is decreasing and ranges from 1 to 0, the samples with large harmful scores will be assigned with
small weights close to 0. Thus, in the reweighting algorithm, we train the model to find optimal
model 𝐹rw by assigning the weight vector 𝑤:
𝐹rw = arg min
𝐹
1
Í
𝑖 𝑤𝑖
Õ
𝑖

𝑤𝑖 · L(𝐹(𝑥adv
𝑖
), 𝑦𝑖)

(3.8)
In the training process, those (adversarial) training samples with large harmful scores are assigned
with small weights and correspondingly their influence will be largely mitigated. Note that in
36
Eq. 3.7, estimating mem(𝑥𝑖) for each 𝑥0 via the leave-out retraining method described in Eq. 3.1 is
potential to be computational costly. However, alternative efficient strategies have been discussed
to estimate how likely one sample is atypical, including confidence-based strategies and ensemble
disagreement [14]. These methods are shown to have good alignment with the method in Eq. 3.1,
thus these strategies can be leveraged to improve the efficiency of Eq. 3.7.
3.5.2 Discrimination Loss
In addition, we introduce Discrimination Loss to further eliminate the negative effect of fitting
atypical samples. Based on the theoretical analysis Section 3.4.2, imagine that the atypical sample
𝑥0 has a fixed distance 𝑑−1 to the center of its labeled class, one strategy to mitigate the negative
effect of fitting 𝑥0 is to increase the two class’s center distance 2𝜂 and hence increase the distance
𝑑+1 of the harmful example to the poisoned class “+1”. Motivated by this point, we introduce L𝐷𝐿
to increase the distances of typical samples among different classes in the representation space.
L𝐷𝐿 (𝐹) = E
(𝑥𝑖 ,𝑦𝑖 ),(𝑥 𝑗 ,𝑦 𝑗 )
{(𝑥𝑘 ,𝑦𝑘 )}𝐾𝑘
=1
"
− log 𝑒ℎ(𝑥adv
𝑖
)·ℎ(𝑥adv
𝑗
)/𝜏
Í𝐾𝑘
=1 𝑒ℎ(𝑥adv
𝑖
)·ℎ(𝑥adv
𝑘
)/𝜏
#
,
where 𝑦𝑖 = 𝑦 𝑗 ;𝑦𝑘 ≠𝑦𝑖 ,∀𝑘 =1,...,𝐾;mem(𝑥𝑖),mem(𝑥 𝑗),mem(𝑥𝑘) <𝜎
(3.9)
Specifically, ℎ(·) is the model 𝐹’s pen-ultimate layer’s output representation; 𝜏 ∈ R+ is the temperature
value and 𝐾 ∈ Z+ is a fixed number of the samples that are randomly chosen with different
labels with 𝑦𝑖 ². Intuitively, constraining the Discrimination Loss imposes the model to output
similar representations for the typical sample pairs (𝑥adv
𝑖 , 𝑥adv
𝑗
) in the same class, and distinct representations
for (𝑥adv
𝑖 , 𝑥adv
𝑘
) from different classes. Similar approaches have been widely used in
representation learning, such as Triplet Loss [95] and contrastive learning methods [22], which
stress the good property of DNNs’ representation space. With the Discrimination Loss incorporated,
the final objective of BAT is:
𝐹bat =arg min
𝐹
1
Í
𝑖𝑤𝑖
Õ
𝑖

𝑤𝑖 ·L(𝐹(𝑥adv
𝑖
),𝑦𝑖)

+𝛽·L𝐷𝐿 (𝐹)
!
(3.10)
37
where 𝛽 > 0 that controls the intensity of Discrimination Loss. The detailed training scheme of
Benign Adversarial Training is presented in appendix.
3.5.3 The Detailed Training Scheme of BAT
In this subsection, we provide the detailed training scheme of BAT in Algorithm 2. In particular,
BAT algorithm starts from a randomly initialized neural network. On each mini-batch, it applies
PGD attack to generate (training) adversarial examples (Step 5). Following the Eq 3.6 and Eq 3.7,
BAT calculates which samples are likely to be harmful Atypical Samples and their corresponding
weight values (Step 6). Under the current mini-batch, next, BAT calculates the Discrimination
Loss of the typical samples Dtyp (Step 8). Finally, BAT uses SGD to update the model parameter
to minimize the reweighted adversarial loss regularized by 𝛽 times discrimination loss (Step 8).
Algorithm 2 The Benign Adversarial Training (BAT) Algorithm.
1: Input: Training dataset D, with typical set Dtyp = {𝑥 ∈ D; mem(𝑥) ≤ 𝜎}, atypical set Datyp =
{𝑥 ∈ D; mem(𝑥) > 𝜎}. Targeted type of adversarial attack: 𝑙∞-𝜖 attack. Hyperparameters
𝛼, 𝛽 ∈ R+.
2: Randomly initialize the network 𝐹
3: repeat
4: Fetch mini-batch data {(𝑥𝑖 , 𝑦𝑖)} at current epoch
5: Using PGD to generate adversarial training sample {(𝑥adv
𝑖 , 𝑦𝑖)}
6: Calculate Harmful Score q(𝑥𝑖) and weight 𝑤𝑖 as in Equation 3.6 and Equation 3.7.
7: Calculate Discrimination Loss L𝐷𝐿 (𝐹) within the current mini-batch, as in Equation 3.9.
8: Update 𝐹 by SGD on the objective: LBAT = Í1
𝑖 𝑤𝑖
Í
𝑖

𝑤𝑖 · L(𝐹(𝑥adv
𝑖
), 𝑦𝑖)

+ 𝛽 · L𝐷𝐿 (𝐹).
9: until End of Training
3.6 Experiment
In this section, we present the experimental results to validate the effectiveness of the proposed
BAT algorithm on benchmark datasets and compare it with state-of-the-art baseline methods.
3.6.1 Experimental Setup
In this section, we conduct the experiments mainly on benchmark datasets CIFAR10, CIFAR100 [56]
and Tiny ImageNet [60].
²In practice, under training algorithms using mini-batches, the set {𝑥𝑘 }𝐾𝑘
=1 can be replaced by all (typical) samples
in the mini-batch, with different labels from 𝑦𝑖 .
38
(a) CIFAR100. (b) CIFAR10. (c) Tiny ImageNet.
Figure 3.5: Frequencies of Training samples with Different.
Description of Datasets. In Fig. 3.6, we provide several pairs of images with high influence
value (as defined in Section 3.3.1) which is over 0.15. In each pair, the training sample also has
a high memorization value over 0.15. These examples suggest that there exist atypical samples in
both training & test sets of CIFAR10, CIFAR100 and Tiny ImageNet. A pair of atypical samples
(in the training set and test set) with a high influence value are visually very similar. Moreover,
since they have high influence values, removing the atypical samples in the training set is very
likely to cause the model to fail on the test atypical samples. Therefore, without memorizing the
atypical sample in the training set, the model can hardly predict the atypical samples in the test
set. In Fig. 3.5, we also provide histograms to show the distribution of the estimated memorization
values of all training samples from CIFAR10, CIFAR100 and Tiny ImageNet. From Fig. 3.5, we
can observe that atypical samples (with high memorization value > 0.15) consist of a significant
fraction (over 40% & 50% respectively) in CIFAR100 and Tiny ImageNet. In CIFAR10, they also
consist of a non-ignorable fraction which is over 10%.
Benchmark Defenses. For each dataset, we study the algorithms under the model architectures
ResNet and WideResNet (WRN28) (width = 10) [45]. In this section, we only present the
results of WRN models and leave the results on ResNet in appendix. For a fair comparison with
BAT, we implement the baseline including PGD adversarial training [71] as well as its most popular
variant TRADES [135]. In addition, we include several recent algorithms: MART [112] and
GAIRAT [137], which also incorporate reweighting strategies into adversarial training. For BAT
39
CIFAR100
Train Test
CIFAR10
Train Test
Tiny ImageNet
Train Test
Figure 3.6: High Influence Pairs with Influence Value > 0.15.
Table 3.1: Performance of BAT vs. Baselines on CIFAR100 Under WRN28.
Method Clean Acc. FGSM PGD CW AA.
PGD Adv. Training 59.7 (-2.3) 34.9 (-3.7) 24.7 (-3.8) 24.2 (-2.3) 22.5 (-2.3)
TRADES (1/𝜆 = 5) 57.3 (-4.7) 34.5 (-4.1) 24.9 (-3.6) 24.6 (-1.9) 22.9 (-1.9)
MART [112] 56.5 (-5.5) 36.1 (-2.5) 26.8 (-1.7) 25.3 (-1.2) 23.8 (-1.0)
GAIRAT [137] 60.2 (-1.8) 34.8 (-3.8) 24.4 (-4.1) 24.8 (-1.5) 22.9 (-1.9)
BAT (𝛼 = 1, 𝛽 = 0.2) 62.0 (+0.0) 38.6 (+0.0) 28.5 (+0.0) 26.5 (+0.0) 24.8 (+0.0)
BAT (𝛼 = 2, 𝛽 = 0.2) 61.4 (-0.6) 38.9 (+0.3) 28.2 (-0.3) 26.3 (-0.2) 24.8 (-0.0)
Table 3.2: Performance of BAT vs. Baselines on Tiny ImageNet Under WRN28.
Method Clean Acc. FGSM PGD CW AA.
PGD Adv. Training 58.9 (-3.4) 35.7 (-5.9) 31.7 (-4.1) 30.0 (-3.7) 30.0 (-4.3)
TRADES (1/𝜆 = 5) 59.7 (-2.6) 37.4 (-4.2) 32.0 (-3.8) 31.8 (-1.9) 32.0 (-2.3)
MART [112] 58.2 (-4.1) 41.0 (-0.6) 35.6 (-0.2) 33.9 (+0.2) 34.1 (-0.2)
GAIRAT [137] 60.5 (-1.8) 39.1 (-2.5) 33.1 (-2.7) 31.9 (-1.8) 32.3 (-2.0)
BAT (𝛼 = 1, 𝛽 = 0.2) 62.3 (+0.0) 41.6 (+0.0) 35.8 (+0.0) 33.7 (+0.0) 34.3 (+0.0)
BAT (𝛼 = 2, 𝛽 = 0.2) 62.4 (+0.1) 43.1 (+1.5) 37.4 (+1.6) 35.3 (+1.6) 35.6 (+0.7)
and all baseline methods, we run the algorithms using SGD [7] for 160 epochs where the learning
rate is from 0.1 and decays by 0.1 after the epoch 80 and 120. We report the performance of all
methods on the checkpoints when they reach maximal PGD adversarial accuracy on a separated
validation set. More implementation details are listed in appendix.
Performance on CIFAR100. For a comprehensive comparison between different methods on
40
Table 3.3: Performance of BAT vs. Baselines on CIFAR10 Under WRN28.
Method Clean Acc. FGSM PGD CW AA.
PGD Adv. Training 85.6(-0.4) 59.6(-1.5) 52.2(-1.0) 50.7(-0.7) 49.4(-1.1)
TRADES (1/𝜆 = 5) 83.8(-2.2) 62.7(+1.6) 54.9(+1.7) 52.6(+1.2) 51.8(+1.3)
MART [112] 83.1(-2.9) 63.5(+2.4) 56.8(+3.6) 53.6(+2.2) 52.2(+1.7)
GAIRAT [137] 85.3(-0.7) 63.1(+2.0) 55.8(+2.6) 52.6(+1.2) 51.8(+1.3)
BAT (𝛼 = 1, 𝛽 = 0.3) 86.0(+0.0) 61.1(+0.0) 53.2(+0.0) 51.4(+0.0) 50.5(+0.0)
BAT (𝛼 = 1, 𝛽 = 0.2) 85.7(-0.3) 60.9(-0.2) 53.2(+0.0) 51.1(-0.3) 50.3(-0.2)
CIFAR100, in Table 3.1, we report the models’ clean accuracy and adversarial accuracy against 𝑙∞-
8/255 adversarial examples, generated by various attacking algorithms including FGSM attack [41],
PGD attack [71], CW [15] and Auto-Attack [25]). For BAT, we report its performance when choosing
its optimal hyperparameter: 𝛼 = 1, 𝛽 = 0.2 and 𝛼 = 2, 𝛽 = 0.2. In appendix, we will discuss the
impact of the selection of 𝛼 and 𝛽 on BAT and argue that both components are necessary for BAT
to achieve the optimal performance. In Table 3.1, we report the performance difference between
baseline methods vs. BAT (𝛼 = 1, 𝛽 = 0.2) in parentheses. From the results, we can find that BATs
enjoy good clean & adversarial accuracy trade-off at the same time. For example, BAT methods
can achieve the highest clean accuracy as well as adversarial accuracy among all baseline methods.
Compared to PGD adversarial training [71], BAT methods have around 2 to 3% improvement on
both clean accuracy and adversarial accuracy on strong attacks such as PGD, CW and AA. Compared
to other baseline methods such as MART [112], it has more than 1% robustness improvement
evaluated by strong attacks, but much higher clean accuracy (> 5%).
Performance on Tiny ImageNet. Tiny ImageNet [60] contains 200 classes of the images in
the original ImageNet [57] dataset, with 500 training images for each class, and image size 64 ×
64. In our experiments, we only choose the first 50 classes in Tiny ImageNet for training and
prediction. Since the image size is 64×64, for both training and robustness evaluation, we consider
the adversarial attacks are bounded by 𝑙∞-norm-4/255. In Table 3.2, we report the performance
of BAT and baseline methods. Similar to the conclusions we made from CIFAR100, BATs can
achieve consistently better clean & adversarial accuracy compared to baseline methods.
Performance on CIFAR10. In CIFAR10 (Table 3.3), BAT also (slightly) outperforms PGD ad-
41
versarial training, as it has a 0.4% improvement on clean accuracy, and better adversarial accuracy.
However, the improvement is not as significant as in CIFAR100 and TinyImageNet, because CIFAR10
contains much fewer atypical samples. Remarkably, this experimental result in CIFAR10
can also suggest that the atypical samples (or memorization) make unignorable influence on the
model performance, even in datasets which contain relatively fewer atypical samples. Compared to
other baseline methods such as TRADES, MART and GAIRAT, BAT methods have lower adversarial
accuracy, but better clean accuracy.
Performance on ResNet18. We also compare the performance of BAT with baselines under a
small archtecture ResNet18, where the results are reported in Table 3.4, Table 3.5 and Table 3.6.
From the result, we can see that BAT does not have as significant advantage over baselines, as in
WRN28. It is because the capacity of ResNet18 is much smaller than WRN28, and sometimes
cannot achieve 100% training accuracy. Therefore, the proposed method BAT cannot achieve the
best performance among baselines. However, BAT can still achieve similar performance with the
PGD adversarial training in most cases.
Table 3.4: Performance of BAT vs. Baselines on CIFAR100 Under ResNet18.
Method Clean Acc. FGSM PGD CW AA.
PGD Adv. Training 56.9 (-2.6) 36.0 (-1.3) 27.4 (+0.1) 25.4 (-1.2) 23.6 (-0.7)
TRADES (1/𝜆 = 5) 56.6 (-2.9) 36.5 (-0.8) 26.9 (-0.4) 25.3 (-1.3) 23.9 (-0.4)
MART [112] 51.8 (-7.7) 36.1 (-1.2) 30.4 (+3.1) 25.8 (-0.8) 24.4 (+0.1)
BAT (𝛼 = 1, 𝛽 = 0.2) 59.5 (+0.0) 37.3 (+0.0) 27.3 (+0.0) 26.6 (+0.0) 24.3 (+0.0)
BAT (𝛼 = 2, 𝛽 = 0.2) 59.3 (-0.2) 37.1 (-0.2) 27.4 (+0.1) 26.5 (-0.1) 24.0 (+0.7)
Table 3.5: Performance of BAT vs. Baselines on CIFAR10 Under ResNet18.
Method Clean Acc. FGSM PGD CW AA.
PGD Adv. Training 81.9 (-0.3) 59.1 (-0.2) 51.8 (+0.3) 50.2 (+0.2) 48.1 (+0.1)
TRADES (1/𝜆 = 5) 81.2 (-1.0) 60.0 (+0.7) 52.2 (+0.7) 50.8 (+0.8) 48.6 (+0.6)
MART [112] 80.5 (-1.7) 62.2 (+1.9) 53.5 (+2.0) 52.5 (+2.5) 49.6 (+1.6)
BAT (𝛼 = 1, 𝛽 = 0.2) 82.2 (+0.0) 59.3 (+0.0) 51.5 (+0.0) 50.0 (+0.0) 48.0 (+0.0)
BAT (𝛼 = 1, 𝛽 = 0.3) 81.7 (-0.5) 59.7 (+0.4) 52.3 (+0.8) 50.3 (+0.3) 48.3 (+0.3)
42
Table 3.6: Performance of BAT vs. Baselines on Tiny ImageNet Under ResNet32.
Method Clean Acc. FGSM PGD CW AA.
PGD Adv. Training 56.3 (-3.1) 37.5 (-2.9) 32.3 (+0.3) 29.8 (-1.7) 29.8 (-2.2)
TRADES (1/𝜆 = 5) 55.4 (-4.0) 35.2 (-5.2) 28.8 (-3.2) 27.0 (-4.5) 27.0 (-5.0)
MART [112] 56.2 (-3.2) 38.1 (-2.3) 34.5 (+2.5) 31.8 (+0.3) 32.0 (+0.0)
BAT (𝛼 = 1, 𝛽 = 0.2) (+0.0) 59.4 (+0.0) 40.4 (+0.0) 32.0 (+0.0) 31.5 (+0.0) 32.0 (+0.0)
BAT (𝛼 = 2, 𝛽 = 0.2) 59.4 (+0.0) 41.3 (+0.9) 32.9 (+0.9) 31.8 (+0.3) 32.4 (+0.4)
Table 3.7: Human Evaluation on Atypical Samples.
Class 𝑦1 Class 𝑦2 Both Likely Neither. Total
High Harmful Score 59.0 12.0 26.0 4.0 100.0
Low Harmful Score 96.0 0.0 1.5 2.5 100.0
3.6.2 Ablation Study
An ablation study for “Harmful Atypical samples”. In this subsection, we conduct additional
experiments to verify the claims about ”harmful atypical samples”. In particular, we aim to demonstrate
that: the atypical images with high ”Harmful scores” are likely to obtain features which
visually resemble the images from a (wrong) different class, so fitting them are likely to degrade
the model performance. To validate our claim, we print out 200 images of atypical samples from
CIFAR100 training set, with highest / lowest Harmful scores as defined in Eq.(6) of the paper. Then,
we let two individual human annotators to label each image, by choosing one of 4 options including:
(a.) This image belongs to class 𝑦1. (b.) This image belongs to class 𝑦2. (c.) Both class 𝑦1 and
𝑦2 are likely. (d.) Neither class 𝑦1 or 𝑦2. Here, 𝑦1 is the ground truth label of the sample; 𝑦2 is
the class other than 𝑦1 which the model has the maximal confidence: 𝑦2 = 𝑎𝑟𝑔 max𝑡≠𝑦1 𝐹𝑡 (𝑥), and
the model 𝐹 is obtained via PGD adversarial training, with all atypical samples removed from the
training set. We report the percentage of the answers of the human annotators in Table 3.7.
From the results, we can see, for the samples with high Harmful scores, people are more likely
to believe that ”the image is from 𝑦2” or ”the image is both likely from 𝑦1 or 𝑦2”. It is because the
highly-harmful images can persist a lot of similar features from these two classes. It also worth to
mention that for samples which people labeled them to 𝑦1, they usually also have features belonging
to the class 𝑦2. For example, a ”bowl” which is round and with orange color. Although people
43
believe it is a bowl confidently, it does have features such as ”round” and ”orange color”, which are
also features of class ”orange”.
An ablation study about hyperparameters in BAT. In this subsection, we study the potential
impact of the hyperparameters chosen in BAT, which is 𝛼 that controls the Reweighting process and
𝛽 that controls the Discrimination Loss. In Figure 3.7, we conduct the experiments on CIFAR100
with BAT when 𝛼 is chosen in [0,1,2,3] and 𝛽 is in [0, 0.1, 0.2, 0.3]. In Figure 3.7, we show each
model’s overall clean accuracy (left) & adversarial accuracy on PGD attack (right) along the Zaxis,
and X-axis / Y-axis indicate the models’ corresponding variables of (𝛼, 𝛽). Note that when
𝛼 = 𝛽 = 0, the BAT method regresses to original PGD adversarial training. From the result, we can
see that a positive pair of (𝛼, 𝛽) can benefit both model clean and adversarial accuracy (on PGD
attack). Therefore, both components (i.e., Reweighting and LDL) of BAT are helpful and necessary.
However, when 𝛼 or 𝛽 is too large, it will hurt the BAT’s clean accuracy. Hence, in CIFAR100,
when 𝛼 = 1 or 2 and 𝛽 = 0.2, BAT can achieve the optimal performance.
Figure 3.7: Clean Acc.(left) & Adv. Acc.(right) of BAT.
3.7 Conclusion
In this paper, we draw significant connections of the memorization effect of deep neural networks
with the behaviors of adversarial training algorithms. Based on the findings, we devise a novel
algorithm BAT to enhance the performance of adversarial training. The findings of the paper can
motivate the futures studies in building robust DNNs with more attention on the data perspective.
44
CHAPTER 4
ENHANCING ROBUSTNESS MAY AMPLIFY UNFAIRNESS ON IMBALANCED DATA¹
4.1 Introduction
The existence of adversarial samples [41] has risen huge concerns on applying deep neural network
(DNN) models into security-critical applications, such as autonomous driving [21] and video
surveillance systems [58]. As countermeasures against adversarial attacks, adversarial training [71,
135] has been empirically proven to be one of the most effective and reliable defense methods.
In general, adversarial training can be formulated to minimize the model’s average error on adversarially
perturbed input examples [71, 135, 89]. Although promising to improve the model’s
robustness, most existing adversarial training methods [135, 112] assume that the number of training
examples from each class is equally distributed. However, datasets collected from real-world
applications typically have imbalanced distribution [37, 66, 107]. Hence, it is natural to ask: what
is the behavior of adversarial training under imbalanced scenarios? Can we directly apply existing
imbalanced learning strategies in natural training to tackle the imbalance issue for adversarial
training? Recent studies find that adversarial training usually presents distinct properties from natural
training [94, 122]. For example, compared to natural training, adversarially trained models
suffer more from the overfitting issue [94]. Moreover, it is evident from a recent study [122] that
the adversarially trained models tend to present strong class-wise performance disparities, even if
the training examples are uniformly distributed over different classes. Imagine that if the training
data distribution is highly imbalanced, these properties of adversarial training can be greatly exaggerated
and make it extremely difficult to be applied in practice. Therefore, it is important but
challenging to answers aforementioned questions.
As the initial effort to study the imbalanced problem in adversarial training, in this work, we
first investigate the performance of existing adversarial training under imbalanced settings. As
a preliminary study shown in Section 4.3.1, we apply both natural training and PGD adversarial
¹This chapter is based on previously published work, ©IEEE, Reprinted, with permission from Han Xu, Wentao
Wang, Xiaorui Liu, Yaxin Li, Bhavani Thuraisingham, Jiliang Tang, titled “Imbalanced adversarial training with
reweighting”, published at the International Conference on Data Mining (12/2022) [111].
45
training [71] on multiple imbalanced image datasets constructed from the CIFAR10 dataset [56]
using the ResNet18 architecture [45] and evaluate trained models’ performance on class-balanced
test datasets. From the preliminary results, we observe that, compared to naturally trained models,
adversarially trained models always present very low standard accuracy and robust accuracy²
on under-represented classes. For example, a naturally trained model can achieve around 40%
and 60% standard accuracy on under-represented classes “frog” and “truck” separately, while an
adversarially trained model gets both 0% standard & robust accuracy on these two classes. This
observation suggests that adversarial training is more sensitive to imbalanced data distribution than
natural training. Thus, when applying adversarial training in practice, imbalance learning strategies
should always be considered for help.
As a result, we explore the potential solutions which can handle the imbalance issues for adversarial
training. In this work, we focus on studying the behavior of the reweighting strategy [44]
and leave other strategies such as resampling [36] for one future work. In Section 4.3.2, we apply
the reweighting strategy to existing adversarial training with varied weights assigning to one underrepresented
class and evaluate trained models’ performance. From the results, we observe that, in
adversarial training, increasing weights for an under-represented class can substantially improve
the standard & robust accuracy on this class, but drastically hurt the model’s performance on the
well-represented class. For example, the robust accuracy of the adversarially trained model on the
under-represented class “horse” can be greatly improved when setting a relatively large weight, like
200, to its examples, but the model’s robust accuracy on the well-represented class “cat” is dropped
to even lower than the class “horse” and, hence, the overall robust performance of the model is also
decreased. These facts indicate that the performance of adversarially trained models is very sensitive
to the reweighting manipulations and it could be very hard to figure out an eligible reweighting
strategy which is optimal for all classes.
It is also worth noting that this phenomenon is absent in natural training under the same settings.
²In this work, we denote standard accuracy as model’s accuracy on the input samples without perturbations and
robust accuracy as model’s accuracy on the input samples which are adversarially perturbed. Without clear clarification,
we consider the perturbation is constrained by 𝑙∞-norm 8/255.
46
In natural training, from the results in Section 4.3.2, we find that upweighting the under-represented
class increases model’s standard accuracy on this class but only slightly hurts the accuracy on other
classes, even when adopting a large weight for under-represent class. To further investigate the
possible reasons leading to different behaviors of the reweighing strategy in natural and adversarial
training, we visualize their learned features via t-SNE [106]. As shown in Figure 4.3, we observe
that features learned by the adversarially trained model of different classes tend to mix together
while they are well separated for the naturally trained model. This observation motivates us to
theoretically show that when the given data distribution has poor data separability, upweighting
under-represented classes will hurt the model’s performance on well-represented classes. Motivated
by this theoretical understanding, we propose a novel algorithm Separable Reweighted Adversarial
Training (SRAT) to facilitate the reweighting strategy in imbalanced adversarial training
by enhancing the separability of learned features. Through extensive experiments, we validate the
effectiveness of SRAT.
4.2 Related Work
Adversarial Robustness. The vulnerability of DNN models to adversarial examples has been verified
by many existing successful attack methods [41, 15, 71]. To improve model robustness against
adversarial attacks, various defense methods have been proposed [41, 71, 85, 24]. Among them,
adversarial training has been proven to be one of the most effective defense methods [2]. Adversarial
training can be formulated as solving a min-max optimization problem where the outer minimization
process enforces the model to be robust to adversarial examples, generated by the inner
maximization process via some existing attacking methods like PGD [71]. Based on adversarial
training, several variants, such as TRADES [135], MART [112] and FAT [136], have been presented
to improve the model’s performance further. More details about adversarial robustness can
be found in recent surveys [18, 125]. Since almost all studies of adversarial training are focused on
balanced datasets, it’s worthwhile to investigate the performance of adversarial training methods
on imbalanced training datasets.
Imbalanced Learning. Most existing works of imbalanced training can be roughly classified
47
(a) Natural Training Standard Acc. (b) Adv. Train. Standard Acc. (c) Adv. Train. Robust Acc.
Figure 4.1: Class-wise performance of natural & adversarial training under an imbalanced CIFAR10.
into two categories, i.e., re-sampling and reweighting. Re-sampling methods aim to reduce the level
of imbalance through either over-sampling data examples from under-represented classes [9, 12] or
under-sampling data examples from well-represented classes [48, 35, 43, 127]. reweighting methods
allocate different weights for different classes or even different data examples. For example,
Focal loss [65] enlarges the weights of wrongly-classified examples while reducing the weights of
well-classified examples in the standard cross entropy loss; and LDAM loss [13] regularizes the
under-represented classes more strongly than the over-represented classes to attain good generalization
performance on under-represented classes. More information about imbalanced learning can
be found in recent surveys [44, 51]. The majority of existing methods focused on the nature training
scenario and their trained models will be crashed when facing adversarial attacks [41]. Hence, in
this paper, we develop a novel method that can defend adversarial attacks and achieve well-pleasing
performance under the imbalance setting.
4.3 Preliminary Study
4.3.1 The Behavior of Adversarial Training
In this subsection, we conduct preliminary studies to examine the performance of PGD adversarial
training [71], under an imbalanced training dataset which is resampled from CIFAR10
dataset [56]. Following previous imbalanced learning works [26, 13], we consider to construct an
imbalanced training dataset where each of the first 5 classes (well-represented classes) has 5,000
training examples, and each of the last 5 classes (under-represented classes) has 50 training ex-
48
(a) Natural Training Standard Acc. (b) Adv. Train. Standard Acc. (c) Adv. Train. Robust Acc.
Figure 4.2: Class-wise performance of natural & adversarial training under an imbalanced CIFAR10.
amples. Figure 4.1 shows the performance of naturally and adversarially trained models under a
ResNet18 [45] architecture. From the figure, we can observe that, comparing with natural training,
PGD adversarial training will result in a larger performance gap between well-represented classes
and under-represented classes. For example, in natural training, the ratio between the average standard
accuracy of well-represented classes (brown) and under-represented classes (violet) is about
2:1, while in adversarial training, this ratio expands to 16:1. Moreover, for adversarial training, although
it can achieve good standard & robust accuracy on well-represented classes, it has extremely
poor performance on under-represented classes. There are 3 out of the 5 under-represented classes
with 0% standard & robust accuracy. As a conclusion, the performance of adversarial training is
easier to be affected by imbalanced distribution than natural training and suffers more on underrepresented
classes. In Appendix 3.1, we provide more implementation details of this experiment,
as well as additional results of the same experiment under other imbalanced settings. The results
in Appendix 3.1 further support our findings.
4.3.2 The Reweighting Strategy in Natural Training v.s. in Adversarial Training
The preliminary study in Section 4.3.1 demonstrates that it is highly demanding to adjust the
original adversarial training methods to accommodate class-imbalanced data. Therefore, in this subsection,
we investigate the effectiveness of existing imbalanced learning strategies in natural training
when adopted in adversarial training. In this paper, we focus on the reweighting strategy [44]
as the initial effort to study this problem and leave other methods such as resampling [20] for future
49
investigation. In this subsection, we conduct experiments under a binary classification problem,
where the training dataset contains two classes that are randomly selected from CIFAR10 dataset,
with each class having 5,000 and 50 training examples respectively. Under this training dataset, we
arrange multiple trails of (reweighted) natural training and (reweighted) adversarial training, with
the weight ratio between the under-represented class and well-represented class ranging from 1:1
to 200:1.
Figure 4.2a shows the experimental results with data from the classes “horse” and “cat”. As
demonstrated in Figure 4.2a, increasing the weight of the under-represented class will drastically
increase the model’s performance of the under-represented class, while also immensely decreasing
the performance of the well-represented class. For example, when increasing the weight ratio
between two classes from 1:1 to 150:1, the under-represented class’s standard accuracy can be improved
from 0% to ∼ 60% and its robust accuracy from 0% to ∼ 50%. However, the standard &
robust accuracy of the well-represented class is also drastically decreasing. For instance, the wellrepresented
class’s standard accuracy drops from 100% to 60%, and its robust accuracy drops from
100% to 50%. These results illustrate that adversarial training’s performance can be significantly
affected by the reweighting strategy. As a result, the reweighting strategy in this setting can hardly
help improve the overall performance no matter which weight ratio is chosen, because the model’s
performance always presents a strong tension between these two classes. As a comparison, for the
naturally trained models (Figure 4.2a), increasing the weights for the under-represented examples
will only slightly decrease the performance on the well-represented class. More experiments using
different binary imbalanced datasets are reported in Appendix 3.2, where we have similar observations.
4.4 Theoretical Analysis
In Section 4.3.2, we observe that in natural training, the reweighting strategy can only make a small
impact on the two classes’ performance. This phenomenon has been extensively studied by recent
works [12, 119], which investigate the decision boundaries of perfectly fitted DNNs. In particular,
they consider the case where the data is linearly (or nonlinearly) separable and study the behavior of
50
linear (or nonlinear) models optimized by reweighted SGD algorithms. Interestingly, they conclude
that over the process of training, these models’ decision boundaries will eventually converge to
weight-agnostic solutions. For example, a linear classifier optimized by SGD on a linearly separable
data will converge to the solution of the hard-margin support vector machine [81]. In other words,
as long as the data can be well separated, reweighting will not make huge influence on the finally
trained models, which is consistent with what we observed above.
Although these studies only focus on natural training, their interpretations and conclusions motivate
our hypothesis in adversarial training. For adversarial training, we conjecture that it is because
the models separate the data poorly, thus, their performance is highly sensitive to the reweighting
strategy. As a direct validation of this hypothesis, in Figure 4.3, we visualize the learned (penultimate
layer) features of the imbalanced training examples used in the binary classification problem
in Section 4.3.2. We find that adversarially trained models do present obviously poorer separability
on the learned features. This suggests that, compared to naturally trained models, adversarially
trained models have a weaker ability to separate training data and could potentially make themselves
sensitive to reweighting. Next, we will theoretically analyze the impact of reweighting on
linear models which are optimized under poorly separable data. Since our empirical study shows
that adversarially trained models usually poorly separate the data (see Figure 4.3), the analysis can
hopefully shed light on the behavior of reweighting in adversarial trained models in practice.
Binary Classification Problem. To construct the theoretical study, we focus on a binary classification
problem, with a Gaussian mixture distribution D which is defined as:
𝑦 ∼ {−1, +1}, 𝑥 ∼
8>>>>
<
>>>>:
N(𝜇, 𝜎2𝐼), if 𝑦 = +1
N(−𝜇, 𝜎2𝐼), if 𝑦 = −1
and 𝜇 = (
dim = d
z }| {
𝜂, ..., 𝜂), (4.1)
where the two classes’ centers (±𝜇 ∈ R𝑑) with each dimension has mean value ±𝜂 (𝜂 > 0), variance
𝜎2. Formally, we define the data separability as 𝑆 = 𝜂/𝜎2. Intuitively, if the separability term 𝑆
is larger, it suggests that two classes have farther distance or data examples of each class are more
concentrated, so these two classes can be well separated. Previous works [12] also closely studied
this term to describe data separability. Besides, we particularly define the imbalanced training
51
dataset satisfying the condition Pr.(𝑦 = +1) = 𝐾 · Pr.(𝑦 = −1) and 𝐾 > 1 which indicates the
imbalance ratio between the two classes. During test, we assume that two classes have the equal
probability to appear. Under data distribution D, we will discuss the performance of linear classifiers
𝑓 (𝑥) = sign(𝑤𝑇 𝑥−𝑏) where 𝑤 and 𝑏 are the weight and bias term of model 𝑓 . If a reweighting
strategy is involved, we define the model will upweight the under-represented class “-1” by 𝜌. In
the following lemma, we first derive the solution of the optimized linear classifier 𝑓 training on this
imbalanced dataset. Then we will extend the result of Lemma 4.4 to analyze the impact of data
separability on the performance of model 𝑓 .
Under the data distributionDas defined in Eq. (4.1), with an imbalanced ratio 𝐾 and a reweight
ratio 𝜌, the optimal classifier which minimizes the (reweighted) empirical risk:
𝑓 ∗ = arg min
𝑓

Pr.( 𝑓 (𝑥) ≠ 𝑦|𝑦 = −1) · Pr.(𝑦 = −1) · 𝜌
+ Pr.( 𝑓 (𝑥) ≠ 𝑦|𝑦 = +1) · Pr.(𝑦 = +1)
 (4.2)
has the solution: 𝑤 = 1 and 𝑏 = 1
2 log( 𝜌
𝐾
) 𝑑𝜎2
𝜂 = 12
log( 𝜌
𝐾
) 𝑑
𝑆 . The proof of Lemma 4.4 can be
found at Appendix 3.3.1. Note that the final optimized classifier has a weight vector equal to 1
and its bias term 𝑏 only depends on 𝐾, 𝜌 and the data separability 𝑆. In the following, our first
theorem is focused on one special setting when 𝜌 = 1, which is the original ERM model without
reweighting. Specifically, Theorem 4.4 calculates and compares the model’s performance under
data distributions: D1 (with a higher separability 𝑆1) and D2 (with a lower separability 𝑆2). From
Theorem 4.4, we aim to compare the behavior of linear models when they can poorly separate data
(like adversarial trained models) or they can well separate data (like naturally trained models).
Under two data distributions (𝑥(1) , 𝑦(1)) ∈ D1 and (𝑥(2) , 𝑦(2)) ∈ D2 with the separability 𝑆1 >
𝑆2, let 𝑓 ∗
1 and 𝑓 ∗
2 be the optimal non-reweighted classifiers (𝜌 = 1) under D1 and D2, respectively.
Given the imbalance ratio 𝐾 is large enough, we have:
Pr.( 𝑓 ∗
1
(𝑥(1)) ≠ 𝑦(1) |𝑦(1) = −1) − Pr.( 𝑓 ∗
1
(𝑥(1)) ≠ 𝑦(1) |𝑦(1) = +1)
< Pr.( 𝑓 ∗
2
(𝑥(2)) ≠ 𝑦(2) |𝑦(2) = −1) − Pr.( 𝑓 ∗
2
(𝑥(2)) ≠ 𝑦(2) |𝑦(2) = +1).
(4.3)
52
The proof of Theorem 4.4 is provided at Appendix 3.3.2. Intuitively, Theorem 4.4 suggests that
when the data separability 𝑆 is low (such as D2), the optimized classifier (without reweighting) can
intrinsically have a larger error difference between the under-represented class “-1” and the wellrepresented
class “+1”. Similar to the observation in Section 4.3.1 and Figure 4.3, adversarially
trained models also present a weak ability to separate data, and it also presents a strong performance
gap between the well-represented class and under-represented class. Conclusively, Theorem 4.4
indicates that the poor ability to separate the training data can be one important reason which leads
to the strong performance gap of adversarially trained models.
Next, we consider the case when the reweighting strategy is applied. Similar to Theorem 4.4, we
also calculate the models’ classwise error under D1 and D2 with different levels of separability. In
particular, Theorem 4.4 focuses on the well-represented class “+1” and calculates its error increase
when upweighting the under-represented class “-1” by 𝜌. Through the analysis in Theorem 4.4,
we compare the impact of upweighting the under-represented class on the performance of wellrepresented
class.
Under two data distributions (𝑥(1) , 𝑦(1)) ∈ D1 and (𝑥(2) , 𝑦(2)) ∈ D2 with different separability
𝑆1 > 𝑆2, let 𝑓 ∗
1 and 𝑓 ∗
2 be the optimal non-reweighted classifiers (𝜌 = 1) under D1 and D2 respectively,
and let 𝑓 ′
1
∗ and 𝑓 ′
2
∗ be the optimal reweighted classifiers under D1 and D2 given the optimal
reweighting ratio (𝜌 = 𝐾). Given the imbalance ratio 𝐾 is large enough, we have:
Pr.( 𝑓 ′
1
∗ (𝑥(1)) ≠ 𝑦(1) |𝑦(1) = +1) − Pr.( 𝑓 ∗
1
(𝑥(1)) ≠ 𝑦(1) |𝑦(1) = +1)
< Pr.( 𝑓 ′
2
∗ (𝑥(2)) ≠ 𝑦(2) |𝑦(2) = +1) − Pr.( 𝑓 ∗
2
(𝑥(2)) ≠ 𝑦(2) |𝑦(2) = +1).
(4.4)
The detail the proof of Theorem 4.4 at Appendix 3.3.3. The theorem shows that, when the
data distribution has poorer data separability, such as D2, upweighting the under-represented class
can cause greater hurt on the performance of the well-represented class. It is also consistent with
our empirical findings about adversarial training models. Since the adversarially trained models
poorly separate the data (Figure 4.3), upweighting the under-represented class always drastically
decreases the performance of well-represented class (Section 4.3.2). Through the discussions in
53
(a) Natural Training. (b) Adversarial Training.
Figure 4.3: t-SNE visualization of penultimate layer features.
both Theorem 4.4 and Theorem 4.4, we can conclude that the poor separability can be one important
reason which makes adversarial training and its reweighted variants extremely difficult to achieve
good performance under imbalance data distribution. Therefore, in the next section, we explore
potential solutions which can facilitate the reweighting strategy in adversarial training.
4.5 Separable Reweighted Adversarial Training (SRAT)
The observations from both preliminary studies and theoretical understandings indicate that more
separable data will advance the reweighting strategy in adversarial training under imbalanced scenarios.
Thus, in this section, we present a framework, Separable Reweighted Adversarial Training
(SRAT), that enables the effectiveness of the reweighting strategy in adversarial training under imbalanced
scenarios by increasing the separability in the learned latent feature space.
4.5.1 Reweighted Adversarial Training
Given an input example (x, 𝑦), adversarial training [71] aims to obtain a robust model 𝑓𝜃 that can
make the same prediction 𝑦 for an adversarial example x′, generated by applying an adversarially
perturbation on x. The adversarially perturbations are typically bounded by a small value 𝜖 under
𝐿𝑝-norm, i.e., ∥x′ − x∥ 𝑝 ≤ 𝜖. More formally, adversarial training can be formulated as solving a
54
min-max optimization problem, where a DNN model is trained on minimizing the prediction error
on adversarial examples generated by iteratively maximizing some loss function.
As indicated in Section 4.3.1, adversarial training cannot be applied in imbalanced scenarios
directly, as it presents very low performance on under-represented classes. To tackle this problem,
a natural idea is to integrate existing imbalanced learning strategies proposed in natural training,
such as reweighting, into adversarial training to improve the trained model’s performance on those
under-represented classes. Hence, the reweighted adversarial training can be defined as
min
𝜃
1
𝑛
Õ𝑛
𝑖=1
max
∥x′
𝑖
−x𝑖 ∥ 𝑝≤𝜖
𝑤𝑖L( 𝑓𝜃 (x′
𝑖
), 𝑦𝑖), (4.5)
where 𝑤𝑖 is a reweighting value assigned for each input sample (x𝑖 , 𝑦𝑖) based on the example size
of the class (x𝑖 , 𝑦𝑖) belongs to or some properties of (x𝑖 , 𝑦𝑖). In most existing adversarial training
methods [71, 135, 112], the cross entropy (CE) loss is adopted as the loss function L(·, ·). However,
the CE loss could be suboptimal in imbalanced settings and some new loss functions designed for
imbalanced settings specifically, such as Focal loss [65] and LDAM loss [13], have been prove
superiority in natural training. Hence, besides CE loss, Focal loss and LDAM loss can also be
adopted as the loss function L(·, ·) in Eq. (4.5).
4.5.2 Increasing Feature Separability
Our preliminary study indicates that only reweighted adversarial training cannot work well under
imbalanced scenarios. Moreover, the reweighting strategy behaves very differently between
natural training and adversarial training. Meanwhile, our theoretical analysis suggests that the poor
separability of the feature space produced by the adversarially trained model can be one reason to
understand these observations. Hence, in order to facilitate the reweighting strategy in adversarial
training under imbalanced scenarios, we equip a feature separation loss with our SRAT method.
We aim to enforce the learned feature space as separable as possible. More specifically, the goal
of the feature separation loss is to make (1) the learned features of examples from the same class
well clustered, and (2) the features of examples from different classes well separated. By achieving
this goal, the model is able to learn more discriminative features for each class. Correspondingly
55
adjusting the decision boundary via the reweighting strategy to fit under-represented classes’ examples
more will not hurt well-represented classes drastically. The feature separation loss is formally
defined as:
L𝑠𝑒𝑝 (x′
𝑖
) = − 1
|𝑃(𝑖) |
Õ
𝑝∈𝑃(𝑖)
log
exp(z′
𝑖
· z′
𝑝
/𝜏)
Í
𝑎∈𝐴(𝑖) exp(z′
𝑖
· z′
𝑎/𝜏) , (4.6)
where z′
𝑖 is the feature representation of the adversarial example x′
𝑖 of x𝑖 , 𝜏 ∈ R+ is a scalar temperature
parameter, 𝑃(𝑖) denotes the set of input examples belonging to the same class with x𝑖 and
𝐴(𝑖) indicates the set of all input examples excepts x′
𝑖 . When minimizing the feature separation loss
during training, the learned features of examples from the same class will tend to aggregate together
in the latent feature space, and, hence, result in a more separable latent feature space. Our proposed
feature separation loss L𝑠𝑒𝑝 (·) is inspired by the supervised contrastive loss proposed in [52]. The
main difference is, instead of applying data augmentation techniques to generate two different views
of each data example and feeding the model with augmented data examples, our feature separation
loss directly takes the adversarial example x′
𝑖 of each data example x𝑖 as input.
4.5.3 Training Schedule
By combining the feature separation loss with the reweighted adversarial training, the final
object function for Separable reweighted Adversarial Training (SRAT) can be defined as:
min
𝜃
1
𝑛
Õ𝑛
𝑖=1
max
∥x′
𝑖
−x𝑖 ∥ 𝑝≤𝜖
𝑤𝑖L( 𝑓𝜃 (x′
𝑖
), 𝑦𝑖) + 𝜆L𝑠𝑒𝑝 (x′
𝑖
), (4.7)
where we use a hyper-parameter 𝜆 to balance the contributions from the reweighted adversarial
training and the feature separation loss.
In practice, in order to better take advantage of the reweighting strategy in our SRAT method,
we adopt a deferred reweighting training schedule [13]. Specifically, before annealing the learning
rate, our SRAT method first trains a model guided by Eq. (4.7) without introducing the reweighting
strategy, i.e., setting 𝑤𝑖 = 1 for every input example x′
𝑖 , and then applies reweighting into model
training process with a smaller learning rate. Our SRAT method enables to learn more separable
feature space, thus comparing with applying the reweighting strategy from the beginning of training,
this deferred re-balancing training schedule enables the reweighting strategy to obtain more
56
benefits from our SRAT method, and as a result, it can boost the performance of our SRAT method
with the help of the reweighting strategy. The detailed training algorithm for SRAT is shown in
Appendix 3.4.
4.6 Experiment
In this section, we perform comprehensive experiments to validate the effectiveness of our proposed
SRAT method. We first compare our method with several representative imbalanced learning methods
in adversarial training under various imbalanced scenarios and then conduct ablation study to
understand our method more deeply.
4.6.1 Experimental Settings
Datasets. We conduct experiments on multiple imbalanced training datasets artificially created
from two benchmark image datasets CIFAR10 [56] and SVHN [80] with diverse imbalanced
distributions. Specifically, we consider two types of imbalance types: Exponential (Exp) imbalance
[26] and Step imbalance [9]. For Exp imbalance, the number of training examples of each
class will be reduced according to an exponential function 𝑛 = 𝑛𝑖𝜏𝑖 , where 𝑖 is the class index, 𝑛𝑖
is the number of training examples in the original CIFAR10/SVHN training dataset for class 𝑖 and
𝜏 ∈ (0, 1). We categorize five most frequent classes in the constructed imbalanced training dataset
as well-represented classes and the remaining five classes as under-represented classes. For Step
imbalance, we follow the same process adopted in Section 4.3.1 to construct imbalanced training
datasets based on CIFAR10 and SVHN, separately. Moreover, in both imbalanced types, we denote
imbalance ratio 𝐾 as the ratio between training example sizes of the most frequent and least
frequent class. In our experiments, we construct four different imbalanced datasets, named as “Step-
100”, “Step-10”, “Exp-100” and “Exp-10”, by adopting different imbalanced types (Step or Exp)
with different imbalanced ratios (𝐾 = 100 or 𝐾 = 10) to train models, and evaluate model’s performance
on the original uniformly distributed test datasets of CIFAR10 and SVHN correspondingly.
More detailed information about imbalanced training sets used in our experiments can be found in
Appendix 3.5.
57
Baseline methods. We implement several representative and state-of-the-art imbalanced learning
methods (or their combinations) into adversarial training as baseline methods. These methods
include: (1) Focal loss (Focal); (2) LDAM loss (LDAM); (3) Class-balanced reweighting
(CB-Reweight) [26], where each example is reweighted proportionally by the inverse of the effective
number³ of its class; (4) Class-balanced Focal loss (CB-Focal) [26], a combination of Classbalanced
method [26] and Focal loss [65], where well-classified examples will be down-weighted
while hard-classified examples will be up-weighted controlled by their corresponding effective numbers;
(5) deferred reweighted CE loss (DRCB-CE), where a deferred reweighting training schedule
is applied based on the CE loss; (6) deferred reweighted Class-balanced Focal loss (DRCB-Focal),
where a deferred reweighting training schedule is applied based on the CB-Focal loss; (7) deferred
reweighted Class-balanced LDAM loss (DRCB-LDAM) [13], where a deferred reweighting training
schedule is applied based on the CB-LDAM loss. In addition, we also include the original PGD
adversarial training method using cross entropy loss (CE) in our experiments.
Our proposed methods. We evaluate three variants of our proposed SRAT method with different
implementations of the prediction loss L(·, ·) in Eq. (4.5), i.e., CE loss, Focal loss and LDAM
loss. The variant utilizing CE loss is denoted as SRAT-CE, and, similarly, other two variants are denoted
as SRAT-Focal and SRAT-LDAM, respectively. For all these three variants, Class-balanced
method [26] is adopted to set reweighting values within the deferred reweighting training schedule.
Implementation details. We implement all methods used in our experiments based on a Pytorch
library DeepRobust [62]. For CIFAR10 based imbalanced datasets, the adversarial examples
used in training are calculated by PGD-10, with a perturbation budget 𝜖 = 8/255, and step size
𝛾 = 2/255. For robustness evaluation, we report robust accuracy under 𝑙∞-norm 8/255 attacks
generated by PGD-20 on Resnet-18 [45] models. For SVHN based imbalanced datasets, the setting
is similar with CIFAR10 based datasets, excepts we set step size 𝛾 to 1/255 in both training
and test phases, as suggested in [118]. For the deferred reweighting training schedule used in our
methods and some baseline methods, we set the number of the training epochs to 200 and the initial
³The effective number is defined as the volume of examples and can be calculated by (1 − 𝛽𝑛𝑖 )/(1 − 𝛽), where
𝛽 ∈ [0, 1) is a hyperparameter and 𝑛𝑖 denotes the number of examples of class 𝑖.
58
learning rate to 0.1, and then decay the learning rate at epoch 160 and 180 with the ratio 0.01. The
reweighting strategy will be applied starting from epoch 160.
4.6.2 Performance Comparison
Table 4.1 and 4.2 show the performance comparison on various imbalanced CIFAR10 datasets
with different imbalance types and imbalance ratios. In these two tables, we use bold values to
denote the highest accuracy among all methods and use the underline values to indicate our SRAT
variants which achieve the highest accuracy among their corresponding baseline methods utilizing
the same loss function for making predictions. Due to the limited space, we report the performance
comparison on SVHN based imbalanced datasets in Appendix 3.6.
From Table 4.1 and Table 4.2, we can make the following observations. First, compared to
baseline methods, our SRAT methods can obtain improved performance in terms of both overall
standard & robust accuracy under almost all imbalanced settings. More importantly, our SRT
methods make significantly improvement on those under-represented classes, especially under the
extremely imbalanced setting. For example, on the Step imbalanced dataset with imbalance ratio
𝐾 = 100, our SRAT-Focal method improves the standard accuracy on under-represented classes
from 21.81% achieved by the best baseline method utilizing Focal loss to 51.83% and robust accuracy
from 3.24% to 15.89%. These results demonstrate that our proposed SRAT method is able
to obtain more robustness under imbalanced settings. Second, the performance gap among three
variants SRAT-CE, SRAT-Focal and SRAT-LDAM are mainly caused by the gap between the loss
functions in these methods. As shown in Table 4.1 and 4.2, DRCB-LDAM typically performs
better than DRCE-CE and DRCB-Focal, and similarly, SRAT-LDAM outperforms SRAT-CE and
SRAT-Focal under corresponding imbalanced settings.
4.6.3 Ablation Study
In this subsection, we provide ablation study to understand our SRAT method more comprehensively.
Feature space visualization. In order to facilitate the reweighting strategy in adversarial training
under the imbalanced setting, we present a feature separation loss in our SRAT method. The
59
Table 4.1: Performance comparison on imbalanced CIFAR10 datasets (Imbalanced Type: Step).
Imbalance Ratio 10 100
Imbalance Ratio Standard Accuracy Robust Accuracy Standard Accuracy Robust Accuracy
Method Overall Under Overall Under Overall Under Overall Under
CE 63.26 40.62 36.96 14.23 47.29 9.03 30.39 1.62
Focal 63.57 41.17 36.89 14.25 47.36 9.03 30.12 1.45
LDAM 57.08 31.09 37.18 12.44 42.49 0.85 30.80 0.05
CB-Reweight 73.30 74.80 41.34 42.15 37.68 19.64 25.58 10.33
CB-Focal 73.47 73.69 41.19 41.02 15.44 0.00 14.46 0.00
DRCB-CE 75.89 70.55 39.93 33.33 53.40 22.86 28.31 3.35
DRCB-Focal 74.61 67.06 37.91 29.50 52.75 21.81 27.78 3.24
DRCB-LDAM 72.95 75.42 45.23 44.98 61.60 50.69 31.37 16.25
SRAT-CE 76.32 73.20 41.71 37.86 59.10 40.24 30.02 11.72
SRAT-Focal 75.41 74.91 42.05 41.28 62.93 51.83 28.38 15.89
SRAT-LDAM 73.99 76.63 45.60 45.96 63.13 52.73 33.51 18.89
Table 4.2: Performance comparison on imbalanced CIFAR10 datasets (Imbalanced Type: Exp).
Imbalance Ratio 10 100
Metric Standard Accuracy Robust Accuracy Standard Accuracy Robust Accuracy
Method Overall Under Overall Under Overall Under Overall Under
CE 71.95 64.09 37.94 26.79 48.40 23.04 26.94 6.17
Focal 72.06 63.99 37.62 26.27 49.16 23.69 26.84 5.88
LDAM 67.39 58.01 41.35 28.65 48.39 25.69 29.51 8.95
CB-Reweight 75.17 76.87 41.02 41.67 57.49 56.47 29.01 26.53
CB-Focal 74.73 76.67 38.86 42.41 50.35 60.05 27.15 33.56
DRCB-CE 76.25 75.83 40.02 37.93 57.30 37.90 26.97 10.57
DRCB-Focal 75.36 72.72 37.76 33.83 54.76 31.79 25.24 7.81
DRCB-LDAM 73.92 78.53 46.29 48.81 62.65 57.19 31.66 22.11
SRAT-CE 76.94 79.50 41.50 43.08 64.93 64.34 29.68 25.42
SRAT-Focal 75.26 80.52 42.37 47.22 62.57 64.88 30.34 28.66
SRAT-LDAM 74.63 79.82 46.72 50.38 63.11 65.60 34.22 32.55
main goal of the feature separation loss is to enforce the learned feature space as much separated
as possible. For checking whether the feature separation loss can work as expected, we apply t-
SNE [106] to visualize the latent feature space learned by our SRAT-LDAM method in Figure 4.4.
As a comparison, we also provide the visualization of feature space learned by the original PGD
adversarial training method (CE) and DRCB-LDAM method.
As shown in Figure 4.4, the feature space learned by our SRAT-LDAM method is more separable
than two baseline methods. This observation demonstrates that, with our proposed feature
separation loss, the adversarially trained model is able to learn much better features and thus our
60
(a) CE. (b) DRCB-LDAM. (c) SRAT-LDAM.
Figure 4.4: t-SNE feature visualization of training examples learned by SRAT and two baseline
methods using imbalanced training datasets “Step-100”.
SRAT method can achieve superiority performance.
Impact of reweighting values. As in all SRAT variants, we adopt the Class-balanced method [26]
to assign different weights to different classes based on their effective number. To explore how the
assigned weights impact the performance of our proposed SRAT method, we conduct experiments
on a Step-imbalanced CIFAR10 dataset with imbalance ratio 𝐾 = 100 to see the change of model’s
performance using different reweighting values. In our experiments, we assign five well-represented
classes with weight 1 and change the weight for remaining five under-represented classes from 10
to 200. The experimental results are shown in Figure 4.5. Here, we use an approximation integer
78 to denote the weight calculated by the Class-balanced method when the imbalance ratio equals
100.
From Figure 4.5, we can obverse that, for all SRAT variants, the model’s standard accuracy is
increased with the increase of the weights assigning to under-represented classes. However, the
robust accuracy for these three methods do not synchronize with the change of their standard accuracy.
When increasing the weights for under-represented classes, robust accuracy of SRAT-LDAM
is almost unchanged and robust accuracy of SRAT-CE and SRAT-Focal even has slight decrease.
As a trade-off, using a relative large weights, such as 78 or 100, in our SRAT method can obtain
satisfactory performance on both standard & robust accuracy, where the former is calculated by the
Class-balanced method and the latter equals the imbalance ratio 𝐾.
61
Figure 4.5: The impact of reweighting values using an imbalanced training dataset “Step-100”.
(a) Step-100. (b) Exp-100.
Figure 4.6: The impact of the hyper-parameter 𝜆 using imbalanced training datasets “Step-100” and
“Exp-100”.
Impact of hyper-parameter 𝜆. In our proposed SRAT method, the contributions of feature
separation loss and prediction loss are controlled by a hyper-parameter 𝛾. In this part, we study
how this hyper-parameter affects the performance of our SRAT method. In our experiments, we
evaluate the models’ performance of all SRAT variants with different values of 𝜆 used in training
process on both Step-imbalanced CIFAR10 dataset and Exp-imbalanced CIFAR10 dataset with
imbalance ratio 𝐾 = 100.
As shown in Figure 4.6, the performance of all SRAT variants are not very sensitive with the
choice of 𝜆. However, a large value of 𝜆, such as 8, may hurt the model’s performance.
62
4.7 Conclusion
In this work, we first empirically investigate the behavior of adversarial training under imbalanced
settings and explore the potential solutions to assist adversarial training in tackling the imbalanced
issues. As neither adversarial training method itself nor adversarial training with reweighting strategy
can work well under imbalanced scenarios, we further theoretically verify that the poor data
separability is one key reason causing the failure of adversarial training based methods under imbalanced
scenarios. Based on our findings, we propose the Separable Reweighted Adversarial Training
(SRAT) framework to facilitate the reweighting strategy in imbalanced adversarial training by
enhancing the separability of learned features. Through extensive experiments, we validate the effectiveness
of SRAT. In the future, we plan to examine how other types of defense methods perform
under imbalanced scenarios and how other types of balanced learning strategies in natural training
behavior under adversarial training.
63
CHAPTER 5
ENHANCING FAIRNESS MAY THREATEN ROBUSTNESS AGAINST POISONING
ATTACKS¹
5.1 Introduction
Data poisoning attacks [6, 23, 100] have brought huge safety concerns for machine learning systems
that are trained on data collected from public resources [55, 115]. For example, the studies [6, 74,
11, 100] have shown that an attacker can inject only a small fraction of fake data into the training
pool of a classification model and intensely degrade its accuracy. As countermeasures against data
poisoning attacks, there are defense methods [100, 29] which can successfully identify the poisoning
samples and sanitize the training dataset.
Recently, in addition to model safety, people have also paid significant attention to fairness.
They stress that machine learning models should provide the “equalized” treatment or “equalized”
prediction quality among groups of population [42, 1, 34, 128]. Since the fair classification problems
are human-society related, it is highly possible that training data is provided by humans, which
can cause high accessibility for adversarial attackers to inject malicious data. Therefore, fair classification
algorithms are also prone to be threatened by poisoning attacks. Since fair classification
problems have distinct optimization objectives & optimization processes from traditional classification,
a natural question is: Can we protect fair classification from data poisoning attacks? In other
words, are existing defenses sufficient to defend fair classification models?
To answer these questions, we first conduct a preliminary study on Adult Census Dataset to explore
whether existing defenses can protect fair classification algorithms (see Section 3.2). In this
work, we focus on representative defense methods including k-NN Defense [53] and SEVER [29].
To fully exploit their vulnerability to poisoning attacks, we introduce a new attacking algorithm
F-Attack, where the attacker aims to cause the failure of fair classification. In detail, by injecting
the poisoning samples, the attacker aims to mislead the trained classifier such that it cannot
¹This chapter is based on previously published work by Han Xu, Xiaorui Liu, Yuxuan Wan, Jiliang Tang titled “Towards
Fair Classification against Poisoning Attacks ” submitted to the 30th ACM SIGKDD Conference on Knowledge
Discovery and Data Mining [123].
64
achieve good accuracy, or not satisfy the fairness constraints. From the preliminary results, we
find that both k-NN defense and SEVER will have an obvious accuracy or fairness degradation
after attacking. Moreover, we also compare F-Attack with one of the strongest poisoning attacks
Min-Max Attack [100] (which is devised for traditional classification). The result demonstrates that
our proposed F-Attack has a better attacking effect compared to Min-Max Attack. In conclusion,
our preliminary study highlights the vulnerability of fair classification against poisoning attacks,
especially against F-Attack.
In this paper, we further propose a defense framework, Robust Fair Classification (F-ATTACK),
to improve the robustness of fair classification against poisoning attacks. Different from existing
defenses, our method aims to scout abnormal samples from each individual sensitive subgroup in
each class. To achieve this goal, F-ATTACK first applies the similar strategy as the works [30, 29],
to find abnormal data samples which significantly deviate from the distribution of other (clean)
samples. Based on a theoretical analysis, we verify that F-ATTACK can exclude more poisoning
samples than clean samples in each step. Moreover, to further avoid removing too many clean
samples, we introduce an Online-Data-Sanitization process: in each iteration, we remove a possible
poisoning set from a single subgroup of a single class and test the retrained models’ performance
on a clean validation set. This helps us locate the subgroup which contains the most poisoning
samples. Through extensive experiments on two benchmark datasets, Adult Census Dataset and
COMPAS, we validate the effectiveness of our defense. Our key contributions are summarized as:
• We devise a strong attack method to poison fair classification and demonstrate the vulnerability
of fair classification under the protection of traditional defenses to poisoning attacks.
• We propose an efficient, and principled framework, Robust Fair Classification (F-ATTACK).
Extensive experiments and theoretical analysis demonstrate the effectiveness and reliability
of the proposed framework.
5.2 Problem Statement and Notations
In this section, we formally define the setting of our studied problem and necessary notations.
65
Fair Classification. In this paper, we focus on the classification problems which incorporate grouplevel
fairness criteria. First, let 𝑥 ⊆ R𝑑 be a random vector denoting the (non-sensitive) features,
with a label 𝑦 ∈ Y = {𝑌1, ...,𝑌𝑚} with 𝑚 classes, a sensitive attribute 𝑧 ∈ Z = {𝑍1, 𝑍2, ..., 𝑍𝑘 } with
𝑘 groups. Let 𝑓 (𝑋, 𝑤) represent a classifier with parameters 𝑤 ∈ W. Then, a fair classification
problem can be defined as:
min
𝑤
E
h
𝑙 ( 𝑓 (𝑥, 𝑤), 𝑦)
i
s.t. 𝑔𝑗 (𝑤) ≤ 𝜏, ∀𝑗 ∈ Z (5.1)
where the function E[𝑙 (·)] is the expected loss on test distribution, and 𝜏 is the unfairness tolerance.
The constraint function 𝑔𝑗 (𝑤) = E
h
ℎ(𝑤, 𝑥, 𝑦) |𝑧 = 𝑗
i
represents the desired fairness metric
for each group 𝑗 ∈ Z. For example, in binary classification problems, we use 𝑓 (𝑥, 𝑤) > 0 to indicate
a positive classification outcome. Then, ℎ(𝑤, 𝑥, 𝑦) = 1( 𝑓 (𝑥, 𝑤) > 0) − E
h
1( 𝑓 (𝑥, 𝑤) > 0)
i
refers to equalized positive rates in the equalized treatment criterion [73]. Similarly, ℎ(𝑤, 𝑥, 𝑦) =
1( 𝑓 (𝑥, 𝑤) > 0|𝑦 = ±1) − E
h
1( 𝑓 (𝑥, 𝑤) > 0|𝑦 = ±1)
i
is for equalizing true / false positive rates
in the equalized odds [42]. Given any training dataset 𝐷, we define the empirical loss function as
𝐿(𝐷, 𝑤) as the average loss value of the model, and 𝑔𝑗 (𝐷, 𝑤) is the empirical fairness constraint
function.
In our paper, we assume that the clean training samples are sampled from the true distribution
D following the density P(𝑥, 𝑦, 𝑧). We also use D𝑢,𝑣 to denote the distribution of (clean) samples
given by 𝑦 = 𝑌𝑢 and 𝑧 = 𝑍𝑢, which has a density P(𝑥, 𝑦, 𝑧|𝑦 = 𝑌𝑢, 𝑧 = 𝑍𝑣).
Poisoning Attack (𝜖-poisoning model). In our paper, we consider the poisoning attack following
the scenario. Given a fair classification task, the poisoned dataset is generated as follows: first, 𝑛
clean samples 𝐷𝐶 = {(𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖)}𝑛
𝑖=1 are drawn from D to form the clean training set. Then, an
adversary is allowed to insert an 𝜖 fraction of 𝐷𝐶 with arbitrary choices of 𝐷𝑃 = {(𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖)}𝜖𝑛
𝑖=1.
We can define such a poisoned training set 𝐷𝐶 ∪ 𝐷𝑃 as 𝜖-poisoning model.
5.3 Fair Classification is Vulnerable to Poisoning Attacks
In this section, we first introduce the algorithm of our proposed attack F-Attack under the 𝜖-poisoning
model. Then, we conduct empirical studies to evaluate the robustness of fair classification algo-
66
rithms (and popular defenses) against F-Attack and baseline attacks.
5.3.1 F-Attack: Poisoning Attacks for Fair Classification
Given a specific fair classification task, we consider that the attacker aims to contaminate the
training set, such that applying existing algorithms cannot successfully fulfill the fair classification
goal. Note that for fair classification tasks, both accuracy and fairness are the desired properties and
they always have strong tension in practice [75]. Therefore, in our attack, we consider misleading
the training algorithms such that at least one of the two criteria is unsatisfied. Formally, we define
the attacker’s objective as Eq.(5.2), where the attacker inserts a poisoning set 𝐷𝑃 with size 𝜖𝑛 in
the feasible injection space F to achieve:
max
𝐷𝑃⊆F
E
h
𝑙 ( 𝑓 (𝑥, 𝑤∗), 𝑦)
i
s.t. 𝑤∗ = arg min
𝑤∈H𝑓 𝑎𝑖𝑟
𝐿(𝐷𝐶 ∪ 𝐷𝑃, 𝑤). (5.2)
It means for the classifier 𝑤∗ that trained on 𝐷𝐶∪𝐷𝑝 and has a low empirical loss 𝐿(𝐷𝐶∪𝐷𝑝, 𝑤∗), if
it falls in the spaceH𝑓 𝑎𝑖𝑟 , it will have a large expected loss (on the test distribution D). Here,H𝑓 𝑎𝑖𝑟
is the space of models (with a limited norm) that satisfy the fairness criteria on clean distribution
D. To have a closer look at Eq.(5.2), we discuss case by case. Suppose we obtain 𝑤∗ by fair
classification on the set 𝐷𝐶 ∪ 𝐷𝑃, there are cases:
1. 𝑤∗ ∉ H𝑓 𝑎𝑖𝑟 : The fairness criteria (on the test set) is not satisfied.
2. 𝑤∗ ∈ H𝑓 𝑎𝑖𝑟 : Since 𝑤∗ is trained on 𝐷𝐶 ∪ 𝐷𝑃 to have low 𝐿(𝐷𝑐 ∪ 𝐷𝑝, 𝑤∗), 𝑤∗ will have high
test error.
For each case, the model 𝑤∗ will either have an unsatisfactory accuracy or unsatisfactory fairness.
Next, we simplify the objective and constraints in Eq.(5.2) to transform it into a solvable problem.
We first conduct relaxations of the objective in Eq.(5.2) (similar to the works [100, 53]):
E
h
𝑙 ( 𝑓 (𝑋, 𝑤),𝑌)
i (i)
≈ 𝐿(𝐷𝐶, 𝑤)
(ii)
≤ 𝐿(𝐷𝐶, 𝑤) + 𝜖𝐿(𝐷𝑃, 𝑤) = (1 + 𝜖)𝐿(𝐷𝐶 ∪ 𝐷𝑃, 𝑤)
Specifically, approximation (i) holds if clean training data 𝐷𝐶 has sufficient samples and is close
to the test distribution D, and model 𝑤 is appropriately regularized. The upper bound (ii) holds
67
because of the non-negativity of loss values. The upper bound (ii) can be tight if the fraction of
poisoning samples 𝜖 is small. Thus, we transfer Eq.(5.2) to a bi-level optimization problem between
𝑤 and 𝐷𝑃. If the model 𝑓 (·) and loss 𝑙 (·) are convex, we can further swap them to get a min-max
form as:
max
𝐷𝑃⊆F
min
𝑤∈H𝑓 𝑎𝑖𝑟
𝐿(𝐷𝐶 ∪ 𝐷𝑃, 𝑤) → min
𝑤∈H𝑓 𝑎𝑖𝑟
max
𝐷𝑃⊆F
𝐿(𝐷𝐶 ∪ 𝐷𝑃, 𝑤) (5.3)
Our proposed F-Attack is to solve Eq.(5.3) which is shown in Algorithm 3. It solves a saddle point
problem to alternatively find the worst attack points (𝑥, 𝑦, 𝑧) w.r.t the current model and then update
the model in the direction of the attack point. In detail, in Step (1), given the current model 𝑤, we
solve the inner maximization problem to maximize 𝐿(𝐷𝐶 ∪ 𝐷𝑃, 𝑤). It is equal to finding sample
(𝑥, 𝑦, 𝑧) with the maximal loss:
max
𝐷𝑃⊆F
𝐿(𝐷𝐶 ∪ 𝐷𝑃, 𝑤) = 𝐿(𝐷𝐶, 𝑤) + 𝜖 · max
(𝑥,𝑦,𝑧)∈F
𝑙 ( 𝑓 (𝑥, 𝑤), 𝑦). (5.4)
In Step (2), we update 𝑤 to minimize 𝐿(𝐷𝐶∪𝐷𝑃, 𝑤). Note that in Step (2), we should also constrain
the model 𝑤 to fall into the fair model space H𝑓 𝑎𝑖𝑟 . Thus, when we update 𝑤 in Step (2), we also
penalize the fairness violation of 𝑤. Here, we calculate the fairness violation as (𝑔𝑗 (𝐷𝐶, 𝑤) − 𝜏)+
(with weight parameter 𝜆 > 0) on the clean set 𝐷𝐶 to approximate the fairness violation on real
data D.
Algorithm 3 Algorithm of F-Attack.
1: Input: Clean data 𝐷𝐶, number of poisoned samples 𝜖𝑛, feasible set F, Fairness constraint
functions 𝑔𝑗 (·) and tolerance 𝜏𝑗 ( 𝑗 = 1, ..., |Z|), 𝜆 > 0, 𝜂 > 0, warm-up steps 𝑛𝑏𝑢𝑟𝑛.
2: Output: A poisoning set 𝐷𝑃.
3: repeat
4: Solver the inner maximization: (𝑥, 𝑦, 𝑧) = arg max(𝑥,𝑦,𝑧)∈F 𝑙 ( 𝑓 (𝑥, 𝑤), 𝑦)
5: Solver the outer minimization: 𝑤 = 𝑤 − 𝜂 · ∇𝑤 ( (𝐿(𝐷𝐶 ∪ 𝐷𝑃, 𝑤) + 𝜆
Í
𝑗
(𝑔𝑗 (𝐷𝐶, 𝑤) − 𝜏)+))
6: until The algorithm converges
5.3.2 Preliminary Study on Adult Census Dataset
Adult Census Dataset. In this subsection, we conduct an experiment on Adult Census Dataset [54],
to test whether F-Attack can poison fair classification methods and whether existing defense methods
can resist F-Attack. Here, we focus on the fairness criteria: Equalized True Positive Rate
68
Figure 5.1: PCA visualization of Clean Samples and Poisoning Samples.
(TPR) [42] between the genders, and we apply the constrained optimization method [34] to train
linear classifiers to fulfill the fair classification objective. It is worth mentioning that, this dataset
contains many categorical features, such as marital-status, occupation, etc. For simplicity, we preprocess
the dataset by transforming categorical features into a continuous space that is spanned by
the first 15 principle directions of training (categorical) data. More details of the pre-processing
procedure can be found in Appendix 4.2.
Defense Methods. Besides naïve fair classification, we mainly consider two representative
data-sanitization defenses, which are existing popular methods to defend against poisoning attacks:
• k-NN Defense [53]. This method removes the samples that are far from their k nearest neighbors.
In detail, the k-NN defense calculates the “abnormal” score as 𝑞𝑖 = ||𝑥𝑖 − 𝑥
(𝑘,𝑦)
𝑖
||2,
where 𝑥
(𝑘,𝑦)
𝑖 is the k-th nearest neighbor to sample 𝑥𝑖 in class 𝑦. In this paper, we set 𝑘 = 5.
• SEVER. [29] This method aims to find abnormal samples by tracing abnormal gradients. In
each iteration, we first train a fair classifier 𝑓 (with fixed 𝜏) and calculate the gradient of loss
69
w.r.t the weight 𝑤 for each training sample (𝑥𝑖 , 𝑦𝑖), and get the normalized gradient matrix
𝑄 =
h
∇𝑤𝑙 ( 𝑓 (𝑥𝑖 , 𝑤), 𝑦𝑖) − 1
𝑛
Í𝑛𝑗
=1
∇𝑤𝑙 ( 𝑓 (𝑥 𝑗 , 𝑤), 𝑦 𝑗 )
i
𝑖=1,..,𝑛
. SEVER flags the samples with
large “abnormal” score 𝑞𝑖 = (𝑄𝑖 · 𝑣)2 as abnormal samples, where 𝑣 is the top right singular
vector of 𝑄. Intuitively, the “abnormal” samples make a great contribution to the variation
of the gradient matrix 𝑄, which suggests their gradients can significantly deviate from the
gradients of other samples.
Results. In our experiments, we insert 10% poisoning samples to the training set, and define the
feasible injection set F to be {(𝑥, 𝑦, 𝑧) : ||𝑥−𝜇𝑦 || ≤ 𝑑}, where 𝑑 is a fixed radius. This will constrain
the inserted samples not too far from the center of their labeled class, to evade potential defense.
Since F is not related to sensitive attribute 𝑧, during F-Attack, we generate poisoning samples with
a fixed 𝑧 to be 0 (female) or 1 (male). During fairness training, we train multiple models with
various hyperparameters to control the unfairness tolerance on the training set (following [59]).
Then, we report the test performance when it has the best validation performance (which considers
both accuracy and fairness, see Section 5.4, Eq.(5.9) for more details). In Table5.1, we report the
performance²for the defense methods. From the result, we can see: all training methods have a
significant performance degradation under F-Attack. For example, under F-Attack (𝑧 = 0), the
SEVER defense has ≈ 4% accuracy drop and 2% fairness drop. This suggests that defenses such as
SEVER and k-NN can be greatly vulnerable to poisoning attacks in fair classification. Moreover, we
compare F-Attack with a baseline attack method Min-Max [100, 53], which is one of the strongest
attacks for traditional classification. It also solves Eq.(5.3) but does not constrain 𝑤 ∈ H𝑓 𝑎𝑖𝑟 . From
Table 5.1, we can see that Min-Max has worse attacking performance than F-Attack, by causing
slighter performance degradation. This result highlights the threat of F-Attack to fair classification.
Discussion. To have a deeper understanding on the behavior of F-Attack, in Figure 5.1, we
visualize the clean samples and poisoning samples (via F-Attack and Min-Max Attack) in a 2-dim
projected space (via PCA). From the figure, we can see that: compared to Min-Max Attack (red),
²We report the “goodness of fairness” as 1 − Unfair, i.e., 1 − |TPR(𝑧 = 0) − TPR(𝑧 = 1) | in Table 5.1.
70
Table 5.1: F-Attack vs. Min-Max on Adult Census Dataset.
No Attack Min-Max (z = 0) Min-Max (z = 1) F-Attack (z = 0) F-Attack (z = 1)
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair.
No Defense. 0.811 0.962 0.801 0.958 0.799 0.851 0.768 0.956 0.793 0.803
k-NN. 0.794 0.952 0.681 0.936 0.680 0.904 0.655 0.951 0.695 0.895
SEVER. 0.812 0.969 0.798 0.958 0.797 0.967 0.773 0.942 0.772 0.943
the samples obtained by F-Attack (yellow) have a smaller distance to the clean samples in their
labeled class 𝑦 = 1, although they are constrained in the same feasible injection set F. It is because
F-Attack aims to find samples with maximal loss (Eq.(5.4)) for fair classifiers, so the generated
samples do not have the maximal loss for traditional classifiers. Thus, the poisoning samples from
F-Attacks are closer to their labeled class. This fact helps explain why F-Attack is more insidious
than Min-Max Attack under the detection of traditional defenses, such as SEVER.
5.4 Robust Fair Classification (RFC)
Motivated by studies in Section 5.3, new defenses are desired to protect fair classification against
poisoning attacks, especially against F-Attack. In this section, we first introduce a novel defense
framework called Robust Fair Classification (RFC), and we provide a theoretical study to further
understand the mechanism of RFC. In Section 5.5, we conduct empirical studies to validate the
robustness of RFC in practice.
5.4.1 Robust Fair Classification (RFC)
Based on the discussion in Section 5.3, F-Attack can evade traditional defenses such as SEVER
and k-NN Defense, because the generated poisoning samples are close to the clean data of their
labeled class. However, we assume that they may deviate from the distribution of clean samples
in their labeled subgroup (the data distribution D𝑦,𝑧 given 𝑦 and 𝑧). Refer to the Figure 5.2, which
shows the location of poisoning samples generated via F-Attack (𝑧 = 0) and clean samples in
subgroup (𝑦 = 1, 𝑧 = 0) in the 2D projected space. It suggests that the poisoning samples greatly
contaminate the information of distribution P(𝑥, 𝑦, 𝑧|𝑦 = 1, 𝑧 = 0) in the training data. Thus, the
injected poisoniend samples will not only confuse the original prediction task from 𝑥 to 𝑦, but also
greatly disturb the fairness constraints (Eq.(5.1)). This observation motivates us to propose a new
71
Figure 5.2: PCA visualization.
defense method that can scout abnormal samples from each individual subgroup in each class. Next,
we will introduce the details of our proposed defense RFC.
Centered Data Matrix & Alignment Score. Our method shares a similar high-level idea
as [30, 29], to find data points that systematically deviate from the distribution of other (clean)
samples. Specifically, in our method, given a (poisoned) dataset 𝐷, we repeatedly scout the poisoning
samples from each subgroup 𝐷(𝑥, 𝑦, 𝑧|𝑦 = 𝑌𝑢, 𝑧 = 𝑍𝑣), where we use (𝑢, 𝑣) to denote
the index of each subgroup and class. In the later parts, we use 𝐷𝑢,𝑣 to denote the samples in
𝐷(𝑥, 𝑦, 𝑧|𝑦 = 𝑌𝑢, 𝑧 = 𝑍𝑣) for simplicity. Then, we define:
𝑄𝑢,𝑣 =
266664
𝑥𝑖 − 1
𝑛𝑢,𝑣
Õ𝑛𝑢,𝑣
𝑗=1
𝑥 𝑗
377775
(𝑥𝑖 ,𝑦𝑖 ,𝑧𝑖 )∈𝐷𝑢,𝑣
, (5.5)
to be the centered data matrix of training samples 𝐷𝑢,𝑣 and 𝑛𝑢,𝑣 is the size of the set 𝐷𝑢,𝑣.
For each 𝑄𝑢,𝑣, the top right singular vector V𝑢,𝑣 of 𝑄𝑢,𝑣 is the direction which explains the
variation of the data distribution in (𝑌𝑢, 𝑍𝑢). Similar to the studies in [30, 29], we conjecture: the
poisoning samples are deviated from clean samples, so they will take the major responsibility for
the variation of the data matrix 𝑄𝑢,𝑣. In this way, they will have high alignments with the direction
72
of V𝑢,𝑣. Thus, we define the Alignment Score 𝑞𝑢,𝑣 for each training sample (𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖) in the 𝐷𝑢,𝑣 as:
𝑞𝑢,𝑣 (𝑥𝑖) = 𝑄𝑢,𝑣
𝑖
· (V𝑢,𝑣)𝑇 . (5.6)
Notably, the poisoned samples are likely to have the same or opposite direction with the top right singular
vector V𝑢,𝑣, but the poisoning samples should share the same direction. Thus, in our method,
we define two Proposed Poisoning Sets for each (𝑢, 𝑣), so that one of the two sets is likely to have
poisoning samples:
P𝑢,𝑣
+ = {𝑥𝑖 |𝑞𝑢,𝑣 (𝑥𝑖) > 𝛾+, 𝑞𝑢,𝑣 (𝑥𝑖) > 0}; P𝑢,𝑣
− = {𝑥𝑖 | − 𝑞𝑢,𝑣 (𝑥𝑖) > 𝛾−, 𝑞𝑢,𝑣 (𝑥𝑖) < 0}; (5.7)
In Eq.(5.7), we set the 𝛾+ (or 𝛾−) to be the 𝑞-th (𝑞 = 90) percentile of the given all alignment
scores (or negative alignment scores) in 𝐷𝑢,𝑣, so that each proposed poisoning set only contains a
small portion of 𝐷𝑢,𝑣. In practice, we will repeatedly test whether removing the proposed poisoning
sets can help improve the retrained model’s performance (both accuracy and fairness) on a clean
validation set. It helps to decide whether the proposed poisoning set contains poisoning samples.
In Section 5.4.2, we will further conduct a theoretical analysis to show that the poisoning samples
are more likely to have higher poisoning scores.
Fair Classification by Excluding Poisoning Set. Finally, given a (poisoned) training set 𝐷 as
well as the proposed poisoning sets, we can keep retraining fair classifiers by excluding proposed
poisoning sets:
𝑤+ = arg min
𝑤∈W
𝐿( 𝑓 (𝑋, 𝑤),𝑌; 𝐷 \ P𝑢,𝑣
+ ) s.t. 𝑔𝑗 (𝐷 \ P𝑢,𝑣
+ ; 𝑤) ≤ 𝜏𝑗 , ∀𝑗 ∈ Z
𝑤− = arg min
𝑤∈W
𝐿( 𝑓 (𝑋, 𝑤),𝑌; 𝐷 \ P𝑢,𝑣
− ) s.t. 𝑔𝑗 (𝐷 \ P𝑢,𝑣
− ; 𝑤) ≤ 𝜏𝑗 , ∀𝑗 ∈ Z
(5.8)
In practice, our proposed RFC method repeatedly proposes potential poisoning sets for each individual
subgroup in each class 𝐷𝑢,𝑣 until finding the best poisoning set among all choices. The
Algorithm 4 provides the detailed introduction of the procedure of RFC, which is an Online Data
Sanitization process. Specifically, during each iteration of RFC, for each 𝐷𝑢,𝑣, we first calculate the
poisoning scores and the proposed poisoning sets (Step (1)&(2)). Then, we remove the proposed
poisoning set from the dataset 𝐷, and conduct fair classification on 𝐷 (Step (3)). In Step (4), we
73
evaluate the retrained classifier on a clean separated validation set and find the best-proposed poisoning
set which results in the highest validation performance. Notably, we measure the validation
performance by considering both the accuracy and fairness criteria, by defining:
ValScore = Pr.( 𝑓 (𝑥, 𝑤) = 𝑦) −
Õ
𝑗∈Z
𝜆 · (𝑔𝑗 (𝐷Val, 𝑤) − 𝜏)+, (5.9)
where 𝜏 is a unfairness tolerance threshold and 𝜆 is a positive number (we set 𝜆 = 3 in this paper).
The second term penalizes the models if some subgroups’ unfairness violation is over 𝜏. Finally,
we remove the proposed poisoning set which results in the highest ValScore and conduct the next
round of searching (Step (6)).
Algorithm 4 Robust Fair Classification (RFC) - An Online Data Sanitization Algorithm.
1: Input: An 𝜖-poisoning model with dataset 𝐷 = {(𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖)}𝑖=1,2,...,𝑛, Iterations 𝑇 of RFC.
2: Output: A fairly classifier
3: repeat
4: repeat
5: Get the Centered Data Matrix 𝑄𝑢,𝑣 and Poisoning Score 𝑞𝑢,𝑣
𝑖 following Eq.(5.5) and
Eq.(5.6)
6: Get the Proposed Poisoning Sets: P𝑢,𝑣
+ and P𝑢,𝑣
− following Eq.(5.7)
7: Conduct fair classification by removing Proposed Poisoning Set, via Eq.(5.8) and get 𝑤𝑢,𝑣
± .
8: Record the performance to get ValScore on a separated (clean) validation set for each 𝑤𝑢,𝑣
± .
9: until All subgroups are checked
10: Get the best proposed poisoning set P∗ which achieves the highest ValScore across all 𝑢, 𝑣
and P𝑢,𝑣
± .
11: Removing the best proposed poisoning set from 𝐷 and set 𝐷 = 𝐷 \ P∗
12: t = t+1
13: until t is larger than T
Remarkably, it is also worth mentioning that the framework RFC is also possible to be extended
to various model architectures, such as Deep Neural Networks (DNNs), for robust fair classification.
For example, we can apply the Energy-based Out-of Distribution Detection [69] to find the
abnormal samples from each 𝐷𝑢,𝑣. Then, we follow a similar manner as RFC to propose poisoning
sets and conduct fair classification. We will leave the study in DNNs for future exploration.
5.4.2 Theoretical Analysis
In this subsection, we conduct a theoretical analysis to further help understand the behavior of
RFC, especially to understand the role of calculating poisoning scores in finding poisoning samples.
74
In particular, we consider a simple theoretical setting where the clean samples from each groupD𝑢,𝑣
follow a distinct Gaussian distribution D𝑢,𝑣 ∼ N(𝜇𝑢,𝑣, Σ𝑢,𝑣), with center and covariance matrix
(𝜇𝑢,𝑣, Σ𝑢,𝑣). In the following theorem, we will show: when there are (poisoning) samples that
deviate from the clean samples of D𝑢,𝑣, by having a center 𝜇 which is far from the center of clean
samples, they will have larger squared poisoning scores (Eq.5.6) than clean samples. Thus, the
proposed poisoning sets (Eq.(5.7)) are likely to contain more poisoning samples than clean samples.
For simplicity, we use D and N(𝜇, Σ) to denote the clean distribution of a given group 𝐷𝑢,𝑣.
Theorem 5 Suppose that a set of “clean” samples S𝑔𝑜𝑜𝑑 with size 𝑛 are i.i.d sampled from distribution
N(𝜇, Σ), where Σ ⪯ 𝜎2𝐼. There is a set of “bad” samples S𝑏𝑎𝑑 with size 𝑛𝑝 = 𝑛/𝐾, 𝐾 > 1
and center ||𝜇𝑝 − 𝜇||2 = 𝑑, 𝑑 = 𝛾 · 𝜎. Then, the average squared poisoning scores of clean samples
and bad samples have a relationship:
E𝑖∈S𝑏𝑎𝑑

𝑞2 (𝑥𝑖)

− E𝑖∈S𝑔𝑜𝑜𝑑

𝑞2 (𝑥𝑖)

≥

𝐾 − 1
𝐾 + 1
· 𝛾2 − (𝐾 + 1)

𝜎2
Theorem 5 suggests that the difference between the average (squared) poisoning scores of S𝑏𝑎𝑑 and
S𝑔𝑜𝑜𝑑 is controlled by 𝛾 and the sample ratio 𝐾. Since 𝐾 > 1, if 𝛾2 >
(𝐾+1)2
𝐾−1 (which suggests
the poisoning samples are sufficiently far from clean distribution), we can get the conclusion that
the difference is positive. Thus, removing samples with the highest positive (or lowest negative)
poisoning scores (as Eq.(5.7)) will help to eliminate more poisoning samples than clean samples.
If 𝛾2 is small, the poisoning samples are close to the true distribution, which will cause the poisoning
samples to have limited influence on the model performance. The detailed proof of Theorem 5 is
deferred to Appendix 4.1. In our algorithm of RFC, we alternatively check each proposed poisoning
set and see whether removing it helps improve the retrained models’ performance. This will also
avoid removing too many clean samples.
5.5 Experiment
5.5.1 Experimental Setup
In this section, we conduct comprehensive experiments to validate the effectiveness of our proposed
attack and defense, in two benchmark datasets, Adult Census Dataset and COMPAS Dataset.
75
In this part, we only consider Equalized True Positive Rate (TPR) between different sensitive subgroups,
which is optimized via the fair classification method [34]. When applying [34], we train
multiple models with various hyperparameters to control the unfairness tolerance on the training
set (following [59]). Then, we report the test performance when it has the best validation performance
(which considers both accuracy and fairness, see Section 5.4, Eq.(5.9)). In Appendix 4.3,
we provide additional results for a different type of fairness “Equalized Treatment”, and a different
fair classification method [128].
Attacks: We consider that the training set can be contaminated by: Label Flipping [84], and
Sensitive Attribute Flipping [110]. We also consider the attack methods, Min-Max and F-Attack,
which are introduced in Section 5.3. Notably, for each method, we assume the poisoning samples
are constrained in the sample feasible injection set F = {(𝑥, 𝑦, 𝑧) : ||𝑥 − 𝜇𝑦 || ≤ 𝑑}, which limit
the poisoning samples’ distance to the class center. Thus, for Min-Max and F-Attack, we assign
the generated samples to have a pre-defined sensitive attribute 𝑧 = 0 or 𝑧 = 1. Furthermore, we
introduce an additional attack method “F-Attack∗” which has the same algorithm with F-Attack but
have a different feasible injection set: F = {(𝑥, 𝑦, 𝑧) : ||𝑥 − 𝜇𝑦,𝑧 || ≤ 𝑑}, where 𝜇𝑦,𝑧 is the center of
the group D𝑦,𝑧. Because this feasible injection set is related to the sensitive attribute 𝑧, we don’t
need to pre-define 𝑧 during F-attack∗. Remarkably, this attack aims to test the robustness of RFC,
because the major goal of RFC is to find samples in each group 𝐷𝑦,𝑧 which are far from 𝜇𝑦,𝑧. Thus,
F-Attack∗ is possible to evade RFC by constraining the poisoning samples’ distance to 𝜇𝑦,𝑧. In
Appendix 4.3, we also report the performance of all attacks & defenses under different choices of
radius 𝑑.
Baseline Defenses. To validate the effectiveness of RFC, we include baseline defense methods:
(1) the naive method which does not apply any defense strategies; (2) SEVER [29], which are
representative defenses for traditional classification tasks. We apply [34] on the sanitized dataset
by SEVER. In addition, we also include (3) [90], which is a method to defend fair classification
methods against label flipping attacks. It leverages adversarial training strategy [132]; and (4) the
method [110] applies Distributional Robust Opitmization (DRO) to improve robustness when labels
76
Table 5.2: RFC & Baseline Methods’ Performance against Poisoning Attacks on Adult Census
Dataset.
No Attack Label Flip(10%) Label Flip(20%) Attr. Flip(10%) Attr. Flip(20%)
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair.
No Defense. 0.817 0.963 0.805 0.966 0.800 0.968 0.812 0.945 0.804 0.951
SEVER. 0.812 0.960 0.805 0.967 0.803 0.960 0.813 0.937 0.803 0.944
Wang. 0.809 0.958 0.793 0.970 0.779 0.961 0.809 0.945 0.813 0.935
Roh. 0.805 0.948 0.798 0.937 0.788 0.939 0.800 0.950 0.795 0.943
RFC. 0.811 0.959 0.807 0.973 0.796 0.966 0.805 0.967 0.802 0.965
(𝜖 = 10%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack*
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair.
No Defense. 0.801 0.969 0.797 0.864 0.769 0.966 0.796 0.804 0.799 0.799
SEVER. 0.800 0.960 0.795 0.954 0.778 0.945 0.773 0.937 0.772 0.956
Wang. 0.782 0.976 0.783 0.968 0.779 0.956 0.791 0.948 0.792 0.945
Roh. 0.780 0.963 0.782 0.961 0.765 0.959 0.766 0.956 0.776 0.944
RFC. 0.802 0.967 0.811 0.946 0.803 0.950 0.808 0.952 0.809 0.951
(𝜖 = 15%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack*
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair.
No Defense. 0.792 0.961 0.787 0.837 0.686 0.953 0.775 0.779 0.788 0.760
SEVER. 0.792 0.959 0.783 0.978 0.765 0.952 0.755 0.909 0.765 0.927
Wang. 0.777 0.970 0.779 0.965 0.738 0.955 0.762 0.960 0.711 0.955
Roh. 0.726 0.948 0.748 0.957 0.722 0.962 0.764 0.929 0.774 0.929
RFC. 0.801 0.954 0.796 0.941 0.800 0.963 0.795 0.947 0.802 0.945
and sensitive attributes are contaminated. For baseline methods, we report their performance with
the choice of hyperparameter that achieves the optimal ValScore (Eq.( 5.9)) on a clean validation
set.
5.5.2 Experimental Results
Adult Census Dataset. We first show the results in Adult Census Dataset in Table 5.2. To
further guarantee that the comparison between different defenses is fair, we use a balanced clean
training dataset where each class has an equal number of samples since the baseline methods such
as SEVER can be affected by class imbalance. Under this dataset, we set the desired fairness criteria
to be |TPR(𝑧 = 0) − TPR(𝑧 = 1) | < 0.05. In Table 5.2, we also mark the cases (with brown color)
when the algorithms output models with much poorer fairness than the desired fairness. From
Table 5.2, we can see that RFC can achieve good accuracy & fairness among different types of
dataset contamination. Especially, under strong attacks such as F-Attack, the accuracy and fairness
are only slightly degraded after injecting poisoning samples. However, the baseline methods such
77
as [110] and [90], will have a clear performance (especially accuracy) degradation under F-Attack.
Notably, the attack method F-Attack∗, has similar attacking performance as F-Attack (𝑧 = 1). It is
because under this dataset, F-Attack∗ also generates samples that have 𝑧 = 1.
COMPAS Dataset. In COMPAS dataset [8], we consider the same type of fairness criteria,
which is Equalized TPR. In this dataset, we consider that the equity is desired among races, which
are “Caucasian (𝑧 = 0), African-American (𝑧 = 1) and Hispanic (𝑧 = 2)”, and follow the similar
preprocessing procedure as that in Adult Census Dataset. In this dataset, the number of samples in
the group “Hispanic” is much smaller than the other two groups. Thus, we only consider to inject
poisoning samples to 𝑧 = 0 or 𝑧 = 1. In Table 5.3, we report the performance of our studied attacks
and defense, and we use 1 − max𝑗∈Z |TPR(𝑧 = 𝑗 ) − TPˆR| to measure the “goodness” of fairness,
where TPˆR is the averaged TPR in the whole dataset. During training, we set the desired fairness
criteria to be max𝑗∈Z |TPR(𝑧 = 𝑗 ) − TPˆR| ≤ 0.15. From the result in Table 5.3, we can see that
RFC is the only method that can consistently preserve the model accuracy and fairness after there
are poisoning samples injected into the dataset.
5.6 Related Works
Poisoning Attacks. In this section, we introduce related work and discuss how this work differs
from prior studies. Data poisoning attacks [6] refer to the scenario that models are threatened by
adversaries who insert malicious training samples, in order to take control of the trained model behavior
[64, 97]. In this work, we concentrate on the untargeted poisoning attacks [6, 53] where the
attacker aims to degrade the overall performance of the trained model. To defend against poisoning
attacks, well-established methods [116, 91, 100, 29, 103, 114] are proposed to efficiently and
effectively defend against poisoning attacks in various scenarios. This paper is within the scope of
linear classification problems and we leave the studies in DNN models for future work.
Fair Classification. Fairness issues have recently drawn much attention from the community of
machine learning. Fairness issues for common classification problems can be generally divided into
two categories: (1) Equalized treatment [128] (or “Statistical Rate”); and (2) Equalized prediction
quality [42]. For classification models to satisfy these fairness criteria, popular methods includ-
78
Table 5.3: RFC & Baseline Methods’ Performance against Poisoning Attacks on COMPAS Dataset.
No Attack Label Flip(10%) Label Flip(20%) Attr. Flip(10%) Attr. Flip(20%)
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair
No Defense. 0.656 0.865 0.655 0.866 0.664 0.836 0.677 0.839 0.665 0.847
SEVER. 0.650 0.854 0.677 0.833 0.674 0.835 0.674 0.835 0.667 0.835
Wang. 0.662 0.847 0.650 0.869 0.631 0.863 0.643 0.861 0.659 0.862
Roh. 0.654 0.851 0.646 0.891 0.663 0.823 0.621 0.834 0.615 0.845
RFC. 0.661 0.850 0.676 0.859 0.682 0.841 0.685 0.850 0.667 0.856
(𝜖 = 10%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack*
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair
No Defense. 0.649 0.845 0.645 0.837 0.665 0.832 0.630 0.860 0.661 0.826
SEVER. 0.634 0.865 0.630 0.859 0.634 0.845 0.627 0.840 0.664 0.826
Wang. 0.648 0.851 0.620 0.855 0.639 0.865 0.624 0.871 0.644 0.868
Roh. 0.644 0.880 0.631 0.865 0.659 0.865 0.633 0.825 0.640 0.858
RFC. 0.656 0.863 0.661 0.834 0.668 0.853 0.667 0.866 0.673 0.845
(𝜖 = 15%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack*
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair
No Defense. 0.676 0.802 0.644 0.863 0.665 0.802 0.630 0.860 0.642 0.831
SEVER. 0.621 0.880 0.620 0.880 0.585 0.890 0.606 0.887 0.663 0.845
Wang. 0.645 0.865 0.631 0.852 0.606 0.842 0.611 0.829 0.624 0.841
Roh. 0.655 0.847 0.628 0.877 0.625 0.865 0.631 0.849 0.621 0.877
RFC. 0.676 0.852 0.659 0.843 0.653 0.848 0.645 0.847 0.672 0.841
ing [128, 34, 1] solve constrained optimization problems, and [132] apply adversarial training [71]
method.
Comparison to Prior Works. There are recent works that try to test the robustness of fair classification
methods by manipulating their training set. They also proposed possible strategies to defend
the perturbations. For example, the works [110, 59, 16, 17] consider injecting naturally / adversarially
generated noise only on sensitive attributes. Another line of researches [90, 109] considers the
vulnerability of fairness training to (coordinated) label-flipping attacks [84]. As countermeasures
to defend against their proposed perturbations, representative works such as [90] proposed an adversarial
training framework [132], to train the model to distinguish clean samples and poisoning
samples, while preserving the model fairness. The work [110] solves robust optimization problems
by assigning soft sensitive attributes. In our work, in terms of attack, we consider a stronger attacker
because he/she can insert sophisticatedly calculated features and sensitive attributes, to fully exploit
the vulnerability of fairness training methods.
79
5.7 Conclusion
In this work, we study the problem of poisoning attacks on fair classification problems. We propose
a strong attack method that can evade the defense of most existing methods. Then, we propose an
effective strategy to greatly improve the robustness of fair classification methods. In the future,
we aim to examine if our findings can be generalized to other machine learning tasks, and other
machine learning models, such as Deep Neural Networks (DNNs).
80
CHAPTER 6
CONCLUSION AND FUTURE DIRECTIONS
6.1 Conclusion
In this thesis, we systematically study the interaction between machine learning robustness and
fairness, and demonstrating the negative impact between each other. On one hand, we show that
adversarial training, as one of the most popular and successful methods to improve ML robustness,
can result in unfairness with different perspectives. Specifically, in Chapter 2, Chapter 3 and
Chapter 4, we observe that adversarial training will either cause or exacerbate class-wise unfairness,
inner-class unfairness and unfairness in imbalance datasets respectively. On the other hand,
we show fair machine learning techniques can also cause the model to be more vulnerable to data
poisoning attacks in Chapter 5. Overall, these results have the further impacts to demand the careful
examination when applying techniques to improve either ML robustness or fairness, especially
in the applications where both of them are essential. Moreover, these empirical and theoretical
results in classification tasks also motivate future investigations for studying: (1) the interaction
between ML robustness and ML fairness in various types of models; (2) the interaction between
ML robustness or fairness with other ML properties, such as privacy and explainability.
6.2 Future Directions
6.2.1 ML Robustness vs. ML Fairness in Generative Models
In this thesis, most discussed works are focused on image classification tasks and classification
tasks in tabular data. However, there are other data domains and other types of machine learning
tasks beyond classification which worth studying. For example, previous works such as [120] focus
on the adversarial robustness in language processing tasks. Meanwhile, the works such as [50, 70]
focus on the adversarial robustness in graph neural networks. Different from image or tabular data
classification tasks, the machine learning models and training algorithms in these domains may
encounter different mechanism and behaviors. Therefore, it is also important to extend the previous
studies from previous results into these domains, to observe new findings.
81
6.2.2 ML Robustness / Fairness vs. ML Privacy / Copyright / Explainability
Beyond machine learning robustness and fairness, there are also other trustworthy concerns
about the application of ML / AI models from the society. For example, people are concerned of
the fact that ML models can release private information of the ML users [82, 130, 131]. Besides,
machine learning models can also be mis-used by people to infringe the copyright of visual artworks
[86], articles [92], or musics [101]. Moreover, for machine learning techniques which are
applied in high-stake applications, such as healthcare or financial prediction, people also desire to
understand and explain the prediction of ML models. Thus, there are also methodologies that are
proposed to improve ML explainability [67].
Given these trustworhiness problems from various faces, it is also worthwhile to study the inner
relationship among them. As some of the pioneer works, the paper [38] finds the positive interaction
between ML robustness and explainability in graph neural networks [50]. Moreover, there
is a line of research works showing that the techniques of finding adversarial examples can naturally
serve as tools for data protection against privacy and copyright infringement. Specifically, the
works [46, 87, 63, 108] aims to add imperceptible perturbations to images such that machine learning
models cannot effectively extract information from the images, so that the private information
and copyright can be protected. Similarly, there are also watermarking strategies [28, 27] trying to
add perturbations to insert particular “fingerprint” information to the data which they aim to protect.
82
BIBLIOGRAPHY
[1] Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna Wallach. A
reductions approach to fair classification. In International Conference on Machine Learning,
pages 60–69. PMLR, 2018.
[2] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense
of security: Circumventing defenses to adversarial examples. In International Conference
on Machine Learning, pages 274–283. PMLR, 2018.
[3] Golnoosh Babaei and Paolo Giudici. How fair is machine learning in credit lending? Quality
and Reliability Engineering International, 2024.
[4] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in
linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070,
2020.
[5] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk
bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482,
2002.
[6] Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support vector
machines. arXiv preprint arXiv:1206.6389, 2012.
[7] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings
of COMPSTAT’2010, pages 177–186. Springer, 2010.
[8] Tim Brennan, William Dieterich, and Beate Ehret. Evaluating the predictive validity of the
compas risk and needs assessment system. Criminal Justice and behavior, 36(1):21–40,
2009.
[9] Mateusz Buda, Atsuto Maki, and Maciej A Mazurowski. A systematic study of the class
imbalance problem in convolutional neural networks. Neural Networks, 106:249–259, 2018.
[10] Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities
in commercial gender classification. In Conference on fairness, accountability and transparency,
pages 77–91, 2018.
[11] Cody Burkard and Brent Lagesse. Analysis of causative attacks against svms learning from
data streams. In Proceedings of the 3rd ACM on International Workshop on Security And
Privacy Analytics, pages 31–36, 2017.
[12] Jonathon Byrd and Zachary Lipton. What is the effect of importance weighting in deep
learning? In International Conference on Machine Learning, pages 872–881. PMLR, 2019.
83
[13] Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced
datasets with label-distribution-aware margin loss. arXiv preprint arXiv:1906.07413,
2019.
[14] Nicholas Carlini, Ulfar Erlingsson, and Nicolas Papernot. Distribution density, tails, and
outliers in machine learning: Metrics and applications. arXiv preprint arXiv:1910.13427,
2019.
[15] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks.
In 2017 ieee symposium on security and privacy (sp), pages 39–57. IEEE, 2017.
[16] L Elisa Celis, Lingxiao Huang, Vijay Keswani, and Nisheeth K Vishnoi. Fair classification
with noisy protected attributes: A framework with provable guarantees. In International
Conference on Machine Learning, pages 1349–1361. PMLR, 2021.
[17] L Elisa Celis, Anay Mehrotra, and Nisheeth K Vishnoi. Fair classification with adversarial
perturbations. arXiv preprint arXiv:2106.05964, 2021.
[18] Anirban Chakraborty, Manaar Alam, Vishal Dey, Anupam Chattopadhyay, and Debdeep
Mukhopadhyay. Adversarial attacks and defences: A survey. arXiv preprint
arXiv:1810.00069, 2018.
[19] Niladri S Chatterji and Philip M Long. Finite-sample analysis of interpolating linear classifiers
in the overparameterized regime. arXiv preprint arXiv:2004.12019, 2020.
[20] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote:
synthetic minority over-sampling technique. Journal of artificial intelligence research,
16:321–357, 2002.
[21] Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. Deepdriving: Learning affordance
for direct perception in autonomous driving. In Proceedings of the IEEE international
conference on computer vision, pages 2722–2730, 2015.
[22] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework
for contrastive learning of visual representations. In International conference on machine
learning, pages 1597–1607. PMLR, 2020.
[23] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks
on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017.
[24] Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized
smoothing. In International Conference on Machine Learning, pages 1310–1320.
PMLR, 2019.
[25] Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an en-
84
semble of diverse parameter-free attacks. In International Conference on Machine Learning,
pages 2206–2216. PMLR, 2020.
[26] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss
based on effective number of samples. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 9268–9277, 2019.
[27] Yingqian Cui, Jie Ren, Yuping Lin, Han Xu, Pengfei He, Yue Xing, Wenqi Fan, Hui Liu,
and Jiliang Tang. Ft-shield: A watermark against unauthorized fine-tuning in text-to-image
diffusion models. arXiv preprint arXiv:2310.02401, 2023.
[28] Yingqian Cui, Jie Ren, Han Xu, Pengfei He, Hui Liu, Lichao Sun, and Jiliang Tang. Diffusionshield:
A watermark for copyright protection against generative diffusion models. arXiv
preprint arXiv:2306.04642, 2023.
[29] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Jacob Steinhardt, and Alistair
Stewart. Sever: A robust meta-algorithm for stochastic optimization. In International Conference
on Machine Learning, pages 1596–1606. PMLR, 2019.
[30] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair
Stewart. Being robust (in high dimensions) can be practical. In International Conference on
Machine Learning, pages 999–1008. PMLR, 2017.
[31] Gavin Weiguang Ding, Yash Sharma, Kry Yik Chau Lui, and Ruitong Huang. Max-margin
adversarial (mma) training: Direct input space margin maximization through adversarial
training. arXiv preprint arXiv:1812.02637, 2018.
[32] Chengyu Dong, Liyuan Liu, and Jingbo Shang. Data profiling for adversarial training: On
the ruin of problematic data. arXiv preprint arXiv:2102.07437, 2021.
[33] Yinpeng Dong, Ke Xu, Xiao Yang, Tianyu Pang, Zhijie Deng, Hang Su, and Jun Zhu. Exploring
memorization in adversarial training. arXiv preprint arXiv:2106.01606, 2021.
[34] Michele Donini, Luca Oneto, Shai Ben-David, John Shawe-Taylor, and Massimiliano Pontil.
Empirical risk minimization under fairness constraints. arXiv preprint arXiv:1802.08626,
2018.
[35] Chris Drummond, Robert C Holte, et al. C4. 5, class imbalance, and cost sensitivity: why
under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II,
volume 11, pages 1–8. Citeseer, 2003.
[36] Andrew Estabrooks, Taeho Jo, and Nathalie Japkowicz. A multiple resampling method for
learning from imbalanced data sets. Computational intelligence, 20(1):18–36, 2004.
[37] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zis-
85
serman. The pascal visual object classes (voc) challenge. International journal of computer
vision, 88(2):303–338, 2010.
[38] Wenqi Fan, Yao Ma, Han Xu, Xiaorui Liu, Jianping Wang, Qing Li, and Jiliang Tang. Deep
adversarial canonical correlation analysis. In Proceedings of the 2020 SIAM international
conference on data mining, pages 352–360. SIAM, 2020.
[39] Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In
Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages
954–959, 2020.
[40] Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering
the long tail via influence estimation. arXiv preprint arXiv:2008.03703, 2020.
[41] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial
examples. arXiv preprint arXiv:1412.6572, 2014.
[42] Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning.
Advances in neural information processing systems, 29:3315–3323, 2016.
[43] Haibo He and Edwardo A Garcia. Learning from imbalanced data. IEEE Transactions on
knowledge and data engineering, 21(9):1263–1284, 2009.
[44] Haibo He and Yunqian Ma. Imbalanced learning: foundations, algorithms, and applications.
2013.
[45] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016.
[46] Pengfei He, Han Xu, Jie Ren, Yingqian Cui, Hui Liu, Charu C Aggarwal, and Jiliang Tang.
Sharpness-aware data poisoning attack. arXiv preprint arXiv:2305.14851, 2023.
[47] Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and
Aleksander Madry. Adversarial examples are not bugs, they are features. In Advances in
Neural Information Processing Systems, pages 125–136, 2019.
[48] Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A systematic study.
Intelligent data analysis, 6(5):429–449, 2002.
[49] Wei Jin, Yaxin Li, Han Xu, Yiqi Wang, and Jiliang Tang. Adversarial attacks and defenses
on graphs: A review and empirical study. arXiv preprint arXiv:2003.00653, 2020.
[50] Wei Jin, Yaxing Li, Han Xu, Yiqi Wang, Shuiwang Ji, Charu Aggarwal, and Jiliang Tang. Adversarial
attacks and defenses on graphs. ACM SIGKDD Explorations Newsletter, 22(2):19–
86
34, 2021.
[51] Justin M Johnson and Taghi M Khoshgoftaar. Survey on deep learning with class imbalance.
Journal of Big Data, 6(1):1–54, 2019.
[52] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola,
Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. arXiv
preprint arXiv:2004.11362, 2020.
[53] Pang Wei Koh, Jacob Steinhardt, and Percy Liang. Stronger data poisoning attacks break
data sanitization defenses. Machine Learning, pages 1–47, 2021.
[54] Ron Kohavi et al. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid.
In Kdd, volume 96, pages 202–207, 1996.
[55] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh,
and Dave Bacon. Federated learning: Strategies for improving communication efficiency.
arXiv preprint arXiv:1610.05492, 2016.
[56] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny
images. 2009.
[57] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with
deep convolutional neural networks. Advances in neural information processing systems,
25:1097–1105, 2012.
[58] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale.
arXiv preprint arXiv:1611.01236, 2016.
[59] Alexandre Louis Lamy, Ziyuan Zhong, Aditya Krishna Menon, and Nakul Verma. Noisetolerant
fair classification. arXiv preprint arXiv:1901.10837, 2019.
[60] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7:7, 2015.
[61] Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stopping
is provably robust to label noise for overparameterized neural networks. In International
Conference on Artificial Intelligence and Statistics, pages 4313–4324. PMLR, 2020.
[62] Yaxin Li, Wei Jin, Han Xu, and Jiliang Tang. Deeprobust: A pytorch library for adversarial
attacks and defenses. arXiv preprint arXiv:2005.06149, 2020.
[63] Yaxin Li, Jie Ren, Han Xu, and Hui Liu. Neural style protection: Counteracting unauthorized
neural style transfer. In Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision, pages 3966–3975, 2024.
87
[64] Yiming Li, Baoyuan Wu, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. Backdoor learning: A
survey. arXiv preprint arXiv:2007.08745, 2020.
[65] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for
dense object detection. In Proceedings of the IEEE international conference on computer
vision, pages 2980–2988, 2017.
[66] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In
European conference on computer vision, pages 740–755. Springer, 2014.
[67] Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. Explainable ai: A
review of machine learning interpretability methods. Entropy, 23(1):18, 2020.
[68] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio,
Francesco Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Ginneken,
and Clara I Sánchez. A survey on deep learning in medical image analysis. Medical image
analysis, 42:60–88, 2017.
[69] Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution
detection. Advances in Neural Information Processing Systems, 33:21464–21475, 2020.
[70] Xiaorui Liu, Jiayuan Ding, Wei Jin, Han Xu, Yao Ma, Zitao Liu, and Jiliang Tang. Graph
neural networks with adaptive residual. Advances in Neural Information Processing Systems,
34:9720–9733, 2021.
[71] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian
Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint
arXiv:1706.06083, 2017.
[72] Melissa D McCradden, Shalmali Joshi, Mjaye Mazwi, and James A Anderson. Ethical limitations
of algorithmic fairness solutions in health care machine learning. The Lancet Digital
Health, 2(5):e221–e223, 2020.
[73] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A
survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54(6):1–
35, 2021.
[74] Shike Mei and Xiaojin Zhu. Using machine teaching to identify optimal training-set attacks
on machine learners. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
[75] Aditya Krishna Menon and Robert C Williamson. The cost of fairness in binary classification.
In Conference on Fairness, Accountability and Transparency, pages 107–118. PMLR, 2018.
[76] Nir Morgulis, Alexander Kreines, Shachar Mendelowitz, and Yuval Weisglass. Fooling a
88
real car with adversarial traffic signs. arXiv preprint arXiv:1907.00374, 2019.
[77] Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, and Anant Sahai. Harmless
interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information
Theory, 1(1):67–83, 2020.
[78] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya
Sutskever. Deep double descent: Where bigger models and more data hurt. arXiv preprint
arXiv:1912.02292, 2019.
[79] Vedant Nanda, Samuel Dooley, Sahil Singla, Soheil Feizi, and John P Dickerson. Fairness
through robustness: Investigating robustness disparity in deep learning. arXiv preprint
arXiv:2006.12621, 2020.
[80] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.
Reading digits in natural images with unsupervised feature learning. 2011.
[81] William S Noble. What is a support vector machine? Nature biotechnology, 24(12):1565–
1567, 2006.
[82] Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael Wellman. Towards the
science of security and privacy in machine learning. arXiv preprint arXiv:1611.03814, 2016.
[83] Ioannis Pastaltzidis, Nikolaos Dimitriou, Katherine Quezada-Tavarez, Stergios Aidinlis,
Thomas Marquenie, Agata Gurzawska, and Dimitrios Tzovaras. Data augmentation for
fairness-aware machine learning: Preventing algorithmic bias in law enforcement systems.
In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency,
pages 2302–2314, 2022.
[84] Andrea Paudice, Luis Muñoz-González, Andras Gyorgy, and Emil C Lupu. Detection of
adversarial training examples in poisoning attacks through anomaly detection. arXiv preprint
arXiv:1802.03041, 2018.
[85] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial
examples. arXiv preprint arXiv:1801.09344, 2018.
[86] Jie Ren, Han Xu, Pengfei He, Yingqian Cui, Shenglai Zeng, Jiankun Zhang, Hongzhi Wen,
Jiayuan Ding, Hui Liu, Yi Chang, et al. Copyright protection in generative ai: A technical
perspective. arXiv preprint arXiv:2402.02333, 2024.
[87] Jie Ren, Han Xu, Yuxuan Wan, Xingjun Ma, Lichao Sun, and Jiliang Tang. Transferable
unlearnable examples. arXiv preprint arXiv:2210.10114, 2022.
[88] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples
for robust deep learning. In International conference on machine learning, pages 4334–4343.
89
PMLR, 2018.
[89] Leslie Rice, Eric Wong, and Zico Kolter. Overfitting in adversarially robust deep learning.
In International Conference on Machine Learning, pages 8093–8104. PMLR, 2020.
[90] Yuji Roh, Kangwook Lee, Steven Whang, and Changho Suh. Fr-train: A mutual informationbased
approach to fair and robust training. In International Conference on Machine Learning,
pages 8147–8157. PMLR, 2020.
[91] Benjamin IP Rubinstein, Blaine Nelson, Ling Huang, Anthony D Joseph, Shing-hon Lau,
Satish Rao, Nina Taft, and J Doug Tygar. Antidote: understanding and defending against
poisoning of anomaly detectors. In Proceedings of the 9th ACM SIGCOMM Conference on
Internet Measurement, pages 1–14, 2009.
[92] Matthew Sag. Copyright safety for generative ai. Forthcoming in the Houston Law Review,
2023.
[93] Amartya Sanyal, Puneet K Dokania, Varun Kanade, and Philip HS Torr. How benign is
benign overfitting? arXiv preprint arXiv:2007.04028, 2020.
[94] Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry.
Adversarially robust generalization requires more data. arXiv preprint arXiv:1804.11285,
2018.
[95] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for
face recognition and clustering. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 815–823, 2015.
[96] Candice Schumann, Jeffrey Foster, Nicholas Mattei, and John Dickerson. We need fairness
and explainability in algorithmic hiring. In International Conference on Autonomous Agents
and Multi-Agent Systems (AAMAS), 2020.
[97] Ali Shafahi, W Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Dumitras,
and Tom Goldstein. Poison frogs! targeted clean-label poisoning attacks on neural
networks. arXiv preprint arXiv:1804.00792, 2018.
[98] Ali Shafahi, Mahyar Najibi, Mohammad Amin Ghiasi, Zheng Xu, John Dickerson, Christoph
Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! In
Advances in Neural Information Processing Systems, pages 3358–3369, 2019.
[99] Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K Reiter. Accessorize to a
crime: Real and stealthy attacks on state-of-the-art face recognition. In Proceedings of the
2016 acm sigsac conference on computer and communications security, pages 1528–1540,
2016.
90
[100] Jacob Steinhardt, Pang Wei Koh, and Percy Liang. Certified defenses for data poisoning attacks.
In Proceedings of the 31st International Conference on Neural Information Processing
Systems, pages 3520–3532, 2017.
[101] Bob LT Sturm, Maria Iglesias, Oded Ben-Tal, Marius Miron, and Emilia Gómez. Artificial
intelligence and music: open questions of copyright law and engineering praxis. In Arts,
volume 8, page 115. MDPI, 2019.
[102] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian
Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint
arXiv:1312.6199, 2013.
[103] Lue Tao, Lei Feng, Jinfeng Yi, Sheng-Jun Huang, and Songcan Chen. Better safe than sorry:
Preventing delusive adversaries with adversarial training. Advances in Neural Information
Processing Systems, 34, 2021.
[104] Florian Tramèr, Jens Behrmann, Nicholas Carlini, Nicolas Papernot, and Jörn-Henrik Jacobsen.
Fundamental tradeoffs between invariance and sensitivity to adversarial perturbations.
In International Conference on Machine Learning, pages 9561–9571. PMLR, 2020.
[105] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander
Madry. Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.
[106] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of
machine learning research, 9(11), 2008.
[107] Grant Van Horn and Pietro Perona. The devil is in the tails: Fine-grained classification in
the wild. arXiv preprint arXiv:1709.01450, 2017.
[108] Yuxuan Wan, Han Xu, Xiaorui Liu, Jie Ren, Wenqi Fan, and Jiliang Tang. Defense against
gradient leakage attacks via learning to obscure data. arXiv preprint arXiv:2206.00769,
2022.
[109] Jialu Wang, Yang Liu, and Caleb Levy. Fair classification with group-dependent label noise.
In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency,
pages 526–536, 2021.
[110] Serena Wang, Wenshuo Guo, Harikrishna Narasimhan, Andrew Cotter, Maya Gupta, and
Michael I Jordan. Robust optimization for fairness with noisy protected groups. arXiv
preprint arXiv:2002.09343, 2020.
[111] Wentao Wang, Han Xu, Xiaorui Liu, Yaxin Li, Bhavani Thuraisingham, and Jiliang Tang.
Imbalanced adversarial training with reweighting. In 2022 IEEE International Conference
on Data Mining (ICDM), pages 1209–1214. IEEE, 2022.
91
[112] Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. Improving
adversarial robustness requires revisiting misclassified examples. In International
Conference on Learning Representations, 2019.
[113] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Learning to model the tail. In Advances
in Neural Information Processing Systems, pages 7029–7039, 2017.
[114] Yunjuan Wang, Poorya Mianjy, and Raman Arora. Robust learning for data poisoning attacks.
In International Conference on Machine Learning, pages 10859–10869. PMLR, 2021.
[115] Susan C Weller and A Kimball Romney. Systematic data collection, volume 10. Sage publications,
1988.
[116] Rand R Wilcox. Introduction to robust estimation and hypothesis testing. Academic press,
2011.
[117] Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex
outer adversarial polytope. In International Conference on Machine Learning, pages 5286–
5295. PMLR, 2018.
[118] Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust
generalization. Advances in Neural Information Processing Systems, 33, 2020.
[119] Da Xu, Yuting Ye, and Chuanwei Ruan. Understanding the role of importance weighting for
deep learning. arXiv preprint arXiv:2103.15209, 2021.
[120] Han Xu, Pengfei He, Jie Ren, Yuxuan Wan, Zitao Liu, Hui Liu, and Jiliang Tang. Probabilistic
categorical adversarial attack and adversarial training. In International Conference
on Machine Learning, pages 38428–38442. PMLR, 2023.
[121] Han Xu, Xiaorui Liu, Yaxin Li, Anil Jain, and Jiliang Tang. To be robust or to be fair:
Towards fairness in adversarial training. In International Conference on Machine Learning,
pages 11492–11501. PMLR, 2021.
[122] Han Xu, Xiaorui Liu, Yaxin Li, and Jiliang Tang. To be robust or to be fair: Towards fairness
in adversarial training. arXiv preprint arXiv:2010.06121, 2020.
[123] Han Xu, Xiaorui Liu, Yuxuan Wan, and Jiliang Tang. Towards fair classification against
poisoning attacks. arXiv preprint arXiv:2210.09503, 2022.
[124] Han Xu, Xiaorui Liu, Wentao Wang, Wenbiao Ding, Zhongqin Wu, Zitao Liu, Anil Jain, and
Jiliang Tang. Towards the memorization effect of neural networks in adversarial training.
arXiv preprint arXiv:2106.04794, 2021.
[125] Han Xu, Yao Ma, Hao-Chen Liu, Debayan Deb, Hui Liu, Ji-Liang Tang, and Anil K Jain.
92
Adversarial attacks and defenses in images, graphs and text: A review. International Journal
of Automation and Computing, 17(2):151–178, 2020.
[126] Han Xu, Yao Ma, Haochen Liu, Debayan Deb, Hui Liu, Jiliang Tang, and Anil Jain.
Adversarial attacks and defenses in images, graphs and text: A review. arXiv preprint
arXiv:1909.08072, 2019.
[127] Show-Jane Yen and Yue-Shi Lee. Cluster-based under-sampling approaches for imbalanced
data distributions. Expert Systems with Applications, 36(3):5718–5727, 2009.
[128] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi.
Fairness beyond disparate treatment & disparate impact: Learning classification without disparate
mistreatment. In Proceedings of the 26th international conference on world wide web,
pages 1171–1180, 2017.
[129] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations.
In International conference on machine learning, pages 325–333. PMLR, 2013.
[130] Shenglai Zeng, Yaxin Li, Jie Ren, Yiding Liu, Han Xu, Pengfei He, Yue Xing, Shuaiqiang
Wang, Jiliang Tang, and Dawei Yin. Exploring memorization in fine-tuned language models.
arXiv preprint arXiv:2310.06714, 2023.
[131] Shenglai Zeng, Jiankun Zhang, Pengfei He, Yue Xing, Yiding Liu, Han Xu, Jie Ren,
Shuaiqiang Wang, Dawei Yin, Yi Chang, et al. The good and the bad: Exploring privacy
issues in retrieval-augmented generation (rag). arXiv preprint arXiv:2402.16893, 2024.
[132] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with
adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and
Society, pages 335–340, 2018.
[133] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding
deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530,
2016.
[134] Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, and Bin Dong. You only
propagate once: Painless adversarial training using maximal principle. arXiv preprint
arXiv:1905.00877, 2(3), 2019.
[135] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P Xing, Laurent El Ghaoui, and Michael I
Jordan. Theoretically principled trade-off between robustness and accuracy. arXiv preprint
arXiv:1901.08573, 2019.
[136] Jingfeng Zhang, Xilie Xu, Bo Han, Gang Niu, Lizhen Cui, Masashi Sugiyama, and Mohan
Kankanhalli. Attacks which do not kill training make adversarial learning stronger. In International
Conference on Machine Learning, pages 11278–11287. PMLR, 2020.
93
[137] Jingfeng Zhang, Jianing Zhu, Gang Niu, Bo Han, Masashi Sugiyama, and Mohan
Kankanhalli. Geometry-aware instance-reweighted adversarial training. arXiv preprint
arXiv:2010.01736, 2020.
[138] Zhifei Zhang, Yang Song, and Hairong Qi. Age progression/regression by conditional adversarial
autoencoder. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 5810–5818, 2017.
94
APPENDIX
1 Supplementary to Chapter 2
1.1 More Results of Preliminary Studies
In this section, we show more preliminary results on the robust fairness phenomenon of adversarial
training in various settings. In addition to the results shown in Section 2.3, we present the
results in the settings with one more architecture (WRN28), one more type of adversarial attack (𝑙2-
norm attack), one more defense method (Randomized Smoothing) and one more dataset (SVHN).
From all these settings, we observe the similar phenomenon as in Section 2.3, which show that the
fairness phenomenon can be generally happening in adversarial training under different scenarios
and can be a common concern during its application. Furthermore, we also present the detailed
results as in Table 2.2, to show the fact that adversarial training usually gives an unequal influence
on different classes, which can be a reason that causes this fairness phenomenon.
In detail, for each dataset under PreAct ResNet18 architecture, for each adversarial training algorithm
(including PGD-adversarial training [71] and TRADES [138]), we train the models following
that as suggested by the original papers. We train the models for 200 epochs with learning rate 0.1
and decay the learning rate at the epoch 100 and 150 by factor 0.1. During the evaluation phase, we
report the trained model’s classwise standard error rate and robust error rate. In general settings
without explicit mention, we study the models’ robustness against 𝑙∞-norm adversarial attack under
8/255, where we implement PGD attack algorithm for 20 steps for robustness evaluation.
1.1.1 Robust Fairness in WRN28 in CIFAR10
Figure A.2 presents robust fairness issues in CIFAR10 dataset under WRN28 models. Note
that in Section 2.3 we also presented the corresponding results under PreAct ResNet18 models in
Figure 2.1. We can observe the similar phenomenon about the robust fairness issues under both
models. Moreover, as clear evidence of the unequal effect of adversarial training among different
classes, in Table A.1 and Table A.2, we compare the classwise standard error and robust error
between natural training and PGD adversarial training. From the experimental results, we get the
95
conclusion that adversarial training usually increases a larger error rate for the classes, such as “dog”
and “cat”, which originally have larger errors in natural training. Similarly, adversarial training will
also give less help to reduce the robust errors for these classes.
1.1.2 Robust Fairness in 𝑙2-norm Adversarial Training
Figure A.3 presents the robust fairness issues of adversarial training methods which target on
𝑙2-norm attacks in CIFAR10 dataset. We further confirm the existence of robust unfairness in adversarial
training methods. In Figure A.3, we present the classwise standard errors and robust errors,
which target on 𝑙2-norm 0.5 adversarial attack. During the robustness evaluation, we implement
PGD attack algorithm with step size 0.1 for 20 steps.
1.1.3 Robust Fairness for Certified Defenses
Certified defenses are another main type of effective defense strategies. Even though certified
defenses do not train in the same way as traditional adversarial training methods, which train the
models on the generated adversarial examples, they minimize the probability of the existence of
adversarial examples near the input data. This process also implicitly minimizes the model’s overall
robust error. Thus, in this section we study whether this certified defense will have robust fairness
issues. As a representative, we implement Randomized Smoothing [24], which is one state-of-theart
methods to certifiably defense against 𝑙2-norm adversarial attacks. In this experiment, we run
Randomized Smoothing against 𝑙2-norm 0.5 attacks in CIFAR10 dataset and report its class-wise
certified standard error and certified robust error under different intensities.
The results also suggest that the Randomized Smoothing certified defense method presents the
similar disparity phenomenon as the traditional adversarial training methods. Moreover, it also
preserves the similar classwise performance relationship, i.e., it has both high standard & robustness
error on the classes “cat” and “dog”, but has relatively low errors on “car” and “ship”.
1.1.4 Robust Fairness on SVHN Dataset
Figure A.4 presents the robust fairness issues of adversarial training methods in SVHN dataset
under PreAct ResNet18 model. From the experimental results, we also observe the strong disparity
96
Figure A.1: Randomized Smoothing on CIFAR10.
of classwise standard errors and robust errors, which do not exist in natural training. In particular,
the classes “3” and “8” have the largest standard error in a naturally trained model. After adversarial
training, these two classes also have the largest standard error increases among all classes, as well
as the least robust error decreases. As a result, there is also a significant disparity of the standard /
robustness performance among the classes. The full results are shown in Table A.3.
(a) Natural Training (b) PGD Adversarial Training (c) TRADES (1/𝜆 = 1) (d) TRADES (1/𝜆 = 6)
Figure A.2: The class-wise performance of natural / adversarial training (target on 𝑙∞-norm 8/255
attack) on CIFAR10 under WRN28.
1.2 Theoretical Proof of Section 2.4
In this section, we formally calculate the classwise standard & robust errors in an optimal linear
classifier and an optimal linear robust classifier. Then, we present our main conclusion that robust
optimization will unequally influence the performance of the two classes and therefore result in a
severe performance disparity.
97
(a) Natural Training (b) PGD Adversarial Training (c) TRADES (1/𝜆 = 6) (d) TRADES (1/𝜆 = 6)
Figure A.3: The class-wise performance of natural / adversarial training (target on 𝑙2-norm 0.5
attack) on CIFAR10 under PreActResNet18.
(a) Natural Training (b) PGD Adversarial Training (c) TRADES (1/𝜆 = 1) (d) TRADES (1/𝜆 = 6)
Figure A.4: The class-wise performance of natural / adversarial training (target on 𝑙∞-norm 8/255
attack) on SVHN under PreActResNet18.
Table A.1: The Changes of Standard & Robust Error in Natural & Adversarial Training in CIFAR10
on PreAct ResNet18.
Std. Error plane car bird cat deer dog frog horse ship truck
Natural Training 4.0 1.8 6.4 11.3 6.1 10.0 5.1 5.2 3.5 4.3
PGD Adv. Training 11.7 6.1 23.3 34.8 20.8 26.9 12.6 9.8 6.4 9.5
Diff. (Adv. - Nat.) 7.7 4.3 16.9 23.5 14.6 16.9 7.5 4.6 2.9 5.2
Rob. Error plane car bird cat deer dog frog horse ship truck
Natural Training 100 100 100 100 100 100 100 100 100 100
PGD Adv. Training 44.9 34.3 68.4 82.7 74.7 66.4 51.5 47.0 40.8 42.3
Diff. (Adv. - Nat.) -55.1 -65.7 -31.6 -17.3 -25.3 -33.5 -48.5 -53.0 -59.2 -57.7
1.2.1 Proof of Theorem 2
In this subsection, we study an optimal linear classifier which minimizes the average standard
error. By calculating its standard errors, we can get the conclusion that the class “+1” in distribution
D is indeed harder than class “-1”. We first start from a lemma to calculate the weight vector of an
optimal linear model.
Lemma 6 (Weight Vector of an Optimal Classifier) For the data following the distributionDde-
98
Table A.2: The Changes of Standard & Robust Error in Natural & Adversarial Training in CIFAR10
on WRN28.
Std. Error plane car bird cat deer dog frog horse ship truck
Natural Training 3.9 1.9 6.4 10.0 6.2 9.0 5.0 4.6 3.3 4.2
PGD Adv. Training 10.0 4.9 21.4 24.6 17.4 26.2 12.4 9.4 6.3 9.2
Diff. (Adv. - Nat.) 6.1 3.0 15.0 14.6 11.2 17.2 7.4 4.8 3.0 5.0
Rob. Error plane car bird cat deer dog frog horse ship truck
Natural Training 100 100 100 100 100 100 100 100 100 100
PGD Adv. Training 41.4 29.3 65.8 77.2 75.5 61.6 60.9 40.7 40.8 39.4
Diff. (Adv. - Nat.) -58.6 -70.7 -34.2 -22.8 -24.5 -38.4 -39.1 -59.3 -59.2 -60.6
Table A.3: The Changes of Standard & Robust Error in Natural & Adversarial Training in SVHN
in PreAct ResNet18.
Std. Error “0” “1” “2” “3” “4” “5” “6” “7” “8” “9”
Natural Training 3.6 2.7 3.2 5.8 4.0 5.1 3.6 4.6 6.1 5.3
PGD Adv. Training 8.8 6.2 7.9 15.8 6.9 13.2 13.3 13.4 19.8 11.4
Diff. (Adv. - Nat.) 5.2 3.5 4.8 10.1 4.9 8.1 9.6 8.9 13.6 6.4
Std. Error “0” “1” “2” “3” “4” “5” “6” “7” “8” “9”
Natural Training 100 100 100 100 100 100 100 100 100 100
PGD Adv. Training 47.2 38.8 49.0 63.4 41.9 57.1 62.6 55.2 73.7 57.1
Diff. (Adv. - Nat.) -52.8 -61.2 -51.1 -36.6 -58.1 -42.9 -37.4 -44.8 -26.3 -42.9
fined in Eq. 2.2, an optimal linear classifier 𝑓nat which minimizes the average standard classification
error:
𝑓nat (𝑥) = sign(⟨𝑤nat, 𝑥⟩ + 𝑏nat)
where 𝑤nat, 𝑏nat = arg min
𝑤,𝑏
Pr.(sign(⟨𝑤, 𝑥⟩ + 𝑏) ≠ 𝑦)
has the optimal weight that satisfy: 𝑤nat = 1.
Proof 2 (Proof of Lemma 8) In the proof we will use 𝑤 = 𝑤nat and 𝑏 = 𝑏nat for simplicity. Next,
we will prove 𝑤1 = 𝑤2 = · · · = 𝑤𝑑 by contradiction. We define 𝐺 = {1, 2, . . . , 𝑑} and make the
following assumption: for the optimal 𝑤 and 𝑏, we assume if there exist 𝑤𝑖 < 𝑤𝑗 for 𝑖 ≠ 𝑗 and
𝑖, 𝑗 ∈ 𝐺. Then we obtain the following standard errors for two classes of this classifier with weight
99
𝑤:
Rnat ( 𝑓 ; −1) = Pr{
Õ
𝑘≠𝑖,𝑘≠𝑗
𝑤𝑘N(−𝜂, 𝜎2−
1
) + 𝑏
+ 𝑤𝑖N(−𝜂, 𝜎2−
1
) + 𝑤𝑗N(−𝜂, 𝜎2−
1
) > 0}
Rnat ( 𝑓 ; +1) = Pr{
Õ
𝑘≠𝑖,𝑘≠𝑗
𝑤𝑘N(+𝜂, 𝜎2
+1
) + 𝑏
+ 𝑤𝑖N(+𝜂, 𝜎2
+1
) + 𝑤𝑗N(+𝜂, 𝜎2
+1
) < 0}
(A.1)
However, if we define a new classier 𝑓˜ whose weight 𝑤˜ uses 𝑤𝑗 to replace 𝑤𝑖 , we obtain the errors
for the new classifier:
Rnat ( ˜ 𝑓 ; −1) = Pr{
Õ
𝑘≠𝑖,𝑘≠𝑗
𝑤𝑘N(−𝜂, 𝜎2−
1
) + 𝑏
+ 𝑤𝑗N(−𝜂, 𝜎2−
1
) + 𝑤𝑗N(−𝜂, 𝜎2−
1
) > 0}
Rnat ( ˜ 𝑓 ; +1) = Pr{
Õ
𝑘≠𝑖,𝑘≠𝑗
𝑤𝑘N(+𝜂, 𝜎2
+1
) + 𝑏
+ 𝑤𝑗N(+𝜂, 𝜎2
+1
) + 𝑤𝑗N(+𝜂, 𝜎2
+1
) < 0}.
(A.2)
By comparing the errors in Eq A.13 and Eq A.14, it can imply the classifier ˜ 𝑓 has smaller error
in each class. Therefore, it contradicts with the assumption that 𝑓 is the optimal classifier with
least error. Thus, we conclude for an optimal linear classifier in natural training, it must satisfies
𝑤1 = 𝑤2 = · · · = 𝑤𝑑 and 𝑤 = 1.
Given the results in Lemma 1, we can calculate the errors of classifiers by only calculating the
interception term 𝑏nat and 𝑏rob. Recall Theorem 1, we calculate the classwise errors of an optimal
classifier with minimal average standard error.
Theorem 1 For a data distribution D in Eq. 2.2, for the optimal linear classifier 𝑓nat which minimizes
the average standard classification error, it has the intra-class standard error for the two
classes:
Rnat ( 𝑓nat, −1) = Pr.{N(0, 1) ≤ 𝐴 − 𝐾 ·
q
𝐴2 + 𝑞(𝐾)}
Rnat ( 𝑓nat, +1) = Pr.{N(0, 1) ≤ −𝐾 · 𝐴 +
q
𝐴2 + 𝑞(𝐾)}
100
where 𝐴 = 2
𝐾2−1
√
𝑑𝜂
𝜎 and 𝑞(𝐾) = 2 log 𝐾
𝐾2−1 which is a positive constant and only depends on 𝐾, As a
result, the class “+1” has a larger standard error:
Rnat ( 𝑓nat, −1) < Rnat ( 𝑓nat, +1).
Proof 3 (Proof of Theorem 1) From the results in Lemma 1, we define our optimal linear classifier
to be 𝑓nat (𝑥) = sign(
Í𝑑
𝑖=1 𝑥𝑖 + 𝑏nat). Now, we calculate the optimal 𝑏nat which can minimize the
average standard error:
𝑅nat ( 𝑓 ) = Pr.{ 𝑓 (𝑥) ≠ 𝑦}
∝ Pr{ 𝑓 (𝑥) = 1|𝑦 = −1} + Pr{ 𝑓 (𝑥) = −1|𝑦 = 1}
= Pr.{
Õ𝑑
𝑖=1
𝑥𝑖 + 𝑏nat > 0|𝑦 = −1}
+ Pr{
Õ𝑑
𝑖=1
𝑥𝑖 + 𝑏nat < 0|𝑦 = +1}
= Pr.{N(0, 1) < −
√
𝑑𝜂
𝜎
+ 1 √
𝑑𝜎
· 𝑏nat}
+ Pr.{N(0, 1) < −
√
𝑑𝜂
𝐾𝜎
− 1
𝐾
√
𝑑𝜎
· 𝑏nat}
(A.3)
The optimal 𝑏nat to minimize Rnat ( 𝑓 ) is achieved at the point that 𝜕Rnat ( 𝑓 )
𝜕𝑏nat
= 0. Thus, we find the
optimal 𝑏nat:
𝑏nat =
𝐾2 + 1
𝐾2 − 1
· 𝑑𝜂 − 𝐾
s
4𝑑2𝜂2
(𝐾2 − 1)2
+ 𝑞(𝐾)𝑑𝜎2 (A.4)
and 𝑞(𝐾) = 2 log 𝐾
𝐾2−1 which is a positive constant and only depends on 𝐾. By incorporating the
optimal 𝑏nat into Eq. A.6, we can get the classwise standard errors for the two classes:
Rnat ( 𝑓nat, −1) = Pr.{N(0, 1) ≤ 𝐴 − 𝐾 ·
q
𝐴2 + 𝑞(𝐾)}
Rnat ( 𝑓nat, +1) = Pr.{N(0, 1) ≤ −𝐾 · 𝐴 +
q
𝐴2 + 𝑞(𝐾)}
where 𝐴 = 2
𝐾2−1
√
𝑑𝜂
𝜎 . Since 𝑞(𝐾) > 0, we have the direct conclusion that Rnat ( 𝑓 ; −1) < Rnat ( 𝑓 ; +1).
101
1.2.2 Proof of Theorem 2
In this subsection, we focus on calculating the errors of robust classifiers which minimize the
average robust errors of the model. By comparing natural classifiers and robust classifiers, we get
the conclusion that robust classifiers can further exacerbate the model’s performance on the “harder”
class. Similar to Section A.2.1, we start from a Lemma to show an optimal robust classifier 𝑓rob
has a weight vector 𝑤rob = 1.
Lemma 8 (Weight of an Optimal Robust Classifier) For the data following the distribution D
defined in Eq. 2.2, an optimal linear classifier 𝑓nat which minimizes the average standard classification
error:
𝑓rob (𝑥) = sign(⟨𝑤rob, 𝑥⟩ + 𝑏rob)
where 𝑤rob, 𝑏rob = arg min
𝑤,𝑏
Pr.(∃𝛿, ||𝛿|| ≤ 𝜖, s.t. sign(⟨𝑤, 𝑥 + 𝛿⟩ + 𝑏) ≠ 𝑦).
has the optimal weight which satisfy: 𝑤rob = 1.
We leave the detailed proof out in the paper because it can be proved in the similar way as the proof
of Lemma 1. Recall Theorem 2, we formally calculate the standard errors of an optimal robust
linear classifier.
Theorem 2 For a data distribution D in Eq. 2.2, the optimal robust linear classifier 𝑓rob which
minimizes the average robust error with perturbation margin 𝜖 < 𝜂, it has the intra-class standard
error for the two classes:
Rnat ( 𝑓rob, −1) = Pr.{N(0, 1) ≤ 𝐵 − 𝐾 ·
q
𝐵2 + 𝑞(𝐾) −
√
𝑑
𝜎
𝜖 }
Rnat ( 𝑓rob, +1) = Pr.{N(0, 1) ≤ −𝐾 · 𝐵 +
q
𝐵2 + 𝑞(𝐾) −
√
𝑑
𝐾𝜎
𝜖 }
(A.5)
where 𝐵 = 2
𝐾2−1
√
𝑑(𝜂−𝜖)
𝜎 and 𝑞(𝐾) = 2 log 𝐾
𝐾2−1 is a positive constant and only depends on 𝐾,
Proof 4 (Proof of Theorem 2) From the results in Lemma 2, we define our optimal linear robust
classifier to be 𝑓rob (𝑥) = sign(
Í𝑑
𝑖=1 𝑥𝑖 + 𝑏rob). Now, we calculate the optimal 𝑏rob which can mini-
102
Table A.4: Average & worst-class standard error, boundary error and robust error for various algorithms
on CIFAR10 under WRN28.
Avg. Std. Worst Std. Avg. Bndy. Worst Bndy. Avg. Rob. Worst Rob.
PGD Adv. Training 14.0 29.3 38.1 53.0 52.2 78.8
TRADES(1/𝜆 = 1) 12.6 25.2 40.2 58.7 52.8 76.7
TRADES(1/𝜆 = 6) 15.5 29.1 31.8 45.7 47.3 71.8
Baseline Reweight 14.2 26.3 38.6 53.7 52.8 77.9
FRL(Reweight, 0.05) 14.5 23.2 40.0 53.3 54.4 76.8
FRL(Remargin, 0.05) 15.4 24.9 38.1 49.6 53.5 70.5
FRL(Reweight+Remargin, 0.05) 15.4 25.0 37.8 46.7 53.2 67.1
FRL(Reweight, 0.07) 14.1 23.8 39.5 54.1 53.6 77.0
FRL(Remargin, 0.07) 14.8 24.3 39.5 50.6 54.3 73.0
FRL(Reweight+Remargin, 0.07) 14.9 24.7 37.8 48.3 52.7 68.2
Table A.5: Average & worst-class standard error, boundary error and robust error for various algorithms
on SVHN under WRN28.
Avg. Std. Worst Std. Avg. Bndy. Worst Bndy. Avg. Rob. Worst Rob.
PGD Adv. Training 8.1 16.8 38.5 57.3 46.7 71.2
TRADES(1/𝜆 = 1) 8.0 19.6 40.1 60.0 48.1 73.3
TRADES(1/𝜆 = 6) 10.6 23.1 32.1 52.5 42.7 70.6
Baseline Reweight 8.5 16.2 40.3 57.8 48.8 71.1
FRL(Reweight, 0.05) 7.8 13.4 38.9 56.9 46.7 70.7
FRL(Remargin, 0.05) 8.4 13.4 40.8 52.1 49.2 65.5
FRL(Reweight+Remargin, 0.05) 8.4 13.2 38.4 52.1 46.8 63.1
FRL(Reweight, 0.07) 8.2 13.5 41.2 56.3 49.4 69.8
FRL(Remargin, 0.07) 8.6 14.9 38.8 51.2 47.4 67.0
FRL(Reweight+Remargin, 0.07) 8.2 13.9 39.9 50.2 48.1 65.4
mize the average robust error:
Rrob ( 𝑓 ) =Pr.(∃||𝛿|| ≤ 𝜖 s.t. 𝑓 (𝑥 + 𝛿) ≠ 𝑦)
= max
||𝛿||≤𝜖
Pr.( 𝑓 (𝑥 + 𝛿) ≠ 𝑦)
=
1
2Pr.( 𝑓 (𝑥 + 𝜖) ≠ −1|𝑦 = −1)
+ 1
2Pr.( 𝑓 (𝑥 − 𝜖) ≠ +1|𝑦 = +1)
=Pr.{
Õ𝑑
𝑖=1
(𝑥𝑖 + 𝜖) + 𝑏rob > 0|𝑦 = −1}
+ Pr{
Õ𝑑
𝑖=1
(𝑥𝑖 − 𝜖) + 𝑏rob < 0|𝑦 = +1}
=Pr.{N(0, 1) < −
√
𝑑(𝜂 − 𝜖)
𝜎
+ 1 √
𝑑𝜎
· 𝑏rob}
+ Pr.{N(0, 1) < −
√
𝑑(𝜂 − 𝜖)
𝐾𝜎
− 1
𝐾
√
𝑑𝜎
· 𝑏rob}
(A.6)
103
The optimal 𝑏rob to minimize Rrob ( 𝑓 ) is achieved at the point that 𝜕Rrob ( 𝑓 )
𝜕𝑏rob
= 0. Thus, we find the
optimal 𝑏rob:
𝑏rob =
𝐾2 + 1
𝐾2 − 1
· 𝑑(𝜂 − 𝜖) − 𝐾
s
4𝑑2 (𝜂 − 𝜖)2
(𝐾2 − 1)2
+ 𝑞(𝐾)𝑑𝜎2 (A.7)
and 𝑞(𝐾) = 2 log 𝐾
𝐾2−1 which is a positive constant and only depends on 𝐾. By incorporating the
optimal 𝑏nat into Eq. A.6, we can get the classwise robust errors for the two classes:
Rrob ( 𝑓rob, −1) = Pr.{N(0, 1) ≤ 𝐵 − 𝐾 ·
q
𝐵2 + 𝑞(𝐾)}
Rrob ( 𝑓rob, +1) = Pr.{N(0, 1) ≤ −𝐾 · 𝐵 +
q
𝐵2 + 𝑞(𝐾)}
where 𝐵 = 2
𝐾2−1
√
𝑑(𝜂−𝜖)
𝜎 . As a direct result, the classwise standard errors for the two classes:
Rnat ( 𝑓rob, −1)
=Pr.{N(0, 1) ≤ 𝐵 − 𝐾 ·
q
𝐵2 + 𝑞(𝐾) −
√
𝑑
𝜎
𝜖 }
Rnat ( 𝑓rob, +1)
=Pr.{N(0, 1) ≤ −𝐾 · 𝐵 +
q
𝐵2 + 𝑞(𝐾) −
√
𝑑
𝐾𝜎
𝜖 }.
1.2.3 Proof of Corollary 1
Giving the results in Theorem 1 and Theorem 2, we will show that a robust classifier will
exacerbate the performance of the class “+1” which originally has higher error in a naturally trained
model. In this way, we can get the conclusion that robust classifiers can cause strong disparity,
because it exacerbates the “difficulty” difference among classes.
Corollary 1 Adversarially Trained Models on D will increase the standard error for class “+1”
and reduce the standard error for class “-1”:
Rnat ( 𝑓rob, −1) < Rnat ( 𝑓nat, −1).
Rnat ( 𝑓rob, +1) > Rnat ( 𝑓nat, +1).
Proof 5 (Proof of Corollary 1) From the intermediate results in Eq.A.4 and Eq. A.7 in the proofs
of Theorem 1 and Theorem 2, we find the only difference between a naturally trained model 𝑓nat
104
and a robust model 𝑓rob is about the interception term 𝑏nat and 𝑏rob. Specifically, we denote 𝑔(·) is
the function of the interception term, then we have the results:
𝑏nat =
𝐾2 + 1
𝐾2 − 1
· 𝑑𝜂 − 𝐾
s
4𝑑2𝜂2
(𝐾2 − 1)2
+ 𝑞(𝐾)𝑑𝜎2 := 𝑔(𝜂)
𝑏rob =
𝐾2 + 1
𝐾2 − 1
· 𝑑(𝜂 − 𝜖) − 𝐾
s
4𝑑2(𝜂 − 𝜖)2
(𝐾2 − 1)2
+ 𝑞(𝐾)𝑑𝜎2
:= 𝑔(𝜂 − 𝜖)
Next, we show that the function 𝑔(·) is a monotone increasing function between 0 and 𝜂:
𝑑𝑔(𝜂)
𝑑𝜂
≥ 𝐾2 + 1
𝐾2 − 1𝑑 − 𝐾
4
(𝐾2−1)2 𝑑2 · 2𝜂
2
q
4
(𝐾2−1)2 𝑑2𝜂2
=
𝐾 − 1
𝐾 + 1 𝑑 > 0
As a direct results, we have the interception terms: 𝑏nat > 𝑏rob. This will result a linear classifier
𝑓 (𝑥) = sign(⟨1𝑇 , 𝑥⟩ + 𝑏) present more samples in the overall distribution to be class “-1”. As a
result, we have the conclusion:
Rnat ( 𝑓rob, −1) < Rnat ( 𝑓nat, −1).
Rnat ( 𝑓rob, +1) > Rnat ( 𝑓nat, +1).
1.3 Robust / Non-Robust Features
In Section 2.4.1, we discussed a theoretical example where adversarial training will unequally
treat the two classes in the distribution D, which will increase the standard error of one class and
decrease the error for the other one. However, in the real applications of deep learning models, we
always observe that each class’s error will increase after adversarial training. In this subsection, motivated
by the work [105, 47], we extend the definition ofD, to split the features into two categories:
robust features and non-robust features, where adversarial training will increase the standard errors
for both classes. Formally, the data distribution D′ is defined as as following:
𝑦 𝑢.∼𝑎.𝑟 {−1, +1}, 𝜃 = (
dim = d
z }| {
𝜂, ..., 𝜂,
dim = m
z  }|  {
𝛾, ..., 𝛾),
𝑥 ∼
8>>>>
<
>>>>:
N (𝜃, 𝜎2
+1𝐼) if 𝑦 = +1
N (−𝜃, 𝜎2−
1𝐼) if 𝑦 = −1
(A.8)
105
where in the center vector 𝜃, it includes robust features with scale 𝜂 > 𝜖, and non-robust features
with scale 𝛾 < 𝜖. Here we specify that non-robust feature space has much higher dimension than
robust feature space (𝑚 >> 𝑑) and there is a 𝐾-factor difference between the variance of two
classes: 𝜎+1 = 𝐾 · 𝜎−1. From the main results in the work [105], it is easy to get that each class’s
standard error will increase after adversarial training. In the following theory, we will show that in
distribution D′ , adversarial training will increase the error for the class “+1” by a larger rate than
the class “-1”.
Theorem 3. For a data distribution D′ in Eq. A.8, the robust optimizer 𝑓rob increases the
standard error of class “+1” by a larger rate than the increase of the standard error of class “-1’:
Rnat ( 𝑓rob; +1) − Rnat ( 𝑓nat; +1) >
Rnat ( 𝑓rob; −1) − Rnat ( 𝑓nat; −1)
(A.9)
Proof 6 (Proof of Theorem 3) The proof of Theorem 3 resembles the process of the proofs of Theorem
1 and Theorem 2, where we first calculate the classwise standard errors for each model. Note
that from the work [105], an important conclusion is that a natural model 𝑓nat uses both robust
and non-robust features for prediction. While, a robust model 𝑓rob only uses robust features for
prediction (a detailed proof can be found in Section 2.1 in [105]). Therefore, we can calculate the
classwise standard errors for both classes of a natural model 𝑓nat:
Rnat ( 𝑓nat, −1) = Pr.{N(0, 1) ≤ 𝐴 − 𝐾 ·
q
𝐴2 + 𝑞(𝐾)}
Rnat ( 𝑓nat, +1) = Pr.{N(0, 1) ≤ −𝐾 · 𝐴 +
q
𝐴2 + 𝑞(𝐾)}
where 𝐴 = 2
𝜎(𝐾2−1)
p
𝑚𝛾2 + 𝑑𝜂2. The classwise standard errors of a robust model 𝑓rob are:
Rnat ( 𝑓rob, −1)
=Pr.{N(0, 1) ≤ 𝐵 − 𝐾 ·
q
𝐵2 + 𝑞(𝐾) −
√
𝑑
𝜎
𝜖 }
Rnat ( 𝑓rob, +1)
=Pr.{N(0, 1) ≤ −𝐾 · 𝐵 +
q
𝐵2 + 𝑞(𝐾) −
√
𝑑
𝐾𝜎
𝜖 }
106
where 𝐵 = 2
𝜎(𝐾2−1)
p
𝑑(𝜂 − 𝜖)2. Next, we compare the standard error increase after adversarial
training between the two classes. We have the results:
(Rnat ( 𝑓rob; +1) − Rnat ( 𝑓nat; +1))−
(Rnat ( 𝑓rob; −1) − Rnat ( 𝑓nat; −1))
> (𝐾 + 1) ( (𝐴 − 𝐵) + (
q
𝐵2 + 𝑞(𝐾) −
q
𝐴2 + 𝑞(𝐾)))
∝ (
q
𝐵2 + 𝑞(𝐾) − 𝐵) − (
q
𝐴2 + 𝑞(𝐾) − 𝐴)
because 𝐴 includes high dimensional non-robust feature and 𝐴 >> 𝐵, the equation above is positive
and we get the conclusion as in Eq. A.9.
1.4 Fairness Performance on WRN28
Table A.4 and Table A.5 presents the empirical results to validate the effectiveness of FRL algorithms
under the WRN 28 model. The implementation details resemble those in Section 2.6.1. In
the training, we start FRL from a pre-trained robust model (such as PGD-adversarial training), and
run FRL with model parameter learning rate 1e-3 and hyperparameter learning rate 𝛼1 = 𝛼2 = 0.05
in the first 40 epochs. Then we decay the model parameter learning rate and the hyperparameter
learning rate by 0.1 every 40 epochs. From the results, we have the similar observations as these
for PreAct ResNet18 models, which is that FRL can help to improve the worst-class standard performance
and robustness performance, such that the unfairness issue is mitigated. In particular,
FRL (Reweight) is usually the most effective way to equalize the standard performance, but not
sufficient to equalize the boundary errors and robust errors. FRL (Reweight + Remargin) is usually
the most effective way to improve robustness for the worst class.
2 Supplementary to Chapter 3
2.1 Additional Introduction of Atypical Samples
In this section, we provide additional introductions about the memorization effect and atypical
samples in image classification tasks. We first introduce several strategies to identify atypical samples
in common datasets. Then, we show several examples to illustrate the distribution & frequency
in common datasets.
107
2.1.1 How to Estimate Whether One Sample is Atypical?
Leave-out Method The work [40] proposes to examine which training samples can only be fitted by
memorization, and measure each training sample’s “memorization value”. Formally, for a training
algorithmA(i.e., ERM), the memorization value “mem(A,D, 𝑥𝑖)” of a training sample (𝑥𝑖 , 𝑦𝑖) ∈ D
in training set D is defined as:
mem(A,D, 𝑥𝑖) = Pr.
𝐹←A(D)
(𝐹(𝑥𝑖) = 𝑦𝑖) − Pr.
𝐹←A(D\𝑥𝑖 )
(𝐹(𝑥𝑖) = 𝑦𝑖),
which calculates the difference between the model 𝐹’s accuracy on 𝑥𝑖 with and without 𝑥𝑖 removed
from the training set D of algorithm A. Note that for each sample 𝑥𝑖 , if its memorization value
is high, it means that removing 𝑥𝑖 from training data will cause the model with a high possibility
to wrongly classify itself, so 𝑥𝑖 is very likely to be fitted only by memorization and be atypical.
Although promising, the computational efficiency of Leave-out method is potential to be low. For
example, the Leave-Out method requires repeatedly training a DNN model for 1,000 trials to accurately
estimate the memorization values.
Ensemble Disagreement The work [14] provides an alternative way to efficiently figure out the
examples which are poorly represented by other samples from the training dataset (atypical samples).
They assume the samples which are well-represented in the distribution should be relatively
easy for many types of DNN models to learn. Concretely, the ensemble agreement trains a series
of models F = {𝐹𝑗 }𝑚𝑗
=1 with different architectures and initializations. Then, for each sample 𝑥𝑖 , it
computes the disagreement of all models:
ed(𝑥𝑖) =
1
𝑚2
Õ𝑚
𝑗=1
Õ𝑚
𝑝=1
JS-Divergence(𝐹𝑗 (𝑥𝑖), 𝐹𝑝 (𝑥𝑖))
Intuitively, if ed(𝑥𝑖) is large, different models will give different predictions to the sample 𝑥𝑖 . It
suggest that the sample 𝑥𝑖 cannot be easily learned by most of models in F. Therefore, the sample
𝑥𝑖 is likely to be atypical. In practice, [15] trains multiple DNN models for 10 random initializations,
or get 10 pretrained from public resources, which significantly improve the efficiency to find atypical
samples.
108
Confidence-Based Method Similar to the ensemble disagreement method, the work [14] assumes
that the models in F = {𝐹𝑗 }𝑚𝑗
=1 should be more confident on examples that are well-represented
by the whole training dataset. Thus, the propose to measure the average confidence value of these
models on sample 𝑥𝑖 :
conf(𝑥𝑖) =
1
𝑚
Õ𝑚
𝑗=1
max(𝐹𝑗 (𝑥𝑖) [𝑦𝑖]))
The smaller the confidence value of a sample 𝑥𝑖 , it is more likely to be atypical. Remarkably, the
work [14] empirically demonstrates that the three metrics decribed above have good alignment for
datasets such as MNIST, CIFAR10, CIFAR100 [56] and ImageNet [57]. Therefore, for practical
application of our proposed method BAT, we can also substitute the leave-out method to ensemble
disagreement and confidence-based methods, in order to to enhance the efficiency of BAT.
2.1.2 Distribution of Atypical Samples
In Fig. A.5, we provide several examples of images from CIFAR10, CIFAR100 [56] and Tiny
ImageNet [60] respectively, with different memorization value (as defined in Section 3) around
0.0, 0.5, 1.0. These examples suggest that if the memorization value of an image is large, this
image is very likely to be “atypical”, as it presents very distinct semantic features with the images
in the main distribution (with memorization value 0.0).
109
0.0
0.5
1.0 1.0
0.5
0.0
“Dog” in CIFAR10 “Plate” in CIFAR100
“Mountain Lion” in Tiny ImageNet
0.0
0.5
1.0
Figure A.5: Examples of Images with Different Memorization Values.
In Fig. 3.5, we provide histograms to show the distribution of the estimated memorization values
of all training samples from CIFAR10, CIFAR100 and Tiny ImageNet. From Fig. 3.5, we can
observe that atypical samples (with high memorization value > 0.15) consist of a significant fraction
(over 40% & 50% respectively) in CIFAR100 and Tiny ImageNet. In CIFAR10, they also consist
of a non-ignorable fraction which is over 10%.
2.2 Additional Results for Preliminary Study
In this section, we provide the full results of the preliminary study in Section 3.4 on CIFAR10,
CIFAR100 and Tiny ImageNet, to illustrate the distinct behaviors of the memorization effect between
traditional ERMs and adversarial training. In both ERM and adversarial training, we train
the models under ResNet18 and WideResNet28-10 (WRN28) architectures. In the experiments, we
train the models for 200 epochs with learning rate 0.1, momentum 0.9, weight decay 5e-4, and decay
the learning rate by 0.1 at the epoch 150 and 200. For adversarial training, we conduct experiments
using PGD adversarial training [71] by default to defense against 𝑙∞-8/255 adversarial attack, with
110
the exception on Tiny ImageNet, which is against 𝑙∞-4/255 attack. For robustness evaluation, we
conduct a 20-step PGD attack.
2.2.1 Additional Results for Preliminary Study - Section 3.4.1
In this subsection, we provide more experimental results to validate the statement in Section 3.4.1,
where we state that fitting atypical samples in adversarial training can only improve the clean accuracy
of test atypical samples, but hardly help their adversarial robustness. We provide full empirical
results to show that: (i) In traditional ERM, fitting atypical samples improves the clean accuracy of
test atypical samples. (ii) In adversarial training, fitting (adversarial) atypical samples improves the
clean accuracy of test atypical samples but can hardly improve the adversarial robustness of them.
The experimental setting follows Section 3.4.1, where we apply traditional ERM and adversarial
training on original CIFAR10, CIFAR100, Tiny ImageNet datasets. We evaluate the model’s clean
accuracy and adversarial accuracy on training atypical set Datyp = {𝑥𝑖 ∈ D : mem(𝑥𝑖) > 0.15} and
its corresponding test atypical set D′
atyp = {𝑥′
𝑗
∈ D′ : infl(𝑥𝑖 , 𝑥′
𝑗
) > 0.15, for ∀𝑥𝑖 ∈ Datyp}.
(i) Additional Results for Preliminary Study - Section 3.4.1 In Traditional ERM
Fig. A.6, Fig. A.7 and Fig. A.8 report the performance (clean accuracy) of traditional ERM,
which is evaluated on atypical sets under ResNet18 (left) and WRN28 (right) on CIFAR100, CIFAR10
and Tiny ImageNet. From the figures, we can obverse that fitting atypical samples during
training can effectively help the models to achieve good clean accuracy on test atypical samples in
all datasets. Note that here we only report clean accuracy as they are not robust against adversarial
attacks.
(ii) Additional Results for Preliminary Study - Section 3.4.1 In Adversarial Training
Fig. A.9, Fig. A.10 and Fig. A.11 report the performance of adversarially trained models. We
evaluate the clean accuracy and adversarial accuracy on the training atypical set Datyp and test
atypical set D′
atyp. From the results, we can observe that although fitting atypical samples can help
the model to have modest clean accuracy on test atypical samples, the adversarial robustness of
them is constantly low and can hardly be improved during the whole training process.
(i) Additional Results for Preliminary Study - Section 3.4.2 In Traditional ERM
111
Figure A.6: Clean Accuracy on Atypical Set of CIFAR100 on ResNet18 and WRN28.
Figure A.7: Clean Accuracy on Atypical Set of CIFAR10 on ResNet18 and WRN28.
Figure A.8: Clean Accuracy on Atypical Set of Tiny ImageNet on ResNet32 and WRN28.
Fig. A.12, Fig. A.13 and Fig. A.14 report the performance of traditional ERM, trained on different
resampled datasets with different amount of atypical samples existed. The figures report the
clean accuracy on test typical set of CIFAR100, CIFAR10 and Tiny ImageNet. We also leave the
robustness performance out here as the models are not robust to adversarial attacks. From the results,
we can conclude that in traditional ERM, fitting atypical samples will not degrade the models’
112
Figure A.9: Clean Accuracy and Adversarial Accuracy on Atypical Set of CIFAR100 on ResNet18
and WRN28.
Figure A.10: Clean Accuracy and Adversarial Accuracy on Atypical Set of CIFAR10 on ResNet18
and WRN28.
Figure A.11: Clean Accuracy and Adversarial Accuracy on Atypical Set of Tiny ImageNet on
ResNet18 and WRN28.
performance on typical samples. For example, in CIFAR100 dataset, with 100% atypical samples
included (Atypical-100%)., the accuracy on the test typical set is even slightly higher than the model
trained without atypical samples (Atypical-0%). This conclusion is consistent for all three datasets
and model architectures.
(ii) Additional Results for Preliminary Study - Section 3.4.2 In Adversarial Training
Fig. A.9, Fig. A.10 and Fig. A.11 report the performance of adversarial training, on different
resampled datasets with different amount of atypical samples existed. The figures report both clean
and adversarial accuracy on test atypical sets of CIFAR100, CIFAR10 and Tiny ImageNet. Based
on the experimental results, we find that including more atypical samples can cause the model have
worse performance on typical samples in all three datasets. In datasets with a large portion of
113
atypical samples, such as CIFAR100, the negative effects of atypical samples are more obvious. In
CIFAR100, training on datasets with 100% atypical samples (Atypical 100%) can cause the clean
& adversarial accuracy drop by ∼ 7% and 8%, respectively.
Figure A.12: Clean Accuracy on Typical Set of CIFAR100 on ResNet18 and WRN28.
Figure A.13: Clean Accuracy on Typical Set of CIFAR10 on ResNet18 and WRN28.
Figure A.14: Clean Accuracy on Typical Set of Tiny ImageNet on ResNet18 and WRN28.
114
2.2.2 An ablation study for “Harmful Atypical samples”
In this subsection, we conduct additional experiments to verify the claims about ”harmful atypical
samples”. In particular, we aim to demonstrate that: the atypical images with high ”Harmful
scores” are likely to obtain features which visually resemble the images from a (wrong) different
class, so fitting them are likely to degrade the model performance. To validate our claim, we print
out 200 images of atypical samples from CIFAR100 training set, with highest / lowest Harmful
scores as defined in Eq.(6) of the paper. Then, we let two individual human annotators to label
each image, by choosing one of 4 options including: (a.) This image belongs to class 𝑦1. (b.) This
image belongs to class 𝑦2. (c.) Both class 𝑦1 and 𝑦2 are likely. (d.) Neither class 𝑦1 or 𝑦2. Here,
𝑦1 is the ground truth label of the sample; 𝑦2 is the class other than 𝑦1 which the model has the
maximal confidence: 𝑦2 = 𝑎𝑟𝑔 max𝑡≠𝑦1 𝐹𝑡 (𝑥), and the model 𝐹 is obtained via PGD adversarial
training, with all atypical samples removed from the training set. We report the percentage of the
answers of the human annotators. We report the percentage of the answers of the human annotators.
Table A.6: Distribution of harmful scores.
Class 𝑦1 Class 𝑦2 Both Likely Neither. Total
High Harmful Score 59.0 12.0 26.0 4.0 100.0
Low Harmful Score 96.0 0.0 1.5 2.5 100.0
From the results, we can see, for the samples with high Harmful scores, people are more likely
to believe that ”the image is from 𝑦2” or ”the image is both likely from 𝑦1 or 𝑦2”. It is because the
highly-harmful images can persist a lot of similar features from these two classes. It also worth to
mention that for samples which people labeled them to 𝑦1, they usually also have features belonging
to the class 𝑦2. For example, a ”bowl” which is round and with orange color. Although people
believe it is a bowl confidently, it does have features such as ”round” and ”orange color”, which are
also features of class ”orange”.
2.3 Proof of Theories
In Section 3.4.2, we argue that the poisoning effect during adversarial training can be significantly
stronger than traditional ERM. In this section, we provide detailed discussions about this
115
Figure A.15: Clean Accuracy and Adversarial Accuracy on Typical Set of CIFAR100 on ResNet18
and WRN28.
Figure A.16: Clean Accuracy and Adversarial Accuracy on Typical Set of CIFAR10 on ResNet18
and WRN28.
Figure A.17: Clean Accuracy and Adversarial Accuracy on Typical Set of Tiny ImageNet on
ResNet18 and WRN28.
claim as well as complete proofs. Our analysis is based on a binary classification task with a twodimensional
Gaussian Mixture distribution:
𝑦 𝑢.∼𝑎.𝑟 {−1, +1}, 𝜃 = (𝜂, 0), 𝜂 ∈ R+, 𝑥 ∼
8>>>>
<
>>>>:
N (𝜃, 𝜎2𝐼) if 𝑦 = +1
N (−𝜃, 𝜎2𝐼) if 𝑦 = −1,
(A.10)
and we consider linear models 𝐹 = sign(𝑤𝑇 𝑥 + 𝑏), where 𝑤 and 𝑏 are the model weight and bias.
Moreover, we assume there is an atypical training sample 𝑥0 ∈ R2 labeled as “−1”, with Euclidean
distance to the center of each class ||𝑥0 − (−𝜃) ||2 = 𝑑−1, ||𝑥0 − 𝜃||2 = 𝑑+1. Intuitively, this setting
of 𝑥0 resembles the cases in Figure 3.4 if 𝑑+1 is relatively small, since it has a short distance to the
wrong class. We also constrain 𝑑−1 ≤ 2 × 𝜂 for 𝑥0 to preserve certain similarity to the class “-1”.
The following theorem shows that there always exists a model which fits the sample 𝑥0, and achieve
116
optimal accuracy (i.e. > 99%) on both classes, although 𝑑+1 can be small.
Theorem 3 For a data distribution in Eq. 2.2, given that 𝜂/𝜎 is large enough, we define 𝑙 =
√
2 ·
Φ−1(0.99) · 𝜎 · (1 +Φ−1(0.99) · 𝜎
𝜂
)−1/2. For a point 𝑥0 with 𝑑+1 ≤ 2𝜂, if 𝑑+1 ≥ 𝑙, then there exist a
model 𝐹 fitting 𝑥0: 𝐹(𝑥0) = −1 and has classification error at most 1% on both classes.
Proof 7 We prove this theorem by two main steps.
Step 1. we show that for any 𝑥0 satisfying 𝑑−1 ≤ 2𝜂, 𝑑+1 ≥ 𝑙, at least one of 𝐹1, 𝐹2 and 𝐹3 fits
𝑥0: ∃ 𝑖 ∈ {1, 2}, 𝐹𝑖 (𝑥0) = −1, where:
𝐹1(𝑥) := sign( [1, 0]𝑇 𝑥 − (𝜂 − Φ−1(0.99) · 𝜎))
𝐹2(𝑥) := sign( [𝜏,
q
𝜂2 − 𝜏2]𝑇 · 𝑥)
𝐹3(𝑥) := sign( [𝜏, −
q
𝜂2 − 𝜏2]𝑇 · 𝑥)
In the following parts, we let 𝜏 = Φ−1 · 𝜎 for simplicity. Step 2. Both models 𝐹1 and 𝐹1 has a
classification accuracy at least 99% for both classes.
Proof of Step 1. For any given sample 𝑥0 = [𝑝, 𝑞], on one hand, if 𝑝 < 𝜂 − 𝜏, 𝐹1(𝑥0) = −1.
On the other hand, if 𝑝 ≥ 𝜂−𝜏, we will show all samples 𝑥0 = [𝑝, 𝑞] which satisfies 𝐹2 ( [𝑝, 𝑞])) ≠
−1, it must have 𝑑+1 ≤ 𝑙, where 𝑙 =
√
2𝜎 · (1 + 𝜏/𝜂)−1/2. Here, we assume 𝑞 ≤ 0 without loss of
generality. This can be treated as a quadratic maximization problem under a convex closed bound:
max (𝑝 − 𝜂)2 + 𝑞2
subject to 𝐾1 := 𝜏 · 𝑝 +
q
𝜂2 − 𝜏2 · 𝑞 ≥ 0
and 𝐾2 := 𝑝 ≥ 𝜂 − 𝜏;
and 𝐾3 := (𝑝 − (−𝜂))2 + 𝑞2 ≤ (2𝜂)2
and 𝐾4 := 𝑞 ≤ 0
(A.11)
Figure A.18 provides an illustration of the problem proposed in Eq. A.11. The optima of Eq. A.11
lies on the border of the closed region (shadow area). Next, we solve the values of the object
(𝑝 − 𝜂)2 + 𝑞2 on points 𝑆1 and 𝑆2, and compares the objective value when 𝑥0 = 𝑆1 or 𝑥0 = 𝑆2.
117
𝜃 = [𝜂, 0]
𝐾!
𝐾"
𝐾#
𝐾$
𝑆! 𝑆"
Figure A.18: Illustration of Eq. A.11.
For 𝑥0 = 𝑆1, it satisfies:
𝜏 · 𝑝 +
q
𝜂2 − 𝜏2 · 𝑞 = 0
𝑝 = 𝜂 − 𝜏;
By solving the equations, we have (𝑝 − 𝜂)2 + 𝑞2 = (
√
2 · 𝜏 · (1 + 𝜏
𝜂
)−1/2)2.
For 𝑥0 = 𝑆1, it satisfies:
𝜏 · 𝑝 +
q
𝜂2 − 𝜏2 · 𝑞 = 0
(𝑝 − (−𝜂))2 + 𝑞2 = (2𝜂)2
By solving the equations, we have (𝑝 − 𝜂)2 + 𝑞2 = (𝜏 · (1 − 𝜏
𝜂
+Ω( ( 𝜏
𝜂
)2))−1/2)2. It is not hard to see
that, when 𝜂/𝜎 is sufficiently large, the objective will achieve larger value on 𝑆1, whose optimum
is not larger than 𝑙2. Therefore, we finished proving the step 1.
Proof of Step 2. The distance of the center of class “+1” to the decision boundaries of 𝐹1, 𝐹2
and 𝐹3 are all 𝜏. Therefore, the error of class “+1”:
Pr.(𝐹(𝑥) ≠ +1|𝑦 = +1) = Pr.(N(0, 𝜎2) > 𝜏)
= Pr.(N(0, 1) > 𝜏/𝜎)
= Pr.(N(0, 1) > Φ−1(0.99)) = 1%
118
For the class “-1”, since classifiers 𝐹2 and 𝐹3 are symmetric, they also have 1% error on class
“-1”. For classifier 𝐹1, it is easy to see that it has smaller error than 1% on class “-1”. Based on
the results of Step 1 and Step 2, we finished the proof of Theorem 3.
Theorem 4 For the same data distribution in Eq. 2.2 and same definition 𝑙 =
√
2 · Φ−1(0.99) · 𝜎 ·
(1 +Φ−1 (0.99) · 𝜎
𝜂
)−1/2. We assume 𝑙 ≤ 𝑑+1 < 𝑘 · 𝑙, where 1 ≤ 𝑘 << 𝜂/𝜎. Then, any model 𝐹 will
have a classification error at least 50% on class “+1”, when 𝐹 fits the adversarial example of 𝑥0:
𝐹(𝑥0 + 𝛿) = −1, ∀||𝛿||2 ≤ 𝜖, and
𝜖 ≥ 𝜎 · (𝑘 ·
√
2 · Φ−1(0.99)) · (1 + Φ−1(0.99) · 𝜎
𝜂
)−1/2 (A.12)
Proof 8 During adversarial training, if 𝜖 > 𝑘 · 𝑙, it will include the center of class “+1” into the
adversarial training bound of sample 𝑥0:
𝜃 ∈ {𝑥 + 𝛿, ||𝛿||2 ≤ 𝜖 }.
Therefore, if there is a model which fits any adversarial counterpart of the sample 𝑥0, there does not
exist a linear model which has error rate smaller than 50% on class “+1”. Therefore, we finished
the proof of Theorem 4.
3 Supplementary to Chapter 4
3.1 The Behavior of Adversarial Training
In order to examine the performance of PGD adversarial training under imbalanced scenarios,
we adversarially train ResNet18 [45] models on multiple imbalanced training datasets based
on CIFAR10 dataset [56]. Similar with observations we discussed in Section 4.3.1, as shown in
Figure A.19, Figure A.20 and Figure A.21, adversarial training produces larger performance gap
between well-represented classes and under-represented classes than natural training. Especially,
in all imbalanced scenarios, adversarially trained models obtain very low robust accuracy on underrepresented
classes, which proves again that adversarial training cannot be applied in practical imbalanced
scenarios directly.
119
(a) Natural Training Standard Acc. (b) Adv. Train. Standard Acc. (c) Adv. Train. Robust Acc.
Figure A.19: Class-wise performance of natural & adversarial training under an imbalanced CIFAR10
dataset “Step-10”.
(a) Natural Training Standard Acc. (b) Adv. Train. Standard Acc. (c) Adv. Train. Robust Acc.
Figure A.20: Class-wise performance of natural & adversarial training under an imbalanced CIFAR10
dataset “Exp-100”.
(a) Natural Training Standard Acc. (b) Adv. Train. Standard Acc. (c) Adv. Train. Robust Acc.
Figure A.21: Class-wise performance of natural & adversarial training under an imbalanced CIFAR10
dataset “Exp-10”.
3.2 Reweighting Strategy in Natural Training v.s. in Adversarial Training
For exploring whether the reweighting strategy can help adversarial training deal with imbalanced
issues, we evaluate performance of adversarial trained models using diverse binary imbalanced
training datasets with different weights assigning to under-represented class. As shown in
Figure A.22, Figure A.23, Figure A.24, for adversarially trained models, increasing the weights
120
(a) Natural Training Standard Acc. (b) Adv. Train. Standard Acc. (c) Adv. Train. Robust Acc.
Figure A.22: Class-wise performance of reweighted natural & adversarial training in binary classification.
(“auto” as well-represented class and “truck” as under-represented class).
(a) Natural Training Standard Acc. (b) Adv. Train. Standard Acc. (c) Adv. Train. Robust Acc.
Figure A.23: Class-wise performance of reweighted natural & adversarial training in binary classification.
(“bird” as well-represented class and “frog” as under-represented class).
assigning to under-represented class will improve models’ performance on under-represented class.
However, as the same time, the models’ performance on well-represented class will be drastically
decreased. As a comparison, adopting larger weights in naturally trained models will also improve
models’ performance on under-represented class but only result in slight drop in performance on
well-represented class. In other words, the reweighting strategy proposed in natural training to
handle imbalanced problem may only provide limited help in adversarial training, and, hence, new
techniques are needed for adversarial training under imbalanced scenarios.
3.3 Proofs of the Theorems in Section 5.4.2
3.3.1 Proof of Lemma 4.4
Proof 9 (Proof of Lemma 4.4) We will first prove that the optimal model 𝑓 ∗ has parameters 𝑤1 =
𝑤2 = · · · = 𝑤𝑑 (or 𝑤 = 1) by contradiction. We define 𝐺 = {1, 2, . . . , 𝑑} and make the following
assumption: for the optimal 𝑤 and 𝑏, we assume if there exist 𝑤𝑖 < 𝑤𝑗 for 𝑖 ≠ 𝑗 and 𝑖, 𝑗 ∈ 𝐺. Then
121
(a) Natural Training Standard Acc. (b) Adv. Train. Standard Acc. (c) Adv. Train. Robust Acc.
Figure A.24: Class-wise performance of reweighted natural & adversarial training in binary classification.
(“dog” as well-represented class and “deer” as under-represented class).
we obtain the following standard errors for two classes of this classifier 𝑓 with weight 𝑤:
Pr.( 𝑓 ∗ (𝑥) ≠ 𝑦|𝑦 = −1) = Pr.{
Õ
𝑘≠𝑖,𝑘≠𝑗
𝑤𝑘N(−𝜂, 𝜎2) − 𝑏 + 𝑤𝑖N(−𝜂, 𝜎2) + 𝑤𝑗N(−𝜂, 𝜎2) > 0},
Pr.( 𝑓 ∗ (𝑥) ≠ 𝑦|𝑦 = +1) = Pr.{
Õ
𝑘≠𝑖,𝑘≠𝑗
𝑤𝑘N(+𝜂, 𝜎2) − 𝑏 + 𝑤𝑖N(+𝜂, 𝜎2) + 𝑤𝑗N(+𝜂, 𝜎2) < 0}.
(A.13)
However, if we define a new classier 𝑓˜ whose weight 𝑤˜ uses 𝑤𝑗 to replace 𝑤𝑖 , we obtain the errors
for the new classifier:
Pr.( ˜ 𝑓 (𝑥) ≠ 𝑦|𝑦 = −1) = Pr.{
Õ
𝑘≠𝑖,𝑘≠𝑗
𝑤𝑘N(−𝜂, 𝜎2) − 𝑏 + 𝑤𝑗N(−𝜂, 𝜎2) + 𝑤𝑗N(−𝜂, 𝜎2) > 0},
Pr.( ˜ 𝑓 (𝑥) ≠ 𝑦|𝑦 = +1) = Pr.{
Õ
𝑘≠𝑖,𝑘≠𝑗
𝑤𝑘N(+𝜂, 𝜎2) − 𝑏 + 𝑤𝑗N(+𝜂, 𝜎2) + 𝑤𝑗N(+𝜂, 𝜎2) < 0}.
(A.14)
By comparing the errors in Eq. (A.13) and Eq. (A.14), it can imply the classifier ˜ 𝑓 has smaller
error in each class. Therefore, it contradicts with the assumption that 𝑓 is the optimal classifier
with smallest error. Thus, we conclude for an optimal linear classifier in natural training, it must
satisfies 𝑤1 = 𝑤2 = · · · = 𝑤𝑑 (or 𝑤 = 1) if we do not consider the scale of 𝑤.
Next, we calculate the optimal bias term 𝑏 given 𝑤 = 1, where we find an optimal 𝑏 can minimize
122
the (reweighted) empirical risk:
Errortrain ( 𝑓 ∗)
= Pr.( 𝑓 ∗ (𝑥) ≠ 𝑦|𝑦 = −1) · Pr.(𝑦 = −1) · 𝜌 + Pr.( 𝑓 ∗ (𝑥) ≠ 𝑦|𝑦 = +1) · Pr.(𝑦 = +1)
∝ Pr.( 𝑓 ∗ (𝑥) ≠ 𝑦|𝑦 = −1) · 𝜌 + Pr.( 𝑓 ∗ (𝑥) ≠ 𝑦|𝑦 = +1) · 𝐾
= 𝜌 · Pr.(
Õ𝑑
𝑖=1
N(−𝜂, 𝜎2) − 𝑏 > 0) + 𝐾 · Pr.(
Õ𝑑
𝑖=1
N(𝜂, 𝜎2) − 𝑏 < 0)
= 𝜌 · Pr.(N(0, 1) < −𝑏 + 𝑑𝜂
𝑑𝜎
) + 𝐾 · Pr.(N(0, 1) <
𝑏 − 𝑑𝜂
𝑑𝜎
),
and we take the derivative with respect to 𝑏:
𝜕Errortrain
𝜕𝑏
=
𝜌
√
2𝜋
· (− 1
𝑑𝜎
) exp(−1
2
(−𝑏 + 𝑑𝜂
𝑑𝜎
)2) + 𝐾
√
2𝜋
· ( 1
𝑑𝜎
) exp(−1
2
( 𝑏 − 𝑑𝜂
𝑑𝜎
)2).
When 𝜕Errortrain/𝜕𝑏 = 0, we can calculate the optimal 𝑏 which gives the minimum value of the
empirical error, and we have:
𝑏 =
1
2
log( 𝜌
𝐾
) 𝑑𝜎2
𝜂
=
1
2
log( 𝜌
𝐾
) 𝑑
𝑆
.
3.3.2 Proof of Theorem 4.4
Proof 10 (Proof of Theorem 4.4) Without loss of generality, for distributionD1,D2 with different
mean-variance pairs (±𝜂1, 𝜎2
1
) and (±𝜂2, 𝜎2
2
), we can only consider the case 𝜂1 = 𝜂2 and 𝜎2
1 < 𝜎2
2 .
Otherwise, we can simply rescale one of them to match the mean vector of the other and will not
impact the results. Under this definition, the optimal classifier 𝑓 ∗
1 and 𝑓 ∗
2 has weight vector 𝑤1 =
𝑤2 = 1 and bias term 𝑏1, 𝑏2, with the value as demonstrated in Lemma 4.4. Next, we will prove the
Theorem 4.4 by 2 steps.
Step 1. For the error of class “-1”, we have:
Pr.( 𝑓 ∗
1
(𝑥(1)) ≠ 𝑦(1) |𝑦(1) = −1) = Pr.(
Õ𝑑
𝑖=1
N(−𝜂, 𝜎2
1
) − 𝑏1 > 0)
< Pr.(
Õ𝑑
𝑖=1
N(−𝜂, 𝜎2
1
) − 𝑏2 > 0) (because 𝑆1 > 𝑆2)
< Pr.(
Õ𝑑
𝑖=1
N(−𝜂, 𝜎2
2
) − 𝑏2 > 0) (because 𝜎2
1 < 𝜎2
2
)
123
= Pr.( 𝑓 ∗
2
(𝑥(2)) ≠ 𝑦(2) |𝑦(2) = −1).
Step 2. For the error of class “+1”, we have:
Pr.( 𝑓 ∗
1
(𝑥(1)) ≠ 𝑦(1) |𝑦(1) = +1) = Pr.(
Õ𝑑
𝑖=1
N(𝜂, 𝜎2
1
) − 𝑏1 < 0)
= Pr.(N(0, 1) <
𝑏1 − 𝑑𝜂
𝑑𝜎1
)
= Pr.(N(0, 1) <
− log(𝐾) · 𝜎1
2𝜂
− 𝜂
𝜎1
),
(A.15)
and similarly,
Pr.( 𝑓 ∗
2
(𝑥(2)) ≠ 𝑦(2) |𝑦(2) = +1) = Pr.(N(0, 1) <
− log(𝐾) · 𝜎2
2𝜂
− 𝜂
𝜎2
). (A.16)
Note that when 𝐾 is large enough, i.e., log(𝐾) > 2·𝜂2
𝜎1·𝜎2 , we can get the Z-score in Eq. (A.15) is
larger than Eq. (A.16). As a result, we have:
Pr.( 𝑓 ∗
1
(𝑥(1)) ≠ 𝑦(1) |𝑦(1) = +1) > Pr.( 𝑓 ∗
2
(𝑥(2)) ≠ 𝑦(2) |𝑦(2) = +1). (A.17)
By combining Step 1 and Step 2, we can get the inequality in Theorem 4.4.
3.3.3 Proof of Theorem 4.4
Proof 11 (Proof of Theorem 4.4) We first show that under both distribution D1 and D2, the optimal
reweighting ratio 𝜌 is equal to the imbalance ratio 𝐾. Based on the results in Eq. (A.13) and
calculated model parameters 𝑤 and 𝑏, we have the test error (given the model trained by reweight
value 𝜌):
Errortest ( 𝑓 ∗)
= Pr.( 𝑓 ∗ (𝑥) ≠ 𝑦|𝑦 = −1) · Pr.(𝑦 = −1) + Pr.( 𝑓 ∗ (𝑥) ≠ 𝑦|𝑦 = +1) · Pr.(𝑦 = +1)
∝ Pr.(N(0, 1) < −𝑏 + 𝑑𝜂
𝑑𝜎
) + Pr.(N(0, 1) <
𝑏 − 𝑑𝜂
𝑑𝜎
)
= Pr.(N(0, 1) < −1
2
log( 𝜌
𝐾
) − 𝜎
𝜂
) + Pr.(N(0, 1) <
1
2
log( 𝜌
𝐾
) − 𝜎
𝜂
).
The value of taking the minimum when its derivative with respect to 𝜌 is equal to 0, where we can
get 𝜌 = 𝐾 and the bias term 𝑏 = 0. Note that the variance values have the relation: 𝜎2
1 < 𝜎2
2 .
Therefore, it is easy to get that:
124
(a) Step 10. (b) Step 100. (c) Exp 10. (d) Exp 100.
Figure A.25: Data distribution of imbalanced training datasets constructed from CIFAR10 dataset.
Pr.( 𝑓 ′
1
∗ (𝑥(1)) ≠ 𝑦(1) |𝑦(1) = +1) = Pr.(
Õ𝑑
𝑖=1
N(𝜂, 𝜎2
1
) < 0)
< Pr.(
Õ𝑑
𝑖=1
N(𝜂, 𝜎2
2
) < 0) = Pr.( 𝑓 ′
2
∗ (𝑥(2)) ≠ 𝑦(2) |𝑦(2) = +1).
(A.18)
Combining the results in Eq. (A.17) and (A.18), we have proved the inequality in Theorem 4.4.
3.4 Algorithm of SRAT
The algorithm of our proposed SRAT framework is shown in Algorithm 5. Specifically, in each
training iteration, we first generate adversarial examples using PGD for examples in the current
batch (Line 5). If the current training iteration does not reach a predefined starting reweighting
epoch 𝑇𝑑, we will assign same weights, i.e., 𝑤𝑖 = 1 for all adversarial examples x𝑖 in the current
batch (Line 6). Otherwise, the reweighting strategy will be adopted in the final loss function (Line
15), where a specific weight 𝑤𝑖 will be assigned for each adversarial example x𝑖 if its corresponding
clean example x𝑖 comes from an under-represented class.
3.5 Data Distribution of Imbalanced Training Datasets
In our experiments, we construct multiple imbalanced training datasets to simulate various kinds
of imbalanced scenarios by combining different imbalance types (i.e., Exp and Step) with different
imbalanced ratios (i.e., 𝐾 = 10 and 𝐾 = 100). Figure A.25 and Figure A.26 show the data distribution
of all ten-classes imbalanced training datasets used in our preliminary studies and experiments
based on CIFAR10 [56] and SVHN [80] datasets, respectively.
125
Algorithm 5 Separable Reweighted Adversarial Training (SRAT).
Input: imbalanced training dataset 𝐷 = {(x𝑖 , 𝑦𝑖)}𝑛
𝑖=1, number of total training epochs 𝑇, starting
reweighting epoch 𝑇𝑑, batch size 𝑁, number of batches 𝑀, learning rate 𝛾
Output: An adversarially robust model 𝑓𝜃
1: Initialize the model parameters 𝜃 randomly;
2: for epoch = 1, . . . , 𝑇𝑑 − 1 do
3: for mini-batch = 1, . . . , 𝑀 do
4: Sample a mini-batch B = {(x𝑖 , 𝑦𝑖)}𝑁
𝑖=1 from 𝐷;
5: Generate adversarial example x′
𝑖 for each x′
𝑖
∈ B;
6: L( 𝑓𝜃) = 1
𝑁
Í𝑁
𝑖=1 max∥x′
𝑖
−x𝑖 ∥ 𝑝≤𝜖 L( 𝑓𝜃 (x′
𝑖
), 𝑦𝑖) + 𝜆L𝑠𝑒𝑝 (x′
𝑖
)
7: 𝜃 ← 𝜃 − 𝛾∇𝜃L( 𝑓𝜃)
8: end for
9: Optional: 𝛾 ← 𝛾/𝜅
10: end for
11: for epoch = 𝑇𝑑, . . . , 𝑇 do
12: for mini-batch = 1, . . . , 𝑀 do
13: Sample a mini-batch B = {(x𝑖 , 𝑦𝑖)}𝑁
𝑖=1 from 𝐷;
14: Generate adversarial example x′
𝑖 for each x′
𝑖
∈ B;
15: L( 𝑓𝜃) = 1
𝑁
Í𝑁
𝑖=1 max∥x′
𝑖
−x𝑖 ∥ 𝑝≤𝜖 𝑤𝑖L( 𝑓𝜃 (x′
𝑖
), 𝑦𝑖) + 𝜆L𝑠𝑒𝑝 (x′
𝑖
)
16: 𝜃 ← 𝜃 − 𝛾∇𝜃L( 𝑓𝜃)
17: end for
18: Optional: 𝛾 ← 𝛾/𝜅
19: end for
(a) Step 10. (b) Step 100. (c) Exp 10. (d) Exp 100.
Figure A.26: Data distribution of imbalanced training datasets constructed from SVHN dataset.
3.6 Performance Comparison on Imbalanced SVHN Datasets
Table A.7 and Table A.8 show the performance comparison on various imbalanced SVHN
datasets with different imbalance types and imbalance ratios. We use bold values to denote the
highest accuracy among all methods and use the underline values to indicate our SRAT variants
which achieve the highest accuracy among their corresponding baseline methods utilizing the same
loss function for making predictions.
126
From Table A.7 and Table A.8, we get similar observation that, comparing with baseline methods,
our proposed SRAT method can produce a robust model which can achieve improved overall
performance when the training dataset is imbalanced. In addition, based on the experimental
results in Table 4.1 to Table A.8, we find that, compared with the performance improvement between
DRCB-LDAM and SRAT-LDAM, the improvement between DRCB-CE and SRAT-CE and
the improvement between DRCB-Focal and SRAT-Focal are more obviously. The possible reason
behind this phenomenon is, the LDAM loss can also implicitly produce a more separable feature
space [13] while CE loss and Focal loss do not conduct any specific operations on the latent feature
space. Hence, the feature separation loss contained in SRAT-CE and SRAT-Focal could be more
effective on learning separable feature space and facilitate the Focal loss on prediction. However, in
SRAT-LDAM, the feature separation loss and LDAM loss may affect each other on learning feature
representations and, hence, the effectiveness of the feature separation loss may be counteracted or
weakened.
In conclusion, experiments conducted on multiple imbalanced datasets verify the effectiveness
of our proposed SRAT method under various imbalanced scenarios.
Table A.7: Performance Comparison on Imbalanced SVHN Datasets (Imbalanced Type: Step).
Imbalance Ratio 10 100
Imbalance Ratio Standard Accuracy Robust Accuracy Standard Accuracy Robust Accuracy
Method Overall Under Overall Under Overall Under Overall Under
CE 79.88 67.04 37.62 22.08 59.61 26.19 29.57 5.03
Focal 79.96 67.03 37.83 22.47 60.58 28.17 30.27 5.83
LDAM 84.55 74.96 45.80 31.23 65.61 37.13 33.34 8.36
CB-Reweight 79.48 66.07 37.38 21.66 60.23 27.68 29.54 5.32
CB-Focal 80.29 67.56 38.10 23.00 60.73 28.37 30.09 5.75
DRCB-CE 80.62 68.74 37.25 22.79 60.67 28.36 30.02 5.59
DRCB-Focal 79.11 65.72 37.01 22.02 61.65 30.29 30.78 7.06
DRCB-LDAM 87.83 82.63 46.45 35.15 63.78 33.99 33.60 7.28
SRAT-CE 82.89 72.79 38.23 24.70 63.39 33.85 29.64 6.11
SRAT-Focal 85.05 77.10 39.51 28.06 70.12 47.44 32.18 11.08
SRAT-LDAM 87.65 82.62 46.03 34.75 71.56 50.33 33.54 11.63
127
Table A.8: Performance Comparison on Imbalanced SVHN Datasets (Imbalanced Type: Exp).
Imbalance Ratio 10 100
Metric Standard Accuracy Robust Accuracy Standard Accuracy Robust Accuracy
Method Overall Under Overall Under Overall Under Overall Under
CE 87.54 82.67 44.12 35.33 72.51 56.30 33.34 16.93
Focal 87.82 83.01 44.88 35.97 72.61 56.48 34.09 17.62
LDAM 90.06 86.69 51.84 43.73 79.11 66.86 40.42 25.18
CB-Reweight 87.66 82.79 44.39 35.53 72.25 55.97 33.36 17.16
CB-Focal 87.86 82.96 44.61 35.55 73.23 57.34 34.25 17.90
DRCB-CE 88.49 84.51 43.82 36.28 73.74 58.03 33.52 17.68
DRCB-Focal 87.47 82.78 42.52 34.31 71.95 55.11 33.43 17.63
DRCB-LDAM 91.24 89.65 52.39 46.71 80.29 69.23 40.16 24.64
SRAT-CE 88.70 84.94 44.54 36.59 77.11 64.47 34.48 19.91
SRAT-Focal 89.51 85.42 45.37 37.20 80.04 69.54 35.25 23.04
SRAT-LDAM 91.27 89.55 52.10 46.13 80.71 70.49 40.33 25.11
4 Supplementary to Chapter 5
4.1 Proof of Theorem
In this part, we provide the detailed proof of Theorem 5 in Section 5.4.
Theorem 4 (Recall Theorem 5) Suppose a set of “clean” samples S𝑔𝑜𝑜𝑑 with size 𝑛 are i.i.d sampled
from distribution N(𝜇, Σ), where Σ ⪯ 𝜎2𝐼. There is a set of “bad” samples S𝑏𝑎𝑑 with size
𝑛𝑝 = 𝑛/𝐾, 𝐾 > 1 and center ||𝜇𝑝 − 𝜇||2 = 𝑑, 𝑑 = 𝛾 · 𝜎. Then, the average squared poisoning
scores of clean samples and bad samples have the relationship:
E𝑖∈S𝑏𝑎𝑑

𝑞2(𝑥𝑖)

− E𝑖∈S𝑔𝑜𝑜𝑑

𝑞2 (𝑥𝑖)

≥

𝐾 − 1
𝐾 + 1
· 𝛾2 − (𝐾 + 1)

𝜎2
Proof 12 We denote S𝑔𝑜𝑜𝑑 is the set of clean samples, S𝑏𝑎𝑑 is the set of bad samples, and the union
of clean samples and bad samples form the whole set S. In the later part, to distinguish between
the centers of each set, we use 𝜇𝑔, 𝜇𝑝 and 𝜇𝑠 to denote the center of clean samples, bad samples
and the whole set.
First, it is easy to know that 𝜇𝑔, 𝜇𝑝 and 𝜇𝑠 are in a same line. Thus, given 𝑛𝑔 : 𝑛𝑝 = 𝐾 : 1, we
have the relationship of center distance: 𝑑 = ||𝜇𝑔−𝜇𝑝 || = (1+𝐾) ||𝜇𝑔−𝜇𝑠 || = ( (𝐾+1)/𝐾) ||𝜇𝑝−𝜇𝑠 ||.
In the following, we will study the squared poisoning score in the whole group S. Given any unit
128
vector V is the top right singular vector of the centered data matrix of S:
E𝑖∈S

(V · (𝑥 − 𝜇𝑠))2
= E𝑖∈S

(V · (𝑥 − 𝜇𝑝) + V · (𝜇𝑝 − 𝜇𝑠))2
= E𝑖∈S

(V · (𝑥 − 𝜇𝑝))2
+ E𝑖∈S

2(V · (𝑥 − 𝜇𝑝) (𝜇𝑝 − 𝜇𝑠)𝑇 · V𝑇 
+ E𝑖∈S

(V · (𝜇𝑝 − 𝜇𝑠))2
= E𝑖∈S

(V · (𝑥 − 𝜇𝑝))2
− (V · (𝜇𝑝 − 𝜇𝑠))2
Note that V is the top right singular vector of the centered data matrix 𝑋𝑠 − 𝜇𝑠, we choose 𝑣′ =
(𝜇𝑝 − 𝜇𝑠)/|| (𝜇𝑝 − 𝜇𝑠) ||, which is the unit vector that has the same direction with (𝜇𝑝 − 𝜇𝑠). We get:
E𝑖∈S

(V · (𝑥 − 𝜇𝑠))2
≥ E𝑖∈S

(𝑣′ · (𝑥 − 𝜇𝑝))2
− ||𝜇𝑝 − 𝜇𝑠 ||2
For the first term in the right hand side of the inequality above:
(𝐾 + 1) · E𝑖∈S

(𝑣′ · (𝑥 − 𝜇𝑝))2
≥ 𝐾 · E𝑖∈S𝑔𝑜𝑜𝑑

(𝑣′ · (𝑥 − 𝜇𝑝))2
because the poisoning scores are all positive. Then, we have:
(𝐾 + 1) · E𝑖∈S

(𝑣′ · (𝑥 − 𝜇𝑝))2
≥ 𝐾 · E𝑖∈S𝑔𝑜𝑜𝑑

(𝑣′ · (𝑥 − 𝜇𝑝))2
= 𝐾 · E𝑖∈S𝑔𝑜𝑜𝑑

(𝑣′ · (𝑥 − 𝜇𝑔) + 𝑣′ · (𝜇𝑔 − 𝜇𝑝))2
= 𝐾 ·

𝑣′ · (𝑋𝑔 − 𝜇𝑔) (𝑋𝑔 − 𝜇𝑔)𝑇 · 𝑣′𝑇
+𝑣′ · (𝜇𝑔 − 𝜇𝑝) (𝜇𝑔 − 𝜇𝑝)𝑇 · 𝑣′𝑇

≥ 𝐾 · (0 + ||𝜇𝑔 − 𝜇𝑝 ||2)
The first term is larger than 0 because of the semi-definite property of the matrix (𝑋𝑔−𝜇𝑔) (𝑋𝑔−𝜇𝑔)𝑇 ,
the second term is because 𝑣′ has the same direction with (𝜇𝑝 − 𝜇𝑠) (because 𝜇𝑝, 𝜇𝑔 and 𝜇𝑠 are in
the same line). Therefore, we get the average poisoning score of the whole set S:
E𝑖∈S

(V · (𝑥 − 𝜇𝑠))2
≥ 𝐾
𝐾 + 1
· ||𝜇𝑔 − 𝜇𝑝 ||2 − ||𝜇𝑝 − 𝜇𝑠 ||2 =
𝐾
(1 + 𝐾)2
· 𝑑2
In the following, we will calculate the average squared poisoning score in the good set S𝑔𝑜𝑜𝑑.
E𝑖∈S𝑔𝑜𝑜𝑑

(V · (𝑥 − 𝜇𝑝))2
= V · (𝑋𝑔 − 𝜇𝑔) (𝑋𝑔 − 𝜇𝑔)𝑇 · V𝑇 + V · (𝜇𝑔 − 𝜇𝑠) (𝑋𝑔 − 𝜇𝑠)𝑇 · V𝑇
≤ 𝜎2 + ||𝜇𝑔 − 𝜇𝑠 ||2 ≤ 𝜎2 + ( 1
𝐾 + 1
· 𝑑)2
129
Based on previous calculation about the average score of whole set and good set, we can get the
average squared poisoning score in the bad set S𝑏𝑎𝑑:
E𝑖∈S𝑏𝑎𝑑

(V · (𝑥 − 𝜇𝑝))2
= (1 + 𝐾)E𝑖∈S

(V · (𝑥 − 𝜇𝑝))2
− 𝐾E𝑖∈S𝑔𝑜𝑜𝑑

(V · (𝑥 − 𝜇𝑝))2
≥ 𝐾2
(𝐾 + 1)2
· 𝑑2 − 𝐾𝜎2
and the difference between two averaged scores:
E𝑖∈S𝑏𝑎𝑑
− E𝑖∈S𝑔𝑜𝑜𝑑
≥ 𝑘 − 1
𝑘 + 1
· 𝑑2 − (𝑘 + 1)𝜎2 =

𝐾 − 1
𝐾 + 1
· 𝛾2 − (𝐾 + 1)

𝜎2
4.2 More Experimental Details
In this part, we provide additional experimental details such as the pre-process procedure.
Adult Cenesus Dataset. In this dataset, we have 5 numerical features “age, education-num,
hours-per-week, capital-loss and capital gain”’, and we also use categorical features such as “workclass,
education, marital-stataus, occupation, relationship, race”. For simplicity, we first transform
the categoircal features into dummy variables and conduct Principle Component Analysis to project
them to the space, which is spanned by the first 15 principle components. After projection, we normalize
all 20 features by centering and standadizing. Under this dataseet, during the attacking of
Min-Max and F-Attack, we assign the radius of the feasible injection set to be 𝑑 = 9.0. In Appendix
4.3, we provide the empirical results for more choices of 𝑑, i.e., 𝑑 = 6.0.
COMPAS Dataset. In this dataset, we have the numerical features: “age, age_cat, juv_fel_count,
juv_misd_count, juv_other_count, priors_count, days_b_screening_arrest, decile_score, c_jail_in,
c_jail_out”. We use “c_jail_out - c_jail_in” to get the number of days in jail and exclude c_jail_out,
c_jail_in. We also have categorical features c_charge_degree and sex, and we use PCA to find the
first two principle directions. Then, we standardize each feature. Under this dataset, during the
attacking of Min-Max and F-Attack, we assign the radius of the feasible injection set to be 𝑑 = 9.0.
In Appendix 4.3, we provide the empirical results for more choices of 𝑑, i.e., 𝑑 = 6.0.
130
Table A.9: RFC & Baseline Methods’ Performance on Adult Census Dataset, for Equalized Treatment.
No Attack Label Flip(10%) Label Flip(20%) Attr. Flip(10%) Attr. Flip(20%)
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair
No Defense. 0.795 0.808 0.787 0.823 0.784 0.834 0.797 0.800 0.802 0.800
SEVER. 0.772 0.829 0.754 0.823 0.757 0.801 0.785 0.795 0.781 0.790
RFC. 0.781 0.803 0.791 0.820 0.799 0.830 0.788 0.806 0.783 0.823
(𝜖 = 10%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack*
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair.
No Defense. 0.779 0.834 0.787 0.810 0.781 0.795 0.785 0.823 0.790 0.822
SEVER. 0.766 0.803 0.757 0.800 0.770 0.788 0.764 0.788 0.769 0.803
RFC. 0.778 0.834 0.788 0.804 0.781 0.812 0.794 0.799 0.793 0.803
(𝜖 = 15%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack*
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair.
No Defense. 0.782 0.797 0.783 0.792 0.766 0.787 0.777 0.822 0.770 0.801
SEVER. 0.775 0.769 0.766 0.792 0.741 0.803 0.755 0.773 0.766 0.770
RFC. 0.786 0.806 0.794 0.812 0.776 0.815 0.790 0.822 0.786 0.823
Table A.10: RFC & Baseline Methods’ Performance on Adult Census Dataset using [128].
No Attack Label Flip(10%) Label Flip(20%) Attr. Flip(10%) Attr. Flip(20%)
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair.
No Defense. 0.815 0.944 0.813 0.962 0.804 0.954 0.814 0.942 0.811 0.931
SEVER. 0.812 0.952 0.806 0.954 0.808 0.957 0.814 0.942 0.811 0.953
RFC. 0.814 0.964 0.808 0.956 0.801 0.951 0.802 0.963 0.802 0.944
(𝜖 = 10%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack*
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair.
No Defense. 0.801 0.969 0.797 0.864 0.769 0.966 0.796 0.804 0.799 0.799
SEVER. 0.800 0.960 0.802 0.954 0.788 0.945 0.786 0.937 0.779 0.956
RFC. 0.802 0.967 0.811 0.946 0.803 0.950 0.808 0.952 0.809 0.951
(𝜖 = 15%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack*
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair.
No Defense. 0.784 0.953 0.797 0.841 0.700 0.943 0.755 0.821 0.798 0.800
SEVER. 0.802 0.959 0.793 0.978 0.775 0.952 0.778 0.909 0.775 0.927
RFC. 0.799 0.944 0.799 0.952 0.796 0.950 0.801 0.951 0.803 0.946
4.3 More experimental results
In this part, we provide additional empirical results to validate the effectiveness of our attack and
defense method. In detail, we consider more settings about: (1) Different Type of Fairness Criteria,
such as “Equalized Treatment”; (2) Different fair classification method, such as [128], (3) Different
choice of the radius of feasible injection space 𝑑, such as 𝑑 = 6.0. Notably, in the all experiments
in the main paper, we set 𝑑 with a fixed value 𝑑 = 9.0. In this part, we provide another option
131
Table A.11: RFC & Baseline Methods’ Performance on Adult Census Dataset when 𝑑 = 6.
(𝜖 = 10%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack*
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair.
No Defense. 0.796 0.970 0.803 0.955 0.762 0.977 0.803 0.778 0.811 0.723
SEVER. 0.762 0.955 0.786 0.956 0.744 0.902 0.743 0.944 0.762 0.735
RFC. 0.800 0.969 0.805 0.966 0.812 0.945 0.785 0.963 0.783 0.961
(𝜖 = 15%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack*
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair.
No Defense. 0.802 0.945 0.800 0.945 0.759 0.946 0.800 0.672 0.789 0.721
SEVER. 0.758 0.960 0.743 0.980 0.724 0.961 0.726 0.855 0.724 0.756
RFC. 0.805 0.955 0.801 0.967 0.775 0.953 0.801 0.955 0.774 0.945
Table A.12: RFC & Baseline Methods’ Performance on COMPAS dataset for Equalized Treatment.
No Attack Label Flip(10%) Label Flip(20%) Attr. Flip(10%) Attr. Flip(20%)
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair
No Defense. 0.671 0.845 0.663 0.854 0.655 0.845 0.659 0.873 0.670 0.845
SEVER. 0.665 0.831 0.665 0.849 0.667 0.836 0.674 0.842 0.672 0.834
RFC. 0.671 0.830 0.669 0.845 0.672 0.846 0.677 0.850 0.672 0.844
(𝜖 = 10%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack*
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair.
No Defense. 0.680 0.843 0.662 0.845 0.666 0.862 0.658 0.854 0.677 0.843
SEVER. 0.662 0.836 0.659 0.839 0.646 0.850 0.648 0.825 0.671 0.842
RFC. 0.677 0.838 0.676 0.841 0.675 0.852 0.674 0.848 0.668 0.845
(𝜖 = 15%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack*
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair.
No Defense. 0.629 0.866 0.668 0.845 0.669 0.820 0.662 0.855 0.675 0.855
SEVER. 0.633 0.848 0.631 0.855 0.613 0.850 0.613 0.866 0.678 0.845
RFC. 0.659 0.845 0.674 0.846 0.675 0.835 0.671 0.846 0.680 0.845
when we choose a smaller 𝑑 = 6.0. It is because larger 𝑑, i.e, 𝑑 > 9.0 will make the poisoning
samples easier to be detected by most defense methods. For fairness criteria under “Equalized
Treatment” (Equalized Positive Rate (PR)) in Adult Census Dataset, we set the desired fairness
criteria to be |PR(𝑧 = 0) − PR(𝑧 = 1) | < 0.2. For fairness criteria under “Equalized TPR” in Adult
Census Dataset, we set the unfairness criteria to be |TPR(𝑧 = 0)−TPR(𝑧 = 1) | < 0.05. For fairness
criteria under “Equalized TPR” and “Equalized Treatment” in COMPAS Dataset, we set the desired
fairness criteria to be max𝑗∈Z |TPR(𝑧 = 𝑗 ) − TPˆR| ≤ 0.15 and max𝑗∈Z |PR(𝑧 = 𝑗 ) − PˆR| ≤ 0.15.
132
Table A.13: RFC & Baseline Methods’ Performance on COMPAS Dataset using [128].
No Attack Label Flip(10%) Label Flip(20%) Attr. Flip(10%) Attr. Flip(20%)
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair
No Defense. 0.651 0.844 0.642 0.865 0.644 0.827 0.651 0.855 0.662 0.840
SEVER. 0.647 0.846 0.6770 0.841 0.657 0.830 0.644 0.851 0.642 0.830
RFC. 0.656 0.852 0.668 0.841 0.679 0.850 0.677 0.846 0.661 0.852
(𝜖 = 10%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack*
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair
No Defense. 0.655 0.840 0.645 0.842 0.646 0.832 0.628 0.851 0.654 0.816
SEVER. 0.645 0.860 0.640 0.861 0.628 0.840 0.619 0.852 0.659 0.831
RFC. 0.655 0.858 0.662 0.854 0.663 0.850 0.660 0.851 0.659 0.852
(𝜖 = 15%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack*
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair
No Defense. 0.659 0.820 0.631 0.851 0.649 0.800 0.621 0.851 0.658 0.815
SEVER. 0.641 0.866 0.638 0.851 0.596 0.865 0.611 0.870 0.643 0.831
RFC. 0.671 0.861 0.659 0.855 0.650 0.851 0.640 0.857 0.659 0.850
Table A.14: RFC & Baseline Methods’ Performance on COMPAS dataset, when 𝑑 = 6.
(𝜖 = 10%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack*
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair.
No Defense. 0.650 0.865 0.581 0.838 0.648 0.800 0.651 0.820 0.631 0.815
SEVER. 0.668 0.853 0.643 0.788 0.652 0.800 0.656 0.802 0.644 0.823
RFC. 0.670 0.850 0.651 0.835 0.658 0.820 0.663 0.822 0.665 0.835
(𝜖 = 15%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack*
Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair.
No Defense. 0.621 0.854 0.617 0.859 0.515 0.980 0.614 0.890 0.640 0.801
SEVER. 0.625 0.819 0.641 0.801 0.624 0.770 0.645 0.853 0.635 0.815
RFC. 0.660 0.833 0.675 0.857 0.672 0.832 0.664 0.832 0.660 0.847
133