ON THE INTERACTION BETWEEN MACHINE LEARNING ROBUSTNESS AND FAIRNESS: PROBLEMS, THEORIES AND METHODS By Han Xu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science —Doctor of Philosophy 2024 ABSTRACT Machine Learning (ML) and Artificial Intelligence (AI) has become a powerful tool in solving a wide range of tasks, from language processing and image recognition to more complex fields such as healthcare and finance. These advanced capabilities are transforming industries and improving efficiency and accuracy in unprecedented ways. However, alongside the remarkable benefits of ML, there are significant concerns about its trustworthiness. Key among these concerns are issues related to safety, fairness, privacy, explainability, and so on. Safety, particularly in terms of robustness to perturbations, is a crucial aspect of ML trustworthiness. It refers to the reliability of a model’s predictions when faced with small changes or noise in the input data. On the other hand, fairness is another critical dimension of ML trustworthiness, which entails that ML models should provide equalized prediction quality across different groups of people, ensuring that no individual or group is unfairly disadvantaged by the model’s outputs. Although various techniques have been proposed to address these issues individually, the internal relationship between robustness and fairness remains largely unexplored. For instance, it is unclear whether enhancing robustness will inadvertently compromise fairness, or vice versa. This is a crucial question for applications like facial recognition systems, where both safe and equitable prediction quality are essential. If we aim to achieve both robustness and fairness, what is the optimal strategy to ensure both? In this thesis, we conduct a comprehensive study to explore the interplay between robustness and fairness in ML models. Our findings reveal a strong tension between these two factors. Enhancing robustness can negatively impact fairness, and improving fairness can reduce robustness. This trade-off poses significant challenges for developing AI systems that are both safe and equitable. Based on our studies, we provide new insights into ML models and propose promising strategies to balance robustness and fairness. To my family and friends for the love and support. iii ACKNOWLEDGEMENTS First, I am deeply grateful to my advisor, Dr. Jiliang Tang, for his invaluable advice, inspiration, encouragement, and support throughout my Ph.D. journey. Throughout this period, I learned so so much from him. I would like to express my deepest appreciation to my dissertation committee members, Dr.􀀀Anil K.􀀀Jain, Dr.􀀀Charu C. Aggarwal, Dr.􀀀Sijia Liu, and Dr.􀀀Yuying Xie for their insightful comments and helpful suggestions. Besides, I was fortunate to work as an intern at VISA Research with amazing colleagues and mentors: Dr.􀀀Menghai Pan, Dr.􀀀Huiyuan Chen and Dr.􀀀Zhimeng Jiang. I enjoyed the wonderful and productive days with them. During my years in Michigan State University, I am grateful to have had the pleasure and fortune of having supportive and encouraging colleagues. Indeed, I am grateful with the academic and personal mentorship from Dr. Xiaorui Liu, Dr. Yao Ma, Dr. Zhiwei Wang, Dr. Wenqi Fan, Dr. Wei Jin and Dr. Tyler Derr. Moreover, it is also a great pleasure to being work, collaborate and debate with Pengfei He, Dr. Wentao Wang, Yaxin Li, Jie Ren, Yuxuan Wan, Yingqian Cui, Shenglai Zeng, Yuping Lin. Moreover, Dr. Sandeep Kulkarni is the professor who saved me from quitting and taught me how to be kind to others. Finally, I would like to thank my family, for their unconditional love and support, my roommates Tanner and Hunter for the accompany during the pandemic. iv TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 2 ENHANCING ROBUSTNESS MAY CAUSE CLASS-WISE UNFAIRNESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 CHAPTER 3 ENHANCING ROBUSTNESS MAY AMPLIFY INNER-CLASS UNFAIRNESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 CHAPTER 4 ENHANCING ROBUSTNESS MAY AMPLIFY UNFAIRNESS ON IMBALANCED DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 CHAPTER 5 ENHANCING FAIRNESS MAY THREATEN ROBUSTNESS AGAINST POISONING ATTACKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 CHAPTER 6 CONCLUSION AND FUTURE DIRECTIONS . . . . . . . . . . . . . . . 81 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 v CHAPTER 1 INTRODUCTION 1.1 Introduction 1.1.1 Motivation Machine Learning (ML) has been applied across a diverse array of applications, including computer vision, language processing, and numerous other domains. Despite their considerable potential, these technologies may encounter various trustworthiness challenges, particularly when implemented in high-stake applications such as autonomous vehicles [57] or medical diagnosis and treatment [68]. Specifically, there are profound concerns regarding the (adversarial) robustness [126] and fairness [34, 42] of these models. In detail, • Adversarial robustness is crucial for ensuring that a model provides correct prediction outcomes, even when the training or test data contain malicious perturbations. For example, an attacker might add doodles onto traffic signs to mislead the recognition system of an autonomous vehicle [76], leading to potentially dangerous situations. Formally, most studies [41, 71, 15] formulate the adversarial attack as optimization problems: consider a classification model 𝑓 and a test sample 𝑥, an attacker aims to introduce a perturbation 𝛿 such that the model’s prediction loss is maximized: max ||𝛿||≤𝜖 L( 𝑓 (𝑥 + 𝛿), 𝑦) (1.1) Meanwhile, the norm of the perturbation 𝛿 is constrained to keep the perturbation to be slight. Since the model’s loss value on 𝑥 + 𝛿 to the true label 𝑦 is high, the model can produce an incorrect prediction. Empirically, many existing studies in the literature demonstrate that many popular ML models are very vulnerable to adversarial attacks. For example, on a benchmark image classification task on ImageNet [57], an attacker can cause the model to have almost 0% accuracy by introducing noises which are imperceptible to human eyes [89]. Therefore, as a countermeasure to protect ML models against these attacks, adversarial training strat- 1 egy [71, 135] is one of the most studied methods, which first generate adversarial perturbations for each training sample by solving the problem in Eq. 1.1, and then it trains the model to minimize the error even when these perturbation exists: min 𝑓 E𝑥  max ||𝛿||≤𝜖 L( 𝑓 (𝑥 + 𝛿), 𝑦)  . (1.2) In this way, the trained models will give the correct prediction even when those exists. In addition to the case above where the attacks happen during the inference phase, recent works have also found that slight perturbation in the training dataset of an ML model can also result great compromise of the model performance, which is called data poisoning attacks [100, 46]. As countermeasures against such data poisoning attacks, well-known defense strategies include data sanitization methods [29], and robust training [88]. • Fairness aims to ensure equitable treatment and consistent prediction quality across different individuals and groups. Thus, fair machine learning techniques are critical to promote equity in ML applications such as healthcare [72], hiring [96], lending [3], and law enforcement [83]. For linear ML models, improving the fairness involves integrating fairness constraints into the optimization process, then solve constraint optimization problems [34, 128, 42]. In deep neural networks, techniques like fair representation learning and adversarial debiasing [129, 132] are used to create feature representations that are invariant to sensitive attributes. Although there are multiple methods that have been proposed to improve model robustness and fairness separately, it is still unknown whether these two properties have interaction between each other. Indeed, when we apply adversarial training to improve the model robustness, will it impact the model’s fairness to cause discrimination? It is a critical problem to answer for the ML applications where both robustness and fairness are important. For instance, in human facial recognition, both the robustness to perturbations and the equalized prediction quality among different people are desired. Under such an application, what are the optimal strategy to guarantee both the robustness and fairness? This thesis aims to answer the questions above by conducting a comprehensive 2 study on the interaction between ML robustness and fairness, uncovering the inner mechanism of machine learning in terms of these two properties, and provide viable solutions to achieve both robustness and fairness. 1.1.2 Dissertation Contributions This thesis comprehensively studies the interaction between ML robustness and fairness from different angles and perspectives. Overall, it will first present three works uncovering the impact from improving the ML robustness to the model fairness, when disparity is considered among different classes, different sub-groups in each class, as well as the scenario of imbalanced dataset. Then, the impact from improving ML fairness to the model robustness, especially to poisoning attacks is discussed. In specifics, • As one of the most popular and effective methods to improve ML robustness, adversarial training has been widely studied and applied. However, our work [121] has reveal that adversarial training tends to exaggerate class-wise disparity in terms of the model accuracy and robustness (accuracy on adversarial examples), which has been validated on various benchmark image classification datasets. This empirical finding demonstrates that adversarial training tends to cause obvious weak points of the model accuracy and safety, which also bring great concerns to the application of such methods. In the work [121], a further empirical analysis is further conducted to validate that adversarial training has the natural behavior to sacrifice the performance on hard classes and favor those easy classes. This result and analysis does not only uncover a natural behavior of adversarial training, but also remind the importance of debiasing during building adversarial robust machine learning models. • Beyond the fairness / bias among different classes, a following-up work [124] conducts a further analysis to compare the performance of adversarial training between different subgroups in each class of a classification task. Indeed, the work [124] consider the disparity between the samples from the main-subgroup (the data distribution that most samples are from) and the rare-subgroup (the data samples are not from the main distribution). Empirical results 3 demonstrate that adversarial training has the behavior to result greater discrepancy between main subgroup and rare subgroups. This result shows that adversarial training will also result a greater bias compared to traditional training, which is consistent to the conclusion found in the work [121]. • In practice of machine learning, it is also frequently happening that the training data is imabalnced: some classes have much more training samples compared to other classes. In the work [111], a study of adversarial training on imbalanced dataset has been conducted. Overall, the work [111] finds that adversarial training tends to cause a greater performance disparity between well-represented classes (the classes with more training samples) and underrepresented classes (the classes with fewer training samples). Moreover, the work [111] also demonstrates that adversarial training will greatly sacrifice the performance on underrepresented classes, when applying common imbalanced training schemes such as re-weighting. These findings show that adversarial training will also have severe bias / unfairness issues under imbalanced dataset. • On an opposite way, fairness training techniques will also bring risk to the ML safety. In our work [123], we conduct the first study to show that fair classification methods in linear models can increase the vulnerability to data poisoning attacks, which is one type of adversarial attack that add perturbations to the training samples of an ML model. Our work provides both experimental and theoretical studies to demonstrate why fair classification methods are more vulnerable to such attacks, and we propose corresponding defenses for safer fair classification. 1.1.3 Dissertation Structure The remainder of this dissertation is organized as follows and the overall picture is presented in Figure 1.1. On one hand, in Chapter 2, we systematically study the interaction between adversarial robustness and class-wise performance disparity (fairness). We provide comprehensive experimental and theoretical results to show that adversarial training can inherently introduce class-wise unfairness. In Chapter 3, we systematically study the interaction between adversarial robustness and 4 Figure 1.1: The overall outlook on the discussed topics about the interaction between ML robustness and fairness. performance disparity in each class of classification tasks. Indeed, we provide evidence demonstrating that adversarial training can cause greater unfairness between the subgroups in each class. In Chapter 4, we extend our discussion from class-balance dataset to imbalance dataset, where the number of training samples in each class are different. Again, we provide detailed empirical evidence and theoretical analysis validating that improving adversarial robustness can introduce greater unfairness (between well-represented and under-represented classes). On the other hand, in Chapter 5, we conduct a study on an opposite way to show that fair classification algorithms can hurt the models’ robustness to data poisoning attacks. Finally, we conclude the thesis by remarking the importance of the strong tension between model robustness and fairness, and discussing future directions in Chapter 6. 5 CHAPTER 2 ENHANCING ROBUSTNESS MAY CAUSE CLASS-WISE UNFAIRNESS ¹ 2.1 Introduction The existence of adversarial examples [41] causes great concerns when applying deep neural networks to safety-critical tasks, such as autonomous driving vehicles and face identification [76, 99]. As countermeasures against adversarial examples, adversarial training algorithms aim to train a classifier that can classify the input samples correctly even when they are adversarially perturbed. Namely, they optimize the model to have the minimum adversarial risk of a sample that can be perturbed to be wrongly classified: min 𝑓 E𝑥  max ||𝛿||≤𝜖 L( 𝑓 (𝑥 + 𝛿), 𝑦)  . (2.1) These adversarial training methods [58, 71, 135] have been shown to be one of the most effective and reliable approaches to improve the model robustness against adversarial attacks. Although promising to improve the model’s robustness, we reveal an intriguing property about adversarial training algorithms: they usually result in a large disparity of accuracy and robustness among different classes. As a preliminary study in Section 2.3, we apply natural training and PGD adversarial training [71] on the CIFAR10 dataset [56] using a PreAct-ResNet18 [45] architecture. For a naturally trained model, the model performance in each class is similar. However, in the adversarially trained model, there is a severe class-wise performance discrepancy (both accuracy and robustness). For example, the model has both low standard and robust errors on the samples from the class “car”, but it has a much larger error rate on those “cat” images. Meanwhile, this fairness issue does not appear in natural models which are trained on clean data. Note that this phenomenon happens in a balanced CIFAR10 dataset and exists in other datasets, model structures and adversarial training algorithms. More details can be found in Section 2.3. If this phenomenon happens in real-world applications, it can raise huge concerns about safety. Imagine that a traffic sign recog- ¹This chapter is based on previously published work by Han Xu, Xiaorui Liu, Yaxin Li, Anil Jain, Jiliang Tang􀀀 titled “To be robust or to be fair: Towards fairness in adversarial training” published at the International Conference on Machine Learning (2021) [121]. 6 nizer has an overall high performance, but it is very inaccurate and vulnerable to perturbations only for some specific signs. The safety of this autonomous driving car is still not guaranteed. Meanwhile, this phenomenon can also lead to issues from the social ethics perspective. For example, a robustly trained face identification system might provide different levels of safety for the services provided to different ethnic communities. Thus, there is a pressing need to study this phenomenon. In this work, we define this phenomenon as the “robust fairness” problem of adversarial training and we aim to understand and mitigate this fairness problem. We first explore the question: “why does adversarial training have this accuracy/robustness disparity between classes while natural training does not present a similar issue?” To answer this question, we first study on several cases on rather simple classification problems with mixture Gaussian distributions, where we design the data in different classes with different “difficulty levels” to be classified. By studying the class-wise performance of naturally and adversarially trained models, we deepen our understandings on this problem. Compared to natural training, adversarial training has a stronger tendency to favor the accuracy of the classes which are “easier” to be predicted. Meanwhile, adversarial training will sacrifice the prediction performance for the “difficult” classes. As a result, there is a much obvious class-wise performance discrepancy in adversarial training. Motivated by our empirical findings and theoretical understandings, we propose a dynamic debiasing framework, i.e., Fair Robust Learning (FRL), to mitigate the robust fairness issue in adversarial settings. Our main contributions are summarized as: • We discover the robust fairness problem of adversarial training algorithms and empirically verify that this problem can be general; • We build conceptual examples to understand the potential reasons that cause this fairness problem; and • We propose a Fair Robust Learning (FRL) framework to mitigate the fairness issue in adversarial training. 7 2.2 Related Work Adversarial Attacks and Adversarial Training. The existence of adversarial attacks [41, 102, 15] causes huge concerns when people adopt machine learning models in various application domains [126, 49]. As countermeasures against adversarial examples, adversarial training (robust optimization) algorithms [41, 71, 135, 98, 134] are formulated as a min-max problem that directly minimize the model’s risk on the adversarial samples. Another mainstream of defense methods are certified defense, which aims to provide provably robust DNNs under 𝑙𝑝 norm bound [117, 24] and guarantee the robustness. Fairness in Machine Learning & Imbalanced Dataset. Fairness issues recently draw much attention from the community of machine learning. These issues can generally divided into two categorizations: (1) prediction outcome disparity [128]; and (2) prediction quality disparity [10]. Unlike existing works, this work is the first study the unfairness issue in the adversarial setting. We also mention the imbalanced data learning problem [43, 65] as one related topic of our work. Since in our work, (i.e.. Figure 2.1), we show that the prediction performance differences are indeed existing between different classes. This phenomenon is also well studied in imbalanced data problems or long-tail distribution learning problems [113] where some classes have much fewer training samples than others. However, in our case, we show that this unfairness problem can generally happen in balanced datasets, so it desires new scopes and methods for further study. Fairness in Robust Learning. A parallel and independent work [79] also figures out that the phenomenon of class-wise unequal robustness can happen for many deep learning tasks in the wild. While our work is more focused on adversarial training algorithms and we argue that adversarial training methods can have the property to cause these fairness phenomena. In addition, our work also discusses various potential mitigation methods to achieve more balanced robustness for adversarial training methods. 2.3 Preliminary Studies In this section, we introduce our preliminary findings to show that adversarial training algorithms usually present the fairness issue, which is related to the strong disparity of standard accuracy 8 (a) Natural Training. (b) PGD Adv. Training. (c) TRADES (1/𝜆 = 1). (d) TRADES (1/𝜆 = 6). Figure 2.1: The class-wise standard / robust error of natural and adversarial training methods on CIFAR10, under PreAct ResNet18. and robustness among different classes. We first examine algorithms including PGD adversarial training [71] and TRADES [135] on the CIFAR10 dataset [56] under the 𝑙∞-norm adversarial attack (under perturbation bound 8/255). In CIFAR10, we both naturally and adversarially train PreAct- ResNet18 [43] models. The results are presented in Figure 2.1. From Figure 2.1, we can observe that – for the naturally trained models, every class has similar standard error (around 7±5%) and 100% robust error under the 8/255 PGD attack. However, for adversarially trained models, the disparity phenomenon is severe. For example, a PGD-adversarially trained model has 32.8% standard error and 82.4% robust error for the samples in the class “cat”, which are much larger than the model’s average standard error 15.5% and average robust error 56.4%. While, the best class “car” only has 6.1% standard error and 34.3% robust error. These results suggest that adversarial training can cause strong disparities of standard / robustness performance between different classes, which are originally negligible in natural training. To further support the claim that adversarial training amplifies the class-wise performance disparity, besides the Figure 2.1, we provide additional statistical evidence to numerically compare each model’s class-wise performance disparity. We measure each model’s Standard Deviation of class-wise error (SD), Normalized Standard Deviation of class-wise error (NSD), and Normalized Standard Deviation of class-wise accuracy (NSD (Acc)).²Note that NSD and NSD (Acc) normalize the scale of average performance, which exclude the influence from the average performance to the disparity issue. From Table 2.1, we find adversarial training methods have larger disparity than natural training across all three metrics, for both standard and robust performance. Thus, we can 9 confirm that adversarial training do amplify the performance disparity between classes. Table 2.1: The class-wise disparity on CIFAR10 (%). Std. Error SD NSD NSD (Acc) Natural Train. 2.8 0.47 0.03 PGD Adv. Train. 9.2 0.57 0.11 TRADES (1/𝜆 = 1) 7.5 0.55 0.09 TRADES (1/𝜆 = 6) 9.6 0.49 0.12 Rob. Error SD NSD NSD (Acc) Natural Train. 0.0 0.0 - PGD Adv. Train. 15.6 0.28 0.34 TRADES (1/𝜆 = 1) 16.8 0.29 0.40 TRADES (1/𝜆 = 6) 17.0 0.34 0.34 Moreover, we find that the reason of this fairness phenomenon might be due to the unequal influence of adversarial training on different classes. It tends to hurt the standard performance of classes which are intrinsically “harder” to be classified, but not effectively improve their robustness performance. In Table 2.2, we list the classes “dog” and “cat”, which have the highest errors in natural training, as well as “car” and “ship”, which have the lowest errors. We can observe that adversarial training increases the standard errors of “dog” and “cat” by a much larger margin than the classes “car” and “ship”. Similarly, adversarial training gives poorer help to reduce the robust errors of “dog” and “cat”. As a conclusion, we hypothesize that adversarial training tends to make the hard classes even harder to be classified or robustly classified. In Section 2.4, we will theoretically confirm this hypothesis. We further investigate other settings including model architecture WRN28, SVHN dataset, 𝑙2- norm adversarial training and Randomized Smoothing [24] algorithm. Results can be found at Appendix 2.2 where we can make similar observations. These findings suggest that observations from Figure 2.1 and Table 2.2 & Table 2.1 are likely to be generalized into other adversarial training algorithms, model architectures, datasets and other types of adversarial attacks. ²Normalized Standard Deviation: 𝑁𝑆𝐷 = 𝑆𝐷/𝑚𝑒𝑎𝑛. Note NSD and NSD (Acc) have unequal “mean” values. 10 Table 2.2: The changes of standard & robust error in natural & adversarial training in CIFAR10. Std. Error Cat Dog Car Ship Nat. Train 11.3 10.0 1.8 3.5 PGD Adv. Train 34.8 26.9 6.1 6.4 Diff. (Adv. - Nat.) 23.5 16.9 4.3 2.9 Rob. Error Cat Dog Car Ship Nat. Train 100 100 100 100 PGD Adv. Train 82.7 66.4 34.3 40.8 Diff. (Adv. - Nat.) -17.3 -33.5 -65.7 -59.2 2.4 Theoretical Analysis From our preliminary studies, we consistently observe that adversarially trained models have great performance disparity (both standard and robust errors) between different classes. Moreover, adversarial training tends to hurt the classes which are originally harder to be classified in natural training. What is the reason that leads to this phenomenon? Is this “unfairness” property an inherent property of adversarial training methods? These questions are not trivial to answer because it is closely related to another property of adversarial training: it usually degrades the model’s average accuracy. Given that naturally trained models already present slight disparities between classes (see Figure 2.1 (a)), is the larger class-wise accuracy disparity in adversarial training only a natural outcome of the model’s worse average accuracy? To deepen our understandings on these questions, we theoretically study the effect of adversarial training on a binary classification task under a mixture Gaussian distribution. We design the two classes with different “difficulties” to be classified. In such case, adversarial training will not significantly degrade the average standard error, but its decision boundary, compared to naturally trained models, are more biased to favor the “easier” class and hurt the “harder” class. From this theoretical study, we aim to validate that adversarial training methods can have the intrinsic property to give unequal influence between different classes and consequently cause the fairness problem. In the following, we first introduce the necessary notations and definitions. Notation. We use 𝑓 to denote the classification model which is a mapping 𝑓 : X → Y from input data space X and output labels Y. It can be formulated as 𝑓 (𝑥) = sign(𝑤 · 𝑥 + 𝑏) with parameters 𝑤 11 and 𝑏. Generally, for a classifier 𝑓 , the overall standard error is defined as Rnat ( 𝑓 ) = Pr.( 𝑓 (𝑥) ≠ 𝑦); and its overall robust error is Rrob ( 𝑓 ) = Pr.(∃𝛿, ||𝛿|| ≤ 𝜖, s.t. 𝑓 (𝑥 +𝛿) ≠ 𝑦), which is the probability that there exists a perturbation to make the model give a wrong prediction.³ We use Rnat ( 𝑓 ; 𝑦) to denote the standard error conditional on a specific class 𝑌 = 𝑦. 2.4.1 A Binary Classification Task We start by giving a rather simple example of a binary classification task with a Gaussian mixture data. Here, we aim to design the two classes with inherent different “difficulties” to be classified. Specifically, in the following definition, the data are from 2 classes Y = {−1, +1} and the data from each class follow a Gaussian distribution D which is centered on −𝜃 and 𝜃 respectively. In our case, we specify that there is a 𝐾-factor difference between two classes’ variance: 𝜎+1 : 𝜎−1 = 𝐾 : 1 and 𝐾 > 1. 𝑦 𝑢.∼𝑎.𝑟 {−1, +1}, 𝜃 = ( dim = d z }| { 𝜂, ..., 𝜂), 𝑥 ∼ 8>>>> < >>>>: N (𝜃, 𝜎2 +1𝐼) if 𝑦 = +1 N (−𝜃, 𝜎2− 1𝐼) if 𝑦 = −1 (2.2) Intuitively, the class “+1” is harder than class “-1” because it is less compacted in the data space. In Theorem 1, we formally show that the class “+1” can be indeed harder because an optimal linear classifier will give a larger error for the class “+1” than class “-1”. Theorem 1 For a data distribution D in Eq. 2.2, the optimal linear classifier 𝑓nat which minimizes the average standard classification error: 𝑓nat = arg min 𝑓 Pr.( 𝑓 (𝑥) ≠ 𝑦) It has the intra-class standard error for the two classes: Rnat ( 𝑓nat, −1) = Pr.{N(0, 1) ≤ 𝐴 − 𝐾 · q 𝐴2 + 𝑞(𝐾)} Rnat ( 𝑓nat, +1) = Pr.{N(0, 1) ≤ −𝐾 · 𝐴 + q 𝐴2 + 𝑞(𝐾)} (2.3) ³In this section, we focus on 𝑙∞-norm bounded perturbation. 12 where 𝐴 = 2 𝐾2−1 √ 𝑑𝜂 𝜎 and 𝑞(𝐾) = 2 log 𝐾 𝐾2−1 which is a positive constant and only depends on 𝐾, As a result, the class “+1” has a larger standard error: Rnat ( 𝑓nat, −1) < Rnat ( 𝑓nat, +1). A detailed proof of Theorem 1 can be found in Appendix 1.2. From Theorem 1, it demonstrates that class “+1” (with large variance) can be harder to be classified than class “-1”, because an optimal natural classifier will present a larger standard error in class “+1”. Note that the classwise difference is due to the positive term 𝑞(𝐾) in Eq. A.5, which depends on the variance ratio 𝐾. If the two classes’ variances are equal, i.e., 𝐾 = 1 , the standard errors for the two classes are the same. Next, we will show that, in the setting of Eq. 2.2, an optimal robust classifier (adversarial training) will give a model which further benefits the easier class “-1” and hurt the harder class “+1”. 2.4.2 Optimal Linear Model to Minimize Robust Error In this subsection, we demonstrate that an adversararially trained model exacerbates the performance gap between these two classes, by giving a decision boundary which is closer to samples in the harder class “+1” and farther to the class “-1”. Similar to Theorem 1, we calculate the classwise standard errors for robust classifiers. Theorem 2 For a data distribution D in Eq. 2.2, the optimal robust linear classifier 𝑓rob which minimizes the average robust error: 𝑓rob = arg min 𝑓 Pr.(∃||𝛿|| ≤ 𝜖 s.t. 𝑓 (𝑥 + 𝛿) ≠ 𝑦) It has the intra-class standard error for the two classes: Rnat ( 𝑓rob, −1) = Pr.{N(0, 1) ≤ 𝐵 − 𝐾 · q 𝐵2 + 𝑞(𝐾) − √ 𝑑 𝜎 𝜖 } Rnat ( 𝑓rob, +1) = Pr.{N(0, 1) ≤ −𝐾 · 𝐵 + q 𝐵2 + 𝑞(𝐾) − √ 𝑑 𝐾𝜎 𝜖 } (2.4) where 𝐵 = 2 𝐾2−1 √ 𝑑(𝜂−𝜖) 𝜎 and 𝑞(𝐾) = 2 log 𝐾 𝐾2−1 is a positive constant and only depends on 𝐾, Note that we limit the perturbation margin 𝜖 in the region 0 < 𝜖 < 𝜂 to guarantee that the robust optimization gives a reasonable classification in this setting in Eq. 2.2. The detailed proof is similar 13 to Theorem 1 and is given in Appendix 1.2. From the results in Theorems 1 and 2, an corollary can demonstrate that the robust classifier will further hurt the “harder” class’s performance. Corollary 1 Adversarially Trained Models on D will increase the standard error for class “+1” and reduce the standard error for class “-1”: Rnat ( 𝑓rob, −1) < Rnat ( 𝑓nat, −1). Rnat ( 𝑓rob, +1) > Rnat ( 𝑓nat, +1). Proof 1 A simplified proof sketch for Corollary 1 can help shed light on the reason of this behavior for adversarial training. First, from the definition of the distribution D, it is easy to have the intermediate result that natural / robust classifiers have the weight vectors 𝑤nat = 𝑤rob = 1. The only difference is their interception terms 𝑏nat and 𝑏rob. In Appendix 1.2, we show that: 𝑏nat = 𝐾2 + 1 𝐾2 − 1 𝑑𝜂 − 𝐾 s 4 (𝐾2 − 1)2 · 𝑑2𝜂2 + 𝑑𝜎2𝑞(𝐾) := 𝑔(𝜂). (2.5) In particular, for robust classifiers, if we look at the objective of robust classification: Rrob ( 𝑓 ) =Pr.(∃||𝛿|| ≤ 𝜖 s.t. 𝑓 (𝑥 + 𝛿) ≠ 𝑦) = max ||𝛿||≤𝜖 Pr.( 𝑓 (𝑥 + 𝛿) ≠ 𝑦) = 1 2Pr.( 𝑓 (𝑥 + 𝜖) ≠ −1|𝑦 = −1) + 1 2Pr.( 𝑓 (𝑥 − 𝜖) ≠ +1|𝑦 = +1) the robust classifier 𝑓rob directly minimizes the standard error of samples (𝑥−𝜖) for 𝑥 in class “+1”, and it minimizes the error of samples (𝑥+𝜖) for 𝑥 in class “-1”. As a consequence, the robust model 𝑓rob minimizes the errors of samples whose centers are ±𝜃′ = ±(𝜂 − 𝜖, ..., 𝜂 − 𝜖), which are both 𝜖-distance closer to zero 0 compared to ±𝜃. Thus, we can get the interception term of the robust model, by only replacing 𝜂 in 𝑏nat by (𝜂 − 𝜖): 𝑏rob = 𝑔(𝜂 − 𝜖). In Appendix 1.2, we show 𝑔 is a monotone increasing function from 0 to 𝜂; thus we have the relation 0 < 𝑏rob < 𝑏nat. This suggests that a robust classifier will predict more samples inDto be a negative “-1” class, and hence reduce the classification error for class “-1” but increase the error of class “+1”. 14 (a) Logistic Regression. (b) One Layer Perceptron. Figure 2.2: Logistic Regression and Multi-perceptron classifiers (natural and robust) on simulated binary data in Eq. 2.2. In Figure 2.2,we give an illustration for the theories in a 2-dim space to show the effect of adversarial training. Particularly, we sampled Gaussian data from class “+1” (blue) which centered on 𝜃 = (2, 2), and has variance 𝜎2 +1 = 2, as well as class “-1” (yellow) which centered on −𝜃 = (−2, −2), and has variance 𝜎2− 1 = 1. We apply a logistic regression classifier for natural training (green line) and adversarial training (with perturbation bound 𝜖 = 0.5, red line). From Figure 2.2, we get the consistent results with our theories: (1) there are more samples in the class “+1” than class “-1” which are wrongly predicted by a natural classifier; (2) adversarial training will further exacerbate the disparity by moving decision boundary closer to the harder class “+1”. In Figure 2.2 (right), we also show the results for a non-linear MLP classifier, where we can make similar observations. These theoretical understandings about adversarial training suggest that adversarial training will naturally bring unequal effect to different classes in the data distribution and consequently cause severe fairness issues. 2.5 Fair Robust Learning (FRL) The observations from both preliminary experiments and theoretical understandings suggest that the fairness issues in adversarial training can be a general and serious concern. In this section, we first discuss fairness requirements that an optimal robust model should satisfy, and then introduce algorithms to achieve these objectives. 15 2.5.1 Objective of Robust Fairness In this work, we desire an optimal robust model that can achieve the parity of both standard prediction accuracy and adversarial robustness among each class 𝑦 ∈ 𝑌: • Equalized Accuracy: one classifier 𝑓 ’s standard error is statistically independent of the ground truth label 𝑦: Pr.( 𝑓 (𝑥) ≠ 𝑦|𝑌 = 𝑦) ≈ Pr.( 𝑓 (𝑥) ≠ 𝑦) for all 𝑦 ∈ 𝑌. • Equalized Robustness: one classifier 𝑓 ’s robust error is statistically independent of the ground truth label 𝑦: Pr.(∃ ||𝛿|| ≤ 𝜖, 𝑓 (𝑥 + 𝛿) ≠ 𝑦|𝑌 = 𝑦) ≈ Pr.(∃ ||𝛿|| ≤ 𝜖, 𝑓 (𝑥 + 𝛿) ≠ 𝑦) for all 𝑦 ∈ 𝑌. The notion of “Equalized Accuracy” is well studied in traditional fairness research [10, 128] which desires the model to provide equal prediction quality for different groups of people. The “Equalized Robustness” is our new desired “fairness” property for robustly trained models. For every class, the model should provide equal robustness and safety to resist adversarial attacks. Therefore, the robust model will have high overall safety but no obvious “weakness”. Faced with the fairness objectives mentioned above, in our work, we propose a Fair Robust Learning (FRL) strategy to train robust models that have equalized accuracy and robustness performance for each class. Formally, we aim to train a classifier 𝑓 to have minimal overall robust error (Rrob ( 𝑓 )), as well stressing 𝑓 to satisfy a series of fairness constraints as: minimize 𝑓 Rrob ( 𝑓 ) s.t. 8>>>> < >>>>: Rnat ( 𝑓 , 𝑖) − Rnat ( 𝑓 ) ≤ 𝜏1 Rrob ( 𝑓 , 𝑖) − Rrob ( 𝑓 ) ≤ 𝜏2 for each 𝑖 ∈ 𝑌 (2.6) where 𝜏1 and 𝜏2 are small and positive predefined paramters. The constraints in Eq. 2.6 restrict the model’s error for each class 𝑖 ∈ 𝑌 (both standard error Rnat ( 𝑓 , 𝑖) and robust error Rrob ( 𝑓 , 𝑖)) should not exceed the average level (Rnat ( 𝑓 ) and Rrob ( 𝑓 )) by a large margin. Thus, there will be no obvious worst group in the whole dataset. One thing to note is that in Eq 2.6, the robust error is always strongly related to the standard error [135, 105] (see Eq. 2.7). Thus, during the debiasing 16 process, we could have a twisted influence on the class 𝑖’s standard and robust errors. For example, if we apply some importance weighting methods to upweight the cost of Rrob ( 𝑓 , 𝑖), we also implicitly upweight the cost for Rnat ( 𝑓 , 𝑖). Therefore, we propose to separate the robust error into the sum of standard error and boundary error inspired by [135] as: Rrob ( 𝑓 , 𝑖) =Pr.{∃𝛿, s.t. 𝑓 (𝑥 + 𝛿) ≠ 𝑦|𝑦 = 𝑖} =Pr.{ 𝑓 (𝑥) ≠ 𝑦|𝑦 = 𝑖} + Pr.{∃𝛿, 𝑓 (𝑥 + 𝛿) · 𝑓 (𝑥) ≤ 0|𝑦 = 𝑖} =Rnat ( 𝑓 , 𝑖) + Rbndy ( 𝑓 , 𝑖) (2.7) where Rbndy ( 𝑓 , 𝑖) = Pr.{∃𝛿, 𝑓 (𝑥 +𝛿) · 𝑓 (𝑥) ≤ 0|𝑦 = 𝑖} represents the probability that a sample from class 𝑖 lies close to the decision boundary and can be attacked. By separating the standard error and boundary error during adversarial training, we are able to independently solve the unfairness of both standard error and boundary error. Formally, we have the training objective as: minimize 𝑓 Rnat ( 𝑓 ) + Rbndy ( 𝑓 ) s.t. 8>>>> < >>>>: Rnat ( 𝑓 , 𝑖) − Rnat ( 𝑓 ) ≤ 𝜏1 Rbndy ( 𝑓 , 𝑖) − Rbndy ( 𝑓 ) ≤ 𝜏2 for each 𝑖 ∈ 𝑌 (2.8) During training to optimize the boundary errors, we borrow the idea from [135], which minimizes the KL-divergence between the output logits of clean samples and their adversarial samples. In the following subsections, we explore effective methods to solve the problem in Eq. 2.8. 2.5.2 Reweight for Robust Fairness In order to solve the fair robust training problem in Eq. 2.8, we first follow the main pipeline from traditional machine learning debiasing works such as [1, 128], which reduce the problem in Eq. 2.6 into a series of Cost-sensitive classification problems and continuously penalize the terms which violate the fairness constraints. We begin by introducing Lagrange multipliers 𝜙 = (𝜙𝑖 nat, 𝜙𝑖 bndy ) 17 (non-negative) for each constraint in Eq. 2.6 and form the Lagrangian: 𝐿( 𝑓 , 𝜙) =Rnat ( 𝑓 ) + Rbndy ( 𝑓 ) + Õ𝑌 𝑖=1 𝜙𝑖 nat (Rnat ( 𝑓 , 𝑖) − Rnat ( 𝑓 ) − 𝜏1)+ + Õ𝑌 𝑖=1 𝜙𝑖 bndy (Rbndy ( 𝑓 , 𝑖) − Rbndy ( 𝑓 ) − 𝜏2)+ (2.9) Thus, the problem in Eq. 2.8 equals to solving the max-min game between two rivals 𝑓 and 𝜙 as: max 𝜙nat,𝜙rob≥0 min 𝑓 𝐿( 𝑓 , 𝜙). (2.10) We present the details for solving Eq. 2.10 in Algorithm 1. During the training process, we start from a pre-trained robust model and we test its class-wise standard / boundary errors on a separated validation set (step 5 and 6). We check whether the current model 𝑓 violates some constraints in Eq. 2.8. For example, if there exists a class “i”, whose standard error is higher than the average by a larger margin: Rnat ( 𝑓 , 𝑖) − Rnat ( 𝑓 ) − 𝜏1 > 0. We will adjust its multiplier term 𝜙𝑖 nat according to the extent of violation (step 7). As a result, we increase the training weight for the standard error Rnat ( 𝑓 , 𝑖) for the class 𝑖. We also adjust the multiplier for boundary errors at the same time (step 8). Next, fixing the multipliers 𝜙, the algorithm will solve the inner-minimization to optimize the model 𝑓 . By repeating the steps 4-9, the model 𝑓 and Lagrangian multiplier 𝜙 will be alternatively updated to achieve the equilibrium until we finally reach an optimal model that satisfies the fairness constraints. Note that we denote the framework with the proposed Reweight strategy as FRL (Reweight). 2.5.3 ReMargin for Robust Fairness Though upweighting the cost for Rnat ( 𝑓 , 𝑖) has the potential to help penalize large Rnat ( 𝑓 , 𝑖) and improve the standard performance for worse groups. However, only upweighting one class’s boundary error’s cost could not succeed to fulfill the goal to help decrease its boundary error. In Section 3.6.2, we show this fact in PGD adversarial training on CIFAR10, where we find that even we give a large weight ratio for a class’s boundary error, it is not sufficient to reduce the boundary 18 error for this class. Thus, we cannot achieve to mitigate the boundary error disparity and robust error disparity. To solve this problem, we propose an alternative strategy by enlarging the perturbation margin 𝜖. It is evident from some existing works [104, 31] that increasing the margin 𝜖 during adversarial training can effectively improve model’s robustness against attacks under the current intensity 𝜖. Therefore, enlarging the adversarial margin 𝜖 when generating adversarial examples during training specifically for the class 𝑖 has the potential to improve this class’s robustness and reduce the large boundary error Rbndy ( 𝑓 , 𝑖). In this work, we define this strategy as FRL (Remargin). The FRL (Remargin) resembles the procedure in Algorithm 1, except for the step 7, where we instead update the adversarial margin for the boundary errors. Specifically, we change the adversarial margin 𝜖𝑖 of the class “𝑖” as follows: 𝜖𝑖 = 𝜖𝑖 · exp 􀀀 𝛼∗ 2 (Rbndy ( 𝑓 , 𝑖) − 𝜏2))  (2.11) Besides FRL (Reweight) and FRL (Remargin), we can combine Reweight and Remargin, where we jointly update the weight of the boundary errors and change the margin. Algorithm 1 The Fair Robust Learning (FRL) Algorithm. 1: Input: Fairness constraints specified by 𝜏1 > 0 and 𝜏2 > 0, test time attacking radius 𝜖 and hyper-param update rate 𝛼1, 𝛼2 2: Output: A fairly robust neural network 𝑓 3: Initialize network with a pre-trained robust model Set 𝜙𝑖 nat = 0, 𝜙𝑖 bndy = 0 and 𝜙 = (𝜙nat, 𝜙bndy), 4: repeat 5: Rnat ( 𝑓 ),Rnat ( 𝑓 , 𝑖) = EVAL( 𝑓 ) 6: Rbndy ( 𝑓 ),Rbndy ( 𝑓 , 𝑖) = EVAL( 𝑓 , 𝜖) 7: 𝜙𝑖 nat = 𝜙𝑖 nat + 𝛼1 · (Rnat ( 𝑓 , 𝑖) − Rnat ( 𝑓 ) − 𝜏1) 8: 𝜙𝑖 bndy = 𝜙𝑖 bndy + 𝛼2 · (Rbndy ( 𝑓 , 𝑖) − Rbndy ( 𝑓 ) − 𝜏2) 9: 𝑓 ← TRAIN( 𝑓 , 𝜙, 𝜖) 10: until Model 𝑓 satisfies all constraints 2.6 Experiment In this section, we present the experimental results to validate the effectiveness of the proposed framework (FRL) on building fairly robust DNN models. We implement and compare our proposed 19 three strategies (i.e., Reweight, Remargin and Reweight+Remargin) on real-world data and discuss their effectiveness. 2.6.1 Experimental Settings We conduct our experiments on benchamrk adversarial learning datasets, including CIFAR10 [56] and SVHN dataset [80]. For both datasets, we study the algorithms under the model architectures PreAct-ResNet18 and WRN28. We only present the results of PreAct-ResNet18 in this section and we leave the results of WRN28 in Appendix 1.2.2. As baselines, we present the original performance of two popular adversarial training algorithms [71, 135], and a debiasing method which is inherited from [1]. It is a traditional debiasing technique and we directly apply it to upweight the cost of the class with the highest robust error in the training data. Other existing unfairness debiasing methods, such as [128, 132] are not included in our experiment, because they are not proposed for deep learning models and have similar ideas with [1] to reweight the costs during training. In our implementation for FRL methods, during the training process, we split the training sets to get validation sets with 300 samples in each class to help us adjust the hyperparameters. For each variant of our proposed FRL method, we pre-define the model to achieve fairness constraints to satisfy that both 𝜏1 and 𝜏2 are not larger than 5% or 7%. In the training, we start FRL from a pre-trained robust model (such as a PGD-adversarial training), and run FRL with model parameter learning rate 1e-3 and hyperparameter learning rate 𝛼1 = 𝛼2 = 0.05 in the first 40 epochs. Then we decay the model parameter learning rate and the hyperparameter learning rate by 0.1 every 40 epochs. During the evaluation phase, we report each trained model’s average standard error rate, boundary error rate and robust error rate, as well as the worst intraclass error rate. Note that the boundary and robust errors are calculated by the PGD attack under 𝑙∞-norm 8/255. 2.6.2 Fairness Performance Table 2.3 shows each algorithm’s performance on the CIFAR10 dataset, including each variant of FRL under the fairness constraints 𝜏1 = 𝜏2 = 5% and 𝜏1 = 𝜏2 = 7%. From the experimental results, we can see that all FRL algorithms reduce the worst-class standard errors and robust errors under different degrees. FRL (Reweight) has the best performance to achieve the minimal “worst- 20 Table 2.3: Average & worst-class standard error, boundary error and robust error for various algorithms on CIFAR10. Avg. Std. Worst Std. Avg. Bndy. Worst Bndy. Avg. Rob. Worst Rob. PGD Adv. Training 15.5 33.8 40.9 55.9 56.4 82.7 TRADES(1/𝜆 = 1) 14.6 31.2 43.1 64.6 57.7 84.7 TRADES(1/𝜆 = 6) 19.6 39.1 29.9 49.5 49.3 77.6 Baseline Reweight 19.2 28.3 39.2 53.7 58.2 80.1 FRL(Reweight, 0.05) 16.0 22.5 41.6 54.2 57.6 73.3 FRL(Remargin, 0.05) 16.9 24.9 35.0 50.6 51.9 75.5 FRL(Reweight+Remargin, 0.05) 17.0 26.8 35.7 44.5 52.7 69.5 FRL(Reweight, 0.07) 16.1 23.8 38.4 55.2 54.0 75.2 FRL(Remargin, 0.07) 16.9 26.0 37.4 51.6 53.5 75.1 FRL(Reweight+Remargin, 0.07) 17.1 26.7 36.7 48.3 53.8 70.2 Table 2.4: Average & worst-class standard error, boundary error and robust error for various algorithms on SVHN. Avg. Std. Worst Std. Avg. Bndy. Worst Bndy. Avg. Rob. Worst Rob. PGD Adv. Training 9.4 19.8 37.0 53.9 46.4 73.7 TRADES(1/𝜆 = 1) 9.9 18.6 39.1 60.6 48.0 78.3 TRADES(1/𝜆 = 6) 10.5 23.4 32.5 52.5 43.1 76.6 Baseline Reweight 8.8 17.4 39.3 54.7 48.2 72.1 FRL(Reweight, 0.05) 7.9 13.3 38.2 56.4 46.1 69.7 FRL(Remargin, 0.05) 9.2 17.4 39.7 49.6 48.9 67.0 FRL(Reweight+Remargin, 0.05) 7.7 12.8 36.2 51.2 43.9 64.0 FRL(Reweight, 0.07) 8.0 13.6 37.2 54.2 45.0 67.8 FRL(Remargin, 0.07) 8.5 14.2 36.9 50.6 45.5 64.8 FRL(Reweight+Remargin, 0.07) 8.3 15.4 36.7 51.4 45.0 64.9 class” standard error. Compared to vanilla methods, it has around 10% reduction to the worst class’s standard error. However, it cannot equalize the boundary errors adequately. Alternatively, FRL (Remargin) is more effective than FRL (Reweight) to decrease the worst class boundary error. Furthermore, their combination FRL (Reweight + Remargin) is the most effective way to reduce the worst-class boundary and worst-class robust errors. It can accomplish to train a robust classifier with the worst-class robust error around 70%, which is 10 ∼ 15% lower than vanilla adversarial training methods. The baseline method (Baseline Reweight) [1] can only help decrease the worstclass standard error but cannot equalize boundary errors or robust errors. These results suggest that the FRL method can mitigate the fairness issues by improving the model’s standard and robustness performance on the worst classes. Notably, from the results in Table 2.3, we find that those methods, which use the strategy “Remargin”, usually have a 1 ∼ 2% larger average standard error, compared to the PGD Adversarial 21 Training or TRADES (1/𝜆 = 1). This is because “Remargin” increases the perturbation margin 𝜖 for some classes during training. According to the existing works [104, 98], using a large perturbation margin might degrade the model’s standard performance generalization. Thus, in this work we limit every training sample’s perturbation margin does not exceed 16/255 to guarantee that the model gives an acceptable average standard performance. On the other hand, because the “Remargin” strategies increase the perturbation margin, their average boundary errors are smaller than the vanilla methods. As a result, the average robust errors are comparable with the vanilla methods or even have 2 ∼ 3% improvement. In Table 2.4, we present the experimental results on the SVHN dataset. We have similar observations as those on CIFAR10. FRL (Reweight) can achieve the minimal worst-class standard error and FRL (Reweight+Remargin) gives the minimum worst-class robust error. One intriguing fact is that FRL methods, including those which use “Remargin”, do not cause an increase of the average standard error. Instead, each FRL method results in 1 ∼ 2% decrease of the average standard error. This might be due to the reason that these methods help improve the worst-class standard error by a large margin. 2.6.3 Ablation Study From the results in Table 2.3, we find that FRL (Reweight) is not effective to improve the worstclass boundary error. As a result, it cannot sufficiently equalize the robustness performance between classes. In this subsection, we study the potential reasons that cause this fact. To have a closer look at this issue, we implement adversarial training on CIFAR10 dataset in two groups of experiments following the basic settings described in Section 2.6.1. For each group, we only upweight a chosen class’s boundary error by different ratios (from 1.0 to 4.5 times of other classes) and keep other classes’ weights fixed. Then, for each model, we present this model’s standard / robustness performance for this class in Figure 2.3 (left). As a comparison, we also show the cases where we increase the perturbation margin (1.0 to 2.5 times of the original margin) (Figure 2.3 (right)). From Figure 2.3, we find that when we increase the weight only for the class “deer”, it results in the boundary error for this class to decrease but also increasing its standard error. Thus, reweight only acts to 22 (a) Upweight Boundary Error. (b) Increase Perturbation Margin. Figure 2.3: The effect of upweighting on boundary error (left) and standard error (right) for the class “deer”. leverage the inner-class boundary error and standard error. It cannot improve the robustness of this class over other classes to solve the fairness relationship between classes. As a contrast, increasing the margin will not increase the standard error while it effectively decreases this class’s boundary error and robust error. Therefore, from the results in Table 2.3 and 2.4, we observe that “Remargin” based methods can successfully improve the worst-class robustness performance. 2.7 Conclusion In this work we first empirically and theoretically uncover one property of adversarial training algorithms: it can cause serious disparity for both standard accuracy and adversarial robustness between different classes of the data. As the first attempt to mitigate the fairness issues from adversarial training, we propose the Fair Robust Learning (FRL) framework. We validate the effectiveness of FRL on benchmark datasets. In the future, we want to examine if the fairness issue can be observed in other types of defense methods. 23 CHAPTER 3 ENHANCING ROBUSTNESS MAY AMPLIFY INNER-CLASS UNFAIRNESS¹ 3.1 Introduction Recent studies [39, 40, 4, 19, 77] uncover one key reason for the extraordinary performance of deep neural networks (DNNs), a.k.a, the memorization (or benign overfitting). In detail, the works [39, 40] empirically show that popular image datasets, such as CIFAR10, CIFAR100 and ImageNet, always have very diverse data distributions, especially containing a large fraction of “atypical” samples (see Fig.(3.1)). These atypical samples are visually and statistically very different from other (typical) samples in their class. For example, the atypical samples of “dog” in CIFAR10 can have very diverse appearance, species, camera angles, backgrounds, which make them visually distinct from typical “dogs”. Besides, although the total amount of atypical samples is large, each atypical samples only appears in a less-frequent region in the whole data distribution. As a result, DNNs can only fit these atypical samples by “memorizing” their labels. Surprisingly, the studies [39, 40] figure out that memorizing / fitting these atypical samples will not hurt the model performance on typical samples, but can boost DNNs’ accuracy by correctly classifying the atypical samples in the test set. Thus, the capacity to memorize atypical samples is one key factor for DNNs to achieve better performance beyond shallow machine learning models. Similar to DNNs trained via empirical risk minimization (ERM), adversarial training methods [71, 58] are also devised to fit the whole training dataset. Specifically, adversarial training minimizes the model’s error against adversarial perturbations [41, 102] by fitting the model on manually generated adversarial examples of all training data. Although adversarial training can fit and memorize all training data as well as their adversarially perturbed counterparts, they always suffer from both poor clean accuracy and adversarial accuracy (or robustness) on the test set [105, 94]. These facts suggest that the memorization in adversarial training may have a very different behavior from traditional ERM. Therefore, it is natural to ask a question: What is the impact of memorization ¹This chapter is based on previously published work by Han Xu, Xiaorui Liu, Wentao Wang, Zitao Liu, Anil K Jain, Jiliang Tang titled “How does the Memorization of Neural Networks Impact Adversarial Robust Models?” published at the Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining [124]. 24 in adversarial training? Can memorizing the atypical samples and their adversarial counterparts benefit the model’s accuracy and robustness? To answer this question, we first conduct preliminary studies to explore whether memorizing atypical samples in adversarial training can benefit the DNNs’ test performance, especially on those test atypical samples. We implement PGD adversarial training [71] on CIFAR100 under WideRes- Net28 [45] and fine-tune them until achieving the optimal training performance. In Fig.(3.2), we observe that the memorization in adversarial training can only benefit the clean accuracy of test atypical samples, instead of their adversarial counterparts. In detail, when the DNNs fit/memorize the atypical samples, they can achieve clean accuracy close to ∼ 40% on the test atypical set. However, the adversarial accuracy on test atypical set is constantly low (∼ 10%) during the whole training process, even though the models can fit almost all atypical (adversarial) training samples. Remind that atypical samples occupy at least ∼ 40% of the whole CIFAR100 dataset, this fact suggests: failing on atypical adversarial examples during test could be one of the key reasons leading to the poor robustness generalization of DNNs. Next, we continue to study whether memorizing atypical samples in adversarial training can impact the model’s performance on typical samples. We again implement PGD adversarial training [71] on CIFAR100 for multiple trails, which are trained with different amount of atypical samples. From the results in Fig.(3.3), an adversarially trained model on the training set without any atypical samples has 95% clean accuracy and 55% adversarial accuracy on the test typical set. While, the model trained with 100% atypical samples only has 90.2%/50.4% clean/adversarial accuracy respectively. This result shows: in adversarial training, fitting atypical samples will even hurt DNNs’ performance on those “typical” samples. In another word, atypical samples act more like “poisoning data” [6] to deteriorate model performance on typical samples. Notably, in Fig.(3.3), we also demonstrates that this poisoning effect is absent in traditional ERM: fitting atypical samples will not reduce the model accuracy in ERM. This fact highlights the uniqueness of memorization in adversarial training methods. As a conclusion, these two facts about memorization uncovers key differences between tra- 25 Figure 3.1: Typical & Atypical Samples in CIFAR10 & CIFAR100, with different “memorization value”. The higher the memorization value (=1.0), the more likely the sample is atypical. The lower the memorization value (=0.0), the more likely the sample is typical. ditional ERM and adversarial training, and shed clues to the reason of the poor generalization of adversarial training. Motivated by our findings, we propose a novel algorithm called Benign Adversarial Training (BAT), which can eliminate the negative influence from memorizing those “harmful” atypical samples, meanwhile preserving the model’s ability to memorize those “benign / useful” atypical samples. It is worth mentioning that, by fitting those “benign” atypical samples, the BAT method can achieve good clean accuracy on the atypical set; by eliminating those “harmful” atypical samples, the BAT method can improve the clean & adv. accuracy on the typical set. In 26 our experiments, we verify that BAT can benefit adversarial training method, by enjoying a better accuracy vs. robustness tradeoff. 3.2 Related Works Memorization and atypical samples. The memorization effect of overparameterized DNNs has been extensively studied both empirically [133, 78] and theoretically [5]. From the views of traditional ML, the memorization can be harmful to the model generalization, because it makes models easily fit outliers and noisy labels. However, recent studies point out the concept of “benign overfitting” [4, 39, 40], which suggests the memorization effect necessary for DNNs to have extraordinary performance on modern machine learning tasks. Especially, the recent work [40] empirically figures out those atypical/rare samples in benchmark datasets and show the contribution from memorizing atypical samples to the DNN’s performance. Besides the work [40], there are also other strategies [14] to find atypical samples in training dataset. Adversarial robustness. Adversarial training methods [71, 135] are considered as one of the most reliable and effective methods to protect DNN models against adversarial attacks [41, 126]. However, there are several intrinsic properties of adversarial training different from traditional ERM which require deeper understandings. For example, they always suffer from poor robustness generalization [94, 89]; they always present strong trade-off relation between clean accuracy vs. robustness [105, 135]. Our work aims to study the robust overfitting issues from the data perspective and demonstrate critical connection of the memorization of DNNs. Comparison to Existing Works. Notably, our work is not the first effort to study the influence of memorization on DNN’s adversarial robustness. A previous study [93] illustrates that memorizing the mis-labeled samples might be a reason to cause the DNNs’ adversarial vulnerability. Similarly, the works [32] claims that fitting some “problematic” data can degrade the performance of adversarial training. A concurrent work [33] finds that fitting some “hard” samples is also likely to hurt the adversarially trained model’s performance. However, these methods didn’t provide efficient strategies to trace the problematic / hard samples, which limit their empirical support. Our study which focused on the “atypical” samples directly compare adversarial training and traditional ERM, on a 27 fixed set of samples (atypical samples). which highlights the great importance of atypical samples on the problem of robust generalization. Moreover, we propose a novel defense method, BAT, to further improve the performance of adversarial training. 3.3 Definition and Notation 3.3.1 Atypical Samples and Memorization In this section, we provide a formal definition about atypical samples, following the work [39], and refer atypical samples are the samples that: “a learning algorithm A cannot predict its label without observing it in the training dataset. Fitting it will require memorizing its label.” To identify such atypical samples in practice, one representative strategy [40] proposes to examine which training samples can only be fitted by memorization, and measure each training sample’s “memorization value”. Formally, for a training algorithm A (i.e., ERM), the memorization value “mem(A,D, 𝑥𝑖)” of a training sample (𝑥𝑖 , 𝑦𝑖) ∈ D in training set D is defined as: mem(A,D, 𝑥𝑖) = Pr. 𝐹←A(D) (𝐹(𝑥𝑖) = 𝑦𝑖) − Pr. 𝐹←A(D\𝑥𝑖 ) (𝐹(𝑥𝑖) = 𝑦𝑖), (3.1) which calculates the difference between the model 𝐹’s accuracy on 𝑥𝑖 with and without 𝑥𝑖 removed from the training set D. In practice, [40] trains a DNN model using ERM method for 1,000 trials, with each trial preserves 70% samples of the whole training dataset. Based on this metric, one can identify all common datasets having a large fraction of atypical samples. For example, CIFAR10, CIFAR100 and Tiny ImageNet have more than 11%, 40% and 49% samples with a large memorization value > 0.15. This strategy can also facilitate to find atypical samples in the test set, which are the samples that are strongly influenced by atypical training samples. In detail, by removing an atypical training sample (𝑥𝑖 , 𝑦𝑖), we calculate “influence value” on each test sample (𝑥′ 𝑗 , 𝑦′ 𝑗 ) ∈ D′ in test set D′: infl(A,D, 𝑥𝑖 , 𝑥′ 𝑗 ) = Pr. 𝐹←A(D) (𝐹(𝑥′ 𝑗 ) = 𝑦′ 𝑗 ) − Pr. 𝐹←A(D\𝑥𝑖 ) (𝐹(𝑥′ 𝑗 ) = 𝑦′ 𝑗 ). (3.2) 28 If the sample pair (𝑥𝑖 , 𝑥′ 𝑗 ) has a high influence value, removing the atypical sample 𝑥𝑖 will drastically decrease the model’s accuracy on 𝑥′ 𝑗 . Remark: In this paper, we also use ERM as A to identify atypical samples, because ERM is computational efficient and more accurate than other algorithms (such as adversarial training). In appendix, we provide more discussions including: (a) the frequency of atypical samples in common datasets and (b) alternative (more efficient) metrics to identify atypical samples. 3.3.2 Adversarial Training Similar to classification models trained via empirical risk minimization (ERM) algorithms, adversarial training methods [71, 58] are also devised to fit the whole dataset by training the model on manually generated adversarial examples. Formally, they minimize the adversarial loss: min 𝐹 E𝑥  max ||𝛿||≤𝜖 L(𝐹(𝑥 + 𝛿), 𝑦)  , (3.3) which is the model 𝐹’s average loss on the data (𝑥, 𝑦) that perturbed by adversarial noise 𝛿. These adversarial training methods have been shown to be one of the most effective approaches to improve the model robustness against adversarial attacks. Note that similar to traditional ERM, adversarially trained models can also achieve very high training performance. For example, under WideResNet28-10 (WRN28), PGD adversarial training [45] can achieve 100% clean accuracy and 99% adversarial accuracy. It suggests these DNN models have sufficient capacity to memorize the labels of these atypical samples and their adversarial counterparts. However, different from ERM, adversarially trained models usually suffer from bad generalization performance during test. For example, the WRN28 model can only have 59% and 24% test clean accuracy and adversarial accuracy. Thus, these facts indicate that the memorization in adversarial training is probably not always beneficial to the test performance and requires deep understanding. 3.4 Preliminary: Memorization in Adversarial Training In this section, we attempt to understand adversarial training’s behavior by studying its relation with the memorization effect. The discussions are based on PGD-adversarial training [71] on CIFAR 100. Implementation details and similar experiments in CIFAR 10 and Tiny ImageNet [60] are shown 29 in appendix. 3.4.1 (a) Robustness of Atypical Samples can Hardly Generalize In this subsection, we first check whether fitting atypical samples in adversarial training can effectively help the model correctly and robustly predict the atypical samples in the test set. We apply PGD adversarial training [71] on original CIFAR100 dataset for 200 epochs and evaluate the model’s clean accuracy and adversarial accuracy on training atypical set Datyp = {𝑥𝑖 ∈ D : mem(𝑥𝑖) > 0.15} and its corresponding test atypical set D′ atyp = {𝑥′ 𝑗 ∈ D′ : infl(𝑥𝑖 , 𝑥′ 𝑗 ) > 0.15, for ∀𝑥𝑖 ∈ Datyp}. In Fig.(3.2), we report the algorithm’s performance (clean & adversarial accuracy) on these atypical sets along with the training process. From the results, we observe that the WRN28 model is capable to memorize almost all atypical samples and their adversarial counterparts, since it can achieve ≈ 100% clean & adversarial accuracy on the training atypical set. As the training goes, the models’ clean accuracy on the test atypical set gradually improves and finally approaches 41%. However, its adversarial robustness keeps constant around 10% from the beginning epochs to the last ones, no matter how high the training performance is. These results suggest that memorizing atypical samples in adversarial training may only improve their test clean accuracy, but hardly help their adversarial robustness. Recall that in CIFAR100, atypical set Datyp (with memorization value > 0.15) covers 40% samples of the whole dataset. Completely failing on the adversarial robustness of atypical samples could be one key reason that contributes to the poor robustness generalization of DNNs [89]. As the previous theoretical study [94] stated, for a model to achieve good robustness generalization performance, it always demands a training set with much more samples, than a model to have good clean accuracy generalization. In our case, the sub-population of each particular atypical sample has a very low frequency to appear in the training set, and it is always deviated from the main sub-population. Thus, in the sub-population of this atypical sample, it is equivalent to a classification task based on an extremely small dataset, with one or a few training samples given. Thus, the robustness of atypical samples can be extremely hard to generalize. 30 Figure 3.2: Clean Acc. (left) & Adv. Acc (right) on Atypical Set of CIFAR100 under WRN28. 3.4.2 (b) Memorizing Atypical Samples Hurts Typical Samples’ Performance In this subsection, we find that fitting atypical samples will even bring negative effects on “typical” samples. Here, we define “typical” samples as the subset of training set D which has low memorization values: Dtyp = {𝑥𝑖 ∈ D : mem(𝑥𝑖) < 0.02}. For the test typical set D′ typ, we exclude all test samples which have high influence values from any atypical training samples, and also exclude the samples that have low success rate to predict using ERM algorithm A (the samples which cannot be learned from D): D′ typ = D′ − {𝑥′ 𝑗 : infl(𝑥𝑖 , 𝑥′ 𝑗 ) > 0.02, for ∀𝑥𝑖 ∈ Datyp} ∪ {𝑥′ 𝑗 : Pr.𝐹←A(D) (𝐹(𝑥′ 𝑗 ) = 𝑦 𝑗 ) < 0.8}. We conduct PGD adversarial training [71] for several trials on resampled CIFAR100 datasets: each dataset is constructed with the whole training typical setDtyp, and a part of the training atypical set Datyp (randomly sample 0%, 20% and 100% in Datyp). In Fig.( 3.3), we report the adversarially trained model’s clean and adversarial accuracy on the test typical set D′ typ and check the impact of atypical samples on the typical samples. From the results, we find that the existence of atypical samples makes a significant influence on the typical samples. For example, an adversarially trained model without atypical samples has 94.7% clean accuracy and 55.0% adversarial accuracy on the test typical samples (on the last epochs). While, the model trained with all atypical samples only has 90.2% and 50.4% clean & adv. accuracy, respectively. These results suggest: the more atypical samples exist in the training set, the poorer performance the model will have onD′ typ. In other words, 31 (a) Adv. Train (Clean Acc.). (b) Adv. Train (Adv. Acc.). (c) Traditional ERM (Clean Acc.). Figure 3.3: Adversarial Training vs. Traditional ERM on Typical Set of CIFAR100 under WRN28. these atypical samples act like “poisoning” samples [6, 126] which can deteriorate the model’s performance. It is also worth mentioning that this negative phenomenon does not appear in the traditional ERM algorithm. In Fig.(3.3c), we can see the models have similar accuracy on the typical set, if different numbers of atypical samples appear in the training set. Harmful Atypical Samples. A natural question is what kinds of atypical samples are likely to hurt model robustness and why? Different from previous literature [61] which argue that mislabeled samples are very likely to degrade model performance, CIFAR100 is a clean dataset with no or very few wrong labels. Thus, we hypothesize that these “harmful” atypical samples are those 32 which share similar features with a “wrong” class. Recall that atypical samples are always distinct from the main data distribution in their labeled class, it is likely that they are close to the distribution of a “wrong” class. In Fig.(3.4), in CIFAR100, we identify an atypical “plate” (via the method in the next section) which has features, such as its shape, resembling to the images of “apple”. If the model memorizes this atypical “plate” and predicts any samples with similar features to be “plate”, the model cannot distinguish “apple” and “plate”. Remarkably, these “harmful” atypical samples are still correctly labeled and they only share some similar features with other classes. Figure 3.4: “Harmful” Atypical Samples: in CIFAR100 datasets, there are atypical “plates” like “apples”; there are atypical “lamps” like “cups”. Why does this negative effect only happen during adversarial training, instead of traditional ERM? To deepen our understanding of this question, we theoretically analyze the consequence of fitting “harmful” atypical samples for both algorithms. In particular, we formally state: the negative effect during adversarial training can be significantly stronger than traditional ERM. Our analysis is based on a binary classification task with a two-dimensional Gaussian Mixture distribution: 𝑦 𝑢.∼𝑎.𝑟 {−1, +1}, 𝜃 = [𝜂, 0], 𝜂 ∈ R+, 𝑥 ∼ 8>>>> < >>>>: N (𝜃, 𝜎2𝐼) if 𝑦 = +1 N (−𝜃, 𝜎2𝐼) if 𝑦 = −1, (3.4) and we consider linear models 𝐹 = sign(𝑤𝑇 𝑥 + 𝑏), where 𝑤 and 𝑏 are the model weight and bias. Moreover, we assume there is an atypical training sample 𝑥0 ∈ R2 labeled as “−1”, with Euclidean 33 distance to the center of each class ||𝑥0 − (−𝜃) ||2 = 𝑑−1, ||𝑥0 − 𝜃||2 = 𝑑+1. Intuitively, this setting of 𝑥0 resembles the cases in Fig.(3.4) if 𝑑+1 is relatively small, since it has a short distance to the wrong class. We also constrain 𝑑−1 ≤ 2 × 𝜂 for 𝑥0 to preserve certain similarity to the class “-1”. The following theorem shows that there always exists a model which fits the sample 𝑥0, and achieve optimal accuracy (i.e. > 99%) on both classes, although 𝑑+1 can be small. Theorem 3 (Fitting Atypical Samples in traditional ERM) For a data distribution in Eq. 3.4, given that 𝜂/𝜎 is large enough, we define 𝑙 = √ 2 · Φ−1(0.99) · 𝜎 · (1 + Φ−1 (0.99) · 𝜎 𝜂 )−1/2. For a point 𝑥0 with 𝑑−1 ≤ 2𝜂, if 𝑑+1 ≥ 𝑙, then there exist a model 𝐹 fitting 𝑥0: 𝐹(𝑥0) = −1 and has classification error at most 1% on both classes. The detailed proof can be found in the appendix. Remarkably, for classification tasks if the two classes are well separated, i.e. 𝜎/𝜂 is sufficiently small, there exist models that can achieve high performance even when the atypical sample 𝑥0 is only around ( √ 2·Φ−1(0.99) ·𝜎) distant away from the center of class “+1”, which is much smaller than the center distance of the two class 2𝜂. This suggests that for traditional ERM, the negative effect of fitting the atypical sample 𝑥0 is negligible, even if 𝑥0 is close to the wrong class “+1”. Next, we show that under the same setting, any model which fits the adversarial counterpart of 𝑥0, must have a poor accuracy on the class “+1”. Theorem 4 (Fitting Atypical Samples in Adversarial Training) For the same data distribution in Eq. 2.2 and same definition 𝑙 = √ 2 · Φ−1(0.99) · 𝜎 · (1 + Φ−1(0.99) · 𝜎 𝜂 )−1/2. We assume 𝑙 ≤ 𝑑+1 < 𝑘 · 𝑙, where 1 ≤ 𝑘 << 𝜂/𝜎. Then, any model 𝐹 will have a classification error at least 50% on class “+1”, when 𝐹 fits the adversarial example of 𝑥0: 𝐹(𝑥0 + 𝛿) = −1, ∀||𝛿||2 ≤ 𝜖, and 𝜖 ≥ 𝜎 · (𝑘 · √ 2 · Φ−1(0.99)) · (1 + Φ−1(0.99) · 𝜎 𝜂 )−1/2 (3.5) Notably, when the two classes are well separated, i.e. 𝜎/𝜂 is small, the adversarial training bound 𝜖 is also much smaller than 𝜂. The analysis suggests that when 𝑥0 is close to the class “+1”, all linear models which fit the adversarial example of 𝑥0 will have a large error on class “+1”, even the adversarial training bound 𝜖 is small. Thus, in adversarial training, the negative effect of fitting 34 atypical samples is significantly greater than that in traditional ERM. These theorems highlight the key difference between adversarial training and traditional ERM when fitting atypical samples, which motivate us to devise strategies to eliminate this negative effect during adversarial training. 3.5 Methodology: Benign Adversarial Training Based on previous discussions, the effect of atypical samples in adversarial training can be briefly summarized as: (a) They can benefit the models’ clean accuracy (especially on atypical samples), but hardly improve their robustness. (b) They hurt the performance of typical samples. In this section, we ask the question: Can we eliminate the negative influence from fitting atypical samples during adversarial training? Admittedly, strategies for adversarial training to stop at early epoch [89] is potential to achieve this goal, since the atypical samples may be fitted in the later stages of training. However, it is still unclear how to properly identify such an epoch in general scenarios. Moreover, the early-stop mechanism tends to ignore all atypical samples and significantly limits the model’s ability to handle atypical samples, especially on complex datasets such as CIFAR100 and ImageNet with large fractions of atypical samples. In this work, we propose a novel algorithm Benign Adversarial Training (BAT). It is composed of two major components: (i) Reweighting to mitigate the impact from harmful atypical samples and learn benign atypical samples; (ii) Discrimination Loss to further protect the discrimination ability among different classes. Combining these two components, BAT can improve the performance of typical samples, but still preserve the ability to fit those “useful”’ atypical samples. Compared to PGD adversarial training [71], BAT enjoys better clean vs. adversarial accuracy trade-off. More results can be found in Section 5.5. Next, we detail these two components of BAT. 3.5.1 Reweighting & Harmful Score Following PGD adversarial training [71], BAT starts by fitting manually generated adversarial examples using Projected Gradient Descent (PGD). During training, in order to identify which samples can hurt or degrade the model’s performance, we define the “harmful score” for each 35 training (clean) sample 𝑥𝑖 as: q(𝑥𝑖) = max 𝑡≠𝑦 𝐹𝑡 (𝑥𝑖) (3.6) which is the model 𝐹’s largest prediction score (after softmax) of the clean sample 𝑥𝑖 to a wrong class 𝑡 other than the true class 𝑦. A high harmful score suggests that the current model predicts the sample 𝑥𝑖 to be a wrong class with high confidence. Note that the atypical samples are usually fitted into the model later than typical samples. Under a model with a few atypical samples fitted, if there is an atypical sample 𝑥𝑖 with high q(𝑥𝑖), 𝑥𝑖 is very likely to be close to the distribution of a wrong class instead of its true class. In Figure 3.4, we present several atypical samples with harmful scores larger than 0.8. These samples present semantic features which are very similar to the wrong class that the model predicts. Therefore, fitting these atypical samples could cause the model to learn mixed concepts of features and degrades the model performance. During the training process, we desire that the model should mitigate the influence from the atypical samples with large harmful scores, but still learn other “useful /benign” atypical samples. Thus, we design a cost-sensitive reweighting strategy which downweights the cost of atypical samples with large harmful scores. In particular, we specify the weight value 𝑤𝑖 for each training sample 𝑥adv 𝑖 as: 𝑤𝑖 = 8>>>> < >>>>: Lexp(E − 𝛼·q(𝑥𝑖)) if mem(𝑥𝑖) >𝜎 and arg max𝑡𝐹𝑡 (𝑥𝑖) ≠𝑦 1 otherwise. (3.7) where 𝛼 ∈ R+ and 𝜎 control the size of the reweighted atypical set. Since the function exp(−𝛼(·)) is decreasing and ranges from 1 to 0, the samples with large harmful scores will be assigned with small weights close to 0. Thus, in the reweighting algorithm, we train the model to find optimal model 𝐹rw by assigning the weight vector 𝑤: 𝐹rw = arg min 𝐹 1 Í 𝑖 𝑤𝑖 Õ 𝑖  𝑤𝑖 · L(𝐹(𝑥adv 𝑖 ), 𝑦𝑖)  (3.8) In the training process, those (adversarial) training samples with large harmful scores are assigned with small weights and correspondingly their influence will be largely mitigated. Note that in 36 Eq. 3.7, estimating mem(𝑥𝑖) for each 𝑥0 via the leave-out retraining method described in Eq. 3.1 is potential to be computational costly. However, alternative efficient strategies have been discussed to estimate how likely one sample is atypical, including confidence-based strategies and ensemble disagreement [14]. These methods are shown to have good alignment with the method in Eq. 3.1, thus these strategies can be leveraged to improve the efficiency of Eq. 3.7. 3.5.2 Discrimination Loss In addition, we introduce Discrimination Loss to further eliminate the negative effect of fitting atypical samples. Based on the theoretical analysis Section 3.4.2, imagine that the atypical sample 𝑥0 has a fixed distance 𝑑−1 to the center of its labeled class, one strategy to mitigate the negative effect of fitting 𝑥0 is to increase the two class’s center distance 2𝜂 and hence increase the distance 𝑑+1 of the harmful example to the poisoned class “+1”. Motivated by this point, we introduce L𝐷𝐿 to increase the distances of typical samples among different classes in the representation space. L𝐷𝐿 (𝐹) = E (𝑥𝑖 ,𝑦𝑖 ),(𝑥 𝑗 ,𝑦 𝑗 ) {(𝑥𝑘 ,𝑦𝑘 )}𝐾𝑘 =1 " − log 𝑒ℎ(𝑥adv 𝑖 )·ℎ(𝑥adv 𝑗 )/𝜏 Í𝐾𝑘 =1 𝑒ℎ(𝑥adv 𝑖 )·ℎ(𝑥adv 𝑘 )/𝜏 # , where 𝑦𝑖 = 𝑦 𝑗 ;𝑦𝑘 ≠𝑦𝑖 ,∀𝑘 =1,...,𝐾;mem(𝑥𝑖),mem(𝑥 𝑗),mem(𝑥𝑘) <𝜎 (3.9) Specifically, ℎ(·) is the model 𝐹’s pen-ultimate layer’s output representation; 𝜏 ∈ R+ is the temperature value and 𝐾 ∈ Z+ is a fixed number of the samples that are randomly chosen with different labels with 𝑦𝑖 ². Intuitively, constraining the Discrimination Loss imposes the model to output similar representations for the typical sample pairs (𝑥adv 𝑖 , 𝑥adv 𝑗 ) in the same class, and distinct representations for (𝑥adv 𝑖 , 𝑥adv 𝑘 ) from different classes. Similar approaches have been widely used in representation learning, such as Triplet Loss [95] and contrastive learning methods [22], which stress the good property of DNNs’ representation space. With the Discrimination Loss incorporated, the final objective of BAT is: 𝐹bat =arg min 𝐹 1 Í 𝑖𝑤𝑖 Õ 𝑖  𝑤𝑖 ·L(𝐹(𝑥adv 𝑖 ),𝑦𝑖)  +𝛽·L𝐷𝐿 (𝐹) ! (3.10) 37 where 𝛽 > 0 that controls the intensity of Discrimination Loss. The detailed training scheme of Benign Adversarial Training is presented in appendix. 3.5.3 The Detailed Training Scheme of BAT In this subsection, we provide the detailed training scheme of BAT in Algorithm 2. In particular, BAT algorithm starts from a randomly initialized neural network. On each mini-batch, it applies PGD attack to generate (training) adversarial examples (Step 5). Following the Eq 3.6 and Eq 3.7, BAT calculates which samples are likely to be harmful Atypical Samples and their corresponding weight values (Step 6). Under the current mini-batch, next, BAT calculates the Discrimination Loss of the typical samples Dtyp (Step 8). Finally, BAT uses SGD to update the model parameter to minimize the reweighted adversarial loss regularized by 𝛽 times discrimination loss (Step 8). Algorithm 2 The Benign Adversarial Training (BAT) Algorithm. 1: Input: Training dataset D, with typical set Dtyp = {𝑥 ∈ D; mem(𝑥) ≤ 𝜎}, atypical set Datyp = {𝑥 ∈ D; mem(𝑥) > 𝜎}. Targeted type of adversarial attack: 𝑙∞-𝜖 attack. Hyperparameters 𝛼, 𝛽 ∈ R+. 2: Randomly initialize the network 𝐹 3: repeat 4: Fetch mini-batch data {(𝑥𝑖 , 𝑦𝑖)} at current epoch 5: Using PGD to generate adversarial training sample {(𝑥adv 𝑖 , 𝑦𝑖)} 6: Calculate Harmful Score q(𝑥𝑖) and weight 𝑤𝑖 as in Equation 3.6 and Equation 3.7. 7: Calculate Discrimination Loss L𝐷𝐿 (𝐹) within the current mini-batch, as in Equation 3.9. 8: Update 𝐹 by SGD on the objective: LBAT = Í1 𝑖 𝑤𝑖 Í 𝑖  𝑤𝑖 · L(𝐹(𝑥adv 𝑖 ), 𝑦𝑖)  + 𝛽 · L𝐷𝐿 (𝐹). 9: until End of Training 3.6 Experiment In this section, we present the experimental results to validate the effectiveness of the proposed BAT algorithm on benchmark datasets and compare it with state-of-the-art baseline methods. 3.6.1 Experimental Setup In this section, we conduct the experiments mainly on benchmark datasets CIFAR10, CIFAR100 [56] and Tiny ImageNet [60]. ²In practice, under training algorithms using mini-batches, the set {𝑥𝑘 }𝐾𝑘 =1 can be replaced by all (typical) samples in the mini-batch, with different labels from 𝑦𝑖 . 38 (a) CIFAR100. (b) CIFAR10. (c) Tiny ImageNet. Figure 3.5: Frequencies of Training samples with Different. Description of Datasets. In Fig. 3.6, we provide several pairs of images with high influence value (as defined in Section 3.3.1) which is over 0.15. In each pair, the training sample also has a high memorization value over 0.15. These examples suggest that there exist atypical samples in both training & test sets of CIFAR10, CIFAR100 and Tiny ImageNet. A pair of atypical samples (in the training set and test set) with a high influence value are visually very similar. Moreover, since they have high influence values, removing the atypical samples in the training set is very likely to cause the model to fail on the test atypical samples. Therefore, without memorizing the atypical sample in the training set, the model can hardly predict the atypical samples in the test set. In Fig. 3.5, we also provide histograms to show the distribution of the estimated memorization values of all training samples from CIFAR10, CIFAR100 and Tiny ImageNet. From Fig. 3.5, we can observe that atypical samples (with high memorization value > 0.15) consist of a significant fraction (over 40% & 50% respectively) in CIFAR100 and Tiny ImageNet. In CIFAR10, they also consist of a non-ignorable fraction which is over 10%. Benchmark Defenses. For each dataset, we study the algorithms under the model architectures ResNet and WideResNet (WRN28) (width = 10) [45]. In this section, we only present the results of WRN models and leave the results on ResNet in appendix. For a fair comparison with BAT, we implement the baseline including PGD adversarial training [71] as well as its most popular variant TRADES [135]. In addition, we include several recent algorithms: MART [112] and GAIRAT [137], which also incorporate reweighting strategies into adversarial training. For BAT 39 CIFAR100 Train Test CIFAR10 Train Test Tiny ImageNet Train Test Figure 3.6: High Influence Pairs with Influence Value > 0.15. Table 3.1: Performance of BAT vs. Baselines on CIFAR100 Under WRN28. Method Clean Acc. FGSM PGD CW AA. PGD Adv. Training 59.7 (-2.3) 34.9 (-3.7) 24.7 (-3.8) 24.2 (-2.3) 22.5 (-2.3) TRADES (1/𝜆 = 5) 57.3 (-4.7) 34.5 (-4.1) 24.9 (-3.6) 24.6 (-1.9) 22.9 (-1.9) MART [112] 56.5 (-5.5) 36.1 (-2.5) 26.8 (-1.7) 25.3 (-1.2) 23.8 (-1.0) GAIRAT [137] 60.2 (-1.8) 34.8 (-3.8) 24.4 (-4.1) 24.8 (-1.5) 22.9 (-1.9) BAT (𝛼 = 1, 𝛽 = 0.2) 62.0 (+0.0) 38.6 (+0.0) 28.5 (+0.0) 26.5 (+0.0) 24.8 (+0.0) BAT (𝛼 = 2, 𝛽 = 0.2) 61.4 (-0.6) 38.9 (+0.3) 28.2 (-0.3) 26.3 (-0.2) 24.8 (-0.0) Table 3.2: Performance of BAT vs. Baselines on Tiny ImageNet Under WRN28. Method Clean Acc. FGSM PGD CW AA. PGD Adv. Training 58.9 (-3.4) 35.7 (-5.9) 31.7 (-4.1) 30.0 (-3.7) 30.0 (-4.3) TRADES (1/𝜆 = 5) 59.7 (-2.6) 37.4 (-4.2) 32.0 (-3.8) 31.8 (-1.9) 32.0 (-2.3) MART [112] 58.2 (-4.1) 41.0 (-0.6) 35.6 (-0.2) 33.9 (+0.2) 34.1 (-0.2) GAIRAT [137] 60.5 (-1.8) 39.1 (-2.5) 33.1 (-2.7) 31.9 (-1.8) 32.3 (-2.0) BAT (𝛼 = 1, 𝛽 = 0.2) 62.3 (+0.0) 41.6 (+0.0) 35.8 (+0.0) 33.7 (+0.0) 34.3 (+0.0) BAT (𝛼 = 2, 𝛽 = 0.2) 62.4 (+0.1) 43.1 (+1.5) 37.4 (+1.6) 35.3 (+1.6) 35.6 (+0.7) and all baseline methods, we run the algorithms using SGD [7] for 160 epochs where the learning rate is from 0.1 and decays by 0.1 after the epoch 80 and 120. We report the performance of all methods on the checkpoints when they reach maximal PGD adversarial accuracy on a separated validation set. More implementation details are listed in appendix. Performance on CIFAR100. For a comprehensive comparison between different methods on 40 Table 3.3: Performance of BAT vs. Baselines on CIFAR10 Under WRN28. Method Clean Acc. FGSM PGD CW AA. PGD Adv. Training 85.6(-0.4) 59.6(-1.5) 52.2(-1.0) 50.7(-0.7) 49.4(-1.1) TRADES (1/𝜆 = 5) 83.8(-2.2) 62.7(+1.6) 54.9(+1.7) 52.6(+1.2) 51.8(+1.3) MART [112] 83.1(-2.9) 63.5(+2.4) 56.8(+3.6) 53.6(+2.2) 52.2(+1.7) GAIRAT [137] 85.3(-0.7) 63.1(+2.0) 55.8(+2.6) 52.6(+1.2) 51.8(+1.3) BAT (𝛼 = 1, 𝛽 = 0.3) 86.0(+0.0) 61.1(+0.0) 53.2(+0.0) 51.4(+0.0) 50.5(+0.0) BAT (𝛼 = 1, 𝛽 = 0.2) 85.7(-0.3) 60.9(-0.2) 53.2(+0.0) 51.1(-0.3) 50.3(-0.2) CIFAR100, in Table 3.1, we report the models’ clean accuracy and adversarial accuracy against 𝑙∞- 8/255 adversarial examples, generated by various attacking algorithms including FGSM attack [41], PGD attack [71], CW [15] and Auto-Attack [25]). For BAT, we report its performance when choosing its optimal hyperparameter: 𝛼 = 1, 𝛽 = 0.2 and 𝛼 = 2, 𝛽 = 0.2. In appendix, we will discuss the impact of the selection of 𝛼 and 𝛽 on BAT and argue that both components are necessary for BAT to achieve the optimal performance. In Table 3.1, we report the performance difference between baseline methods vs. BAT (𝛼 = 1, 𝛽 = 0.2) in parentheses. From the results, we can find that BATs enjoy good clean & adversarial accuracy trade-off at the same time. For example, BAT methods can achieve the highest clean accuracy as well as adversarial accuracy among all baseline methods. Compared to PGD adversarial training [71], BAT methods have around 2 to 3% improvement on both clean accuracy and adversarial accuracy on strong attacks such as PGD, CW and AA. Compared to other baseline methods such as MART [112], it has more than 1% robustness improvement evaluated by strong attacks, but much higher clean accuracy (> 5%). Performance on Tiny ImageNet. Tiny ImageNet [60] contains 200 classes of the images in the original ImageNet [57] dataset, with 500 training images for each class, and image size 64 × 64. In our experiments, we only choose the first 50 classes in Tiny ImageNet for training and prediction. Since the image size is 64×64, for both training and robustness evaluation, we consider the adversarial attacks are bounded by 𝑙∞-norm-4/255. In Table 3.2, we report the performance of BAT and baseline methods. Similar to the conclusions we made from CIFAR100, BATs can achieve consistently better clean & adversarial accuracy compared to baseline methods. Performance on CIFAR10. In CIFAR10 (Table 3.3), BAT also (slightly) outperforms PGD ad- 41 versarial training, as it has a 0.4% improvement on clean accuracy, and better adversarial accuracy. However, the improvement is not as significant as in CIFAR100 and TinyImageNet, because CIFAR10 contains much fewer atypical samples. Remarkably, this experimental result in CIFAR10 can also suggest that the atypical samples (or memorization) make unignorable influence on the model performance, even in datasets which contain relatively fewer atypical samples. Compared to other baseline methods such as TRADES, MART and GAIRAT, BAT methods have lower adversarial accuracy, but better clean accuracy. Performance on ResNet18. We also compare the performance of BAT with baselines under a small archtecture ResNet18, where the results are reported in Table 3.4, Table 3.5 and Table 3.6. From the result, we can see that BAT does not have as significant advantage over baselines, as in WRN28. It is because the capacity of ResNet18 is much smaller than WRN28, and sometimes cannot achieve 100% training accuracy. Therefore, the proposed method BAT cannot achieve the best performance among baselines. However, BAT can still achieve similar performance with the PGD adversarial training in most cases. Table 3.4: Performance of BAT vs. Baselines on CIFAR100 Under ResNet18. Method Clean Acc. FGSM PGD CW AA. PGD Adv. Training 56.9 (-2.6) 36.0 (-1.3) 27.4 (+0.1) 25.4 (-1.2) 23.6 (-0.7) TRADES (1/𝜆 = 5) 56.6 (-2.9) 36.5 (-0.8) 26.9 (-0.4) 25.3 (-1.3) 23.9 (-0.4) MART [112] 51.8 (-7.7) 36.1 (-1.2) 30.4 (+3.1) 25.8 (-0.8) 24.4 (+0.1) BAT (𝛼 = 1, 𝛽 = 0.2) 59.5 (+0.0) 37.3 (+0.0) 27.3 (+0.0) 26.6 (+0.0) 24.3 (+0.0) BAT (𝛼 = 2, 𝛽 = 0.2) 59.3 (-0.2) 37.1 (-0.2) 27.4 (+0.1) 26.5 (-0.1) 24.0 (+0.7) Table 3.5: Performance of BAT vs. Baselines on CIFAR10 Under ResNet18. Method Clean Acc. FGSM PGD CW AA. PGD Adv. Training 81.9 (-0.3) 59.1 (-0.2) 51.8 (+0.3) 50.2 (+0.2) 48.1 (+0.1) TRADES (1/𝜆 = 5) 81.2 (-1.0) 60.0 (+0.7) 52.2 (+0.7) 50.8 (+0.8) 48.6 (+0.6) MART [112] 80.5 (-1.7) 62.2 (+1.9) 53.5 (+2.0) 52.5 (+2.5) 49.6 (+1.6) BAT (𝛼 = 1, 𝛽 = 0.2) 82.2 (+0.0) 59.3 (+0.0) 51.5 (+0.0) 50.0 (+0.0) 48.0 (+0.0) BAT (𝛼 = 1, 𝛽 = 0.3) 81.7 (-0.5) 59.7 (+0.4) 52.3 (+0.8) 50.3 (+0.3) 48.3 (+0.3) 42 Table 3.6: Performance of BAT vs. Baselines on Tiny ImageNet Under ResNet32. Method Clean Acc. FGSM PGD CW AA. PGD Adv. Training 56.3 (-3.1) 37.5 (-2.9) 32.3 (+0.3) 29.8 (-1.7) 29.8 (-2.2) TRADES (1/𝜆 = 5) 55.4 (-4.0) 35.2 (-5.2) 28.8 (-3.2) 27.0 (-4.5) 27.0 (-5.0) MART [112] 56.2 (-3.2) 38.1 (-2.3) 34.5 (+2.5) 31.8 (+0.3) 32.0 (+0.0) BAT (𝛼 = 1, 𝛽 = 0.2) (+0.0) 59.4 (+0.0) 40.4 (+0.0) 32.0 (+0.0) 31.5 (+0.0) 32.0 (+0.0) BAT (𝛼 = 2, 𝛽 = 0.2) 59.4 (+0.0) 41.3 (+0.9) 32.9 (+0.9) 31.8 (+0.3) 32.4 (+0.4) Table 3.7: Human Evaluation on Atypical Samples. Class 𝑦1 Class 𝑦2 Both Likely Neither. Total High Harmful Score 59.0 12.0 26.0 4.0 100.0 Low Harmful Score 96.0 0.0 1.5 2.5 100.0 3.6.2 Ablation Study An ablation study for “Harmful Atypical samples”. In this subsection, we conduct additional experiments to verify the claims about ”harmful atypical samples”. In particular, we aim to demonstrate that: the atypical images with high ”Harmful scores” are likely to obtain features which visually resemble the images from a (wrong) different class, so fitting them are likely to degrade the model performance. To validate our claim, we print out 200 images of atypical samples from CIFAR100 training set, with highest / lowest Harmful scores as defined in Eq.(6) of the paper. Then, we let two individual human annotators to label each image, by choosing one of 4 options including: (a.) This image belongs to class 𝑦1. (b.) This image belongs to class 𝑦2. (c.) Both class 𝑦1 and 𝑦2 are likely. (d.) Neither class 𝑦1 or 𝑦2. Here, 𝑦1 is the ground truth label of the sample; 𝑦2 is the class other than 𝑦1 which the model has the maximal confidence: 𝑦2 = 𝑎𝑟𝑔 max𝑡≠𝑦1 𝐹𝑡 (𝑥), and the model 𝐹 is obtained via PGD adversarial training, with all atypical samples removed from the training set. We report the percentage of the answers of the human annotators in Table 3.7. From the results, we can see, for the samples with high Harmful scores, people are more likely to believe that ”the image is from 𝑦2” or ”the image is both likely from 𝑦1 or 𝑦2”. It is because the highly-harmful images can persist a lot of similar features from these two classes. It also worth to mention that for samples which people labeled them to 𝑦1, they usually also have features belonging to the class 𝑦2. For example, a ”bowl” which is round and with orange color. Although people 43 believe it is a bowl confidently, it does have features such as ”round” and ”orange color”, which are also features of class ”orange”. An ablation study about hyperparameters in BAT. In this subsection, we study the potential impact of the hyperparameters chosen in BAT, which is 𝛼 that controls the Reweighting process and 𝛽 that controls the Discrimination Loss. In Figure 3.7, we conduct the experiments on CIFAR100 with BAT when 𝛼 is chosen in [0,1,2,3] and 𝛽 is in [0, 0.1, 0.2, 0.3]. In Figure 3.7, we show each model’s overall clean accuracy (left) & adversarial accuracy on PGD attack (right) along the Zaxis, and X-axis / Y-axis indicate the models’ corresponding variables of (𝛼, 𝛽). Note that when 𝛼 = 𝛽 = 0, the BAT method regresses to original PGD adversarial training. From the result, we can see that a positive pair of (𝛼, 𝛽) can benefit both model clean and adversarial accuracy (on PGD attack). Therefore, both components (i.e., Reweighting and LDL) of BAT are helpful and necessary. However, when 𝛼 or 𝛽 is too large, it will hurt the BAT’s clean accuracy. Hence, in CIFAR100, when 𝛼 = 1 or 2 and 𝛽 = 0.2, BAT can achieve the optimal performance. Figure 3.7: Clean Acc.(left) & Adv. Acc.(right) of BAT. 3.7 Conclusion In this paper, we draw significant connections of the memorization effect of deep neural networks with the behaviors of adversarial training algorithms. Based on the findings, we devise a novel algorithm BAT to enhance the performance of adversarial training. The findings of the paper can motivate the futures studies in building robust DNNs with more attention on the data perspective. 44 CHAPTER 4 ENHANCING ROBUSTNESS MAY AMPLIFY UNFAIRNESS ON IMBALANCED DATA¹ 4.1 Introduction The existence of adversarial samples [41] has risen huge concerns on applying deep neural network (DNN) models into security-critical applications, such as autonomous driving [21] and video surveillance systems [58]. As countermeasures against adversarial attacks, adversarial training [71, 135] has been empirically proven to be one of the most effective and reliable defense methods. In general, adversarial training can be formulated to minimize the model’s average error on adversarially perturbed input examples [71, 135, 89]. Although promising to improve the model’s robustness, most existing adversarial training methods [135, 112] assume that the number of training examples from each class is equally distributed. However, datasets collected from real-world applications typically have imbalanced distribution [37, 66, 107]. Hence, it is natural to ask: what is the behavior of adversarial training under imbalanced scenarios? Can we directly apply existing imbalanced learning strategies in natural training to tackle the imbalance issue for adversarial training? Recent studies find that adversarial training usually presents distinct properties from natural training [94, 122]. For example, compared to natural training, adversarially trained models suffer more from the overfitting issue [94]. Moreover, it is evident from a recent study [122] that the adversarially trained models tend to present strong class-wise performance disparities, even if the training examples are uniformly distributed over different classes. Imagine that if the training data distribution is highly imbalanced, these properties of adversarial training can be greatly exaggerated and make it extremely difficult to be applied in practice. Therefore, it is important but challenging to answers aforementioned questions. As the initial effort to study the imbalanced problem in adversarial training, in this work, we first investigate the performance of existing adversarial training under imbalanced settings. As a preliminary study shown in Section 4.3.1, we apply both natural training and PGD adversarial ¹This chapter is based on previously published work, ©IEEE, Reprinted, with permission from Han Xu, Wentao Wang, Xiaorui Liu, Yaxin Li, Bhavani Thuraisingham, Jiliang Tang, titled “Imbalanced adversarial training with reweighting”, published at the International Conference on Data Mining (12/2022) [111]. 45 training [71] on multiple imbalanced image datasets constructed from the CIFAR10 dataset [56] using the ResNet18 architecture [45] and evaluate trained models’ performance on class-balanced test datasets. From the preliminary results, we observe that, compared to naturally trained models, adversarially trained models always present very low standard accuracy and robust accuracy² on under-represented classes. For example, a naturally trained model can achieve around 40% and 60% standard accuracy on under-represented classes “frog” and “truck” separately, while an adversarially trained model gets both 0% standard & robust accuracy on these two classes. This observation suggests that adversarial training is more sensitive to imbalanced data distribution than natural training. Thus, when applying adversarial training in practice, imbalance learning strategies should always be considered for help. As a result, we explore the potential solutions which can handle the imbalance issues for adversarial training. In this work, we focus on studying the behavior of the reweighting strategy [44] and leave other strategies such as resampling [36] for one future work. In Section 4.3.2, we apply the reweighting strategy to existing adversarial training with varied weights assigning to one underrepresented class and evaluate trained models’ performance. From the results, we observe that, in adversarial training, increasing weights for an under-represented class can substantially improve the standard & robust accuracy on this class, but drastically hurt the model’s performance on the well-represented class. For example, the robust accuracy of the adversarially trained model on the under-represented class “horse” can be greatly improved when setting a relatively large weight, like 200, to its examples, but the model’s robust accuracy on the well-represented class “cat” is dropped to even lower than the class “horse” and, hence, the overall robust performance of the model is also decreased. These facts indicate that the performance of adversarially trained models is very sensitive to the reweighting manipulations and it could be very hard to figure out an eligible reweighting strategy which is optimal for all classes. It is also worth noting that this phenomenon is absent in natural training under the same settings. ²In this work, we denote standard accuracy as model’s accuracy on the input samples without perturbations and robust accuracy as model’s accuracy on the input samples which are adversarially perturbed. Without clear clarification, we consider the perturbation is constrained by 𝑙∞-norm 8/255. 46 In natural training, from the results in Section 4.3.2, we find that upweighting the under-represented class increases model’s standard accuracy on this class but only slightly hurts the accuracy on other classes, even when adopting a large weight for under-represent class. To further investigate the possible reasons leading to different behaviors of the reweighing strategy in natural and adversarial training, we visualize their learned features via t-SNE [106]. As shown in Figure 4.3, we observe that features learned by the adversarially trained model of different classes tend to mix together while they are well separated for the naturally trained model. This observation motivates us to theoretically show that when the given data distribution has poor data separability, upweighting under-represented classes will hurt the model’s performance on well-represented classes. Motivated by this theoretical understanding, we propose a novel algorithm Separable Reweighted Adversarial Training (SRAT) to facilitate the reweighting strategy in imbalanced adversarial training by enhancing the separability of learned features. Through extensive experiments, we validate the effectiveness of SRAT. 4.2 Related Work Adversarial Robustness. The vulnerability of DNN models to adversarial examples has been verified by many existing successful attack methods [41, 15, 71]. To improve model robustness against adversarial attacks, various defense methods have been proposed [41, 71, 85, 24]. Among them, adversarial training has been proven to be one of the most effective defense methods [2]. Adversarial training can be formulated as solving a min-max optimization problem where the outer minimization process enforces the model to be robust to adversarial examples, generated by the inner maximization process via some existing attacking methods like PGD [71]. Based on adversarial training, several variants, such as TRADES [135], MART [112] and FAT [136], have been presented to improve the model’s performance further. More details about adversarial robustness can be found in recent surveys [18, 125]. Since almost all studies of adversarial training are focused on balanced datasets, it’s worthwhile to investigate the performance of adversarial training methods on imbalanced training datasets. Imbalanced Learning. Most existing works of imbalanced training can be roughly classified 47 (a) Natural Training Standard Acc. (b) Adv. Train. Standard Acc. (c) Adv. Train. Robust Acc. Figure 4.1: Class-wise performance of natural & adversarial training under an imbalanced CIFAR10. into two categories, i.e., re-sampling and reweighting. Re-sampling methods aim to reduce the level of imbalance through either over-sampling data examples from under-represented classes [9, 12] or under-sampling data examples from well-represented classes [48, 35, 43, 127]. reweighting methods allocate different weights for different classes or even different data examples. For example, Focal loss [65] enlarges the weights of wrongly-classified examples while reducing the weights of well-classified examples in the standard cross entropy loss; and LDAM loss [13] regularizes the under-represented classes more strongly than the over-represented classes to attain good generalization performance on under-represented classes. More information about imbalanced learning can be found in recent surveys [44, 51]. The majority of existing methods focused on the nature training scenario and their trained models will be crashed when facing adversarial attacks [41]. Hence, in this paper, we develop a novel method that can defend adversarial attacks and achieve well-pleasing performance under the imbalance setting. 4.3 Preliminary Study 4.3.1 The Behavior of Adversarial Training In this subsection, we conduct preliminary studies to examine the performance of PGD adversarial training [71], under an imbalanced training dataset which is resampled from CIFAR10 dataset [56]. Following previous imbalanced learning works [26, 13], we consider to construct an imbalanced training dataset where each of the first 5 classes (well-represented classes) has 5,000 training examples, and each of the last 5 classes (under-represented classes) has 50 training ex- 48 (a) Natural Training Standard Acc. (b) Adv. Train. Standard Acc. (c) Adv. Train. Robust Acc. Figure 4.2: Class-wise performance of natural & adversarial training under an imbalanced CIFAR10. amples. Figure 4.1 shows the performance of naturally and adversarially trained models under a ResNet18 [45] architecture. From the figure, we can observe that, comparing with natural training, PGD adversarial training will result in a larger performance gap between well-represented classes and under-represented classes. For example, in natural training, the ratio between the average standard accuracy of well-represented classes (brown) and under-represented classes (violet) is about 2:1, while in adversarial training, this ratio expands to 16:1. Moreover, for adversarial training, although it can achieve good standard & robust accuracy on well-represented classes, it has extremely poor performance on under-represented classes. There are 3 out of the 5 under-represented classes with 0% standard & robust accuracy. As a conclusion, the performance of adversarial training is easier to be affected by imbalanced distribution than natural training and suffers more on underrepresented classes. In Appendix 3.1, we provide more implementation details of this experiment, as well as additional results of the same experiment under other imbalanced settings. The results in Appendix 3.1 further support our findings. 4.3.2 The Reweighting Strategy in Natural Training v.s. in Adversarial Training The preliminary study in Section 4.3.1 demonstrates that it is highly demanding to adjust the original adversarial training methods to accommodate class-imbalanced data. Therefore, in this subsection, we investigate the effectiveness of existing imbalanced learning strategies in natural training when adopted in adversarial training. In this paper, we focus on the reweighting strategy [44] as the initial effort to study this problem and leave other methods such as resampling [20] for future 49 investigation. In this subsection, we conduct experiments under a binary classification problem, where the training dataset contains two classes that are randomly selected from CIFAR10 dataset, with each class having 5,000 and 50 training examples respectively. Under this training dataset, we arrange multiple trails of (reweighted) natural training and (reweighted) adversarial training, with the weight ratio between the under-represented class and well-represented class ranging from 1:1 to 200:1. Figure 4.2a shows the experimental results with data from the classes “horse” and “cat”. As demonstrated in Figure 4.2a, increasing the weight of the under-represented class will drastically increase the model’s performance of the under-represented class, while also immensely decreasing the performance of the well-represented class. For example, when increasing the weight ratio between two classes from 1:1 to 150:1, the under-represented class’s standard accuracy can be improved from 0% to ∼ 60% and its robust accuracy from 0% to ∼ 50%. However, the standard & robust accuracy of the well-represented class is also drastically decreasing. For instance, the wellrepresented class’s standard accuracy drops from 100% to 60%, and its robust accuracy drops from 100% to 50%. These results illustrate that adversarial training’s performance can be significantly affected by the reweighting strategy. As a result, the reweighting strategy in this setting can hardly help improve the overall performance no matter which weight ratio is chosen, because the model’s performance always presents a strong tension between these two classes. As a comparison, for the naturally trained models (Figure 4.2a), increasing the weights for the under-represented examples will only slightly decrease the performance on the well-represented class. More experiments using different binary imbalanced datasets are reported in Appendix 3.2, where we have similar observations. 4.4 Theoretical Analysis In Section 4.3.2, we observe that in natural training, the reweighting strategy can only make a small impact on the two classes’ performance. This phenomenon has been extensively studied by recent works [12, 119], which investigate the decision boundaries of perfectly fitted DNNs. In particular, they consider the case where the data is linearly (or nonlinearly) separable and study the behavior of 50 linear (or nonlinear) models optimized by reweighted SGD algorithms. Interestingly, they conclude that over the process of training, these models’ decision boundaries will eventually converge to weight-agnostic solutions. For example, a linear classifier optimized by SGD on a linearly separable data will converge to the solution of the hard-margin support vector machine [81]. In other words, as long as the data can be well separated, reweighting will not make huge influence on the finally trained models, which is consistent with what we observed above. Although these studies only focus on natural training, their interpretations and conclusions motivate our hypothesis in adversarial training. For adversarial training, we conjecture that it is because the models separate the data poorly, thus, their performance is highly sensitive to the reweighting strategy. As a direct validation of this hypothesis, in Figure 4.3, we visualize the learned (penultimate layer) features of the imbalanced training examples used in the binary classification problem in Section 4.3.2. We find that adversarially trained models do present obviously poorer separability on the learned features. This suggests that, compared to naturally trained models, adversarially trained models have a weaker ability to separate training data and could potentially make themselves sensitive to reweighting. Next, we will theoretically analyze the impact of reweighting on linear models which are optimized under poorly separable data. Since our empirical study shows that adversarially trained models usually poorly separate the data (see Figure 4.3), the analysis can hopefully shed light on the behavior of reweighting in adversarial trained models in practice. Binary Classification Problem. To construct the theoretical study, we focus on a binary classification problem, with a Gaussian mixture distribution D which is defined as: 𝑦 ∼ {−1, +1}, 𝑥 ∼ 8>>>> < >>>>: N(𝜇, 𝜎2𝐼), if 𝑦 = +1 N(−𝜇, 𝜎2𝐼), if 𝑦 = −1 and 𝜇 = ( dim = d z }| { 𝜂, ..., 𝜂), (4.1) where the two classes’ centers (±𝜇 ∈ R𝑑) with each dimension has mean value ±𝜂 (𝜂 > 0), variance 𝜎2. Formally, we define the data separability as 𝑆 = 𝜂/𝜎2. Intuitively, if the separability term 𝑆 is larger, it suggests that two classes have farther distance or data examples of each class are more concentrated, so these two classes can be well separated. Previous works [12] also closely studied this term to describe data separability. Besides, we particularly define the imbalanced training 51 dataset satisfying the condition Pr.(𝑦 = +1) = 𝐾 · Pr.(𝑦 = −1) and 𝐾 > 1 which indicates the imbalance ratio between the two classes. During test, we assume that two classes have the equal probability to appear. Under data distribution D, we will discuss the performance of linear classifiers 𝑓 (𝑥) = sign(𝑤𝑇 𝑥−𝑏) where 𝑤 and 𝑏 are the weight and bias term of model 𝑓 . If a reweighting strategy is involved, we define the model will upweight the under-represented class “-1” by 𝜌. In the following lemma, we first derive the solution of the optimized linear classifier 𝑓 training on this imbalanced dataset. Then we will extend the result of Lemma 4.4 to analyze the impact of data separability on the performance of model 𝑓 . Under the data distributionDas defined in Eq. (4.1), with an imbalanced ratio 𝐾 and a reweight ratio 𝜌, the optimal classifier which minimizes the (reweighted) empirical risk: 𝑓 ∗ = arg min 𝑓  Pr.( 𝑓 (𝑥) ≠ 𝑦|𝑦 = −1) · Pr.(𝑦 = −1) · 𝜌 + Pr.( 𝑓 (𝑥) ≠ 𝑦|𝑦 = +1) · Pr.(𝑦 = +1)  (4.2) has the solution: 𝑤 = 1 and 𝑏 = 1 2 log( 𝜌 𝐾 ) 𝑑𝜎2 𝜂 = 12 log( 𝜌 𝐾 ) 𝑑 𝑆 . The proof of Lemma 4.4 can be found at Appendix 3.3.1. Note that the final optimized classifier has a weight vector equal to 1 and its bias term 𝑏 only depends on 𝐾, 𝜌 and the data separability 𝑆. In the following, our first theorem is focused on one special setting when 𝜌 = 1, which is the original ERM model without reweighting. Specifically, Theorem 4.4 calculates and compares the model’s performance under data distributions: D1 (with a higher separability 𝑆1) and D2 (with a lower separability 𝑆2). From Theorem 4.4, we aim to compare the behavior of linear models when they can poorly separate data (like adversarial trained models) or they can well separate data (like naturally trained models). Under two data distributions (𝑥(1) , 𝑦(1)) ∈ D1 and (𝑥(2) , 𝑦(2)) ∈ D2 with the separability 𝑆1 > 𝑆2, let 𝑓 ∗ 1 and 𝑓 ∗ 2 be the optimal non-reweighted classifiers (𝜌 = 1) under D1 and D2, respectively. Given the imbalance ratio 𝐾 is large enough, we have: Pr.( 𝑓 ∗ 1 (𝑥(1)) ≠ 𝑦(1) |𝑦(1) = −1) − Pr.( 𝑓 ∗ 1 (𝑥(1)) ≠ 𝑦(1) |𝑦(1) = +1) < Pr.( 𝑓 ∗ 2 (𝑥(2)) ≠ 𝑦(2) |𝑦(2) = −1) − Pr.( 𝑓 ∗ 2 (𝑥(2)) ≠ 𝑦(2) |𝑦(2) = +1). (4.3) 52 The proof of Theorem 4.4 is provided at Appendix 3.3.2. Intuitively, Theorem 4.4 suggests that when the data separability 𝑆 is low (such as D2), the optimized classifier (without reweighting) can intrinsically have a larger error difference between the under-represented class “-1” and the wellrepresented class “+1”. Similar to the observation in Section 4.3.1 and Figure 4.3, adversarially trained models also present a weak ability to separate data, and it also presents a strong performance gap between the well-represented class and under-represented class. Conclusively, Theorem 4.4 indicates that the poor ability to separate the training data can be one important reason which leads to the strong performance gap of adversarially trained models. Next, we consider the case when the reweighting strategy is applied. Similar to Theorem 4.4, we also calculate the models’ classwise error under D1 and D2 with different levels of separability. In particular, Theorem 4.4 focuses on the well-represented class “+1” and calculates its error increase when upweighting the under-represented class “-1” by 𝜌. Through the analysis in Theorem 4.4, we compare the impact of upweighting the under-represented class on the performance of wellrepresented class. Under two data distributions (𝑥(1) , 𝑦(1)) ∈ D1 and (𝑥(2) , 𝑦(2)) ∈ D2 with different separability 𝑆1 > 𝑆2, let 𝑓 ∗ 1 and 𝑓 ∗ 2 be the optimal non-reweighted classifiers (𝜌 = 1) under D1 and D2 respectively, and let 𝑓 ′ 1 ∗ and 𝑓 ′ 2 ∗ be the optimal reweighted classifiers under D1 and D2 given the optimal reweighting ratio (𝜌 = 𝐾). Given the imbalance ratio 𝐾 is large enough, we have: Pr.( 𝑓 ′ 1 ∗ (𝑥(1)) ≠ 𝑦(1) |𝑦(1) = +1) − Pr.( 𝑓 ∗ 1 (𝑥(1)) ≠ 𝑦(1) |𝑦(1) = +1) < Pr.( 𝑓 ′ 2 ∗ (𝑥(2)) ≠ 𝑦(2) |𝑦(2) = +1) − Pr.( 𝑓 ∗ 2 (𝑥(2)) ≠ 𝑦(2) |𝑦(2) = +1). (4.4) The detail the proof of Theorem 4.4 at Appendix 3.3.3. The theorem shows that, when the data distribution has poorer data separability, such as D2, upweighting the under-represented class can cause greater hurt on the performance of the well-represented class. It is also consistent with our empirical findings about adversarial training models. Since the adversarially trained models poorly separate the data (Figure 4.3), upweighting the under-represented class always drastically decreases the performance of well-represented class (Section 4.3.2). Through the discussions in 53 (a) Natural Training. (b) Adversarial Training. Figure 4.3: t-SNE visualization of penultimate layer features. both Theorem 4.4 and Theorem 4.4, we can conclude that the poor separability can be one important reason which makes adversarial training and its reweighted variants extremely difficult to achieve good performance under imbalance data distribution. Therefore, in the next section, we explore potential solutions which can facilitate the reweighting strategy in adversarial training. 4.5 Separable Reweighted Adversarial Training (SRAT) The observations from both preliminary studies and theoretical understandings indicate that more separable data will advance the reweighting strategy in adversarial training under imbalanced scenarios. Thus, in this section, we present a framework, Separable Reweighted Adversarial Training (SRAT), that enables the effectiveness of the reweighting strategy in adversarial training under imbalanced scenarios by increasing the separability in the learned latent feature space. 4.5.1 Reweighted Adversarial Training Given an input example (x, 𝑦), adversarial training [71] aims to obtain a robust model 𝑓𝜃 that can make the same prediction 𝑦 for an adversarial example x′, generated by applying an adversarially perturbation on x. The adversarially perturbations are typically bounded by a small value 𝜖 under 𝐿𝑝-norm, i.e., ∥x′ − x∥ 𝑝 ≤ 𝜖. More formally, adversarial training can be formulated as solving a 54 min-max optimization problem, where a DNN model is trained on minimizing the prediction error on adversarial examples generated by iteratively maximizing some loss function. As indicated in Section 4.3.1, adversarial training cannot be applied in imbalanced scenarios directly, as it presents very low performance on under-represented classes. To tackle this problem, a natural idea is to integrate existing imbalanced learning strategies proposed in natural training, such as reweighting, into adversarial training to improve the trained model’s performance on those under-represented classes. Hence, the reweighted adversarial training can be defined as min 𝜃 1 𝑛 Õ𝑛 𝑖=1 max ∥x′ 𝑖 −x𝑖 ∥ 𝑝≤𝜖 𝑤𝑖L( 𝑓𝜃 (x′ 𝑖 ), 𝑦𝑖), (4.5) where 𝑤𝑖 is a reweighting value assigned for each input sample (x𝑖 , 𝑦𝑖) based on the example size of the class (x𝑖 , 𝑦𝑖) belongs to or some properties of (x𝑖 , 𝑦𝑖). In most existing adversarial training methods [71, 135, 112], the cross entropy (CE) loss is adopted as the loss function L(·, ·). However, the CE loss could be suboptimal in imbalanced settings and some new loss functions designed for imbalanced settings specifically, such as Focal loss [65] and LDAM loss [13], have been prove superiority in natural training. Hence, besides CE loss, Focal loss and LDAM loss can also be adopted as the loss function L(·, ·) in Eq. (4.5). 4.5.2 Increasing Feature Separability Our preliminary study indicates that only reweighted adversarial training cannot work well under imbalanced scenarios. Moreover, the reweighting strategy behaves very differently between natural training and adversarial training. Meanwhile, our theoretical analysis suggests that the poor separability of the feature space produced by the adversarially trained model can be one reason to understand these observations. Hence, in order to facilitate the reweighting strategy in adversarial training under imbalanced scenarios, we equip a feature separation loss with our SRAT method. We aim to enforce the learned feature space as separable as possible. More specifically, the goal of the feature separation loss is to make (1) the learned features of examples from the same class well clustered, and (2) the features of examples from different classes well separated. By achieving this goal, the model is able to learn more discriminative features for each class. Correspondingly 55 adjusting the decision boundary via the reweighting strategy to fit under-represented classes’ examples more will not hurt well-represented classes drastically. The feature separation loss is formally defined as: L𝑠𝑒𝑝 (x′ 𝑖 ) = − 1 |𝑃(𝑖) | Õ 𝑝∈𝑃(𝑖) log exp(z′ 𝑖 · z′ 𝑝 /𝜏) Í 𝑎∈𝐴(𝑖) exp(z′ 𝑖 · z′ 𝑎/𝜏) , (4.6) where z′ 𝑖 is the feature representation of the adversarial example x′ 𝑖 of x𝑖 , 𝜏 ∈ R+ is a scalar temperature parameter, 𝑃(𝑖) denotes the set of input examples belonging to the same class with x𝑖 and 𝐴(𝑖) indicates the set of all input examples excepts x′ 𝑖 . When minimizing the feature separation loss during training, the learned features of examples from the same class will tend to aggregate together in the latent feature space, and, hence, result in a more separable latent feature space. Our proposed feature separation loss L𝑠𝑒𝑝 (·) is inspired by the supervised contrastive loss proposed in [52]. The main difference is, instead of applying data augmentation techniques to generate two different views of each data example and feeding the model with augmented data examples, our feature separation loss directly takes the adversarial example x′ 𝑖 of each data example x𝑖 as input. 4.5.3 Training Schedule By combining the feature separation loss with the reweighted adversarial training, the final object function for Separable reweighted Adversarial Training (SRAT) can be defined as: min 𝜃 1 𝑛 Õ𝑛 𝑖=1 max ∥x′ 𝑖 −x𝑖 ∥ 𝑝≤𝜖 𝑤𝑖L( 𝑓𝜃 (x′ 𝑖 ), 𝑦𝑖) + 𝜆L𝑠𝑒𝑝 (x′ 𝑖 ), (4.7) where we use a hyper-parameter 𝜆 to balance the contributions from the reweighted adversarial training and the feature separation loss. In practice, in order to better take advantage of the reweighting strategy in our SRAT method, we adopt a deferred reweighting training schedule [13]. Specifically, before annealing the learning rate, our SRAT method first trains a model guided by Eq. (4.7) without introducing the reweighting strategy, i.e., setting 𝑤𝑖 = 1 for every input example x′ 𝑖 , and then applies reweighting into model training process with a smaller learning rate. Our SRAT method enables to learn more separable feature space, thus comparing with applying the reweighting strategy from the beginning of training, this deferred re-balancing training schedule enables the reweighting strategy to obtain more 56 benefits from our SRAT method, and as a result, it can boost the performance of our SRAT method with the help of the reweighting strategy. The detailed training algorithm for SRAT is shown in Appendix 3.4. 4.6 Experiment In this section, we perform comprehensive experiments to validate the effectiveness of our proposed SRAT method. We first compare our method with several representative imbalanced learning methods in adversarial training under various imbalanced scenarios and then conduct ablation study to understand our method more deeply. 4.6.1 Experimental Settings Datasets. We conduct experiments on multiple imbalanced training datasets artificially created from two benchmark image datasets CIFAR10 [56] and SVHN [80] with diverse imbalanced distributions. Specifically, we consider two types of imbalance types: Exponential (Exp) imbalance [26] and Step imbalance [9]. For Exp imbalance, the number of training examples of each class will be reduced according to an exponential function 𝑛 = 𝑛𝑖𝜏𝑖 , where 𝑖 is the class index, 𝑛𝑖 is the number of training examples in the original CIFAR10/SVHN training dataset for class 𝑖 and 𝜏 ∈ (0, 1). We categorize five most frequent classes in the constructed imbalanced training dataset as well-represented classes and the remaining five classes as under-represented classes. For Step imbalance, we follow the same process adopted in Section 4.3.1 to construct imbalanced training datasets based on CIFAR10 and SVHN, separately. Moreover, in both imbalanced types, we denote imbalance ratio 𝐾 as the ratio between training example sizes of the most frequent and least frequent class. In our experiments, we construct four different imbalanced datasets, named as “Step- 100”, “Step-10”, “Exp-100” and “Exp-10”, by adopting different imbalanced types (Step or Exp) with different imbalanced ratios (𝐾 = 100 or 𝐾 = 10) to train models, and evaluate model’s performance on the original uniformly distributed test datasets of CIFAR10 and SVHN correspondingly. More detailed information about imbalanced training sets used in our experiments can be found in Appendix 3.5. 57 Baseline methods. We implement several representative and state-of-the-art imbalanced learning methods (or their combinations) into adversarial training as baseline methods. These methods include: (1) Focal loss (Focal); (2) LDAM loss (LDAM); (3) Class-balanced reweighting (CB-Reweight) [26], where each example is reweighted proportionally by the inverse of the effective number³ of its class; (4) Class-balanced Focal loss (CB-Focal) [26], a combination of Classbalanced method [26] and Focal loss [65], where well-classified examples will be down-weighted while hard-classified examples will be up-weighted controlled by their corresponding effective numbers; (5) deferred reweighted CE loss (DRCB-CE), where a deferred reweighting training schedule is applied based on the CE loss; (6) deferred reweighted Class-balanced Focal loss (DRCB-Focal), where a deferred reweighting training schedule is applied based on the CB-Focal loss; (7) deferred reweighted Class-balanced LDAM loss (DRCB-LDAM) [13], where a deferred reweighting training schedule is applied based on the CB-LDAM loss. In addition, we also include the original PGD adversarial training method using cross entropy loss (CE) in our experiments. Our proposed methods. We evaluate three variants of our proposed SRAT method with different implementations of the prediction loss L(·, ·) in Eq. (4.5), i.e., CE loss, Focal loss and LDAM loss. The variant utilizing CE loss is denoted as SRAT-CE, and, similarly, other two variants are denoted as SRAT-Focal and SRAT-LDAM, respectively. For all these three variants, Class-balanced method [26] is adopted to set reweighting values within the deferred reweighting training schedule. Implementation details. We implement all methods used in our experiments based on a Pytorch library DeepRobust [62]. For CIFAR10 based imbalanced datasets, the adversarial examples used in training are calculated by PGD-10, with a perturbation budget 𝜖 = 8/255, and step size 𝛾 = 2/255. For robustness evaluation, we report robust accuracy under 𝑙∞-norm 8/255 attacks generated by PGD-20 on Resnet-18 [45] models. For SVHN based imbalanced datasets, the setting is similar with CIFAR10 based datasets, excepts we set step size 𝛾 to 1/255 in both training and test phases, as suggested in [118]. For the deferred reweighting training schedule used in our methods and some baseline methods, we set the number of the training epochs to 200 and the initial ³The effective number is defined as the volume of examples and can be calculated by (1 − 𝛽𝑛𝑖 )/(1 − 𝛽), where 𝛽 ∈ [0, 1) is a hyperparameter and 𝑛𝑖 denotes the number of examples of class 𝑖. 58 learning rate to 0.1, and then decay the learning rate at epoch 160 and 180 with the ratio 0.01. The reweighting strategy will be applied starting from epoch 160. 4.6.2 Performance Comparison Table 4.1 and 4.2 show the performance comparison on various imbalanced CIFAR10 datasets with different imbalance types and imbalance ratios. In these two tables, we use bold values to denote the highest accuracy among all methods and use the underline values to indicate our SRAT variants which achieve the highest accuracy among their corresponding baseline methods utilizing the same loss function for making predictions. Due to the limited space, we report the performance comparison on SVHN based imbalanced datasets in Appendix 3.6. From Table 4.1 and Table 4.2, we can make the following observations. First, compared to baseline methods, our SRAT methods can obtain improved performance in terms of both overall standard & robust accuracy under almost all imbalanced settings. More importantly, our SRT methods make significantly improvement on those under-represented classes, especially under the extremely imbalanced setting. For example, on the Step imbalanced dataset with imbalance ratio 𝐾 = 100, our SRAT-Focal method improves the standard accuracy on under-represented classes from 21.81% achieved by the best baseline method utilizing Focal loss to 51.83% and robust accuracy from 3.24% to 15.89%. These results demonstrate that our proposed SRAT method is able to obtain more robustness under imbalanced settings. Second, the performance gap among three variants SRAT-CE, SRAT-Focal and SRAT-LDAM are mainly caused by the gap between the loss functions in these methods. As shown in Table 4.1 and 4.2, DRCB-LDAM typically performs better than DRCE-CE and DRCB-Focal, and similarly, SRAT-LDAM outperforms SRAT-CE and SRAT-Focal under corresponding imbalanced settings. 4.6.3 Ablation Study In this subsection, we provide ablation study to understand our SRAT method more comprehensively. Feature space visualization. In order to facilitate the reweighting strategy in adversarial training under the imbalanced setting, we present a feature separation loss in our SRAT method. The 59 Table 4.1: Performance comparison on imbalanced CIFAR10 datasets (Imbalanced Type: Step). Imbalance Ratio 10 100 Imbalance Ratio Standard Accuracy Robust Accuracy Standard Accuracy Robust Accuracy Method Overall Under Overall Under Overall Under Overall Under CE 63.26 40.62 36.96 14.23 47.29 9.03 30.39 1.62 Focal 63.57 41.17 36.89 14.25 47.36 9.03 30.12 1.45 LDAM 57.08 31.09 37.18 12.44 42.49 0.85 30.80 0.05 CB-Reweight 73.30 74.80 41.34 42.15 37.68 19.64 25.58 10.33 CB-Focal 73.47 73.69 41.19 41.02 15.44 0.00 14.46 0.00 DRCB-CE 75.89 70.55 39.93 33.33 53.40 22.86 28.31 3.35 DRCB-Focal 74.61 67.06 37.91 29.50 52.75 21.81 27.78 3.24 DRCB-LDAM 72.95 75.42 45.23 44.98 61.60 50.69 31.37 16.25 SRAT-CE 76.32 73.20 41.71 37.86 59.10 40.24 30.02 11.72 SRAT-Focal 75.41 74.91 42.05 41.28 62.93 51.83 28.38 15.89 SRAT-LDAM 73.99 76.63 45.60 45.96 63.13 52.73 33.51 18.89 Table 4.2: Performance comparison on imbalanced CIFAR10 datasets (Imbalanced Type: Exp). Imbalance Ratio 10 100 Metric Standard Accuracy Robust Accuracy Standard Accuracy Robust Accuracy Method Overall Under Overall Under Overall Under Overall Under CE 71.95 64.09 37.94 26.79 48.40 23.04 26.94 6.17 Focal 72.06 63.99 37.62 26.27 49.16 23.69 26.84 5.88 LDAM 67.39 58.01 41.35 28.65 48.39 25.69 29.51 8.95 CB-Reweight 75.17 76.87 41.02 41.67 57.49 56.47 29.01 26.53 CB-Focal 74.73 76.67 38.86 42.41 50.35 60.05 27.15 33.56 DRCB-CE 76.25 75.83 40.02 37.93 57.30 37.90 26.97 10.57 DRCB-Focal 75.36 72.72 37.76 33.83 54.76 31.79 25.24 7.81 DRCB-LDAM 73.92 78.53 46.29 48.81 62.65 57.19 31.66 22.11 SRAT-CE 76.94 79.50 41.50 43.08 64.93 64.34 29.68 25.42 SRAT-Focal 75.26 80.52 42.37 47.22 62.57 64.88 30.34 28.66 SRAT-LDAM 74.63 79.82 46.72 50.38 63.11 65.60 34.22 32.55 main goal of the feature separation loss is to enforce the learned feature space as much separated as possible. For checking whether the feature separation loss can work as expected, we apply t- SNE [106] to visualize the latent feature space learned by our SRAT-LDAM method in Figure 4.4. As a comparison, we also provide the visualization of feature space learned by the original PGD adversarial training method (CE) and DRCB-LDAM method. As shown in Figure 4.4, the feature space learned by our SRAT-LDAM method is more separable than two baseline methods. This observation demonstrates that, with our proposed feature separation loss, the adversarially trained model is able to learn much better features and thus our 60 (a) CE. (b) DRCB-LDAM. (c) SRAT-LDAM. Figure 4.4: t-SNE feature visualization of training examples learned by SRAT and two baseline methods using imbalanced training datasets “Step-100”. SRAT method can achieve superiority performance. Impact of reweighting values. As in all SRAT variants, we adopt the Class-balanced method [26] to assign different weights to different classes based on their effective number. To explore how the assigned weights impact the performance of our proposed SRAT method, we conduct experiments on a Step-imbalanced CIFAR10 dataset with imbalance ratio 𝐾 = 100 to see the change of model’s performance using different reweighting values. In our experiments, we assign five well-represented classes with weight 1 and change the weight for remaining five under-represented classes from 10 to 200. The experimental results are shown in Figure 4.5. Here, we use an approximation integer 78 to denote the weight calculated by the Class-balanced method when the imbalance ratio equals 100. From Figure 4.5, we can obverse that, for all SRAT variants, the model’s standard accuracy is increased with the increase of the weights assigning to under-represented classes. However, the robust accuracy for these three methods do not synchronize with the change of their standard accuracy. When increasing the weights for under-represented classes, robust accuracy of SRAT-LDAM is almost unchanged and robust accuracy of SRAT-CE and SRAT-Focal even has slight decrease. As a trade-off, using a relative large weights, such as 78 or 100, in our SRAT method can obtain satisfactory performance on both standard & robust accuracy, where the former is calculated by the Class-balanced method and the latter equals the imbalance ratio 𝐾. 61 Figure 4.5: The impact of reweighting values using an imbalanced training dataset “Step-100”. (a) Step-100. (b) Exp-100. Figure 4.6: The impact of the hyper-parameter 𝜆 using imbalanced training datasets “Step-100” and “Exp-100”. Impact of hyper-parameter 𝜆. In our proposed SRAT method, the contributions of feature separation loss and prediction loss are controlled by a hyper-parameter 𝛾. In this part, we study how this hyper-parameter affects the performance of our SRAT method. In our experiments, we evaluate the models’ performance of all SRAT variants with different values of 𝜆 used in training process on both Step-imbalanced CIFAR10 dataset and Exp-imbalanced CIFAR10 dataset with imbalance ratio 𝐾 = 100. As shown in Figure 4.6, the performance of all SRAT variants are not very sensitive with the choice of 𝜆. However, a large value of 𝜆, such as 8, may hurt the model’s performance. 62 4.7 Conclusion In this work, we first empirically investigate the behavior of adversarial training under imbalanced settings and explore the potential solutions to assist adversarial training in tackling the imbalanced issues. As neither adversarial training method itself nor adversarial training with reweighting strategy can work well under imbalanced scenarios, we further theoretically verify that the poor data separability is one key reason causing the failure of adversarial training based methods under imbalanced scenarios. Based on our findings, we propose the Separable Reweighted Adversarial Training (SRAT) framework to facilitate the reweighting strategy in imbalanced adversarial training by enhancing the separability of learned features. Through extensive experiments, we validate the effectiveness of SRAT. In the future, we plan to examine how other types of defense methods perform under imbalanced scenarios and how other types of balanced learning strategies in natural training behavior under adversarial training. 63 CHAPTER 5 ENHANCING FAIRNESS MAY THREATEN ROBUSTNESS AGAINST POISONING ATTACKS¹ 5.1 Introduction Data poisoning attacks [6, 23, 100] have brought huge safety concerns for machine learning systems that are trained on data collected from public resources [55, 115]. For example, the studies [6, 74, 11, 100] have shown that an attacker can inject only a small fraction of fake data into the training pool of a classification model and intensely degrade its accuracy. As countermeasures against data poisoning attacks, there are defense methods [100, 29] which can successfully identify the poisoning samples and sanitize the training dataset. Recently, in addition to model safety, people have also paid significant attention to fairness. They stress that machine learning models should provide the “equalized” treatment or “equalized” prediction quality among groups of population [42, 1, 34, 128]. Since the fair classification problems are human-society related, it is highly possible that training data is provided by humans, which can cause high accessibility for adversarial attackers to inject malicious data. Therefore, fair classification algorithms are also prone to be threatened by poisoning attacks. Since fair classification problems have distinct optimization objectives & optimization processes from traditional classification, a natural question is: Can we protect fair classification from data poisoning attacks? In other words, are existing defenses sufficient to defend fair classification models? To answer these questions, we first conduct a preliminary study on Adult Census Dataset to explore whether existing defenses can protect fair classification algorithms (see Section 3.2). In this work, we focus on representative defense methods including k-NN Defense [53] and SEVER [29]. To fully exploit their vulnerability to poisoning attacks, we introduce a new attacking algorithm F-Attack, where the attacker aims to cause the failure of fair classification. In detail, by injecting the poisoning samples, the attacker aims to mislead the trained classifier such that it cannot ¹This chapter is based on previously published work by Han Xu, Xiaorui Liu, Yuxuan Wan, Jiliang Tang titled “Towards Fair Classification against Poisoning Attacks ” submitted to the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining [123]. 64 achieve good accuracy, or not satisfy the fairness constraints. From the preliminary results, we find that both k-NN defense and SEVER will have an obvious accuracy or fairness degradation after attacking. Moreover, we also compare F-Attack with one of the strongest poisoning attacks Min-Max Attack [100] (which is devised for traditional classification). The result demonstrates that our proposed F-Attack has a better attacking effect compared to Min-Max Attack. In conclusion, our preliminary study highlights the vulnerability of fair classification against poisoning attacks, especially against F-Attack. In this paper, we further propose a defense framework, Robust Fair Classification (F-ATTACK), to improve the robustness of fair classification against poisoning attacks. Different from existing defenses, our method aims to scout abnormal samples from each individual sensitive subgroup in each class. To achieve this goal, F-ATTACK first applies the similar strategy as the works [30, 29], to find abnormal data samples which significantly deviate from the distribution of other (clean) samples. Based on a theoretical analysis, we verify that F-ATTACK can exclude more poisoning samples than clean samples in each step. Moreover, to further avoid removing too many clean samples, we introduce an Online-Data-Sanitization process: in each iteration, we remove a possible poisoning set from a single subgroup of a single class and test the retrained models’ performance on a clean validation set. This helps us locate the subgroup which contains the most poisoning samples. Through extensive experiments on two benchmark datasets, Adult Census Dataset and COMPAS, we validate the effectiveness of our defense. Our key contributions are summarized as: • We devise a strong attack method to poison fair classification and demonstrate the vulnerability of fair classification under the protection of traditional defenses to poisoning attacks. • We propose an efficient, and principled framework, Robust Fair Classification (F-ATTACK). Extensive experiments and theoretical analysis demonstrate the effectiveness and reliability of the proposed framework. 5.2 Problem Statement and Notations In this section, we formally define the setting of our studied problem and necessary notations. 65 Fair Classification. In this paper, we focus on the classification problems which incorporate grouplevel fairness criteria. First, let 𝑥 ⊆ R𝑑 be a random vector denoting the (non-sensitive) features, with a label 𝑦 ∈ Y = {𝑌1, ...,𝑌𝑚} with 𝑚 classes, a sensitive attribute 𝑧 ∈ Z = {𝑍1, 𝑍2, ..., 𝑍𝑘 } with 𝑘 groups. Let 𝑓 (𝑋, 𝑤) represent a classifier with parameters 𝑤 ∈ W. Then, a fair classification problem can be defined as: min 𝑤 E h 𝑙 ( 𝑓 (𝑥, 𝑤), 𝑦) i s.t. 𝑔𝑗 (𝑤) ≤ 𝜏, ∀𝑗 ∈ Z (5.1) where the function E[𝑙 (·)] is the expected loss on test distribution, and 𝜏 is the unfairness tolerance. The constraint function 𝑔𝑗 (𝑤) = E h ℎ(𝑤, 𝑥, 𝑦) |𝑧 = 𝑗 i represents the desired fairness metric for each group 𝑗 ∈ Z. For example, in binary classification problems, we use 𝑓 (𝑥, 𝑤) > 0 to indicate a positive classification outcome. Then, ℎ(𝑤, 𝑥, 𝑦) = 1( 𝑓 (𝑥, 𝑤) > 0) − E h 1( 𝑓 (𝑥, 𝑤) > 0) i refers to equalized positive rates in the equalized treatment criterion [73]. Similarly, ℎ(𝑤, 𝑥, 𝑦) = 1( 𝑓 (𝑥, 𝑤) > 0|𝑦 = ±1) − E h 1( 𝑓 (𝑥, 𝑤) > 0|𝑦 = ±1) i is for equalizing true / false positive rates in the equalized odds [42]. Given any training dataset 𝐷, we define the empirical loss function as 𝐿(𝐷, 𝑤) as the average loss value of the model, and 𝑔𝑗 (𝐷, 𝑤) is the empirical fairness constraint function. In our paper, we assume that the clean training samples are sampled from the true distribution D following the density P(𝑥, 𝑦, 𝑧). We also use D𝑢,𝑣 to denote the distribution of (clean) samples given by 𝑦 = 𝑌𝑢 and 𝑧 = 𝑍𝑢, which has a density P(𝑥, 𝑦, 𝑧|𝑦 = 𝑌𝑢, 𝑧 = 𝑍𝑣). Poisoning Attack (𝜖-poisoning model). In our paper, we consider the poisoning attack following the scenario. Given a fair classification task, the poisoned dataset is generated as follows: first, 𝑛 clean samples 𝐷𝐶 = {(𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖)}𝑛 𝑖=1 are drawn from D to form the clean training set. Then, an adversary is allowed to insert an 𝜖 fraction of 𝐷𝐶 with arbitrary choices of 𝐷𝑃 = {(𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖)}𝜖𝑛 𝑖=1. We can define such a poisoned training set 𝐷𝐶 ∪ 𝐷𝑃 as 𝜖-poisoning model. 5.3 Fair Classification is Vulnerable to Poisoning Attacks In this section, we first introduce the algorithm of our proposed attack F-Attack under the 𝜖-poisoning model. Then, we conduct empirical studies to evaluate the robustness of fair classification algo- 66 rithms (and popular defenses) against F-Attack and baseline attacks. 5.3.1 F-Attack: Poisoning Attacks for Fair Classification Given a specific fair classification task, we consider that the attacker aims to contaminate the training set, such that applying existing algorithms cannot successfully fulfill the fair classification goal. Note that for fair classification tasks, both accuracy and fairness are the desired properties and they always have strong tension in practice [75]. Therefore, in our attack, we consider misleading the training algorithms such that at least one of the two criteria is unsatisfied. Formally, we define the attacker’s objective as Eq.(5.2), where the attacker inserts a poisoning set 𝐷𝑃 with size 𝜖𝑛 in the feasible injection space F to achieve: max 𝐷𝑃⊆F E h 𝑙 ( 𝑓 (𝑥, 𝑤∗), 𝑦) i s.t. 𝑤∗ = arg min 𝑤∈H𝑓 𝑎𝑖𝑟 𝐿(𝐷𝐶 ∪ 𝐷𝑃, 𝑤). (5.2) It means for the classifier 𝑤∗ that trained on 𝐷𝐶∪𝐷𝑝 and has a low empirical loss 𝐿(𝐷𝐶∪𝐷𝑝, 𝑤∗), if it falls in the spaceH𝑓 𝑎𝑖𝑟 , it will have a large expected loss (on the test distribution D). Here,H𝑓 𝑎𝑖𝑟 is the space of models (with a limited norm) that satisfy the fairness criteria on clean distribution D. To have a closer look at Eq.(5.2), we discuss case by case. Suppose we obtain 𝑤∗ by fair classification on the set 𝐷𝐶 ∪ 𝐷𝑃, there are cases: 1. 𝑤∗ ∉ H𝑓 𝑎𝑖𝑟 : The fairness criteria (on the test set) is not satisfied. 2. 𝑤∗ ∈ H𝑓 𝑎𝑖𝑟 : Since 𝑤∗ is trained on 𝐷𝐶 ∪ 𝐷𝑃 to have low 𝐿(𝐷𝑐 ∪ 𝐷𝑝, 𝑤∗), 𝑤∗ will have high test error. For each case, the model 𝑤∗ will either have an unsatisfactory accuracy or unsatisfactory fairness. Next, we simplify the objective and constraints in Eq.(5.2) to transform it into a solvable problem. We first conduct relaxations of the objective in Eq.(5.2) (similar to the works [100, 53]): E h 𝑙 ( 𝑓 (𝑋, 𝑤),𝑌) i (i) ≈ 𝐿(𝐷𝐶, 𝑤) (ii) ≤ 𝐿(𝐷𝐶, 𝑤) + 𝜖𝐿(𝐷𝑃, 𝑤) = (1 + 𝜖)𝐿(𝐷𝐶 ∪ 𝐷𝑃, 𝑤) Specifically, approximation (i) holds if clean training data 𝐷𝐶 has sufficient samples and is close to the test distribution D, and model 𝑤 is appropriately regularized. The upper bound (ii) holds 67 because of the non-negativity of loss values. The upper bound (ii) can be tight if the fraction of poisoning samples 𝜖 is small. Thus, we transfer Eq.(5.2) to a bi-level optimization problem between 𝑤 and 𝐷𝑃. If the model 𝑓 (·) and loss 𝑙 (·) are convex, we can further swap them to get a min-max form as: max 𝐷𝑃⊆F min 𝑤∈H𝑓 𝑎𝑖𝑟 𝐿(𝐷𝐶 ∪ 𝐷𝑃, 𝑤) → min 𝑤∈H𝑓 𝑎𝑖𝑟 max 𝐷𝑃⊆F 𝐿(𝐷𝐶 ∪ 𝐷𝑃, 𝑤) (5.3) Our proposed F-Attack is to solve Eq.(5.3) which is shown in Algorithm 3. It solves a saddle point problem to alternatively find the worst attack points (𝑥, 𝑦, 𝑧) w.r.t the current model and then update the model in the direction of the attack point. In detail, in Step (1), given the current model 𝑤, we solve the inner maximization problem to maximize 𝐿(𝐷𝐶 ∪ 𝐷𝑃, 𝑤). It is equal to finding sample (𝑥, 𝑦, 𝑧) with the maximal loss: max 𝐷𝑃⊆F 𝐿(𝐷𝐶 ∪ 𝐷𝑃, 𝑤) = 𝐿(𝐷𝐶, 𝑤) + 𝜖 · max (𝑥,𝑦,𝑧)∈F 𝑙 ( 𝑓 (𝑥, 𝑤), 𝑦). (5.4) In Step (2), we update 𝑤 to minimize 𝐿(𝐷𝐶∪𝐷𝑃, 𝑤). Note that in Step (2), we should also constrain the model 𝑤 to fall into the fair model space H𝑓 𝑎𝑖𝑟 . Thus, when we update 𝑤 in Step (2), we also penalize the fairness violation of 𝑤. Here, we calculate the fairness violation as (𝑔𝑗 (𝐷𝐶, 𝑤) − 𝜏)+ (with weight parameter 𝜆 > 0) on the clean set 𝐷𝐶 to approximate the fairness violation on real data D. Algorithm 3 Algorithm of F-Attack. 1: Input: Clean data 𝐷𝐶, number of poisoned samples 𝜖𝑛, feasible set F, Fairness constraint functions 𝑔𝑗 (·) and tolerance 𝜏𝑗 ( 𝑗 = 1, ..., |Z|), 𝜆 > 0, 𝜂 > 0, warm-up steps 𝑛𝑏𝑢𝑟𝑛. 2: Output: A poisoning set 𝐷𝑃. 3: repeat 4: Solver the inner maximization: (𝑥, 𝑦, 𝑧) = arg max(𝑥,𝑦,𝑧)∈F 𝑙 ( 𝑓 (𝑥, 𝑤), 𝑦) 5: Solver the outer minimization: 𝑤 = 𝑤 − 𝜂 · ∇𝑤 ( (𝐿(𝐷𝐶 ∪ 𝐷𝑃, 𝑤) + 𝜆 Í 𝑗 (𝑔𝑗 (𝐷𝐶, 𝑤) − 𝜏)+)) 6: until The algorithm converges 5.3.2 Preliminary Study on Adult Census Dataset Adult Census Dataset. In this subsection, we conduct an experiment on Adult Census Dataset [54], to test whether F-Attack can poison fair classification methods and whether existing defense methods can resist F-Attack. Here, we focus on the fairness criteria: Equalized True Positive Rate 68 Figure 5.1: PCA visualization of Clean Samples and Poisoning Samples. (TPR) [42] between the genders, and we apply the constrained optimization method [34] to train linear classifiers to fulfill the fair classification objective. It is worth mentioning that, this dataset contains many categorical features, such as marital-status, occupation, etc. For simplicity, we preprocess the dataset by transforming categorical features into a continuous space that is spanned by the first 15 principle directions of training (categorical) data. More details of the pre-processing procedure can be found in Appendix 4.2. Defense Methods. Besides naïve fair classification, we mainly consider two representative data-sanitization defenses, which are existing popular methods to defend against poisoning attacks: • k-NN Defense [53]. This method removes the samples that are far from their k nearest neighbors. In detail, the k-NN defense calculates the “abnormal” score as 𝑞𝑖 = ||𝑥𝑖 − 𝑥 (𝑘,𝑦) 𝑖 ||2, where 𝑥 (𝑘,𝑦) 𝑖 is the k-th nearest neighbor to sample 𝑥𝑖 in class 𝑦. In this paper, we set 𝑘 = 5. • SEVER. [29] This method aims to find abnormal samples by tracing abnormal gradients. In each iteration, we first train a fair classifier 𝑓 (with fixed 𝜏) and calculate the gradient of loss 69 w.r.t the weight 𝑤 for each training sample (𝑥𝑖 , 𝑦𝑖), and get the normalized gradient matrix 𝑄 = h ∇𝑤𝑙 ( 𝑓 (𝑥𝑖 , 𝑤), 𝑦𝑖) − 1 𝑛 Í𝑛𝑗 =1 ∇𝑤𝑙 ( 𝑓 (𝑥 𝑗 , 𝑤), 𝑦 𝑗 ) i 𝑖=1,..,𝑛 . SEVER flags the samples with large “abnormal” score 𝑞𝑖 = (𝑄𝑖 · 𝑣)2 as abnormal samples, where 𝑣 is the top right singular vector of 𝑄. Intuitively, the “abnormal” samples make a great contribution to the variation of the gradient matrix 𝑄, which suggests their gradients can significantly deviate from the gradients of other samples. Results. In our experiments, we insert 10% poisoning samples to the training set, and define the feasible injection set F to be {(𝑥, 𝑦, 𝑧) : ||𝑥−𝜇𝑦 || ≤ 𝑑}, where 𝑑 is a fixed radius. This will constrain the inserted samples not too far from the center of their labeled class, to evade potential defense. Since F is not related to sensitive attribute 𝑧, during F-Attack, we generate poisoning samples with a fixed 𝑧 to be 0 (female) or 1 (male). During fairness training, we train multiple models with various hyperparameters to control the unfairness tolerance on the training set (following [59]). Then, we report the test performance when it has the best validation performance (which considers both accuracy and fairness, see Section 5.4, Eq.(5.9) for more details). In Table5.1, we report the performance²for the defense methods. From the result, we can see: all training methods have a significant performance degradation under F-Attack. For example, under F-Attack (𝑧 = 0), the SEVER defense has ≈ 4% accuracy drop and 2% fairness drop. This suggests that defenses such as SEVER and k-NN can be greatly vulnerable to poisoning attacks in fair classification. Moreover, we compare F-Attack with a baseline attack method Min-Max [100, 53], which is one of the strongest attacks for traditional classification. It also solves Eq.(5.3) but does not constrain 𝑤 ∈ H𝑓 𝑎𝑖𝑟 . From Table 5.1, we can see that Min-Max has worse attacking performance than F-Attack, by causing slighter performance degradation. This result highlights the threat of F-Attack to fair classification. Discussion. To have a deeper understanding on the behavior of F-Attack, in Figure 5.1, we visualize the clean samples and poisoning samples (via F-Attack and Min-Max Attack) in a 2-dim projected space (via PCA). From the figure, we can see that: compared to Min-Max Attack (red), ²We report the “goodness of fairness” as 1 − Unfair, i.e., 1 − |TPR(𝑧 = 0) − TPR(𝑧 = 1) | in Table 5.1. 70 Table 5.1: F-Attack vs. Min-Max on Adult Census Dataset. No Attack Min-Max (z = 0) Min-Max (z = 1) F-Attack (z = 0) F-Attack (z = 1) Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. No Defense. 0.811 0.962 0.801 0.958 0.799 0.851 0.768 0.956 0.793 0.803 k-NN. 0.794 0.952 0.681 0.936 0.680 0.904 0.655 0.951 0.695 0.895 SEVER. 0.812 0.969 0.798 0.958 0.797 0.967 0.773 0.942 0.772 0.943 the samples obtained by F-Attack (yellow) have a smaller distance to the clean samples in their labeled class 𝑦 = 1, although they are constrained in the same feasible injection set F. It is because F-Attack aims to find samples with maximal loss (Eq.(5.4)) for fair classifiers, so the generated samples do not have the maximal loss for traditional classifiers. Thus, the poisoning samples from F-Attacks are closer to their labeled class. This fact helps explain why F-Attack is more insidious than Min-Max Attack under the detection of traditional defenses, such as SEVER. 5.4 Robust Fair Classification (RFC) Motivated by studies in Section 5.3, new defenses are desired to protect fair classification against poisoning attacks, especially against F-Attack. In this section, we first introduce a novel defense framework called Robust Fair Classification (RFC), and we provide a theoretical study to further understand the mechanism of RFC. In Section 5.5, we conduct empirical studies to validate the robustness of RFC in practice. 5.4.1 Robust Fair Classification (RFC) Based on the discussion in Section 5.3, F-Attack can evade traditional defenses such as SEVER and k-NN Defense, because the generated poisoning samples are close to the clean data of their labeled class. However, we assume that they may deviate from the distribution of clean samples in their labeled subgroup (the data distribution D𝑦,𝑧 given 𝑦 and 𝑧). Refer to the Figure 5.2, which shows the location of poisoning samples generated via F-Attack (𝑧 = 0) and clean samples in subgroup (𝑦 = 1, 𝑧 = 0) in the 2D projected space. It suggests that the poisoning samples greatly contaminate the information of distribution P(𝑥, 𝑦, 𝑧|𝑦 = 1, 𝑧 = 0) in the training data. Thus, the injected poisoniend samples will not only confuse the original prediction task from 𝑥 to 𝑦, but also greatly disturb the fairness constraints (Eq.(5.1)). This observation motivates us to propose a new 71 Figure 5.2: PCA visualization. defense method that can scout abnormal samples from each individual subgroup in each class. Next, we will introduce the details of our proposed defense RFC. Centered Data Matrix & Alignment Score. Our method shares a similar high-level idea as [30, 29], to find data points that systematically deviate from the distribution of other (clean) samples. Specifically, in our method, given a (poisoned) dataset 𝐷, we repeatedly scout the poisoning samples from each subgroup 𝐷(𝑥, 𝑦, 𝑧|𝑦 = 𝑌𝑢, 𝑧 = 𝑍𝑣), where we use (𝑢, 𝑣) to denote the index of each subgroup and class. In the later parts, we use 𝐷𝑢,𝑣 to denote the samples in 𝐷(𝑥, 𝑦, 𝑧|𝑦 = 𝑌𝑢, 𝑧 = 𝑍𝑣) for simplicity. Then, we define: 𝑄𝑢,𝑣 = 266664 𝑥𝑖 − 1 𝑛𝑢,𝑣 Õ𝑛𝑢,𝑣 𝑗=1 𝑥 𝑗 377775 (𝑥𝑖 ,𝑦𝑖 ,𝑧𝑖 )∈𝐷𝑢,𝑣 , (5.5) to be the centered data matrix of training samples 𝐷𝑢,𝑣 and 𝑛𝑢,𝑣 is the size of the set 𝐷𝑢,𝑣. For each 𝑄𝑢,𝑣, the top right singular vector V𝑢,𝑣 of 𝑄𝑢,𝑣 is the direction which explains the variation of the data distribution in (𝑌𝑢, 𝑍𝑢). Similar to the studies in [30, 29], we conjecture: the poisoning samples are deviated from clean samples, so they will take the major responsibility for the variation of the data matrix 𝑄𝑢,𝑣. In this way, they will have high alignments with the direction 72 of V𝑢,𝑣. Thus, we define the Alignment Score 𝑞𝑢,𝑣 for each training sample (𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖) in the 𝐷𝑢,𝑣 as: 𝑞𝑢,𝑣 (𝑥𝑖) = 𝑄𝑢,𝑣 𝑖 · (V𝑢,𝑣)𝑇 . (5.6) Notably, the poisoned samples are likely to have the same or opposite direction with the top right singular vector V𝑢,𝑣, but the poisoning samples should share the same direction. Thus, in our method, we define two Proposed Poisoning Sets for each (𝑢, 𝑣), so that one of the two sets is likely to have poisoning samples: P𝑢,𝑣 + = {𝑥𝑖 |𝑞𝑢,𝑣 (𝑥𝑖) > 𝛾+, 𝑞𝑢,𝑣 (𝑥𝑖) > 0}; P𝑢,𝑣 − = {𝑥𝑖 | − 𝑞𝑢,𝑣 (𝑥𝑖) > 𝛾−, 𝑞𝑢,𝑣 (𝑥𝑖) < 0}; (5.7) In Eq.(5.7), we set the 𝛾+ (or 𝛾−) to be the 𝑞-th (𝑞 = 90) percentile of the given all alignment scores (or negative alignment scores) in 𝐷𝑢,𝑣, so that each proposed poisoning set only contains a small portion of 𝐷𝑢,𝑣. In practice, we will repeatedly test whether removing the proposed poisoning sets can help improve the retrained model’s performance (both accuracy and fairness) on a clean validation set. It helps to decide whether the proposed poisoning set contains poisoning samples. In Section 5.4.2, we will further conduct a theoretical analysis to show that the poisoning samples are more likely to have higher poisoning scores. Fair Classification by Excluding Poisoning Set. Finally, given a (poisoned) training set 𝐷 as well as the proposed poisoning sets, we can keep retraining fair classifiers by excluding proposed poisoning sets: 𝑤+ = arg min 𝑤∈W 𝐿( 𝑓 (𝑋, 𝑤),𝑌; 𝐷 \ P𝑢,𝑣 + ) s.t. 𝑔𝑗 (𝐷 \ P𝑢,𝑣 + ; 𝑤) ≤ 𝜏𝑗 , ∀𝑗 ∈ Z 𝑤− = arg min 𝑤∈W 𝐿( 𝑓 (𝑋, 𝑤),𝑌; 𝐷 \ P𝑢,𝑣 − ) s.t. 𝑔𝑗 (𝐷 \ P𝑢,𝑣 − ; 𝑤) ≤ 𝜏𝑗 , ∀𝑗 ∈ Z (5.8) In practice, our proposed RFC method repeatedly proposes potential poisoning sets for each individual subgroup in each class 𝐷𝑢,𝑣 until finding the best poisoning set among all choices. The Algorithm 4 provides the detailed introduction of the procedure of RFC, which is an Online Data Sanitization process. Specifically, during each iteration of RFC, for each 𝐷𝑢,𝑣, we first calculate the poisoning scores and the proposed poisoning sets (Step (1)&(2)). Then, we remove the proposed poisoning set from the dataset 𝐷, and conduct fair classification on 𝐷 (Step (3)). In Step (4), we 73 evaluate the retrained classifier on a clean separated validation set and find the best-proposed poisoning set which results in the highest validation performance. Notably, we measure the validation performance by considering both the accuracy and fairness criteria, by defining: ValScore = Pr.( 𝑓 (𝑥, 𝑤) = 𝑦) − Õ 𝑗∈Z 𝜆 · (𝑔𝑗 (𝐷Val, 𝑤) − 𝜏)+, (5.9) where 𝜏 is a unfairness tolerance threshold and 𝜆 is a positive number (we set 𝜆 = 3 in this paper). The second term penalizes the models if some subgroups’ unfairness violation is over 𝜏. Finally, we remove the proposed poisoning set which results in the highest ValScore and conduct the next round of searching (Step (6)). Algorithm 4 Robust Fair Classification (RFC) - An Online Data Sanitization Algorithm. 1: Input: An 𝜖-poisoning model with dataset 𝐷 = {(𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖)}𝑖=1,2,...,𝑛, Iterations 𝑇 of RFC. 2: Output: A fairly classifier 3: repeat 4: repeat 5: Get the Centered Data Matrix 𝑄𝑢,𝑣 and Poisoning Score 𝑞𝑢,𝑣 𝑖 following Eq.(5.5) and Eq.(5.6) 6: Get the Proposed Poisoning Sets: P𝑢,𝑣 + and P𝑢,𝑣 − following Eq.(5.7) 7: Conduct fair classification by removing Proposed Poisoning Set, via Eq.(5.8) and get 𝑤𝑢,𝑣 ± . 8: Record the performance to get ValScore on a separated (clean) validation set for each 𝑤𝑢,𝑣 ± . 9: until All subgroups are checked 10: Get the best proposed poisoning set P∗ which achieves the highest ValScore across all 𝑢, 𝑣 and P𝑢,𝑣 ± . 11: Removing the best proposed poisoning set from 𝐷 and set 𝐷 = 𝐷 \ P∗ 12: t = t+1 13: until t is larger than T Remarkably, it is also worth mentioning that the framework RFC is also possible to be extended to various model architectures, such as Deep Neural Networks (DNNs), for robust fair classification. For example, we can apply the Energy-based Out-of Distribution Detection [69] to find the abnormal samples from each 𝐷𝑢,𝑣. Then, we follow a similar manner as RFC to propose poisoning sets and conduct fair classification. We will leave the study in DNNs for future exploration. 5.4.2 Theoretical Analysis In this subsection, we conduct a theoretical analysis to further help understand the behavior of RFC, especially to understand the role of calculating poisoning scores in finding poisoning samples. 74 In particular, we consider a simple theoretical setting where the clean samples from each groupD𝑢,𝑣 follow a distinct Gaussian distribution D𝑢,𝑣 ∼ N(𝜇𝑢,𝑣, Σ𝑢,𝑣), with center and covariance matrix (𝜇𝑢,𝑣, Σ𝑢,𝑣). In the following theorem, we will show: when there are (poisoning) samples that deviate from the clean samples of D𝑢,𝑣, by having a center 𝜇 which is far from the center of clean samples, they will have larger squared poisoning scores (Eq.5.6) than clean samples. Thus, the proposed poisoning sets (Eq.(5.7)) are likely to contain more poisoning samples than clean samples. For simplicity, we use D and N(𝜇, Σ) to denote the clean distribution of a given group 𝐷𝑢,𝑣. Theorem 5 Suppose that a set of “clean” samples S𝑔𝑜𝑜𝑑 with size 𝑛 are i.i.d sampled from distribution N(𝜇, Σ), where Σ ⪯ 𝜎2𝐼. There is a set of “bad” samples S𝑏𝑎𝑑 with size 𝑛𝑝 = 𝑛/𝐾, 𝐾 > 1 and center ||𝜇𝑝 − 𝜇||2 = 𝑑, 𝑑 = 𝛾 · 𝜎. Then, the average squared poisoning scores of clean samples and bad samples have a relationship: E𝑖∈S𝑏𝑎𝑑  𝑞2 (𝑥𝑖)  − E𝑖∈S𝑔𝑜𝑜𝑑  𝑞2 (𝑥𝑖)  ≥  𝐾 − 1 𝐾 + 1 · 𝛾2 − (𝐾 + 1)  𝜎2 Theorem 5 suggests that the difference between the average (squared) poisoning scores of S𝑏𝑎𝑑 and S𝑔𝑜𝑜𝑑 is controlled by 𝛾 and the sample ratio 𝐾. Since 𝐾 > 1, if 𝛾2 > (𝐾+1)2 𝐾−1 (which suggests the poisoning samples are sufficiently far from clean distribution), we can get the conclusion that the difference is positive. Thus, removing samples with the highest positive (or lowest negative) poisoning scores (as Eq.(5.7)) will help to eliminate more poisoning samples than clean samples. If 𝛾2 is small, the poisoning samples are close to the true distribution, which will cause the poisoning samples to have limited influence on the model performance. The detailed proof of Theorem 5 is deferred to Appendix 4.1. In our algorithm of RFC, we alternatively check each proposed poisoning set and see whether removing it helps improve the retrained models’ performance. This will also avoid removing too many clean samples. 5.5 Experiment 5.5.1 Experimental Setup In this section, we conduct comprehensive experiments to validate the effectiveness of our proposed attack and defense, in two benchmark datasets, Adult Census Dataset and COMPAS Dataset. 75 In this part, we only consider Equalized True Positive Rate (TPR) between different sensitive subgroups, which is optimized via the fair classification method [34]. When applying [34], we train multiple models with various hyperparameters to control the unfairness tolerance on the training set (following [59]). Then, we report the test performance when it has the best validation performance (which considers both accuracy and fairness, see Section 5.4, Eq.(5.9)). In Appendix 4.3, we provide additional results for a different type of fairness “Equalized Treatment”, and a different fair classification method [128]. Attacks: We consider that the training set can be contaminated by: Label Flipping [84], and Sensitive Attribute Flipping [110]. We also consider the attack methods, Min-Max and F-Attack, which are introduced in Section 5.3. Notably, for each method, we assume the poisoning samples are constrained in the sample feasible injection set F = {(𝑥, 𝑦, 𝑧) : ||𝑥 − 𝜇𝑦 || ≤ 𝑑}, which limit the poisoning samples’ distance to the class center. Thus, for Min-Max and F-Attack, we assign the generated samples to have a pre-defined sensitive attribute 𝑧 = 0 or 𝑧 = 1. Furthermore, we introduce an additional attack method “F-Attack∗” which has the same algorithm with F-Attack but have a different feasible injection set: F = {(𝑥, 𝑦, 𝑧) : ||𝑥 − 𝜇𝑦,𝑧 || ≤ 𝑑}, where 𝜇𝑦,𝑧 is the center of the group D𝑦,𝑧. Because this feasible injection set is related to the sensitive attribute 𝑧, we don’t need to pre-define 𝑧 during F-attack∗. Remarkably, this attack aims to test the robustness of RFC, because the major goal of RFC is to find samples in each group 𝐷𝑦,𝑧 which are far from 𝜇𝑦,𝑧. Thus, F-Attack∗ is possible to evade RFC by constraining the poisoning samples’ distance to 𝜇𝑦,𝑧. In Appendix 4.3, we also report the performance of all attacks & defenses under different choices of radius 𝑑. Baseline Defenses. To validate the effectiveness of RFC, we include baseline defense methods: (1) the naive method which does not apply any defense strategies; (2) SEVER [29], which are representative defenses for traditional classification tasks. We apply [34] on the sanitized dataset by SEVER. In addition, we also include (3) [90], which is a method to defend fair classification methods against label flipping attacks. It leverages adversarial training strategy [132]; and (4) the method [110] applies Distributional Robust Opitmization (DRO) to improve robustness when labels 76 Table 5.2: RFC & Baseline Methods’ Performance against Poisoning Attacks on Adult Census Dataset. No Attack Label Flip(10%) Label Flip(20%) Attr. Flip(10%) Attr. Flip(20%) Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. No Defense. 0.817 0.963 0.805 0.966 0.800 0.968 0.812 0.945 0.804 0.951 SEVER. 0.812 0.960 0.805 0.967 0.803 0.960 0.813 0.937 0.803 0.944 Wang. 0.809 0.958 0.793 0.970 0.779 0.961 0.809 0.945 0.813 0.935 Roh. 0.805 0.948 0.798 0.937 0.788 0.939 0.800 0.950 0.795 0.943 RFC. 0.811 0.959 0.807 0.973 0.796 0.966 0.805 0.967 0.802 0.965 (𝜖 = 10%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack* Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. No Defense. 0.801 0.969 0.797 0.864 0.769 0.966 0.796 0.804 0.799 0.799 SEVER. 0.800 0.960 0.795 0.954 0.778 0.945 0.773 0.937 0.772 0.956 Wang. 0.782 0.976 0.783 0.968 0.779 0.956 0.791 0.948 0.792 0.945 Roh. 0.780 0.963 0.782 0.961 0.765 0.959 0.766 0.956 0.776 0.944 RFC. 0.802 0.967 0.811 0.946 0.803 0.950 0.808 0.952 0.809 0.951 (𝜖 = 15%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack* Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. No Defense. 0.792 0.961 0.787 0.837 0.686 0.953 0.775 0.779 0.788 0.760 SEVER. 0.792 0.959 0.783 0.978 0.765 0.952 0.755 0.909 0.765 0.927 Wang. 0.777 0.970 0.779 0.965 0.738 0.955 0.762 0.960 0.711 0.955 Roh. 0.726 0.948 0.748 0.957 0.722 0.962 0.764 0.929 0.774 0.929 RFC. 0.801 0.954 0.796 0.941 0.800 0.963 0.795 0.947 0.802 0.945 and sensitive attributes are contaminated. For baseline methods, we report their performance with the choice of hyperparameter that achieves the optimal ValScore (Eq.( 5.9)) on a clean validation set. 5.5.2 Experimental Results Adult Census Dataset. We first show the results in Adult Census Dataset in Table 5.2. To further guarantee that the comparison between different defenses is fair, we use a balanced clean training dataset where each class has an equal number of samples since the baseline methods such as SEVER can be affected by class imbalance. Under this dataset, we set the desired fairness criteria to be |TPR(𝑧 = 0) − TPR(𝑧 = 1) | < 0.05. In Table 5.2, we also mark the cases (with brown color) when the algorithms output models with much poorer fairness than the desired fairness. From Table 5.2, we can see that RFC can achieve good accuracy & fairness among different types of dataset contamination. Especially, under strong attacks such as F-Attack, the accuracy and fairness are only slightly degraded after injecting poisoning samples. However, the baseline methods such 77 as [110] and [90], will have a clear performance (especially accuracy) degradation under F-Attack. Notably, the attack method F-Attack∗, has similar attacking performance as F-Attack (𝑧 = 1). It is because under this dataset, F-Attack∗ also generates samples that have 𝑧 = 1. COMPAS Dataset. In COMPAS dataset [8], we consider the same type of fairness criteria, which is Equalized TPR. In this dataset, we consider that the equity is desired among races, which are “Caucasian (𝑧 = 0), African-American (𝑧 = 1) and Hispanic (𝑧 = 2)”, and follow the similar preprocessing procedure as that in Adult Census Dataset. In this dataset, the number of samples in the group “Hispanic” is much smaller than the other two groups. Thus, we only consider to inject poisoning samples to 𝑧 = 0 or 𝑧 = 1. In Table 5.3, we report the performance of our studied attacks and defense, and we use 1 − max𝑗∈Z |TPR(𝑧 = 𝑗 ) − TPˆR| to measure the “goodness” of fairness, where TPˆR is the averaged TPR in the whole dataset. During training, we set the desired fairness criteria to be max𝑗∈Z |TPR(𝑧 = 𝑗 ) − TPˆR| ≤ 0.15. From the result in Table 5.3, we can see that RFC is the only method that can consistently preserve the model accuracy and fairness after there are poisoning samples injected into the dataset. 5.6 Related Works Poisoning Attacks. In this section, we introduce related work and discuss how this work differs from prior studies. Data poisoning attacks [6] refer to the scenario that models are threatened by adversaries who insert malicious training samples, in order to take control of the trained model behavior [64, 97]. In this work, we concentrate on the untargeted poisoning attacks [6, 53] where the attacker aims to degrade the overall performance of the trained model. To defend against poisoning attacks, well-established methods [116, 91, 100, 29, 103, 114] are proposed to efficiently and effectively defend against poisoning attacks in various scenarios. This paper is within the scope of linear classification problems and we leave the studies in DNN models for future work. Fair Classification. Fairness issues have recently drawn much attention from the community of machine learning. Fairness issues for common classification problems can be generally divided into two categories: (1) Equalized treatment [128] (or “Statistical Rate”); and (2) Equalized prediction quality [42]. For classification models to satisfy these fairness criteria, popular methods includ- 78 Table 5.3: RFC & Baseline Methods’ Performance against Poisoning Attacks on COMPAS Dataset. No Attack Label Flip(10%) Label Flip(20%) Attr. Flip(10%) Attr. Flip(20%) Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair No Defense. 0.656 0.865 0.655 0.866 0.664 0.836 0.677 0.839 0.665 0.847 SEVER. 0.650 0.854 0.677 0.833 0.674 0.835 0.674 0.835 0.667 0.835 Wang. 0.662 0.847 0.650 0.869 0.631 0.863 0.643 0.861 0.659 0.862 Roh. 0.654 0.851 0.646 0.891 0.663 0.823 0.621 0.834 0.615 0.845 RFC. 0.661 0.850 0.676 0.859 0.682 0.841 0.685 0.850 0.667 0.856 (𝜖 = 10%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack* Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair No Defense. 0.649 0.845 0.645 0.837 0.665 0.832 0.630 0.860 0.661 0.826 SEVER. 0.634 0.865 0.630 0.859 0.634 0.845 0.627 0.840 0.664 0.826 Wang. 0.648 0.851 0.620 0.855 0.639 0.865 0.624 0.871 0.644 0.868 Roh. 0.644 0.880 0.631 0.865 0.659 0.865 0.633 0.825 0.640 0.858 RFC. 0.656 0.863 0.661 0.834 0.668 0.853 0.667 0.866 0.673 0.845 (𝜖 = 15%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack* Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair No Defense. 0.676 0.802 0.644 0.863 0.665 0.802 0.630 0.860 0.642 0.831 SEVER. 0.621 0.880 0.620 0.880 0.585 0.890 0.606 0.887 0.663 0.845 Wang. 0.645 0.865 0.631 0.852 0.606 0.842 0.611 0.829 0.624 0.841 Roh. 0.655 0.847 0.628 0.877 0.625 0.865 0.631 0.849 0.621 0.877 RFC. 0.676 0.852 0.659 0.843 0.653 0.848 0.645 0.847 0.672 0.841 ing [128, 34, 1] solve constrained optimization problems, and [132] apply adversarial training [71] method. Comparison to Prior Works. There are recent works that try to test the robustness of fair classification methods by manipulating their training set. They also proposed possible strategies to defend the perturbations. For example, the works [110, 59, 16, 17] consider injecting naturally / adversarially generated noise only on sensitive attributes. Another line of researches [90, 109] considers the vulnerability of fairness training to (coordinated) label-flipping attacks [84]. As countermeasures to defend against their proposed perturbations, representative works such as [90] proposed an adversarial training framework [132], to train the model to distinguish clean samples and poisoning samples, while preserving the model fairness. The work [110] solves robust optimization problems by assigning soft sensitive attributes. In our work, in terms of attack, we consider a stronger attacker because he/she can insert sophisticatedly calculated features and sensitive attributes, to fully exploit the vulnerability of fairness training methods. 79 5.7 Conclusion In this work, we study the problem of poisoning attacks on fair classification problems. We propose a strong attack method that can evade the defense of most existing methods. Then, we propose an effective strategy to greatly improve the robustness of fair classification methods. In the future, we aim to examine if our findings can be generalized to other machine learning tasks, and other machine learning models, such as Deep Neural Networks (DNNs). 80 CHAPTER 6 CONCLUSION AND FUTURE DIRECTIONS 6.1 Conclusion In this thesis, we systematically study the interaction between machine learning robustness and fairness, and demonstrating the negative impact between each other. On one hand, we show that adversarial training, as one of the most popular and successful methods to improve ML robustness, can result in unfairness with different perspectives. Specifically, in Chapter 2, Chapter 3 and Chapter 4, we observe that adversarial training will either cause or exacerbate class-wise unfairness, inner-class unfairness and unfairness in imbalance datasets respectively. On the other hand, we show fair machine learning techniques can also cause the model to be more vulnerable to data poisoning attacks in Chapter 5. Overall, these results have the further impacts to demand the careful examination when applying techniques to improve either ML robustness or fairness, especially in the applications where both of them are essential. Moreover, these empirical and theoretical results in classification tasks also motivate future investigations for studying: (1) the interaction between ML robustness and ML fairness in various types of models; (2) the interaction between ML robustness or fairness with other ML properties, such as privacy and explainability. 6.2 Future Directions 6.2.1 ML Robustness vs. ML Fairness in Generative Models In this thesis, most discussed works are focused on image classification tasks and classification tasks in tabular data. However, there are other data domains and other types of machine learning tasks beyond classification which worth studying. For example, previous works such as [120] focus on the adversarial robustness in language processing tasks. Meanwhile, the works such as [50, 70] focus on the adversarial robustness in graph neural networks. Different from image or tabular data classification tasks, the machine learning models and training algorithms in these domains may encounter different mechanism and behaviors. Therefore, it is also important to extend the previous studies from previous results into these domains, to observe new findings. 81 6.2.2 ML Robustness / Fairness vs. ML Privacy / Copyright / Explainability Beyond machine learning robustness and fairness, there are also other trustworthy concerns about the application of ML / AI models from the society. For example, people are concerned of the fact that ML models can release private information of the ML users [82, 130, 131]. Besides, machine learning models can also be mis-used by people to infringe the copyright of visual artworks [86], articles [92], or musics [101]. Moreover, for machine learning techniques which are applied in high-stake applications, such as healthcare or financial prediction, people also desire to understand and explain the prediction of ML models. Thus, there are also methodologies that are proposed to improve ML explainability [67]. Given these trustworhiness problems from various faces, it is also worthwhile to study the inner relationship among them. As some of the pioneer works, the paper [38] finds the positive interaction between ML robustness and explainability in graph neural networks [50]. Moreover, there is a line of research works showing that the techniques of finding adversarial examples can naturally serve as tools for data protection against privacy and copyright infringement. Specifically, the works [46, 87, 63, 108] aims to add imperceptible perturbations to images such that machine learning models cannot effectively extract information from the images, so that the private information and copyright can be protected. Similarly, there are also watermarking strategies [28, 27] trying to add perturbations to insert particular “fingerprint” information to the data which they aim to protect. 82 BIBLIOGRAPHY [1] Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna Wallach. A reductions approach to fair classification. In International Conference on Machine Learning, pages 60–69. PMLR, 2018. [2] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International Conference on Machine Learning, pages 274–283. PMLR, 2018. [3] Golnoosh Babaei and Paolo Giudici. How fair is machine learning in credit lending? Quality and Reliability Engineering International, 2024. [4] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020. [5] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002. [6] Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support vector machines. arXiv preprint arXiv:1206.6389, 2012. [7] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010. [8] Tim Brennan, William Dieterich, and Beate Ehret. Evaluating the predictive validity of the compas risk and needs assessment system. Criminal Justice and behavior, 36(1):21–40, 2009. [9] Mateusz Buda, Atsuto Maki, and Maciej A Mazurowski. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 106:249–259, 2018. [10] Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pages 77–91, 2018. [11] Cody Burkard and Brent Lagesse. Analysis of causative attacks against svms learning from data streams. In Proceedings of the 3rd ACM on International Workshop on Security And Privacy Analytics, pages 31–36, 2017. [12] Jonathon Byrd and Zachary Lipton. What is the effect of importance weighting in deep learning? In International Conference on Machine Learning, pages 872–881. PMLR, 2019. 83 [13] Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. arXiv preprint arXiv:1906.07413, 2019. [14] Nicholas Carlini, Ulfar Erlingsson, and Nicolas Papernot. Distribution density, tails, and outliers in machine learning: Metrics and applications. arXiv preprint arXiv:1910.13427, 2019. [15] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pages 39–57. IEEE, 2017. [16] L Elisa Celis, Lingxiao Huang, Vijay Keswani, and Nisheeth K Vishnoi. Fair classification with noisy protected attributes: A framework with provable guarantees. In International Conference on Machine Learning, pages 1349–1361. PMLR, 2021. [17] L Elisa Celis, Anay Mehrotra, and Nisheeth K Vishnoi. Fair classification with adversarial perturbations. arXiv preprint arXiv:2106.05964, 2021. [18] Anirban Chakraborty, Manaar Alam, Vishal Dey, Anupam Chattopadhyay, and Debdeep Mukhopadhyay. Adversarial attacks and defences: A survey. arXiv preprint arXiv:1810.00069, 2018. [19] Niladri S Chatterji and Philip M Long. Finite-sample analysis of interpolating linear classifiers in the overparameterized regime. arXiv preprint arXiv:2004.12019, 2020. [20] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002. [21] Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. Deepdriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE international conference on computer vision, pages 2722–2730, 2015. [22] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020. [23] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017. [24] Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pages 1310–1320. PMLR, 2019. [25] Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an en- 84 semble of diverse parameter-free attacks. In International Conference on Machine Learning, pages 2206–2216. PMLR, 2020. [26] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9268–9277, 2019. [27] Yingqian Cui, Jie Ren, Yuping Lin, Han Xu, Pengfei He, Yue Xing, Wenqi Fan, Hui Liu, and Jiliang Tang. Ft-shield: A watermark against unauthorized fine-tuning in text-to-image diffusion models. arXiv preprint arXiv:2310.02401, 2023. [28] Yingqian Cui, Jie Ren, Han Xu, Pengfei He, Hui Liu, Lichao Sun, and Jiliang Tang. Diffusionshield: A watermark for copyright protection against generative diffusion models. arXiv preprint arXiv:2306.04642, 2023. [29] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. In International Conference on Machine Learning, pages 1596–1606. PMLR, 2019. [30] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. In International Conference on Machine Learning, pages 999–1008. PMLR, 2017. [31] Gavin Weiguang Ding, Yash Sharma, Kry Yik Chau Lui, and Ruitong Huang. Max-margin adversarial (mma) training: Direct input space margin maximization through adversarial training. arXiv preprint arXiv:1812.02637, 2018. [32] Chengyu Dong, Liyuan Liu, and Jingbo Shang. Data profiling for adversarial training: On the ruin of problematic data. arXiv preprint arXiv:2102.07437, 2021. [33] Yinpeng Dong, Ke Xu, Xiao Yang, Tianyu Pang, Zhijie Deng, Hang Su, and Jun Zhu. Exploring memorization in adversarial training. arXiv preprint arXiv:2106.01606, 2021. [34] Michele Donini, Luca Oneto, Shai Ben-David, John Shawe-Taylor, and Massimiliano Pontil. Empirical risk minimization under fairness constraints. arXiv preprint arXiv:1802.08626, 2018. [35] Chris Drummond, Robert C Holte, et al. C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II, volume 11, pages 1–8. Citeseer, 2003. [36] Andrew Estabrooks, Taeho Jo, and Nathalie Japkowicz. A multiple resampling method for learning from imbalanced data sets. Computational intelligence, 20(1):18–36, 2004. [37] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zis- 85 serman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010. [38] Wenqi Fan, Yao Ma, Han Xu, Xiaorui Liu, Jianping Wang, Qing Li, and Jiliang Tang. Deep adversarial canonical correlation analysis. In Proceedings of the 2020 SIAM international conference on data mining, pages 352–360. SIAM, 2020. [39] Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954–959, 2020. [40] Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation. arXiv preprint arXiv:2008.03703, 2020. [41] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014. [42] Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29:3315–3323, 2016. [43] Haibo He and Edwardo A Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284, 2009. [44] Haibo He and Yunqian Ma. Imbalanced learning: foundations, algorithms, and applications. 2013. [45] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [46] Pengfei He, Han Xu, Jie Ren, Yingqian Cui, Hui Liu, Charu C Aggarwal, and Jiliang Tang. Sharpness-aware data poisoning attack. arXiv preprint arXiv:2305.14851, 2023. [47] Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems, pages 125–136, 2019. [48] Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A systematic study. Intelligent data analysis, 6(5):429–449, 2002. [49] Wei Jin, Yaxin Li, Han Xu, Yiqi Wang, and Jiliang Tang. Adversarial attacks and defenses on graphs: A review and empirical study. arXiv preprint arXiv:2003.00653, 2020. [50] Wei Jin, Yaxing Li, Han Xu, Yiqi Wang, Shuiwang Ji, Charu Aggarwal, and Jiliang Tang. Adversarial attacks and defenses on graphs. ACM SIGKDD Explorations Newsletter, 22(2):19– 86 34, 2021. [51] Justin M Johnson and Taghi M Khoshgoftaar. Survey on deep learning with class imbalance. Journal of Big Data, 6(1):1–54, 2019. [52] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. arXiv preprint arXiv:2004.11362, 2020. [53] Pang Wei Koh, Jacob Steinhardt, and Percy Liang. Stronger data poisoning attacks break data sanitization defenses. Machine Learning, pages 1–47, 2021. [54] Ron Kohavi et al. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In Kdd, volume 96, pages 202–207, 1996. [55] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016. [56] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. [57] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012. [58] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236, 2016. [59] Alexandre Louis Lamy, Ziyuan Zhong, Aditya Krishna Menon, and Nakul Verma. Noisetolerant fair classification. arXiv preprint arXiv:1901.10837, 2019. [60] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7:7, 2015. [61] Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In International Conference on Artificial Intelligence and Statistics, pages 4313–4324. PMLR, 2020. [62] Yaxin Li, Wei Jin, Han Xu, and Jiliang Tang. Deeprobust: A pytorch library for adversarial attacks and defenses. arXiv preprint arXiv:2005.06149, 2020. [63] Yaxin Li, Jie Ren, Han Xu, and Hui Liu. Neural style protection: Counteracting unauthorized neural style transfer. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3966–3975, 2024. 87 [64] Yiming Li, Baoyuan Wu, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. Backdoor learning: A survey. arXiv preprint arXiv:2007.08745, 2020. [65] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. [66] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. [67] Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. Explainable ai: A review of machine learning interpretability methods. Entropy, 23(1):18, 2020. [68] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Ginneken, and Clara I Sánchez. A survey on deep learning in medical image analysis. Medical image analysis, 42:60–88, 2017. [69] Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. Advances in Neural Information Processing Systems, 33:21464–21475, 2020. [70] Xiaorui Liu, Jiayuan Ding, Wei Jin, Han Xu, Yao Ma, Zitao Liu, and Jiliang Tang. Graph neural networks with adaptive residual. Advances in Neural Information Processing Systems, 34:9720–9733, 2021. [71] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017. [72] Melissa D McCradden, Shalmali Joshi, Mjaye Mazwi, and James A Anderson. Ethical limitations of algorithmic fairness solutions in health care machine learning. The Lancet Digital Health, 2(5):e221–e223, 2020. [73] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54(6):1– 35, 2021. [74] Shike Mei and Xiaojin Zhu. Using machine teaching to identify optimal training-set attacks on machine learners. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. [75] Aditya Krishna Menon and Robert C Williamson. The cost of fairness in binary classification. In Conference on Fairness, Accountability and Transparency, pages 107–118. PMLR, 2018. [76] Nir Morgulis, Alexander Kreines, Shachar Mendelowitz, and Yuval Weisglass. Fooling a 88 real car with adversarial traffic signs. arXiv preprint arXiv:1907.00374, 2019. [77] Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, and Anant Sahai. Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020. [78] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. arXiv preprint arXiv:1912.02292, 2019. [79] Vedant Nanda, Samuel Dooley, Sahil Singla, Soheil Feizi, and John P Dickerson. Fairness through robustness: Investigating robustness disparity in deep learning. arXiv preprint arXiv:2006.12621, 2020. [80] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011. [81] William S Noble. What is a support vector machine? Nature biotechnology, 24(12):1565– 1567, 2006. [82] Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael Wellman. Towards the science of security and privacy in machine learning. arXiv preprint arXiv:1611.03814, 2016. [83] Ioannis Pastaltzidis, Nikolaos Dimitriou, Katherine Quezada-Tavarez, Stergios Aidinlis, Thomas Marquenie, Agata Gurzawska, and Dimitrios Tzovaras. Data augmentation for fairness-aware machine learning: Preventing algorithmic bias in law enforcement systems. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 2302–2314, 2022. [84] Andrea Paudice, Luis Muñoz-González, Andras Gyorgy, and Emil C Lupu. Detection of adversarial training examples in poisoning attacks through anomaly detection. arXiv preprint arXiv:1802.03041, 2018. [85] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial examples. arXiv preprint arXiv:1801.09344, 2018. [86] Jie Ren, Han Xu, Pengfei He, Yingqian Cui, Shenglai Zeng, Jiankun Zhang, Hongzhi Wen, Jiayuan Ding, Hui Liu, Yi Chang, et al. Copyright protection in generative ai: A technical perspective. arXiv preprint arXiv:2402.02333, 2024. [87] Jie Ren, Han Xu, Yuxuan Wan, Xingjun Ma, Lichao Sun, and Jiliang Tang. Transferable unlearnable examples. arXiv preprint arXiv:2210.10114, 2022. [88] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In International conference on machine learning, pages 4334–4343. 89 PMLR, 2018. [89] Leslie Rice, Eric Wong, and Zico Kolter. Overfitting in adversarially robust deep learning. In International Conference on Machine Learning, pages 8093–8104. PMLR, 2020. [90] Yuji Roh, Kangwook Lee, Steven Whang, and Changho Suh. Fr-train: A mutual informationbased approach to fair and robust training. In International Conference on Machine Learning, pages 8147–8157. PMLR, 2020. [91] Benjamin IP Rubinstein, Blaine Nelson, Ling Huang, Anthony D Joseph, Shing-hon Lau, Satish Rao, Nina Taft, and J Doug Tygar. Antidote: understanding and defending against poisoning of anomaly detectors. In Proceedings of the 9th ACM SIGCOMM Conference on Internet Measurement, pages 1–14, 2009. [92] Matthew Sag. Copyright safety for generative ai. Forthcoming in the Houston Law Review, 2023. [93] Amartya Sanyal, Puneet K Dokania, Varun Kanade, and Philip HS Torr. How benign is benign overfitting? arXiv preprint arXiv:2007.04028, 2020. [94] Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. arXiv preprint arXiv:1804.11285, 2018. [95] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015. [96] Candice Schumann, Jeffrey Foster, Nicholas Mattei, and John Dickerson. We need fairness and explainability in algorithmic hiring. In International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), 2020. [97] Ali Shafahi, W Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Dumitras, and Tom Goldstein. Poison frogs! targeted clean-label poisoning attacks on neural networks. arXiv preprint arXiv:1804.00792, 2018. [98] Ali Shafahi, Mahyar Najibi, Mohammad Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! In Advances in Neural Information Processing Systems, pages 3358–3369, 2019. [99] Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K Reiter. Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. In Proceedings of the 2016 acm sigsac conference on computer and communications security, pages 1528–1540, 2016. 90 [100] Jacob Steinhardt, Pang Wei Koh, and Percy Liang. Certified defenses for data poisoning attacks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 3520–3532, 2017. [101] Bob LT Sturm, Maria Iglesias, Oded Ben-Tal, Marius Miron, and Emilia Gómez. Artificial intelligence and music: open questions of copyright law and engineering praxis. In Arts, volume 8, page 115. MDPI, 2019. [102] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013. [103] Lue Tao, Lei Feng, Jinfeng Yi, Sheng-Jun Huang, and Songcan Chen. Better safe than sorry: Preventing delusive adversaries with adversarial training. Advances in Neural Information Processing Systems, 34, 2021. [104] Florian Tramèr, Jens Behrmann, Nicholas Carlini, Nicolas Papernot, and Jörn-Henrik Jacobsen. Fundamental tradeoffs between invariance and sensitivity to adversarial perturbations. In International Conference on Machine Learning, pages 9561–9571. PMLR, 2020. [105] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018. [106] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008. [107] Grant Van Horn and Pietro Perona. The devil is in the tails: Fine-grained classification in the wild. arXiv preprint arXiv:1709.01450, 2017. [108] Yuxuan Wan, Han Xu, Xiaorui Liu, Jie Ren, Wenqi Fan, and Jiliang Tang. Defense against gradient leakage attacks via learning to obscure data. arXiv preprint arXiv:2206.00769, 2022. [109] Jialu Wang, Yang Liu, and Caleb Levy. Fair classification with group-dependent label noise. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 526–536, 2021. [110] Serena Wang, Wenshuo Guo, Harikrishna Narasimhan, Andrew Cotter, Maya Gupta, and Michael I Jordan. Robust optimization for fairness with noisy protected groups. arXiv preprint arXiv:2002.09343, 2020. [111] Wentao Wang, Han Xu, Xiaorui Liu, Yaxin Li, Bhavani Thuraisingham, and Jiliang Tang. Imbalanced adversarial training with reweighting. In 2022 IEEE International Conference on Data Mining (ICDM), pages 1209–1214. IEEE, 2022. 91 [112] Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. Improving adversarial robustness requires revisiting misclassified examples. In International Conference on Learning Representations, 2019. [113] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Learning to model the tail. In Advances in Neural Information Processing Systems, pages 7029–7039, 2017. [114] Yunjuan Wang, Poorya Mianjy, and Raman Arora. Robust learning for data poisoning attacks. In International Conference on Machine Learning, pages 10859–10869. PMLR, 2021. [115] Susan C Weller and A Kimball Romney. Systematic data collection, volume 10. Sage publications, 1988. [116] Rand R Wilcox. Introduction to robust estimation and hypothesis testing. Academic press, 2011. [117] Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pages 5286– 5295. PMLR, 2018. [118] Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization. Advances in Neural Information Processing Systems, 33, 2020. [119] Da Xu, Yuting Ye, and Chuanwei Ruan. Understanding the role of importance weighting for deep learning. arXiv preprint arXiv:2103.15209, 2021. [120] Han Xu, Pengfei He, Jie Ren, Yuxuan Wan, Zitao Liu, Hui Liu, and Jiliang Tang. Probabilistic categorical adversarial attack and adversarial training. In International Conference on Machine Learning, pages 38428–38442. PMLR, 2023. [121] Han Xu, Xiaorui Liu, Yaxin Li, Anil Jain, and Jiliang Tang. To be robust or to be fair: Towards fairness in adversarial training. In International Conference on Machine Learning, pages 11492–11501. PMLR, 2021. [122] Han Xu, Xiaorui Liu, Yaxin Li, and Jiliang Tang. To be robust or to be fair: Towards fairness in adversarial training. arXiv preprint arXiv:2010.06121, 2020. [123] Han Xu, Xiaorui Liu, Yuxuan Wan, and Jiliang Tang. Towards fair classification against poisoning attacks. arXiv preprint arXiv:2210.09503, 2022. [124] Han Xu, Xiaorui Liu, Wentao Wang, Wenbiao Ding, Zhongqin Wu, Zitao Liu, Anil Jain, and Jiliang Tang. Towards the memorization effect of neural networks in adversarial training. arXiv preprint arXiv:2106.04794, 2021. [125] Han Xu, Yao Ma, Hao-Chen Liu, Debayan Deb, Hui Liu, Ji-Liang Tang, and Anil K Jain. 92 Adversarial attacks and defenses in images, graphs and text: A review. International Journal of Automation and Computing, 17(2):151–178, 2020. [126] Han Xu, Yao Ma, Haochen Liu, Debayan Deb, Hui Liu, Jiliang Tang, and Anil Jain. Adversarial attacks and defenses in images, graphs and text: A review. arXiv preprint arXiv:1909.08072, 2019. [127] Show-Jane Yen and Yue-Shi Lee. Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 36(3):5718–5727, 2009. [128] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th international conference on world wide web, pages 1171–1180, 2017. [129] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In International conference on machine learning, pages 325–333. PMLR, 2013. [130] Shenglai Zeng, Yaxin Li, Jie Ren, Yiding Liu, Han Xu, Pengfei He, Yue Xing, Shuaiqiang Wang, Jiliang Tang, and Dawei Yin. Exploring memorization in fine-tuned language models. arXiv preprint arXiv:2310.06714, 2023. [131] Shenglai Zeng, Jiankun Zhang, Pengfei He, Yue Xing, Yiding Liu, Han Xu, Jie Ren, Shuaiqiang Wang, Dawei Yin, Yi Chang, et al. The good and the bad: Exploring privacy issues in retrieval-augmented generation (rag). arXiv preprint arXiv:2402.16893, 2024. [132] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 335–340, 2018. [133] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016. [134] Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, and Bin Dong. You only propagate once: Painless adversarial training using maximal principle. arXiv preprint arXiv:1905.00877, 2(3), 2019. [135] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P Xing, Laurent El Ghaoui, and Michael I Jordan. Theoretically principled trade-off between robustness and accuracy. arXiv preprint arXiv:1901.08573, 2019. [136] Jingfeng Zhang, Xilie Xu, Bo Han, Gang Niu, Lizhen Cui, Masashi Sugiyama, and Mohan Kankanhalli. Attacks which do not kill training make adversarial learning stronger. In International Conference on Machine Learning, pages 11278–11287. PMLR, 2020. 93 [137] Jingfeng Zhang, Jianing Zhu, Gang Niu, Bo Han, Masashi Sugiyama, and Mohan Kankanhalli. Geometry-aware instance-reweighted adversarial training. arXiv preprint arXiv:2010.01736, 2020. [138] Zhifei Zhang, Yang Song, and Hairong Qi. Age progression/regression by conditional adversarial autoencoder. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5810–5818, 2017. 94 APPENDIX 1 Supplementary to Chapter 2 1.1 More Results of Preliminary Studies In this section, we show more preliminary results on the robust fairness phenomenon of adversarial training in various settings. In addition to the results shown in Section 2.3, we present the results in the settings with one more architecture (WRN28), one more type of adversarial attack (𝑙2- norm attack), one more defense method (Randomized Smoothing) and one more dataset (SVHN). From all these settings, we observe the similar phenomenon as in Section 2.3, which show that the fairness phenomenon can be generally happening in adversarial training under different scenarios and can be a common concern during its application. Furthermore, we also present the detailed results as in Table 2.2, to show the fact that adversarial training usually gives an unequal influence on different classes, which can be a reason that causes this fairness phenomenon. In detail, for each dataset under PreAct ResNet18 architecture, for each adversarial training algorithm (including PGD-adversarial training [71] and TRADES [138]), we train the models following that as suggested by the original papers. We train the models for 200 epochs with learning rate 0.1 and decay the learning rate at the epoch 100 and 150 by factor 0.1. During the evaluation phase, we report the trained model’s classwise standard error rate and robust error rate. In general settings without explicit mention, we study the models’ robustness against 𝑙∞-norm adversarial attack under 8/255, where we implement PGD attack algorithm for 20 steps for robustness evaluation. 1.1.1 Robust Fairness in WRN28 in CIFAR10 Figure A.2 presents robust fairness issues in CIFAR10 dataset under WRN28 models. Note that in Section 2.3 we also presented the corresponding results under PreAct ResNet18 models in Figure 2.1. We can observe the similar phenomenon about the robust fairness issues under both models. Moreover, as clear evidence of the unequal effect of adversarial training among different classes, in Table A.1 and Table A.2, we compare the classwise standard error and robust error between natural training and PGD adversarial training. From the experimental results, we get the 95 conclusion that adversarial training usually increases a larger error rate for the classes, such as “dog” and “cat”, which originally have larger errors in natural training. Similarly, adversarial training will also give less help to reduce the robust errors for these classes. 1.1.2 Robust Fairness in 𝑙2-norm Adversarial Training Figure A.3 presents the robust fairness issues of adversarial training methods which target on 𝑙2-norm attacks in CIFAR10 dataset. We further confirm the existence of robust unfairness in adversarial training methods. In Figure A.3, we present the classwise standard errors and robust errors, which target on 𝑙2-norm 0.5 adversarial attack. During the robustness evaluation, we implement PGD attack algorithm with step size 0.1 for 20 steps. 1.1.3 Robust Fairness for Certified Defenses Certified defenses are another main type of effective defense strategies. Even though certified defenses do not train in the same way as traditional adversarial training methods, which train the models on the generated adversarial examples, they minimize the probability of the existence of adversarial examples near the input data. This process also implicitly minimizes the model’s overall robust error. Thus, in this section we study whether this certified defense will have robust fairness issues. As a representative, we implement Randomized Smoothing [24], which is one state-of-theart methods to certifiably defense against 𝑙2-norm adversarial attacks. In this experiment, we run Randomized Smoothing against 𝑙2-norm 0.5 attacks in CIFAR10 dataset and report its class-wise certified standard error and certified robust error under different intensities. The results also suggest that the Randomized Smoothing certified defense method presents the similar disparity phenomenon as the traditional adversarial training methods. Moreover, it also preserves the similar classwise performance relationship, i.e., it has both high standard & robustness error on the classes “cat” and “dog”, but has relatively low errors on “car” and “ship”. 1.1.4 Robust Fairness on SVHN Dataset Figure A.4 presents the robust fairness issues of adversarial training methods in SVHN dataset under PreAct ResNet18 model. From the experimental results, we also observe the strong disparity 96 Figure A.1: Randomized Smoothing on CIFAR10. of classwise standard errors and robust errors, which do not exist in natural training. In particular, the classes “3” and “8” have the largest standard error in a naturally trained model. After adversarial training, these two classes also have the largest standard error increases among all classes, as well as the least robust error decreases. As a result, there is also a significant disparity of the standard / robustness performance among the classes. The full results are shown in Table A.3. (a) Natural Training (b) PGD Adversarial Training (c) TRADES (1/𝜆 = 1) (d) TRADES (1/𝜆 = 6) Figure A.2: The class-wise performance of natural / adversarial training (target on 𝑙∞-norm 8/255 attack) on CIFAR10 under WRN28. 1.2 Theoretical Proof of Section 2.4 In this section, we formally calculate the classwise standard & robust errors in an optimal linear classifier and an optimal linear robust classifier. Then, we present our main conclusion that robust optimization will unequally influence the performance of the two classes and therefore result in a severe performance disparity. 97 (a) Natural Training (b) PGD Adversarial Training (c) TRADES (1/𝜆 = 6) (d) TRADES (1/𝜆 = 6) Figure A.3: The class-wise performance of natural / adversarial training (target on 𝑙2-norm 0.5 attack) on CIFAR10 under PreActResNet18. (a) Natural Training (b) PGD Adversarial Training (c) TRADES (1/𝜆 = 1) (d) TRADES (1/𝜆 = 6) Figure A.4: The class-wise performance of natural / adversarial training (target on 𝑙∞-norm 8/255 attack) on SVHN under PreActResNet18. Table A.1: The Changes of Standard & Robust Error in Natural & Adversarial Training in CIFAR10 on PreAct ResNet18. Std. Error plane car bird cat deer dog frog horse ship truck Natural Training 4.0 1.8 6.4 11.3 6.1 10.0 5.1 5.2 3.5 4.3 PGD Adv. Training 11.7 6.1 23.3 34.8 20.8 26.9 12.6 9.8 6.4 9.5 Diff. (Adv. - Nat.) 7.7 4.3 16.9 23.5 14.6 16.9 7.5 4.6 2.9 5.2 Rob. Error plane car bird cat deer dog frog horse ship truck Natural Training 100 100 100 100 100 100 100 100 100 100 PGD Adv. Training 44.9 34.3 68.4 82.7 74.7 66.4 51.5 47.0 40.8 42.3 Diff. (Adv. - Nat.) -55.1 -65.7 -31.6 -17.3 -25.3 -33.5 -48.5 -53.0 -59.2 -57.7 1.2.1 Proof of Theorem 2 In this subsection, we study an optimal linear classifier which minimizes the average standard error. By calculating its standard errors, we can get the conclusion that the class “+1” in distribution D is indeed harder than class “-1”. We first start from a lemma to calculate the weight vector of an optimal linear model. Lemma 6 (Weight Vector of an Optimal Classifier) For the data following the distributionDde- 98 Table A.2: The Changes of Standard & Robust Error in Natural & Adversarial Training in CIFAR10 on WRN28. Std. Error plane car bird cat deer dog frog horse ship truck Natural Training 3.9 1.9 6.4 10.0 6.2 9.0 5.0 4.6 3.3 4.2 PGD Adv. Training 10.0 4.9 21.4 24.6 17.4 26.2 12.4 9.4 6.3 9.2 Diff. (Adv. - Nat.) 6.1 3.0 15.0 14.6 11.2 17.2 7.4 4.8 3.0 5.0 Rob. Error plane car bird cat deer dog frog horse ship truck Natural Training 100 100 100 100 100 100 100 100 100 100 PGD Adv. Training 41.4 29.3 65.8 77.2 75.5 61.6 60.9 40.7 40.8 39.4 Diff. (Adv. - Nat.) -58.6 -70.7 -34.2 -22.8 -24.5 -38.4 -39.1 -59.3 -59.2 -60.6 Table A.3: The Changes of Standard & Robust Error in Natural & Adversarial Training in SVHN in PreAct ResNet18. Std. Error “0” “1” “2” “3” “4” “5” “6” “7” “8” “9” Natural Training 3.6 2.7 3.2 5.8 4.0 5.1 3.6 4.6 6.1 5.3 PGD Adv. Training 8.8 6.2 7.9 15.8 6.9 13.2 13.3 13.4 19.8 11.4 Diff. (Adv. - Nat.) 5.2 3.5 4.8 10.1 4.9 8.1 9.6 8.9 13.6 6.4 Std. Error “0” “1” “2” “3” “4” “5” “6” “7” “8” “9” Natural Training 100 100 100 100 100 100 100 100 100 100 PGD Adv. Training 47.2 38.8 49.0 63.4 41.9 57.1 62.6 55.2 73.7 57.1 Diff. (Adv. - Nat.) -52.8 -61.2 -51.1 -36.6 -58.1 -42.9 -37.4 -44.8 -26.3 -42.9 fined in Eq. 2.2, an optimal linear classifier 𝑓nat which minimizes the average standard classification error: 𝑓nat (𝑥) = sign(⟨𝑤nat, 𝑥⟩ + 𝑏nat) where 𝑤nat, 𝑏nat = arg min 𝑤,𝑏 Pr.(sign(⟨𝑤, 𝑥⟩ + 𝑏) ≠ 𝑦) has the optimal weight that satisfy: 𝑤nat = 1. Proof 2 (Proof of Lemma 8) In the proof we will use 𝑤 = 𝑤nat and 𝑏 = 𝑏nat for simplicity. Next, we will prove 𝑤1 = 𝑤2 = · · · = 𝑤𝑑 by contradiction. We define 𝐺 = {1, 2, . . . , 𝑑} and make the following assumption: for the optimal 𝑤 and 𝑏, we assume if there exist 𝑤𝑖 < 𝑤𝑗 for 𝑖 ≠ 𝑗 and 𝑖, 𝑗 ∈ 𝐺. Then we obtain the following standard errors for two classes of this classifier with weight 99 𝑤: Rnat ( 𝑓 ; −1) = Pr{ Õ 𝑘≠𝑖,𝑘≠𝑗 𝑤𝑘N(−𝜂, 𝜎2− 1 ) + 𝑏 + 𝑤𝑖N(−𝜂, 𝜎2− 1 ) + 𝑤𝑗N(−𝜂, 𝜎2− 1 ) > 0} Rnat ( 𝑓 ; +1) = Pr{ Õ 𝑘≠𝑖,𝑘≠𝑗 𝑤𝑘N(+𝜂, 𝜎2 +1 ) + 𝑏 + 𝑤𝑖N(+𝜂, 𝜎2 +1 ) + 𝑤𝑗N(+𝜂, 𝜎2 +1 ) < 0} (A.1) However, if we define a new classier 𝑓˜ whose weight 𝑤˜ uses 𝑤𝑗 to replace 𝑤𝑖 , we obtain the errors for the new classifier: Rnat ( ˜ 𝑓 ; −1) = Pr{ Õ 𝑘≠𝑖,𝑘≠𝑗 𝑤𝑘N(−𝜂, 𝜎2− 1 ) + 𝑏 + 𝑤𝑗N(−𝜂, 𝜎2− 1 ) + 𝑤𝑗N(−𝜂, 𝜎2− 1 ) > 0} Rnat ( ˜ 𝑓 ; +1) = Pr{ Õ 𝑘≠𝑖,𝑘≠𝑗 𝑤𝑘N(+𝜂, 𝜎2 +1 ) + 𝑏 + 𝑤𝑗N(+𝜂, 𝜎2 +1 ) + 𝑤𝑗N(+𝜂, 𝜎2 +1 ) < 0}. (A.2) By comparing the errors in Eq A.13 and Eq A.14, it can imply the classifier ˜ 𝑓 has smaller error in each class. Therefore, it contradicts with the assumption that 𝑓 is the optimal classifier with least error. Thus, we conclude for an optimal linear classifier in natural training, it must satisfies 𝑤1 = 𝑤2 = · · · = 𝑤𝑑 and 𝑤 = 1. Given the results in Lemma 1, we can calculate the errors of classifiers by only calculating the interception term 𝑏nat and 𝑏rob. Recall Theorem 1, we calculate the classwise errors of an optimal classifier with minimal average standard error. Theorem 1 For a data distribution D in Eq. 2.2, for the optimal linear classifier 𝑓nat which minimizes the average standard classification error, it has the intra-class standard error for the two classes: Rnat ( 𝑓nat, −1) = Pr.{N(0, 1) ≤ 𝐴 − 𝐾 · q 𝐴2 + 𝑞(𝐾)} Rnat ( 𝑓nat, +1) = Pr.{N(0, 1) ≤ −𝐾 · 𝐴 + q 𝐴2 + 𝑞(𝐾)} 100 where 𝐴 = 2 𝐾2−1 √ 𝑑𝜂 𝜎 and 𝑞(𝐾) = 2 log 𝐾 𝐾2−1 which is a positive constant and only depends on 𝐾, As a result, the class “+1” has a larger standard error: Rnat ( 𝑓nat, −1) < Rnat ( 𝑓nat, +1). Proof 3 (Proof of Theorem 1) From the results in Lemma 1, we define our optimal linear classifier to be 𝑓nat (𝑥) = sign( Í𝑑 𝑖=1 𝑥𝑖 + 𝑏nat). Now, we calculate the optimal 𝑏nat which can minimize the average standard error: 𝑅nat ( 𝑓 ) = Pr.{ 𝑓 (𝑥) ≠ 𝑦} ∝ Pr{ 𝑓 (𝑥) = 1|𝑦 = −1} + Pr{ 𝑓 (𝑥) = −1|𝑦 = 1} = Pr.{ Õ𝑑 𝑖=1 𝑥𝑖 + 𝑏nat > 0|𝑦 = −1} + Pr{ Õ𝑑 𝑖=1 𝑥𝑖 + 𝑏nat < 0|𝑦 = +1} = Pr.{N(0, 1) < − √ 𝑑𝜂 𝜎 + 1 √ 𝑑𝜎 · 𝑏nat} + Pr.{N(0, 1) < − √ 𝑑𝜂 𝐾𝜎 − 1 𝐾 √ 𝑑𝜎 · 𝑏nat} (A.3) The optimal 𝑏nat to minimize Rnat ( 𝑓 ) is achieved at the point that 𝜕Rnat ( 𝑓 ) 𝜕𝑏nat = 0. Thus, we find the optimal 𝑏nat: 𝑏nat = 𝐾2 + 1 𝐾2 − 1 · 𝑑𝜂 − 𝐾 s 4𝑑2𝜂2 (𝐾2 − 1)2 + 𝑞(𝐾)𝑑𝜎2 (A.4) and 𝑞(𝐾) = 2 log 𝐾 𝐾2−1 which is a positive constant and only depends on 𝐾. By incorporating the optimal 𝑏nat into Eq. A.6, we can get the classwise standard errors for the two classes: Rnat ( 𝑓nat, −1) = Pr.{N(0, 1) ≤ 𝐴 − 𝐾 · q 𝐴2 + 𝑞(𝐾)} Rnat ( 𝑓nat, +1) = Pr.{N(0, 1) ≤ −𝐾 · 𝐴 + q 𝐴2 + 𝑞(𝐾)} where 𝐴 = 2 𝐾2−1 √ 𝑑𝜂 𝜎 . Since 𝑞(𝐾) > 0, we have the direct conclusion that Rnat ( 𝑓 ; −1) < Rnat ( 𝑓 ; +1). 101 1.2.2 Proof of Theorem 2 In this subsection, we focus on calculating the errors of robust classifiers which minimize the average robust errors of the model. By comparing natural classifiers and robust classifiers, we get the conclusion that robust classifiers can further exacerbate the model’s performance on the “harder” class. Similar to Section A.2.1, we start from a Lemma to show an optimal robust classifier 𝑓rob has a weight vector 𝑤rob = 1. Lemma 8 (Weight of an Optimal Robust Classifier) For the data following the distribution D defined in Eq. 2.2, an optimal linear classifier 𝑓nat which minimizes the average standard classification error: 𝑓rob (𝑥) = sign(⟨𝑤rob, 𝑥⟩ + 𝑏rob) where 𝑤rob, 𝑏rob = arg min 𝑤,𝑏 Pr.(∃𝛿, ||𝛿|| ≤ 𝜖, s.t. sign(⟨𝑤, 𝑥 + 𝛿⟩ + 𝑏) ≠ 𝑦). has the optimal weight which satisfy: 𝑤rob = 1. We leave the detailed proof out in the paper because it can be proved in the similar way as the proof of Lemma 1. Recall Theorem 2, we formally calculate the standard errors of an optimal robust linear classifier. Theorem 2 For a data distribution D in Eq. 2.2, the optimal robust linear classifier 𝑓rob which minimizes the average robust error with perturbation margin 𝜖 < 𝜂, it has the intra-class standard error for the two classes: Rnat ( 𝑓rob, −1) = Pr.{N(0, 1) ≤ 𝐵 − 𝐾 · q 𝐵2 + 𝑞(𝐾) − √ 𝑑 𝜎 𝜖 } Rnat ( 𝑓rob, +1) = Pr.{N(0, 1) ≤ −𝐾 · 𝐵 + q 𝐵2 + 𝑞(𝐾) − √ 𝑑 𝐾𝜎 𝜖 } (A.5) where 𝐵 = 2 𝐾2−1 √ 𝑑(𝜂−𝜖) 𝜎 and 𝑞(𝐾) = 2 log 𝐾 𝐾2−1 is a positive constant and only depends on 𝐾, Proof 4 (Proof of Theorem 2) From the results in Lemma 2, we define our optimal linear robust classifier to be 𝑓rob (𝑥) = sign( Í𝑑 𝑖=1 𝑥𝑖 + 𝑏rob). Now, we calculate the optimal 𝑏rob which can mini- 102 Table A.4: Average & worst-class standard error, boundary error and robust error for various algorithms on CIFAR10 under WRN28. Avg. Std. Worst Std. Avg. Bndy. Worst Bndy. Avg. Rob. Worst Rob. PGD Adv. Training 14.0 29.3 38.1 53.0 52.2 78.8 TRADES(1/𝜆 = 1) 12.6 25.2 40.2 58.7 52.8 76.7 TRADES(1/𝜆 = 6) 15.5 29.1 31.8 45.7 47.3 71.8 Baseline Reweight 14.2 26.3 38.6 53.7 52.8 77.9 FRL(Reweight, 0.05) 14.5 23.2 40.0 53.3 54.4 76.8 FRL(Remargin, 0.05) 15.4 24.9 38.1 49.6 53.5 70.5 FRL(Reweight+Remargin, 0.05) 15.4 25.0 37.8 46.7 53.2 67.1 FRL(Reweight, 0.07) 14.1 23.8 39.5 54.1 53.6 77.0 FRL(Remargin, 0.07) 14.8 24.3 39.5 50.6 54.3 73.0 FRL(Reweight+Remargin, 0.07) 14.9 24.7 37.8 48.3 52.7 68.2 Table A.5: Average & worst-class standard error, boundary error and robust error for various algorithms on SVHN under WRN28. Avg. Std. Worst Std. Avg. Bndy. Worst Bndy. Avg. Rob. Worst Rob. PGD Adv. Training 8.1 16.8 38.5 57.3 46.7 71.2 TRADES(1/𝜆 = 1) 8.0 19.6 40.1 60.0 48.1 73.3 TRADES(1/𝜆 = 6) 10.6 23.1 32.1 52.5 42.7 70.6 Baseline Reweight 8.5 16.2 40.3 57.8 48.8 71.1 FRL(Reweight, 0.05) 7.8 13.4 38.9 56.9 46.7 70.7 FRL(Remargin, 0.05) 8.4 13.4 40.8 52.1 49.2 65.5 FRL(Reweight+Remargin, 0.05) 8.4 13.2 38.4 52.1 46.8 63.1 FRL(Reweight, 0.07) 8.2 13.5 41.2 56.3 49.4 69.8 FRL(Remargin, 0.07) 8.6 14.9 38.8 51.2 47.4 67.0 FRL(Reweight+Remargin, 0.07) 8.2 13.9 39.9 50.2 48.1 65.4 mize the average robust error: Rrob ( 𝑓 ) =Pr.(∃||𝛿|| ≤ 𝜖 s.t. 𝑓 (𝑥 + 𝛿) ≠ 𝑦) = max ||𝛿||≤𝜖 Pr.( 𝑓 (𝑥 + 𝛿) ≠ 𝑦) = 1 2Pr.( 𝑓 (𝑥 + 𝜖) ≠ −1|𝑦 = −1) + 1 2Pr.( 𝑓 (𝑥 − 𝜖) ≠ +1|𝑦 = +1) =Pr.{ Õ𝑑 𝑖=1 (𝑥𝑖 + 𝜖) + 𝑏rob > 0|𝑦 = −1} + Pr{ Õ𝑑 𝑖=1 (𝑥𝑖 − 𝜖) + 𝑏rob < 0|𝑦 = +1} =Pr.{N(0, 1) < − √ 𝑑(𝜂 − 𝜖) 𝜎 + 1 √ 𝑑𝜎 · 𝑏rob} + Pr.{N(0, 1) < − √ 𝑑(𝜂 − 𝜖) 𝐾𝜎 − 1 𝐾 √ 𝑑𝜎 · 𝑏rob} (A.6) 103 The optimal 𝑏rob to minimize Rrob ( 𝑓 ) is achieved at the point that 𝜕Rrob ( 𝑓 ) 𝜕𝑏rob = 0. Thus, we find the optimal 𝑏rob: 𝑏rob = 𝐾2 + 1 𝐾2 − 1 · 𝑑(𝜂 − 𝜖) − 𝐾 s 4𝑑2 (𝜂 − 𝜖)2 (𝐾2 − 1)2 + 𝑞(𝐾)𝑑𝜎2 (A.7) and 𝑞(𝐾) = 2 log 𝐾 𝐾2−1 which is a positive constant and only depends on 𝐾. By incorporating the optimal 𝑏nat into Eq. A.6, we can get the classwise robust errors for the two classes: Rrob ( 𝑓rob, −1) = Pr.{N(0, 1) ≤ 𝐵 − 𝐾 · q 𝐵2 + 𝑞(𝐾)} Rrob ( 𝑓rob, +1) = Pr.{N(0, 1) ≤ −𝐾 · 𝐵 + q 𝐵2 + 𝑞(𝐾)} where 𝐵 = 2 𝐾2−1 √ 𝑑(𝜂−𝜖) 𝜎 . As a direct result, the classwise standard errors for the two classes: Rnat ( 𝑓rob, −1) =Pr.{N(0, 1) ≤ 𝐵 − 𝐾 · q 𝐵2 + 𝑞(𝐾) − √ 𝑑 𝜎 𝜖 } Rnat ( 𝑓rob, +1) =Pr.{N(0, 1) ≤ −𝐾 · 𝐵 + q 𝐵2 + 𝑞(𝐾) − √ 𝑑 𝐾𝜎 𝜖 }. 1.2.3 Proof of Corollary 1 Giving the results in Theorem 1 and Theorem 2, we will show that a robust classifier will exacerbate the performance of the class “+1” which originally has higher error in a naturally trained model. In this way, we can get the conclusion that robust classifiers can cause strong disparity, because it exacerbates the “difficulty” difference among classes. Corollary 1 Adversarially Trained Models on D will increase the standard error for class “+1” and reduce the standard error for class “-1”: Rnat ( 𝑓rob, −1) < Rnat ( 𝑓nat, −1). Rnat ( 𝑓rob, +1) > Rnat ( 𝑓nat, +1). Proof 5 (Proof of Corollary 1) From the intermediate results in Eq.A.4 and Eq. A.7 in the proofs of Theorem 1 and Theorem 2, we find the only difference between a naturally trained model 𝑓nat 104 and a robust model 𝑓rob is about the interception term 𝑏nat and 𝑏rob. Specifically, we denote 𝑔(·) is the function of the interception term, then we have the results: 𝑏nat = 𝐾2 + 1 𝐾2 − 1 · 𝑑𝜂 − 𝐾 s 4𝑑2𝜂2 (𝐾2 − 1)2 + 𝑞(𝐾)𝑑𝜎2 := 𝑔(𝜂) 𝑏rob = 𝐾2 + 1 𝐾2 − 1 · 𝑑(𝜂 − 𝜖) − 𝐾 s 4𝑑2(𝜂 − 𝜖)2 (𝐾2 − 1)2 + 𝑞(𝐾)𝑑𝜎2 := 𝑔(𝜂 − 𝜖) Next, we show that the function 𝑔(·) is a monotone increasing function between 0 and 𝜂: 𝑑𝑔(𝜂) 𝑑𝜂 ≥ 𝐾2 + 1 𝐾2 − 1𝑑 − 𝐾 4 (𝐾2−1)2 𝑑2 · 2𝜂 2 q 4 (𝐾2−1)2 𝑑2𝜂2 = 𝐾 − 1 𝐾 + 1 𝑑 > 0 As a direct results, we have the interception terms: 𝑏nat > 𝑏rob. This will result a linear classifier 𝑓 (𝑥) = sign(⟨1𝑇 , 𝑥⟩ + 𝑏) present more samples in the overall distribution to be class “-1”. As a result, we have the conclusion: Rnat ( 𝑓rob, −1) < Rnat ( 𝑓nat, −1). Rnat ( 𝑓rob, +1) > Rnat ( 𝑓nat, +1). 1.3 Robust / Non-Robust Features In Section 2.4.1, we discussed a theoretical example where adversarial training will unequally treat the two classes in the distribution D, which will increase the standard error of one class and decrease the error for the other one. However, in the real applications of deep learning models, we always observe that each class’s error will increase after adversarial training. In this subsection, motivated by the work [105, 47], we extend the definition ofD, to split the features into two categories: robust features and non-robust features, where adversarial training will increase the standard errors for both classes. Formally, the data distribution D′ is defined as as following: 𝑦 𝑢.∼𝑎.𝑟 {−1, +1}, 𝜃 = ( dim = d z }| { 𝜂, ..., 𝜂, dim = m z }| { 𝛾, ..., 𝛾), 𝑥 ∼ 8>>>> < >>>>: N (𝜃, 𝜎2 +1𝐼) if 𝑦 = +1 N (−𝜃, 𝜎2− 1𝐼) if 𝑦 = −1 (A.8) 105 where in the center vector 𝜃, it includes robust features with scale 𝜂 > 𝜖, and non-robust features with scale 𝛾 < 𝜖. Here we specify that non-robust feature space has much higher dimension than robust feature space (𝑚 >> 𝑑) and there is a 𝐾-factor difference between the variance of two classes: 𝜎+1 = 𝐾 · 𝜎−1. From the main results in the work [105], it is easy to get that each class’s standard error will increase after adversarial training. In the following theory, we will show that in distribution D′ , adversarial training will increase the error for the class “+1” by a larger rate than the class “-1”. Theorem 3. For a data distribution D′ in Eq. A.8, the robust optimizer 𝑓rob increases the standard error of class “+1” by a larger rate than the increase of the standard error of class “-1’: Rnat ( 𝑓rob; +1) − Rnat ( 𝑓nat; +1) > Rnat ( 𝑓rob; −1) − Rnat ( 𝑓nat; −1) (A.9) Proof 6 (Proof of Theorem 3) The proof of Theorem 3 resembles the process of the proofs of Theorem 1 and Theorem 2, where we first calculate the classwise standard errors for each model. Note that from the work [105], an important conclusion is that a natural model 𝑓nat uses both robust and non-robust features for prediction. While, a robust model 𝑓rob only uses robust features for prediction (a detailed proof can be found in Section 2.1 in [105]). Therefore, we can calculate the classwise standard errors for both classes of a natural model 𝑓nat: Rnat ( 𝑓nat, −1) = Pr.{N(0, 1) ≤ 𝐴 − 𝐾 · q 𝐴2 + 𝑞(𝐾)} Rnat ( 𝑓nat, +1) = Pr.{N(0, 1) ≤ −𝐾 · 𝐴 + q 𝐴2 + 𝑞(𝐾)} where 𝐴 = 2 𝜎(𝐾2−1) p 𝑚𝛾2 + 𝑑𝜂2. The classwise standard errors of a robust model 𝑓rob are: Rnat ( 𝑓rob, −1) =Pr.{N(0, 1) ≤ 𝐵 − 𝐾 · q 𝐵2 + 𝑞(𝐾) − √ 𝑑 𝜎 𝜖 } Rnat ( 𝑓rob, +1) =Pr.{N(0, 1) ≤ −𝐾 · 𝐵 + q 𝐵2 + 𝑞(𝐾) − √ 𝑑 𝐾𝜎 𝜖 } 106 where 𝐵 = 2 𝜎(𝐾2−1) p 𝑑(𝜂 − 𝜖)2. Next, we compare the standard error increase after adversarial training between the two classes. We have the results: (Rnat ( 𝑓rob; +1) − Rnat ( 𝑓nat; +1))− (Rnat ( 𝑓rob; −1) − Rnat ( 𝑓nat; −1)) > (𝐾 + 1) ( (𝐴 − 𝐵) + ( q 𝐵2 + 𝑞(𝐾) − q 𝐴2 + 𝑞(𝐾))) ∝ ( q 𝐵2 + 𝑞(𝐾) − 𝐵) − ( q 𝐴2 + 𝑞(𝐾) − 𝐴) because 𝐴 includes high dimensional non-robust feature and 𝐴 >> 𝐵, the equation above is positive and we get the conclusion as in Eq. A.9. 1.4 Fairness Performance on WRN28 Table A.4 and Table A.5 presents the empirical results to validate the effectiveness of FRL algorithms under the WRN 28 model. The implementation details resemble those in Section 2.6.1. In the training, we start FRL from a pre-trained robust model (such as PGD-adversarial training), and run FRL with model parameter learning rate 1e-3 and hyperparameter learning rate 𝛼1 = 𝛼2 = 0.05 in the first 40 epochs. Then we decay the model parameter learning rate and the hyperparameter learning rate by 0.1 every 40 epochs. From the results, we have the similar observations as these for PreAct ResNet18 models, which is that FRL can help to improve the worst-class standard performance and robustness performance, such that the unfairness issue is mitigated. In particular, FRL (Reweight) is usually the most effective way to equalize the standard performance, but not sufficient to equalize the boundary errors and robust errors. FRL (Reweight + Remargin) is usually the most effective way to improve robustness for the worst class. 2 Supplementary to Chapter 3 2.1 Additional Introduction of Atypical Samples In this section, we provide additional introductions about the memorization effect and atypical samples in image classification tasks. We first introduce several strategies to identify atypical samples in common datasets. Then, we show several examples to illustrate the distribution & frequency in common datasets. 107 2.1.1 How to Estimate Whether One Sample is Atypical? Leave-out Method The work [40] proposes to examine which training samples can only be fitted by memorization, and measure each training sample’s “memorization value”. Formally, for a training algorithmA(i.e., ERM), the memorization value “mem(A,D, 𝑥𝑖)” of a training sample (𝑥𝑖 , 𝑦𝑖) ∈ D in training set D is defined as: mem(A,D, 𝑥𝑖) = Pr. 𝐹←A(D) (𝐹(𝑥𝑖) = 𝑦𝑖) − Pr. 𝐹←A(D\𝑥𝑖 ) (𝐹(𝑥𝑖) = 𝑦𝑖), which calculates the difference between the model 𝐹’s accuracy on 𝑥𝑖 with and without 𝑥𝑖 removed from the training set D of algorithm A. Note that for each sample 𝑥𝑖 , if its memorization value is high, it means that removing 𝑥𝑖 from training data will cause the model with a high possibility to wrongly classify itself, so 𝑥𝑖 is very likely to be fitted only by memorization and be atypical. Although promising, the computational efficiency of Leave-out method is potential to be low. For example, the Leave-Out method requires repeatedly training a DNN model for 1,000 trials to accurately estimate the memorization values. Ensemble Disagreement The work [14] provides an alternative way to efficiently figure out the examples which are poorly represented by other samples from the training dataset (atypical samples). They assume the samples which are well-represented in the distribution should be relatively easy for many types of DNN models to learn. Concretely, the ensemble agreement trains a series of models F = {𝐹𝑗 }𝑚𝑗 =1 with different architectures and initializations. Then, for each sample 𝑥𝑖 , it computes the disagreement of all models: ed(𝑥𝑖) = 1 𝑚2 Õ𝑚 𝑗=1 Õ𝑚 𝑝=1 JS-Divergence(𝐹𝑗 (𝑥𝑖), 𝐹𝑝 (𝑥𝑖)) Intuitively, if ed(𝑥𝑖) is large, different models will give different predictions to the sample 𝑥𝑖 . It suggest that the sample 𝑥𝑖 cannot be easily learned by most of models in F. Therefore, the sample 𝑥𝑖 is likely to be atypical. In practice, [15] trains multiple DNN models for 10 random initializations, or get 10 pretrained from public resources, which significantly improve the efficiency to find atypical samples. 108 Confidence-Based Method Similar to the ensemble disagreement method, the work [14] assumes that the models in F = {𝐹𝑗 }𝑚𝑗 =1 should be more confident on examples that are well-represented by the whole training dataset. Thus, the propose to measure the average confidence value of these models on sample 𝑥𝑖 : conf(𝑥𝑖) = 1 𝑚 Õ𝑚 𝑗=1 max(𝐹𝑗 (𝑥𝑖) [𝑦𝑖])) The smaller the confidence value of a sample 𝑥𝑖 , it is more likely to be atypical. Remarkably, the work [14] empirically demonstrates that the three metrics decribed above have good alignment for datasets such as MNIST, CIFAR10, CIFAR100 [56] and ImageNet [57]. Therefore, for practical application of our proposed method BAT, we can also substitute the leave-out method to ensemble disagreement and confidence-based methods, in order to to enhance the efficiency of BAT. 2.1.2 Distribution of Atypical Samples In Fig. A.5, we provide several examples of images from CIFAR10, CIFAR100 [56] and Tiny ImageNet [60] respectively, with different memorization value (as defined in Section 3) around 0.0, 0.5, 1.0. These examples suggest that if the memorization value of an image is large, this image is very likely to be “atypical”, as it presents very distinct semantic features with the images in the main distribution (with memorization value 0.0). 109 0.0 0.5 1.0 1.0 0.5 0.0 “Dog” in CIFAR10 “Plate” in CIFAR100 “Mountain Lion” in Tiny ImageNet 0.0 0.5 1.0 Figure A.5: Examples of Images with Different Memorization Values. In Fig. 3.5, we provide histograms to show the distribution of the estimated memorization values of all training samples from CIFAR10, CIFAR100 and Tiny ImageNet. From Fig. 3.5, we can observe that atypical samples (with high memorization value > 0.15) consist of a significant fraction (over 40% & 50% respectively) in CIFAR100 and Tiny ImageNet. In CIFAR10, they also consist of a non-ignorable fraction which is over 10%. 2.2 Additional Results for Preliminary Study In this section, we provide the full results of the preliminary study in Section 3.4 on CIFAR10, CIFAR100 and Tiny ImageNet, to illustrate the distinct behaviors of the memorization effect between traditional ERMs and adversarial training. In both ERM and adversarial training, we train the models under ResNet18 and WideResNet28-10 (WRN28) architectures. In the experiments, we train the models for 200 epochs with learning rate 0.1, momentum 0.9, weight decay 5e-4, and decay the learning rate by 0.1 at the epoch 150 and 200. For adversarial training, we conduct experiments using PGD adversarial training [71] by default to defense against 𝑙∞-8/255 adversarial attack, with 110 the exception on Tiny ImageNet, which is against 𝑙∞-4/255 attack. For robustness evaluation, we conduct a 20-step PGD attack. 2.2.1 Additional Results for Preliminary Study - Section 3.4.1 In this subsection, we provide more experimental results to validate the statement in Section 3.4.1, where we state that fitting atypical samples in adversarial training can only improve the clean accuracy of test atypical samples, but hardly help their adversarial robustness. We provide full empirical results to show that: (i) In traditional ERM, fitting atypical samples improves the clean accuracy of test atypical samples. (ii) In adversarial training, fitting (adversarial) atypical samples improves the clean accuracy of test atypical samples but can hardly improve the adversarial robustness of them. The experimental setting follows Section 3.4.1, where we apply traditional ERM and adversarial training on original CIFAR10, CIFAR100, Tiny ImageNet datasets. We evaluate the model’s clean accuracy and adversarial accuracy on training atypical set Datyp = {𝑥𝑖 ∈ D : mem(𝑥𝑖) > 0.15} and its corresponding test atypical set D′ atyp = {𝑥′ 𝑗 ∈ D′ : infl(𝑥𝑖 , 𝑥′ 𝑗 ) > 0.15, for ∀𝑥𝑖 ∈ Datyp}. (i) Additional Results for Preliminary Study - Section 3.4.1 In Traditional ERM Fig. A.6, Fig. A.7 and Fig. A.8 report the performance (clean accuracy) of traditional ERM, which is evaluated on atypical sets under ResNet18 (left) and WRN28 (right) on CIFAR100, CIFAR10 and Tiny ImageNet. From the figures, we can obverse that fitting atypical samples during training can effectively help the models to achieve good clean accuracy on test atypical samples in all datasets. Note that here we only report clean accuracy as they are not robust against adversarial attacks. (ii) Additional Results for Preliminary Study - Section 3.4.1 In Adversarial Training Fig. A.9, Fig. A.10 and Fig. A.11 report the performance of adversarially trained models. We evaluate the clean accuracy and adversarial accuracy on the training atypical set Datyp and test atypical set D′ atyp. From the results, we can observe that although fitting atypical samples can help the model to have modest clean accuracy on test atypical samples, the adversarial robustness of them is constantly low and can hardly be improved during the whole training process. (i) Additional Results for Preliminary Study - Section 3.4.2 In Traditional ERM 111 Figure A.6: Clean Accuracy on Atypical Set of CIFAR100 on ResNet18 and WRN28. Figure A.7: Clean Accuracy on Atypical Set of CIFAR10 on ResNet18 and WRN28. Figure A.8: Clean Accuracy on Atypical Set of Tiny ImageNet on ResNet32 and WRN28. Fig. A.12, Fig. A.13 and Fig. A.14 report the performance of traditional ERM, trained on different resampled datasets with different amount of atypical samples existed. The figures report the clean accuracy on test typical set of CIFAR100, CIFAR10 and Tiny ImageNet. We also leave the robustness performance out here as the models are not robust to adversarial attacks. From the results, we can conclude that in traditional ERM, fitting atypical samples will not degrade the models’ 112 Figure A.9: Clean Accuracy and Adversarial Accuracy on Atypical Set of CIFAR100 on ResNet18 and WRN28. Figure A.10: Clean Accuracy and Adversarial Accuracy on Atypical Set of CIFAR10 on ResNet18 and WRN28. Figure A.11: Clean Accuracy and Adversarial Accuracy on Atypical Set of Tiny ImageNet on ResNet18 and WRN28. performance on typical samples. For example, in CIFAR100 dataset, with 100% atypical samples included (Atypical-100%)., the accuracy on the test typical set is even slightly higher than the model trained without atypical samples (Atypical-0%). This conclusion is consistent for all three datasets and model architectures. (ii) Additional Results for Preliminary Study - Section 3.4.2 In Adversarial Training Fig. A.9, Fig. A.10 and Fig. A.11 report the performance of adversarial training, on different resampled datasets with different amount of atypical samples existed. The figures report both clean and adversarial accuracy on test atypical sets of CIFAR100, CIFAR10 and Tiny ImageNet. Based on the experimental results, we find that including more atypical samples can cause the model have worse performance on typical samples in all three datasets. In datasets with a large portion of 113 atypical samples, such as CIFAR100, the negative effects of atypical samples are more obvious. In CIFAR100, training on datasets with 100% atypical samples (Atypical 100%) can cause the clean & adversarial accuracy drop by ∼ 7% and 8%, respectively. Figure A.12: Clean Accuracy on Typical Set of CIFAR100 on ResNet18 and WRN28. Figure A.13: Clean Accuracy on Typical Set of CIFAR10 on ResNet18 and WRN28. Figure A.14: Clean Accuracy on Typical Set of Tiny ImageNet on ResNet18 and WRN28. 114 2.2.2 An ablation study for “Harmful Atypical samples” In this subsection, we conduct additional experiments to verify the claims about ”harmful atypical samples”. In particular, we aim to demonstrate that: the atypical images with high ”Harmful scores” are likely to obtain features which visually resemble the images from a (wrong) different class, so fitting them are likely to degrade the model performance. To validate our claim, we print out 200 images of atypical samples from CIFAR100 training set, with highest / lowest Harmful scores as defined in Eq.(6) of the paper. Then, we let two individual human annotators to label each image, by choosing one of 4 options including: (a.) This image belongs to class 𝑦1. (b.) This image belongs to class 𝑦2. (c.) Both class 𝑦1 and 𝑦2 are likely. (d.) Neither class 𝑦1 or 𝑦2. Here, 𝑦1 is the ground truth label of the sample; 𝑦2 is the class other than 𝑦1 which the model has the maximal confidence: 𝑦2 = 𝑎𝑟𝑔 max𝑡≠𝑦1 𝐹𝑡 (𝑥), and the model 𝐹 is obtained via PGD adversarial training, with all atypical samples removed from the training set. We report the percentage of the answers of the human annotators. We report the percentage of the answers of the human annotators. Table A.6: Distribution of harmful scores. Class 𝑦1 Class 𝑦2 Both Likely Neither. Total High Harmful Score 59.0 12.0 26.0 4.0 100.0 Low Harmful Score 96.0 0.0 1.5 2.5 100.0 From the results, we can see, for the samples with high Harmful scores, people are more likely to believe that ”the image is from 𝑦2” or ”the image is both likely from 𝑦1 or 𝑦2”. It is because the highly-harmful images can persist a lot of similar features from these two classes. It also worth to mention that for samples which people labeled them to 𝑦1, they usually also have features belonging to the class 𝑦2. For example, a ”bowl” which is round and with orange color. Although people believe it is a bowl confidently, it does have features such as ”round” and ”orange color”, which are also features of class ”orange”. 2.3 Proof of Theories In Section 3.4.2, we argue that the poisoning effect during adversarial training can be significantly stronger than traditional ERM. In this section, we provide detailed discussions about this 115 Figure A.15: Clean Accuracy and Adversarial Accuracy on Typical Set of CIFAR100 on ResNet18 and WRN28. Figure A.16: Clean Accuracy and Adversarial Accuracy on Typical Set of CIFAR10 on ResNet18 and WRN28. Figure A.17: Clean Accuracy and Adversarial Accuracy on Typical Set of Tiny ImageNet on ResNet18 and WRN28. claim as well as complete proofs. Our analysis is based on a binary classification task with a twodimensional Gaussian Mixture distribution: 𝑦 𝑢.∼𝑎.𝑟 {−1, +1}, 𝜃 = (𝜂, 0), 𝜂 ∈ R+, 𝑥 ∼ 8>>>> < >>>>: N (𝜃, 𝜎2𝐼) if 𝑦 = +1 N (−𝜃, 𝜎2𝐼) if 𝑦 = −1, (A.10) and we consider linear models 𝐹 = sign(𝑤𝑇 𝑥 + 𝑏), where 𝑤 and 𝑏 are the model weight and bias. Moreover, we assume there is an atypical training sample 𝑥0 ∈ R2 labeled as “−1”, with Euclidean distance to the center of each class ||𝑥0 − (−𝜃) ||2 = 𝑑−1, ||𝑥0 − 𝜃||2 = 𝑑+1. Intuitively, this setting of 𝑥0 resembles the cases in Figure 3.4 if 𝑑+1 is relatively small, since it has a short distance to the wrong class. We also constrain 𝑑−1 ≤ 2 × 𝜂 for 𝑥0 to preserve certain similarity to the class “-1”. The following theorem shows that there always exists a model which fits the sample 𝑥0, and achieve 116 optimal accuracy (i.e. > 99%) on both classes, although 𝑑+1 can be small. Theorem 3 For a data distribution in Eq. 2.2, given that 𝜂/𝜎 is large enough, we define 𝑙 = √ 2 · Φ−1(0.99) · 𝜎 · (1 +Φ−1(0.99) · 𝜎 𝜂 )−1/2. For a point 𝑥0 with 𝑑+1 ≤ 2𝜂, if 𝑑+1 ≥ 𝑙, then there exist a model 𝐹 fitting 𝑥0: 𝐹(𝑥0) = −1 and has classification error at most 1% on both classes. Proof 7 We prove this theorem by two main steps. Step 1. we show that for any 𝑥0 satisfying 𝑑−1 ≤ 2𝜂, 𝑑+1 ≥ 𝑙, at least one of 𝐹1, 𝐹2 and 𝐹3 fits 𝑥0: ∃ 𝑖 ∈ {1, 2}, 𝐹𝑖 (𝑥0) = −1, where: 𝐹1(𝑥) := sign( [1, 0]𝑇 𝑥 − (𝜂 − Φ−1(0.99) · 𝜎)) 𝐹2(𝑥) := sign( [𝜏, q 𝜂2 − 𝜏2]𝑇 · 𝑥) 𝐹3(𝑥) := sign( [𝜏, − q 𝜂2 − 𝜏2]𝑇 · 𝑥) In the following parts, we let 𝜏 = Φ−1 · 𝜎 for simplicity. Step 2. Both models 𝐹1 and 𝐹1 has a classification accuracy at least 99% for both classes. Proof of Step 1. For any given sample 𝑥0 = [𝑝, 𝑞], on one hand, if 𝑝 < 𝜂 − 𝜏, 𝐹1(𝑥0) = −1. On the other hand, if 𝑝 ≥ 𝜂−𝜏, we will show all samples 𝑥0 = [𝑝, 𝑞] which satisfies 𝐹2 ( [𝑝, 𝑞])) ≠ −1, it must have 𝑑+1 ≤ 𝑙, where 𝑙 = √ 2𝜎 · (1 + 𝜏/𝜂)−1/2. Here, we assume 𝑞 ≤ 0 without loss of generality. This can be treated as a quadratic maximization problem under a convex closed bound: max (𝑝 − 𝜂)2 + 𝑞2 subject to 𝐾1 := 𝜏 · 𝑝 + q 𝜂2 − 𝜏2 · 𝑞 ≥ 0 and 𝐾2 := 𝑝 ≥ 𝜂 − 𝜏; and 𝐾3 := (𝑝 − (−𝜂))2 + 𝑞2 ≤ (2𝜂)2 and 𝐾4 := 𝑞 ≤ 0 (A.11) Figure A.18 provides an illustration of the problem proposed in Eq. A.11. The optima of Eq. A.11 lies on the border of the closed region (shadow area). Next, we solve the values of the object (𝑝 − 𝜂)2 + 𝑞2 on points 𝑆1 and 𝑆2, and compares the objective value when 𝑥0 = 𝑆1 or 𝑥0 = 𝑆2. 117 𝜃 = [𝜂, 0] 𝐾! 𝐾" 𝐾# 𝐾$ 𝑆! 𝑆" Figure A.18: Illustration of Eq. A.11. For 𝑥0 = 𝑆1, it satisfies: 𝜏 · 𝑝 + q 𝜂2 − 𝜏2 · 𝑞 = 0 𝑝 = 𝜂 − 𝜏; By solving the equations, we have (𝑝 − 𝜂)2 + 𝑞2 = ( √ 2 · 𝜏 · (1 + 𝜏 𝜂 )−1/2)2. For 𝑥0 = 𝑆1, it satisfies: 𝜏 · 𝑝 + q 𝜂2 − 𝜏2 · 𝑞 = 0 (𝑝 − (−𝜂))2 + 𝑞2 = (2𝜂)2 By solving the equations, we have (𝑝 − 𝜂)2 + 𝑞2 = (𝜏 · (1 − 𝜏 𝜂 +Ω( ( 𝜏 𝜂 )2))−1/2)2. It is not hard to see that, when 𝜂/𝜎 is sufficiently large, the objective will achieve larger value on 𝑆1, whose optimum is not larger than 𝑙2. Therefore, we finished proving the step 1. Proof of Step 2. The distance of the center of class “+1” to the decision boundaries of 𝐹1, 𝐹2 and 𝐹3 are all 𝜏. Therefore, the error of class “+1”: Pr.(𝐹(𝑥) ≠ +1|𝑦 = +1) = Pr.(N(0, 𝜎2) > 𝜏) = Pr.(N(0, 1) > 𝜏/𝜎) = Pr.(N(0, 1) > Φ−1(0.99)) = 1% 118 For the class “-1”, since classifiers 𝐹2 and 𝐹3 are symmetric, they also have 1% error on class “-1”. For classifier 𝐹1, it is easy to see that it has smaller error than 1% on class “-1”. Based on the results of Step 1 and Step 2, we finished the proof of Theorem 3. Theorem 4 For the same data distribution in Eq. 2.2 and same definition 𝑙 = √ 2 · Φ−1(0.99) · 𝜎 · (1 +Φ−1 (0.99) · 𝜎 𝜂 )−1/2. We assume 𝑙 ≤ 𝑑+1 < 𝑘 · 𝑙, where 1 ≤ 𝑘 << 𝜂/𝜎. Then, any model 𝐹 will have a classification error at least 50% on class “+1”, when 𝐹 fits the adversarial example of 𝑥0: 𝐹(𝑥0 + 𝛿) = −1, ∀||𝛿||2 ≤ 𝜖, and 𝜖 ≥ 𝜎 · (𝑘 · √ 2 · Φ−1(0.99)) · (1 + Φ−1(0.99) · 𝜎 𝜂 )−1/2 (A.12) Proof 8 During adversarial training, if 𝜖 > 𝑘 · 𝑙, it will include the center of class “+1” into the adversarial training bound of sample 𝑥0: 𝜃 ∈ {𝑥 + 𝛿, ||𝛿||2 ≤ 𝜖 }. Therefore, if there is a model which fits any adversarial counterpart of the sample 𝑥0, there does not exist a linear model which has error rate smaller than 50% on class “+1”. Therefore, we finished the proof of Theorem 4. 3 Supplementary to Chapter 4 3.1 The Behavior of Adversarial Training In order to examine the performance of PGD adversarial training under imbalanced scenarios, we adversarially train ResNet18 [45] models on multiple imbalanced training datasets based on CIFAR10 dataset [56]. Similar with observations we discussed in Section 4.3.1, as shown in Figure A.19, Figure A.20 and Figure A.21, adversarial training produces larger performance gap between well-represented classes and under-represented classes than natural training. Especially, in all imbalanced scenarios, adversarially trained models obtain very low robust accuracy on underrepresented classes, which proves again that adversarial training cannot be applied in practical imbalanced scenarios directly. 119 (a) Natural Training Standard Acc. (b) Adv. Train. Standard Acc. (c) Adv. Train. Robust Acc. Figure A.19: Class-wise performance of natural & adversarial training under an imbalanced CIFAR10 dataset “Step-10”. (a) Natural Training Standard Acc. (b) Adv. Train. Standard Acc. (c) Adv. Train. Robust Acc. Figure A.20: Class-wise performance of natural & adversarial training under an imbalanced CIFAR10 dataset “Exp-100”. (a) Natural Training Standard Acc. (b) Adv. Train. Standard Acc. (c) Adv. Train. Robust Acc. Figure A.21: Class-wise performance of natural & adversarial training under an imbalanced CIFAR10 dataset “Exp-10”. 3.2 Reweighting Strategy in Natural Training v.s. in Adversarial Training For exploring whether the reweighting strategy can help adversarial training deal with imbalanced issues, we evaluate performance of adversarial trained models using diverse binary imbalanced training datasets with different weights assigning to under-represented class. As shown in Figure A.22, Figure A.23, Figure A.24, for adversarially trained models, increasing the weights 120 (a) Natural Training Standard Acc. (b) Adv. Train. Standard Acc. (c) Adv. Train. Robust Acc. Figure A.22: Class-wise performance of reweighted natural & adversarial training in binary classification. (“auto” as well-represented class and “truck” as under-represented class). (a) Natural Training Standard Acc. (b) Adv. Train. Standard Acc. (c) Adv. Train. Robust Acc. Figure A.23: Class-wise performance of reweighted natural & adversarial training in binary classification. (“bird” as well-represented class and “frog” as under-represented class). assigning to under-represented class will improve models’ performance on under-represented class. However, as the same time, the models’ performance on well-represented class will be drastically decreased. As a comparison, adopting larger weights in naturally trained models will also improve models’ performance on under-represented class but only result in slight drop in performance on well-represented class. In other words, the reweighting strategy proposed in natural training to handle imbalanced problem may only provide limited help in adversarial training, and, hence, new techniques are needed for adversarial training under imbalanced scenarios. 3.3 Proofs of the Theorems in Section 5.4.2 3.3.1 Proof of Lemma 4.4 Proof 9 (Proof of Lemma 4.4) We will first prove that the optimal model 𝑓 ∗ has parameters 𝑤1 = 𝑤2 = · · · = 𝑤𝑑 (or 𝑤 = 1) by contradiction. We define 𝐺 = {1, 2, . . . , 𝑑} and make the following assumption: for the optimal 𝑤 and 𝑏, we assume if there exist 𝑤𝑖 < 𝑤𝑗 for 𝑖 ≠ 𝑗 and 𝑖, 𝑗 ∈ 𝐺. Then 121 (a) Natural Training Standard Acc. (b) Adv. Train. Standard Acc. (c) Adv. Train. Robust Acc. Figure A.24: Class-wise performance of reweighted natural & adversarial training in binary classification. (“dog” as well-represented class and “deer” as under-represented class). we obtain the following standard errors for two classes of this classifier 𝑓 with weight 𝑤: Pr.( 𝑓 ∗ (𝑥) ≠ 𝑦|𝑦 = −1) = Pr.{ Õ 𝑘≠𝑖,𝑘≠𝑗 𝑤𝑘N(−𝜂, 𝜎2) − 𝑏 + 𝑤𝑖N(−𝜂, 𝜎2) + 𝑤𝑗N(−𝜂, 𝜎2) > 0}, Pr.( 𝑓 ∗ (𝑥) ≠ 𝑦|𝑦 = +1) = Pr.{ Õ 𝑘≠𝑖,𝑘≠𝑗 𝑤𝑘N(+𝜂, 𝜎2) − 𝑏 + 𝑤𝑖N(+𝜂, 𝜎2) + 𝑤𝑗N(+𝜂, 𝜎2) < 0}. (A.13) However, if we define a new classier 𝑓˜ whose weight 𝑤˜ uses 𝑤𝑗 to replace 𝑤𝑖 , we obtain the errors for the new classifier: Pr.( ˜ 𝑓 (𝑥) ≠ 𝑦|𝑦 = −1) = Pr.{ Õ 𝑘≠𝑖,𝑘≠𝑗 𝑤𝑘N(−𝜂, 𝜎2) − 𝑏 + 𝑤𝑗N(−𝜂, 𝜎2) + 𝑤𝑗N(−𝜂, 𝜎2) > 0}, Pr.( ˜ 𝑓 (𝑥) ≠ 𝑦|𝑦 = +1) = Pr.{ Õ 𝑘≠𝑖,𝑘≠𝑗 𝑤𝑘N(+𝜂, 𝜎2) − 𝑏 + 𝑤𝑗N(+𝜂, 𝜎2) + 𝑤𝑗N(+𝜂, 𝜎2) < 0}. (A.14) By comparing the errors in Eq. (A.13) and Eq. (A.14), it can imply the classifier ˜ 𝑓 has smaller error in each class. Therefore, it contradicts with the assumption that 𝑓 is the optimal classifier with smallest error. Thus, we conclude for an optimal linear classifier in natural training, it must satisfies 𝑤1 = 𝑤2 = · · · = 𝑤𝑑 (or 𝑤 = 1) if we do not consider the scale of 𝑤. Next, we calculate the optimal bias term 𝑏 given 𝑤 = 1, where we find an optimal 𝑏 can minimize 122 the (reweighted) empirical risk: Errortrain ( 𝑓 ∗) = Pr.( 𝑓 ∗ (𝑥) ≠ 𝑦|𝑦 = −1) · Pr.(𝑦 = −1) · 𝜌 + Pr.( 𝑓 ∗ (𝑥) ≠ 𝑦|𝑦 = +1) · Pr.(𝑦 = +1) ∝ Pr.( 𝑓 ∗ (𝑥) ≠ 𝑦|𝑦 = −1) · 𝜌 + Pr.( 𝑓 ∗ (𝑥) ≠ 𝑦|𝑦 = +1) · 𝐾 = 𝜌 · Pr.( Õ𝑑 𝑖=1 N(−𝜂, 𝜎2) − 𝑏 > 0) + 𝐾 · Pr.( Õ𝑑 𝑖=1 N(𝜂, 𝜎2) − 𝑏 < 0) = 𝜌 · Pr.(N(0, 1) < −𝑏 + 𝑑𝜂 𝑑𝜎 ) + 𝐾 · Pr.(N(0, 1) < 𝑏 − 𝑑𝜂 𝑑𝜎 ), and we take the derivative with respect to 𝑏: 𝜕Errortrain 𝜕𝑏 = 𝜌 √ 2𝜋 · (− 1 𝑑𝜎 ) exp(−1 2 (−𝑏 + 𝑑𝜂 𝑑𝜎 )2) + 𝐾 √ 2𝜋 · ( 1 𝑑𝜎 ) exp(−1 2 ( 𝑏 − 𝑑𝜂 𝑑𝜎 )2). When 𝜕Errortrain/𝜕𝑏 = 0, we can calculate the optimal 𝑏 which gives the minimum value of the empirical error, and we have: 𝑏 = 1 2 log( 𝜌 𝐾 ) 𝑑𝜎2 𝜂 = 1 2 log( 𝜌 𝐾 ) 𝑑 𝑆 . 3.3.2 Proof of Theorem 4.4 Proof 10 (Proof of Theorem 4.4) Without loss of generality, for distributionD1,D2 with different mean-variance pairs (±𝜂1, 𝜎2 1 ) and (±𝜂2, 𝜎2 2 ), we can only consider the case 𝜂1 = 𝜂2 and 𝜎2 1 < 𝜎2 2 . Otherwise, we can simply rescale one of them to match the mean vector of the other and will not impact the results. Under this definition, the optimal classifier 𝑓 ∗ 1 and 𝑓 ∗ 2 has weight vector 𝑤1 = 𝑤2 = 1 and bias term 𝑏1, 𝑏2, with the value as demonstrated in Lemma 4.4. Next, we will prove the Theorem 4.4 by 2 steps. Step 1. For the error of class “-1”, we have: Pr.( 𝑓 ∗ 1 (𝑥(1)) ≠ 𝑦(1) |𝑦(1) = −1) = Pr.( Õ𝑑 𝑖=1 N(−𝜂, 𝜎2 1 ) − 𝑏1 > 0) < Pr.( Õ𝑑 𝑖=1 N(−𝜂, 𝜎2 1 ) − 𝑏2 > 0) (because 𝑆1 > 𝑆2) < Pr.( Õ𝑑 𝑖=1 N(−𝜂, 𝜎2 2 ) − 𝑏2 > 0) (because 𝜎2 1 < 𝜎2 2 ) 123 = Pr.( 𝑓 ∗ 2 (𝑥(2)) ≠ 𝑦(2) |𝑦(2) = −1). Step 2. For the error of class “+1”, we have: Pr.( 𝑓 ∗ 1 (𝑥(1)) ≠ 𝑦(1) |𝑦(1) = +1) = Pr.( Õ𝑑 𝑖=1 N(𝜂, 𝜎2 1 ) − 𝑏1 < 0) = Pr.(N(0, 1) < 𝑏1 − 𝑑𝜂 𝑑𝜎1 ) = Pr.(N(0, 1) < − log(𝐾) · 𝜎1 2𝜂 − 𝜂 𝜎1 ), (A.15) and similarly, Pr.( 𝑓 ∗ 2 (𝑥(2)) ≠ 𝑦(2) |𝑦(2) = +1) = Pr.(N(0, 1) < − log(𝐾) · 𝜎2 2𝜂 − 𝜂 𝜎2 ). (A.16) Note that when 𝐾 is large enough, i.e., log(𝐾) > 2·𝜂2 𝜎1·𝜎2 , we can get the Z-score in Eq. (A.15) is larger than Eq. (A.16). As a result, we have: Pr.( 𝑓 ∗ 1 (𝑥(1)) ≠ 𝑦(1) |𝑦(1) = +1) > Pr.( 𝑓 ∗ 2 (𝑥(2)) ≠ 𝑦(2) |𝑦(2) = +1). (A.17) By combining Step 1 and Step 2, we can get the inequality in Theorem 4.4. 3.3.3 Proof of Theorem 4.4 Proof 11 (Proof of Theorem 4.4) We first show that under both distribution D1 and D2, the optimal reweighting ratio 𝜌 is equal to the imbalance ratio 𝐾. Based on the results in Eq. (A.13) and calculated model parameters 𝑤 and 𝑏, we have the test error (given the model trained by reweight value 𝜌): Errortest ( 𝑓 ∗) = Pr.( 𝑓 ∗ (𝑥) ≠ 𝑦|𝑦 = −1) · Pr.(𝑦 = −1) + Pr.( 𝑓 ∗ (𝑥) ≠ 𝑦|𝑦 = +1) · Pr.(𝑦 = +1) ∝ Pr.(N(0, 1) < −𝑏 + 𝑑𝜂 𝑑𝜎 ) + Pr.(N(0, 1) < 𝑏 − 𝑑𝜂 𝑑𝜎 ) = Pr.(N(0, 1) < −1 2 log( 𝜌 𝐾 ) − 𝜎 𝜂 ) + Pr.(N(0, 1) < 1 2 log( 𝜌 𝐾 ) − 𝜎 𝜂 ). The value of taking the minimum when its derivative with respect to 𝜌 is equal to 0, where we can get 𝜌 = 𝐾 and the bias term 𝑏 = 0. Note that the variance values have the relation: 𝜎2 1 < 𝜎2 2 . Therefore, it is easy to get that: 124 (a) Step 10. (b) Step 100. (c) Exp 10. (d) Exp 100. Figure A.25: Data distribution of imbalanced training datasets constructed from CIFAR10 dataset. Pr.( 𝑓 ′ 1 ∗ (𝑥(1)) ≠ 𝑦(1) |𝑦(1) = +1) = Pr.( Õ𝑑 𝑖=1 N(𝜂, 𝜎2 1 ) < 0) < Pr.( Õ𝑑 𝑖=1 N(𝜂, 𝜎2 2 ) < 0) = Pr.( 𝑓 ′ 2 ∗ (𝑥(2)) ≠ 𝑦(2) |𝑦(2) = +1). (A.18) Combining the results in Eq. (A.17) and (A.18), we have proved the inequality in Theorem 4.4. 3.4 Algorithm of SRAT The algorithm of our proposed SRAT framework is shown in Algorithm 5. Specifically, in each training iteration, we first generate adversarial examples using PGD for examples in the current batch (Line 5). If the current training iteration does not reach a predefined starting reweighting epoch 𝑇𝑑, we will assign same weights, i.e., 𝑤𝑖 = 1 for all adversarial examples x𝑖 in the current batch (Line 6). Otherwise, the reweighting strategy will be adopted in the final loss function (Line 15), where a specific weight 𝑤𝑖 will be assigned for each adversarial example x𝑖 if its corresponding clean example x𝑖 comes from an under-represented class. 3.5 Data Distribution of Imbalanced Training Datasets In our experiments, we construct multiple imbalanced training datasets to simulate various kinds of imbalanced scenarios by combining different imbalance types (i.e., Exp and Step) with different imbalanced ratios (i.e., 𝐾 = 10 and 𝐾 = 100). Figure A.25 and Figure A.26 show the data distribution of all ten-classes imbalanced training datasets used in our preliminary studies and experiments based on CIFAR10 [56] and SVHN [80] datasets, respectively. 125 Algorithm 5 Separable Reweighted Adversarial Training (SRAT). Input: imbalanced training dataset 𝐷 = {(x𝑖 , 𝑦𝑖)}𝑛 𝑖=1, number of total training epochs 𝑇, starting reweighting epoch 𝑇𝑑, batch size 𝑁, number of batches 𝑀, learning rate 𝛾 Output: An adversarially robust model 𝑓𝜃 1: Initialize the model parameters 𝜃 randomly; 2: for epoch = 1, . . . , 𝑇𝑑 − 1 do 3: for mini-batch = 1, . . . , 𝑀 do 4: Sample a mini-batch B = {(x𝑖 , 𝑦𝑖)}𝑁 𝑖=1 from 𝐷; 5: Generate adversarial example x′ 𝑖 for each x′ 𝑖 ∈ B; 6: L( 𝑓𝜃) = 1 𝑁 Í𝑁 𝑖=1 max∥x′ 𝑖 −x𝑖 ∥ 𝑝≤𝜖 L( 𝑓𝜃 (x′ 𝑖 ), 𝑦𝑖) + 𝜆L𝑠𝑒𝑝 (x′ 𝑖 ) 7: 𝜃 ← 𝜃 − 𝛾∇𝜃L( 𝑓𝜃) 8: end for 9: Optional: 𝛾 ← 𝛾/𝜅 10: end for 11: for epoch = 𝑇𝑑, . . . , 𝑇 do 12: for mini-batch = 1, . . . , 𝑀 do 13: Sample a mini-batch B = {(x𝑖 , 𝑦𝑖)}𝑁 𝑖=1 from 𝐷; 14: Generate adversarial example x′ 𝑖 for each x′ 𝑖 ∈ B; 15: L( 𝑓𝜃) = 1 𝑁 Í𝑁 𝑖=1 max∥x′ 𝑖 −x𝑖 ∥ 𝑝≤𝜖 𝑤𝑖L( 𝑓𝜃 (x′ 𝑖 ), 𝑦𝑖) + 𝜆L𝑠𝑒𝑝 (x′ 𝑖 ) 16: 𝜃 ← 𝜃 − 𝛾∇𝜃L( 𝑓𝜃) 17: end for 18: Optional: 𝛾 ← 𝛾/𝜅 19: end for (a) Step 10. (b) Step 100. (c) Exp 10. (d) Exp 100. Figure A.26: Data distribution of imbalanced training datasets constructed from SVHN dataset. 3.6 Performance Comparison on Imbalanced SVHN Datasets Table A.7 and Table A.8 show the performance comparison on various imbalanced SVHN datasets with different imbalance types and imbalance ratios. We use bold values to denote the highest accuracy among all methods and use the underline values to indicate our SRAT variants which achieve the highest accuracy among their corresponding baseline methods utilizing the same loss function for making predictions. 126 From Table A.7 and Table A.8, we get similar observation that, comparing with baseline methods, our proposed SRAT method can produce a robust model which can achieve improved overall performance when the training dataset is imbalanced. In addition, based on the experimental results in Table 4.1 to Table A.8, we find that, compared with the performance improvement between DRCB-LDAM and SRAT-LDAM, the improvement between DRCB-CE and SRAT-CE and the improvement between DRCB-Focal and SRAT-Focal are more obviously. The possible reason behind this phenomenon is, the LDAM loss can also implicitly produce a more separable feature space [13] while CE loss and Focal loss do not conduct any specific operations on the latent feature space. Hence, the feature separation loss contained in SRAT-CE and SRAT-Focal could be more effective on learning separable feature space and facilitate the Focal loss on prediction. However, in SRAT-LDAM, the feature separation loss and LDAM loss may affect each other on learning feature representations and, hence, the effectiveness of the feature separation loss may be counteracted or weakened. In conclusion, experiments conducted on multiple imbalanced datasets verify the effectiveness of our proposed SRAT method under various imbalanced scenarios. Table A.7: Performance Comparison on Imbalanced SVHN Datasets (Imbalanced Type: Step). Imbalance Ratio 10 100 Imbalance Ratio Standard Accuracy Robust Accuracy Standard Accuracy Robust Accuracy Method Overall Under Overall Under Overall Under Overall Under CE 79.88 67.04 37.62 22.08 59.61 26.19 29.57 5.03 Focal 79.96 67.03 37.83 22.47 60.58 28.17 30.27 5.83 LDAM 84.55 74.96 45.80 31.23 65.61 37.13 33.34 8.36 CB-Reweight 79.48 66.07 37.38 21.66 60.23 27.68 29.54 5.32 CB-Focal 80.29 67.56 38.10 23.00 60.73 28.37 30.09 5.75 DRCB-CE 80.62 68.74 37.25 22.79 60.67 28.36 30.02 5.59 DRCB-Focal 79.11 65.72 37.01 22.02 61.65 30.29 30.78 7.06 DRCB-LDAM 87.83 82.63 46.45 35.15 63.78 33.99 33.60 7.28 SRAT-CE 82.89 72.79 38.23 24.70 63.39 33.85 29.64 6.11 SRAT-Focal 85.05 77.10 39.51 28.06 70.12 47.44 32.18 11.08 SRAT-LDAM 87.65 82.62 46.03 34.75 71.56 50.33 33.54 11.63 127 Table A.8: Performance Comparison on Imbalanced SVHN Datasets (Imbalanced Type: Exp). Imbalance Ratio 10 100 Metric Standard Accuracy Robust Accuracy Standard Accuracy Robust Accuracy Method Overall Under Overall Under Overall Under Overall Under CE 87.54 82.67 44.12 35.33 72.51 56.30 33.34 16.93 Focal 87.82 83.01 44.88 35.97 72.61 56.48 34.09 17.62 LDAM 90.06 86.69 51.84 43.73 79.11 66.86 40.42 25.18 CB-Reweight 87.66 82.79 44.39 35.53 72.25 55.97 33.36 17.16 CB-Focal 87.86 82.96 44.61 35.55 73.23 57.34 34.25 17.90 DRCB-CE 88.49 84.51 43.82 36.28 73.74 58.03 33.52 17.68 DRCB-Focal 87.47 82.78 42.52 34.31 71.95 55.11 33.43 17.63 DRCB-LDAM 91.24 89.65 52.39 46.71 80.29 69.23 40.16 24.64 SRAT-CE 88.70 84.94 44.54 36.59 77.11 64.47 34.48 19.91 SRAT-Focal 89.51 85.42 45.37 37.20 80.04 69.54 35.25 23.04 SRAT-LDAM 91.27 89.55 52.10 46.13 80.71 70.49 40.33 25.11 4 Supplementary to Chapter 5 4.1 Proof of Theorem In this part, we provide the detailed proof of Theorem 5 in Section 5.4. Theorem 4 (Recall Theorem 5) Suppose a set of “clean” samples S𝑔𝑜𝑜𝑑 with size 𝑛 are i.i.d sampled from distribution N(𝜇, Σ), where Σ ⪯ 𝜎2𝐼. There is a set of “bad” samples S𝑏𝑎𝑑 with size 𝑛𝑝 = 𝑛/𝐾, 𝐾 > 1 and center ||𝜇𝑝 − 𝜇||2 = 𝑑, 𝑑 = 𝛾 · 𝜎. Then, the average squared poisoning scores of clean samples and bad samples have the relationship: E𝑖∈S𝑏𝑎𝑑  𝑞2(𝑥𝑖)  − E𝑖∈S𝑔𝑜𝑜𝑑  𝑞2 (𝑥𝑖)  ≥  𝐾 − 1 𝐾 + 1 · 𝛾2 − (𝐾 + 1)  𝜎2 Proof 12 We denote S𝑔𝑜𝑜𝑑 is the set of clean samples, S𝑏𝑎𝑑 is the set of bad samples, and the union of clean samples and bad samples form the whole set S. In the later part, to distinguish between the centers of each set, we use 𝜇𝑔, 𝜇𝑝 and 𝜇𝑠 to denote the center of clean samples, bad samples and the whole set. First, it is easy to know that 𝜇𝑔, 𝜇𝑝 and 𝜇𝑠 are in a same line. Thus, given 𝑛𝑔 : 𝑛𝑝 = 𝐾 : 1, we have the relationship of center distance: 𝑑 = ||𝜇𝑔−𝜇𝑝 || = (1+𝐾) ||𝜇𝑔−𝜇𝑠 || = ( (𝐾+1)/𝐾) ||𝜇𝑝−𝜇𝑠 ||. In the following, we will study the squared poisoning score in the whole group S. Given any unit 128 vector V is the top right singular vector of the centered data matrix of S: E𝑖∈S  (V · (𝑥 − 𝜇𝑠))2 = E𝑖∈S  (V · (𝑥 − 𝜇𝑝) + V · (𝜇𝑝 − 𝜇𝑠))2 = E𝑖∈S  (V · (𝑥 − 𝜇𝑝))2 + E𝑖∈S  2(V · (𝑥 − 𝜇𝑝) (𝜇𝑝 − 𝜇𝑠)𝑇 · V𝑇  + E𝑖∈S  (V · (𝜇𝑝 − 𝜇𝑠))2 = E𝑖∈S  (V · (𝑥 − 𝜇𝑝))2 − (V · (𝜇𝑝 − 𝜇𝑠))2 Note that V is the top right singular vector of the centered data matrix 𝑋𝑠 − 𝜇𝑠, we choose 𝑣′ = (𝜇𝑝 − 𝜇𝑠)/|| (𝜇𝑝 − 𝜇𝑠) ||, which is the unit vector that has the same direction with (𝜇𝑝 − 𝜇𝑠). We get: E𝑖∈S  (V · (𝑥 − 𝜇𝑠))2 ≥ E𝑖∈S  (𝑣′ · (𝑥 − 𝜇𝑝))2 − ||𝜇𝑝 − 𝜇𝑠 ||2 For the first term in the right hand side of the inequality above: (𝐾 + 1) · E𝑖∈S  (𝑣′ · (𝑥 − 𝜇𝑝))2 ≥ 𝐾 · E𝑖∈S𝑔𝑜𝑜𝑑  (𝑣′ · (𝑥 − 𝜇𝑝))2 because the poisoning scores are all positive. Then, we have: (𝐾 + 1) · E𝑖∈S  (𝑣′ · (𝑥 − 𝜇𝑝))2 ≥ 𝐾 · E𝑖∈S𝑔𝑜𝑜𝑑  (𝑣′ · (𝑥 − 𝜇𝑝))2 = 𝐾 · E𝑖∈S𝑔𝑜𝑜𝑑  (𝑣′ · (𝑥 − 𝜇𝑔) + 𝑣′ · (𝜇𝑔 − 𝜇𝑝))2 = 𝐾 ·  𝑣′ · (𝑋𝑔 − 𝜇𝑔) (𝑋𝑔 − 𝜇𝑔)𝑇 · 𝑣′𝑇 +𝑣′ · (𝜇𝑔 − 𝜇𝑝) (𝜇𝑔 − 𝜇𝑝)𝑇 · 𝑣′𝑇  ≥ 𝐾 · (0 + ||𝜇𝑔 − 𝜇𝑝 ||2) The first term is larger than 0 because of the semi-definite property of the matrix (𝑋𝑔−𝜇𝑔) (𝑋𝑔−𝜇𝑔)𝑇 , the second term is because 𝑣′ has the same direction with (𝜇𝑝 − 𝜇𝑠) (because 𝜇𝑝, 𝜇𝑔 and 𝜇𝑠 are in the same line). Therefore, we get the average poisoning score of the whole set S: E𝑖∈S  (V · (𝑥 − 𝜇𝑠))2 ≥ 𝐾 𝐾 + 1 · ||𝜇𝑔 − 𝜇𝑝 ||2 − ||𝜇𝑝 − 𝜇𝑠 ||2 = 𝐾 (1 + 𝐾)2 · 𝑑2 In the following, we will calculate the average squared poisoning score in the good set S𝑔𝑜𝑜𝑑. E𝑖∈S𝑔𝑜𝑜𝑑  (V · (𝑥 − 𝜇𝑝))2 = V · (𝑋𝑔 − 𝜇𝑔) (𝑋𝑔 − 𝜇𝑔)𝑇 · V𝑇 + V · (𝜇𝑔 − 𝜇𝑠) (𝑋𝑔 − 𝜇𝑠)𝑇 · V𝑇 ≤ 𝜎2 + ||𝜇𝑔 − 𝜇𝑠 ||2 ≤ 𝜎2 + ( 1 𝐾 + 1 · 𝑑)2 129 Based on previous calculation about the average score of whole set and good set, we can get the average squared poisoning score in the bad set S𝑏𝑎𝑑: E𝑖∈S𝑏𝑎𝑑  (V · (𝑥 − 𝜇𝑝))2 = (1 + 𝐾)E𝑖∈S  (V · (𝑥 − 𝜇𝑝))2 − 𝐾E𝑖∈S𝑔𝑜𝑜𝑑  (V · (𝑥 − 𝜇𝑝))2 ≥ 𝐾2 (𝐾 + 1)2 · 𝑑2 − 𝐾𝜎2 and the difference between two averaged scores: E𝑖∈S𝑏𝑎𝑑 − E𝑖∈S𝑔𝑜𝑜𝑑 ≥ 𝑘 − 1 𝑘 + 1 · 𝑑2 − (𝑘 + 1)𝜎2 =  𝐾 − 1 𝐾 + 1 · 𝛾2 − (𝐾 + 1)  𝜎2 4.2 More Experimental Details In this part, we provide additional experimental details such as the pre-process procedure. Adult Cenesus Dataset. In this dataset, we have 5 numerical features “age, education-num, hours-per-week, capital-loss and capital gain”’, and we also use categorical features such as “workclass, education, marital-stataus, occupation, relationship, race”. For simplicity, we first transform the categoircal features into dummy variables and conduct Principle Component Analysis to project them to the space, which is spanned by the first 15 principle components. After projection, we normalize all 20 features by centering and standadizing. Under this dataseet, during the attacking of Min-Max and F-Attack, we assign the radius of the feasible injection set to be 𝑑 = 9.0. In Appendix 4.3, we provide the empirical results for more choices of 𝑑, i.e., 𝑑 = 6.0. COMPAS Dataset. In this dataset, we have the numerical features: “age, age_cat, juv_fel_count, juv_misd_count, juv_other_count, priors_count, days_b_screening_arrest, decile_score, c_jail_in, c_jail_out”. We use “c_jail_out - c_jail_in” to get the number of days in jail and exclude c_jail_out, c_jail_in. We also have categorical features c_charge_degree and sex, and we use PCA to find the first two principle directions. Then, we standardize each feature. Under this dataset, during the attacking of Min-Max and F-Attack, we assign the radius of the feasible injection set to be 𝑑 = 9.0. In Appendix 4.3, we provide the empirical results for more choices of 𝑑, i.e., 𝑑 = 6.0. 130 Table A.9: RFC & Baseline Methods’ Performance on Adult Census Dataset, for Equalized Treatment. No Attack Label Flip(10%) Label Flip(20%) Attr. Flip(10%) Attr. Flip(20%) Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair No Defense. 0.795 0.808 0.787 0.823 0.784 0.834 0.797 0.800 0.802 0.800 SEVER. 0.772 0.829 0.754 0.823 0.757 0.801 0.785 0.795 0.781 0.790 RFC. 0.781 0.803 0.791 0.820 0.799 0.830 0.788 0.806 0.783 0.823 (𝜖 = 10%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack* Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. No Defense. 0.779 0.834 0.787 0.810 0.781 0.795 0.785 0.823 0.790 0.822 SEVER. 0.766 0.803 0.757 0.800 0.770 0.788 0.764 0.788 0.769 0.803 RFC. 0.778 0.834 0.788 0.804 0.781 0.812 0.794 0.799 0.793 0.803 (𝜖 = 15%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack* Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. No Defense. 0.782 0.797 0.783 0.792 0.766 0.787 0.777 0.822 0.770 0.801 SEVER. 0.775 0.769 0.766 0.792 0.741 0.803 0.755 0.773 0.766 0.770 RFC. 0.786 0.806 0.794 0.812 0.776 0.815 0.790 0.822 0.786 0.823 Table A.10: RFC & Baseline Methods’ Performance on Adult Census Dataset using [128]. No Attack Label Flip(10%) Label Flip(20%) Attr. Flip(10%) Attr. Flip(20%) Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. No Defense. 0.815 0.944 0.813 0.962 0.804 0.954 0.814 0.942 0.811 0.931 SEVER. 0.812 0.952 0.806 0.954 0.808 0.957 0.814 0.942 0.811 0.953 RFC. 0.814 0.964 0.808 0.956 0.801 0.951 0.802 0.963 0.802 0.944 (𝜖 = 10%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack* Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. No Defense. 0.801 0.969 0.797 0.864 0.769 0.966 0.796 0.804 0.799 0.799 SEVER. 0.800 0.960 0.802 0.954 0.788 0.945 0.786 0.937 0.779 0.956 RFC. 0.802 0.967 0.811 0.946 0.803 0.950 0.808 0.952 0.809 0.951 (𝜖 = 15%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack* Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. No Defense. 0.784 0.953 0.797 0.841 0.700 0.943 0.755 0.821 0.798 0.800 SEVER. 0.802 0.959 0.793 0.978 0.775 0.952 0.778 0.909 0.775 0.927 RFC. 0.799 0.944 0.799 0.952 0.796 0.950 0.801 0.951 0.803 0.946 4.3 More experimental results In this part, we provide additional empirical results to validate the effectiveness of our attack and defense method. In detail, we consider more settings about: (1) Different Type of Fairness Criteria, such as “Equalized Treatment”; (2) Different fair classification method, such as [128], (3) Different choice of the radius of feasible injection space 𝑑, such as 𝑑 = 6.0. Notably, in the all experiments in the main paper, we set 𝑑 with a fixed value 𝑑 = 9.0. In this part, we provide another option 131 Table A.11: RFC & Baseline Methods’ Performance on Adult Census Dataset when 𝑑 = 6. (𝜖 = 10%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack* Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. No Defense. 0.796 0.970 0.803 0.955 0.762 0.977 0.803 0.778 0.811 0.723 SEVER. 0.762 0.955 0.786 0.956 0.744 0.902 0.743 0.944 0.762 0.735 RFC. 0.800 0.969 0.805 0.966 0.812 0.945 0.785 0.963 0.783 0.961 (𝜖 = 15%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack* Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. No Defense. 0.802 0.945 0.800 0.945 0.759 0.946 0.800 0.672 0.789 0.721 SEVER. 0.758 0.960 0.743 0.980 0.724 0.961 0.726 0.855 0.724 0.756 RFC. 0.805 0.955 0.801 0.967 0.775 0.953 0.801 0.955 0.774 0.945 Table A.12: RFC & Baseline Methods’ Performance on COMPAS dataset for Equalized Treatment. No Attack Label Flip(10%) Label Flip(20%) Attr. Flip(10%) Attr. Flip(20%) Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair No Defense. 0.671 0.845 0.663 0.854 0.655 0.845 0.659 0.873 0.670 0.845 SEVER. 0.665 0.831 0.665 0.849 0.667 0.836 0.674 0.842 0.672 0.834 RFC. 0.671 0.830 0.669 0.845 0.672 0.846 0.677 0.850 0.672 0.844 (𝜖 = 10%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack* Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. No Defense. 0.680 0.843 0.662 0.845 0.666 0.862 0.658 0.854 0.677 0.843 SEVER. 0.662 0.836 0.659 0.839 0.646 0.850 0.648 0.825 0.671 0.842 RFC. 0.677 0.838 0.676 0.841 0.675 0.852 0.674 0.848 0.668 0.845 (𝜖 = 15%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack* Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. No Defense. 0.629 0.866 0.668 0.845 0.669 0.820 0.662 0.855 0.675 0.855 SEVER. 0.633 0.848 0.631 0.855 0.613 0.850 0.613 0.866 0.678 0.845 RFC. 0.659 0.845 0.674 0.846 0.675 0.835 0.671 0.846 0.680 0.845 when we choose a smaller 𝑑 = 6.0. It is because larger 𝑑, i.e, 𝑑 > 9.0 will make the poisoning samples easier to be detected by most defense methods. For fairness criteria under “Equalized Treatment” (Equalized Positive Rate (PR)) in Adult Census Dataset, we set the desired fairness criteria to be |PR(𝑧 = 0) − PR(𝑧 = 1) | < 0.2. For fairness criteria under “Equalized TPR” in Adult Census Dataset, we set the unfairness criteria to be |TPR(𝑧 = 0)−TPR(𝑧 = 1) | < 0.05. For fairness criteria under “Equalized TPR” and “Equalized Treatment” in COMPAS Dataset, we set the desired fairness criteria to be max𝑗∈Z |TPR(𝑧 = 𝑗 ) − TPˆR| ≤ 0.15 and max𝑗∈Z |PR(𝑧 = 𝑗 ) − PˆR| ≤ 0.15. 132 Table A.13: RFC & Baseline Methods’ Performance on COMPAS Dataset using [128]. No Attack Label Flip(10%) Label Flip(20%) Attr. Flip(10%) Attr. Flip(20%) Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair No Defense. 0.651 0.844 0.642 0.865 0.644 0.827 0.651 0.855 0.662 0.840 SEVER. 0.647 0.846 0.6770 0.841 0.657 0.830 0.644 0.851 0.642 0.830 RFC. 0.656 0.852 0.668 0.841 0.679 0.850 0.677 0.846 0.661 0.852 (𝜖 = 10%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack* Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair No Defense. 0.655 0.840 0.645 0.842 0.646 0.832 0.628 0.851 0.654 0.816 SEVER. 0.645 0.860 0.640 0.861 0.628 0.840 0.619 0.852 0.659 0.831 RFC. 0.655 0.858 0.662 0.854 0.663 0.850 0.660 0.851 0.659 0.852 (𝜖 = 15%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack* Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair No Defense. 0.659 0.820 0.631 0.851 0.649 0.800 0.621 0.851 0.658 0.815 SEVER. 0.641 0.866 0.638 0.851 0.596 0.865 0.611 0.870 0.643 0.831 RFC. 0.671 0.861 0.659 0.855 0.650 0.851 0.640 0.857 0.659 0.850 Table A.14: RFC & Baseline Methods’ Performance on COMPAS dataset, when 𝑑 = 6. (𝜖 = 10%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack* Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. No Defense. 0.650 0.865 0.581 0.838 0.648 0.800 0.651 0.820 0.631 0.815 SEVER. 0.668 0.853 0.643 0.788 0.652 0.800 0.656 0.802 0.644 0.823 RFC. 0.670 0.850 0.651 0.835 0.658 0.820 0.663 0.822 0.665 0.835 (𝜖 = 15%) Min-Max(z = 0) Min-Max(z = 1) F-Attack(z = 0) F-Attack(z = 1) F-Attack* Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. Acc. Fair. No Defense. 0.621 0.854 0.617 0.859 0.515 0.980 0.614 0.890 0.640 0.801 SEVER. 0.625 0.819 0.641 0.801 0.624 0.770 0.645 0.853 0.635 0.815 RFC. 0.660 0.833 0.675 0.857 0.672 0.832 0.664 0.832 0.660 0.847 133