LEARNING FROM IMBALANCED DATA DISTRIBUTION

By

Wentao Wang

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Computer Science—Doctor of Philosophy

2023

ABSTRACT

As a prominent component of artificial intelligence (AI), machine learning (ML) techniques

play a significant role in the stunning achievement obtained by AI technologies in human society.

ML techniques enable computers to leverage collected data to tackle various kinds of tasks in

practice. However, more and more studies reveal that the capability of a ML model will be

decreased dramatically if the distribution of collected data used for training this model is imbalanced.

As imbalanced data distribution is widespread in many real-world applications, improving the

performance of ML models under imbalanced data distribution has attracted considerable attention.

While a growing number of related works have been proposed to make ML models learn from

imbalanced data more effectively, the study on this topic is far from complete. In this dissertation, I

propose several studies to fill up the gaps in this direction. First, most existing data generation based

works only consider the local distribution information within classes, while the global distribution

is totally ignored. I demonstrate both global and local distribution information are important for

producing high-quality synthetic data samples to balance the data distribution. Second, almost

all existing studies assume that collected data samples are associated with noisy-free labels, and,

hence, they cannot work well when annotated labels are noisy. I investigate the problem of learning

from imbalanced crowdsourced labeled data and propose a novel framework as a solution with

satisfactory performance. Third, currently the research on investigating the impact of imbalanced

data distribution on the robustness of ML models is rather limited. To this end, I empirically

verify the adversarial training (AT) approach alone cannot bring enough robustness for ML models

under imbalanced scenarios while integrating the reweighting strategy with AT can be very helpful.

In addition, I also propose an effective data augmentation based framework to benefit AT under

imbalanced scenarios.

Copyright by
WENTAO WANG
2023

To my parents and entire family for their love and support.

iv

ACKNOWLEDGEMENTS

This dissertation is impossible without invaluable help, support, and guidance I received from

many great people during my Ph.D. study.

First and foremost, I would like to express my heartfelt thanks to my advisor Dr. Jiliang Tang. I

am very grateful to him for providing me with a valuable opportunity that I can start a new journey

in my life. During this unforgettable study journey, I have learned so much from him, ranging from

discovering significant research problems, writing insightful papers, giving attractive presentations,

collaborating with teammates, to establishing valuable career goals, etc. These priceless knowledge

and experience I learned has benefited me a lot in my life and allowed me to achieve things I had

never imagined. Dr. Tang is a role model for me to learn from and I feel honored to have been

working with him.

I also would like to convey my deepest appreciation to other committee members of my

dissertation, Dr. Hui Liu, Dr. Pan-Ning Tan and Dr. Yuying Xie, for their insightful comments

and helpful suggestions.

During my Ph.D. study, I have had the pleasure and fortune of having so many intimate friends

and colleagues. I would like to thank all of my colleagues from the Data Science and Engineering

Lab for their selfless support and consistent encouragement: Zhikai Chen, Yingqian Cui, Dr.

Jamell Dacon, Dr. Tyler Derr, Jiayuan Ding, Dr. Wenqi Fan, Kai Guo, Haoyu Han, Pengfei He, Dr.

Jiangtao Huang, Dr. Wei Jin, Dr. Hamid Karimi, Hang Li, Juanhui Li, Yaxin Li, Dr. Haochen Liu,

Hua Liu, Remy Liu, Dr. Xiaorui Liu, Dr. Yao Ma, Haitao Mao, Jie Ren, Harry Shomer, Wenzhuo

Tang, Yuxuan Wan, Dr. Xiaoyang Wang, Dr. Xin Wang, Dr. Yiqi Wang, Dr. Zhiwei Wang,

Hongzhi Wen, Han Xu, Kaiqi Yang and Dr. Xiangyu Zhao. I am also thankful for the collaboration

from outside the Data Science and Engineering Lab: Wenbiao Ding, Yan Huang, Dr. Guoliang Li,

Dr. Zitao Liu, Dr. Joseph Thekinen, Dr. Bhavani Thuraisingham, Dr. Suhang Wang and Guowei

Xu.

Finally, I would like to thank my family for their love and support. I also dedicate this dissertation

to Yiwei Ma for supporting me all the way.

v

TABLE OF CONTENTS

CHAPTER 1

1.1 Motivation .
.
1.2 Contributions .

.
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.

.
.

.
.

.

1
1
3

CHAPTER 2

GLOBAL-AND-LOCAL AWARE DATA GENERATION FROM
IMBALANCED DATA DISTRIBUTION . . . . . . . . . . . . . . . . .
2.1 Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Related Work .
.
2.3 The Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
2.4 Experiment
2.5 Case Study . .
2.6 Chapter Conclusion .

4
4
6
8
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
. 19
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
. .

. .

.

CHAPTER 3

LEARNING FROM IMBALANCED CROWDSOURCED
LABELED DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 22
3.1 Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 The Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Experiment
.
3.4 Related Work .
. 39
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Chapter Conclusion .

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
. .

.

CHAPTER 4

IMBALANCED ADVERSARIAL TRAINING WITH
REWEIGHTING . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 42
4.1 Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Preliminary Study .
. 45
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Theoretical Analysis
. 53
4.4 Separable Reweighted Adversarial Training . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5 Experiment
.
4.6 Related Work .
. 64
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.7 Chapter Conclusion .

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
. .

.

CHAPTER 5

MIX-UP STRATEGY TO ENHANCE ADVERSARIAL TRAINING
WITH IMBALANCED DATA . . . . . . . . . . . . . . . . . . . . . . . 67
5.1 Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Related Work .
. 69
5.3 The Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4 Regularization Effect Of Imb-Mix . . . . . . . . . . . . . . . . . . . . . . .
. 75
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
.
5.5 Experiment
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.6 Chapter Conclusion .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .

.

.

.

.

CHAPTER 6

CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 87
6.1 Dissertation Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
. 88
6.2 Future Work .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

.

.

vi

BIBLIOGRAPHY . .

. .

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

vii

CHAPTER 1

INTRODUCTION

1.1 Motivation

As a prominent component of artificial intelligence (AI), machine learning techniques play

an important role in the great successes achieved by AI technologies in human world. The core

objective of machine learning is to instruct computers to leverage collected data to tackle diverse

tasks [75]. Machine learning techniques have been applied into a wide range of applications,

which facilitate people’s daily lives greatly while also improving productivity in various sectors

effectively. For instances, in e-commerce, recommender systems [6] can provide personalized

product recommendations to customers, which helps customers find their interested products more

efficiently and hence brings more profits to e-commerce platforms; in information security, face

recognition systems [53] are able to confirm people’s identities by matching their faces against faces

stored in database, which accelerates the verification process when people accessing confidential

data and also supplies an enhanced security for protecting these confidential data at the same time;

in healthcare, medical image analysis methods [65] make diagnosis by analyzing medical images,

which leads to reduced costs for patients and improved efficiency for healthcare organizations.

Despite the huge potential of developing advanced machine learning algorithms to handle more

complex tasks and enhance the power of AI in human world, recent studies [8] reveal that the

capability of a machine learning model will be decreased dramatically if the quality of data used

for training this model is low. For supervised classification algorithms who utilize data samples

with corresponding class labels to train a model to predict class information of target data samples,

imbalanced data distribution can be regarded as one common type of low-quality data. The

imbalanced data distribution refers to the case that some classes have exceedingly large number

of data samples while others have very small amount of data samples in the data set. Since most

supervised classification algorithms are developed based on a common assumption that each class

in the training data set has a relatively equal number of data samples, their trained models tend to

be overwhelmed by classes with large training samples while ignoring classes with small training

1

samples [16] in the training phase and hence cannot obtain satisfactory classification performance

on these ignored classes in the inference phase.

Considering imbalanced data distribution is widespread in many applications and aforemen-

tioned negative impacts brought by it for machine learning algorithms, in the past few decades,

many efforts have been devoted to making models learn from imbalanced data more effectively.

Although many related works have been proposed, the study on this topic is far from complete.

First, when generating synthetic data samples, most existing data generation based works only

consider the local distribution information within classes with amount of data samples, while the

global distribution is totally ignored. Hence, the quality of synthetic data samples generated by

these work cannot be guaranteed. Second, almost all existing studies assume that collected data

samples are associated with noisy-free labels. Therefore, they cannot deal with a more complicated

but realistic scenario that the annotated labels are noisy. For example, multiple crowd works may

be invited to annotated labels for collected data samples and hence the annotated labels can be

very inconsistent and noisy. Third, the majority of all previous studies only focus on improving

the prediction accuracy of models under imbalanced data distribution, while the research on inves-

tigating the impact of imbalanced data distribution for the robustness of trained models is rather

limited. Recently, the robustness of machine learning models has attracted increasing attention

when deploying machine learning models into real-world applications.

In this dissertation, I present several studies to fill up the gaps in terms of three aforementioned

perspectives. First, I focus on mitigating imbalanced data distribution through generating synthetic

data samples. I will demonstrate both global and local distribution information are important for

producing high-quality synthetic data samples. Second, I investigate the problem of learning from

imbalanced crowdsourced labeled data, which is a more realistic and challenging scenario in the

real world. I will propose a novel framework for training a discriminative model on imbalanced

crowdsourced labeled data directly with satisfactory prediction performance. Third, I study the

model robustness problem under imbalanced data distribution. I will empirically verify that the

adversarial training approach alone cannot bring enough robustness for models under imbalanced

2

scenarios while integrating reweighting strategy with adversarial training can be very helpful. In

addition, I will also present an effective data augmentation based framework to befit adversarial

training under imbalanced scenarios

1.2 Contributions

The major contributions of this dissertation are summarized as follows:

• I conduct research on three important but less-explored topics about learning from imbalanced

data distribution: (1) generating high-quality synthetic data, (2) learning from imbalanced

crowdsourced labeled data, and (3) improving model robustness;

• In chapter 2, I first identify the importance of both global and local distribution information

in data generation approaches for imbalanced data and then propose a novel framework to

generate more realistic synthetic data samples under imbalanced data distribution by utilizing

both global and local distribution information;

• In chapter 3, I present a novel framework to obtain a discriminative model that can achieve

good prediction performance on all classes involved in the data set by training on imbalanced

crowdsourced labeled data directly;

• In chapter 4, I first empirically discover two major differences between naturally trained

models and models trained by adversarial training approach under imbalanced data distribu-

tion and then propose a new framework to improve model robustness under imbalanced data

distribution.

• In chapter 5, I present an effective framework to augment imbalanced training data so that the

robustness of models can be boosted by applying adversarial training into the training phase.

3

CHAPTER 2

GLOBAL-AND-LOCAL AWARE DATA GENERATION FROM IMBALANCED DATA
DISTRIBUTION

In many real-world classification applications such as fake news detection, the training data can

be extremely imbalanced, which brings challenges to existing classifiers as the majority classes

dominate the loss functions of classifiers. Oversampling techniques such as SMOTE are effective

approaches to tackle the class imbalance problem by producing more synthetic minority samples.

Despite their success, the majority of existing oversampling methods only consider local data

distributions when generating minority samples, which can result in noisy minority samples that do

not fit global data distributions or interleave with majority classes. Hence, in this chapter, we study

the class imbalance problem by simultaneously exploring local and global data information since:

(i) the local data distribution could give detailed information for generating minority samples; and

(ii) the global data distribution could provide guidance to avoid generating outliers or samples that

interleave with majority classes. Specifically, we propose a novel framework GL-GAN, which

leverages the SMOTE method to explore local distribution in a learned latent space and employs

GAN to capture the global information, so that synthetic minority samples can be generated under

even extremely imbalanced scenarios. Experimental results on diverse real data sets demonstrate

the effectiveness of our GL-GAN framework in producing realistic and discriminative minority

samples for improving the classification performance of various classifiers on imbalanced training

data.

2.1 Chapter Introduction

The classification performance heavily relies on the quality and quantity of the training data [49].

However, in many real-world applications, due to some practical concerns such as privacy and time

cost, only limited labeled data can be collected. Meanwhile, such data could be imbalanced.

Specifically, some classes have significantly larger number of data samples while others have very

limited amount of data, which is called class imbalance problem [44]. For instance, in fake news

detection [88], the majority of news in the collected data are true news while only a small portion

4

of news are fake news. The imbalanced data has negative impacts on the classifier training since

the standard classifiers tend to be overwhelmed by the majority classes while ignoring the minority

classes [16]. Furthermore, even though minority classes may only take extremely small ratio of

one data set, for some applications like medical diagnosis, misclassifying a minority class sample

is usually more severe than misclassifying a majority one [68].

Oversampling has been proven to be an effective way to alleviate the class imbalance problem

by oversampling minority samples into the imbalanced data set [70]. As one of the most popular

oversampling methods, Synthetic Minority Over-sampling Technique (SMOTE) [14] generates new

synthetic minority samples by performing linear interpolation operations between existing minority

samples and their nearest neighbors within the same class. As shown in Figure 2.1, by applying the

SMOTE method, new synthetic minority samples are generated along with the linear interpolation

between two existing minority samples.

Despite the success of SMOTE and its variants [35, 39], they still face some challenges. First,

SMOTE-based methods only consider the local neighbor relationship of each minority sample,

while the global distribution is totally ignored. Without considering the global distribution of the

given data, the generated minority samples could not fit the real data distribution. For instance, the

generated samples in Figure 2.1 are either located on the null space of the given data samples or

interleaved with majority data samples. Second, the interpolation operations performed by these

methods on raw feature space may not generate realistic data samples. For instance, for the given

text data which lies in discrete space, SMOTE-based methods cannot guarantee their generated

texts are readable.

Therefore, in this chapter, we study the class imbalance problem by simultaneously exploring

both global and local information. The local data distribution provides detailed local information for

generating minority samples; and the global data distribution provides guidance from a global view

to avoid generating samples that interleave with majority samples or fall in the null space of the given

data. We are faced with two challenges: (i) how to explore global data distribution for minority

sample generation; and (ii) how to simultaneously leverage global and local distribution information

5

Figure 2.1 An example of imbalanced data and SMOTE method. The synthetic minority samples
are generated along the dash line between two minority samples.

to generate realistic and discriminative synthetic minority samples. Recently, generative adversarial

learning [31] has shown promising results in generating realistic data samples [31, 74] through

estimating the latent global data distribution, which paves us a way to solve these two challenges.

Hence, we propose a novel framework which leverages oversampling techniques to capture local data

structure and generative adversarial learning to explore global data distribution. The contributions

of this chapter are summarized below:

• We identify the importance of both global and local distribution information in tackling the

class imbalance problem.

• We propose a novel generative adversarial framework, GL-GAN, to generate realistic and

discriminative minority samples by exploring both global and local distributions.

• We conduct extensive experiments on diverse real data sets to demonstrate the effectiveness

of GL-GAN on alleviating the class imbalance problem.

2.2 Related Work

In many real world applications, we are faced with class imbalance problem. The popularity of

class imbalance has attracted increasing attention, and, various kinds of effective approaches have

6

been proposed in the last few decades. Existing works for tackling the class imbalance problem

can be roughly classified into three categories[55]: (i) data-level methods, which modify the class

distribution by adding or removing samples from training set; (ii) algorithm-level methods, which

modify the existing algorithm to adapt imbalanced scenarios; (iii) and hybrid methods, which

combine the advantages of two previous categories. Our GL-GAN is a data-level method.

Undersampling [110, 68] and oversampling [14, 35, 39] are two fundamental data-level so-

lutions. Briefly, undersampling approaches downsize the majority class by removing majority

samples, while oversampling approaches upsize the minority class by generating minority sam-

ples [66]. Oversampling with replacement, also called random oversampling [30], is the simplest

oversampling approach that randomly duplicates existing minority samples to augment the minor-

ity class. However, the random oversampling method often makes the decision boundary of the

classifier smaller and causes the classifier to over-fit [35]. As an improved approach, SMOTE [14]

inflates the minority class by producing synthetic minority samples instead of duplicating existing

minority samples. Based on the SMOTE method, several variants, such as borderline-SMOTE1 and

borderline-SMOTE2 [35] and ADASYN [39], have been proposed to achieve better performances in

the past few years. Different from SMOTE-based methods that utilize Euclidean distance to perform

interpolation operations, some recent work [2, 86] introduced Mahalanobis distance into synthetic

minority samples generation process and achieved good performance on classifier training.

Recently, more and more researchers have been attracted by the generative adversarial learning

due to its great power on generating different kinds of realistic synthetic data samples. The pioneer

work introduced by [31] presented Generative Adversarial Networks (GAN) to learn the real data

distribution through a minimax game between a generator 𝐺 and a discriminator 𝐷. The generator

𝐺 produces synthetic samples to fool the discriminator 𝐷, while the discriminator 𝐷 judges whether

the input samples come from the generator or from the real data set. These two components fight

against each other and improve themselves gradually [29]. In the perfect equilibrium, the generator

𝐺 is able to capture the global distribution information of real training data and generate synthetic

samples following this distribution [20]. Due to the great ability of generative adversarial learning

7

models on generating realistic data samples, some recent research applied generative adversarial

learning into the class imbalance problem. For instance, conditional GAN [74] is adopted in [24]

for producing minority samples effectively. BAGAN [72], is a data augmentation model that can

alleviate the class imbalance problem by modifying the discriminator 𝐷 in the traditional GAN.

However, all aforementioned models used in solving the class imbalance problem take the random

noise as the input of generator 𝐺, which may bring a lot of uncertainty during the model training

phase. Moreover, the local structure of training minority samples is not explored by these models,

so some generated synthetic samples may close to the decision boundary and thus hard to be utilized

to train a classifier.

Our GL-GAN is inherently different from existing works. We simultaneously explore the

global and local information by leveraging local-based oversampling techniques and generative

adversarial learning models. Therefore, GL-GAN can overcome the drawbacks of the oversampling

techniques and utilize the power of the generative adversarial models to produce more realistic and

discriminative synthetic minority data samples.

2.3 The Proposed Framework

In this chapter, we focus on the binary class imbalance problem. Given an imbalanced sample

set X𝑜𝑟𝑔 containing a majority sample set X𝑚𝑎 𝑗 and a minority sample set X𝑚𝑖𝑛, our goal is to

generate a set of realistic and discriminative synthetic minority samples X𝑠𝑦𝑛 so that, comparing

with training only on the original imbalanced sample set X𝑜𝑟𝑔, the classification performance of

classifiers can be greatly improved by training on the balanced augmentation sample set X𝑜𝑟𝑔 ∪X𝑠𝑦𝑛.

As shown in Figure 2.2, our GL-GAN is composed of two modules, local structure exploration

and global distribution learning. The former is designed for generating latent representations

of minority samples through exploring the local distribution information, and the latter aims to

produce realistic and discriminative minority samples that can fit the global distribution. Next, we

will introduce details of each module.

8

Figure 2.2 An overview of GL-GAN.

2.3.1 Local Structure Exploration

The local structure exploration module consists of two components, i.e., an encoder 𝐸 and a

local data representation interpolator 𝐼, specifying for two different tasks separately.

2.3.1.1 Discriminative Representation Learning

In many cases, directly generating synthetic data samples in raw feature space by local-based

oversampling techniques such as SMOTE may cause several problems.

Firstly, as we demonstrated before, these methods cannot generate realistic synthetic samples for

some specific data types like text data. Secondly, the generated minority data samples may interleave

with majority samples. This motivates us to first learn discriminative latent representations of the

raw data, then exploit the local data structure in the learned latent space. The advantages of doing

this are as follows: (i) By learning a low-dimensional latent representation, we can preserve the

most important information of the data while drop some noisy information; and (ii) During the

latent representation learning process, we can enforce the latent representations of data samples

belonging to the same class to be closed to each other.

Deep autoencoders have been proved to be an effective way to extract important information

from high-dimensional data using low-dimensional representations [84]. Typically, an autoencoder

consists of two components: an encoder 𝐸 and a decoder 𝑄. The encoder 𝐸 takes the high-

dimensional data as input and maps them to the corresponding latent representations. The decoder

9

𝑄 recovers these learned latent representations back to the raw feature space. The goal of training

an autoencoder is to minimize the reconstruction error between the input data and the reconstructed

data produced by the decoder 𝑄, which can be defined as

L𝑟𝑒𝑐 (𝑄 (cid:0)𝐸 (cid:0)X𝑜𝑟𝑔(cid:1) (cid:1) , X𝑜𝑟𝑔) =

1
|X𝑜𝑟𝑔 |

∑︁

𝑥𝑖 ∈X𝑜𝑟 𝑔

∥𝑄 (𝐸 (𝑥𝑖)) − 𝑥𝑖 ∥2
2

.

(2.1)

In our GL-GAN framework, we propose to embed the given real data samples into a latent

space with majority samples in one cluster and minority samples in another cluster, and these two

clusters should be far-away from each other. To do that, we aim to reduce the interleaving between

the synthetic generated minority samples and the given majority samples. Formally, this process

can be described by

L𝑐𝑙𝑢 =

1
|X𝑚𝑎 𝑗 |

∑︁

𝑥𝑖 ∈X𝑚𝑎 𝑗

∥𝐸 (𝑥𝑖) − 𝑧𝑚𝑎 𝑗 ∥2

2 +

𝜆1
|X𝑚𝑖𝑛|

∑︁

𝑥𝑖 ∈X𝑚𝑖𝑛

∥𝐸 (𝑥𝑖) − 𝑧𝑚𝑖𝑛 ∥2

2 − 𝜆2∥𝑧𝑚𝑎 𝑗 − 𝑧𝑚𝑖𝑛 ∥2
2

,

(2.2)

where 𝑧𝑚𝑎 𝑗 and 𝑧𝑚𝑖𝑛 are mean of the latent representations of majority sample set 𝑋𝑚𝑎 𝑗 and minority

sample set 𝑋𝑚𝑖𝑛, respectively. 𝜆1 and 𝜆2 are two hyper-parameters controlling the weights. Starting

from here, we use Λ or 𝜆 to represent hyper-parameters.

Therefore, the autoencoder in our GL-GAN can be trained by minimizing the following loss

function:

L 𝐴 = L𝑟𝑒𝑐 + Λ1L𝑐𝑙𝑢 + Λ2𝑅(𝜃).

(2.3)

Here 𝑅(𝜃) is the regularizer of the model parameters 𝜃. Once the autoencoder is trained well, the

latent representation of sample 𝑥𝑖 can be given as 𝑧𝑖 = 𝐸 (𝑥𝑖).

2.3.1.2 Local-based Data Generation

With the learned latent representations, we can generate synthetic minority samples in the latent

space by exploring the local structure of the sample set Z𝑚𝑖𝑛, which is the latent embedding of the

minority sample set X𝑚𝑖𝑛. In our GL-GAN, we adopt SMOTE as the implementation of the local

data interpolator 𝐼 because of its simplicity.

10

For any minority sample 𝑧𝑖 ∈ Z𝑚𝑖𝑛, SMOTE 1) discovers 𝑘 nearest neighbors {𝑧1

𝑖 , . . . , 𝑧𝑘
𝑖 }
of 𝑧𝑖 within the same minority class set Z𝑚𝑖𝑛, 2) randomly picks up any one nearest neighbor 𝑧𝑛
𝑖

𝑖 , 𝑧2

(𝑛 ∈ [1, 𝑘]) from the set {𝑧1

𝑖 } and chooses a random number 𝜂 ∈ [0, 1]. Hence, a
𝑖 − 𝑧𝑖(cid:1). The second step can
be repeated 𝑁 times, and, finally, 𝑁 × |Z𝑚𝑖𝑛| synthetic minority samples will be generated when

𝑖 , . . . , 𝑧𝑘
𝑖 could be created by 𝑧′

new synthetic minority sample 𝑧′

𝑖 = 𝑧𝑖 + 𝜂 (cid:0)𝑧𝑛

𝑖 , 𝑧2

executing the same process on every minority sample in Z𝑚𝑖𝑛. After the synthetic minority sample

set 𝑍𝑠𝑦𝑛 is obtained, we can get a balanced augmentation sample set Z = Z𝑚𝑎 𝑗 ∪ Z𝑚𝑖𝑛 ∪ Z𝑠𝑦𝑛 in

the latent space.

2.3.2 Global Distribution Learning

For making the generated minority samples in Z𝑠𝑦𝑛 more realistic and discriminative, we

introduce a generative adversarial learning model to learn the global information of given samples

and modify samples in Z𝑠𝑦𝑛 accordingly.

2.3.2.1 Discriminator 𝐷

The role of the discriminator 𝐷 is to differentiate if a data sample is real or fake. For a data

sample who comes from the given real data set, the discriminator 𝐷 labels it as a real sample. If a

data sample is synthetically generated by the generator 𝐺, it will be classified as a fake sample. The

discriminator 𝐷 and the generator 𝐺 fight against each other and improve themselves gradually.

The loss function for training the discriminator 𝐷 can be written as

L𝐷 =

1
|X𝑜𝑟𝑔 |

∑︁

𝑥𝑖 ∈X𝑜𝑟 𝑔

∥𝐷 (𝑥𝑖) − 1∥2

2 +

𝜆

|Z|

∑︁

𝑧𝑖 ∈Z

∥𝐷 (𝐺 (𝑧𝑖)) − 0∥2
2

.

(2.4)

In equilibrium, the discriminator 𝐷 cannot find the difference between real and synthetic samples,

which means the quality of synthetic data generated by the generator 𝐺 are approximate to the real

data.

2.3.2.2 Classifier 𝐶

For making sure that generated data samples can have expected labels, we introduce a classifier

𝐶 in our GL-GAN. Specifically, the classifier 𝐶 also takes both real samples and synthetic samples

generated by the generator 𝐺 as input. Since the input of 𝐺 in GL-GAN is the balanced augmentation

11

sample set Z, every output of the generator 𝐺, i.e. 𝐺 (𝑧𝑖), has its corresponding label. The classifier

𝐶 works on labeled data samples and makes classification for them. The loss function for training

the classifier 𝐶 in our GL-GAN is

L𝐶 =

1
|X𝑜𝑟𝑔 |

∑︁

𝑥𝑖 ∈X𝑜𝑟 𝑔

∥𝐶 (𝑥𝑖) − Γ𝑥𝑖∥2

2 +

𝜆

∑︁

|Z|

𝑧𝑖 ∈Z

∥𝐶 (𝐺 (𝑧𝑖)) − Γ𝑧𝑖∥2
2

.

(2.5)

Here Γ𝑥𝑖 and Γ𝑧𝑖 are true labels of real sample 𝑥𝑖 and latent representation 𝑧𝑖, respectively. By

introducing the classifier 𝐶 into the traditional GAN, the generator 𝐺 is forced to produce synthetic

samples which can be classified by 𝐶 correctly.

2.3.2.3 Generator 𝐺

Different from the traditional generator 𝐺 that takes a set of random noise following some prior

distribution as input, during the model training phase, the generator 𝐺 in our GL-GAN is fed with

the balanced augmentation sample set Z. Since there are two types of latent representations in Z,

i.e., the latent representations of real samples in Z𝑚𝑎 𝑗 and Z𝑚𝑖𝑛, denoted as Z𝑜𝑟𝑔 = Z𝑚𝑎 𝑗 ∪ Z𝑚𝑖𝑛,

and the latent representations of synthetic samples in Z𝑠𝑦𝑛, the generator 𝐺 should be able to

project latent representations Z𝑜𝑟𝑔 back to the raw feature space as well as produce synthetic data

samples that can fool the discriminator 𝐷. Therefore, the loss for training generator 𝐺 includes

three different types: the reconstruction loss L𝑟𝑒𝑐 for mapping latent representations Z𝑜𝑟𝑔 back to

the raw feature space, the discriminator loss L(𝐺,𝐷) produced by the discriminator 𝐷 for evaluating

the difference between the real data samples and data samples generated by 𝐺, and the classifier

loss L(𝐺,𝐶) brought by the classifier 𝐶 for making classification on the generated data samples of

𝐺. Formally, the loss function for training the generator 𝐺 in our GL-GAN can be defined as

L𝐺 = L𝑟𝑒𝑐 (𝐺 (Z𝑜𝑟𝑔), X𝑜𝑟𝑔) + 𝜆1L(𝐺,𝐷) + 𝜆2L(𝐺,𝐶)

=

1
|X𝑜𝑟𝑔 |

∑︁

𝑥𝑖 ∈X𝑜𝑟 𝑔,𝑧𝑖 ∈Z𝑜𝑟 𝑔

∥𝐺 (𝑧𝑖) − 𝑥𝑖 ∥2

2 +

𝜆1
|Z|

∑︁

𝑧𝑖 ∈Z

∥𝐷 (𝐺 (𝑧𝑖)) − 1∥2

2 +

𝜆2
|Z|

∑︁

𝑧𝑖 ∈Z

∥𝐶 (𝐺 (𝑧𝑖)) − Γ𝑧𝑖 ∥2
2

.

(2.6)

After the whole framework is trained well, the generator 𝐺 is able to produce a set of realistic and

discriminative synthetic minority samples.

12

Algorithm 2.1 The algorithm of GL-GAN.
Input: an imbalanced sample set 𝑋𝑜𝑟𝑔
1: Initialize the parameters of autoencoder.
2: Pre-train the autoencoder to obtain the latent representations Z𝑜𝑟𝑔 = Z𝑚𝑎 𝑗 ∪ Z𝑚𝑖𝑛 of X𝑜𝑟𝑔.
3: Apply SMOTE method for Z𝑚𝑖𝑛 to get the synthetic minority sample set Z𝑠𝑦𝑛.
4: Form a balanced augmentation sample set Z = Z𝑚𝑎 𝑗 ∪ Z𝑚𝑖𝑛 ∪ Z𝑠𝑦𝑛 in the latent space.
5: repeat
6:
7:

Train the discriminator 𝐷 with augmented latent sample set Z and real sample set X𝑜𝑟𝑔
(Sec. 2.3.2.1).

for discriminator-epochs do

8:
9:
10:

end for
for classifier-epochs do

Train the classifier 𝐶 with augmented latent sample set Z and real sample set X𝑜𝑟𝑔
(Sec. 2.3.2.2).

end for
for generator-epochs do

11:
12:
13:
14:
15: until model convergence

end for

Train the generator 𝐺 (Sec. 2.3.2.3).

2.3.3 Objective Function of GL-GAN

With local structure exploration module and global distribution learning module introduced

above, the final objective function of GL-GAN is given as:

min
𝜃𝐺,𝜃𝐶

max
𝜃𝐷

L𝑟𝑒𝑐 (𝐺 (Z𝑜𝑟𝑔), X𝑜𝑟𝑔) + Λ1L(𝐺,𝐷) + Λ2L(𝐺,𝐶)

(2.7)

where 𝜃𝐺, 𝜃𝐶 and 𝜃𝐷 are the parameters of generator 𝐺, classifier 𝐶 and discriminator 𝐷, respec-

tively.

2.3.4 Algorithm

In this subsection, we present our GL-GAN framework in Algorithm 2.1.

As shown in Algorithm 2.1, we train the autoencoder part at first to make sure the autoencoder

could map the input data samples into two far-way clusters in the latent space. After pre-training the

autoencoder, we utilize the encoder 𝐸 to obtain the latent representations of the input data samples.

Then, the local data interpolator 𝐼 can be applied in the learned latent space to generate a set of

synthetic minority samples within the same cluster.

In order to train the generative adversarial

learning part more effectively, we use the knowledge learned by the pre-trained autoencoder to

13

initialize the generative model. Specifically, the discriminator 𝐷 and the classifier 𝐶 have the

same architecture with the encoder 𝐸 except both 𝐷 and 𝐶 have one more layer. The last layer

of the discriminator 𝐷 is a dense layer with a softmax activation function for producing binary

outputs and the last layer of the classifier 𝐶 is a dense layer for producing classification results.

The parameters learned by the encoder 𝐸 will be used to initialize the discriminator 𝐷 and the

classifier 𝐶 during the generative model training phase. Similarly, the generator 𝐺 is initialized by

the weight parameters learned by the decoder 𝑄 since they have the same architecture.

2.4 Experiment

In this section, we conduct experiments to verify the effectiveness of our proposed GL-GAN

framework. We aim at answering the following two questions:

• Can the proposed GL-GAN framework generate discriminative minority samples for improv-

ing the classification performance of imbalanced data?

• What is the impact of each module of GL-GAN?

We begin by introducing the data sets and experimental settings, then we compare GL-GAN with

several state-of-the-art related methods on the classification task to answer the first question. We

then analyze the impact of each module of GL-GAN to answer the second question.

2.4.1 Experimental Settings

In order to test how the generated synthetic samples alleviate the binary class imbalance problem,

we utilize the classification performance of different classifiers training on various augmented

sample sets as the evaluation indicator.

2.4.1.1 Data Sets

The experiments are conducted on five real data sets, i.e., USPS, Sensorless Drive Diagnosis,

Gas Sensor Array Drift, Madelon and Gisette. Sensorless Drive Diagnosis and Gas Sensor Array

Drift are publicly from the UCI data repository1 and the rest three can be obtained from Feature

Selection data repository2. Since all these five data sets are class balanced, we construct the

1https://archive.ics.uci.edu/ml/index.php
2http://featureselection.asu.edu/datasets.php

14

Table 2.1 Statistical information of imbalanced data sets.

Data Set
USPS
Sensorless Drive Diagnosis
Gas Sensor Array Drift
Madelon
Gisette

# Features
256
48
128
500
5000

# Majority
744
4256
1549
1040
2800

# Minority
7
42
15
10
28

imbalanced data set for each of them according to the following three steps. Firstly, we randomly

choose one class as majority class and another one as minority class to obtain a balanced binary data

set. Then we divide 80% data samples of the balanced binary class data set as the candidates set and

the rest as the test set. Lastly, we artificially imbalance the candidates set to form the imbalanced

data set by utilizing a predefined imbalanced ratio 𝑟. For instance, if 𝑟 = 0.01, then 99% minority

samples will be removed from the candidates set so that the ratio of the minority samples to the

majority samples in the imbalanced data set is 0.01. Table 2.1 provides the statistical information

of five imbalanced data sets obtained by the aforementioned three steps when the imbalanced ratio

𝑟 = 0.01.

2.4.1.2 Classifiers

Since our goal is to generate synthetic minority samples for improving the classification per-

formance, we introduce several classifiers to help to evaluate the quality of the generated samples.

Three representative classifiers, i.e., Multi-layer Perceptron (MLPClassifier), Linear Support Vector

Classification (LinearSVC) and AdaBoost are adopted in our experiments. We train these classifiers

on the training sets augmented by the synthetic minority samples generated by our model or base-

lines and test them on the corresponding test data sets. All these three classifiers are implemented

by the scikit-learn package3 in Python, and we use their default settings in all experiments.

2.4.1.3 Evaluation Metrics

For measuring the classification performance of classifiers, we introduce three different metrics,

macro F1-score, micro F1-score and Matthews correlation coefficient (MCC) [73] into our experi-

ments. The value of MCC is in the range [−1, 1], in which MCC = 1 indicates a perfect prediction,

3https://scikit-learn.org/stable/index.html

15

Table 2.2 Classification performance of classifiers on the USPS data set.

Evaluation

Classifier

Metrics
macro F1
MLPClassifier micro F1

LinearSVC

AdaBoost

MCC
macro F1
micro F1
MCC
macro F1
micro F1
MCC

Method
Imbalanced Random SMOTE MDO NRAS SWIM BAGAN GL-GAN

0.7712
0.7969
0.6232
0.8473
0.8589
0.7346
0.8024
0.8218
0.6689

0.8840
0.8899
0.7839
0.8580
0.8681
0.7510
0.7878
0.8095
0.6441

0.8825
0.8887
0.7823
0.8580
0.8680
0.7510
0.8036
0.8221
0.6662

0.8914
0.8947
0.7858
0.8834
0.8865
0.7683
0.8436
0.8558
0.7291

0.8415 0.6443
0.8503 0.6595
0.7022 0.4731
0.8408 0.6184
0.8497 0.6380
0.7010 0.4440
0.7883 0.8900
0.8098 0.8906
0.6444 0.7865

0.8208
0.8356
0.6872
0.8836
0.8896
0.7838
0.7848
0.8077
0.6433

0.8937
0.8985
0.7990
0.8912
0.8957
0.7913
0.8920
0.8966
0.7942

Table 2.3 Classification performance of classifiers on the Sensorless Drive Diagnosis data set.

Evaluation

Classifier

Metrics
macro F1
MLPClassifier micro F1

LinearSVC

AdaBoost

MCC
macro F1
micro F1
MCC
macro F1
micro F1
MCC

Method
Imbalanced Random SMOTE MDO NRAS SWIM BAGAN GL-GAN

0.7697
0.7809
0.6251
0.8195
0.8250
0.6939
0.8962
0.8973
0.8119

0.8642
0.8666
0.7608
0.8425
0.8462
0.7277
0.9683
0.9683
0.9385

0.8700
0.8722
0.7699
0.8425
0.8462
0.7277
0.9686
0.9687
0.9392

0.8580
0.8608
0.7513
0.8455
0.8490
0.7322
0.9955
0.9955
0.9910

0.8510 0.8439
0.8542 0.8476
0.7406 0.7299
0.8420 0.8435
0.8457 0.8471
0.7269 0.7292
0.9835 0.9924
0.9835 0.9924
0.9676 0.9849

0.9078
0.9086
0.8310
0.9281
0.9285
0.8659
0.9817
0.9817
0.9638

0.9334
0.9338
0.8755
0.8714
0.8735
0.7721
0.9959
0.9959
0.9918

Table 2.4 Classification performance of classifiers on the Gas Sensor Array Drift data set.

Evaluation

Classifier

Metrics
macro F1
MLPClassifier micro F1

LinearSVC

AdaBoost

MCC
macro F1
micro F1
MCC
macro F1
micro F1
MCC

Method
Imbalanced Random SMOTE MDO NRAS SWIM BAGAN GL-GAN

0.3879
0.5376
0.1077
0.8580
0.8619
0.7511
0.3825
0.5255
0.0689

0.7880
0.8003
0.6518
0.9270
0.9270
0.8560
0.4620
0.5418
0.0940

0.8116
0.8201
0.6832
0.9296
0.9296
0.8610
0.4739
0.5352
0.0690

0.6540
0.6833
0.4819
0.6588
0.6866
0.4871
0.5299
0.5963
0.3423

0.7182 0.6974
0.7424 0.6985
0.5592 0.3967
0.9216 0.3822
0.9216 0.5126
0.8445 0.1606
0.5295 0.4795
0.6082 0.5445
0.3174 0.0954

0.8891
0.8908
0.7933
0.9290
0.9296
0.8666
0.4995
0.5352
0.0653

0.8881
0.8884
0.7782
0.9694
0.9695
0.9391
0.5976
0.6082
0.2181

MCC = 0 means the prediction made by a classifier is no better than the random prediction and

MCC = -1 represents total wrong between the prediction and the observation.

2.4.2 Effectiveness Evaluation

For evaluating the effectiveness of our GL-GAN framework on alleviating the binary class

imbalance problem, we compare the quality of the synthetic samples generated by GL-GAN with

several representative and state-of-the-art oversampling methods, including: 1) Imbalanced, which

16

Table 2.5 Classification performance of classifiers on the Madelon data set.

Evaluation

classifier

Metrics
macro F1
MLPClassifier micro F1

LinearSVC

AdaBoost

MCC
macro F1
micro F1
MCC
macro F1
micro F1
MCC

Method

SWIM BAGAN GL-GAN
NRAS
Imbalanced Random SMOTE MDO
0.3376
0.4088
0.4440
0.4513
0.5019
0.5128
0.5213
0.4931
0.0439
0.0465
0.0628
-0.0165
0.3376
0.3894
0.4306
0.4643
0.5019
0.5058
0.5092
0.4942
0.0439
0.0237
0.0276
-0.0131
0.4860
0.3325
0.4504
0.3529
0.4981
0.4942
0.4962
0.4885
-0.0439
-0.0324 -0.0233 -0.0094

0.3346
0.5006
0.0132
0.3333
0.5000
0.0
0.3325
0.4981
-0.0439

0.3333
0.5000
0.0
0.3324
0.4981
-0.0439
0.3354
0.5000
0.0034

0.4821
0.5260
0.0640
0.4390
0.5212
0.0657
0.3612
0.5019
0.0092

0.3364
0.5008
0.0120
0.3367
0.5000
0.0
0.3400
0.5000
0.0

Table 2.6 Classification performance of classifiers on the Gisette data set.

Evaluation

Classifier

Metrics
macro F1
MLPClassifier micro F1

LinearSVC

AdaBoost

MCC
macro F1
micro F1
MCC
macro F1
micro F1
MCC

Method
Imbalanced Random SMOTE MDO NRAS SWIM BAGAN GL-GAN

0.4618
0.5685
0.2423
0.6053
0.6529
0.4248
0.5226
0.5993
0.3320

0.6051
0.6528
0.4246
0.6115
0.6571
0.4318
0.5874
0.6400
0.4001

0.6161
0.6604
0.4371
0.6115
0.6571
0.4318
0.5669
0.6271
0.3817

0.5888
0.6317
0.6426
0.6710
0.4066
0.4486
0.6718
0.8616
0.7011
0.8636
0.7482
0.5018
0.5718
0.3361
0.4850
0.6300
-0.0935 0.3848

-

-

-

0.4462
0.5558
0.2418
0.3521
0.5086
0.0930
0.5949
0.6199
0.2770

0.8552
0.8568
0.7277
0.8636
0.8650
0.7457
0.6271
0.6679
0.4476

directly uses original imbalanced data sets without adding minority samples; 2) Random [30],

i.e., random oversampling, which inflates minority class by duplicating existing minority samples;

3) SMOTE [14], which generates minority samples by performing linear interpolation operations

between minority samples and their nearest neighbors; 4) MDO [2], which produces minority

samples that have the same Mahalanobis distance from the considered class mean with existing

minority samples; 5) NRAS [81], which performs a noise removal process on the minority class

first and then constructs synthetic samples from the remaining samples; 6) SWIM [86], which

utilizes the distribution information of majority class to generate minority samples located at the

same Mahalanobis distance from the majority class; and 7) BAGAN [72] which takes random

noise as input and produces synthetic samples to balance the imbalanced data set. We adopt the

implementations of Random and SMOTE methods provided by literature [60] and of MDO and

NRAS methods provided by literature [54] in all experiments with default settings. BAGAN is

developed upon its public source code4.

4https://github.com/IBM/BAGAN

17

For each imbalanced data set, we apply baselines and our model to generate synthetic minority

data samples and then form different augmented data sets for training classifiers. Table 2.2 to

Table 2.6 list the classification performance of three different classifiers on five test data sets.

We conduct each experiment ten times and report average results. From these tables, we make

the following observations: (i) Compared with the imbalanced set, the classification performance

generally increases with the oversampling techniques, which shows the importance of oversampling.

(ii) In most cases, with GL-GAN, the classification performance of classifiers outperforms with

baselines, which implies the high quality of synthetic minority samples generated by GL-GAN. This

is because local-based oversampling methods like SMOTE may produce some synthetic minority

samples which are interleaved with existing majority samples or located in the null space of the

given data set, while only global distribution information is explored in BAGAN and the generated

synthetic samples may overlook the local structure of the given minority samples.

2.4.3 Components Analysis

In order to investigate the impact of each module in our GL-GAN framework, we implement

two models employing part of components contained in GL-GAN to generate synthetic minority

samples and compare the quality of generated samples with GL-GAN. First, we combine the

autoencoder component and the local data interpolator component together to obtain a new model

called Auto-only. In Auto-only, the encoder 𝐸 maps all given data samples into a latent space and

the new synthetic minority samples are generated by the local data interpolator 𝐼. This procedure is

the same with the first module in GL-GAN. However, instead of importing all latent representations

into the generator 𝐺, Auto-only model employs the decoder 𝑄 to project all latent representations

back to the raw feature space. Second, for studying the functionality of GAN-based module, we

adopt conditional GAN [74] to generate synthetic minority samples. The conditional GAN takes

random noise as input and produces synthetic samples with minority class label.

We conduct experiments on two real data sets and display the experimental results in Figure 2.3

and Figure 2.4, respectively. Here we can see, despite autoencoder (Auto-only) or conditional

GAN (cGAN) can also produce synthetic minority samples for training a classifier, the quality

18

(a) 𝑟 = 0.01

(b) 𝑟 = 0.05

(c) 𝑟 = 0.1

(d) 𝑟 = 0.2

(e) 𝑟 = 0.4

Figure 2.3 MCC score of AdaBoost on the Gisette data set.

(a) 𝑟 = 0.01

(b) 𝑟 = 0.05

(c) 𝑟 = 0.1

(d) 𝑟 = 0.2

(e) 𝑟 = 0.4

Figure 2.4 MCC score of AdaBoost on the USPS data set.

of generated synthetic samples are not good enough, especially in the extremely imbalanced

scenario (𝑟 = 0.01). However, since our GL-GAN could simultaneously explore the global and

local information through combining the advantages of local-based oversampling techniques and

generative adversarial learning together, the synthetic samples produced by GL-GAN could be

more helpful for training a better classifier.

2.5 Case Study

For verifying whether GL-GAN can produce more realistic synthetic minority samples, we

visualize the synthetic samples generated on a handwritten digits data set MNIST5. Here we

randomly choose images “4" as majority class and images “7" as minority class, and form the

imbalanced data set as described in Sec 2.4.1.1.

2.5.1 Functionality of Autoencoder

As we mentioned before, in order to avoid minority samples are generated in the null space

of the given sample set or interleaved with majority samples, we require the encoder contained in

our GL-GAN framework is able to map the given sample set X𝑜𝑟𝑔 into two far-away clusters in the

latent space, which can be achieved by Eq. (2.2). Here we utilize the MNIST data set to verify the

usefulness of this design. In Figure 2.5, the right figure shows a snippet of images generated by the

5http://yann.lecun.com/exdb/mnist/

19

Auto-onlycGANGL-GANMethod0.30.40.50.60.70.80.91.0MCC scoreimbalanced ratio r = 0.01Auto-onlycGANGL-GANMethod0.30.40.50.60.70.80.91.0MCC scoreimbalanced ratio r = 0.05Auto-onlycGANGL-GANMethod0.30.40.50.60.70.80.91.0MCC scoreimbalanced ratio r = 0.1Auto-onlycGANGL-GANMethod0.30.40.50.60.70.80.91.0MCC scoreimbalanced ratio r = 0.2Auto-onlycGANGL-GANMethod0.30.40.50.60.70.80.91.0MCC scoreimbalanced ratio r = 0.4Auto-onlycGANGL-GANMethod0.50.60.70.80.91.0MCC scoreimbalanced ratio r = 0.01Auto-onlycGANGL-GANMethod0.50.60.70.80.91.0MCC scoreimbalanced ratio r = 0.05Auto-onlycGANGL-GANMethod0.50.60.70.80.91.0MCC scoreimbalanced ratio r = 0.1Auto-onlycGANGL-GANMethod0.50.60.70.80.91.0MCC scoreimbalanced ratio r = 0.2Auto-onlycGANGL-GANMethod0.50.60.70.80.91.0MCC scoreimbalanced ratio r = 0.4(a) Traditional AE

(b) AE used in GL-GAN

Figure 2.5 Images generated by different autoencoders.

Auto-only method, in which the setting of the encoder is exactly same with the encoder 𝐸 contained

in our GL-GAN. As a comparison, we remove the loss function defined in Eq. (2.2) from the

final loss function of the autoencoder, i.e., Eq. (2.3), and also apply SMOTE to generate synthetic

minority samples in the latent space. In other words, the majority samples and minority samples

are not required to be mapped far-away from each other in the latent space learned by the encoder,

which is a common setting in the traditional autoencoder (AE). As the left figure shown, under this

setting, the quality of generated synthetic minority samples “7" is worse than the Auto-only method

generated ones. The reason is that, in the latent space, the synthetic minority samples generated by

SMOTE may still have probability to interleave with majority samples if the majority cluster and

minority cluster are not far-away from each other. Hence, the generated synthetic images may not

good enough. In short, these two figures demonstrate the loss function described by Eq. (2.2) is

useful and indispensable.

2.5.2 Quality of Generated Image Data

We also visualize the synthetic minority samples generated by SMOTE and our proposed

GL-GAN framework on the MNIST data set.

In the left figure of Figure 2.6, several synthetic

samples produced by SMOTE looks like some intermediates between the majority class “4" and

minority class “7". As we discussed before, due to only local neighbor relationships are utilized in

SMOTE and the global information is totally ignored, SMOTE cannot avoid producing outliers or

samples that interleaved with majority samples. On the contrary, our GL-GAN framework is able

20

(a) SMOTE

(b) GL-GAN

Figure 2.6 Images generated by SMOTE and GL-GAN.

to generate more realistic synthetic minority samples. Since both global and local distributions are

simultaneously explored in GL-GAN, the drawbacks of SMOTE can be overcame and high quality

synthetic minority samples can be generated.

2.6 Chapter Conclusion

In this chapter, we propose a novel framework to solve the class imbalance problem through

generating synthetic data samples for minority class. Different from local-based oversampling

methods which only explore the local structure of minority samples and generative adversarial

learning models which only utilize the global distribution information of all given samples, our

GL-GAN framework considers both global and local information of the given data in the synthetic

minority sample generation process. Extensive experimental results demonstrate that, comparing

with existing baselines, our model can produce more realistic and discriminative synthetic minority

samples, which are helpful for training better classifiers. In the future, we would extend our GL-

GAN framework to the class imbalance problem of multi-class as well as some specific imbalanced

application scenarios such as credit fraud detection.

21

CHAPTER 3

LEARNING FROM IMBALANCED CROWDSOURCED LABELED DATA

Crowdsourcing has proven to be a cost-effective way to meet the demands for labeled training

data in supervised deep learning models. However, crowdsourced labels are often inconsistent and

noisy due to cognitive and expertise differences among crowd workers. Existing approaches either

infer latent true labels from noisy crowdsourced labels or learn a discriminative model directly

from the crowdsourced labeled data, assuming the latent true label distribution is class-balanced.

Unfortunately, in many real-world applications, the true label distribution typically is imbalanced

across classes. Therefore, in this chapter, we address the problem of learning from crowdsourced

labeled data with an imbalanced true label distribution. We propose a new framework, named

“Learning from Imbalanced Crowdsourced Labeled Data" (ICED), which simultaneously infers

true labels from imbalanced crowdsourced labeled data and achieves high accuracy on downstream

tasks such as classification. The ICED framework consists of two modules, i.e., a true label inference

module and a synthetic data generation module, that augment each other iteratively. Extensive

experiments conducted on both synthetic and real-world data sets demonstrate the effectiveness of

the ICED framework.

3.1 Chapter Introduction

The success of supervised deep learning models in many real-world applications, such as image

classification [57, 89, 42] and speech recognition [33, 1, 38], is inseparable from the availability

of large-scale labeled training data. However, obtaining a large amount of labeled data is often

challenging. Annotating certain types of data samples such as medical images requires specific

domain knowledge [93], while some other types of data such as videos or audios are expensive

in terms of time [101]. By inviting multiple crowd workers to annotate labels for data samples

simultaneously or sequentially, modern crowdsourcing platforms such as Amazon Mechanical

Turk1 offer a cost-effective way to collect large-scale labeled data [82]. Although crowdsourcing

alleviates the label shortage problem to some extent, the annotated labels can be very inconsistent

1https://www.mturk.com

22

and noisy due to the cognitive differences between crowd workers [28]. For example, non-experts

and experts may annotate the same object with distinct labels. As most existing supervised deep

learning models only work well with determinate noise-free labels, there is a need for alternative

approaches to handle such noisy labeled data.

In the past few decades, several approaches have addressed noisy crowdsourced labels. One

class of approaches infer true labels from crowdsourced labels [22, 79, 104]. Another class of

approaches learn a discriminative model directly from crowdsourced labeled data [82, 51, 92].

All the above approaches assume that the given training set is class-balanced, which is not true

in real-world scenarios [95, 69], where majority classes have a significantly higher number of

data samples than minority classes. Hence, those approaches perform poorly when training on

imbalanced datasets.

There have been many attempts to address the challenges brought by imbalanced datasets, such

as re-sampling approaches [67, 15, 35] and re-weighting approaches [21, 10]. These approaches

require determinate noise-free training labels and are not able to handle data with crowdsourced

labels. Therefore, there is a need for a new approach to address both imbalanced and noisy

data in the crowdsourcing settings. To address this need, we study the problem of learning from

imbalanced crowdsourced labeled data in this chapter. To the best of our knowledge, this is the first

work to learn an effective discriminative model on crowdsourced labels when the latent true label

distribution is imbalanced. Our goal is simultaneously obtaining accurate supervised information

by inferring true labels from crowdsourced labels and ensuring good prediction performance of the

classifier on all classes in the balanced test set.

In this chapter, we propose a novel framework ICED (Learning from Imbalanced Crowdsourced

labEled Data). The ICED framework consists of two modules. One module uses generated synthetic

data for minority classes to improve the true label inference process. Another module uses the

inferred true labels to improve the quality of generated synthetic data. These two modules augment

each other and improve themselves iteratively. After training, ICED can learn a classifier with good

prediction performance on all classes uniformly distributed in the test set. The main contributions

23

of this chapter are summarized below:

• We are the first one to address the problem of learning from imbalanced crowdsourced labeled

data, a more realistic scenario in the real world.

• We present a novel framework ICED, which can simultaneously infer true labels from imbal-

anced crowdsourced labeled data and achieve good prediction performance on all classes.

• We conduct extensive experiments on both synthetic and real datasets to demonstrate the

effectiveness of ICED on the classification task.

3.2 The Proposed Framework

In this section, we first formulate the problem we studied in this chapter and then introduce our

proposed ICED framework in detail.

3.2.1 Problem Formulation

Definition 3.2.1 (Learning from Imbalanced Crowdsourced Labeled Data). Given a set of data

X = {x1, x2, . . . , x𝑛}, 𝑊 crowd workers are invited to annotate every sample in X to produce a

crowd label set Y = {y1, y2, . . . , y𝑛} and

y𝑖 = {(𝑦𝑖,1, 𝑤𝑖,1), (𝑦𝑖,2, 𝑤𝑖,2) . . . , (𝑦𝑖,𝑊 , 𝑤𝑖,𝑊 )},

where each annotation pair (𝑦𝑖,𝑢, 𝑤𝑖,𝑢) represents label 𝑦𝑖,𝑢 provided by worker 𝑤𝑢 for sample x𝑖

from 𝐶 classes, our goal is to obtain a deep neural network based classifier F , which can achieve

good prediction performance on uniformly distributed 𝐶-classes test data based on the data set X

and corresponding crowdsourced label set Y.

Note that, for each data sample x𝑖 ∈ X, we assume it has 𝑊 annotated labels. Moreover, as the

true labels for sample data set X is unknown, we denote the estimated true labels inferred by our

proposed framework for X as T = {𝑡1, 𝑡2, . . . , 𝑡𝑛}. In this chapter, our focus is on the classification

task for binary classes, i.e, there are two classes in the data set X and, one is the majority class

and the other is the minority class. Note that our proposed ICED framework is also suitable for

multi-class classification tasks with slight modifications and we will leave it as one future work.

24

Figure 3.1 An overview of ICED. The solid yellow arrow and red arrow indicate outputs of synthetic
data generation module and true label inference module in the current training iteration, respectively.
The red dash arrow and black dash arrow represent the inferred labels obtained and synthetic data
samples generated in the previous iteration, respectively.

3.2.2 Framework Overview

For tackling the learning from imbalanced crowdsourced labeled data problem, we propose a

novel framework ICED as shown in Figure 3.1. The main structure of ICED is a deep neural network

based classifier F consisting of a feature extractor G and a fully connected Layer (FC). During

training the classifier F , ICED introduces two modules: true label inference module and synthetic

data generation module. The true label inference module estimates determinate true labels from

given crowdsourced labeled data, and the synthetic data generation module generates synthetic

data samples for minority class using the estimated true labels. These two modules augment each

other and improve themselves iteratively. Furthermore, to make our ICED framework obtain better

initial learning ability at the beginning of framework training phase, ICED also includes a warm-up

training strategy specifically designed for the crowdsourced labeled data. Next, we introduce details

of each component.

25

3.2.3 True Label Inference

Many classical approaches to infer true labels from crowdsourced labels ignore the correlation

between data samples and cognitive differences between individual crowd workers. For example,

some workers tend to judge class 𝑐𝛼 as class 𝑐 𝛽 by mistake due to their cognitive differences.

Therefore, for overcoming the aforementioned shortages, our ICED framework adopts an EM

approach [22] into the true label inference module to estimate determinate labels from given

crowdsourced labeled data.

To capture the annotation behaviors of crowd workers, we define Ψ𝑤𝑢 (𝑐𝛼, 𝑐 𝛽) as the prob-

ability that worker 𝑤𝑢 will annotate data samples with true label 𝑐𝛼 as class 𝑐 𝛽. Therefore,
(cid:205)𝑐𝛽≠𝑐 𝛼 Ψ𝑤𝑢 (𝑐𝛼, 𝑐 𝛽) represents the annotation error rate of the worker 𝑤𝑢 when true label of sam-
ples are 𝑐𝛼. Let T be the random variable representing the true label of sample set X (similarly T𝑖

for sample x𝑖) and Φ𝑐 𝛼 = 𝑝(T = 𝑐𝛼) = 𝑝(T𝑖 = 𝑐𝛼) be the prior of class 𝑐𝛼, in the absence of any

observations.

Our task is to estimate the probability of each label 𝑐𝛼 (𝑐𝛼 ∈ [𝐶]) to be the latent true label for

each sample x𝑖 based on the crowdsourced labels Y, i.e., 𝑝(T𝑖 = 𝑐𝛼|Y). The label with maximal

probability is then chosen as the current estimated true label to train the deep neural network based

classifier F . The steps in the EM algorithm to estimate true labels are:

• E-step: computes the likelihood function of the observed crowdsourced labels Y based on

current estimated true labels T and parameters Ψ = {Ψ𝑤𝑢 (𝑐𝛼, 𝑐 𝛽)|𝑤𝑢 ∈ [𝑊], 𝑐𝛼, 𝑐 𝛽 ∈ [𝐶]}

and Φ = {Φ𝑐 𝛼 |𝑐𝛼 ∈ [𝐶]};

• M-step: updates the parameters by maximizing the likelihood function and refine the esti-

mated true labels with new parameters.

In detail, we assume the labels provided by crowd workers are independently distributed. Given the

current estimated true labels T and parameters Ψ and Φ, the likelihood of the observed crowdsourced

26

labels Y can be obtained by:

𝑄(Y|Ψ, Φ, T) ∝

(cid:214)

𝑖∈[𝑛]

𝑝(T𝑖 = 𝑡𝑖)

(cid:214)

𝑢∈[𝑊]

Ψ𝑤𝑢 (𝑡𝑖, 𝑦𝑖,𝑢),

(3.1)

where 𝑖, 𝑢 are the indices of data sample and crowd worker, respectively; [𝑛] and [𝑊] denote the

sets {1, 2, . . . , 𝑛} and {1, 2, . . . , 𝑊 }, respectively.

The parameters in Ψ and Φ are updated by maximizing the above likelihood function. Specifi-

cally, Ψ can be computed as

Ψ𝑤𝑢 (𝑐𝛼,𝑐 𝛽) =

𝑑 (𝑤𝑢, 𝑐𝛼, 𝑐 𝛽)
𝑑 (𝑤𝑢, 𝑐𝛼)

,

where 𝑑 (𝑤𝑢, 𝑐𝛼, 𝑐 𝛽) represents the number of samples labeled as 𝑐 𝛽 by worker 𝑤𝑢 when their

current estimated true labels is 𝑐𝛼, and 𝑑 (𝑤𝑢, 𝑐𝛼) represents the number of samples labeled by

worker 𝑤𝑢 when their current estimated true labels is 𝑐𝛼. In addition, Φ can be computed as

Φ𝑐 𝛼 =

# samples whose true label is estimated as 𝑐𝛼
# samples in data set X

.

Based on these updates, we can refine the estimation of true label by Bayes’s theorem

𝑝(T𝑖 = 𝑐𝛼|Y, Ψ, Φ) ∝ 𝑝(Y|Ψ, Φ, T𝑖 = 𝑐𝛼) 𝑝(T𝑖 = 𝑐𝛼)

∝ 𝑝(T𝑖 = 𝑐𝛼)

(cid:214)

𝑢∈[𝑊]

Ψ𝑤𝑢 (𝑐𝛼, 𝑦𝑖,𝑢),

(3.2)

and choose the label 𝑐𝛼 with highest probability as the current estimated true label for data sample

x𝑖. We repeat E-step and M-step iteratively until convergence.

In summary, the true label inference module can provide two important information for our

ICED framework: 1) an estimation of latent true labels T, which can be used as supervised

label information to train the deep neural network based classifier F ; 2) the marginal distribution

𝑝(T = 𝑐𝛼) which reveals the data imbalance level between classes and thus guides the synthetic

data generation module to augment a balanced synthetic data set. Moreover, we can also obtain a

by-product from the EM algorithm, i.e., the annotation error rate of each worker 𝑤𝑢 derived from

Ψ𝑤𝑢 (·, ·), which can be potentially used to eliminate or penalize the unqualified workers whose

error rate is relatively high, depending on different application scenarios.

27

3.2.4 Synthetic Data Generation

The performance of the EM approach adopted in the true label inference module depends on

the choice of prior probability, e.g., Φ𝑐 𝛼, for initialization. Conventionally, uniform prior is used

for initialization resulting in poor performance on imbalanced crowdsourced labeled data sets.

Motivated by over-sampling approaches as an effective solution for an imbalanced data set, we

integrate a synthetic data generation module in our ICED framework to balance the training data

set.

As shown in Figure 3.1, we first apply the true label inference module to obtain estimated true

labels T. We then use the feature extractor G in the deep neural network based classifier F to map

all data samples in X from the raw data space into a latent embedding space. Finally, we apply the

following synthetic data generation process in the embedding space.

Suppose z be the embedding of the data sample x in the learned latent space, based on the

information involved in the estimated true labels T, all embeddings z𝑖 which its corresponding

determinate label 𝑡𝑖 belongs to the minority class will be selected as candidate embeddings to

help generate synthetic minority samples. After that, we utilize the linear interpolation operations

adopted in the SMOTE [15] approach as the way to create synthetic minority sample embeddings.
𝑖 , . . . , z𝑘
𝑖 }

Specifically, for any candidate embedding z𝑖, we (i) discover 𝑘 nearest neighbors {z1
𝑖 , . . . , z𝑘

for z𝑖; and (ii) randomly pick up one nearest neighbor z𝑟

𝑖 from the set {z1

𝑖 } to create a

𝑖 , z2

𝑖 , z2

synthetic minority sample embedding z′

𝑖 as follows:

𝑖 = z𝑖 + 𝛿 (cid:0)z𝑟
z′

𝑖 − z𝑖(cid:1) ,

(3.3)

where 𝛿 is a scalar in range [0, 1]. The step (ii) can repeat 𝑅 times, and, finally, 𝑅 × 𝑚 synthetic

minority sample embeddings will be generated when executing the same process on all selected

candidate embeddings with size 𝑚.

Since the true label inference module cannot guarantee 100% accuracy on estimating latent true

labels from crowdsourced labels, the estimated determinate labels still have a chance to be opposite

to the latent true labels. Hence, to reduce the adverse effects of possible wrong inference, different

28

from the SMOTE approach, which chooses 𝛿 randomly, we assign the value for 𝛿 based on the label

certainty score of a sample x𝑖 and its selected neighbor x𝑟
𝑖 .

Definition 3.2.2 (Label Certainty Score). Given a data example x𝑖 and we assume that its crowd-

sourced labels {𝑦𝑖,𝑢 | 𝑢 ∈ [𝑊]} follow the multinomial distribution. The label certainty score 𝑆(x𝑖)

is defined as the inverse variance of this distribution and is computed as:

𝑆(x𝑖) =

1
E𝑢 ∥𝑦𝑖 − E𝑢 (𝑦𝑖) ∥2 + 𝜖

,

(3.4)

where E𝑢 is the expectation over crowdsourced label {𝑦𝑖,𝑢 | 𝑢 ∈ [𝑊]} for sample x𝑖, and 𝜖 is a small

constant to avoid numerical issue.

The label certainty score measures the agreement degree among crowd workers. Label certainty

score reaches its minimum value when a tie or a draw happens and goes to its maximum value

when all annotated labels for one data sample are consistent.

Therefore, given sample x𝑖 and its neighbor x𝑟

𝑖 , 𝛿 can be calculated by:

𝛿 = 𝑆(x𝑟

𝑖 )/(cid:0)𝑆(x𝑖) + 𝑆(x𝑟

𝑖 )(cid:1) + 𝜂,

(3.5)

where 𝜂 is sampled from a uniform distribution to add some randomness on the scalar 𝛿. With

the help of Eq. (3.5), the generated synthetic embeddings z′

𝑖 will close to the candidate embedding

which has larger label certainty score, such that the probability of the generated z′

𝑖 is located in the

minority embedding clusters can be increased and, finally, the imbalanced issue can be alleviated

via the aforementioned generation process.

After the synthetic minority sample embeddings are generated, we introduce the 𝑘-NN approach

into the latent embedding space to construct synthetic crowdsourced labels for generated sample

embeddings. More specifically, for any generated minority embedding z′

𝑖, we collect crowdsourced

labels of its 𝑘 nearest neighbor embeddings of real data samples and then determine its synthetic

crowdsourced labels by simulating the annotation behavior of each crowd worker in these collected

𝑘 crowdsourced labels.

29

Algorithm 3.1 The algorithm of warm-up training.
Input: sample set X, crowdsourced label set Y
1: Calculate label agreement score 𝑆(x𝑖) for each sample x𝑖 ∈ X.
2: Obtain estimated true label 𝑡𝑖 for each y𝑖 ∈ Y using MV.
3: Random initialize parameters of classifier F .
4: Divide sample set X and estimated true label set T into four different groups based on their

label certainty scores.
5: for warm-up epochs do
6:

Train the classifier F using the data samples and determinate labels in the third highest
certainty group.

7: end for
8: for warm-up epochs do
9:

Train the classifier F using the data samples and determinate labels in the second highest
certainty group.

10: end for
11: for warm-up epochs do
12:

Train the classifier F using the data samples and determinate labels in the highest certainty
group.

13: end for

When we obtain synthetic minority sample embeddings and corresponding synthetic crowd-

sourced labels, as shown in Figure 3.1, we use a pre-trained decoder Q to map the synthetic

embeddings back to the raw data space and will use the augmented balanced training set to update

the parameters of the deep neural network based classifier F .

In summary, the synthetic data generation module in our ICED framework addresses the issues

causes by imbalanced training data set via generating sufficient synthetic minority samples with

synthetic crowdsourced labels, which can benefit both the true label inference process and the deep

neural network training.

3.2.5 Warm-up Training

Recent studies have discovered that deep neural networks can learn even on noisy labeled

data [76, 61]. Hence, a warm-up training phase is an effective strategy to initialize supervised deep

learning models. Existing literature [59, 36] uses all available data in the warm-up training phase.

Different from existing literature, in our ICED framework, we design a new warm-up training

strategy specifically designed for the crowdsourced labeled data.

As shown in Algorithm 3.1, given data set X and crowdsourced label set Y, we first calculate

30

Algorithm 3.2 The algorithm of ICED.
Input: sample set X, crowdsourced label set Y

1: Conduct warm-up training as described in Algorithm 3.1.
2: repeat
3:

4:

Generate synthetic minority samples and corresponding synthetic crowdsourced labels as
described in Sec. 3.2.4.
Obtain the inferred true label set T′ for the augmented crowdsourced labels as described in
Sec. 3.2.3.
Train the classifier F using the augmented data samples and inferred determinate labels T′.

5:
6: until model converge or maximum training epoch reached

the label certainty score for each data sample x𝑖. After gathering label certainty scores for all data

samples, we apply majority voting (MV) on the crowdsourced label set Y to obtain an estimated

true label set T. Each element 𝑡𝑖 in T is obtained by aggregating the corresponding crowdsourced

label Y𝑖 using MV. We then divide the sample set X and corresponding true label set T into four

different subgroups based on the label certainty scores: low certainty group, third-highest certainty

group, second highest certainty group, and highest certainty group. We use all data samples except

those in the low certainty group to initially train the deep neural network based classifier F with

associated determinate labels in a supervised way.

In general, there is a higher probability of the determinate label 𝑡𝑖, obtained by MV, being the

same as the latent true label when sample x𝑖 has a higher label certainty score 𝑆(x𝑖). Our new

warm-up training strategy is similar to using noisy labeled data to help provide the initial ability

for the deep neural network based classifier F and using clean labeled data to fine-tune F . After

the warm-up training phase, the ICED framework can get a better initial prediction ability.

3.2.6 Algorithm

In this subsection, we present our ICED framework for learning from imbalanced crowdsourced

labeled data in Algorithm 3.2.

As shown in Algorithm 3.2, we first introduce our designed warm-up training strategy to make

the deep neural network based classifier F obtain better initial ability. Then, in each training epoch,

we apply the synthetic data generation module to produce synthetic minority samples with synthetic

crowdsourced labels for balancing the training data set. After that, the true label inference module is

31

used to inferred latent true labels for the augmented crowdsourced labels. Hence, the parameters of

the classifier F can be updated based on the augmented balanced data samples and corresponding

inferred determinate labels in a supervised way. We continuously conduct this iteration process

until F converges or the maximum training epoch is reached.

3.3 Experiment

In this section, we conduct experiments to verify the effectiveness of our proposed ICED

framework by answering the following three questions:

1. Can the proposed framework obtain good prediction performance on the balanced test data?

2. Does the generated synthetic data improve the accuracy of the true label inference process?

3. Does our newly designed warm-up training strategy improve over existing warm-up training

strategies?

To answer the first question, we compare the performance of ICED with several state-of-the-art

crowdsourced label processing approaches on the classification task. For the second question, we

compare the accuracy of true label inference with and without synthetic data generation modules on

two synthetic datasets. Finally, we compare the prediction performance of the deep neural network

based classifier F using our designed warm-up training strategy and by traditional warm-up training

strategies to answer the third question.

3.3.1 Data Sets

3.3.1.1 Synthetic Data Sets

We conduct experiments on three synthetic data sets and one real-world data set. Table 3.1

summarizes key statistical information of these four data sets. The three synthetic imbalanced

crowdsourced labeled data sets are constructed based on three widely used data sets: Gisette,

USPS, and Gas Sensor Array Drift (GSAD). Specifically, Gisette and USPS datasets are from

Feature Selection data repository2 and the GSAD data set is from UCI data repository3. Next,

2http://featureselection.asu.edu/datasets.php
3https://archive.ics.uci.edu/ml/index.php

32

Table 3.1 Statistics of data sets. The entries in “# majority class" and “# minority class" represent
the number of samples we used for those classes, respectively, to construct a synthetic training data
set.

Statistic item

# features
# training data
# majority class
# minority class
# crowd worker
# test data

Dataset
Gisette-Syn USPS-Syn GSAD-Syn Emotion

5,000
3,080
2,800
280
7
1,400

256
734
668
66
9
332

128
2,575
2,341
234
11
1,170

1,582
3,027
-
-
5
900

using the USPS data set as an illustrative example, we describe how we construct a synthetic

imbalanced crowdwourced labeled data set USPS-Syn. First we randomly choose one class as a

majority class and another one as a minority class from the total ten classes contained in the USPS

data set to obtain a balanced binary data set. Then we split 80% data samples in this balanced

binary data set as candidate training set and the rest as the test set. Note that the test set is class-

balanced. Different from previous crowdsourced label processing approaches that randomly assign

a mislabeling probability for each worker to all data samples [3, 50], in this chapter, we present

a new way to synthesize the crowdsourced labels by considering the difficulties of data samples.

Intuitively, it should be easier to infer true labels from data samples with a higher label certainty

score (e.g., all 𝑊 crowd workers annotate the same label for the same data sample). Motivated by

this intuition, we introduce a deep neural network to evaluate the identification difficulty of each

sample. Specifically, we train a deep neural network based on the candidate training set and stop

the training process when the training accuracy is higher than 98%. Then, for each class, we use the

softmax outputs of the trained deep neural network to indicate the identification difficulties of data

samples. Then, we assign a mislabeling probability for each data sample based on the relative ease

of inferring from that data sample. Higher the difficulty in inferring from a data sample, the higher

the mislabeling probability for all crowd workers. Finally, we remove 90% minority samples from

the candidate training set based on their ground truth labels to obtain an imbalanced crowdsourced

labeled data set USPS-Syn.

33

Another two synthetic data sets Gisette-Syn and GSAD-Syn can be constructed in the similar

way as mentioned above. Once again, we textitasize that only the training set in these three synthetic

data sets is an imbalanced crowdsourced labeled data set while the test set is a class-balanced data

set with determinate labels.

3.3.1.2 Real Data Set

We collected a real-world imbalanced crowdsourced labeled data set Emotion from our edu-

cational practice. The collected data samples in the Emotion dataset are 1-minute audio tracks

collected from multiple teachers who teach courses such as Mathematics and English in primary

school. We split all audio tracks in Emotion into a training set and a test set with sample size

3,027 and 900 separately. Five teaching professionals are invited to annotate every audio track in

the training set as either high emotion arousal or low emotion arousal to assess teaching effects

on courses and the annotation results provided by one teaching expert for audio tracks in the test

set are adopted as the ground truth labels. As the original data samples in the Emotion data set

are audio tracks, neither our ICED framework nor baseline methods can directly deal with those

data samples. For addressing this issue, we apply OpenSmile4 to extract 1,582 acoustic features,

such as signal energy, loudness, MFCC features, etc., from the collected audio tracks. Since the

Emotion dataset is collected from our educational practice, it cannot guarantee the latent true label

distribution in the training set is class-balanced. Moreover, based on the label inference results

produced by majority voting, among the total of 3,027 samples in the training set, there are 1,911

samples in one class and 1,116 samples in the other class. For experiment purpose, we maintained

the same number of data samples in each class in the test set.

3.3.2 Performance Comparison

3.3.2.1 Baseline Methods

For evaluating the effectiveness of our proposed ICED framework on the learning from im-

balanced crowdsourced labeled data problem, we compare the performance of ICED with several

representative state-of-the-art crowdsourced label processing approaches on the classification task,

4https://www.audeering.com/opensmile/

34

including:

• Majority Voting (MV), which infers determinate labels based on the majority of annotated

labels.

• D&S [22], which infers determinate labels via estimating the error rate of each crowd worker.

• Crowd-Layer [82], which is an end-to-end deep neural network containing a novel crowd

layer to learn from crowdsourced labeled data directly.

• MBEM [51], which is able to learn from crowdsourced labeled data via jointly modeling

latent true labels and crowd worker qualifications.

• CPC [48], which improves the performance of classifier via learning parameters of classifier

and clusters of crowd workers jointly.

As MV and D&S can only infer determinate labels instead of learning a classifier from crowd-

sourced labels, we introduce two classifiers Logistic regression (LR) and deep neural networks

(DNN). Specifically, we train LR and DNN on the same datasets with determinate labels inferred

by MV and D&S individually and use them as baseline methods. We denote these baseline methods

as MV+LR, MV+DNN, D&S+LR and D&S+DNN.

Table 3.2 shows the classification performance of our ICED framework by comparing against

seven baseline methods on three synthetic data sets and one real data set. Based on this table,

we have the following observations. First, the classification performance of both LR and DNN,

measured in terms of accuracy and F1-score, is higher when using MV instead of D&S to infer

determinate labels. D&S, as an EM-based approach, assumes a uniform label distribution. MV

independently aggregates annotated labels of each crowdsourced label. Hence, given an imbalanced

crowdsourced labeled data set, the performance of MV on the true label inference task will not

be affected by the imbalanced true label distribution. On the contrary, D&S may show poor

performance due to its inaccurate uniformity assumption. Second, our ICED framework achieves

35

Table 3.2 Classification performance of our ICED framework and baseline methods on four data
sets.

Gisette-Syn

USPS-Syn

GSAD-Syn

Emotion

Methods
MV+LR
MV+DNN
D&S+LR
D&S+DNN
Crowd-Layer
MBEM
CPC
ICED

Accuracy F1-score Accuracy F1-score Accuracy F1-score Accuracy F1-score
0.8277
0.7901
0.8294
0.7749
0.8273
0.5324
-
0.8640

0.8175
0.7979
0.7592
0.7562
0.8167
0.4211
0.8020
0.8512

0.8289
0.7944
0.8311
0.7811
0.8300
0.6344
-
0.8644

0.8480
0.8732
0.8136
0.7386
0.8998
0.5334
0.8311
0.9030

0.7094
0.8333
0.6974
0.8085
0.8342
0.6826
0.6154
0.8872

0.6872
0.8327
0.6716
0.8082
0.8295
0.5577
0.5917
0.8865

0.8494
0.8735
0.8193
0.7530
0.9006
0.7813
0.8313
0.9036

0.8179
0.8000
0.7671
0.7636
0.8200
0.6967
0.8021
0.8521

the best classification performance on all four data sets comparing with several representative state-

of-the-art crowdsourced label processing approaches. We believe there are three reasons behind

this phenomenon. First, even though the D&S approach assumes uniformity in data distribution,

ICED generates synthetic data to augment the imbalances between classes in the training set. The

resulting training set will approximate a uniform distribution, enhancing the performance of the

D&S approach. Second, the more accurate determinate labels inferred by the true label inference

module improves the synthetic data generation module. The reason being, the synthetic data

generation module can use the inferred determinate labels to differentiate minority data samples

from majority ones to generate synthetic samples in minority classes. As a result, the data samples

produced by the synthetic data generation module have a higher probability of belonging to the

minority classes. Third, the synthetic generated data can also help the classifier F in ICED to

obtain better generalization ability during the model training phase via augmenting the imbalanced

training set. In summary, comparing with several representative state-of-the-art crowdsourced label

processing approaches, our ICED framework is more effective to tackle the problem of learning

from imbalanced crowdsourced labeled data.

3.3.3 Ablation Study

As we mentioned before, the D&S approach assumes uniform label distribution as prior knowl-

edge for initialization. Therefore, the true label inference performance of the D&S approach is

lower than the MV approach on the imbalanced crowdsourced labeled dataset. The ICED frame-

36

(a) Gisette-Syn.

(b) GSAD-Syn.

Figure 3.2 Accuracy of true label inference using MV and the true label inference module (D&S)
in ICED.

work addresses the issue in the D&S approach by integrating a synthetic data generation module.

The synthetic data generation module balances the imbalanced training set via generating synthetic

data samples for minority classes. The resulting augmented dataset better fits the prior knowledge

used in D&S.

To verify whether and how the synthetic generation module benefits from the true label inference

module in our ICED framework, we compare the true label inference accuracy of D&S adopted

in ICED with MV. We show the comparison on two synthetic datasets— Gisette-Syn and GSAD-

Syn— because the ground truth labels are available for these two datasets. In our experiments,

we record the true label inference accuracy of D&S for three cases: 1) before introducing the

synthetic data generation module, 2) after applying the synthetic data generation module once, and

3) after completing the training procedure of ICED. We denote these three cases as D&S-orig,

D&S-first, D&S-final, respectively.

In Figure 3.2, we find that the performance of D&S varies

widely for different cases. Take the experimental results obtained on the training set of Gisette-

Syn as an example. As shown in Figure 3.2a, before introducing the synthetic data generation

module, the true label inference accuracy of D&S is below 60%, which is much worse than MV.

Surprisingly, by conducting the synthetic data generation process just once, the label inference

accuracy of D&S is higher than 80%. After finishing the training procedure of ICED, i.e, after

repeating the synthetic data generation process multiple times, D&S achieves higher than 90% true

37

MVD&S-origD&S-firstD&S-final0.50.60.70.80.91.0MVD&S-origD&S-firstD&S-final0.50.60.70.80.91.0Table 3.3 Performance of different warm-up strategies.

Gisette-Syn

Datasets Methods
Trad-I
Trad-II
ICED-w
Trad-I
Trad-II
ICED-w
Trad-I
Trad-II
ICED-w

USPS-Syn

GSAD-Syn

# samples
3,080
2,043
2,043
734
103
103
2,575
507
507

# epochs Accuracy F1-score
0.7989
0.8014
0.5372
0.6079
0.8126
0.8186
0.7333
0.7500
0.7828
0.7922
0.8329
0.8373
0.3142
0.4581
0.3105
0.4504
0.8332
0.8376

15
15
5 × 3
6
6
2 × 3
6
6
2 × 3

label inference accuracy, which is a significant improvement in comparison to a naive application

of D&S on the imbalanced crowdsourced labeled dataset. In conclusion, the synthetic generation

module significantly enhances the performance of the true label inference module in ICED.

3.3.4 Effectiveness of Warm-up Training

In this subsection, we test the effectiveness of our designed warm-up training strategy. Given a

set of crowdsourced labeled data, the warm-up training strategy adopted in our ICED framework

first calculates label certainty score for each data sample based on its corresponding crowdsourced

label. Then it divides data samples into different groups based on their label certainty scores.

Data samples in the third-highest certainty group will feed the classifier F in ICED first with their

corresponding determinate labels produced by MV. Data samples in the highest certainty group will

train F after those in the second-highest group are picked. In experiments, we denote our designed

warm-up training strategy as ICED-w. As a comparison, we implement one common warm-up

training strategy used in literature for learning from noisy labeled data that uses all available

data simultaneously to warm up the model. We denote this warm-up strategy as Trad-I. Another

warm-up training strategy Trad-II, which is the same as Trad-I, except it only uses data samples

in the highest, second-highest, and third-highest certainty groups rather than all the available data

samples. In other words, Trad-II chooses the same data samples adopted in our designed warm-up

training strategy ICED-w and uses them to feed F at the same time. For evaluation, we report the

classification performance of F by training on different warm-up training strategies in Table 3.3.

38

We observe that the classifier F training by ICED-w achieves the best classification performance,

comparing with Trad-I and Trad-II, on all datasets. Thus, our designed warm-up training strategy

more effectively initializes ICED.

3.4 Related Work

3.4.1 Processing Crowdsourced Labels

Inferring true labels from crowdsourced labels is a challenge as the crowd workers have diverse

expertise [113]. A naive approach to infer true labels is majority voting (MV), which uses the

majority of annotated labels as the true label. The MV approach performs poorly in practice, as

the crowd workers have diverse expertise and reliability. An Expectation-Maximization (EM) [22]

approach addresses the differences between crowd workers by estimating the error rate of each

crowd worker from the crowd labels. Therefore, an EM approach has higher accuracy than MV in

inferring true labels. Inspired by this, Whitehill et al. [104] used an iterative approach considering

both sample difficulty and crowd worker reliability to infer true labels. The above approaches focus

only on inferring true labels. Some recent works integrate true labels inference with downstream

tasks. Kajino et al. [48] developed a clustered personal classifier method that simultaneously trains

a classifier and estimates a cluster of workers. Rodrigues et al. [83] generalized Gaussian process

classification considering crowd workers with diverse expertise. Raykar et al. [79] designed an

EM-based approach to jointly learn a crowd worker noise model and a regression model. Khetan

et al. [51] proposed another EM-based approach for learning from crowdsourced labeled data by

jointly modeling latent true labels and crowd worker qualifications. Guan et al. [34] modeled

information from each worker and then learned combination weights via back-propagation. As all

the above approaches assume a uniformed label distribution as prior knowledge for initialization,

they cannot achieve good generalization when the given training set has an imbalanced true label

distribution.

39

3.4.2 Handling Imbalanced Data

The performance of a classifier heavily relies on the quality and quantity of training data [49].

Since the majority of classes in the imbalanced training set can dominate the loss function of

training, classifiers trained on imbalanced data often generalize poorly. Existing approaches to

handle imbalanced data mainly falls into two categories:

re-sampling and re-weighting. Re-

sampling approaches balance the imbalanced data through under-sampling data samples from

majority classes [110, 67] or over-sampling data samples from minority classes [15, 100]. As

under-sampling approaches often discard several data samples, over-sampling approaches are better

in practice. Synthetic Minority Over-sampling Technique (SMOTE) [15] is a well-accepted over-

sampling approach. Instead of duplicating existing minority data samples to inflate minority classes,

SMOTE produces unseen synthetic minority samples by applying linear interpolation operations

between a specific minority sample and one of its nearest neighbors within the same class. Several

variants of SMOTE [35, 39] further improve the prediction performance of classifiers training

on imbalanced datasets. Re-weighting approaches allocate different weights for different classes

or even different data samples. For example, Lin et al. [63] proposed Focal loss to reshape the

standard cross entropy loss such that it down-weights the loss assigned to well-classified data

samples. Cui et al. [21] presented to utilize the data overlap measurement to quantify the effective

number of samples for each class and re-weight each class by the inverse of the number of effective

samples per class. Existing imbalanced data handling approaches assume that the given labels are

determinate and noise-free, which is not the case in crowdsourcing settings. Therefore, learning

from imbalanced crowdsourced labels needs to be addressed.

3.5 Chapter Conclusion

In this chapter, we investigate the problem of learning from imbalanced crowdsourced labeled

data. We present a novel ICED framework to deal with the imbalanced true label distribution and

noisy crowdsourced labels. The ICED framework alleviates the negative impacts of imbalanced

true label distribution while using the supervised information in the crowdsourced labels. To

evaluate the performance of the ICED framework, we apply ICED into a classification task by

40

training on both synthetic and real imbalanced crowdsourced labeled datasets and comparing its

performance with several representative crowdsourced label processing approaches. Extensive

experimental results demonstrate the effectiveness of our proposed framework ICED on learning

from imbalanced crowdsourced labeled data.

41

CHAPTER 4

IMBALANCED ADVERSARIAL TRAINING WITH REWEIGHTING

Adversarial training has been empirically proven to be one of the most effective and reliable

defense methods against adversarial attacks. However, the majority of existing studies are focused

on balanced data sets, where each class has a similar amount of training examples. Research

on adversarial training with imbalanced training data sets is rather limited. As the initial effort

to investigate this problem, we reveal the facts that adversarially trained models present two

distinguished behaviors from naturally trained models in imbalanced data sets: (1) Compared

to natural training, adversarially trained models can suffer much worse performance on under-

represented classes, when the training data set is extremely imbalanced. (2) Traditional reweighting

strategies which assign large weights to under-represented classes will drastically hurt the model’s

performance on well-represented classes. In this chapter, to further understand our observations,

we theoretically show that the poor data separability is one key reason causing this strong tension

between under-represented and well-represented classes. Motivated by this finding, we propose

the Separable Reweighted Adversarial Training (SRAT) framework to facilitate adversarial training

under imbalanced scenarios, by learning more separable features for different classes. Extensive

experiments on various data sets verify the effectiveness of the proposed framework.

4.1 Chapter Introduction

The existence of adversarial samples [91, 32] has risen huge concerns on applying deep neural

network (DNN) models into security-critical applications, such as autonomous driving [17] and

video surveillance systems [58]. As countermeasures against adversarial attacks, adversarial train-

ing [71, 111, 103] has been empirically proven to be one of the most effective and reliable defense

methods. In general, it can be formulated to minimize the model’s average error on adversarially

perturbed input examples [71]. Although promising to improve the model’s robustness, most ex-

isting adversarial training methods assume that the number of training examples from each class is

equally distributed. However, datasets collected from real-world applications typically have imbal-

anced distribution [27, 64]. Hence, it is natural to ask: What is the behavior of adversarial training

42

under imbalanced scenarios? Can we directly apply existing imbalanced learning strategies in

natural training to tackle the imbalance issue for adversarial training? Recent studies find that

adversarial training usually presents distinct properties from natural training. For example, com-

pared to natural training, adversarially trained models suffer more from the overfitting issue [85],

and they tend to present strong class-wise performance disparities, even if the training examples

are uniformly distributed over different classes [108]. Imagine that if the training data distribution

is highly imbalanced, these properties of adversarial training can be greatly exaggerated and make

it extremely difficult to be applied in practice. Therefore, it is necessary but challenging to answer

aforementioned questions.

As the initial effort to study the imbalanced problem in adversarial training, in this work, we

first investigate the performance of existing adversarial training under imbalanced settings. As a

preliminary study shown in Section 4.2.1, we apply both natural training and PGD adversarial train-

ing [71] on multiple imbalanced training datasets constructed from CIFAR10 training dataset [56]

and evaluate trained models’ performance on class-balanced test dataset. From the preliminary

results, we observe that, compared to naturally trained models, adversarially trained models always

present very low standard & robust accuracy1 on under-represented classes. This observation

suggests that adversarial training is more sensitive to imbalanced data distribution than natural

training. Thus, when applying adversarial training in practice, imbalance learning strategies should

be considered for help.

As a result, we explore potential solutions which can handle the imbalance issue for adversarial

training. In this chapter, we focus on studying the behavior of the reweighting strategy [41] and

leave other strategies such as resampling [26] for one future work. In Section 4.2.2, we apply the

reweighting strategy to adversarial training with varied weights assigning to one under-represented

class and evaluate trained models’ performance. From the results, we observe that, in adversarial

training, increasing weights for an under-represented class can substantially improve the standard &

robust accuracy on this class, but drastically hurt the model’s performance on the well-represented

1In this chapter, we denote standard/robust accuracy as model’s accuracy on the input examples without/with
perturbations, respectively. Without clear clarification, we consider the perturbation is constrained by 𝑙∞-norm 8/255.

43

class. This finding indicates that the performance of adversarially trained models is very sensitive

to the reweighting manipulations and it could be very hard to figure out an eligible reweighting

strategy which is optimal for all classes.

It is also worth noting that, in natural training, we find that upweighting the under-represented

class increases model’s standard accuracy on this class but only slightly hurts the accuracy on

the well-represented class, even when adopting a large weight for the under-represent class. To

further investigate the possible reasons leading to different behaviors of the reweighing strategy

in natural and adversarial training, we visualize their learned features (in Figure 4.3), and observe

that features learned by the adversarially trained model of different classes tend to mix together

while they are well separated for the naturally trained model. This observation motivates us to

theoretically show that when the given data distribution has poor data separability, upweighting

under-represented classes will hurt the model’s performance on well-represented classes. Motivated

by our theoretical understanding, we propose a novel framework SRAT (Separable Reweighted

Adversarial Training) to facilitate the reweighting strategy in imbalanced adversarial training by

enhancing the separability of learned features. Through experiments, we validate the effectiveness

of SRAT. The main contributions of this chapter include:

• We empirically discover two major differences between naturally trained models and adver-

sarial trained models under imbalanced settings, which reveal a fact that adversarial training

alone cannot work well given an imbalanced training dataset.

• We theoretically verify the poor data separability is one key reason causing the failure of

adversarial training based methods under imbalanced settings.

• We propose a novel framework SRAT to facilitate the reweighting strategy in imbalanced

adversarial training and demonstrate the effectiveness of SRAT via extensive experiments.

44

(a) Natural Training Standard Acc.

(b) Adv. Training Standard Acc.

(c) Adv. Training Robust Acc.

Figure 4.1 Class-wise performance of natural & adversarial training using an imbalanced CIFAR10.

4.2 Preliminary Study

4.2.1 The Behavior of Adversarial Training

In this subsection, we conduct preliminary studies to examine the performance of PGD adver-

sarial training [71]. Following previous works [21, 10], we construct an imbalanced CIFAR10 [56]

training dataset, where each of the first 5 classes (a.k.a. well-represented classes) has 5,000 training

examples and each of the last 5 classes (a.k.a. under-represented classes) has 50 training examples.

Figure 4.1 shows the performance of naturally and adversarially trained models using a

ResNet18 [42] architecture. From the figure, we can observe that, compared with natural training,

PGD adversarial training will result in a larger performance gap between well-represented classes

and under-represented classes. For example, in natural training, the ratio between the average

standard accuracy of well-represented classes (brown) and under-represented classes (violet) is

about 2:1, while in adversarial training, this ratio expands to 16:1. Moreover, for adversarial

training, it has extremely poor performance on under-represented classes. There are 3 out of the 5

under-represented classes with 0% standard & robust accuracy. As a conclusion, the performance

of adversarial training is easier to be affected by imbalanced distribution than natural training

and suffers more on under-represented classes. We also conduct more experiments under various

imbalanced settings and get have similar findings.

4.2.2 The Reweighting Strategy in Natural Training v.s. in Adversarial Training

The preliminary study in Section 4.2.1 demonstrates that it is highly demanding to adjust the

original adversarial training methods to accommodate imbalanced data distribution. Next, we

45

(a) Natural Training Standard Acc.

(b) Adv. Training Standard Acc.

(c) Adv. Training Robust Acc.

Figure 4.2 Class-wise performance of reweighted natural & adversarial training in binary classifi-
cation.

investigate the effectiveness of adopting the reweighting strategy [41] in adversarial training. Our

experiments are conducted under a binary classification setting, where the training dataset contains

two classes that are randomly selected from CIFAR10 dataset, with each class having 5,000 and

50 training examples respectively. Based on this training dataset, we arrange multiple trails of

(reweighted) natural training and (reweighted) adversarial training, with the weight ratio between

the under-represented class and well-represented class ranging from 1:1 to 200:1.

Figure 4.2 shows the experimental results with training data sampled from the classes “cat”

and “horse”. As demonstrated in Figure 4.2, increasing the weight for the under-represented

class (horse) will drastically increase the model’s performance on this class, while also immensely

decreasing the performance on the well-represented class (cat). For example, when increasing the

weight ratio from 1:1 to 150:1, the standard accuracy of the under-represented class is improved

from 0% to ∼ 60% and its robust accuracy from 0% to ∼ 50%. However, the standard accuracy on

the well-represented class drops from 100% to 60%, and its robust accuracy drops from 100% to

50%. These results illustrate that adversarial training’s performance can be significantly affected by

the reweighting strategy. As a result, the reweighting strategy in this setting can hardly help improve

the overall performance no matter which weight ratio is chosen, because the model’s performance

always presents a strong tension between these two classes. We also conduct more experiments

using different binary imbalanced datasets and get have similar observations.

46

(a) Natural Training.

(b) Adversarial Training.

Figure 4.3 t-SNE visualization of learned features.

4.3 Theoretical Analysis

In Section 4.2.2, we observe that in natural training, the reweighting strategy can only make

a small impact on the two classes’ performance. This phenomenon has been extensively studied

by recent works [9, 107], where they find that a linear classifier optimized by SGD on a linearly

separable data will converge to the solution of the hard-margin support vector machine [77]. In

other words, as long as the data can be well separated, reweighting will not make huge influence

on the finally trained models.

Inspired by their conclusions, we hypothesize that, as the adversarially trained models separate

the data poorly, their performance is highly sensitive to the reweighting strategy. As a direct

validation of our hypothesis, in Figure 4.3, we visualize the learned (penultimate layer) features

of the imbalanced training examples used in the binary classification problem in Section 4.2.2.

We find that adversarially trained models do present obviously poorer separability on the learned

features. Next, we theoretically analyze the impact of reweighting on linear models which are

optimized under poorly separable data.

Binary Classification Problem. To construct the theoretical study, we focus on a binary

47

classification problem, with a Gaussian mixture distribution D which is defined as:

𝑦 ∼ {−1, +1},

𝑥 ∼

N (𝜇, 𝜎2𝐼),

if 𝑦 = +1

,

N (−𝜇, 𝜎2𝐼),

if 𝑦 = −1





(4.1)

where the two classes’ centers (±𝜇 ∈ R𝑑) with each dimension have mean value ±𝜂 (𝜂 > 0) and

variance 𝜎2. Formally, we define the data separability as 𝑆 = 𝜂/𝜎2. Intuitively, when 𝑆 is larger,

it suggests that two classes are well separated. Previous work [9] also closely studied this term to

describe data separability.

Besides, we assume the imbalanced training dataset satisfying the condition Pr.(𝑦 = +1) =

𝐾 · Pr.(𝑦 = −1) and 𝐾 > 1, which indicates the imbalance ratio between two classes. During test,

we assume two classes have the equal probability to appear. Under the data distribution D, we will

discuss the performance of linear classifiers 𝑓 (𝑥) = sign(𝑤𝑇 𝑥 − 𝑏) where 𝑤 and 𝑏 are the weight

and bias terms of the model 𝑓 . If a reweighting strategy is involved, we define the model upweights

the under-represented class “-1” by 𝜌.

In the following lemma, we first derive the solution of the optimized linear classifier 𝑓 training

on this imbalanced dataset. Then we will extend the result of Lemma 4.3.1 to analyze the impact

of data separability on the performance of model 𝑓 .

Lemma 4.3.1. Under the data distribution D as defined in Eq. (4.1), with an imbalanced ratio 𝐾

and a reweight ratio 𝜌, the optimal classifier which minimizes the (reweighted) empirical risk:

𝑓 ∗ = arg min

𝑓

(cid:0)Pr.( 𝑓 (𝑥) ≠ 𝑦|𝑦 = −1) · Pr.(𝑦 = −1) · 𝜌

+ Pr.( 𝑓 (𝑥) ≠ 𝑦|𝑦 = +1) · Pr.(𝑦 = +1)(cid:1)

(4.2)

has the solution: 𝑤 = 1 and 𝑏 = 1

2 log(

𝜌

𝐾 ) 𝑑𝜎2

𝜂 = 1

2 log(

𝜌

𝐾 ) 𝑑
𝑆 .

Proof. We will first prove that the optimal model 𝑓 ∗ has parameters 𝑤1 = 𝑤2 = · · · = 𝑤𝑑 (or 𝑤 = 1)

by contradiction. We define 𝐺 = {1, 2, . . . , 𝑑} and make the following assumption: for the optimal

𝑤 and 𝑏, we assume if there exist 𝑤𝑖 < 𝑤 𝑗 for 𝑖 ≠ 𝑗 and 𝑖, 𝑗 ∈ 𝐺. Then we obtain the following

48

standard errors for the class “-1" and the class “+1" of this classifier 𝑓 with weight 𝑤:

Pr.( 𝑓 ∗(𝑥) ≠ 𝑦|𝑦 = −1) = Pr.(𝑤𝑇 N (−𝜂, 𝜎2) − 𝑏 > 0)

∑︁

= Pr.{

𝑘≠𝑖,𝑘≠ 𝑗

𝑤 𝑘 N (−𝜂, 𝜎2) +𝑤𝑖N (−𝜂, 𝜎2) +𝑤 𝑗 N (−𝜂, 𝜎2) −𝑏 > 0},

Pr.( 𝑓 ∗(𝑥) ≠ 𝑦|𝑦 = +1) = Pr.(𝑤𝑇 N (+𝜂, 𝜎2) − 𝑏 < 0)

∑︁

= Pr.{

𝑘≠𝑖,𝑘≠ 𝑗

𝑤 𝑘 N (+𝜂, 𝜎2) +𝑤𝑖N (+𝜂, 𝜎2) +𝑤 𝑗 N (+𝜂, 𝜎2) −𝑏 < 0}.

(4.3)

However, if we define a new classier ˜𝑓 whose weight ˜𝑤 uses 𝑤 𝑗 to replace 𝑤𝑖, we obtain the errors

for the new classifier:

Pr.( ˜𝑓 (𝑥) ≠ 𝑦|𝑦 = −1)

∑︁

= Pr.{

𝑘≠𝑖,𝑘≠ 𝑗

𝑤 𝑘N (−𝜂, 𝜎2) +𝑤 𝑗 N (−𝜂, 𝜎2) +𝑤 𝑗 N (−𝜂, 𝜎2) −𝑏 > 0},

Pr.( ˜𝑓 (𝑥) ≠ 𝑦|𝑦 = +1)

∑︁

= Pr.{

𝑘≠𝑖,𝑘≠ 𝑗

𝑤 𝑘N (+𝜂, 𝜎2) +𝑤 𝑗 N (+𝜂, 𝜎2) +𝑤 𝑗 N (+𝜂, 𝜎2) −𝑏 < 0}.

(4.4)

Comparing the errors in Eq. (4.3) and Eq. (4.4), as 𝑤𝑖 < 𝑤 𝑗 , then the classifier

˜𝑓 has smaller

standard error in each class. Therefore, it contradicts with the assumption that 𝑓 is the optimal

classifier with smallest error. Thus, we conclude for an optimal linear classifier in natural training,

it must satisfies 𝑤1 = 𝑤2 = · · · = 𝑤𝑑 (or 𝑤 = 1) if we do not consider the scale of 𝑤.

Next, we calculate the optimal bias term 𝑏 given 𝑤 = 1, where we find an optimal 𝑏 can

minimize the (reweighted) empirical risk:

Errortrain( 𝑓 ∗)

= Pr.( 𝑓 ∗(𝑥) ≠ 𝑦|𝑦 = −1) · Pr.(𝑦 = −1) · 𝜌+Pr.( 𝑓 ∗(𝑥) ≠ 𝑦|𝑦 = +1) ·Pr.(𝑦 = +1)

∝ Pr.( 𝑓 ∗(𝑥) ≠ 𝑦|𝑦 = −1) · 𝜌 + Pr.( 𝑓 ∗(𝑥) ≠ 𝑦|𝑦 = +1) · 𝐾

= 𝜌 · Pr.(

𝑑
∑︁

𝑖=1

N (−𝜂, 𝜎2) − 𝑏 > 0) + 𝐾 · Pr.(

𝑑
∑︁

𝑖=1

N (𝜂, 𝜎2) − 𝑏 < 0)

= 𝜌 · Pr.(N (0, 1) < −

𝑏 + 𝑑𝜂
𝑑𝜎

) + 𝐾 · Pr.(N (0, 1) <

𝑏 − 𝑑𝜂
𝑑𝜎

),

49

and we take the derivative with respect to 𝑏:

𝜕Errortrain
𝜕𝑏

=

+

𝜌
√
2𝜋
𝐾
√
2𝜋

· (−

1
𝑑𝜎

) exp(−

· (

1
𝑑𝜎

) exp(−

1
2

)2)

𝑏 + 𝑑𝜂
𝑑𝜎

(−

1
2
𝑏 − 𝑑𝜂
𝑑𝜎

(

)2).

When 𝜕Errortrain/𝜕𝑏 = 0, we can calculate the optimal 𝑏 which gives the minimum value of the

empirical error, and we have:

𝑏 =

1
2

log(

𝜌
𝐾

𝑑𝜎2
𝜂

)

=

1
2

log(

𝜌
𝐾

𝑑
𝑆

.

)

□

Lemma 4.3.1 indicates that the final optimized classifier has a weight vector equal to 1 and its

bias term 𝑏 only depends on 𝐾, 𝜌 and the data separability 𝑆. In the following, we first focus on one

special setting when 𝜌 = 1, which is the original ERM model without reweighting. Specifically, we

aim to compare the behavior of linear models when they can poorly separate data (like adversarial

trained models) or they can well separate data (like naturally trained models).

Theorem 4.3.2. Under two data distributions (𝑥 (1), 𝑦 (1)) ∈ D1 and (𝑥 (2), 𝑦 (2)) ∈ D2 with different
2 be the optimal non-reweighted classifiers (𝜌 = 1) under D1

separabilities 𝑆1 > 𝑆2, let 𝑓 ∗

1 and 𝑓 ∗

and D2, respectively. Given the imbalance ratio 𝐾 is large enough, we have:

Pr.( 𝑓 ∗

1 (𝑥 (1) ) ≠ 𝑦 (1) |𝑦 (1) = −1) −Pr.( 𝑓 ∗

1 (𝑥 (1) ) ≠ 𝑦 (1) |𝑦 (1) = +1)

(4.5)

< Pr.( 𝑓 ∗

2 (𝑥 (2) ) ≠ 𝑦 (2) |𝑦 (2) = −1) −Pr.( 𝑓 ∗

2 (𝑥 (2) ) ≠ 𝑦 (2) |𝑦 (2) = +1).

Proof. Without loss of generality, for distribution D1, D2 with different mean-variance pairs

(±𝜂1, 𝜎2

1 ) and (±𝜂2, 𝜎2

2 ), we can only consider the case 𝜂1 = 𝜂2 and 𝜎2

1

< 𝜎2

2 . Otherwise, we can

simply rescale one of them to match the mean vector of the other and will not impact the results.

Under this definition, the optimal classifier 𝑓 ∗

2 has weight vector 𝑤1 = 𝑤2 = 1 and bias term
1 and 𝑓 ∗
𝑏1, 𝑏2, with the value as demonstrated in Lemma 4.3.1. Next, we will prove the Theorem 4.3.2 by

2 steps.

50

Step 1. For the error of class “-1”, we have:

Pr.( 𝑓 ∗

1 (𝑥 (1)) ≠ 𝑦 (1) |𝑦 (1) = −1) = Pr.(

N (−𝜂, 𝜎2

1 ) −𝑏1 > 0)

𝑑
∑︁

< Pr.(

N (−𝜂, 𝜎2

1 ) − 𝑏2 > 0) < Pr.(

N (−𝜂, 𝜎2

2 ) − 𝑏2 > 0)

𝑖=1
2 (𝑥 (2)) ≠ 𝑦 (2) |𝑦 (2) = −1).

= Pr.( 𝑓 ∗

𝑑
∑︁

𝑖=1
𝑑
∑︁

𝑖=1

Step 2. For the error of class “+1”, we have:

Pr.( 𝑓 ∗

1 (𝑥 (1)) ≠ 𝑦 (1) |𝑦 (1) = +1) = Pr.(

𝑑
∑︁

𝑖=1

N (𝜂, 𝜎2

1 ) −𝑏1 < 0)

= Pr.(N (0, 1) <

𝑏1−𝑑𝜂
𝑑𝜎1

) = Pr.(N (0, 1) <

− log(𝐾) ·𝜎1
2𝜂

−

𝜂
𝜎1

),

and similarly,

Pr.( 𝑓 ∗

2 (𝑥 (2)) ≠ 𝑦 (2) |𝑦 (2) = +1) = Pr.(N (0, 1) <

− log(𝐾) ·𝜎2
2𝜂

−

𝜂
𝜎2

).

(4.6)

(4.7)

Note that when 𝐾 is large enough, i.e., log(𝐾) > 2·𝜂2
𝜎1·𝜎2

, we can get the Z-score in Eq. (4.6) is larger

than Eq. (4.7). As a result, we have:

Pr.( 𝑓 ∗

1 (𝑥 (1)) ≠ 𝑦 (1) |𝑦 (1) = +1) > Pr.( 𝑓 ∗

2 (𝑥 (2)) ≠ 𝑦 (2) |𝑦 (2) = +1).

By combining Step 1 and Step 2, we can get the inequality in Theorem 4.3.2.

(4.8)

□

Intuitively, Theorem 4.3.2 suggests that when the data separability 𝑆 is low (such as D2), the

optimized classifier (without reweighting) can intrinsically have a larger error difference between

the under-represented class “-1” and the well-represented class “+1”. Similar to the observation

in Section 4.2.1 and Figure 4.3, adversarially trained models present a weak ability to separate

data, and they also present a strong performance gap between the well-represented class and

under-represented class. Conclusively, Theorem 4.3.2 indicates that the poor ability to separate the

training data can be one important reason which leads to the strong performance gap of adversarially

trained models.

Next, we consider the case when the reweighting strategy is applied. Similar to Theorem 4.3.2,

we also calculate the models’ classwise error under D1 and D2 with different levels of separability.

51

In particular, Theorem 4.3.3 focuses on the well-represented class “+1” and calculates its error

increase when upweighting the under-represented class “-1” by 𝜌. Through the analysis in Theo-

rem 4.3.3, we compare the impact of upweighting the under-represented class on the performance

of well-represented class.

separabilities 𝑆1 > 𝑆2, let 𝑓 ∗

Theorem 4.3.3. Under two data distributions (𝑥 (1), 𝑦 (1)) ∈ D1 and (𝑥 (2), 𝑦 (2)) ∈ D2 with different
2 be the optimal non-reweighted classifiers (𝜌 = 1) under D1
∗ be the optimal reweighted classifiers under D1 and D2

and D2, respectively, and let 𝑓 ′
1

1 and 𝑓 ∗

∗ and 𝑓 ′
2

given the optimal reweighting ratio (𝜌 = 𝐾). Given the imbalance ratio 𝐾 is large enough, we

have:

Pr.( 𝑓 ′
1

∗(𝑥 (1) ) ≠ 𝑦 (1) |𝑦 (1) = +1) −Pr.( 𝑓 ∗

1 (𝑥 (1) ) ≠ 𝑦 (1) |𝑦 (1) = +1)

< Pr.( 𝑓 ′
2

∗(𝑥 (2) ) ≠ 𝑦 (2) |𝑦 (2) = +1) −Pr.( 𝑓 ∗

2 (𝑥 (2) ) ≠ 𝑦 (2) |𝑦 (2) = +1).

(4.9)

Proof. We first show that under both distribution D1 and D2, the optimal reweighting ratio 𝜌 is

equal to the imbalance ratio 𝐾. Based on the results in Eq. (4.3) and calculated model parameters

𝑤 and 𝑏, we have the test error (given the model trained by reweight value 𝜌):

Errortest( 𝑓 ∗)

= Pr.( 𝑓 ∗(𝑥) ≠ 𝑦|𝑦 = −1) ·Pr.(𝑦 = −1) +Pr.( 𝑓 ∗(𝑥) ≠ 𝑦|𝑦 = +1) ·Pr.(𝑦 = +1)

∝ Pr.(N (0, 1) < −

𝑏 + 𝑑𝜂
𝑑𝜎

) +Pr.(N (0, 1) <

𝑏 − 𝑑𝜂
𝑑𝜎

)

= Pr.(N (0, 1) < −

1
2

log(

𝜌
𝐾

) −

𝜎
𝜂

) +Pr.(N (0, 1) < 1
2

log(

𝜌
𝐾

) −

𝜎
𝜂

).

The value of taking the minimum when its derivative with respect to 𝜌 is equal to 0, where we can

get 𝜌 = 𝐾 and the bias term 𝑏 = 0. Note that the variance values have the relation: 𝜎2
1

< 𝜎2
2 .

Therefore, it is easy to get that:

Pr.( 𝑓 ′
1

∗(𝑥 (1) ) ≠ 𝑦 (1) |𝑦 (1) = +1) = Pr.(

𝑑
∑︁

𝑖=1

N (𝜂, 𝜎2

1 ) < 0)

< Pr.(

𝑑
∑︁

𝑖=1

N (𝜂, 𝜎2

2 ) < 0) = Pr.( 𝑓 ′

2

∗(𝑥 (2) ) ≠ 𝑦 (2) |𝑦 (2) = +1).

(4.10)

Combining the results in Eq. (4.8) and (4.10), we have proved the inequality in Theorem 4.3.3. □

52

As Theorem 4.3.3 shows, when the data distribution has poorer data separability (such as

D2), upweighting the under-represented class can cause greater hurt on the performance of the

well-represented class. It is also consistent with our empirical findings about adversarial training

models. Since the adversarially trained models poorly separate the data (Figure 4.3), upweighting

the under-represented class always drastically decreases the performance of the well-represented

class (Section 4.2.2). Through the discussions in both Theorem 4.3.2 and Theorem 4.3.3, we

conclude that the poor separability can be one important reason which makes adversarial training

and its reweighted variants extremely difficult to achieve good performance under imbalance data

distribution. Therefore, in the next section, we will explore potential solutions which can facilitate

the reweighting strategy in adversarial training.

4.4 Separable Reweighted Adversarial Training

The observations from both preliminary study and theoretical understandings indicate that

more separable data will advance the reweighting strategy in adversarial training under imbalanced

scenarios. Thus, in this section, we present a framework, Separable Reweighted Adversarial

Training (SRAT), which enables the effectiveness of the reweighting strategy in adversarial training

under imbalanced scenarios by increasing the separability in the learned feature space.

4.4.1 Reweighted Adversarial Training

Given an input example (𝑥, 𝑦), adversarial training [71] aims to obtain a robust model 𝑓𝜃 that can

make the same prediction 𝑦 for an adversarial example 𝑥′, generated by applying an adversarially

perturbation on 𝑥. The adversarial perturbations are typically bounded by a small value 𝜖 under

𝐿 𝑝-norm, i.e., ∥𝑥′ − 𝑥∥ 𝑝 ≤ 𝜖.

As indicated in Section 4.2.1, adversarial training cannot be applied in imbalanced scenarios

directly, as it presents very low performance on under-represented classes. To tackle this problem,

a natural idea is to integrate existing imbalanced learning strategies proposed in natural training,

such as reweighting, into adversarial training to improve the trained model’s performance on those

53

under-represented classes. Hence, the reweighted adversarial training can be defined as

min
𝜃

1
𝑛

𝑛
∑︁

𝑖=1

max
𝑖 −𝑥𝑖 ∥ 𝑝 ≤𝜖

∥𝑥′

𝑤𝑖L ( 𝑓𝜃 (𝑥′

𝑖), 𝑦𝑖),

(4.11)

where 𝑤𝑖 is a weight value assigned for each input sample (𝑥𝑖, 𝑦𝑖) based on the example size of

the class (𝑥𝑖, 𝑦𝑖) belongs to or some properties of (𝑥𝑖, 𝑦𝑖). In most existing adversarial training

methods [71, 111, 103], the cross entropy (CE) loss is adopted as the loss function L (·, ·). However,

the CE loss could be suboptimal in imbalanced scenarios and some new loss functions designed for

imbalanced learning specifically, such as Focal loss [63] and LDAM loss [10], have been proven

their superiority in natural training. Hence, besides CE loss, Focal loss and LDAM loss can also

be adopted as the loss function L (·, ·) in Eq. (4.11).

4.4.2

Increasing Feature Separability

Our preliminary study indicates that only reweighted adversarial training cannot work well

under imbalanced scenarios. Moreover, the reweighting strategy behaves very differently between

natural training and adversarial training. Meanwhile, our theoretical analysis suggests that the

poor separability of the feature space produced by the adversarially trained model can be one

reason to understand these observations. Hence, in order to facilitate the reweighting strategy

in adversarial training under imbalanced scenarios, we equip a feature separation loss with our

SRAT method. We aim to enforce the learned feature space as separable as possible. More

specifically, the goal of the feature separation loss is to make (1) the learned features of examples

from the same class well clustered, and (2) the features of examples from different classes well

separated. By achieving this goal, the model is able to learn more discriminative features for

each class. Correspondingly adjusting the decision boundary via the reweighting strategy to fit

under-represented classes’ examples more will not hurt well-represented classes drastically. The

feature separation loss is formally defined as:

L𝑠𝑒 𝑝 (𝑥′

𝑖) = −

1
|𝑃(𝑖)|

∑︁

𝑝∈𝑃(𝑖)

log

exp(𝑧′
𝑖 · 𝑧′
(cid:205)𝑎∈𝐴(𝑖) exp(𝑧′

𝑝/𝜏)
𝑖 · 𝑧′

𝑎/𝜏)

,

(4.12)

where z′

𝑖 is the feature representation of the adversarial example x′

𝑖 of x𝑖, 𝜏 ∈ R+ is a scalar

temperature parameter, 𝑃(𝑖) denotes the set of input examples belonging to the same class with x𝑖

54

and 𝐴(𝑖) indicates the set of all input examples excepts x′

𝑖. When minimizing the feature separation

loss during training, the learned features of examples from the same class will tend to aggregate

together in the latent feature space, and, hence, result in a more separable latent feature space. Our

proposed feature separation loss L𝑠𝑒 𝑝 (·) is inspired by the supervised contrastive loss proposed

in [52]. The main difference is, instead of applying data augmentation techniques to generate two

different views of each data example and feeding the model with augmented data examples, our

feature separation loss directly takes the adversarial example x′

𝑖 of each data example x𝑖 as input.

4.4.3 Training Schedule

By combining the feature separation loss with the reweighted adversarial training, the final

object function for Separable Reweighted Adversarial Training (SRAT) is defined as:

min
𝜃

1
𝑛

𝑛
∑︁

𝑖=1

max
𝑖 −𝑥𝑖 ∥ 𝑝 ≤𝜖

∥𝑥′

𝑤𝑖L ( 𝑓𝜃 (𝑥′

𝑖), 𝑦𝑖) + 𝜆L𝑠𝑒 𝑝 (𝑥′
𝑖),

(4.13)

where we use a hyper-parameter 𝜆 to balance the contributions from the reweighted adversarial

training and the feature separation loss.

In practice, in order to better take advantage of the reweighting strategy in our SRAT method,

we adopt a deferred reweighting training schedule [10]. Specifically, before annealing the learning

rate, our SRAT method first trains a model guided by Eq. (4.13) without introducing the reweighting

strategy, i.e., setting 𝑤𝑖 = 1 for every input example x′

𝑖, and then applies reweighting into model

training process with a smaller learning rate. Our SRAT method enables to learn more separable

feature space, thus comparing with applying the reweighting strategy from the beginning of training,

this deferred re-balancing training schedule enables the reweighting strategy to obtain more benefits

from our SRAT method, and as a result, it can boost the performance of our SRAT method with

the help of the reweighting strategy.

4.4.4 Algorithm

The algorithm of our proposed SRAT framework is shown in Algorithm 4.1. Specifically, in

each training iteration, we first generate adversarial examples using PGD for examples in the current

batch (Line 5). If the current training iteration does not reach a predefined starting reweighting

55

Algorithm 4.1 Separable Reweighted Adversarial Training.
Input: imbalanced training dataset 𝐷 = {(𝑥𝑖, 𝑦𝑖)}𝑛

reweighting epoch 𝑇𝑑, batch size 𝑁, number of batches 𝑀, learning rate 𝛾

𝑖=1, number of total training epochs 𝑇, starting

for mini-batch = 1, . . . , 𝑀 do

Output: adversarially robust model 𝑓𝜃
1: Initialize the model parameters 𝜃 randomly.
2: for epoch = 1, . . . , 𝑇𝑑 − 1 do
3:
4:
5:

Sample a mini-batch B = {(𝑥𝑖, 𝑦𝑖)}𝑁
Generate adversarial example 𝑥′
L ( 𝑓𝜃) = 1
𝑁
𝜃 ← 𝜃 − 𝛾∇𝜃 L ( 𝑓𝜃)

𝑖 −𝑥𝑖 ∥ 𝑝 ≤𝜖 L ( 𝑓𝜃 (𝑥′

𝑖=1 max∥𝑥′

(cid:205)𝑁

𝑖 for each 𝑥′

𝑖=1 from 𝐷.
𝑖 ∈ B.
𝑖), 𝑦𝑖) + 𝜆L𝑠𝑒 𝑝 (𝑥′
𝑖)

end for
Optional: 𝛾 ← 𝛾/𝜅

6:
7:
8:
9:
10: end for
11: for epoch = 𝑇𝑑, . . . , 𝑇 do
12:
13:
14:

15:
16:
17:
18:
19: end for

end for
Optional: 𝛾 ← 𝛾/𝜅

for mini-batch = 1, . . . , 𝑀 do

Sample a mini-batch B = {(𝑥𝑖, 𝑦𝑖)}𝑁
Generate adversarial example 𝑥′
L ( 𝑓𝜃) = 1
𝑁
𝜃 ← 𝜃 − 𝛾∇𝜃 L ( 𝑓𝜃)

𝑖=1 max∥𝑥′

(cid:205)𝑁

𝑖 −𝑥𝑖 ∥ 𝑝 ≤𝜖 𝑤𝑖L ( 𝑓𝜃 (𝑥′

𝑖=1 from 𝐷.
𝑖 ∈ B.

𝑖 for each 𝑥′

𝑖), 𝑦𝑖) + 𝜆L𝑠𝑒 𝑝 (𝑥′
𝑖)

epoch 𝑇𝑑, we will assign same weights, i.e., 𝑤𝑖 = 1 for all adversarial examples 𝑥𝑖 in the current

batch (Line 6). Otherwise, the reweighting strategy will be adopted in the final loss function (Line

15), where a specific weight 𝑤𝑖 will be assigned for each adversarial example 𝑥𝑖 if its corresponding

clean example 𝑥𝑖 comes from an under-represented class.

4.5 Experiment

In this section, we perform experiments to validate the effectiveness of our SRAT method.

We first compare SRAT with several representative imbalanced learning methods in adversarial

training under various imbalanced scenarios and then conduct ablation study to deeply understand

SRAT.

4.5.1 Experimental Settings

Data sets. We conduct experiments on multiple imbalanced training datasets artificially cre-

ated from two benchmark image datasets CIFAR10, and CIFAR100 [56] with diverse imbalanced

56

distributions. Specifically, we consider two different imbalance types: Exponential (Exp) imbal-

ance [21] and Step imbalance [7]. For Exp imbalance, the number of training examples of each

class will be reduced according to an exponential function 𝑛 = 𝑛𝑖𝜏𝑖, where 𝑖 is the class index,

𝑛𝑖 is the number of training examples in the original training dataset for class 𝑖 and 𝜏 ∈ (0, 1).

We categorize half classes with most frequent example sizes in the imbalanced training dataset

as well-represented classes and the remaining half classes as under-represented classes. For Step

imbalance, we follow the similar process adopted in Section 4.2.1. Moreover, we denote imbalance

ratio 𝐾 as the ratio between training example sizes of the most frequent and least frequent class.

We construct different imbalanced datasets “Step-10", “Step-100", “Exp-10" and “Exp-100", by

adopting different imbalanced types (Step or Exp) with different imbalanced ratios (𝐾 = 10 or

𝐾 = 100) to train models, and evaluate model’s performance on the original uniformly distributed

test datasets of CIFAR10 and CIFAR100 correspondingly.

Baseline methods. We implement several representative and state-of-the-art imbalanced learn-

ing methods (or their combinations) into adversarial training as baseline methods. These methods

include: (1) Focal loss (Focal); (2) LDAM loss (LDAM); (3) Class-balanced reweighting (CB-

Reweight) [21], where each example is reweighted proportionally by the inverse of the effective

number2 of its class; (4) Class-balanced Focal loss (CB-Focal) [21], a combination of Class-

balanced method and Focal loss, where well-classified examples will be downweighted while

hard-classified examples will be upweighted controlled by their corresponding effective number;

(5) deferred reweighted CE loss (DRCB-CE), where a deferred reweighting training schedule is

applied based on the CE loss; (6) deferred reweighted Class-balanced Focal loss (DRCB-Focal),

where a deferred reweighting training schedule is applied based on the CB-Focal loss; (7) deferred

reweighted Class-balanced LDAM loss (DRCB-LDAM) [10], where a deferred reweighting train-

ing schedule is applied based on the CB-LDAM loss. We also include the original PGD adversarial

training method using cross entropy loss (CE) in our experiments.

Our proposed methods. We evaluate three variants of our proposed SRAT method with

2The effective number is defined as the volume of examples and can be calculated by (1 − 𝛽𝑛𝑖 )/(1 − 𝛽), where

𝛽 ∈ [0, 1) is a hyperparameter and 𝑛𝑖 denotes the number of examples of class 𝑖.

57

Table 4.1 Performance comparison on the CIFAR10 Step-10 dataset under 𝑙∞ threat model.

Table 4.2 Performance comparison on the CIFAR10 Step-100 dataset under 𝑙∞ threat model.

Standard Accuracy

Metric
Overall
Method
63.26 ± 0.59
CE
63.57 ± 0.92
Focal
57.08 ± 1.16
LDAM
73.30 ± 0.30
CB-Reweight
73.42 ± 0.29
CB-Focal
75.89 ± 0.23
DRCB-CE
74.61 ± 0.35
DRCB-Focal
DRCB-LDAM 72.95 ± 0.08
76.69 ± 0.33
SRAT-CE
75.41 ± 0.69
SRAT-Focal
SRAT-LDAM 73.99 ± 0.52

Metric
Overall
Method
47.29 ± 0.32
CE
47.36 ± 0.19
Focal
42.49 ± 0.62
LDAM
37.68 ± 1.18
CB-Reweight
15.44 ± 3.85
CB-Focal
53.40 ± 1.20
DRCB-CE
52.75 ± 0.96
DRCB-Focal
DRCB-LDAM 61.60 ± 0.44
60.04 ± 1.16
SRAT-CE
62.93 ± 1.10
SRAT-Focal
SRAT-LDAM 63.13 ± 1.17

Under
40.62 ± 1.10
41.17 ± 2.07
31.09 ± 2.20
74.80 ± 0.88
74.35 ± 1.39
70.55 ± 1.10
67.06 ± 1.37
75.42 ± 1.83
73.07 ± 0.63
74.91 ± 0.70
76.63 ± 0.39

Under
9.03 ± 0.99
9.03 ± 0.52
0.85 ± 0.46
19.64 ± 1.82
0.00 ± 0.00
22.86 ± 3.03
21.81 ± 2.27
50.69 ± 2.27
41.71 ± 2.07
51.83 ± 3.33
52.73 ± 3.23

Robust Accuracy

Overall
36.96 ± 0.36
36.89 ± 0.36
37.18 ± 0.56
41.34 ± 0.42
41.34 ± 0.23
39.93 ± 0.24
37.91 ± 0.24
45.23 ± 0.19
41.02 ± 0.49
42.05 ± 0.52
45.60 ± 0.18

Under
14.23 ± 0.83
14.25 ± 0.97
12.44 ± 0.93
42.15 ± 1.42
41.80 ± 1.24
33.33 ± 1.42
29.50 ± 1.31
44.98 ± 1.90
36.57 ± 0.92
41.28 ± 0.82
45.95 ± 0.51

Robust Accuracy

Overall
30.39 ± 0.24
30.12 ± 0.31
30.80 ± 0.31
25.58 ± 0.62
14.46 ± 3.16
28.31 ± 0.59
27.78 ± 0.49
31.37 ± 0.45
30.00 ± 0.80
28.38 ± 1.00
33.51 ± 0.68

Under
1.62 ± 0.41
1.45 ± 0.12
0.05 ± 0.06
10.33 ± 0.82
0.00 ± 0.00
3.35 ± 0.56
3.24 ± 0.57
16.25 ± 2.04
12.25 ± 1.43
15.89 ± 3.15
18.89 ± 0.59

Standard Accuracy

different implementations of the prediction loss L (·, ·) in Eq. (4.11), i.e., CE loss, Focal loss

and LDAM loss. The variant utilizing CE loss is denoted as SRAT-CE, and, similarly, other two

variants are denoted as SRAT-Focal and SRAT-LDAM, respectively. For all these three variants,

Class-balanced method [21] is adopted to set weight values within the deferred reweighting training

schedule.

Implementation details. All aforementioned methods are implemented using a Pytorch library

DeepRobust [62]. For CIFAR10/CIFAR100 based datasets, the adversarial examples used in

training are calculated by PGD-10, with a perturbation budget 𝜖 = 8/255 and step size 𝛾 = 2/255;

58

Table 4.3 Performance comparison on the CIFAR10 Exp-10 dataset under 𝑙∞ threat model.

Standard Accuracy

Metric
Overall
Method
71.95 ± 0.52
CE
72.06 ± 0.78
Focal
67.39 ± 1.00
LDAM
75.17 ± 0.15
CB-Reweight
74.73 ± 0.41
CB-Focal
76.25 ± 0.09
DRCB-CE
75.36 ± 0.40
DRCB-Focal
DRCB-LDAM 73.92 ± 0.31
76.74 ± 0.15
SRAT-CE
75.26 ± 0.00
SRAT-Focal
SRAT-LDAM 74.63 ± 0.00

Under
64.09 ± 0.44
63.99 ± 1.15
58.01 ± 2.26
76.87 ± 0.69
76.67 ± 0.26
75.83 ± 0.49
72.72 ± 0.94
78.53 ± 1.24
78.61 ± 0.63
80.52 ± 0.00
79.82 ± 0.00

Robust Accuracy

Overall
37.94 ± 0.19
37.62 ± 0.34
41.35 ± 0.32
41.02 ± 0.39
38.86 ± 0.67
40.02 ± 0.45
37.76 ± 0.54
46.29 ± 0.46
42.39 ± 0.71
42.37 ± 0.00
46.72 ± 0.00

Under
26.79 ± 0.51
26.27 ± 1.04
28.65 ± 0.83
41.67 ± 0.89
42.41 ± 0.56
37.93 ± 0.65
33.83 ± 0.68
48.81 ± 0.54
43.37 ± 0.38
47.22 ± 0.00
50.38 ± 0.00

Table 4.4 Performance comparison on the CIFAR10 Exp-100 dataset under 𝑙∞ threat model.

Standard Accuracy

Robust Accuracy

Metric
Overall
Method
48.40 ± 0.59
CE
49.16 ± 0.61
Focal
48.39 ± 0.99
LDAM
57.49 ± 0.58
CB-Reweight
50.35 ± 0.44
CB-Focal
57.30 ± 0.30
DRCB-CE
54.76 ± 0.30
DRCB-Focal
DRCB-LDAM 62.65 ± 0.50
64.29 ± 0.46
SRAT-CE
62.57 ± 0.47
SRAT-Focal
SRAT-LDAM 63.11 ± 0.08

Under
23.04 ± 1.15
23.69 ± 1.15
25.69 ± 1.35
56.47 ± 1.67
60.05 ± 0.53
37.90 ± 1.23
31.79 ± 1.30
57.19 ± 2.10
61.81 ± 1.83
64.88 ± 0.81
65.60 ± 1.94

Overall
26.94 ± 0.84
26.84 ± 0.59
29.51 ± 0.27
29.01 ± 0.30
27.15 ± 0.20
26.97 ± 0.55
25.24 ± 0.39
31.66 ± 0.56
29.99 ± 0.43
30.34 ± 0.67
34.22 ± 0.41

Under
6.17 ± 0.86
5.88 ± 0.48
8.95 ± 0.45
26.53 ± 1.27
33.56 ± 0.35
10.57 ± 1.03
7.81 ± 0.87
22.11 ± 1.70
24.09 ± 0.98
28.66 ± 1.60
32.55 ± 1.70

in evaluation, we report robust accuracy under 𝑙∞-norm 8/255 attacks generated by PGD-20 on

Resnet-18 [42] models. We set the total training epochs to 200 and the initial learning rate to 0.1,

and decay the learning rate at epoch 160 and 180 with the ratio 0.01. The deferred reweighting

strategy will be applied starting from epoch 160.

4.5.2 Performance Comparison

Table 4.1 and Table 4.6 show the performance comparison on several different imbalanced

CIFAR10 and CIFAR100 data sets. In these two tables, we use bold values to denote the highest

accuracy among all methods and use the underline values to indicate our SRAT variants which

59

Table 4.5 Performance comparison on the CIFAR100 Step-10 dataset under 𝑙∞ threat model.

Table 4.6 Performance comparison on the CIFAR100 Exp-10 dataset under 𝑙∞ threat model.

Standard Accuracy

Metric
Overall
Method
39.90 ± 0.11
CE
40.10 ± 0.27
Focal
39.34 ± 0.54
LDAM
38.88 ± 0.40
CB-Reweight
39.49 ± 0.30
CB-Focal
45.21 ± 0.11
DRCB-CE
44.28 ± 0.15
DRCB-Focal
DRCB-LDAM 44.70 ± 0.46
47.17 ± 0.26
SRAT-CE
46.83 ± 0.28
SRAT-Focal
SRAT-LDAM 45.41 ± 0.55

Metric
Overall
Method
41.88 ± 0.36
CE
41.64 ± 0.51
Focal
41.55 ± 0.60
LDAM
41.82 ± 0.11
CB-Reweight
40.86 ± 0.13
CB-Focal
43.89 ± 0.26
DRCB-CE
43.38 ± 0.30
DRCB-Focal
DRCB-LDAM 43.36 ± 0.48
45.84 ± 0.18
SRAT-CE
46.38 ± 0.28
SRAT-Focal
SRAT-LDAM 44.98 ± 0.33

Under
17.90 ± 0.38
17.99 ± 0.75
17.57 ± 0.94
30.73 ± 0.49
28.96 ± 0.14
33.26 ± 0.09
30.57 ± 0.22
35.90 ± 0.92
37.81 ± 0.38
38.10 ± 0.58
36.39 ± 0.65

Under
31.30 ± 0.57
31.02 ± 0.71
31.74 ± 0.91
34.37 ± 0.31
32.21 ± 0.01
37.28 ± 0.29
36.17 ± 0.57
39.27 ± 0.72
41.72 ± 0.53
42.53 ± 0.79
40.39 ± 0.69

Standard Accuracy

Robust Accuracy

Overall
17.88 ± 0.32
17.67 ± 0.30
20.95 ± 0.20
16.67 ± 0.58
16.55 ± 0.39
18.36 ± 0.33
17.30 ± 0.39
21.80 ± 0.12
21.36 ± 0.31
21.66 ± 0.32
23.15 ± 0.15

Under
6.40 ± 0.60
6.40 ± 0.18
7.41 ± 0.37
11.71 ± 0.62
11.09 ± 0.33
11.15 ± 0.48
9.73 ± 0.18
15.19 ± 0.36
15.41 ± 0.19
16.52 ± 0.32
16.84 ± 0.08

Robust Accuracy

Overall
16.62 ± 0.03
16.29 ± 0.18
20.20 ± 0.20
17.05 ± 0.35
16.08 ± 0.41
16.90 ± 0.19
16.04 ± 0.18
20.36 ± 0.30
21.20 ± 0.15
20.09 ± 0.25
21.83 ± 0.33

Under
11.22 ± 0.21
10.97 ± 0.34
14.71 ± 0.51
13.53 ± 0.57
12.30 ± 0.59
13.62 ± 0.14
12.56 ± 0.27
17.63 ± 0.38
19.23 ± 0.36
17.83 ± 0.56
18.99 ± 0.59

achieve the highest accuracy among their corresponding baseline methods utilizing the same loss

function for making predictions.

From these tables, we can make the following observations. First, compared to baseline meth-

ods, our SRAT method obtains improved performance in terms of both overall standard & robust

accuracy under almost all imbalanced settings. More importantly, SRAT makes significant im-

provements on those under-represented classes, especially under the extremely imbalanced settings.

For example, on the CIFAR10 Step-100 data set, our SRAT-Focal method improves the standard

accuracy on under-represented classes from 21.81% achieved by the best baseline method utilizing

60

(a) CE.

(b) DRCB-LDAM.

(c) SRAT-LDAM.

Figure 4.4 t-SNE visualization of different learned features.

Focal loss to 51.83% and robust accuracy from 3.24% to 15.89%. These results demonstrate that

SRAT is able to obtain more robustness under imbalanced settings. Second, the performance gap

among three SRAT variants are mainly caused by the gap between the loss functions in these

methods. As shown in these two tables, DRCB-LDAM typically performs better than DRCE-CE

and DRCB-Focal, and similarly, SRAT-LDAM outperforms SRAT-CE and SRAT-Focal under the

same settings.

4.5.3 Ablation Study

In this subsection, we provide ablation study to understand our SRAT method more compre-

hensively.

Feature space visualization.

In order to facilitate the reweighting strategy in adversarial

training under the imbalanced setting, we present a feature separation loss in our SRAT method.

The main goal of the feature separation loss is to enforce the learned feature space as much separated

as possible. For checking whether the feature separation loss can work as expected, we apply t-

SNE [94] to visualize the latent feature space learned by our SRAT-LDAM method as well as by

original PGD adversarial training method (CE) and DRCB-LDAM method in Figure 4.4.

As shown in Figure 4.4, the feature space learned by our SRAT-LDAM method is more

separable than two baseline methods, which demonstrates that, with our feature separation loss, the

adversarially trained model is able to learn much better features and thus SRAT can achieve better

61

Figure 4.5 The impact of weights.

performance.

Impact of weight values. As in all SRAT variants, we adopt the Class-balanced method [21]

to assign different weights to different classes. To explore how the assigned weights impact the

performance of SRAT, we conduct experiments using CIFAR10 Step-100 dataset to see the change

of model’s performance using different reweighting values. Specifically, we assign well-represented

classes with weight 1 and change the weight for under-represented classes from 10 to 200. The

experimental results are shown in Figure 4.5. Here, we use an approximated value 78 to denote the

weight calculated by the Class-balanced method when the imbalance ratio equals 100.

From Figure 4.5, we can obverse that, for all SRAT variants, the model’s standard accuracy

is increased with the increasing of the weights for under-represented classes. However, the robust

accuracy for these three methods do not synchronize with the change of their standard accuracy.

When increasing the weights for under-represented classes, robust accuracy of SRAT-LDAM is

almost unchanged and of SRAT-CE and SRAT-Focal even has slight decrease. As a trade-off, using

a relative large weight, such as 78 or 100, in SRAT can obtain satisfactory performance on both

standard & robust accuracy.

Impact of hyper-parameter 𝜆. In our SRAT method, the contributions of feature separation

loss and prediction loss are controlled by a hyper-parameter 𝜆. In this part, we study how this hyper-

parameter affects the performance of SRAT. In experiments, we evaluate the models’ performance

of all SRAT variants with different values of 𝜆 used in training process on CIFAR10 Step-100

62

dataset.

Figure 4.6 The impact of 𝜆.

As shown in Figure 4.6, the performance of all SRAT variants are not very sensitive with the

choice of 𝜆. However, a large value of 𝜆, such as 8, may hurt the model’s performance.

Impact of imbalance ratio 𝐾. In previous experiments, we evaluate the effectiveness of our

SRAT method using various imbalanced datasets with imbalance ratio 𝐾 = 10 or 𝐾 = 100. To

investigate the performance of our SRAT method more comprehensively, in this part, we test

our SRAT method on more imbalanced datasets with diverse imbalance ratios. Specifically, we

construct a series of "Step" imbalanced CIFAR10 datasets by setting the value of the imbalance

ratio 𝐾 from 5 to 100. For comparison, we apply both DRCB-Focal method and our SRAT-Focal

variant to train models on those imbalanced datasets and test the trained models’ performance on

the original uniformly distributed CIFAR10 test dataset. The experimental results are shown in

Figure 4.7 The impact of 𝐾.

63

Table 4.7 Performance comparison on the CIFAR10 Step-100 dataset under 𝑙2 threat model.

Standard Accuracy

Metric
Overall
Method
65.01 ± 1.84
CE
66.15 ± 2.75
Focal
57.35 ± 2.47
LDAM
64.32 ± 0.85
CB-Reweight
65.89 ± 0.82
CB-Focal
70.78 ± 1.84
DRCB-CE
71.59 ± 1.21
DRCB-Focal
DRCB-LDAM 71.51 ± 1.32
76.27 ± 1.46
SRAT-CE
73.73 ± 0.48
SRAT-Focal
SRAT-LDAM 73.89 ± 0.78

Under
35.79 ± 3.72
38.77 ± 5.80
20.25 ± 4.37
40.75 ± 2.47
45.65 ± 2.01
48.57 ± 4.01
50.85 ± 2.60
50.99 ± 1.85
60.76 ± 3.04
54.68 ± 1.06
57.09 ± 2.43

Robust Accuracy

Overall
52.22 ± 1.99
55.02 ± 3.13
52.11 ± 2.14
52.72 ± 0.71
55.81 ± 1.31
56.00 ± 2.00
57.89 ± 1.88
64.68 ± 1.15
61.83 ± 1.53
60.12 ± 0.54
67.38 ± 0.92

Under
20.07 ± 3.48
24.67 ± 5.99
15.49 ± 3.44
27.47 ± 1.75
33.85 ± 2.64
30.39 ± 4.01
34.14 ± 3.31
40.55 ± 1.75
42.72 ± 2.93
38.01 ± 1.35
47.45 ± 2.75

Figure 4.7.

From Figure 4.7, we can obverse that, under different imbalanced scenarios, the model trained

by our SRAT-Focal method can always achieve better performance than the one trained by DRCB-

Focal method. In other words, the effectiveness of our SRAT method will not be affected by the

imbalanced ratio 𝐾, which determines the data distribution of the imbalanced training dataset.

4.5.4 Performance under 𝑙2 Threat Model

To further evaluate the effectiveness of our SRAT method, we also adversarially train Resnet-

18 [42] models on CIFAR10 Step-100 dataset under 𝑙2 attack. We follow the same settings in [105],

where the perturbation budge 𝜖 = 128/255 and step size 𝛾 = 15/255. As shown in Table 4.7,

SRAT outperforms all baseline methods with a large margin, which verifies the effectiveness of

SRAT.

4.6 Related Work

Adversarial Robustness. The vulnerability of DNN models to adversarial examples has

been verified by many existing successful attack methods [32, 11]. To improve model robustness

against adversarial attacks, various defense methods have been proposed [71, 78, 19]. Among

them, adversarial training has been proven to be one of the most effective defense methods [4].

Adversarial training can be formulated as solving a min-max optimization problem where the

64

outer minimization process enforces the model to be robust to adversarial examples, generated by

the inner maximization process via some existing attacking methods like PGD [71]. Based on

adversarial training, several variants, such as TRADES [111], MART [103], have been presented

to improve the model’s performance further. More details about adversarial robustness can be

found in recent surveys [13, 109]. Since almost all studies of adversarial training are focused on

balanced datasets, it’s worthwhile to investigate the performance of adversarial training methods

on imbalanced training datasets.

Imbalanced Learning. Most existing works of imbalanced training can be roughly classi-

fied into two categories, i.e., re-sampling and reweighting. Re-sampling methods aim to reduce

imbalance level through either over-sampling examples from under-represented classes [7, 9] or

under-sampling examples from well-represented classes [46, 25, 40]. Reweighting methods allo-

cate different weights for different classes or even different examples. For example, Focal loss [63]

enlarges the weights of wrongly-classified examples while reducing the weights of well-classified

examples in the standard cross entropy loss; and LDAM loss [10] regularizes the under-represented

classes more strongly than the well-represented classes to attain good generalization on under-

represented classes. More details about imbalanced learning can be found in recent surveys [41, 47].

The majority of existing methods focused on the nature training scenario and their trained models

will be crashed when facing adversarial attacks [91, 32]. Hence, in this chapter, we develop a

novel method that can defend adversarial attacks and achieve well-pleasing performance under

imbalanced scenarios.

4.7 Chapter Conclusion

In this chapter, we first empirically investigate the behavior of adversarial training under im-

balanced scenarios and explore potential solutions to assist adversarial training in tackling the

imbalanced issue. As neither adversarial training itself nor adversarial training with reweighting

can work well under imbalanced scenarios, we further theoretically verify the poor data separa-

bility is one key reason causing the failure of adversarial training based methods. Based on our

findings, we propose the Separable Reweighted Adversarial Training (SRAT) method to facilitate

65

the reweighting strategy in imbalanced adversarial training. We validate the effectiveness of SRAT

via extensive experiments. In the future, we plan to examine how other types of defense methods

perform under imbalanced scenarios and how other types of balanced learning methods behave

under adversarial training.

66

CHAPTER 5

MIX-UP STRATEGY TO ENHANCE ADVERSARIAL TRAINING WITH
IMBALANCED DATA

Adversarial training has been proven to be one of the most effective techniques to defend

against adversarial examples. The majority of existing adversarial training methods assume that

every class in the training data is equally distributed. However, in reality, some classes often have

a large number of training data while others only have a very limited amount. Recent studies have

shown that the performance of adversarial training will degrade drastically if the training data is

imbalanced. In this chapter, we propose a simple yet effective framework to enhance the robustness

of DNN models under imbalanced scenarios. Our framework, Imb-Mix, first augments the training

dataset by generating multiple adversarial examples for samples in the minority classes. This is

done by first adding random noise to the original adversarial examples created by one specific

adversarial attack method. It then constructs Mixup-mimic mixed examples upon the augmented

dataset used by adversarial training. In addition, we theoretically prove the regularization effect

of our Mixup-mimic mixed examples generation technique in Imb-Mix. Extensive experiments on

various imbalanced datasets verify the effectiveness of the proposed framework.

5.1 Chapter Introduction

Deep neural networks (DNNs) have been successfully applied in a wide range of real-world

applications, such as computer vision [42], natural language processing [96] and speech recog-

nition [1]. However, DNNs are highly vulnerable to adversarial examples [32, 11]. By adding

an imperceptible amount of noise to benign examples, manually crafted adversarial examples can

mislead a well-trained DNN based classifier. This can cause the classifier to incorrectly classify

benign samples, with high confidence, that it previously classified correctly. Due to the large

threat of adversarial examples, considerable efforts have been made to improve the robustness of

DNNs. Among them, adversarial training [71, 111] has been empirically proven to be one of the

most effective and reliable defense methods. Generally, adversarial training can be formulated

as a min-max optimization problem where the inner maximization process generates adversarial

67

examples that can mostly fool the model, and the outer minimization process reduces the model’s

average classification error on the generated adversarial examples.

Although they have been shown to improve the robustness of DNNs, most existing adversarial

training methods assume that the number of training examples from each class is balanced. However,

this assumption does not hold in many real-world applications where some classes can have a notably

larger presence than other classes [95, 69]. Hence, the training data is typically imbalanced among

classes. Very recently, there have been works [106, 102] that examine adversarial training under

imbalanced scenarios. They’ve shown that in such situations, adversarial training will lead to a

huge performance discrepancy between classes with more training examples (i.e., majority classes)

and classes with fewer training examples (i.e., minority classes). Furthermore, it cannot provide

satisfactory robustness for those minority classes. Therefore, it is natural to ask: How can we

improve adversarial training under imbalanced scenarios? Since imbalanced training data often

causes the trained classifier to be overwhelmed by the majority classes and ignore the minority

classes, two common ways to alleviate the negative impacts are re-sampling and re-weighting. Re-

sampling attempts to balance the data distribution [15, 26] by upsampling minority class samples or

downsampling majority class samples. Re-weighting, assigns higher weights in the loss to samples

from the minority class to make the trained model to be biased toward minority classes [63, 21, 10].

The majority of existing works only consider the imbalanced problem within the natural training

paradigm, where their ultimate goal is improving model’s standard accuracy under imbalanced

scenarios. However, few studies focus on improving the model’s robust accuracy under the on the

adversarial training paradigm1. In addition, as demonstrated in [102], some effective techniques

for handling the imbalanced problem for the nature training paradigm are not applicable to the

adversarial training paradigm. Hence, it’s worthwhile to investigate new approaches to boost the

model robustness under imbalanced scenarios.

Recently, data augmentation techniques have been proven to be an effective way to improve

1In this chapter, we denote standard accuracy as model’s prediction accuracy on the input examples without adver-
sarial perturbations and robust accuracy as model’s prediction accuracy on the perturbed input examples constrained
by 𝑙∞-norm 8/255.

68

model robustness with respect to noisy inputs like blurred images [43, 80]. This provides a

potential way to solve the imbalanced problem within the adversarial training paradigm. Therefore

we propose Imb-Mix, a novel data augmentation based framework to advance model robustness

under imbalanced scenarios. Imb-Mix first generates multiple adversarial examples for the minority

classes. This is done by adding random noise to the original adversarial examples created by the

PGD adversarial attack [71]. This is done to balance the imbalanced data distribution between

classes. Next, to increase the generalization ability of the trained model, we construct Mixup-mimic

mixed examples upon the augmented dataset used by adversarial training. Moreover, to further

improve the model performance, we introduce the stochastic model weight averaging (SWA) [45]

technique into our proposed framework. SWA has been shown to be effective in improving the

performance of DNNs with almost no extra computational overhead. The contributions of this

chapter include:

• We introduce a simple yet effective data augmentation based framework into the adversarial

training paradigm to benefit adversarial training under imbalanced scenarios.

• We theoretically prove the regularization effect of our Mixup-mimic mixed examples genera-

tion technique. This provides an understanding as to why this process can be effective under

imbalanced scenarios.

• We conduct extensive experiments on multiple datasets with various imbalanced scenarios

to verify the effectiveness of our proposed framework.

5.2 Related Work

5.2.1 Adversarial Robustness

The existence of successful adversarial attacks [32, 11] reveals the vulnerability of DNN

models. As a countermeasure against adversarial attacks, many defense methods [71, 78, 19]

have been proposed to improve model robustness. Among them, adversarial training has been

proven to be one of the more effective methods [4]. Generally, adversarial training aims to solve

69

a min-max optimization problem where the inner maximization process utilizes some existing

attack methods, such as PGD [71], to generate adversarial examples that can mislead the current

model mostly, and the outer minimization process enforces the model to be robust to the generated

adversarial examples. Because of its effectiveness, many variants of adversarial training have been

proposed to further improve the robustness of models under various settings [111, 103, 108]. More

details about adversarial robustness can be obtained in related surveys [13, 109]. Note that, as

the majority of existing adversarial training based methods only focus on balanced datasets. As

such, the performance of these methods will be drastically decreased when the training dataset is

imbalanced [106, 102].

5.2.2

Imbalanced Learning

Due to its commonality in many real-world applications [44], learning from imbalanced datasets

has been widely investigated in the past few decades. Most existing works can be roughly classified

into two categories, i.e., re-sampling and re-weighting. The re-sampling methods focus on balancing

the data distribution through either downsizing the majority classes [7, 9] or upsizing the minority

classes [46, 25, 40]. The re-weighting methods assign different weights for different classes

or even different examples. For instance, Focal loss [63] allocates larger weights for wrongly-

classified examples while giving smaller weights for well-classified examples based on the standard

cross entropy loss. The LDAM loss [10] regularizes the minority classes more strongly than the

majority classes to achieve a good generalization performance on minority classes. More details

about imbalanced learning can be obtained in related surveys [41, 47]. Note that, most existing

imbalanced learning methods focused on the nature training paradigm and their trained models

will be crashed when facing adversarial attacks [91, 32]. Therefore, in this chapter, we present

a novel framework that is able to improve the model robustness against adversarial attacks under

imbalanced scenarios.

5.2.3 Data Augmentation

Data augmentation methods have been empirically shown to be an effective way of improving

the generalization ability of DNN models. For instance, random flipping and cropping are two most

70

commonly used techniques in image classification tasks [42]. Some random occlusion techniques

such as Cutout [23] can also help models to obtain better standard classification accuracy on

images. Besides applying operations on every single image, Mixup [112] adopts an pair-wise

linear combination of two images to create a mixed image along with a mixed corresponding label.

Although simple, experimental results verify that Mixup is able to bring much better generalization

performance for DNN models. Variants of Mixup have been proposed for multiple domains

including, NLP [90], computer vision [97], and graphs [37, 87]. Furthermore, Mixup strategies

have also been proposed to further improve the standard classification accuracy of models under

different scenarios. AUGMIX [43] demonstrate that randomly mixing generated augmentations

instead of original input images can improve DNN models’ robustness against noisy images (e.g.,

blurred images). Although various kinds of data augmentation methods have been proposed, there

is no existing data augmentation methods considering the problem of improving adversarial training

on imbalanced data distributions.

5.3 The Proposed Framework

In this section we present our proposed framework. We first introduce the basic idea of

adversarial training in Section 5.3.1. In Section 5.3.2 we present our proposed data augmentation

method Imb-Mix.

In Section 5.3.3 we introduce the stochastic model weight averaging (SWA)

technique used in our framework. Lastly, in Section 5.3.4 we detail the full training algorithm of

Imb-Mix.

5.3.1 Adversarial Training

In order to improve the model robustness against adversarial attacks, previous works propose to

include adversarial examples generated by some adversarial attacks methods into the model training

process. This helps teach the trained model to recognize adversarial examples correctly [71, 111].

Specifically, given an input example (𝑥, 𝑦), the PGD adversarial training [71] aims to obtain a

robust model 𝑓𝜃 where the (correct) prediction 𝑦 is the same for the original sample 𝑥 and the

adversarial example 𝑥′. The sample 𝑥′ is generated by applying an adversarially perturbation on

𝑥. The adversarial perturbations are typically bounded by a small value 𝜖 under 𝐿 𝑝-norm, i.e.,

71

Figure 5.1 A toy example of creating augmented adversarial examples by our Imb-Mix framework.
The blue and red circles represent data examples from one majority class and one minority class,
respectively. The solid lines denote the process of producing adversarial examples through the
inner maximization process described in Eq. (5.1). The dash lines denote the process of generating
various adversarial examples for the minority class, respectively.

∥𝑥′ − 𝑥∥ 𝑝 ≤ 𝜖. Formally, the PGD adversarial training on a dataset 𝑋 can be defined as,

min
𝜃

1
|𝑋 |

|𝑋 |
∑︁

𝑖=1

max
𝑖 −𝑥𝑖 ∥ 𝑝 ≤𝜖

∥𝑥′

L (cid:0) 𝑓𝜃 (𝑥′

𝑖), 𝑦𝑖(cid:1).

(5.1)

Based on the PGD adversarial training, many variants have been proposed to further improve

the model robustness against adversarial attacks from different aspects [111, 103]. Most existing

adversarial training based methods assume that the number of training examples from each class is

equally distributed. However, as pointed by a few recent works [106, 102], these methods cannot

achieve satisfactory performance when the training data distribution is imbalanced.

5.3.2

Imb-Mix

To facilitate adversarial training under imbalanced scenarios, we propose a simple yet effec-

tive data augmentation based framework to balance the training data distribution.

Inspired by

SMOTE [15], a classical method that generates synthetic training data examples for minority

classes, we focus on creating more adversarial examples for the minority classes to balance the

imbalanced data distribution. This will help the model learn more useful information from minority

classes, thereby improving the performance of the trained model on the those classes. Specifically,

our proposed framework Imb-Mix contains two main procedures (1) supplementary adversarial

example creation and (2) generated adversarial example Mixup. In the rest of this section we detail

72

both procedures.

5.3.2.1 Supplementary Adversarial Examples Creation

Given a data example 𝑥𝑖 from the original imbalanced training dataset 𝑋𝑜𝑟𝑔, Imb-Mix first

generates it’s adversarial counterpart 𝑥′

𝑖 using the inner maximization process described in Eq. (5.1).

Then, for any data example belonging to a minority class, Imb-Mix produces multiple adversarial

examples based on its original adversarial example 𝑥′

𝑖 through the following the process:

𝑖 = 𝑥𝑖 + 𝛼 × (𝑥′
ˆ𝑥′

𝑖 − 𝑥𝑖) × N (0, 1),

(5.2)

where 𝛼 is a hyper-parameter to determine the level of perturbations added into the data example 𝑥𝑖.

This step is repeated 𝑡 times to obtain 𝑡 different adversarial examples. Here the hyper-parameter

𝑡 can be determined by users’ domain knowledge upon the application settings. The main idea

behind this design is to obtain a larger number of diverse adversarial examples for minority classes

to balance the original imbalanced dataset. If a data example 𝑥𝑖 belongs to a majority class, we

set ˆ𝑥′

𝑖 = 𝑥′

𝑖, as no additional samples are needed. After the supplementary adversarial examples

creation procedure, we can obtain an augmented adversarial example set 𝑋𝑎𝑑𝑣.

Although the aforementioned data augmentation method is able to produce 𝑡 adversarial ex-

amples for every input example in the minority classes, we empirically find that the improvement

of model’s robustness on the minority classes is limited. We argue that this is caused by a lack

of diversity between the different generated adversarial examples for each minority data example.

Therefore the augmented adversarial examples produced do not contain sufficient information for

the minority classes to enhance the learnt model. In addition, in many applications like fraud de-

tection and medical diagnosis, misclassifying a minority class sample is usually more severe than

misclassifying one from the majority class [68]. Therefore, we further use the idea of a re-balanced

version of Mixup [112] to generate additional adversarial examples. We later show that this works

as a form of regularization, thereby improving the model performance on the minority classes.

73

5.3.2.2 Generated Adversarial Example Mixup

To adapt Mixup into imbalanced scenarios, Remix [18] relaxes Mixup’s formulation and enables

the mixing factors of data and labels to be disentangled. In our Imb-Mix framework, we follow a

similar idea. Specifically, for any two adversarial examples ˆ𝑥′

𝑖 and ˆ𝑥′

𝑗 sampled from the augmented

adversarial example set 𝑋𝑎𝑑𝑣, the mixed adversarial example ˆ𝑥′

𝑚𝑖𝑥 and it’s corresponding label 𝑦𝑚𝑖𝑥

can be obtained by:

𝑚𝑖𝑥,𝑖 = 𝜆𝑥 ∗ ˆ𝑥′
ˆ𝑥′

𝑖 + (1 − 𝜆𝑥) ∗ ˆ𝑥′
𝑗 ,

(5.3)

𝑦𝑚𝑖𝑥,𝑖 = 𝜆𝑦 ∗ 𝑦𝑖 + (1 − 𝜆𝑦) ∗ 𝑦 𝑗 ,
where 𝜆𝑥 is sampled from a beta distribution B (𝛾, 𝛾) (typically we choose 𝛾 = 1.0) and 𝜆𝑦 is

defined as,

𝜆𝑦 =

0,

1,

𝑛𝑖/𝑛 𝑗 ≥ 𝜅 and 𝜆𝑥 < 𝜏;

𝑛𝑖/𝑛 𝑗 ≤ 1/𝜅 and 1 − 𝜆𝑥 < 𝜏;

(5.4)

𝜆𝑥,

otherwise.






Here 𝑛𝑖 and 𝑛 𝑗 represent the number of adversarial examples for class 𝑖 and class 𝑗, respectively.

𝜅 and 𝜏 are two hyper-parameters. We follow the default settings 𝜅 = 3 and 𝜏 = 0.5 adopted by

Remix [18] in our implementation.

By applying the aforementioned mixup procedure on all generated adversarial examples, we

can obtain a set of mixed data-label pair ( ˆ𝑥′

𝑚𝑖𝑥,𝑖, 𝑦𝑚𝑖𝑥,𝑖), denoted as 𝑋𝑚𝑖𝑥. Finally, the DNN classifier

𝑓𝜃 will be trained by minimizing the model’s cross entropy loss on elements of the set 𝑋𝑚𝑖𝑥, instead

of the original imbalanced dataset 𝑋𝑜𝑟𝑔. Formally, the final objective function of our Imb-Mix

framework can be described as,

min
𝜃

1
|𝑋𝑚𝑖𝑥 |

|𝑋𝑚𝑖𝑥 |
∑︁

𝑖=1

𝑙𝜃 (cid:0) 𝑓𝜃 ( ˆ𝑥′

𝑚𝑖𝑥,𝑖), 𝑦𝑚𝑖𝑥,𝑖(cid:1).

(5.5)

To better demonstrate the data augmentation process in our proposed framework Imb-Mix, we

provide a toy example of applying Imb-Mix on a binary imbalanced classification problem in

Figure 5.1. As shown in this example, adversarial examples for data examples 𝑥𝑖, 𝑥 𝑗 , and 𝑥𝑘 will

be generated first using a PGD attack [71]. Then our Imb-Mix framework will produce several

different adversarial examples ˆ𝑥′

𝑗 for the minority data example 𝑥 𝑗 by adding random noise to its

74

original adversarial counterpart 𝑥′

𝑗 . Finally, mixup-mimic mixed examples will be created based

on ˆ𝑥′

𝑗 and adversarial examples of the majority class ˆ𝑥′

𝑖 and ˆ𝑥′
𝑘 ,

5.3.3 Stochastic Model Weight Averaging

DNNs are typically trained by minimizing a loss function with stochastic gradient descent

(SGD), which is an iterative method proposed for optimizing model parameters. Recently, some

existing works [45, 5] discovered that simply averaging multiple points along the trajectory of SGD

can lead to a better generalization ability. This kind of averaging strategy is called stochastic model

weight averaging (SWA). Formally, it can be defined as

𝜃SWA ←

𝜃SWA × 𝑛models + 𝜃
𝑛models + 1

,

where 𝜃 is the model parameters obtained by SGD and 𝑛models is the number of models used for

averaging the parameters. At the beginning of applying SWA, 𝜃SWA = 𝜃.

In addition to the effectiveness, SWA will not add any additional costs during the model training

process and can be easily integrated with any other optimization methods besides SGD. Therefore,

to further improve the model robustness under imbalanced scenarios, we adopt SWA in our Imb-

Mix framework. We empirically find that SWA can make a visible contribution to the performance

of our Imb-Mix framework. More results can be found in Section 5.5.

5.3.4 Algorithm

The overall algorithm of our proposed framework Imb-Mix is shown in Algorithm 5.1. Given an

imbalanced training dataset 𝑋𝑜𝑟𝑔, for each iteration Imb-Mix framework first obtains the augmented

adversarial examples set 𝑋𝑎𝑑𝑣. Using this, it produces the mixed adversarial example-label pairs

set 𝑋𝑚𝑖𝑥 based on 𝑋𝑎𝑑𝑣. The parameters of the DNN classifier 𝑓𝜃 will be updated by minimizing

the model’s empirical loss on 𝑋𝑚𝑖𝑥. If the training iteration reaches a pre-defined value, then SWA

will be introduced to update model parameters 𝜃.

5.4 Regularization Effect Of Imb-Mix

In this section, we examine the properties of our proposed framework, Imb-Mix. We theoreti-

cally prove that the Mixup-mimic mixed examples generation technique adopted in Imb-Mix can

75

Algorithm 5.1 The algorithm of Imb-Mix.
Input: an imbalanced training dataset 𝑋𝑜𝑟𝑔
Output: a trained DNN classifier 𝑓𝜃
1: Initialize the parameters 𝜃 of the DNN classifier 𝑓𝜃.
2: repeat
3:
4:
5:
6:
7:
8:
9:
10: until model convergence

Apply SWA as described in Section 5.3.3.

end if

Obtain an augmented adversarial examples set 𝑋𝑎𝑑𝑣 based on PGD attack and Eq. (5.2).
Get a mixed data-label pairs set 𝑋𝑚𝑖𝑥 based on Eqs. (5.3)-(5.4).
Optimize the final objective function Eq. (5.5).
Update the parameters 𝜃 ← 𝜃 − 𝛿 ∗ ∇𝜃𝑙𝜃.
if swa-epochs then

be formulated as a form of regularization on the minority class examples.

To simplify our analysis, we consider a binary imbalanced setting, where only two classes are

involved. Furthermore, we assume the dataset is imbalanced such that there is a majority class

𝐶𝑚 and a minority class 𝐶𝑛. Recall that when performing linear interpolation on any two data

examples 𝑥𝑖 and 𝑥 𝑗 , Imb-Mix assigns two different mixing factors 𝜆𝑥 and 𝜆𝑦 for them in data space

and label space, respectively. The mixing factor 𝜆𝑥 is sampled from a Beta distribution B (𝛾, 𝛾).

The factor 𝜆𝑦 is determined by the ratio between the example size of the class 𝑥𝑖 belonging to and

the example size of the class 𝑥 𝑗 belonging to, the value of 𝜆𝑥 and two hyper-parameters 𝜅 and 𝜏, as

shown in Eq. (5.4). More specifically, the value of 𝜆 for data examples 𝑥𝑖 and 𝑥 𝑗 is determined by:

1) if both 𝑥𝑖 and 𝑥 𝑗 are sampled from the majority class 𝐶𝑚, then 𝜆𝑦 = 𝜆𝑥; 2) if both 𝑥𝑖 and 𝑥 𝑗 are

sampled from the minority class 𝐶𝑛, then 𝜆𝑦 = 𝜆𝑥; and 3) if 𝑥𝑖 and 𝑥 𝑗 are sampled from different

classes, then 𝜆𝑦 can be either 0, 1 or 𝜆𝑥. In our binary imbalanced case, we further assume the

ratio between the example size of two classes 𝐶𝑚 and 𝐶𝑛 satisfy |𝐶𝑚 |/|𝐶𝑛| ≥ 𝜅 and set 𝜏 = 0.5 as

adopted in Remix [18]. Hence, if 𝑥𝑖 is sampled from the majority class 𝐶𝑚 and 𝑥 𝑗 is sampled from

the minority class 𝐶𝑛 when 𝜆𝑥 < 0.5 then 𝜆𝑦 = 0. Otherwise we set 𝜆𝑦 = 𝜆𝑥. Note that we omit the

scenario where 𝑥𝑖 is sampled from the minority class 𝐶𝑛 and 𝑥 𝑗 is sampled from the majority class

𝐶𝑚, as it is equivalent to our discussed scenario. As a conclusion, 𝜆𝑦 can be either 𝜆𝑥 or 0. Next,

we will analyze the regularization effect of Imb-Mix on these two cases, separately.

76

Case 1: 𝜆𝑦 = 0.

In this case, Imb-Mix will only conduct linear interpolation on two data

examples on the data space and then assign the label of minority class 𝐶𝑛 to the mixed data

example. Hence, the loss function on all mixed data examples, denoted as L𝐼𝑚𝑏−𝑀𝑖𝑥 (𝜃), can be

defined as

L𝐼𝑚𝑏−𝑀𝑖𝑥 (𝜃) =

1
|𝐶𝑚 | × |𝐶𝑛|

|𝐶𝑚|
∑︁

|𝐶𝑛 |
∑︁

E𝜆𝑙𝜃

(cid:16)

𝑦 𝑗 , 𝑓𝜃 (cid:0)𝜆𝑥𝑖 + (1 − 𝜆)𝑥 𝑗 (cid:1) (cid:17)

.

(5.6)

𝑗=1
where the loss function 𝑙𝜃 represents the binary cross-entropy loss, 𝑦 𝑗 = 𝐶𝑛 and 𝜆 ∼ 𝛽[0,0.5] (𝛾, 𝛾).

𝑖=1

For simplicity, we use 𝜆 to represent 𝜆𝑥 in the following.

Following [12], we define ¯𝜆 = E𝜆𝜆 and introduce a random perturbation 𝛿𝑖 formulated as

𝛿𝑖 = (𝜆 − ¯𝜆)𝑥𝑖 + (1 − 𝜆)𝑥 𝑗 − (1 − ¯𝜆) ¯𝑥.

Then Eq. (5.6) can be rewritten as

L𝐼𝑚𝑏−𝑀𝑖𝑥 (𝜃) =

𝑖=1
where 𝑗 ∼ 𝑈𝑛𝑖 𝑓 𝑜𝑟𝑚(𝑋{𝐶𝑛}) and 𝑋{𝐶𝑛} is a set that contains all minority examples and mixed

|𝐶𝑚|
∑︁

E𝜆, 𝑗 𝑙𝜃 (cid:0)𝑦 𝑗 , 𝑓𝜃 ( ˜𝑥𝑖 + 𝛿𝑖)(cid:1),

(5.7)

1
|𝐶𝑚 |

samples,

˜𝑥𝑖 = ¯𝑥 + ¯𝜆(𝑥𝑖 − ¯𝑥).

For the loss function L𝐼𝑚𝑏−𝑀𝑖𝑥 (𝜃) described in Eq. (5.7), we have the following theorem.

Theorem 5.4.1. The Imb-Mix loss function L𝐼𝑚𝑏−𝑀𝑖𝑥 (𝜃) defined in Eq. (5.7) can be rewritten as

the following

where

L𝐼𝑚𝑏−𝑀𝑖𝑥 (𝜃) =

1
|𝐶𝑚 |

|𝐶𝑚|
∑︁

𝑖=1

𝑙𝜃 (cid:0)𝑦 𝑗 , 𝑓𝜃 ( ˜𝑥𝑖)(cid:1) + 𝑅1(𝜃) + 𝑅2(𝜃),

(5.8)

𝑅1(𝜃) =

𝑅2(𝜃) =

1
2|𝐶𝑚 |

1
2|𝐶𝑚 |

|𝐶𝑚|
∑︁

𝑖=1
|𝐶𝑚|
∑︁

𝑖=1

(cid:16)

∇ 𝑓𝜃 ( ˜𝑥𝑖) − 𝐽 (𝑖)(cid:17) ⊤ (cid:16)

𝑢𝑢𝑙𝜃 (cid:0)𝑦 𝑗 , 𝑓 ( ˜𝑥𝑖)(cid:1)(cid:17) 1
∇2

2

(cid:13)
(cid:13)
(cid:13)
(cid:13)

(cid:13)
(cid:13)
(cid:13)
(cid:13)

(cid:28)

Σ(𝑖)
˜𝑥 ˜𝑥 , ∇𝑢𝑙𝜃 (cid:0)𝑦 𝑗 , 𝑓𝜃 ( ˜𝑥𝑖)(cid:1)∇2 𝑓𝜃 ( ˜𝑥𝑖)

(cid:29)

and for any 𝑖 ∈ {1, 2, ..., |𝑋{𝐶𝑚}|},

𝐽 (𝑖) = −

(cid:16)

𝑢𝑢𝑙𝜃 (cid:0)𝑦 𝑗 , 𝑓𝜃 ( ˜𝑥𝑖)(cid:1)(cid:17) −1
∇2

∇𝑢𝑙𝜃 (cid:0)𝑦 𝑗 , 𝑓𝜃 ( ˜𝑥𝑖)(cid:1)Σ(𝑖)
𝑦 𝑗 ˜𝑥

(cid:16)

Σ(𝑖)
˜𝑥 ˜𝑥

(cid:17) −1

.

77

Proof. Inspired by recent work [12], we examine the regularizing effect of the mixing factor

𝜆. This is achieved by approximating the loss function 𝑙𝜃 using a second-order quadratic Taylor

approximation near each mixed example pair ( ˜𝑥𝑖, 𝑦 𝑗 ). Assuming 𝑙𝜃 is twice differentiable and

expressing the derivatives of 𝑙 (𝑦, 𝑓 (𝑥)) as derivatives of 𝑙 (𝑦, 𝑢) and 𝑓 (𝑥), then we can have

E𝜆, 𝑗 𝑙𝜃 (𝑦 𝑗 , 𝑓𝜃 ( ˜𝑥𝑖 + 𝛿𝑖)) = 𝑙𝜃 (𝑦 𝑗 , 𝑓𝜃 ( ˜𝑥𝑖 + 𝛿𝑖))

+

(cid:28)

1
2

E𝜆, 𝑗 𝛿𝑖𝛿⊤

𝑖 , ∇ 𝑓𝜃 ( ˜𝑥𝑖)⊤∇2

𝑢𝑢𝑙𝜃 (𝑦 𝑗 , 𝑓𝜃 ( ˜𝑥𝑖))∇ 𝑓 ( ˜𝑥𝑖)

+ ∇𝑢 𝑓𝜃 ( ˜𝑥𝑖)∇2 𝑓𝜃 ( ˜𝑥𝑖)

(cid:29)

.

By replacing the expectations in the above equation by their values given by Lemma 2 in [12], we

can have

E𝜆, 𝑗 𝑙𝜃 (𝑦 𝑗 , 𝑓𝜃 ( ˜𝑥𝑖 + 𝛿𝑖)) = 𝑙𝜃 (𝑦 𝑗 , 𝑓𝜃 ( ˜𝑥𝑖))
(cid:29)

(cid:28)

Σ(𝑖)
˜𝑥 ˜𝑥 , ∇ 𝑓𝜃 ( ˜𝑥𝑖)⊤∇2

𝑢𝑢𝑙𝜃 (𝑦 𝑗 , 𝑓𝜃 ( ˜𝑥𝑖))

(cid:28)

Σ(𝑖)
˜𝑥 ˜𝑥 , ∇𝑢 𝑓𝜃 ( ˜𝑥𝑖)∇2 𝑓𝜃 ( ˜𝑥𝑖)

(cid:29)

,

+

+

1
2
1
2

where for any 𝑖 ∈ {1, 2, ..., |𝑋{𝐶𝑚}|},

Σ(𝑖)
˜𝑥 ˜𝑥 =

𝜎2( ˜𝑥𝑖 − ¯𝑥) ( ˜𝑥𝑖 − ¯𝑥)⊤ + 𝜉2Σ ˜𝑥 ˜𝑥
¯𝜃2

,

Σ(𝑖)
˜𝑥𝑦 𝑗

=

𝜉2Σ ˜𝑥𝑦 𝑗
¯𝜃2

,

Here ¯𝜆 and 𝜎2 be the mean and variance of a 𝛽[0,0.5] (𝛾, 𝛾) distributed random variable, respectively,

and 𝜉2 = 𝜎2 + (1 − ¯𝜆)2. By summing over 𝑖, finally we can get

where

L𝐼𝑚𝑏−𝑀𝑖𝑥 (𝜃) =

1
|𝐶𝑚 |

|𝐶𝑚|
∑︁

𝑖=1

𝑙𝜃 (cid:0)𝑦 𝑗 , 𝑓𝜃 ( ˜𝑥𝑖)(cid:1) + 𝑅1(𝜃) + 𝑅2(𝜃),

𝑅1(𝜃) =

𝑅2(𝜃) =

1
2|𝐶𝑚 |

1
2|𝐶𝑚 |

|𝐶𝑚|
∑︁

𝑖=1
|𝐶𝑚|
∑︁

𝑖=1

(cid:16)

∇ 𝑓𝜃 ( ˜𝑥𝑖) − 𝐽 (𝑖)(cid:17) ⊤ (cid:16)

𝑢𝑢𝑙𝜃 (cid:0)𝑦 𝑗 , 𝑓 ( ˜𝑥𝑖)(cid:1)(cid:17) 1
∇2

2

(cid:13)
(cid:13)
(cid:13)
(cid:13)

(cid:13)
(cid:13)
(cid:13)
(cid:13)

(cid:28)

Σ(𝑖)
˜𝑥 ˜𝑥 , ∇𝑢𝑙𝜃 (cid:0)𝑦 𝑗 , 𝑓𝜃 ( ˜𝑥𝑖)(cid:1)∇2 𝑓𝜃 ( ˜𝑥𝑖)

(cid:29)

78

and for any 𝑖 ∈ {1, 2, ..., |𝑋{𝐶𝑚}|},

𝐽 (𝑖) = −

(cid:16)

𝑢𝑢𝑙𝜃 (cid:0)𝑦 𝑗 , 𝑓𝜃 ( ˜𝑥𝑖)(cid:1)(cid:17) −1
∇2

∇𝑢𝑙𝜃 (cid:0)𝑦 𝑗 , 𝑓𝜃 ( ˜𝑥𝑖)(cid:1)Σ(𝑖)
𝑦 𝑗 ˜𝑥

(cid:16)

Σ(𝑖)
˜𝑥 ˜𝑥

(cid:17) −1

.

□

As shown in Eq. (5.8), the loss function of Remix consists of three parts, 𝑙𝜃 (cid:0)𝑦 𝑗 , 𝑓𝜃 ( ˜𝑥𝑖)(cid:1) denotes

the loss value on the mixed data examples with a minority label 𝑦 𝑗 , and two additional terms 𝑅1(𝜃)

and 𝑅2(𝜃). These two terms act as a regularization effect on the mixed data examples. Hence, we

can see the advantages of Imb-Mix. On the one hand, Imb-Mix generates more training examples

intentionally assigned to the minority class, which can help learn a better decision boundary

between the majority and minority classes. This can lead to better model generalization. On the

other hand, similar with Mixup, Imb-Mix can also be rewritten as an empirical risk on perturbed

data examples (i.e.( ˜𝑥𝑖, 𝑦 𝑗 )). This allows us to interpret Imb-Mix as a form of regularization. The

regularization helps the model avoid simply remembering data examples in the original training

dataset and improve the generalization ability of the model.

Case 2: 𝜆𝑦 = 𝜆𝑥. In this case, the Imb-Mix performs exactly same with Mixup [112]. Hence,

based on [12], we have

L𝐼𝑚𝑏−𝑀𝑖𝑥 (𝜃) = L𝑀𝑖𝑥𝑢 𝑝 (𝜃)

=

1
𝑊

𝑊
∑︁

𝑖=1

𝑙𝜃 (cid:0) ˜𝑦𝑖, 𝑓𝜃 ( ˜𝑥𝑖)(cid:1) + 𝑅1(𝜃) + 𝑅2(𝜃) + 𝑅3(𝜃) + 𝑅4(𝜃).

(5.9)

Here we use 𝑊 to denote the number of example pairs used in Mixup and

˜𝑦𝑖 = ¯𝑦 + ¯𝜆(𝑦𝑖 − ¯𝑦).

Similarly, there are four regularization terms, i.e., 𝑅1(𝜃), 𝑅2(𝜃), 𝑅3(𝜃) and 𝑅4(𝜃), in the loss

function of Mixup, which can effectively improve the generalization ability of the trained model.

For details of these four regularization terms, please refer to [12].

5.5 Experiment

In this section, we conduct various experiments to validate the effectiveness of our Imb-Mix

framework. We aim at answering the following two questions:

79

• Can the proposed framework Imb-Mix boost adversarial training under various imbalanced

scenarios?

• What is the impact of each component on Imb-Mix?

We begin by introducing the experimental settings including datasets construction and implemen-

tation details. Next, we compare Imb-Mix with several representative methods to answer the first

question. Then we analyze the impact of each component on Imb-Mix to answer the second

question.

5.5.1 Experimental Settings

Datasets. We create several imbalanced training datasets based on two benchmark image

datasets CIFAR10 and CIFAR100 [56] with diverse imbalanced distributions. Specifically, follow-

ing existing imbalanced learning works, we consider two different imbalance types: Exponential

(Exp) imbalance [21] and Step imbalance [7]. For Exp imbalance, the number of training examples

of each class will be reduced according to an exponential function 𝑛 = 𝑛𝑖𝜏𝑖, where 𝑖 is the class

index, 𝑛𝑖 is the number of training data examples in the original training dataset for class 𝑖 and

𝜏 ∈ (0, 1). We categorize the half of the classes with most frequent example sizes in the imbalanced

training dataset as majority classes and the remaining half as minority classes. For Step imbalance,

we equally split the classes into majority and minority classes where the number of training data

examples are equal in each majority/minority class. Moreover, we denote imbalance ratio 𝐾 as the

ratio between training example sizes of the most frequent and least frequent classes. We construct

different imbalanced datasets “Step-10”, “Step-100”, “Exp-10” and “Exp-100”, by adopting differ-

ent imbalanced types (Step or Exp) with different imbalanced ratios (𝐾 = 10 or 𝐾 = 100) to train

models. We evaluate the model’s performance on the original uniformly distributed test datasets

of CIFAR10 and CIFAR100 correspondingly.

Baseline methods. We implement several representative imbalanced learning methods (or

their combinations) into adversarial training as baseline methods. These methods include: (1) the

original PGD adversarial training (vanilla adv.); (2) PGD adversarial training with re-sample (adv.

80

Table 5.1 Experiment results on the CIFAR10 Exp-10 dataset under 𝑙∞ threat model.

Metric
Method
vanilla adv.
adv. + resample
adv. + reweight
mixup + adv.
remix + adv.
Imb-Mix

Standard Accuracy

Robust Accuracy

Overall
72.42 ± 0.52
68.72 ± 0.97
72.91 ± 0.26
75.26 ± 0.43
75.61 ± 0.60
78.17 ± 0.18

Minority
64.15 ± 0.55
57.57 ± 1.86
65.50 ± 0.24
67.91 ± 0.26
69.34 ± 0.91
73.03 ± 0.44

Overall
33.18 ± 0.60
31.68 ± 0.19
33.22 ± 0.40
37.23 ± 0.15
37.39 ± 0.54
39.00 ± 0.27

Minority
22.57 ± 0.60
20.90 ± 0.77
24.29 ± 0.53
26.23 ± 0.08
27.45 ± 0.75
30.52 ± 0.37

Table 5.2 Experiment results on the CIFAR10 Exp-100 dataset under 𝑙∞ threat model.

Metric
Method
vanilla adv.
adv. + resample
adv. + reweight
mixup + adv.
remix + adv.
Imb-Mix

Standard Accuracy

Robust Accuracy

Overall
49.30 ± 1.13
45.61 ± 0.44
50.89 ± 0.69
50.33 ± 1.22
51.35 ± 1.45
55.15 ± 0.42

Minority
23.40 ± 1.71
18.38 ± 1.86
25.69 ± 1.26
25.67 ± 1.75
27.13 ± 2.07
32.93 ± 0.63

Overall
24.29 ± 0.38
23.12 ± 0.32
24.35 ± 0.47
26.47 ± 0.29
26.64 ± 0.27
26.87 ± 0.34

Minority
4.67 ± 0.18
4.41 ± 0.82
5.73 ± 0.42
5.47 ± 0.55
6.18 ± 0.76
7.95 ± 0.55

Table 5.3 Experiment results on the CIFAR10 Step-10 dataset under 𝑙∞ threat model.

Metric
Method
vanilla adv.
adv. + resample
adv. + reweight
mixup + adv.
remix + adv.
Imb-Mix

Standard Accuracy

Robust Accuracy

Overall
66.09 ± 0.25
59.90 ± 0.77
67.47 ± 0.14
66.99 ± 1.34
67.78 ± 1.07
71.70 ± 0.18

Minority
45.07 ± 0.50
34.81 ± 0.96
48.71 ± 0.22
46.20 ± 2.28
47.91 ± 1.65
55.31 ± 0.30

Overall
32.71 ± 0.40
31.92 ± 0.17
33.12 ± 0.16
36.08 ± 0.15
36.37 ± 0.13
37.89 ± 0.20

Minority
12.81 ± 0.29
10.29 ± 0.47
14.90 ± 0.06
14.17 ± 1.17
14.86 ± 0.62
18.49 ± 0.45

+ resample), where the probability of each example to be selected in each training batch equals to

the inverse of the effective number of each class; (3) PGD adversarial training with reweighting

(adv. + reweight), where each example is reweighted proportionally by the inverse of the effective

number of its class; (4) PGD adversarial training with Mixup (mixup + adv.), where we apply

Mixup [112] on pair-wise adversarial examples generated by PGD attack; and (5) PGD adversarial

training with Remix [18] (remix + adv.), where we apply Remix on pair-wise adversarial examples

generated by PGD attack.

Implementation details. All aforementioned methods are implemented using a Pytorch library

DeepRobust [62]. For CIFAR10 and CIFAR100 based datasets, the adversarial examples used in

81

Table 5.4 Experiment results on the CIFAR10 Step-100 dataset under 𝑙∞ threat model.

Metric
Method
vanilla adv.
adv. + resample
adv. + reweight
mixup + adv.
remix + adv.
Imb-Mix

Standard Accuracy

Robust Accuracy

Overall
47.59 ± 0.40
43.49 ± 0.08
46.98 ± 0.42
44.58 ± 0.74
47.08 ± 1.19
48.05 ± 0.20

Minority
7.61 ± 1.06
2.97 ± 0.50
9.44 ± 0.25
2.27 ± 1.55
7.24 ± 2.09
9.81 ± 0.46

Overall
28.14 ± 0.16
28.75 ± 0.45
28.47 ± 0.12
29.61 ± 0.66
29.69 ± 0.61
30.25 ± 0.20

Minority
0.82 ± 0.11
0.50 ± 0.17
1.38 ± 0.16
0.19 ± 0.17
0.81 ± 0.37
1.29 ± 0.28

Table 5.5 Experiment results on the CIFAR100 Exp-10 dataset under 𝑙∞ threat model.

Metric
Method
vanilla adv.
adv. + resample
adv. + reweight
mixup + adv.
remix + adv.
Imb-Mix

Standard Accuracy

Robust Accuracy

Overall
41.36 ± 0.25
38.49 ± 0.18
39.85 ± 0.40
46.39 ± 0.29
47.01 ± 0.37
51.22 ± 0.49

Minority
30.84 ± 0.20
25.88 ± 0.30
28.23 ± 0.39
34.21 ± 0.39
35.77 ± 0.33
41.32 ± 1.00

Overall
15.06 ± 0.15
14.81 ± 0.07
14.70 ± 0.04
18.84 ± 0.21
18.82 ± 0.25
19.14 ± 0.21

Minority
9.95 ± 0.09
9.49 ± 0.34
9.59 ± 0.21
12.35 ± 0.21
12.96 ± 0.28
13.99 ± 0.41

Table 5.6 Experiment results on the CIFAR100 Step-10 dataset under 𝑙∞ threat model.

Metric
Method
vanilla adv.
adv. + resample
adv. + reweight
mixup + adv.
remix + adv.
Imb-Mix

Standard Accuracy

Robust Accuracy

Overall
39.68 ± 0.33
36.16 ± 0.39
38.33 ± 0.47
42.62 ± 0.28
43.90 ± 0.17
46.79 ± 0.67

Minority
18.46 ± 0.24
12.07 ± 0.44
15.73 ± 0.60
17.28 ± 0.61
20.59 ± 0.42
26.33 ± 1.13

Overall
15.43 ± 0.07
15.51 ± 0.13
15.12 ± 0.13
19.35 ± 0.17
19.60 ± 0.10
19.94 ± 0.31

Minority
5.13 ± 0.30
4.14 ± 0.31
4.78 ± 0.24
5.77 ± 0.28
6.51 ± 0.35
7.84 ± 0.52

training are calculated by PGD-10, with a perturbation budget 𝜖 = 8/255 and step size 𝛾 = 2/255.

In evaluation, we report robust accuracy under 𝑙∞-norm 8/255 attacks generated by PGD-20 on

Resnet-18 [42] models. We set the total training epochs to 250 and the initial learning rate to 0.1,

and decay the learning rate at epoch 160 and 180 with the ratio 0.01.

5.5.2 Performance Comparison

Tables 5.1-5.6 report the performance comparison on multiple imbalanced datasets with various

imbalanced scenarios. The highest accuracy achieved among all methods are denoted by bold

values. From these tables, we have the following observations. First, compared to baseline

methods, Imb-Mix obtains improved performance in terms of both overall standard accuracy and

82

(a) Standard Accuracy

(b) Robust Accuracy

Figure 5.2 Performance on the CIFAR10 Exp-100 dataset.

(a) Standard Accuracy

(b) Robust Accuracy

Figure 5.3 Performance on the CIFAR10 Step-100 dataset.

robust accuracy under almost all imbalanced scenarios. This suggests that Imb-Mix is able to

facilitate adversarial training under imbalanced scenarios. Second, Imb-Mix obtains significant

improvement on those under-represented classes with a large margin. For instance, on the CIFAR10

Exp-10 dataset, Imb-Mix improves the standard accuracy on minority classes from 69.34% achieved

by the best baseline method to 73.03% and robust accuracy from 27.45% to 30.52%. These results

demonstrate that Imb-Mix is able to obtain more robustness under imbalanced settings. In addition,

we find that the baseline method adv. + re-sample always achieves the worst performance among

all methods. This demonstrates that simply combining adversarial training with re-sampling

techniques cannot improve the models’ robustness under imbalanced scenarios. In other words,

novel data augmentation methods, such as our proposed framework Imb-Mix, are necessary.

5.5.3 Ablation Studies

In this subsection, we investigate how each component contributes to Imb-Mix. This includes

our proposed data augmentation method as well as the SWA technique. We further explore the

83

(a) Standard Accuracy

(b) Robust Accuracy

Figure 5.4 Performance on the CIFAR10 Exp-10 dataset.

(a) Standard Accuracy

(b) Robust Accuracy

Figure 5.5 Performance on the CIFAR10 Step-10 dataset.

performance under various imbalanced scenarios. To achieve this goal, we first implemented a

variant of Imb-Mix, which only integrates our proposed data augmentation method with adversarial

training and adopts stochastic gradient descent (SGD) to optimize the loss function. We then

compared the performance of this method, i.e. adv. + data_aug, with vanilla adversarial training

and our Imb-Mix framework on our constructed imbalanced datasets.

Figures 5.2-5.7 show both standard accuracy and robust accuracy achieved by aforementioned

three methods on different imbalanced datasets. From these figures, we can have the following

observations. First, our proposed data augmentation method indeed benefits adversarial training un-

der imbalanced scenarios. Compared to vanilla adversarial training, adversarial training combined

with our data augmentation method achieves significant improvement on both clean examples and

adversarial examples. For example, on CIFAR100 Step-10 datasets, adversarial training combined

with our data augmentation method obtains a standard accuracy of 50% and a robust accuracy

20%. However, vanilla adversarial training only achieves a 40% and 15% standard accuracy and

84

(a) Standard Accuracy

(b) Robust Accuracy

Figure 5.6 Performance on the CIFAR100 Exp-10 dataset.

(a) Standard Accuracy

(b) Robust Accuracy

Figure 5.7 Performance on the CIFAR100 Step-10 dataset.

robust accuracy, respectively. Secondly, our Imb-Mix framework achieves the best performance

among three methods. This verifies the effectiveness of the SWA technique adopted by our Imb-Mix

framework, as the only difference between Imb-Mix and adv. + data_aug is that the former applies

SWA while the latter utilizes the normal SGD during the model training process. To sum up,

experimental results reported in Figures 5.2-5.7 demonstrate the contribution of each component

to our framework.

5.5.4 Robustness against 𝑙2 Attack

To further evaluate the effectiveness of our Imb-Mix framework, we also adversarially train

Resnet-18 [42] models on CIFAR100 Exp-10 dataset under 𝑙2 attack. We follow the same settings

as in [105] with s perturbation budget of 𝜖 = 128/255 and a step size of 𝛾 = 15/255. As shown in

Table 5.7, Imb-Mix outperforms all baseline methods with a large margin. This further verifies the

effectiveness of Imb-Mix.

85

Table 5.7 Experiment results on the CIFAR100 Exp-10 dataset under 𝑙2 threat model.

Metric
Method
vanilla adv.
adv. + resample
adv. + reweight
mixup + adv.
remix + adv.
Imb-Mix

Standard Accuracy

Robust Accuracy

Overall
58.17 ± 0.07
55.77 ± 0.22
57.70 ± 0.22
62.60 ± 0.06
63.22 ± 0.44
63.97 ± 0.54

Minority
47.81 ± 0.29
44.25 ± 0.73
47.43 ± 0.46
51.45 ± 0.16
53.49 ± 0.66
54.05 ± 0.97

Overall
47.22 ± 0.31
45.84 ± 0.36
47.13 ± 0.33
52.47 ± 0.47
52.67 ± 0.49
53.51 ± 0.16

Minority
37.81 ± 0.76
35.18 ± 0.71
37.87 ± 0.75
41.35 ± 0.37
42.97 ± 0.79
43.38 ± 0.47

5.6 Chapter Conclusion

In this chapter, we propose a novel data augmentation based framework, Imb-Mix, to facilitate

the adversarial training method under imbalanced scenarios. Imb-Mix first generates adversarial

examples for the minority classes to balance the dataset. It then constructs Mixup-mimic mixed

examples as inputs during the model training process. In addition, stochastic model weight aver-

aging is also included in our framework and helps achieve better performance. We validate the

effectiveness of Imb-Mix via comprehensive experiments. In the future, we plan to investigate more

advanced data augmentation methods to further improve the model robustness under imbalanced

scenarios.

86

CHAPTER 6

CONCLUSIONS

In this chapter, I summarize the research efforts described in this dissertation and discuss

promising research directions.

6.1 Dissertation Summary

In this dissertation, I introduced my studies on learning from imbalanced data distribution under

various kind of settings. Specifically, I presented several effective solutions for (1) generating high-

quality synthetic data to balance data distribution, (2) learning from imbalanced crowdsourced

labeled data, and (3) improving model robustness given imbalanced training data.

To generate more realistic realistic and discriminative data samples for minority classes, in

Chapter 2, I first pointed out the importance of both local and global data distribution information

in generating high-quality synthetic minority samples to tackle the class imbalance problem. Based

on that, I proposed GL-GAN [100], a novel data generation framework utilizing both global

and local information of the given imbalanced data in the synthetic minority sample generation

process. Comparing with common related works only considers local data distribution information

when generating synthetic minority samples, as shown in experimental results, GL-GAN is able

to produce more realistic and discriminative synthetic minority samples by taking global data

distribution information into consideration.

To learn useful information from imbalanced crowdsourced labeled data, in Chapter 3, I pro-

posed a deep neural network based classifier ICED [99]. During training, a true label inference

module equipped in ICED will estimate determinate true labels from given crowdsourced labeled

data by the true label inference module while a synthetic data generation module will generate

synthetic data samples for the minority class using the estimated determinate true labels. These

two modules are able to augment each other and improve themselves iteratively. With the help

of these modules, ICED is able to infers true labels from imbalanced crowdsourced labeled data

and achieves high accuracy on the classification task simultaneously.

I conducted a series of

experiments to verify the effectiveness of ICED.

87

To improving model robustness under imbalanced scenarios, I explored several solutions from

different perspectives. In Chapter 4, I demonstrated that adversarial training alone cannot be effec-

tive for improving the robustness of models under imbalanced scenarios, because of adversarially

trained models can suffer much worse performance on minority classes observed in empirically

studies, and simply combing adversarial training with reweighting strategies also cannot work well,

due to the poor data separability brought by adversarial training training proven by theoretical anal-

ysis. Based on findings, I proposed a novel method SRAT [102] to boost the reweighting strategy

in adversarial training under imbalanced scenarios. By testing the performance of SRAT in various

kinds of experiments, I validated the effectiveness of it. In Chapter 5, I focused on boosting adver-

sarial training under imbalanced scenarios by augmenting imbalanced training data. The proposed

framework Imb-Mix [98] is able to generate multiple adversarial examples for minority classes, by

adding random noise to the original adversarial examples created by one specific adversarial attack

method first and then constructing Mixup-mimic mixed examples upon the augmented dataset used

by adversarial training. I also theoretically proven the regularization effect of the Mixup-mimic

mixed examples generation technique adopted in Imb-Mix. Experimental results demonstrated that

data augmentation can also be an effective way to benefit adversarial training under imbalanced

scenarios.

6.2 Future Work

In addition to the achievements obtained by my studies, I also plan to explore the following

research directions in the future:

• Multi-label Imbalanced Classification. Multi-label classification task is omnipresent in

many real-world applications, such as annotating a given movie category and creating a profile

for a customer. Different from the common multi-class classification task I investigated in

this dissertation, in multi-label classification task, each data sample is typically associated

with series of labels instead of one and there is no constraint on how many labels one data

sample can be assigned to. Hence, the imbalanced data distribution as almost unavoidable

in the multi-label classification task, as it’s very hard to guarantee each label occur with

88

the same number. I plan to explore how to learn from imbalanced multi-label data more

effectively to obtain satisfied performance on the multi-label classification task.

• Learning from Imbalanced Text Data. Most existing approaches for handling imbalanced

data distribution mainly focused on continuous data like image, and research on addressing

this problem on non-continuous data, such as text, is rather limited. Considering that text data

is everywhere in human society, this direction deserves more attention. Hence, as one future

work, I plan to investigate the negative impacts brought by the imbalanced data distribution

in various text data related applications, such as text classification and sentiment analysis,

and explore effective solutions to mitigate negative impacts.

89

BIBLIOGRAPHY

[1]

[2]

[3]

[4]

[5]

[6]

Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and
Dong Yu. Convolutional neural networks for speech recognition. IEEE/ACM Transactions
on audio, speech, and language processing, 22(10):1533–1545, 2014.

Lida Abdi and Sattar Hashemi. To combat multi-class imbalanced problems by means
IEEE Transactions on Knowledge and Data Engineering,
of over-sampling techniques.
(1):238–251, 2016.

Shadi Albarqouni, Christoph Baur, Felix Achilles, Vasileios Belagiannis, Stefanie Demirci,
and Nassir Navab. Aggnet: deep learning from crowds for mitosis detection in breast cancer
histology images. IEEE transactions on medical imaging, 35(5):1313–1321, 2016.

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense
of security: Circumventing defenses to adversarial examples. In International Conference
on Machine Learning, pages 274–283. PMLR, 2018.

Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, and Andrew Gordon Wilson. There are
many consistent explanations of unlabeled data: Why you should average. arXiv preprint
arXiv:1806.05594, 2018.

Jesús Bobadilla, Fernando Ortega, Antonio Hernando, and Abraham Gutiérrez. Recom-
mender systems survey. Knowledge-based systems, 46:109–132, 2013.

[7] Mateusz Buda, Atsuto Maki, and Maciej A Mazurowski. A systematic study of the class
imbalance problem in convolutional neural networks. Neural Networks, 106:249–259, 2018.

[8]

[9]

Lukas Budach, Moritz Feuerpfeil, Nina Ihde, Andrea Nathansen, Nele Noack, Hendrik
Patzlaff, Hazar Harmouch, and Felix Naumann. The effects of data quality on machine
learning performance. arXiv preprint arXiv:2207.14529, 2022.

Jonathon Byrd and Zachary Lipton. What is the effect of importance weighting in deep
learning? In International Conference on Machine Learning, pages 872–881. PMLR, 2019.

[10] Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbal-
anced datasets with label-distribution-aware margin loss. In Advances in Neural Information
Processing Systems, pages 1567–1578, 2019.

[11] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks.

In 2017 ieee symposium on security and privacy (sp), pages 39–57. IEEE, 2017.

[12] Luigi Carratino, Moustapha Cissé, Rodolphe Jenatton, and Jean-Philippe Vert. On mixup

regularization. arXiv preprint arXiv:2006.06049, 2020.

[13] Anirban Chakraborty, Manaar Alam, Vishal Dey, Anupam Chattopadhyay, and Deb-
arXiv preprint

deep Mukhopadhyay. Adversarial attacks and defences: A survey.
arXiv:1810.00069, 2018.

90

[14] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote:
synthetic minority over-sampling technique. Journal of artificial intelligence research,
16:321–357, 2002.

[15] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote:
synthetic minority over-sampling technique. Journal of artificial intelligence research,
16:321–357, 2002.

[16] Nitesh V Chawla, Nathalie Japkowicz, and Aleksander Kotcz. Special issue on learning
from imbalanced data sets. ACM SIGKDD explorations newsletter, 6(1):1–6, 2004.

[17] Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. Deepdriving: Learning affor-
dance for direct perception in autonomous driving. In Proceedings of the IEEE international
conference on computer vision, pages 2722–2730, 2015.

[18] Hsin-Ping Chou, Shih-Chieh Chang, Jia-Yu Pan, Wei Wei, and Da-Cheng Juan. Remix:
rebalanced mixup. In European Conference on Computer Vision, pages 95–110. Springer,
2020.

[19]

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via ran-
domized smoothing. In International Conference on Machine Learning, pages 1310–1320.
PMLR, 2019.

[20] Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and
Anil A Bharath. Generative adversarial networks: An overview. IEEE Signal Processing
Magazine, 35(1):53–65, 2018.

[21] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss
based on effective number of samples. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 9268–9277, 2019.

[22] Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer
error-rates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied
Statistics), 28(1):20–28, 1979.

[23] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural

networks with cutout. arXiv preprint arXiv:1708.04552, 2017.

[24] Georgios Douzas and Fernando Bacao. Effective data generation for imbalanced learning
using conditional generative adversarial networks. Expert Systems with applications, 91:464–
471, 2018.

[25] Chris Drummond, Robert C Holte, et al. C4. 5, class imbalance, and cost sensitivity: why
under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II,
volume 11, pages 1–8. Citeseer, 2003.

[26] Andrew Estabrooks, Taeho Jo, and Nathalie Japkowicz. A multiple resampling method for
learning from imbalanced data sets. Computational intelligence, 20(1):18–36, 2004.

91

[27] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew
International journal of

Zisserman. The pascal visual object classes (voc) challenge.
computer vision, 88(2):303–338, 2010.

[28]

Ju Fan, Guoliang Li, Beng Chin Ooi, Kian-lee Tan, and Jianhua Feng. icrowd: An adap-
tive crowdsourcing framework. In Proceedings of the 2015 ACM SIGMOD International
Conference on Management of Data, pages 1015–1030, 2015.

[29] Zhe Gan, Liqun Chen, Weiyao Wang, Yuchen Pu, Yizhe Zhang, Hao Liu, Chunyuan Li,
In Advances in Neural

and Lawrence Carin. Triangle generative adversarial networks.
Information Processing Systems, pages 5247–5256, 2017.

[30] Vaishali Ganganwar. An overview of classification algorithms for imbalanced datasets.
International Journal of Emerging Technology and Advanced Engineering, 2(4):42–47,
2012.

[31]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in
neural information processing systems, pages 2672–2680, 2014.

[32]

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing
adversarial examples. arXiv preprint arXiv:1412.6572, 2014.

[33] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep
recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and
signal processing, pages 6645–6649. Ieee, 2013.

[34] Melody Guan, Varun Gulshan, Andrew Dai, and Geoffrey Hinton. Who said what: Modeling
In Proceedings of the AAAI Conference on

individual labelers improves classification.
Artificial Intelligence, pages 3109–3118, 2018.

[35] Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. Borderline-smote: a new over-sampling
method in imbalanced data sets learning. In International Conference on Intelligent Com-
puting, pages 878–887. Springer, 2005.

[36]

Jiangfan Han, Ping Luo, and Xiaogang Wang. Deep self-learning from noisy labels. In
Proceedings of the IEEE International Conference on Computer Vision, pages 5138–5147,
2019.

[37] Xiaotian Han, Zhimeng Jiang, Ninghao Liu, and Xia Hu. G-mixup: Graph data augmentation
for graph classification. In International Conference on Machine Learning, pages 8230–
8248. PMLR, 2022.

[38] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan
Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. Deep speech: Scaling up
end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.

[39] Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. Adasyn: Adaptive synthetic
sampling approach for imbalanced learning. In Neural Networks. ĲCNN. IEEE International
Joint Conference on, pages 1322–1328. IEEE, 2008.

92

[40] Haibo He and Edwardo A Garcia. Learning from imbalanced data. IEEE Transactions on

knowledge and data engineering, 21(9):1263–1284, 2009.

[41] Haibo He and Yunqian Ma. Imbalanced learning: foundations, algorithms, and applications.

2013.

[42] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
image recognition. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016.

[43] Dan Hendrycks, Norman Mu, Ekin D Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lak-
shminarayanan. Augmix: A simple data processing method to improve robustness and
uncertainty. arXiv preprint arXiv:1912.02781, 2019.

[44] Yueh-Min Huang, Chun-Min Hung, and Hewĳin Christine Jiau. Evaluation of neural
networks and data mining methods on a credit assessment task for class imbalance problem.
Nonlinear Analysis: Real World Applications, 7(4):720–747, 2006.

[45] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon
Wilson. Averaging weights leads to wider optima and better generalization. arXiv preprint
arXiv:1803.05407, 2018.

[46] Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A systematic study.

Intelligent data analysis, 6(5):429–449, 2002.

[47]

Justin M Johnson and Taghi M Khoshgoftaar. Survey on deep learning with class imbalance.
Journal of Big Data, 6(1):1–54, 2019.

[48] Hiroshi Kajino, Yuta Tsuboi, and Hisashi Kashima. Clustering crowds. In Proceedings of
the twenty-seventh AAAI conference on artificial intelligence, pages 1120–1127, 2013.

[49] Taskin Kavzoglu.

Increasing the accuracy of neural network classification using refined

training data. Environmental Modelling & Software, 24(7):850–858, 2009.

[50] Ashish Khetan, Zachary C Lipton, and Anima Anandkumar. Learning from noisy singly-

labeled data. arXiv preprint arXiv:1712.04577, 2017.

[51] Ashish Khetan, Zachary C Lipton, and Animashree Anandkumar. Learning from noisy
singly-labeled data. In International Conference on Learning Representations, 2018.

[52] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola,
Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. arXiv
preprint arXiv:2004.11362, 2020.

[53] Yassin Kortli, Maher Jridi, Ayman Al Falou, and Mohamed Atri. Face recognition systems:

A survey. Sensors, 20(2):342, 2020.

[54] György Kovács. smote-variants: a python implementation of 85 minority oversampling

techniques. Neurocomputing, 2019.

93

[55] Bartosz Krawczyk. Learning from imbalanced data: open challenges and future directions.

Progress in Artificial Intelligence, 5(4):221–232, 2016.

[56] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny

images. 2009.

[57] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing systems, pages
1097–1105, 2012.

[58] Alexey Kurakin, Ian J Goodfellow, and Samy Bengio. Adversarial examples in the physical

world. arXiv preprint arXiv:1607.02533, 2016.

[59] Kuang-Huei Lee, Xiaodong He, Lei Zhang, and Linjun Yang. Cleannet: Transfer learning
for scalable image classifier training with label noise. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 5447–5456, 2018.

[60] Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas.

Imbalanced-learn: A
python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of
Machine Learning Research, 18(17):1–5, 2017.

[61]

Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S Kankanhalli. Learning to learn from
noisy labeled data. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 5051–5059, 2019.

[62] Yaxin Li, Wei Jin, Han Xu, and Jiliang Tang. Deeprobust: A pytorch library for adversarial

attacks and defenses. arXiv preprint arXiv:2005.06149, 2020.

[63] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for
dense object detection. In Proceedings of the IEEE international conference on computer
vision, pages 2980–2988, 2017.

[64] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In
European conference on computer vision, pages 740–755. Springer, 2014.

[65] Geert Litjens, Thĳs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio,
Francesco Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Ginneken,
and Clara I Sánchez. A survey on deep learning in medical image analysis. Medical image
analysis, 42:60–88, 2017.

[66] Alexander Liu, Joydeep Ghosh, and Cheryl E Martin. Generative oversampling for mining

imbalanced datasets. In DMIN, pages 66–72, 2007.

[67] Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou. Exploratory undersampling for class-
IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cy-

imbalance learning.
bernetics), 39(2):539–550, 2008.

94

[68] Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou. Exploratory undersampling for class-
IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cy-

imbalance learning.
bernetics), 39(2):539–550, 2009.

[69] Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu.
Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 2537–2546, 2019.

[70] Rushi Longadge and Snehalata Dongre. Class imbalance problem in data mining review.

arXiv preprint arXiv:1305.1707, 2013.

[71] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian
Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint
arXiv:1706.06083, 2017.

[72] Giovanni Mariani, Florian Scheidegger, Roxana Istrate, Costas Bekas, and Cristiano Malossi.

Bagan: Data augmentation with balancing gan. arXiv preprint arXiv:1803.09655, 2018.

[73] Brian W Matthews. Comparison of the predicted and observed secondary structure of t4
phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2):442–451,
1975.

[74] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint

arXiv:1411.1784, 2014.

[75]

Iqbal Muhammad and Zhu Yan. Supervised machine learning approaches: A survey. ICTACT
Journal on Soft Computing, 5(3), 2015.

[76] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning
with noisy labels. In Advances in neural information processing systems, pages 1196–1204,
2013.

[77] William S Noble. What is a support vector machine? Nature biotechnology, 24(12):1565–

1567, 2006.

[78] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial

examples. arXiv preprint arXiv:1801.09344, 2018.

[79] Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin,
Luca Bogoni, and Linda Moy. Learning from crowds. Journal of Machine Learning
Research, 11(4), 2010.

[80] Sylvestre-Alvise Rebuffi, Sven Gowal, Dan Andrei Calian, Florian Stimberg, Olivia Wiles,
and Timothy A Mann. Data augmentation can improve robustness. Advances in Neural
Information Processing Systems, 34:29935–29948, 2021.

[81] William Rivera. Noise reduction a priori synthetic over-sampling for class imbalanced data

sets. Information Sciences, 408:146–161, 10 2017.

95

[82] Filipe Rodrigues and Francisco Pereira. Deep learning from crowds. In Proceedings of the

AAAI Conference on Artificial Intelligence, pages 1611–1618, 2018.

[83] Filipe Rodrigues, Francisco Pereira, and Bernardete Ribeiro. Gaussian process classifica-
tion and active learning with multiple annotators. In International conference on machine
learning, pages 433–441, 2014.

[84]

Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks,
61:85–117, 2015.

[85] Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry.
Adversarially robust generalization requires more data. arXiv preprint arXiv:1804.11285,
2018.

[86] Shiven Sharma, Colin Bellinger, Bartosz Krawczyk, Osmar Zaiane, and Nathalie Japkowicz.
Synthetic oversampling with the majority class: A new perspective on handling extreme
imbalance. In 2018 IEEE International Conference on Data Mining, pages 447–456. IEEE,
2018.

[87] Harry Shomer, Wei Jin, Wentao Wang, and Jiliang Tang. Toward degree bias in embedding-
In Proceedings of the ACM Web Conference 2023,

based knowledge graph completion.
pages 705–715, 2023.

[88] Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. Fake news detection on
social media: A data mining perspective. ACM SIGKDD Explorations Newsletter, 19(1):22–
36, 2017.

[89] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale

image recognition. arXiv preprint arXiv:1409.1556, 2014.

[90] Lichao Sun, Congying Xia, Wenpeng Yin, Tingting Liang, S Yu Philip, and Lifang He.
Mixup-transformer: Dynamic data augmentation for nlp tasks. In Proceedings of the 28th
International Conference on Computational Linguistics, pages 3436–3440, 2020.

[91] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian
Intriguing properties of neural networks. arXiv preprint

Goodfellow, and Rob Fergus.
arXiv:1312.6199, 2013.

[92] Ryutaro Tanno, Ardavan Saeedi, Swami Sankaranarayanan, Daniel C Alexander, and Nathan
Silberman. Learning from noisy labels by regularized estimation of annotator confusion. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
11244–11253, 2019.

[93] Tatiana Tommasi, Francesco Orabona, and Barbara Caputo. Discriminative cue integration
for medical image annotation. Pattern Recognition Letters, 29(15):1996–2002, 2008.

[94] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of

machine learning research, 9(11), 2008.

96

[95] Grant Van Horn and Pietro Perona. The devil is in the tails: Fine-grained classification in

the wild. arXiv preprint arXiv:1709.01450, 2017.

[96] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
In Advances in

Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.
neural information processing systems, pages 5998–6008, 2017.

[97] Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David
Lopez-Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating
hidden states. In International conference on machine learning, pages 6438–6447. PMLR,
2019.

[98] Wentao Wang, Harry Shomer, Yuxuan Wan, Yaxin Li, Jiangtao Huang, and Hui Liu. A
mix-up strategy to enhance adversarial training with imbalanced data. In Proceedings of the
32nd ACM International Conference on Information and Knowledge Management, pages
2637–2645, 2023.

[99] Wentao Wang, Joseph Thekinen, Xiaorui Liu, Zitao Liu, and Jiliang Tang. Learning from
In Proceedings of the 2022 SIAM International

imbalanced crowdsourced labeled data.
Conference on Data Mining (SDM), pages 594–602. SIAM, 2022.

[100] Wentao Wang, Suhang Wang, Wenqi Fan, Zitao Liu, and Jiliang Tang. Global-and-local
aware data generation for the class imbalance problem. In Proceedings of the 2020 SIAM
International Conference on Data Mining, pages 307–315. SIAM, 2020.

[101] Wentao Wang, Guowei Xu, Wenbiao Ding, Yan Huang, Guoliang Li, Jiliang Tang, and Zitao
Liu. Representation learning from limited educational data with crowdsourced labels. IEEE
Transactions on Knowledge and Data Engineering, 2020.

[102] Wentao Wang, Han Xu, Xiaorui Liu, Yaxin Li, Bhavani Thuraisingham, and Jiliang Tang.
Imbalanced adversarial training with reweighting. In 2022 IEEE International Conference
on Data Mining (ICDM), pages 1209–1214. IEEE, 2022.

[103] Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. Im-
proving adversarial robustness requires revisiting misclassified examples. In International
Conference on Learning Representations, 2019.

[104] Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R Movellan, and Paul L Ruvolo.
Whose vote should count more: Optimal integration of labels from labelers of unknown
expertise. In Advances in neural information processing systems, pages 2035–2043, 2009.

[105] Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust

generalization. Advances in Neural Information Processing Systems, 33, 2020.

[106] Tong Wu, Ziwei Liu, Qingqiu Huang, Yu Wang, and Dahua Lin. Adversarial robustness
under long-tailed distribution. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 8659–8668, 2021.

97

[107] Da Xu, Yuting Ye, and Chuanwei Ruan. Understanding the role of importance weighting for

deep learning. arXiv preprint arXiv:2103.15209, 2021.

[108] Han Xu, Xiaorui Liu, Yaxin Li, and Jiliang Tang. To be robust or to be fair: Towards fairness

in adversarial training. arXiv preprint arXiv:2010.06121, 2020.

[109] Han Xu, Yao Ma, Hao-Chen Liu, Debayan Deb, Hui Liu, Ji-Liang Tang, and Anil K Jain.
Adversarial attacks and defenses in images, graphs and text: A review. International Journal
of Automation and Computing, 17(2):151–178, 2020.

[110] Show-Jane Yen and Yue-Shi Lee. Under-sampling approaches for improving prediction of
the minority class in an imbalanced dataset. In Intelligent Control and Automation, pages
731–740. Springer, 2006.

[111] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael
Jordan. Theoretically principled trade-off between robustness and accuracy. In International
Conference on Machine Learning, pages 7472–7482. PMLR, 2019.

[112] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond

empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.

[113] Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reynold Cheng. Truth inference
in crowdsourcing: Is the problem solved? Proceedings of the VLDB Endowment, 10(5):541–
552, 2017.

98