Learning from Imbalanced Data Distribution

As a prominent component of artificial intelligence (AI), machine learning (ML) techniques play a significant role in the stunning achievement obtained by AI technologies in human society. ML techniques enable computers to leverage collected data to tackle various kinds of tasks in practice. However, more and more studies reveal that the capability of a ML model will be decreased dramatically if the distribution of collected data used for training this model is imbalanced. As imbalanced data distribution is widespread in many real-world applications, improving the performance of ML models under imbalanced data distribution has attracted considerable attention.While a growing number of related works have been proposed to make ML models learn from imbalanced data more effectively, the study on this topic is far from complete. In this dissertation, I propose several studies to fill up the gaps in this direction. First, most existing data generation based works only consider the local distribution information within classes, while the global distribution is totally ignored. I demonstrate both global and local distribution information are important for producing high-quality synthetic data samples to balance the data distribution. Second, almost all existing studies assume that collected data samples are associated with noisy-free labels, and, hence, they cannot work well when annotated labels are noisy. I investigate the problem of learning from imbalanced crowdsourced labeled data and propose a novel framework as a solution with satisfactory performance. Third, currently the research on investigating the impact of imbalanced data distribution on the robustness of ML models is rather limited. To this end, I empirically verify the adversarial training (AT) approach alone cannot bring enough robustness for ML models under imbalanced scenarios while integrating the reweighting strategy with AT can be very helpful. In addition, I also propose an effective data augmentation based framework to benefit AT under imbalanced scenarios.

Read