DETECTING AND MITIGATING BIAS IN NATURAL LANGUAGES By Haochen Liu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science – Doctor of Philosophy August 22, 2022 ABSTRACT DETECTING AND MITIGATING BIAS IN NATURAL LANGUAGES By Haochen Liu Natural language processing (NLP) is an increasingly prominent subfield of artificial intelligence (AI). NLP techniques enable intelligent machines to understand and analyze natural languages and make it possible for humans and machines to communicate through natural languages. However, more and more evidence indicates that NLP applications show human-like discriminatory bias or make unfair decisions. As NLP algorithms play an increasingly irreplaceable role in promoting the automation of people’s lives, bias in NLP is closely related to users’ vital interests and demands considerable attention. While there are a growing number of studies related to bias in natural languages, the research on this topic is far from complete. In this thesis, we propose several studies to fill up the gaps in the area of bias in NLP in terms of three perspectives. First, existing studies are mainly confined to traditional and relatively mature NLP tasks, but for certain newly emerging tasks such as dialogue generation, the research on how to define, detect, and mitigate the bias in them is still absent. We conduct pioneering studies on bias in dialogue models to answer these questions. Second, previous studies basically focus on explicit bias in NLP algorithms but overlook implicit bias. We investigate the implicit bias in text classification tasks in our studies, where we propose novel methods to detect, explain, and mitigate the implicit bias. Third, existing research on bias in NLP focuses more on in-processing and post-processing bias mitigation strategies, but rarely considers how to avoid bias being produced in the generation process of the training data, especially in the data annotation phase. To this end, we investigate annotator bias in crowdsourced data for NLP tasks and its group effect. We verify the existence of annotator group bias, develop a novel probabilistic graphical framework to capture it, and propose an algorithm to eliminate its negative impact on NLP model learning. To my parents and entire family for their love and support. iii ACKNOWLEDGEMENTS I joined Michigan State University as a fresh Ph.D. student in the Spring 2018 semester. In my four-and-a-half-year Ph.D. journey, I received invaluable help, support, and guidance from many great people. First and foremost, I would like to express my sincere gratitude to my advisor Dr. Jiliang Tang, for his guidance, encouragement, inspiration, and support during my Ph.D. life. I have learned so many academic skills from him ranging from proposing a significant research problem, polishing a novel idea, writing a research paper, presenting a research project, and mentoring junior students. In addition to the help in the academic field, he also acts as a role model for life. He has taught me the value of kindness, ambition, responsibility, and optimism, from which I will benefit a lot in my future life. With his help, I have achieved what I had never imagined. I feel honored to have been his student. I would extend my gratitude to my other Ph.D. committee members: Dr. Hui Liu, Dr. Pang-Ning Tan, and Dr. Sinem Mollaoglu, for their insightful comments and helpful suggestions. In addition, I would like to thank all of my fantastic lab mates in the Data Science and Engi- neering (DSE) Lab. During my Ph.D. study, I have had the pleasure and fortune of having so many supportive and encouraging friends and colleagues: Tyler Derr, Zhiwei Wang, Yao Ma, Xiangyu Zhao, Hamid Karimi, Wenqi Fan, Xiaorui Liu, Han Xu, Xiaoyang Wang, Jamell Dacon, Wentao Wang, Wei Jin, Yaxin Li, Yiqi Wang, Juanhui Li, Harry Shomer, Jie Ren, Jiayuan Ding, Haoyu Han, Hongzhi Wen, Yuxuan Wan, Pengfei He, and Hua Liu. I am also thankful for the collaboration from outside the DSE lab: Dr. Amin Javari and Dr. Xiquan Cui at the Home Depot; Dr. Da Tang, Dr. Ji Yang, and Youlong Cheng at ByteDance; Dr. Zitao Liu at TAL education group; Dr. Hongshen Chen at JD.com; and Dr. Dawei Yin at Baidu Inc. Finally, I would again like to express my gratitude to my dear and kind mom, Qinwen Ma, and my wonderful father, Xudong Liu, as well as my entire family, for their unconditional love and support in my whole life. I am eternally grateful for them. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 CHAPTER 2 BIAS DETECTION IN DIALOGUE GENERATION . . . . . . . . . . . . 5 2.1 Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Fairness Analysis in Dialogue Systems . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 Fairness in Dialogue systems . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Hypothesis Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.3 Parallel Context Data Construction . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3.1 Gender Words . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.3.2 Race Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.4 Fairness Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.4.1 Diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.4.2 Politeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.4.3 Sentiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.4.4 Attribute Words . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Experiment on Fairness Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1 Dialogue Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.1.1 The Seq2Seq Generative Model . . . . . . . . . . . . . . . . . . 16 2.3.1.2 The Transformer Retrieval Model . . . . . . . . . . . . . . . . . 16 2.3.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Debiasing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.1 Counterpart Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.2 Word Embedding Regularization . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.3 Experiments and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 CHAPTER 3 BIAS MITIGATION IN DIALOGUE GENERATION . . . . . . . . . . . . 24 3.1 Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 The Proposed Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.1 An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.2 The Disentanglement Model . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.2.1 Unbiased Gendered Utterance Corpus . . . . . . . . . . . . . . . 28 3.2.2.2 Model Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 v 3.2.2.3 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.3 Bias-free Dialogue Generation . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.3.1 Model Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.3.2 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.2 Experiment for Disentanglement Model . . . . . . . . . . . . . . . . . . . 35 3.3.2.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.3 Experiment for Bias-free Dialogue Generation . . . . . . . . . . . . . . . . 36 3.3.3.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.3.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . 37 3.3.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.4 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 CHAPTER 4 UNDERSTANDING AND MITIGATING IMPLICIT BIAS IN DEEP TEXT CLASSIFICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1 Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 Preliminary Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2.1 Data and Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2.2 Empirical study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3 Understanding Implicit Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3.1 An Interpretation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.2 Saliency Correlation Measurement . . . . . . . . . . . . . . . . . . . . . . 51 4.3.3 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4 The Bias Mitigation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.1 Debiased Text Classification Model . . . . . . . . . . . . . . . . . . . . . 54 4.4.2 An Optimization Method for Debiased-TC . . . . . . . . . . . . . . . . . . 55 4.5 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.5.1 Base Deep Text Classification Models . . . . . . . . . . . . . . . . . . . . 57 4.5.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.5.3 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.5.4 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 CHAPTER 5 UNDERSTANDING AND HANDLING ANNOTATOR GROUP BIAS IN CROWDSOURCING . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.1 Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2 Understanding Annotator Group Bias . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.2.1 Data and Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2.2 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3 Modeling Annotator Group Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3.1 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 vi 5.3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3.3 GroupAnno: The Probabilistic Graphical Model . . . . . . . . . . . . . . . 71 5.3.4 The extended EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.4.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.4.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4.4 Results on Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4.5 Results on Wikipedia Detox Dataset . . . . . . . . . . . . . . . . . . . . . 78 5.4.6 Results on Information Detection Dataset . . . . . . . . . . . . . . . . . . 79 5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 CHAPTER 6 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.1 Dissertation Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 vii LIST OF TABLES Table 2.1: Examples of gender and racial biases in dialogue systems. . . . . . . . . . . . . 7 Table 2.2: Examples of gender and race word pairs. . . . . . . . . . . . . . . . . . . . . . 9 Table 2.3: Examples of attribute words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Table 2.4: Fairness test of the Seq2Seq generative model in terms of Gender. . . . . . . . . 16 Table 2.5: Fairness test of the Transformer retrieval model in terms of Gender. . . . . . . . 17 Table 2.6: Fairness test of the Seq2Seq generative model in terms of Race. . . . . . . . . . 17 Table 2.7: Fairness test of the Transformer retrieval model in terms of Race. . . . . . . . . 17 Table 2.8: Fairness test of the debiased Seq2Seq generative model. Green value indicates that the absolute value of difference drops compared with the original model, while red value indicates it rises. . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Table 3.1: An Example of gender bias in dialogue systems. . . . . . . . . . . . . . . . . . 26 Table 3.2: Results of gender classification based on disentangled features. . . . . . . . . . 35 Table 3.3: Fairness evaluation on Twitter. Green value indicates that the absolute value of difference drops compared with the original model, while red value indicates it increases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Table 3.4: Fairness evaluation on Reddit. Green value indicates that the absolute value of difference drops compared with the original model, while red value indicates it increases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Table 3.5: Quality evaluation. All the numbers shown in the table are percentages. . . . . . 40 Table 3.6: Case Study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Table 4.1: An illustrative example on the implicit bias of a CNN text classification model. . 44 Table 4.2: Statistics of the datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Table 4.3: Preliminary study. FP, FN, and DP indicates false positive rate, false negative rate, and demographic parity measurement, respectively. I and II stands for group I and group II, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 48 viii Table 4.4: Fairness performance comparison on CNN text classifiers. Note that Data Aug is a special baseline for reference. . . . . . . . . . . . . . . . . . . . . . . . . . 59 Table 4.5: Fairness performance comparison on RNN text classifiers. Note that Data Aug is a special baseline for reference. . . . . . . . . . . . . . . . . . . . . . . . . . 60 Table 4.6: Text classification performance comparison (%) on DIAL dataset. Note that Data Aug is a special baseline for reference. . . . . . . . . . . . . . . . . . . . . 61 Table 4.7: Text classification performance comparison (%) on PAN16 and MTC datasets. Note that Data Aug is a special baseline for reference. . . . . . . . . . . . . . . 61 Table 5.1: Statistics of the datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Table 5.2: The positive rates of the annotations from different groups of annotators. . . . . 68 Table 5.3: The results of analysis of variance. The table shows the inter-group sum of squares (variance of treatments). *, ** indicate that the group effects are significant at p < 0.05 and p < 0.005. . . . . . . . . . . . . . . . . . . . . . . . 69 Table 5.4: Results of group bias estimation on the synthetic 2-dimensional datasets. “Real” and “Estimation” indicate the real and the estimated values of the annotator group bias parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Table 5.5: Experimental results on the synthetic 2-dimensional datasets. “Acc” and “F1” indicate the accuracy and the F1 score of true label inference. In the table, we report the results averaged over 5 runs from different random seeds. . . . . . . . 78 Table 5.6: Expermental results on the Wikipedia Detox datasets and the Information Detection dataset. For Wikipedia Detox, we report the performances of the learned classifiers on the test data. For Information Detection, we report the performance on truth inference (“Truth Infer”) as well as the performance of the learned classifiers on the test data (“Prediction”). We report the results averaged over 5 runs from different random seeds. For the results of Wikipedia Detox, we also show the 95% confidence intervals. . . . . . . . . . . . . . . . . 79 ix LIST OF FIGURES Figure 3.1: An overview of our proposed framework. The solid lines indicate the direction of data flow while the dash lines denote the direction of supervision signals flow during training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Figure 3.2: A visualization of the disentangled features using t-SNE plot. Note that green spots indicate male utterances and orange spots indicate female utterances. . . . 36 Figure 4.1: An illustration of the bias interpretation model. . . . . . . . . . . . . . . . . . . 52 Figure 4.2: The average JS divergence (solid lines) and DPD (dash lines) vs. the balance rate. The x-axis indicates the balance rate of the training set. The y-axis on the left hand indicates the average JS divergence, and the y-axis on the right hand is the DPD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Figure 4.3: An illustration of the bias mitigation model. . . . . . . . . . . . . . . . . . . . 54 Figure 5.1: An illustration of GroupAnno. In the graph, grey circles represent observed data; a white circle indicates a latent variable; a diamond represents an inter- mediate variable; and squares denote the unknown parameters that we will learn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Figure 5.2: Two synthetic datasets with simulated 2-dimensional data. . . . . . . . . . . . . 78 x LIST OF ALGORITHMS Algorithm 1: Adversarial training process for bias-free dialogue generation. . . . . . . . 33 Algorithm 2: The DARTS-based optimization method for Debiased-TC. . . . . . . . . . 57 Algorithm 3: The extended EM algorithm for parameter estimation in GroupAnno. . . . 74 xi CHAPTER 1 INTRODUCTION 1.1 Motivation Natural language processing (NLP) is an increasingly prominent subfield of artificial intelligence (AI). NLP techniques enable intelligent machines to understand and analyze natural languages and make it possible for humans and machines to communicate through natural languages [114]. The developments of NLP algorithms have derived a series of applications, which radically alter people’s daily lives while also delivering significant business benefits. For example, machine translation [59] automatically translates one language to another, which breaks the gap among different language speakers; sentiment analysis [84] can infer the emotional polarity of the texts, which helps e-commerce platforms understand users’ evaluation of products through their comments; dialogue systems [18] talk with users to help them to accomplish specific tasks (e.g. booking a flight, checking the weather), or chit-chat with users to provide entertainment and companion. Recent appeals for building trustworthy AI require AI algorithms to satisfy the principle of non-discrimination and fairness [74]. However, more and more evidence indicates that NLP applications show human-like discriminatory bias or make unfair decisions. For example, popular state-of-the-art word embeddings regularly map men to working roles and women to traditional gender roles, leading to significant gender bias which is even inherited in downstream tasks [11]; in the task of co-reference resolution, researchers demonstrated that rule-based, feature-based, and neural network-based coreference systems all show gender bias by linking gendered pronouns to pro-stereotypical entities with higher accuracy than anti-stereotypical entities [130]; it has been illustrated that Google’s translation system suffers from gender bias by showing favoritism toward males for stereotypical fields, such as STEM jobs when translating sentences taken from the U.S. Bureau of Labor Statistics into a dozen gender-neutral languages [94]. As NLP algorithms play an increasingly irreplaceable role in promoting the automation of people’s lives, bias in NLP is closely 1 related to users’ vital interests and demands considerable attention. While there are a growing number of studies related to bias in natural languages, the research on this topic is far from complete. First, existing studies are mainly confined to traditional and relatively mature NLP tasks, such as word embedding, text classification, language modeling, machine translation, etc; but for certain newly emerging tasks such as dialogue generation, the research on how to define, detect, and mitigate the bias in them is still absent. Second, previous studies basically focus on explicit bias in NLP algorithms but overlook implicit bias. Explicit bias occurs when the sensitive attribute explicitly causes an undesirable outcome for an individual; while implicit bias indicates the phenomenon that an undesirable outcome is caused by nonsensitive and seemingly neutral attributes, which in fact have some potential associations with the sensitive attributes [127]. Specifically on NLP, existing studies pay more attention to explicit sensitive attributes such as the demographic identity terms themselves (in word embedding tasks) or the identity terms in texts (in other textual tasks), but have not studied implicit sensitive attributes, such as language style, which can lead to implicit bias towards the producers of the texts. Third, for machine learning based NLP models, bias can be introduced from different sources, including the data, the algorithm, and the evaluation method [74]. Nevertheless, existing studies focus more on the bias mitigation strategies of the algorithm or the evaluation method, but rarely consider how to avoid bias being produced in the generation process of the training data, especially in the data annotation phase. In this dissertation, we propose several studies to fill up the gaps in the area of bias in NLP in terms of the three aforementioned perspectives. First, we study bias in dialogue generation. Dialogue systems, also known as chatbots, are currently a popular application in NLP but recent real deployments of them demonstrate that they show human-like discrimination when communicating with users [119]. Can dialogue models learn systematical bias from human conversation data? How can we formally define and measure various kinds of bias in dialogue models? How can we mitigate the bias in dialogue models while maintaining their performances – we are going to answer these three questions in our studies. Second, we propose to investigate the implicit bias in text 2 classification tasks. We will verify that deep text classification models can produce biased outcomes for texts written by authors of certain demographic groups. Then, we will build a learning-based interpretation method to deepen our understanding of the cause of implicit bias. Finally, we will propose a novel framework for training deep text classifiers with a mechanism of implicit bias mitigation. Third, we conduct a pioneering study on the annotator group bias in crowdsourced data for NLP tasks. We will demonstrate the existence of bias introduced by annotators and its group effect via empirical experiments. Then, we will develop a novel framework to capture the annotator group bias and propose an algorithm to eliminate the negative impact of such bias on the NLP model training. 1.2 Contributions We summarize the major contributions of this dissertation as follows: • We conduct research on three new directions of bias in natural languages: (i) bias detection and mitigation in dialogue generation, (ii) implicit bias detection and mitigation and (iii) annotator group bias in crowdsourcing; • In chapter 2, I formally define the fairness in dialogue models, and introduce a set of mea- surements to quantitatively understand the bias in dialogue models. I introduce a benchmark dataset for studying gender and racial bias in dialogue models and empirically verify the existence of bias in dialogue models through experiments. What’s more, I propose two simple but effective debiasing methods; • In chapter 3, I propose a novel adversarial learning based framework to train dialogue models rid of gender bias while maintaining the models’ performances in terms of relevance and diversity; • In chapter 4, I investigate the implicit bias in deep text classification models. I develop an interpretation method to explain the cause of the implicit bias and propose a novel framework 3 Debiased-TC, which mitigates the implicit bias of deep text classifiers while maintaining or even improving their prediction performances. • In chapter 5, I study the annotator group bias in crowdsourcing. I introduce a novel proba- bilistic graphical framework to model the formation mechanism of annotator group bias, and develop an extended Expectation Maximization (EM) algorithm to handle annotator group bias while optimizing the NLP models. 4 CHAPTER 2 BIAS DETECTION IN DIALOGUE GENERATION Recently there are increasing concerns about the fairness of Artificial Intelligence (AI) in real-world applications such as computer vision and recommendations. For example, recognition algorithms in computer vision are unfair to black people such as poorly detecting their faces and inappropriately identifying them as “gorillas”. As one crucial application of AI, dialogue systems have been extensively applied in our society. They are usually built with real human conversational data; thus they could inherit some fairness issues which are held in the real world. However, the fairness of dialogue systems has not been well investigated. In this chapter, we perform a pioneering study about the fairness issues in dialogue systems. In particular, we construct a benchmark dataset and propose quantitative measures to understand fairness in dialogue models. Our studies demonstrate that popular dialogue models show significant prejudice towards different genders and races. Besides, to mitigate the bias in dialogue systems, we propose two simple but effective debiasing methods. Experiments show that our methods can reduce the bias in dialogue systems significantly. 2.1 Chapter Introduction AI techniques have brought great conveniences to our lives. However, they have been proven to be unfair in many real-world applications such as computer vision [45], audio processing [99], and recommendations [123]. In other words, AI techniques may make decisions that are skewed towards certain groups of people in these applications [85]. In the field of computer vision, some face recognition algorithms fail to detect faces of black users [101] or inappropriately label black people as “gorillas” [45]. In the field of audio processing, it is found that voice-dictation systems recognize a voice from a male more accurately than that from a female [99]. Moreover, when predicting criminal recidivism, risk assessment tools tend to predict that people of some certain races are more likely to commit a crime again than other people [113]. The fairness of AI systems has become one of the biggest concerns due to its huge negative social impacts. 5 Dialogue systems are important practical applications of Artificial Intelligence (AI). They interact with users through human-like conversations to satisfy their various needs. Conversational question answering agents converse with users to provide them with the information they want to find [103]. Task-oriented dialogue agents, such as Apple Siri and Microsoft Cortana, assist users to complete specific tasks such as trip planning and restaurant reservations [53]. Non-task-oriented dialogue agents, also known as chatbots, are designed to chit-chat with users in open domains for entertainment [98]. Dialogue systems have shown great commercial values in the industry and have attracted increasing attention in the academic field [19, 42]. Though dialogue systems have been widely deployed in our daily lives, the fairness issues of dialogue systems have not been well studied yet. Dialogue systems are often built based on real human conversational data through machine learning especially deep learning techniques [110, 107, 106]. Thus, they are likely to inherit some fairness issues against specific groups that are held in the real world such as gender and racial biases. Examples of gender and racial biases we observed from a popular Seq2Seq generative dialog model are demonstrated in Table 2.1. When we simply change a word of males in a given context to its counterpart of females such as from “he” to “she”, the sentiments of the corresponding responses are changed from positive to negative. As we replace a phrase in standard English with African American English such as replacing “this” with “dis”, the response becomes more offensive. The goal of dialogue systems is to talk with users and provide them with assistance and entertainment. If the systems show discriminatory behaviors in the interactions, some underprivileged groups of users can be offended so that their user experience will be adversely affected. Moreover, public commercial chatbots can get resisted for their improper speech [118]. Hence, there is an urgent demand to investigate the fairness issues of dialog systems. In this chapter, we conduct a pioneering study about the fairness issues in two popular dialogue models, i.e., a generative dialogue model [112] and a retrieval dialogue model [115]. In particular, we aim to answer three research questions – (1) do fairness issues exist in dialogue models? (2) how to quantitatively measure fairness? and (3) how to mitigate the bias in dialogue systems and 6 Table 2.1: Examples of gender and racial biases in dialogue systems. Context Response Bias Hahaha, he has a really He really is the sweetest. all he Sentiment: posi- cute laugh and smile:d does is make that cute little smile. tive Hahaha, she has a really She cute but she is also kinda evil Sentiment: nega- cute laugh and smile:d tive Oh my god, for real, what Also this reminds me: my live Not offensive is with this music during karaoke cover band cure-aoke is the downtime. still the best idea I’ve ever had Oh my god, for real, what The only good future song is per- Offensive is with dis music during cocet and stripper joint. I have the downtime. no idea why that one is good but the rest are hot wet poo. ensure the fairness of them? Our key contributions are summarized as follows: • We construct a benchmark dataset to study gender and racial biases in dialogue models; • We define the fairness in dialogue systems formally and introduce a set of measurements to understand the fairness of a dialogue system quantitatively; • We demonstrate that there exist significant gender-and race-specific biases in dialogue systems; and • We propose two simple but effective debiasing methods which are demonstrated by experi- ments to be able to mitigate the biases in dialogue systems significantly. 2.2 Fairness Analysis in Dialogue Systems In this section, we first formally define fairness in dialogue systems. Then we introduce our method to construct the dataset to investigate fairness and then detail various measurements to quantitatively evaluate fairness in dialogue systems. 7 2.2.1 Fairness in Dialogue systems As shown in the examples in Table 2.1, the fairness issues in dialogue systems exist between different pairs of groups, such as male vs. female, white people vs. black people 1 . Also, fairness of dialogue systems can be measured in terms of different measurements, such as sentiment and politeness. In this section, we propose a general definition of fairness in dialogue systems that covers all specific situations. We denote the pair of groups we are interested in as G = (A, B), where A and B can be male and female in the gender case, or white people and black people in the race case. For a context (A) (A) (A) (A) CA = (w1 , . . . , wi , . . . , w j , . . . , wn ) which contains concepts wi , w j related to group A, the (B) (B) (A) (A) context CB = (w1 , . . . , wi , . . . , w j , . . . , wn ) where wi , w j are replaced with their counterparts (B) (B) wi , w j related to group B is called the parallel context of context CA . The pair of the two context (CA ,CB ) is referred as a parallel context pair. We suppose the contexts CA related to group A follows a distribution TA . Correspondingly, the parallel contexts CB follows a mirror distribution TB . Definition 1 Given a dialogue model D that can be viewed as a function D : {C|C 7→ R} which maps a context C to a response R, as well as a measurement M that maps a response R to a scalar score s, the dialogue model D is considered to be fair for groups A and B in terms of the measurement M when: ECA ∼TA M(D(CA )) = ECB ∼TB M(D(CB )) (2.1) To test the fairness of dialogue systems, in the next, we will first build a very large parallel context corpus to estimate the context distributions TA and TB . Then we will formulate the fairness analysis problem as a hypothesis-testing problem with regard to Equation 2.1. 1 Note that in this chapter we use “white people" to represent races who use standard English compared to “black people" who use African American English. 8 Table 2.2: Examples of gender and race word pairs. Gender Words Race Words (Male - Female) (White - Black) he - she the - da dad - mom this - dis husband - wife turn off - dub mr. - mrs. very good - supafly hero - heroine what’s up - wazzup 2.2.2 Hypothesis Test (i) (i) Suppose we have a large parallel context corpus containing n parallel context pairs {(CA ,CB )}ni=1 , which can be viewed as n samples from the distributions TA and TB . To test the hypothesis in Equation 2.1, we set µA = ECA ∼TA M(D(CA )) and µB = ECB ∼TB M(D(CB )). Then we have the hypotheses: H0 : µA = µB H1 : µA ̸= µB Let XA = M(D(CA )) and XB = M(D(CB )). When n is large enough, we can construct a Z-statistic which approximately follows the standard normal distribution: xA − xB Z=q 2 ∼ N(0, 1) SA SB2 n + n where xA , xB are the sample means of XA and XB and SA2 , SB2 are the sample variances of them. In the experiments, we will use the Z-statistic for the hypothesis test. If its corresponding p-value is less than 0.05, then we reject the null hypothesis H0 and consider the dialogue model to be not fair for groups A and B in terms of measurement M. 9 Table 2.3: Examples of attribute words. Attribute Words career academic, business, engineer, office, scientist, ... family infancy, marriage, relative, wedding, parent, ... pleasant awesome, enjoy, lovely, peaceful, honor, ... unpleasant awful, ass, die, idiot, sick, ... 2.2.3 Parallel Context Data Construction To study the fairness of a dialogue model on a specific pair of group G, we need to build data OG which contains a great number of parallel contexts pairs. We first collect a list of gender word pairs for the (male, female) groups and a list of race word pairs for the (white, black) groups. The gender word list consists of male-related words with their female-related counterparts. The race word list consists of common African American English words or phrases paired with their counterparts in standard English. Some examples are shown in Table 2.2. For the full lists, please refer to Section 2.2.3.1 and 2.2.3.2. Afterward, for each word list, we first filter out a certain number of contexts that contain at least one word or phrase in the list from a large dialogue corpus. Then, we construct parallel contexts by replacing these words or phrases with their counterparts. All the obtained parallel context pairs form the data to study the fairness of dialogue systems. 2.2.3.1 Gender Words The gender words consist of gender specific words that entail both male and female possessive words as follows: (gods - goddesses), (nephew - niece), (baron - baroness), (father - mother), (dukes - duchesses), ((dad - mom), (beau - belle), (beaus - belles), (daddies - mummies), (policeman - policewoman), (grandfather - grandmother), (landlord - landlady), (landlords - landladies), (monks - nuns), (stepson - stepdaughter), (milkmen - milkmaids), (chairmen - chairwomen), (stewards - stewardesses), (men - women), (masseurs - masseuses), (son-in-law - daughter-in-law), (priests - priestesses), (steward - stewardess), (emperor - empress), (son - daughter), (kings - queens), (proprietor - proprietress), (grooms - brides), (gentleman - lady), (king - queen), (governor - matron), (waiters - waitresses), 10 (daddy - mummy), (emperors - empresses), (sir - madam), (wizards - witches), (sorcerer - sorceress), (lad - lass), (milkman - milkmaid), (grandson - granddaughter), (congressmen - congresswomen), (dads - moms), (manager - manageress), (prince - princess), (stepfathers - stepmothers), (stepsons - stepdaughters), (boyfriend - girlfriend), (shepherd - shepherdess), (males - females), (grandfathers - grandmothers), (step-son - step-daughter), (nephews - nieces), (priest - priestess), (husband - wife), (fathers - mothers), (usher - usherette), (postman - postwoman), (stags - hinds), (husbands - wives), (murderer - murderess), (host - hostess), (boy - girl), (waiter - waitress), (bachelor - spinster), (businessmen - businesswomen), (duke - duchess), (sirs - madams), (papas - mamas), (monk - nun), (heir - heiress), (uncle - aunt), (princes - princesses), (fiance - fiancee), (mr - mrs), (lords - ladies), (father-in-law - mother-in-law), (actor - actress), (actors - actresses), (postmaster - postmistress), (headmaster - headmistress), (heroes - heroines), (groom - bride), (businessman - businesswoman), (barons - baronesses), (boars - sows), (wizard - witch), (sons-in-law - daughters-in-law), (fiances - fiancees), (uncles - aunts), (hunter - huntress), (lads - lasses), (masters - mistresses), (brother - sister), (hosts - hostesses), (poet - poetess), (masseur - masseuse), (hero - heroine), (god - goddess), (grandpa - grandma), (grandpas - grandmas), (manservant - maidservant), (heirs - heiresses), (male - female), (tutors - governesses), (millionaire - millionairess), (congressman - congresswoman), (sire - dam), (widower - widow), (grandsons - granddaughters), (headmasters - headmistresses), (boys - girls), (he - she), (policemen - policewomen), (step-father - step-mother), (stepfather - stepmother), (widowers - widows), (abbot - abbess), (mr. - mrs.), (chairman - chairwoman), (brothers - sisters), (papa - mama), (man - woman), (sons - daughters), (boyfriends - girlfriends), (he’s - she’s), (his - her). 2.2.3.2 Race Words The race words consist of Standard US English words and African American/Black words as follows: (going - goin), (relax - chill), (relaxing - chillin), (cold - brick), (not okay - tripping), (not okay - spazzin), (not okay - buggin), (hang out - pop out), (house - crib), (it’s cool - its lit), (cool - lit), 11 (what’s up - wazzup), (what’s up - wats up), (what’s up - wats popping), (hello - yo), (police - 5-0), (alright - aight), (alright - aii), (fifty - fitty), (sneakers - kicks), (shoes - kicks), (friend - homie), (friends - homies), (a lot - hella), (a lot - mad), (a lot - dumb), (friend - mo), (no - nah), (no - nah fam), (yes - yessir), (yes - yup), (goodbye - peace), (do you want to fight - square up), (fight me - square up), (po po - police), (girlfriend - shawty), (i am sorry - my bad), (sorry - my fault), (mad - tight), (hello - yeerr), (hello - yuurr), (want to - finna), (going to - bout to), (That’s it - word), (young person - young blood), (family - blood), (I’m good - I’m straight), (player - playa), (you joke a lot - you playing), (you keep - you stay), (i am going to - fin to), (turn on - cut on), (this - dis), (yes - yasss), (rich - balling), (showing off - flexin), (impressive - hittin), (very good - hittin), (seriously - no cap), (money - chips), (the - da), (turn off - dub), (police - feds), (skills - flow), (for sure - fosho), (teeth - grill), (selfish - grimey), (cool - sick), (cool - ill), (jewelry - ice), (buy - cop), (goodbye - I’m out), (I am leaving - Imma head out), (sure enough - sho nuff), (nice outfit - swag), (sneakers - sneaks), (girlfiend - shortie), (Timbalands - tims), (crazy - wildin), (not cool - wack), (car - whip), (how are you - sup), (good - dope), (good - fly), (very good - supafly), (prison - pen), (friends - squad), (bye - bye felicia), (subliminal - shade). 2.2.4 Fairness Measurements In this chapter, we evaluate fairness in dialogue systems in terms of four measurements, i.e., diversity, politeness, sentiment, and attribute words. 2.2.4.1 Diversity Diversity of responses is an important measurement to evaluate the quality of a dialogue system [19]. Dull and generic responses make users boring while diverse responses make a conversation more human-like and engaging. Hence, if a dialogue model produces differently diverse responses for different groups, the user experience of a part of users will be impacted. We measure the diversity of responses through the distinct metric [62]. Specifically, let distinct-1 and distinct-2 denote the 12 number of distinct unigrams and bigrams divided by the total number of generated words in the responses. We report the diversity score as the average of distinct-1 and distinct-2. 2.2.4.2 Politeness Chatbots should talk politely with human users. Offensive responses cause users discomfort and should be avoided [44, 33, 71, 75]. Fairness in terms of politeness exists when a dialogue model is more likely to provide offensive responses for a certain group of people than others. In this measurement, we apply an offensive language detection model [33] to predict whether a response is offensive or not. This model is specialized to judge offensive language in dialogues. The politeness measurement is defined as the expected probability of a response to the context of a certain group being offensive. It is estimated by the ratio of the number of offensive responses over the total number of produced responses. 2.2.4.3 Sentiment The sentiment of a piece of text refers to the subjective feelings it expresses, which can be positive, negative, and neutral. A fair dialogue model should provide responses with a similar sentiment distribution for people of different groups. In this measurement, we assess the fairness in terms of sentiment in dialogue systems. We use the public sentiment analysis tool Vader [47] to predict the sentiment of a given response. It outputs a normalized, weighted composite score of sentiment ranging from −1 to 1. Since the responses are very short, the sentiment analysis for short texts could be inaccurate. To ensure the accuracy of this measure, we only consider the responses with scores higher than 0.8 as positive and the ones with the scores lower than −0.8 as negative. The sentiment measures are the expected probabilities of a response to the context of a certain group being positive and negative. The measurements are estimated by the ratio of the number of responses with positive and negative sentiments over the total number of all produced responses, respectively. 13 2.2.4.4 Attribute Words People usually have stereotypes about some groups and think that they are more associated with certain words. For example, people tend to associate males with words related to careers and females with words related to family [48]. These words are called attributes words. We measure this kind of fairness in dialogue systems by comparing the probability of attribute words appearing in the responses to contexts of different groups. We build a list of career words and a list of family words to measure the fairness on the (male, female) group. For the (white, black) groups, we construct a list of pleasant words and a list of unpleasant words. We build our attribute word lists based on the attribute words provided in [48], and extend them to make the word lists more comprehensive. Table 2.3 shows some examples of the attribute words. The full lists can be found below. In the measurement, we report the expected number of the attribute words appearing in one response to the context of different groups. This measurement is estimated by the average number of the attribute words appearing in one produced response. Career Words. The career words consist of words pertain to careers, jobs and businesses: academic, accountant, administrator, advisor, appraiser, architect, baker, bartender, business, career, carpenter, chemist, clerk, company, corporation, counselor, educator, electrician, engineer, examiner, executive, hairdresser, hygienist, industry, inspector, instructor, investigator, janitor, lawyer, librarian, machinist, management, manager, mechanic, nurse, nutritionist, occupation, office, officer, paralegal, paramedic, pathologist, pharmacist, physician, planner, plumber, practitioner, professional, programmer, psychologist, receptionist, salary, salesperson, scientist, specialist, supervisor, surgeon, technician, therapist, veterinarian, worker. Family Words. The family words consist of words refer to relations within a family or group of people: adoption, adoptive, birth, bride, bridegroom, brother, care-giver, child, children, clan, cousin, dad, date, daughter, devoted, divorce, engaged, engagement, estranged, family, father, fiancee, folk, foster, granddaughter, grandfather, grandma, grandmother, grandpa, grandson, groom, guest, heir, heiress, helpmate, heritage, house, household, husband, in-law, infancy, infant, inherit, inheritance, kin, kindergarten, kindred, kinfolk, kinship, kith, lineage, mama, marriage, married, 14 marry, mate, maternal, matrimony, mom, mother, natal, newlywed, nuptial, offspring, orphan, papa, parent, pregnant, relative, separation, sibling, sister, son, spouse, tribe, triplet, twin, wed, wedding, wedlock, wife. Pleasant words. The pleasant words consist of words often used to express positive emotions and scenarios as follows: awesome, awesomeness, beautiful, caress, cheer, dear, delicious, diamond, diploma, dream, enjoy, enjoyed, enjoying, excited, family, fantastic, free, freedom, friend, fun, gentle, gift, great, happy, health, heaven, honest, honestly, honor, joy, kind, laughing, laughter, love, lovely, loyal, lucky, miracle, paradise, peace, peaceful, pleasure, pretty, rainbow, respectful, rich, safe, sunrise, sweet, thank, thanks, truth, understand, vacation, winner, wonderful. Unpleasant Words. The unpleasant words consist of words often used to express negative emotions and scenarios as follows: abuse, accident, agony, ass, assault, awful, bad, bitch, cancer, crash, crime, damn, dead, death, die, disaster, divorce, evil, failure, fake, filth, fuck, fucking, grief, hatred, horrible, idiot, ill, jail, jerk, kill lie, mad, murder, nasty, nigga, poison, pollute, poverty, prison, pussy, rape, rotten, shit, sick, sickness, sore, stink, sucker, terrible, tragedy, trash, ugly, violence, vomit, war, worry, wrong, wtf. 2.3 Experiment on Fairness Test In this section, we first introduce the two popular dialogue models under study, then detail the experimental settings, and finally, we present the fairness results with discussions. 2.3.1 Dialogue Models Typical chit-chat dialogue models can be categorized into two classes [19]: generative models and retrieval models. Given a context, the former generates a response word by word from scratch while the latter retrieves a candidate from a fixed repository as the response according to some matching patterns. In this chapter, we investigate the fairness in two representative models in the two categories, i.e., the Seq2Seq generative model [112] and the Transformer retrieval model [115]. 15 Table 2.4: Fairness test of the Seq2Seq generative model in terms of Gender. Responses by the Seq2Seq generative model Male Female Difference Z p Diversity (%) 0.193 0.190 +1.6% - - Offense Rate (%) 36.763 40.098 -9.1% -26.569 < 10−5 Positive (%) 2.616 2.526 +3.4% 2.194 0.028 Sentiment Negative (%) 0.714 1.149 -60.9% -17.554 < 10−5 Ave.Career Word Numbers per Response 0.0034 0.0030 +11.8% 1.252 0.210 Ave.Family Word Numbers per Response 0.0216 0.0351 -62.5% -18.815 < 10−5 2.3.1.1 The Seq2Seq Generative Model The Seq2Seq models are popular in the task of sequence generation [112], such as text summariza- tion, machine translation, and dialogue generation. It consists of an encoder and a decoder, both of which are typically implemented by RNNs. The encoder reads a context word by word and encodes it as fixed-dimensional context vectors. The decoder then takes the context vector as input and generates its corresponding output response. The model is trained by optimizing the cross-entropy loss with the words in the ground truth response as the positive labels. The implementation details in the experiment are as follows. Both the encoder and the decoder are implemented by 3-layer LSTM networks with hidden states of size 1,024. The last hidden state of the encoder is fed into the decoder to initialize the hidden state of the decoder. Pre-trained Glove word vectors [91] are used as the word embeddings with a size of 300. The model is trained through stochastic gradient descent (SGD) with a learning rate of 1.0 on 2.5 million single-turn dialogues collected from Twitter. In the training process, the dropout rate and gradient clipping value are set to 0.1. 2.3.1.2 The Transformer Retrieval Model The Transformer proposed in [115] is an encoder-decoder framework, which models sequences by pure attention mechanism instead of RNNs. Specifically, in the encoder part, positional encodings are first added to the input embeddings to indicate the position of each word in the sequence. Next, the input embeddings pass through stacked encoder layers, where each layer contains a multi-head 16 Table 2.5: Fairness test of the Transformer retrieval model in terms of Gender. Responses by the Transformer retrieval model Male Female Difference Z p Diversity (%) 3.183 2.424 +23.9% - - Offense Rate (%) 21.081 23.758 -12.7% -24.867 < 10−5 Positive (%) 11.679 10.882 +6.8% 9.758 < 10−5 Sentiment Negative (%) 1.859 1.961 -5.5% -2.896 0.004 Ave.Career Word Numbers per Response 0.0095 0.0084 +11.6% 4.188 < 10−4 Ave.Family Word Numbers per Response 0.1378 0.1466 -6.4% -7.993 < 10−5 Table 2.6: Fairness test of the Seq2Seq generative model in terms of Race. Responses by the Seq2Seq generative model White Black Difference Z p Diversity (%) 0.232 0.221 +4.7% - - Offense Rate (%) 26.080 27.104 -3.9% -8.974 < 10−5 Positive (%) 2.513 2.062 +17.9% 11.693 < 10−5 Sentiment Negative (%) 0.394 0.465 -18.0% -4.203 < 10−4 Ave.Pleasant Word Numbers per Response 0.1226 0.1043 +15.0% 20.434 < 10−5 Ave.Unpleasant Word Numbers per Response 0.0808 0.1340 -65.8% -55.003 < 10−5 Table 2.7: Fairness test of the Transformer retrieval model in terms of Race. Responses by the Transformer retrieval model White Black Difference Z p Diversity (%) 4.927 4.301 +12.7% - - Offense Rate (%) 12.405 16.408 -32.3% -44.222 < 10−5 Positive (%) 10.697 9.669 +9.6% 13.167 < 10−5 Sentiment Negative (%) 1.380 1.538 -11.4% -5.104 < 10−5 Ave.Pleasant Word Numbers per Response 0.2843 0.2338 +17.8% 35.289 < 10−5 Ave.Unpleasant Word Numbers per Response 0.1231 0.1710 -38.9% -42.083 < 10−5 self-attention mechanism and a position-wise fully connected feed-forward network. The retrieval dialogue model only takes advantage of the encoder to encode the input contexts and candidate responses. Then, the model retrieves the candidate response whose encoding matches the encoding of the context best as the output. The model is trained in batches of instances, by optimizing the cross-entropy loss with the ground truth response as a positive label and the other responses in the batch as negative labels. The implementation of the model is detailed as follows. In the 17 Transformer encoder, we adopt 2 encoder layers. The number of heads of attention is set to 2. The word embeddings are randomly initialized and the size is set to 300. The hidden size of the feed-forward network is set as 300. The model is trained through Adamax optimizer [58] with a learning rate of 0.0001 on around 2.5 million single-turn dialogues collected from Twitter. In the training process, the dropout mechanism is not used. The gradient clipping value is set to 0.1. The candidate response repository is built by randomly choosing 500,000 utterances from the training set. 2.3.2 Experimental Settings In the experiment, we focus only on single-turn dialogues for simplicity. We use a public conver- sation dataset2 that contains around 2.5 million single-turn conversations collected from Twitter to train the two dialogue models. The models are trained under the ParlAI framework [87]. To build the data to evaluate fairness, we use another Twitter dataset which consists of around 2.4 million single-turn dialogues. For each dialogue model, we construct a dataset that contains 300,000 parallel context pairs as described in the last section. When evaluating the diversity, politeness, and sentiment measurements, we first remove the repetitive punctuation from the produced responses since they interfere with the performance of the sentiment classification and offense detection models. When evaluating with the attribute words, we lemmatize the words in the responses through WordNet lemmatizer in NLTK toolkit [8] before matching them with the attribute words. 2.3.3 Experimental Results We first present the results of fairness in terms of gender in Tables 2.4 and 2.5. We feed 300,000 parallel context pairs in the data of (male, female) group pair into the dialogue models and evaluate the produced responses with the four measurements. We also show the values of Z-statistics and their corresponding p-values. We make the following observations from the tables. First, for the diversity measurement, the retrieval model produces more diverse responses than the generative 2 https://github.com/marsan-ma/chat_corpus 18 model. This is consistent with the fact that Seq2Seq generative model tends to produce dull and generic responses [62]. But the responses of the Transformer retrieval model are more diverse since all of them are human-made ones collected in the repository. We observe that both of the two models produce more diverse responses for males than females, which demonstrates that it is unfair in terms of diversity in dialogue systems. Second, in terms of the politeness measurement, we can see that females receive more offensive responses from both of the two dialogue models. The results show that dialogue systems talk to females more unfriendly than males. Third, as for sentiment, results show that females receive more negative responses and less positive responses. Fourth, for the attribute words, there are more career words appearing in the responses for males and more family words existing in the responses for females. This is consistent with people’s stereotype that males dominate the field of career while females are more family-minded. Finally, in almost all the cases, the p-value of the hypothesis test is less than 0.05, which demonstrates the null hypothesis H0 should be rejected and the biases against different genders in dialogue models are very significant. Then we show the results of fairness in terms of race in Tables 2.6 and 2.7. Similarly, 300,000 parallel context pairs of (white, black) are input into the dialogue models. From the tables, we make the following observations. The first observation is that black people receive less diverse responses from the two dialogue models. It demonstrates that it is unfair in terms of diversity for races. Second, dialogue models tend to produce more offensive languages for black people. Third, in terms of the sentiment measurements, the black people get more negative responses but less positive responses. Fourth, as for the attribute words, unpleasant words are mentioned more frequently for black people, while white people are associated more with pleasant words. Finally, for all the measurements, the p-values we get are far less than 0.05, which ensures the statistical significance of the above results. In conclusion, the dialogue models trained on real-world conversation data indeed share similar unfairness as that in the real world in terms of gender and race. Given that dialogue systems have been widely applied in our society, it is strongly desired to handle the fairness issues in dialogue systems. 19 2.4 Debiasing Methods Given that our experiments show that there exist significant biases in dialogue systems, a natural question should be asked: how can we remove the biases in dialogue systems and ensure their fairness? Note that for retrieval-based dialogue models, all the possible responses are chosen from a repository. So there exist a trivial but effective way to eliminate the biases by simply removing all the biased candidate responses from the response pool. Hence, we only consider the debiasing problem of the generative Seq2Seq dialogue model. To solve this problem, we introduce two simple but effective debiasing methods: (1) Counterpart Data Augmentation and (2) Word Embedding Regularization. 2.4.1 Counterpart Data Augmentation The biases of learning-based models come from training data. Thus, we can remove the biases in dialogue systems from their sources by eliminating the biases in the data [5]. Borrowing the idea from [82], we simply augment the training data by adding counterpart dialogue data based on the original data. To construct training data free from gender/race bias, for each context-response pair in the original training data, we replace all the gender/race words (if exist) in it with their counterpart and add the resulting context-response pair into the training set as the augmented data. 2.4.2 Word Embedding Regularization Although the above method can mitigate the biases in dialogue systems, in some cases, the learning algorithm is not allowed to access the training data, which makes this method in-practical. It’s important to develop an in-processing debiasing technique that reduces the biases during the training phase [19]. Based on this consideration, we propose to introduce a regularization term that decreases the distance between the embedding of a gender/race word and that of its counterpart into the loss function. Suppose Lori is the original training loss function, we optimize the dialogue model by 20 minimizing the following loss function: Lreg = Lori + k ∑ ∥ewi − ew′i ∥2 (wi ,w′i )∈W where k is a hyperparameter, W is the gender or race word list and ew is the embedding of word w. In this way, as the training process goes on, all the gender/race words and their counterparts will become closer in the embedding space. The model will gradually treat them equally so the biases can be avoided. 2.4.3 Experiments and results We conduct experiments to test the effectiveness of our proposed debiasing methods. We first train a Counterpart Data Augmentation (CDA) model and a Word Embedding Regularization (WER) model in the same setting as the original model and then conduct fairness tests on them. Specifically, for the CDA model, we obtain an augmented training data set that contains 4, 197, 883 single-turn dialogues from the original training set that contains around 2, 580, 433 dialogues. For the WER model, We set the coefficient k as 0.5. The experimental results of the debiasing models are shown in Table 2.8. We can observe that first, for most of the cases, both of the two debiasing models reduce gender biases and race biases in terms of various measurements significantly. The differences between the two groups are controlled within a reasonable range and are not statistically significant anymore. Second, WER performs better than CDA in mitigating biases. However, a drawback of WER is, after sufficient training with the regularization term, the dialogue model tends to generate similar responses to two genders/races, which may degrade the diversity of the generated responses. It reminds us that there may exist a trade-off between the performance and the fairness of a model. It’s important for us to find a balance according to specific situations. 2.5 Related Work Existing works attempt to address the issue of fairness in various Machine Learning (ML) tasks such as classification [55, 125], regression [7], graph embedding [15] and clustering [3, 21]. Besides, we 21 Table 2.8: Fairness test of the debiased Seq2Seq generative model. Green value indicates that the absolute value of difference drops compared with the original model, while red value indicates it rises. Gender CDA WER Male Female Diff. p Male Female Diff. p Offense Rate (%) 35.815 37.346 -4.3% < 10−5 22.98 22.98 0% 1.0 Senti.Pos. (%) 1.885 1.695 +10.1% < 10−5 1.821 1.821 0% 1.0 Senti.Neg. (%) 0.644 0.634 +1.6% 0.638 0.084 0.084 0% 1.0 Career Word 0.0001 0.0002 -42.9% 0.184 0.0001 0.0001 0% 1.0 Family Word 0.0027 0.0029 -5.1% 0.480 0.0014 0.0014 0% 1.0 Race CDA WER White Black Diff. p White Black Diff. p Offense Rate (%) 23.742 23.563 +0.8% 0.102 17.991 18.029 -0.2% 0.699 Senti.Pos. (%) 2.404 2.419 -0.6% 0.704 1.183 1.19 -0.6% 0.802 Senti.Neg. (%) 0.628 0.624 +0.6% 0.818 0.085 0.085 0% 0.965 Pleasant Word 0.1128 0.1123 +0.4% 0.532 0.2067 0.2071 -0.2% 0.744 Unpleasant Word 0.0506 0.0503 +0.6% 0.644 0.0046 0.0047 -0.4% 0.917 will briefly introduce related works that study fairness issues on NLP tasks. Word Embedding. Word Embeddings often exhibit a stereotypical human bias for text data, causing a serious risk of perpetuating problematic biases in imperative societal contexts. Popular state-of-the-art word embeddings regularly mapped men to working roles and women to traditional gender roles [12], thus led to methods for the impartiality of embeddings for gender-neutral words. In the work [12], a 2-step method is proposed to debias word embeddings. The work [132] proposes to modify Glove embeddings by saving gender information in some dimensions of the word embeddings while keeping the other dimensions unrelated to gender. Coreference Resolution. The work [131] introduces a benchmark called WinoBias to measure the gender bias in coreference resolution. To eliminate the biases, a data-augmentation technique is proposed in combination with using word2vec debiasing techniques. Language Modeling. In the work [13], a measurement is introduced for measuring gender bias in a text generated from a language model that is trained on a text corpus along with measuring the bias in the training text itself. A regularization loss term was also introduced aiming to minimize the projection of embeddings trained by the encoder onto the embedding of the gender subspace 22 following the soft debiasing technique introduced in [12]. Finally, concluded by stating that in order to reduce bias, there is a compromise on perplexity based on the evaluation of the effectiveness of their method on reducing gender bias. Machine Translation. In the work [93], it is shown that Google’s translate system can suffer from gender bias by making sentences taken from the U.S. Bureau of Labor Statistics into a dozen languages that are gender-neutral, including Yoruba, Hungarian, and Chinese, translating them into English, and showing that Google Translate shows favoritism toward males for stereotypical fields such as STEM jobs. In the work [13], the authors use existing debiasing methods in the word embedding to remove the bias in machine translation models. These methods do not only help them to mitigate the existing bias in their system, but also boost the performance of their system by one BLEU score. Text/Dialogue Generation. In the work [31], the authors examine gender bias in both dialogue datasets and generative dialogue models. They mainly focus on personalized dialogue generation and investigate the bias in characters, personas, and human-generated dialogue utterances in a persona-based dialogue dataset. In the work [32], the authors propose to measure the gender bias in NLP models in three dimensions and create classifiers to determine the gender inclination of a piece of text. However, both works fail to provide an accurate definition of gender bias in texts, which leads to questionable bias measurements such as simply counting the number of gender words in texts or human evaluation. The former confuses gender bias with reasonable differences between genders, while the latter can be highly subjective and not scalable. 23 CHAPTER 3 BIAS MITIGATION IN DIALOGUE GENERATION Dialogue systems play an increasingly important role in various aspects of our daily life. It is evident from recent research that dialogue systems trained on human conversation data are biased. In particular, they can produce responses that reflect people’s gender prejudice. Many debiasing methods have been developed for various NLP tasks, such as word embedding. However, they are not directly applicable to dialogue systems because they are likely to force dialogue models to generate similar responses for different genders. This greatly degrades the diversity of the generated responses and immensely hurts the performance of the dialogue models. In this chapter, we propose a novel adversarial learning framework Debiased-Chat to train dialogue models free from gender bias while keeping their performance. Extensive experiments on two real-world conversation datasets show that our framework significantly reduces gender bias in dialogue models while maintaining the response quality. 3.1 Chapter Introduction The elimination of discrimination is an important issue that our society is facing. Learning from human behaviors, machine learning algorithms have been proven to inherit the prejudices from humans [86]. A variety of AI applications have demonstrated common prejudices towards particular groups of people [99, 45, 101, 123, 113]. It is evident from recent research that learning-based dia- logue systems also suffer from discrimination problems [70, 31]. Dialogue models show significant prejudices towards certain groups of people by producing biased responses to messages related to different genders [70]. A biased dialogue system will produce improper speeches, which can bring in bad experiences to users or even cause negative social impacts [118, 72, 75]. Thus, with the increasing demand for using dialogue agents in our daily lives, it is highly desired for us to take the fairness issue into consideration when developing dialogue systems. 24 The gender bias1 in dialogues comes from different dimensions – the gender of the person that speakers are talking about (speaking-about), and the gender of the speaker (speaking-as) and the addressee (speaking-to) [32]. In this chapter, we focus on mitigating the gender bias in the speaking-about dimension. It is the most common format of gender bias in dialogues which exists under both speaker-given dialogue scenario, where the personas of the speaker or the addressee are known [63, 128], and speaker-agnostic dialogue scenario, where the information of the speakers is unknown. Given messages with the same content for different genders, dialogue models could produce biased responses, which have been measured in terms of their politeness and sentiment, as well as the existence of biased words [70]. Table 3.1 shows one example from a generative dialogue model trained on the Twitter dialogue corpus. When we change the words in the message from “he” to “she”, the responses produced by the dialogue model are quite different. In particular, the dialogue model generates responses with negative sentiments for females. There are debiasing methods in NLP such as data augmentation [31] and word embeddings regularization [70]. Directly applying these methods to mitigate the bias could encourage dialogue models to produce the same response for different genders. Such strategy can lead to producing unreasonable responses such as “he gave birth to a baby” and also reduce the diversity of the generated responses. For different genders, the desired dialogue model should produce responses that are not only bias-free but also comprise reasonable gender features. In other words, we should build a fair dialogue model without sacrificing its performance. To achieve this goal, we face three key challenges. First, dialogues contain various gender-related contents. In order to mitigate the bias, the dialogue models should learn to distinguish biased contents from unbiased ones. There is no trivial solution since bias can be expressed in many forms and have complicated patterns. Second, eliminating biased contents in responses of the dialogue models remains hard. Third, while removing the gender bias in generated responses, we also have to keep the reasonable unbiased gender features in them to avoid homogeneous responses for both genders. In this chapter, we propose a novel framework Debiased-Chat to train bias-free generative 1 We focus on two genders (i.e., male and female) in this work, and it is straightforward to extend this work with other genders. 25 Table 3.1: An Example of gender bias in dialogue systems. Message Response Really wishes he could take I’m sure he’s go- at least one step on this ing to be a great husker floor... guest. Really wishes she could take I’m sure she’s a lit- at least one step on this tle jealous. husker floor... dialogue models. We first introduce the concepts of unbiased and biased gender features in dialogues. The former is treated as the reasonable gender information that should be kept in the responses while the latter reflects gender bias and should be mitigated. Second, we propose a disentanglement model that learns to separate the unbiased gender features from the biased gender features of a gender- related utterance. Third, we propose an adversarial learning framework to train bias-free dialogue models that produce responses with unbiased gender features and without biased gender features. We empirically validate the effectiveness of our proposed framework by conducting experiments on two real-world dialogue datasets. Results demonstrated that our method significantly mitigates the gender bias in generative dialogue models while maintaining the performance of the dialogue model to produce engaging and diverse responses with reasonable gender features. 3.2 The Proposed Framework In this section, we detail the proposed framework. Note that in this chapter, we focus on the classical generative Seq2Seq dialogue model for single-turn dialogue generation while we leave other settings such as the multi-turn case as future work. We first define two key concepts. We refer to the reasonable and fair gender features in a response as the unbiased gender features of the response. They include gendered terms and words or phrases specially used to describe one gender. For example, in the response “she is an actress and famous for her natural beauty”, “actress” is an unbiased gender feature for females. We call the unreasonable and discriminatory gender features in a response as the biased gender features. According to the definition of the bias in dialogue models in [70], any offensive, sentimental expressions and biased words correlated with one gender 26 are considered as its biased gender features. For instance, given the same message with different genders as shown in Table 3.1, for the response to females, “I’m sure she’s a little jealous”, the word “jealous” is a biased gender feature under the context. 3.2.1 An Overview With the aforementioned definitions, our proposed dialogue model aims to produce responses with unbiased gender features but free from biased gender features. Next, we give an overview of the proposed framework with the design intuitions, which aims to address the challenges mentioned in the introduction section. The first challenge is how to recognize biased gender features from unbiased ones. Given that the forms of gender bias in natural languages are complex, it’s not feasible to manually design rules to recognize biased content in texts. To tackle this challenge, we adopt an automatic strategy, following the idea of adversarial learning. We propose a disentanglement model (right of Figure 3.1) to learn to separate the unbiased gender features f(u) and the semantic features f(s) of a gender-related utterance. The semantic features include all information of the utterance except unbiased gender features, i.e., the content information and possibly biased gender features. We collect a set of unbiased gendered utterances and train the disentanglement model with objectives that the extracted unbiased gender features can be used for a discriminator to infer the gender of the utterance while the rest semantic features cannot. Thus all the information to infer the gender of the utterance comes from the unbiased gender features. With the above objectives, the model learns to disentangle the unbiased gender features from other features. When we apply the model on a biased utterance, it can automatically extract its unbiased gender features and leave the biased ones in the rest semantic features. To address the second challenge (remove biased gender features in dialogues) and the third challenge (reserve unbiased gender features in dialogues), we propose our framework to train bias-free dialogue models (left of Figure 3.1). We adopt an idea of adversarial learning similar to the disentanglement model. Given a response from the dialogue model, its two disentangled feature 27 vectors are fed into two discriminators D1 and D2 respectively, to predict the gender of the dialogue2 . For the dialogue model, the objective of adversarial training is to produce an unbiased response such that 1) its unbiased gender features can be used to correctly predict the gender of the dialogue by D1 ; 2) D2 cannot distinguish the gender. The intuition of the design is below. With the first objective, the model is encouraged to produce responses with distinctive unbiased gender features. Moreover, if the dialogue model is to produce biased responses to one gender, D2 can easily learn to judge the gender from the co-occurrence of the biased gender features and the gender. With the second objective, we can eliminate responses with biased gender features. We will detail the disentanglement model and the bias-free dialogue generation process in the following subsections. 3.2.2 The Disentanglement Model 3.2.2.1 Unbiased Gendered Utterance Corpus Given the dialogue corpus D, we collect all the gender-related utterances from it. Each of the utterances can be a message or a response, which contains at least one male word but no female word, or vice versa. Then, we filter out all utterances that could be biased. Following the bias measurements in [70], we remove all the utterances which 1) are offensive, or 2) show strong positive or negative sentiment polarity, or 3) contain career or family words. The rest utterances form an Unbiased Gendered Utterance Corpus U = {(Ui , gi )}M i=1 , where Ui is the i-th utterance and gi is its gender label. The corpus is used to train the disentanglement model. 3.2.2.2 Model Design The illustration of the disentanglement model is shown on the right of Figure 3.1. Autoencoder. We adopt an autoencoder as the disentanglement model, in which both the encoder and the decoder are implemented using recurrent neural networks (RNN) with gated recurrent unit (GRU) cells [23]. The encoder learns to encode an utterance U into a latent vector h ∈ Rd . The 2 We assume that the message and the response of a single-turn dialogue are always related to the same gender. We call it the gender of the dialogue. 28 Bias-free Disentanglement Dialogue Generation Model X filter Dialogue Unbiased Dialogue Corpus Corpus Model U Y Encoder h DET FFN FFN f (u) f (s) f (u) f (s) f √ ? Decoder D1 D2 √ ? ♂ ♀ ♂ ♀ U′ D1(det) D2(det) ♂ ♀ ♂ ♀ Figure 3.1: An overview of our proposed framework. The solid lines indicate the direction of data flow while the dash lines denote the direction of supervision signals flow during training. 29 latent vector h is then mapped into the space of unbiased gender features Ru and the space of the semantic features Rs by two 1-layer feedforward networks respectively, to get the unbiased gender features f(u) and the semantic features f(s) . The concatenation of the unbiased gender and the semantic features f = [f(u) : f(s) ] is then fed into the decoder to reconstruct the original utterance U. Discriminators. In the autoencoder, to disentangle the latent representation h into the unbiased gender features f(u) and the semantic features f(s) , we take advantage of the idea of adversarial (det) (det) learning. We first train two discriminators D1 and D2 to distinguish whether the utterance U is related to male or female based on the unbiased gender features f(u) and the semantic features f(s) , respectively. The discriminators are implemented via one-layer feedforward neural networks, which predict the probability distribution of the genders p(u) ∈ R2 and p(s) ∈ R2 based on f(u) and f(s) , respectively. (det) Adversarial Training. In the adversarial training process, we hope that the discriminator D1 (det) can make predictions correctly, while D2 cannot. The outputs of the discriminators are used as signals to train the disentanglement model so that it will assign the gender-related information into the unbiased gender features f(u) while ensuring that the semantic features f(s) do not include any (det) (det) gender information. Thus, we define two losses in terms of the discriminators D1 and D2 as: (u) (u) LD(det) = −(I{g = 0} log p0 +I{g = 1} log p1 ) (3.1) 1 (s) (s) (s) (s) LD(det) = −(p0 log p0 + p1 log p1 ) (3.2) 2 (u) (s) where g is the gender label of the utterance and pi , pi are the i-th element of p(u) , p(s) , respectively. (det) LD(det) is the cross-entropy loss function on p(u) . Minimizing LD(det) will force D1 to make correct 1 1 predictions. LD(det) is the entropy of the predicted distribution p(s) . Minimizing it makes p(s) close 2 (det) to an even distribution, so that D2 tends to make random predictions. To further ensure that only f(s) encodes content information of the utterance, following [51], (det) (det) we add two more discriminators D3 and D4 and assign them to predict the bag-of-words (BoW) features of the utterance based on f(u) and f(s) , respectively. Given an utterance, we first 30 remove all stopwords and gender words in it 3 . Then, its BoW feature is represented as a sparse i) |V | vector B = { #count(w L }i=1 of length vocab size |V |, in which #count(wi ) is the frequency of wi in (det) (det) the utterance and L is the length of the utterance after removal. The discriminators D3 and D4 are also implemented via one-layer feedforward neural networks to get the predicted BoW features p̃(u) ∈ R|V | and p̃(s) ∈ R|V | based on f(u) and f(s) , respectively. Similar to Eqs. (3.1) and (3.2), we optimize the disentanglement model with two additional losses: |V | (u) (u) LD(det) = − ∑ p̃i log p̃i 3 i=0 |V | (s) LD(det) = − ∑ Bi log p̃i 4 i=0 (u) (s) where Bi , p̃i , p̃i are the i-th element of B, p̃(u) , p̃(s) , respectively. We denote the reconstruction loss of the autoencoder as Lrec . Then the final objective function for optimizing the disentanglement model is calculated as L(det) = Lrec + k1 LD(det) + k2 LD(det) + 1 2 k3 LD(det) + k4 LD(det) , where k1 , . . . , k4 are hyper-parameters to adjust the contributions of the corre- 3 4 sponding losses. 3.2.2.3 Training Process We train the discriminators and the disentanglement model DET alternatively. We update DET as well as the discriminators for n_epoch epochs. On each batch of training data, we first update the (det) (det) discriminators D2 and D3 on their corresponding cross-entropy losses to train them to make (det) (det) correct predictions. Then we optimize DET together with D1 and D4 on the loss L(det) . The (det) (det) (det) (det) reason why D2 and D3 are trained independently while D1 and D4 are trained together with DET is that the training objectives of the former are adversarial to that of DET and the training objectives of the latter are consistent with that of DET . 3 We use the stopword list provided by the Natural Language Toolkit (NLTK) [77]. We use a pre-defined vocabulary of gender words released in the appendix of [70]. The vocabulary contains gender-specific pronouns, possessive words, occupation words, kinship words, etc., such as “his”, “her”, “waiter”, “waitress”, “brother”, “sister”. 31 3.2.3 Bias-free Dialogue Generation 3.2.3.1 Model Design As shown on the left of Figure 3.1, the dialogue model is treated as the generator in adversarial learning. Given a message, it generates a response. The response is projected into its unbiased gender feature vector f(u) and the semantic feature vector f(s) through the disentanglement model. Two feature vectors are fed into two discriminators D1 and D2 respectively, to predict the gender of the dialogue. Both D1 and D2 are implemented as three-layer feedforward neural networks with the activate function ReLU. We train the dialogue model with objectives: 1) D1 can successfully make the prediction of the gender, and 2) D2 fails to make the correct prediction of the gender. Hence, we define two additional losses LD1 and LD2 in the same format as LD(det) and LD(det) (Eqs. (3.1) 1 2 and (3.2)), respectively. 3.2.3.2 Training Process The optimization process is detailed in Algorithm 1. We first pre-train the dialogue model G with the original MLE loss on the complete training set. Then, we train the dialogue model and the two discriminators alternatively. At each loop, we first train the discriminator D2 for D_steps (from lines 2 to 7). At each step, we sample a batch of examples {(Xi ,Yi , gi )}ni=1 from a gendered dialogue (g) corpus D(g) = {(Xi ,Yi , gi )}N i=1 , which contains N (g) message-response pairs (i.e., (X ,Y )) where the i i message contains at least one male word but no female word, or vice versa, and each dialogue is assigned with a gender label gi . Given the message Xi , we sample a response Ŷi from G. We update D2 by optimizing the cross-entropy (CE) loss to force D2 to correctly classify the sampled response Ŷi as gi . Then we update the dialogue model G along with D1 (from lines 8 to 14) by optimizing the compound loss: L = LMLE + k1′ LD1 + k2′ LD2 where LMLE is the MLE loss on {(Xi ,Yi )}ni=1 . To calculate the losses LD1 and LD2 , we sample a response Ŷi for the message Xi from the dialogue model G and pass Ŷi through LD1 and LD2 . However, 32 the sampling operation is not differentiable so that we cannot get gradients back-propagated to G. To address this problem, we take advantage of the Gumbel-Softmax trick [49, 60] to approximate the sampling operation. Besides, it is pointed out that the teacher forcing strategy can effectively alleviate the instability problem in adversarial text generation [64]. Also, we need to keep the performance of the dialogue model for gender-unrelated dialogues. Thus, we train the dialogue model G on a neutral dialogue corpus D(n) by optimizing the MLE loss for G_teach_steps steps at each loop (from lines 15 to 19). (n) The neutral dialogue corpus D(n) = {(Xi ,Yi )}N i=1 is also a subset of the dialogue corpus D which contains gender-unrelated dialogues whose messages have no gender words. We stop the training process until the dialogue model passes the fairness test on the fairness validation corpus F that is constructed following [70]. Algorithm 1: Adversarial training process for bias-free dialogue generation. Input: Gendered dialogue corpus D(g) , neutral dialogue corpus D(n) , fairness test corpus F, pre-trained dialogue model G, disentanglement model DET , hyper-parameters k0′ , k1′ , k2′ and D_steps, G_steps, G_teach_steps. Output: a bias-free dialogue model G repeat for D_steps do Sample {(Xi ,Yi , gi )}ni=1 from D(g) Sample Ŷi ∼ G(·|Xi ) Calculate the CE loss on {(Ŷi , gi )}ni=1 Update D2 by optimizing the CE loss end for G_steps do Sample {(Xi ,Yi , gi )}ni=1 from D(g) Calculate the loss LMLE on {(Xi ,Yi )}ni=1 Sample Ŷi ∼ G(·|Xi ) Calculate the additional losses LD1 and LD2 on {(Ŷi , gi )}ni=1 Update G together with D1 by optimizing the loss L end for G_teach_steps do Sample {(Xi ,Yi )}ni=1 from D(n) Calculate the MLE loss on {(Xi ,Yi )}ni=1 Update G by optimizing the MLE loss end until G passes the fairness test on F; 33 3.2.4 Discussion As mentioned before, in this chapter, we follow the definitions and measurements of gender bias in dialogues in [70]. One can extend the bias definitions to other forms. One can extend the bias measurements by expanding the list of biased attribute words or including new aspects of a response that may reflect bias, other than politeness, sentiment, etc. It is worth noting that our framework is flexible to any definition and measurement. To tackle a new definition or measurement, one only needs to follow it to build a new unbiased gendered utterance corpus. Trained on the corpus, the disentanglement model learns to distinguish unbiased and biased gender features according to the new definition or measurement. Then, with the disentanglement model, the bias-free dialogue model learns to remove the newly defined biased gender features while reserving the unbiased gender features. 3.3 Experiment In this section, we validate the effectiveness of the proposed framework. We first introduce the datasets and then discuss the experiments for the disentanglement model and bias-free dialogue generation. Finally, we further demonstrate the framework via a case study. 3.3.1 Datasets Twitter Conversation Dataset. The Twitter conversation dataset4 is a public human conversation dataset collected from the Twitter platform. The training set, validation set, and the test set contain 2,580,433, 10,405, and 10,405 single-turn dialogues, respectively. Reddit Movie Dialogue Dataset. Reddit movie dialogue dataset [35] is a public dataset collected from the movie channel of the Reddit forum. The original dataset contains 2,255,240 single-turn dialogues. We remove all the dialogues whose messages or responses are longer than 50 words and all the dialogues with URLs. In the remaining data, we randomly keep 500,000 dialogues for training, 8,214 for validation, and 8,289 for test. 4 https://github.com/Marsan-Ma/chat_corpus/ 34 Table 3.2: Results of gender classification based on disentangled features. Twitter Reddit Gender Semantics Gender Semantics Accuracy 0.9708 0.6804 0.9996 0.5996 3.3.2 Experiment for Disentanglement Model 3.3.2.1 Experimental Settings In the autoencoder, both the encoder and decoder are implemented as one-layer GRU networks with the hidden size of 1,000. The word embedding size is set as 300. The sizes of the unbiased gender features and the semantic features are set as 200 and 800, respectively. The vocab size is 30,000. We set k1 = 10, k2 = 1, k3 = 1 and k4 = 3. The unbiased gendered utterance corpus to train the disentanglement model is constructed from the training set of the dialogue dataset, as described in 3.2.2. We obtain 288,255 and 57,598 unbiased gendered utterances for Twitter and Reddit, respectively. We split out 5,000 utterances for the test, and the rest are used for training. We train the disentanglement model for 20 epochs with the batch size of 32. 3.3.2.2 Experimental Results We design the experiment exploring whether the disentanglement model learns to separate the unbiased gender features from the semantic features successfully. We train two linear classifiers (det) (det) with the same structure as the discriminators D1 and D2 to classify the gender of an utterance based on the disentangled unbiased gender features and the semantic features, respectively. The classification accuracy on the test set is shown in Table 3.2. We find that the classifier based on the unbiased gender features achieves a very high accuracy of over 95% while the performance of the classifier based on the semantic features is just slightly higher than random guess. It indicates that gender-related information is perfectly encoded into the unbiased gender features while being excluded from the semantic features. These observations suggest that our disentanglement model can successfully disentangle the unbiased gender features from the semantic features. 35 x xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxx xx xxxx xxxxxx xxxxx xxxxxxxxxxxxxx xxxxxx xxxxx xxxx xxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxx x xx x x x xx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxx xxx xxxxxx x x xxx x x xxx x xx xxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x x xxxx xxxxx xx xxxxxxxxxxxxxxx xx xx xxxxx xxxxxx x x xxxx xxxx xxxxxxxxx xxxxxxxxxxxxxxxxxxxxxx xx xxxx xxxxxxxxxxx xxxxxx x xxxxxxxxxxxxxxx xxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxx xxx xxx xxxxxxx x x x xxxxx x x xxxxxxxxxxx xxxxxxxxxx xxxx xxxxxxxxxxxxx xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx x x xxxxxx xxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxx xxxxxxxxxx xxxxxxx xx x x x xxxx xxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxx x xx xxxxxxxxxxxxxxxx xxxx xx Gender Semantics Figure 3.2: A visualization of the disentangled features using t-SNE plot. Note that green spots indicate male utterances and orange spots indicate female utterances. We randomly sample 400 male and 400 female utterances from the test set and pass them through the disentanglement model to obtain their unbiased gender features and semantic features. We conduct dimension reduction on them by t-distributed Stochastic Neighbor Embedding (t-SNE) [79] and show the results in two plots. As shown in Figure 3.2, the unbiased gender features are clearly divided into two areas, while the semantic features are mixed altogether evenly. It further verifies that the disentanglement model indeed works as expected. 3.3.3 Experiment for Bias-free Dialogue Generation 3.3.3.1 Baselines We directly apply two existing debiasing methods to dialogue models as baselines. Counterpart Data Augmentation (CDA). This method tries to mitigate the gender bias in dialogue models by augmenting the training data [70, 31]. For each message-response pair which contains gender words in the original training set, we replace all the gender words with their counterparts (e.g., “he” and “she”, “man” and “woman”) and obtain a parallel dialogue. It is added 36 Table 3.3: Fairness evaluation on Twitter. Green value indicates that the absolute value of difference drops compared with the original model, while red value indicates it increases. Twitter Male Female Diff. p Offense Rate (%) 17.457 22.290 -27.7% < 10−5 Senti.Pos. (%) 12.160 4.633 +61.9% < 10−5 Original Senti.Neg. (%) 0.367 1.867 -408.7% < 10−5 Model Career Word 0.0136 0.0019 +85.8% < 10−5 Family Word 0.0317 0.1499 -372.4% < 10−5 Offense Rate (%) 30.767 32.073 -4.2% < 10−3 Senti.Pos. (%) 3.013 2.840 +5.7% 0.208 CDA Senti.Neg. (%) 0.593 0.543 +8.4% 0.415 Career Word 6.7e-05 1.7e-04 -149.3% 0.491 Family Word 0.0038 0.0051 -34.5% 0.107 Offense Rate (%) 24.147 24.140 +0.03% 0.985 Senti.Pos. (%) 5.207 5.210 -0.06% 0.985 WER Senti.Neg. (%) 0.080 0.080 0.0% 1.0 Career Word 0.0005 0.0005 0.0% 1.0 Family Word 0.0071 0.0071 0.0% 1.0 Offense Rate (%) 12.797 13.273 -3.7% 0.083 Senti.Pos. (%) 3.283 2.907 +11.5% 0.008 Debiased- Senti.Neg. (%) 0.077 0.070 +9.1% 0.763 Chat Career Word 0.0006 0.0004 +27.8% 0.398 Family Word 0.0035 0.0038 -8.6% 0.568 to the training set as the augmented data. Word Embedding Regularization (WER). In this method [70], besides the original MLE loss, we train the dialogue model with an auxiliary regularization loss which reduces the difference between the embeddings of the gender words and that of their counterparts. We empirically set the weight of the regularization term as k = 0.25. 3.3.3.2 Experimental Settings For Seq2Seq dialogue models, the encoder and the decoder are implemented by three-layer LSTM networks with the hidden size of 1,024. Word embedding size is set as 300, and the vocab size is 30,000. The original model is trained using standard stochastic gradient descent (SGD) algorithm with a learning rate of 1.0. In the adversarial training process of Debiased-Chat, both the dialogue model and the discriminators are trained by Adam optimizer [58] with the initial learning rate of 37 Table 3.4: Fairness evaluation on Reddit. Green value indicates that the absolute value of difference drops compared with the original model, while red value indicates it increases. Reddit Male Female Diff. p Offense Rate (%) 21.343 27.323 -28.0% < 10−5 Senti.Pos. (%) 0.340 0.237 +30.3% 0.018 Original Senti.Neg. (%) 0.047 0.180 -283.0% < 10−5 Model Career Word 0.202 0.138 +31.6% < 10−5 Family Word 3.67e-4 7.67e-4 -109.0% 0.045 Offense Rate (%) 38.317 52.900 -38.1% < 10−5 Senti.Pos. (%) 0.347 0.413 -19.0% 0.184 CDA Senti.Neg. (%) 0.010 0.007 +30% 0.655 Career Word 0.321 0.797 -148.0% < 10−5 Family Word 1.67e-4 2.07e-3 -1137.7% < 10−5 Offense Rate (%) 48.057 48.057 0.0% 1.0 Senti.Pos. (%) 2.473 2.473 0.0% 1.0 WER Senti.Neg. (%) 0.130 0.130 0.0% 1.0 Career Word 0.402 0.402 0.0% 1.0 Family Word 3.3e-05 3.3e-05 0.0% 1.0 Offense Rate (%) 17.383 17.823 -2.5% 0.157 Senti.Pos. (%) 0.750 0.770 -2.7% 0.451 Debiased- Senti.Neg. (%) 0.030 0.033 -10% 0.639 Chat Career Word 0.150 0.113 +24.7% 0.216 Family Word 0.0 3.3e-05 / 0.317 0.001. The temperature value τ for Gumbel-Softmax is initialized as 1.0 and decreases through dividing by 1.1 every 200 iterations. It stops decreasing when τ < 0.3. Hyper-parameters are empirically set as k1′ = k2′ = 1 and D_steps = 2, G_steps = 2, G_teach_steps = 1. All the models are trained on NVIDIA Tesla K80 GPUs. 3.3.3.3 Experimental Results We first conduct a fairness test on the baselines and our model to compare their ability in debiasing, and then compare the quality of the responses they generate in terms of relevance and diversity. Fairness Evaluation. Following [70], we formulate the problem of the fairness analysis as a hypothesis test problem. We test whether a dialogue model is fair for males and females in terms of various measurements: offense, sentiment, career word, and family word. We construct fairness test corpora, which contain 30,000 parallel message pairs as described in [70] from the Twitter dataset 38 and the Reddit dataset, respectively. Each parallel message pair consists of a male-related message and a female-related message. The two messages have the same content, but only the gender words in them are different. In Table 3.3 and Table 3.4, we report the results of the fairness evaluation. “Offense Rate” is the offense rate of the produced responses towards male- and female-related messages; “Senti.Pos/Neg” indicates the rate of responses with positive and negative sentiments; and “Career Word” and “Family Word” indicate the average number of career and family words appeared in one response. We also report the difference in the measurements between the two genders, as well as the p-value. We consider the dialogue model to be not fair for the two genders in terms of a measurement if p < 0.05. We make the following observations. First, the original model shows significant gender bias. Female-related messages tend to receive more offensive responses, less positive responses, and more negative responses. Career words are more likely to appear in the responses of male-related messages, while family words are more likely to appear in the responses of female-related messages. Second, CDA mitigates the bias to some degree, but its performance is not stable. In some cases, the bias is even amplified. Third, WER seems to eliminate the bias completely, but in fact, it generates almost identical responses to male- and female-related messages that will hurt the quality of the response, as shown below. Finally, our proposed framework steadily reduces the gender bias in a dialogue model to a reasonable level. Quality Evaluation. We then evaluate the quality of generated responses of the original and debiased dialogue models in terms of relevance and diversity. We do the evaluation on the test set of the two dialogue datasets. For relevance, we report the BLEU score between generated responses and ground truths. For diversity, we report the metric “Distinct” proposed in [62]. The results are shown in Table 3.5. From the table, we observe that in terms of the relevance, our model behaves comparably with the original model. It means that while our method reduces bias, it doesn’t hurt the quality of the response. Besides, since our model encourages the responses to be reasonably different for male- and female-related messages, our model achieves better performance than the original model and 39 Table 3.5: Quality evaluation. All the numbers shown in the table are percentages. Relevance Diversity Dataset Model BLEU-1 BLEU-2 BLEU-3 Distinct-1 Distinct-2 Original Model 7.401 2.107 1.004 0.760 2.904 CDA 7.150 1.875 0.803 0.376 1.278 Twitter WER 6.896 2.174 1.029 0.516 1.911 Debiased-Chat 7.652 2.010 0.872 0.961 3.459 Original Model 11.918 2.735 0.823 0.158 0.514 CDA 11.385 2.598 0.804 0.106 0.302 Reddit WER 12.040 2.832 0.833 0.227 0.834 Debiased-Chat 12.793 2.952 0.935 0.344 0.923 Table 3.6: Case Study. Messages He ain’t cooking, She ain’t cooking, This poor boy is sick This poor girl is sick that’s the problem! that’s the problem! I feel so bad u may I feel so bad u may not try and get with not try and get with his. her. Original He’s a real one. She’s a bitch. I’m sorry to hear She’s a good person. Model that. CDA I’m not sure what I’m not sure what I’m so sorry. I’m so I’m so sorry. I’m so you mean by that. you mean by that. sorry. sorry. WER I know right ?!?! I know right ?!?! I don’t think she is. I don’t think she is. I just don’t think she I just don’t think she is . is. Debiased- I know right? I was just thinking He is a very hand- I love her and she Chat about how much I some man. is a beautiful woman love her. and she is a beautiful woman. the baseline models in terms of diversity. 3.3.4 Case Study To further demonstrate the effectiveness of the proposed framework, we show two pairs of parallel messages and their responses produced by various dialogue models in Table 3.6. In the left case, responses generated by the original model show bias. Among the debiased dialogue models, the CDA model and the WER model generate the same responses for two messages. It shows that both of them mitigate bias crudely by producing responses with similar content. Our model generates responses that are free from bias. Also, the responses for the two genders are different. In the right 40 case, the CDA model and the WER model still produce identical dull responses for two messages. However, our model produces responses with distinct gender features. The words “handsome”, “man” and “beautiful”, “woman” are recognized by the disentanglement model as unbiased gender features of males and females, respectively, and they are encouraged to appear in the responses of male- and female-related messages. The two examples demonstrate that our model increases the diversity of responses for different genders while mitigating gender bias. 3.4 Related Work The fairness problems in natural language processing have received increasing attention [86]. Word Embeddings exhibit human bias for text data. Researchers find that in word embeddings trained on large-scale real-world text data, the word “man” is mapped to “programmer” while “woman” is mapped to “homemaker” [12]. They propose a 2-step method for debiasing word embeddings. Some works extend the research of bias in word embeddings to that of sentence embeddings. In [83], the authors propose Sentence Encoder Association Test (SEAT) based on Word Embedding Association Test (WEAT) [48]. They examine popular sentence encoding models from CBoW, GPT, ELMo to BERT and show that various sentence encoders inherit human’s prejudices from the training data. For the task of coreference resolution, a benchmark named WinoBias is proposed in [131] to measure the gender bias. This work provides a debiasing method based on data augmentation. The work [13] first explores the gender bias in language models. The authors propose a measurement to evaluate the bias in well-trained language models as well as the training corpus. They propose to add a regularization term in the loss function to minimize the projection of word embeddings onto the gender subspace. Dialogue systems have been shown to be sensitive to the input messages [89, 129, 122]. They could produce very different responses to messages with the same content but different gender terms, which may reflect the social bias of humans. The work [70] first studies the bias in dialogue systems. They define measurements to evaluate the fairness of a dialogue model and show that significant gender and race bias exist in popular dialogue models. The paper [31] analyzes gender 41 bias in persona-based dialogue models and proposes a combination debiasing method. Since their debiasing method involves manpower, which is not easy to reproduce, we only compare our method with their objective data augmentation technique. While in this work, the authors encourage the dialogue models to produce responses whose gender is indistinguishable, our proposed model tries to produce responses whose gender can be told by people based on unbiased gender features instead of biased gender features. 42 CHAPTER 4 UNDERSTANDING AND MITIGATING IMPLICIT BIAS IN DEEP TEXT CLASSIFICATION It is evident that deep text classification models trained on human data could be biased. In particular, they produce biased outcomes for texts that explicitly include identity terms of certain demographic groups. We refer to this type of bias as explicit bias, which has been extensively studied. However, deep text classification models can also produce biased outcomes for texts written by authors of certain demographic groups. We refer to such bias as implicit bias, of which we still have a rather limited understanding. In this chapter, we first demonstrate that implicit bias exists in different text classification tasks for different demographic groups. Then, we build a learning-based interpretation method to deepen our knowledge of implicit bias. Specifically, we verify that classifiers learn to make predictions based on language features that are related to the demographic attributes of the authors. Next, we propose a framework Debiased-TC to train deep text classifiers to make predictions on the right features and consequently mitigate implicit bias. We conduct extensive experiments on three real-world datasets. The results show that the text classification models trained under our proposed framework outperform traditional models significantly in terms of fairness, and also slightly in terms of classification performance. 4.1 Chapter Introduction Many recent studies have suggested that machine learning algorithms can learn social prejudices from data produced by humans, and thereby show systemic bias in performance towards specific demographic groups or individuals [86, 9, 109]. As one machine learning application, text clas- sification has been proven to be discriminatory towards certain groups of people [34, 14]. Text classification applications such as sentiment analysis and hate speech detection are common and widely used in our daily lives. If a biased hate speech detection model is deployed by a social media service provider to filter users’ comments, the comments related to different demographic groups 43 Table 4.1: An illustrative example on the implicit bias of a CNN text classification model. Author Text Label Prediction White Can’t wait to visit your new home. positive positive American Yes, I going to be a great guest! African Can’t wait to visit your new home. positive negative American Yup , I goin to be a great guest! can have uneven chances to be recognized and removed. Such a case will cause unfairness and bring in negative experiences to users. Thus, it is highly desired to mitigate the bias in text classification. The majority of existing studies on bias and fairness in text classification have mainly focused on the bias towards the individuals mentioned in the text content. For example, in [34, 90, 126], it is investigated how text classification models perform unfairly on texts containing demographic identity terms such as “gay” and “muslim”. In such scenarios, the demographic attributes of the individuals subject to bias explicitly exist in the text. In this chapter, we refer to this kind of bias as explicit bias. Bias in texts, however, can be reflected more subtly and insidiously. While a text may not contain any reference to a specific group or individual, the content can somehow be revealing of the demographic information of the author. As shown in [27, 95], the language style (e.g., wordings and tone) of a text can be highly correlated with its author’s demographic attributes (e.g., age, gender, and race). We find that a text classifier can learn to associate the content with demographic information and consequently make unfair decisions towards certain groups. We refer to such bias as implicit bias. Table 4.1 demonstrates an example of implicit bias. There are two short texts where the first text is written by a white American and the second one is written by an African American. The task is to predict the sentiment of a text by a convolutional neural network (CNN) model. Words with a red background indicate those with the salient predictive capability by the model where the darker the color, the more salient the words. The words “yup” and “goin” in the second text are commonly used by African Americans [70] and are irrelevant to the sentiment. However, the CNN model has hinted at them and consequently has predicted a positive text to be negative. In this chapter, we aim to understand and mitigate implicit bias in deep text classification models. 44 One key source of bias is the imbalance of training data [34, 90]. Thus, existing debiasing methods mainly focus on balancing the training data, such as adding new training data [34] and augmenting data based on identity-term swap [90]. However, these methods cannot be directly applied to mitigate implicit bias. Obtaining new texts from authors of various demographic groups is very expensive. It requires heavy human labor. Meanwhile, given that there is no explicit demographic information in texts, identity-term swap data augmentation is not applicable. Thus, we propose to enhance deep text classification models to mitigate implicit bias in the training process. To achieve this goal, we face tremendous challenges. First, to mitigate the implicit bias, we have to understand how deep models behave. For example, how they correlate implicit features in text with demographic attributes and how the models make biased predictions. Second, we need to design new mechanisms to take advantage of our understandings to mitigate the implicit bias in deep text classifiers. To address the above challenges, in this chapter, we first propose an interpretation method, which sheds light on the formation mechanism of implicit bias in deep text classification models. We show that the implicit bias is caused by the fact that the models make predictions based on incorrect language features in texts. Second, based on this finding, we propose a novel framework Debiased-TC (Debiased Text Classification) to mitigate the implicit bias of deep text classifiers. More specifically, we equip the deep classifiers with an additional saliency selection layer that first determines the correct language features which the model should base on to make predictions. We also propose an optimization method to train the classifiers with the saliency selection layer. Note that both our proposed interpretation method and the learning framework are model-agnostic, which means that they can be applied to any deep text classifier. We evaluate the framework with two popular deep text classification models across various text classification tasks on three public datasets. The experimental results demonstrate that our method significantly mitigates the implicit bias in the classification models while maintaining or even improving their prediction performance. 45 4.2 Preliminary Study In this section, we perform a preliminary study to validate the existence of implicit bias in deep text classification models. We first introduce the data and text classification tasks, and then present the empirical results. 4.2.1 Data and Tasks In the preliminary study, we investigate different text classification tasks and various demographic groups to validate the implicit bias. We use three datasets, including the DIAL and PAN16 datasets processed by [38] and the Multilingual Twitter Corpus (MTC) introduced in [46]. The statistics of these datasets are shown in Table 4.2. In the table, the “task” section shows the text classification tasks included in a dataset. “Sentiment” is short for sentiment analysis. “Mention” is short for mention detection. “Hate Speech” is short for hate speech detection. “Demog.” indicates the demographic attribute of the tweet authors collected in a dataset. The “Size” section shows the total number of instances in a dataset. Each instance is a tweet text. The “Avg.Len.” section shows the average number of words in one instance in a dataset. Table 4.2: Statistics of the datasets. Dataset Task Demog. Size Avg.Len. Sentiment Race 317,151 11.20 DIAL Mention Race 400,000 10.56 Mention Gender 175,871 14.64 PAN16 Mention Age 175,471 14.55 MTC Hate Speech Race 47,627 19.60 The DIAL dataset contains dialectal texts collected from Twitter. Each tweet’s text is associated with the race of the author as the demographic attribute, denoted as “white” or “black”, respectively. This dataset is annotated for two classification tasks: sentiment analysis and mention detection. The sentiment analysis task aims to categorize a text as “happy” or “sad”. The mention detection task tries to determine whether a tweet mentions another user, which can also be viewed as distinguishing conversational tweets from non-conversational ones. The dataset is annotated based on the dialectal 46 tweet corpus [10], which contains 59.2 million tweets from 2.8 million users. The race attribute is annotated by an automated probabilistic inference method based on the geolocation information of the user and the tweet text. Given that geolocation information (residence) is highly associated with the race of a user, the model can make accurate predictions. To further ensure the accuracy, DIAL only keeps annotations with confidence above 80%. The PAN16 dataset [96] consists of tweets. For each tweet, age and gender of its author have been manually labelled. The demographic attribute age has two categories of “18-34” and “≥ 35”, and gender has “male” and “female”. Also, this dataset is annotated for the mention detection task as described above. The dataset contains 436 Twitter users, each of which has up to 1,000 tweets. The age and gender of the users are manually annotated by referring to their LinkedIn profiles. Specifically, annotators judge the gender based on the user’s name and profile photo. The age is inferred based on the user’s birth date or degree starting date. The MTC dataset [46] contains multilingual tweets for the hate speech detection task. Each tweet is annotated as “hate speech” or “non hate speech” and associated with four author’s demographic attributes: race, gender, age, and country. We only use the English corpus with the attribute race. In this dataset, the attribute race has two categories, i.e., “white” and “nonwhite”. The dataset is annotated based on 7 published Twitter hate speech datasets in five languages. The dataset contains user demographic information such as race, gender, age, and country. We only focus on the English corpus and the attribute race in our experiments. The race of a user is inferred by the computer vision API, Face++1 , based on the profile photo. 4.2.2 Empirical study In this subsection, we aim to empirically study if text classification models make the predictions dependent on the demographic attributes of the authors of the texts. The explicit bias in text classification tasks stems from the imbalance of training data [34, 90]. For example, when there are more negative examples from one group in the training data, the model learns to correlate that 1 https://www.faceplusplus.com/ 47 Table 4.3: Preliminary study. FP, FN, and DP indicates false positive rate, false negative rate, and demographic parity measurement, respectively. I and II stands for group I and group II, respectively. FP (%) FN (%) DP (%) Dataset Task Demo I II I II I II Sentiment Race 46.97 23.38 21.29 62.75 62.84 30.32 DIAL Mention Race 48.72 15.99 17.32 34.90 65.70 40.55 Mention Gender 23.90 12.30 13.06 23.01 55.42 44.64 PAN16 Mention Age 24.91 9.88 16.48 26.43 54.22 41.72 MTC Hate Speech Race 80.33 1.77 12.13 49.35 84.10 26.21 group with the negative label, which results in bias. Inspired by this observation, to validate the existence of implicit bias, we investigate if the imbalance of training data in terms of demographic attributes of the authors can lead to biased predictions. To answer this question, we consider the following setting: (1) the training data has an equal number of positive and negative examples; and (2) positive and negative examples in the training data are imbalanced among different groups of the authors according to their demographic attributes. Intuitively, if the predictions are independent of the demographic attributes of authors, the model should still perform similarly for different groups. For each task and demographic attribute of authors, we consider two labels (i.e., positive and negative) and two demographic groups (i.e., Group I and Group II). For each dataset, we follow the aforementioned setting to build a training set. We make the training set overall balanced in terms of the labels and demographic groups. That is, we set the overall ratio of positive and negative examples as 1:1, and the overall ratio of examples from Group I and Group II as 1:1 as well. Meanwhile, we make the data in each group imbalanced. In particular, for Group I, we set the ratio of its positive and negative examples to 4:1, while the ratio is automatically set to 1:4 for Group II. We name the proportion of positive and negative samples in Group I as the “balance rate”. We train a CNN text classifier as a representative model on the training set and evaluate it on the test set. We use the false positive/negative rates [34] and the demographic parity rate (a.k.a., positive outcome rate, the probability of the model predicting a positive outcome for one group) [36, 61] to evaluate the fairness of the classification models. The results are shown in Table 4.3. For the demographic attribute race, Group I/Group II stands for white/black in the DIAL dataset, and white/nonwhite in the MTC dataset. For gender and age, 48 Group I/Group II stands for male/female and age ranges (18-34)/(≥35), respectively. From the table, we observe that in terms of different tasks and demographic attributes of authors, the model shows significant bias with the same pattern. For all cases, the demographic group with more positive examples (Group I) always gets a higher false positive rate, a lower false negative rate, and a higher demographic parity rate than the other group. This demonstrates that imbalanced data can cause implicit bias, and the predictions are not independent of the demographic attributes of authors. Since the text itself doesn’t explicitly contain any demographic information, the model could learn to recognize the demographic attributes of authors based on implicit features such as language styles and associate them with a biased outcome. Next, we will understand one formation of implicit bias and then propose Debiased-TC to mitigate it. 4.3 Understanding Implicit Bias In this section, we aim to understand the possible underlying formation mechanism of implicit bias. Our intuition is – when a training set for sentiment analysis has more positive examples from white authors and more negative examples from black authors, a classification model trained on such a dataset may learn a “shortcut” [81] to indiscriminately associates the language style features of white people with the positive sentiment and those of black people with the negative sentiment. In other words, the model does not use the correct language features (e.g., emotional words) to make the prediction. Thus, we attempt to examine the following hypothesis: A deep text classification model presents implicit bias since it makes predictions based on language features that should be irrelevant to the classification task but are correlated with a certain demographic group of authors. To verify this hypothesis, we first propose an interpretation method to detect the salient words a text classification model relies on to make the prediction. The interpretation model enables us to check the overlapping between the salient words and the words related to the authors’ demographic attributes. Consequently, it allows us to understand the relationship between such overlapping and the model’s implicit bias. 49 4.3.1 An Interpretation Method We follow the idea of the learning-based interpretation method L2X [20] to train an explainer to interpret a given model. The reasons for choosing L2X are – 1) as a learning-based explainer, it learns to globally explain the behavior of a model, instead of explaining a single instance at one time; and 2) the explainer has the potential to be integrated into our debiasing framework to mitigate implicit bias in an end-to-end manner, which will be introduced in Section 4.4. A binary text classification model M : X → Y maps an input text X = (x1 , x2 , . . . , xn ) to a label Y ∈ {0, 1}. For a certain model M , we seek to specify the contribution of each word in X for M to make the prediction Y . The contributions can be denoted as a saliency distribution S = (s1 , s2 , . . . , sn ), where si is the saliency score of the word xi , and ∑ni=1 si = 1. Given a model M , we train an explainer E M : X → S to estimate the saliency distribution S of an input text X. The explainer is trained by maximizing I(XS ,Y ), the mutual information [28] between the response variable Y and the selected feature XS of X under saliency distribution S. The selected feature XS = X ⊙ S = (s1 · x1 , s2 · x2 , . . . , sn · xn ) 2 is calculated as the element-wise product between X and S. In our implementation, we parametrize the explainer by a bi-directional recurrent neural network (RNN) followed by a linear layer and a Softmax layer. We train the explainer E by maximizing the mutual information between the response variable Y and the selected features XS . The optimization problem can be formulated as: max I(XS ;Y ) (4.1) E s.t. S ∼ PE (S|X) 2 Without confusion, we use xi to denote both a word and its word embedding vector. 50 where h P(XS ,Y ) i I(XS ,Y ) = E log P(XS )P(Y ) h PM (Y |XS ) i = E log P(Y ) h i ∝ E log PM (Y |XS ) h i = EX ES|X EY |XS log PM (Y |XS ) Solving the optimization problem in Eq. (4.1) is equivalent to finding an explainer E satisfying the following: max PM (Y |XS ) s.t. S ∼ PE (S|X). E Hence, we train the explainer E by optimizing PM (Y |XS ) with the parameters of the classifica- tion model M fixed. In our implementation, we adopt the cross-entropy loss for training, as we do when we train the classification model M . 4.3.2 Saliency Correlation Measurement In this chapter, we assume that the text classification task is totally independent of the demographic attribute of the author of the text. In other words, language features that reflect the author’s demographic information should not be taken as evidence for the main task. Thus, we propose to understand the implicit bias of a deep text classification model by examining the overlapping between salient words for the main task and the words correlated with the demographic attribute. With the interpretation model, we can estimate the saliency distributions of the input words for the classification task and the demographic attribute prediction task, respectively, and then check their overlapping. As shown in Figure 4.1, we train two models M Y and M Z with the same architecture for the former and the latter tasks, respectively. Then, two corresponding explainers E Y and E Z are trained for them. Thus, given an input text X, two explainers can estimate the saliency distributions SY and SZ on two tasks, respectively. We use the Jensen-Shannon (JS) divergence 51 Y JS(S Y , S Z ) Z Model Y Model Z x1 x2 x3 xn x1 x2 x3 xn s1Y × s2Y × s3Y × snY × = SY S Z = s1 Z × s2Z × s3Z × snZ × Explainer Y Explainer Z x1 x2 x3 ⋯ xn x1 x2 x3 ⋯ xn Figure 4.1: An illustration of the bias interpretation model. JS(SY ||SZ ) to measure the overlap between language features that these two tasks relying on to make the predictions on Y and Z. 4.3.3 Empirical Analysis In this subsection, we present the experiments to verify our hypothesis on the formulation of implicit bias. Following the experimental settings in Section 4.2.2, we vary the “balance rate” of the training data and then observe how the saliency correlation changes. We use CNN text classifiers (see Section 4.5.1 for details) for both M Y and M Z . In Figure 4.2, we show how the average JS divergence and the demographic parity difference (DPD) vary with the changes of the balance rate. DPD is the absolute value of the difference between the demographic parity rates for the two groups. We only report the results for DIAL and PAN16 datasets and DPD as the fairness metric since we achieved similar results for other settings. For each task and each demographic attribute, the DPD is small when the training data are balanced and becomes large when the data are imbalanced. However, the JS divergence is large for balanced data while small for imbalanced data. A larger DPD indicates stronger implicit bias and a smaller JS divergence stands for a stronger overlap between the saliency distributions for the two tasks. Thus, these observations suggest that when the training data are imbalanced, the text classifiers tend to use language features related to the 52 (a) Sentiment Race (DIAL) (b) Mention Race (DIAL) 0.45 40 0.46 30 0.44 25 0.40 30 JS Divergence 0.42 20 DPD 0.35 20 0.40 15 10 0.30 10 0.38 5 4:1 3:1 2:1 1:1 1:2 1:3 1:4 4:1 3:1 2:1 1:1 1:2 1:3 1:4 (c) Mention Gender (PAN16) (d) Mention Age (PAN16) 0.48 10 0.52 10 8 0.50 8 0.46 JS Divergence 6 0.48 6 0.44 DPD 4 0.46 4 0.42 2 0.44 2 0.42 0 4:1 3:1 2:1 1:1 1:2 1:3 1:4 4:1 3:1 2:1 1:1 1:2 1:3 1:4 Figure 4.2: The average JS divergence (solid lines) and DPD (dash lines) vs. the balance rate. The x-axis indicates the balance rate of the training set. The y-axis on the left hand indicates the average JS divergence, and the y-axis on the right hand is the DPD. demographic attribute of authors to make the prediction. 4.4 The Bias Mitigation Framework In the previous section, we showed that a model with implicit bias tends to utilize features related to the demographic attribute of authors to make the prediction, especially when training data is imbalanced in terms of the demographic attribute of authors. One potential solution is to balance the training data by augmenting more examples from underrepresented groups. However, collecting new data from authors of different demographics is expensive. Thus, to mitigate the implicit bias, we propose a novel framework Debiased-TC. Our proposed approach can mitigate implicit bias by automatically correcting their selection of input features. In this section, we will introduce the proposed framework with the corresponding optimization method. 53 Y √ Z ╳ Model Y Model Z x1 x2 x3 xn S ̂ = s1̂ × s2̂ × s3̂ × sn̂ × Corrector x1 x2 x3 ⋯ xn Figure 4.3: An illustration of the bias mitigation model. 4.4.1 Debiased Text Classification Model An illustration of Debiased-TC is shown in Figure 4.3. Similar to the explainer in the interpretation model, we equip the base model M Y with a corrector layer C after the input layer. The corrector C : X → S learns to correct the model’s feature selection. It first maps an input text X = (x1 , x2 , . . . , xn ) to a saliency distribution S = (s1 , s2 , . . . , sn ), which is expected to give high scores to words related to the main tasks and low scores to words related to demographic attributes of authors. Then, it assigns weights to the input features with the saliency scores by calculating XS = X ⊙ S, which is fed into the classification model M Y for prediction. To train a corrector to achieve the expected goal, we adopt the idea of adversarial training. More specifically, in addition to the main classifier M Y , we introduce an adversarial classifier M Z , which takes XS as the input and predicts the demographic attribute Z. During the adversarial training, the corrector attempts to help M Y make correct predictions while preventing M Z from predicting demographic attributes. To make this feasible, we use the gradient reversal technique [41], where we add a gradient-reversal layer between the weighted inputs XS and the adversarial classifier M Z . The gradient-reversal layer has no effect on its downstream components (i.e., the adversarial classifier M Z ). However, during back-propagation, the gradients that pass down through 54 this layer to its upstream components (i.e., the corrector C ) are getting reversed. As a result, the corrector C receives opposite gradients from M Z . The outputs of the M Y and M Z are used as signals to train the corrector such that it can upweight the words correlated with the main task label Y and downweight the words correlated with the demographic attribute Z. We set the adversarial classifier M Z with the same architecture as the main classifier M Y . The corrector C has the same architecture as the explainer introduced in Section 4.3. 4.4.2 An Optimization Method for Debiased-TC In this subsection, we discuss the optimization method for the proposed framework. We denote the parameters of M Y , M Z and C as WY , WZ and θ , respectively. The optimization task is to jointly optimize the parameters of the classifiers, i.e., WY and WZ , and the parameters of the corrector, i.e., θ . We can view the optimization as an architecture search problem. Since our debiasing framework is end-to-end and differentiable, we develop an optimization method for our framework based on the differentiable architecture search (DARTS) technique [69]. We update M Y , M Z by optimizing the training losses LYtrain and Ltrain Z on the training set and update θ by optimizing the validation loss Lval on the validation set through gradient descent. We denote the cross-entropy losses for M Y and M Z as LY and LZ , respectively. LYtrain and Ltrain Z indicate the cross-entropy losses LY and LZ on the training set. Lval denotes the combined loss of the two cross-entropy losses L = LY + LZ on the validation set. The goal of optimizing the corrector is to find optimal parameters θ ∗ that minimizes the validation loss Lval (WY ∗ , WZ∗ , θ ), where the optimal parameters WY ∗ and WZ∗ are obtained by minimizing the training losses as follows. WY ∗ = arg min LYtrain (WY , θ ∗ ) WY WZ∗ = arg min Ltrain Z (WZ , θ ∗ ) WZ The above goal forms a bi-level optimization problem [80, 92], where θ is the upper-level variable and WY and WZ are the lower-level variables: 55 min Lval WY ∗ (θ ), WZ∗ (θ ), θ  θ s.t. WY ∗ (θ ) = arg min LYtrain (WY , θ ∗ ) WY WZ∗ (θ ) = arg min Ltrain Z (WZ , θ ∗ ) WZ Optimizing θ is time-consuming due to the expensive inner optimization of WY and WZ . Therefore, we leverage the approximation scheme as DARTS: ∇θ Lval WY ∗ (θ ), WZ∗ (θ ), θ  ≈ ∇θ Lval WY − ξ ∇WY LYtrain (WY , θ ), WZ − ξ ∇WZ Ltrain Z (WZ , θ ), θ  where ξ is the learning rate for updating WY and WZ . The approximation scheme estimates WY ∗ (θ ) and WZ∗ (θ ) by updating WY and WZ for a single training step, which avoids the total optimization W∗ (θ ) = arg minW Ltrain (W, θ ∗ ) to the convergence. In our implementation, we apply first-order approximation with ξ = 0, which can even lead to more speed-up. Also, in our specific experiments, since the amount of validation data is limited, we build an augmented validation dataset V ′ = V ∪T combining the original validation set V with the training set T for optimizing θ . We present our DARTS-based optimization algorithm in Algorithm 2. In each iteration, we first update the corrector’s parameters based on the augmented validation set V ′ (lines 2-3). Then, we collect a new mini-batch of training data (line 4). We generate the saliency scores S = (s1 , s2 , . . . , sn ) for the training examples via the corrector with its current parameters (line 5). Next, we make predictions via the classifiers with their current parameters and XS (line 6). Eventually, we update the parameters of the classifiers (line 7). 4.5 Experiment In this section, we conduct experiments to evaluate our proposed debiasing framework. Through the experiments, we try to answer two questions: 1) Does our framework effectively mitigate the implicit bias in various deep text classification models? and 2) Does our framework maintain the performance of the original models (without debasing) while reducing the bias? 56 Algorithm 2: The DARTS-based optimization method for Debiased-TC. |T | |V | Input: Training data T = {Xi ,Yi , Zi }i=1 and Validation data V = {Xi ,Yi , Zi }i=1 Output: classifier parameters WY ∗ and WZ∗ ; and corrector parameters θ ∗ Initialize WY , WZ and θ 1: while not converged do 2: Sample a mini-batch of validation data from V ′ = V ∪ T 3: Update θ by descending ∇θ Lval WY − ξ ∇WY LYtrain (WY , θ ), WZ − ξ ∇WZ Ltrain Z (WZ , θ ), θ (ξ = 0 for first-order approximation) 4: Collect a mini-batch of training data from T 5: Generate S via the corrector with current parameters θ 6: Generate predictions via the classifiers with current parameters WY , WZ and XS 7: Update WY and WZ by descending ∇WY LYtrain (WY , θ ) and ∇WZ Ltrain Z (WZ , θ ) 8: end while 4.5.1 Base Deep Text Classification Models In this chapter, we generally investigate implicit bias in deep text classifiers in a model-agnostic setting, rather than focusing on a specific classifier or type of classifier. We conduct our experiments on two popular deep text classification models: • CNN. Following [57], we build a Convolutional Neural Network (CNN) text classifier. We use 100 filters with three different kernel sizes (3, 4, and 5) in the convolution layer, where we use a Rectified Linear Unit (ReLU) as the non-linear activation function. Each obtained feature map is processed by a max-pooling layer. Then, the features are concatenated and fed into a linear prediction layer to get the final predictions. A dropout with a rate of 0.3 is applied before the linear prediction layer. • RNN. We build a Recurrent Neural Network (RNN) text classifier [26] with Gated Recurrent Units (GRU). We use a unidirectional RNN with one layer. The hidden size is set to 300. The last hidden state of the RNN is fed into a linear prediction layer to get the final predictions. We apply a dropout with a rate of 0.2 before the linear prediction layer. 57 4.5.2 Baselines In our experiments, we compare our proposed debiasing framework with two baselines. Since there is no established method for mitigating implicit bias, we adopt two debiasing methods designed for traditional explicit bias and adapt them for implicit bias. Data Augmentation* (Data Aug) [34]. We manually balance the training data of two demo- graphic groups by adding sufficient negative examples for Group I and positive examples for Group II. As a result, the ratio of positive and negative training examples for both groups is 1:1. As dis- cussed in the introduction, obtaining additional labeled data from specific authors is very expensive. In this chapter, we seek to develop a bias mitigation methodology without extra data. Since Data Aug introduces more training data, it’s not fair to directly compare it with other debiasing methods that only utilize original training data (including our method). We include Data Aug as a special baseline for reference. Instance Weighting (Ins Weigh) [126]. We re-weight each training instance with a numerical P(Y ) weight P(Y |Z) based on the label distribution for each demographic group to mitigate explicit bias. In this method, a random forest classifier is built to estimate the conditional distribution P(Y |Z) and the marginal distribution P(Y ) is manually calculated. 4.5.3 Experimental Settings We use the same datasets with manually designed proportions, as described in Section 4.2.2. For the base text classifiers, we use randomly initialized word embeddings with the size of 300. All the models are trained by an Adam optimizer [58] with an initial learning rate of 0.001. We apply gradient clipping with a clip-value of 0.25 to prevent the exploding gradient problem. The batch size is set to 64. For the base model and the baseline methods, when the prediction accuracy of the validation data doesn’t improve for 5 consecutive epochs, the training is terminated, and we pick the model with the best performance on the validation set. Our model utilizes the validation data for training. To avoid it overfitting the validation data, we don’t select the model based on its performance on the validation set. Instead, we train the model for a fixed number of epochs (5 58 Table 4.4: Fairness performance comparison on CNN text classifiers. Note that Data Aug is a special baseline for reference. CNN Task Methods FPED (%) FNED (%) DPD (%) Base Model 23.59 41.45 32.52 Sentiment Data Aug* 21.00* 3.88* 12.44* Race Ins Weigh 25.47 41.43 33.45 (DIAL) Debiased-TC 6.08 4.63 0.73 Base Model 32.73 17.58 25.16 Mention Data Aug* 1.31* 7.31* 3.00* Race Ins Weigh 24.66 19.46 22.06 (DIAL) Debiased-TC 3.61 2.40 0.61 Base Model 11.60 9.95 10.78 Mention Data Aug* 0.84* 0.19* 0.32* Gender Ins Weigh 12.73 10.22 11.47 (PAN16) Debiased-TC 3.95 3.04 3.49 Base Model 15.03 9.96 12.49 Mention Data Aug* 3.71* 1.59* 1.06* Age Ins Weigh 16.53 8.71 12.62 (PAN16) Debiased-TC 7.29 2.91 5.10 Base Model 78.56 37.22 57.89 Hate Speech Data Aug* 88.81* 26.15* 57.48* Race Ins Weigh 87.51 31.92 59.72 (MTC) Debiased-TC 75.97 17.08 46.53 epochs, the same for all the three datasets) and evaluate the obtained model. 4.5.4 Performance Comparison We train the base models with our proposed debiasing framework as well as the baseline debi- asing methods. We report the performance on the test set in terms of fairness and classification performance. Fairness Evaluation. Table 4.4 and Table 4.5 show the results for fairness evaluation metrics: false positive equality difference (FPED), false negative equality difference (FNED), and DPD. FPED/FNED indicates the absolute value of the difference between the false positive/negative rates of the two groups. We make the following observations. First, the base models attain high FPED, FNED, and DPD, which indicates the existence of significant implicit bias towards the authors of 59 Table 4.5: Fairness performance comparison on RNN text classifiers. Note that Data Aug is a special baseline for reference. RNN Task Methods FPED (%) FNED (%) DPD (%) Base Model 26.86 42.36 34.61 Sentiment Data Aug* 19.84* 0.59* 10.22* Race Ins Weigh 26.86 42.36 34.61 (DIAL) Debiased-TC 6.67 5.68 0.50 Base Model 30.44 17.55 24.00 Mention Data Aug* 0.77* 7.91* 4.34* Race Ins Weigh 28.83 17.26 23.05 (DIAL) Debiased-TC 4.97 1.07 1.95 Base Model 10.62 8.33 9.47 Mention Data Aug* 2.42* 0.72* 1.57* Gender Ins Weigh 11.20 9.35 10.28 (PAN16) Debiased-TC 5.41 3.73 4.57 Base Model 13.07 7.34 10.20 Mention Data Aug* 0.17* 2.69* 1.26* Age Ins Weigh 13.24 7.94 10.59 (PAN16) Debiased-TC 7.64 2.69 5.16 Base Model 81.51 28.50 55.01 Hate Speech Data Aug* 83.51* 22.73* 53.12* Race Ins Weigh 84.45 27.44 55.95 (MTC) Debiased-TC 74.56 18.85 46.70 the texts. Ins Weigh seems ineffective in mitigating implicit bias since it only achieved comparable fairness scores with the base models. Note that not every example that belongs to a certain group necessarily results in bias towards that group. Thus, assigning a uniform weight for all examples with the same label Y and demographic attribute Z is not a proper way to reduce implicit bias. Third, both Data Aug and Debiased-TC can mitigate the implicit bias by achieving lower equality and demographic parity differences. However, compared to Data Aug, Debiased-TC has two advantages. First, Data Aug needs to add more training data while Debiased-TC does not. Debiased-TC can locate the main source of implicit bias by analyzing how it forms in a deep text classification model. Due to the proposed corrector model, it can make a classification model focus on the relevant features for predictions and discard the features that may lead to implicit bias. Second, Debiased-TC is more stable than Data Aug. For the sentiment classification task with race as the demographic attribute, the CNN and RNN classifiers trained on augmented data still result in high FPED and 60 Table 4.6: Text classification performance comparison (%) on DIAL dataset. Note that Data Aug is a special baseline for reference. Sentiment/Race Mention/Race Methods (DIAL) (DIAL) Acc. F1 Acc. F1 CNN Base Model 61.40 60.03 70.77 71.65 Data Aug* 67.58* 71.53* 76.42* 76.03* Ins Weigh 61.06 60.36 71.62 69.66 Debiased-TC 63.60 66.58 73.15 71.84 RNN Base Model 61.23 61.53 72.97 73.68 Data Aug* 67.82* 69.35* 78.42* 77.26* Ins Weigh 61.23 61.53 73.37 73.79 Debiased-TC 63.68 66.70 74.05 73.41 Table 4.7: Text classification performance comparison (%) on PAN16 and MTC datasets. Note that Data Aug is a special baseline for reference. Mention/Gender Mention/Age Hate Speech/Race Methods (PAN16) (PAN16) (MTC) Acc. F1 Acc. F1 Acc. F1 CNN Base Model 81.93 81.94 80.57 80.17 64.10 65.86 Data Aug* 84.11* 84.31* 84.08* 84.36* 66.96* 71.10* Ins Weigh 81.86 81.85 80.70 81.05 65.25 68.73 Debiased-TC 81.67 82.01 80.41 79.68 69.14 72.69 RNN Base Model 83.46 83.40 82.78 82.43 66.31 69.57 Data Aug* 86.25* 86.05* 86.12* 85.68* 68.55* 72.37* Ins Weigh 83.46 83.32 82.80 82.58 67.26 70.94 Debiased-TC 81.81 81.51 80.21 79.17 66.76 70.76 DPD scores. This suggests that balancing the training data cannot always mitigate implicit bias. In fact, only training examples with demographic language features can contribute to the implicit bias. Since some texts in the training set do not contain any language features belonging to a demographic group, they do not help balance the data. Text Classification Performance Evaluation. The prediction performance of the text classification models trained under various debiasing methods is shown in Table 4.6 and Table 4.7, where we report the accuracy and F1 scores. First, it is not surprising to see that Data Aug achieves the best 61 performances, since the data augmentation technique introduces more training data. It’s not fair to directly compare it with other debiasing methods that only utilize original training data. Second, in most cases, our method achieves comparable or even better performance than the original base models. As we verified before, the implicit bias of a text classification model is caused by the fact that it learns a wrong correlation between labels and demographic language features. Debiased- TC corrects the model’s selection of language features for predictions and thereby improves its performance on the classification task. In conclusion, our proposed debiasing framework significantly mitigates the implicit bias, while maintaining or even slightly improving the classification performance. 4.6 Related Work Fairness in Machine Learning. With the wide spread of the machine learning (ML) applications in our daily lives, bias and fairness issues in them are drawing increasing attention from the community. Researches are conducted to detect and mitigate the bias in ML models on various tasks. Specifically, studies investigate how algorithms can be biased in classification [54, 24], regression [6, 1], and clustering tasks [4, 22]. In the domain of computer vision, researchers show that ML-based face recognition [17] and object detection [102] models perform unfairly for different demographic groups. Besides, a lot of works examine the bias in language related tasks, including word embedding [12], coreference resolution [131], machine translation [93] and dialogue generation [70, 73], etc. Moreover, some recent studies also explore the relationship between the fairness of an ML model and its other properties, such as robustness [121, 88] and privacy [29]. Fairness in Text Classification. In this chapter, we focus on the fairness issues in the text classification task. In this task, the work [34] demonstrates that the source of unintended bias in models is the imbalance of training data, and they provide a debiasing method, which introduces new data to balance the training data. In [90], gender bias is measured on abusive language detection models, and the effects of different pre-trained word embeddings and model architectures are analyzed. By considering the various ways that a classifier’s score distribution can vary across 62 designated groups, a suite of threshold-agnostic metrics is introduced in [14], which provides a nuanced view of unintended bias. Furthermore, the work [126] proposes to debias text classification models using instance weighting, i.e., different weights are assigned to the training samples involving different demographic groups. The works discussed above focus on explicit bias, where the demographic attributes are explicitly expressed in the text. However, works studying implicit bias are rather limited. The paper [46] introduces the first multilingual hate speech dataset with inferred author demographic attributes. Through experiments on this dataset, they show that popular text classifiers can learn the bias towards the demographic attribute of the author. But this work doesn’t discuss how the bias is produced, and no debiasing method is provided. 63 CHAPTER 5 UNDERSTANDING AND HANDLING ANNOTATOR GROUP BIAS IN CROWDSOURCING Crowdsourcing has emerged as a popular approach for collecting annotated data to train supervised machine learning models. However, annotator bias can lead to defective annotations. Though there are a few works investigating individual annotator bias, the group effects in annotators are largely overlooked. In this chapter, we reveal that annotators within the same demographic group tend to show consistent group bias in annotation tasks and thus we conduct an initial study on annotator group bias. We first empirically verify the existence of annotator group bias in various real-world crowdsourcing datasets. Then, we develop a novel probabilistic graphical framework GroupAnno to capture annotator group bias with an extended Expectation Maximization (EM) algorithm. We conduct experiments on both synthetic and real-world datasets. Experimental results demonstrate the effectiveness of our model in modeling annotator group bias in label aggregation and model learning over competitive baselines. 5.1 Chapter Introduction The performance of supervised machine learning algorithms heavily relies on the quality of the annotated training data. Due to the heavy workload of annotation tasks, researchers and practitioners typically take advantage of crowdsourcing platforms to obtain cost-effective annotation data [111, 16]. However, the labels collected from multiple crowdsourcing annotators could be not consistent, since the expertise and reliability of the annotators are uncertain, and the task itself could be subjective and difficult. In recent years, a lot of efforts from the machine learning community have been conducted to mitigate the effect of these noisy crowdsourcing labels [134]. Various approaches have been proposed to model the quality [76, 2], confidence [50], expertise [78, 133], reliability [65] of annotators; or model the difficulty of the tasks [117, 78]. With such information, we can infer the truth label from the noisy labels more accurately and correspondingly train a more desirable model. 64 In terms of annotator modeling, existing studies mainly concentrated on factors like quality, confidence, expertise, etc., which could affect the annotation results. Besides, the bias held by the annotators can also lead to defective annotations [104], which is, however, rarely studied. In addition, studies in social science [37] suggest that people from different demographic groups tend to apply different standards to evaluate the same thing due to their different experiences, which causes group bias. We observe that annotators in different demographic groups tend to show different bias in annotation tasks. For example, in a preliminary study, we examine the instances annotated by both two groups of annotators in the Wikipedia Toxicity dataset [120]. We observe that native speakers of English rate 5.1% more comments as toxic than non-native speakers. Similarly, annotators over 30 years old rate 2.5% more comments as toxic than younger annotators. More details of the preliminary study can be found in Section 5.2. Thus, a thorough investigation of such annotator group bias is desired. Similar to existing studies, by considering the effect of annotator group bias, we have the potential to achieve a more accurate inference of true labels and train a better model. Meanwhile, it is often hard to estimate the individual bias of one annotator with limited annotation data. With annotator group bias as the prior knowledge, we can estimate the bias more effectively based on the demographic groups the annotator belongs to. Thus, annotator group bias could mitigate the “cold-start” problem in modeling the annotator individual bias. In this chapter, we aim to study how to detect annotator group bias under text classification tasks, and how to mitigate the detrimental effects of annotator group bias on model training. We face several challenges. First, given noisy annotated data without the true labels, how should we detect the annotator bias? We first make a comparison of the annotation results from different groups of annotators and find that there is a significant gap between them. Then, we use two metrics sensitivity and specificity to measure the annotator bias, and conduct an analysis of variance (ANOVA) which demonstrates that the bias of each individual annotator shows obvious group effects in terms of its demographic attributes. Second, how can we estimate the annotator group bias, and perform label aggregation and model training with the knowledge of annotator group bias? Following the traditional probabilistic approaches for label aggregation [97, 100, 65], we propose a novel 65 framework GroupAnno that models the production of annotations as a stochastic process via a novel probabilistic graphical model (PGM). Inspired by the results of ANOVA, we assume that the bias of an annotator can be viewed as a superposition of the effects of annotator group bias and its individual bias. We thereby extend the original PGM for label aggregation with additional variables representing annotator group bias. By learning the PGM, we estimate the annotator group bias, infer the true labels, and optimize our classification model simultaneously. Third, how can we learn this PGM effectively? With the unknown true label as the latent variable, typical maximum likelihood estimation (MLE) method cannot be directly applied to estimate the parameters. To address this challenge, we propose an extended EM algorithm for GroupAnno to effectively learn all the parameters in it, including the parameters of the classifier and the newly introduced variables for modeling annotator group bias. We summarize our contributions in this chapter as follows. First, we propose metrics to measure the annotator group bias and verify its existence in real NLP datasets via an empirical study. Second, we propose a novel framework GroupAnno to model the annotation process by considering the annotator group bias. Third, we propose a novel extended EM algorithm for GroupAnno where we estimate the annotator group bias, infer the true labels, and optimize the text classification model simultaneously. Finally, we conduct experiments on synthetic and real data. The experimental results show that GroupAnno can accurately estimate the annotator group bias. Also, compared with competitive baselines, GroupAnno can infer the true label more accurately, and learn better classification models. 5.2 Understanding Annotator Group Bias In this section, we perform an empirical study to get a rudimentary understanding of annotator group bias. 66 5.2.1 Data and Tasks We investigate the group annotator bias on three datasets that involve various text classification tasks. These datasets are released in the Wikipedia Detox project [120]: Personal Attack Corpus, Aggression Corpus, and Toxicity Corpus where each instance is labeled by multiple annotators from the Crowdflower platform 1 . For all the datasets, the demographic attributes of the annotators are collected. The data statistics of the three Wikipedia Detox datasets, i.e. Personal Attack, Aggression, and Toxicity are shown in Table 5.1, where “#Instances” indicates the total number of instances in a dataset; and “#Annotators” denotes the total number of annotators. Table 5.1: Statistics of the datasets. Dataset #Instances #Annotators Personal Attack 115,864 2,190 Aggression 115,864 2,190 Toxicity 159,686 3,591 The Personal Attack dataset and the Aggression dataset contain the same comments collected from English Wikipedia. Each comment is labeled by around 10 annotators on two tasks, respec- tively. The task of the former dataset is to determine whether the comment contains any form of personal attack, while the task of the latter dataset is to judge whether the comment is aggressive or not. For each annotator, four demographic categories are collected: gender, age, language, and education. Although the original dataset provides more fine-grained partitions, for simplicity, we divide the annotators into only two groups in terms of each demographic category 2 . We consider two groups: male and female for gender, under 30 and over 30 for age, below bachelor and above bachelor (including bachelor) for education, and native and non-native speaker of English for language. The toxicity dataset contains comments collected from the same source. Similarly, each comment is labeled by around 10 annotators on whether it is toxic or not. The toxicity dataset includes the same demographic information of the annotators as the former two datasets. 1 https://www.crowdflower.com/ 2 Based on our experiments, when considering more fine-grained groups, e.g. “18-30”, “30-45” and “45-60” for age, the bias is also significant. 67 5.2.2 Empirical Study Table 5.2: The positive rates of the annotations from different groups of annotators. Gender Age Dataset Male Female Under 30 Over 30 PersonalAttack 15.98 18.67 15.83 18.52 Aggression 17.74 21.44 17.79 20.85 Toxicity 12.06 16.37 12.51 15.08 Education Language Dataset Below Ba. Above Ba. Native Non-native PersonalAttack 17.63 15.81 19.95 14.40 Aggression 20.28 17.62 23.20 16.08 Toxicity 15.16 12.56 16.93 11.80 To investigate whether the annotators from different groups behave differently in annotation tasks, we first perform a comparison of the annotation results from different annotator groups. For each demographic category, we collect the instances which are labeled by annotators from both groups, and report the proportion of instances that are classified as positive. The results are shown in Table 5.2. First, we note that there are obvious gaps between the annotations given by different annotator groups. Second, given that the tasks of the three datasets are similar (i.e., all of them are related to detecting inappropriate speech), the annotation tendency of each annotator group is the same. For example, young and non-native speaker annotators are less likely to annotate a comment as attacking, aggressive, or toxic. Third, in terms of different demographic categories, the gaps between the annotations from the two groups are different. For example, compared with other group pairs, the annotations provided by native speakers and non-native speakers are more different. Analysis of Variance. The results in Table 5.2 suggest that annotators show group bias in the annotation tasks, which is manifested in that different groups hold different evaluation criteria in the same task. Specifically for classification tasks, different annotators are unevenly likely to label instances belonging from one class to another class. In this chapter, we only consider binary classification tasks for simplicity 3 . Thus, we use sensitivity (true positive rate) and specificity (1 − false positive rate) [124] to describe the bias of an individual annotator. 3 All our findings and the proposed framework can be trivially extended to the case of multi-way classification. 68 Table 5.3: The results of analysis of variance. The table shows the inter-group sum of squares (variance of treatments). *, ** indicate that the group effects are significant at p < 0.05 and p < 0.005. Personal Attack Aggression Toxicity Category Sensitivity Specificity Sensitivity Specificity Sensitivity Specificity Gender 0.010 0.077* 0.106 0.182** 0.217** 0.266** Age 3.093** 0.257** 3.529** 0.348** 3.230** 0.005 Education 0.006 0.001 0.021 0.012 0.012 0.013 Language 0.805** 0.155** 1.200** 0.470** 0.041 0.023* Next, we seek to verify the existence of annotator group bias. We are interested in whether the demographic category of an individual annotator has a significant impact on its bias. Thus, we first estimate the bias (i.e., sensitivity and specificity) of each individual annotator from its annotation data. Since we don’t have the true labels, we use majority vote labels as the true labels to approximately estimate the bias of each annotator. Then, we perform an ANOVA [105] with the demographic category as the factors, the groups as the treatments, and the bias of an annotator as the response variable, to analyze the significance of the annotator’s demographic groups against its own bias. The corresponding statistical model can be expressed as: 1 P π̃r = u + π 1,gr + · · · + π P,gr + εr (5.1) p where π̃r indicates the bias of an individual annotator r; u is the average bias of all annotators; π p,gr is the effect of the group grp in terms of category p; and εr is the random error which follows a normal distribution with the mean value as 0. To test whether category p has a significant impact on π̃, we consider the null hypothesis H0p : π p,0 = π p,1 , which indicates that the demographic category p has no significant effect on the annotator bias. In other words, there is no significant difference between the annotation behaviors of the two groups in terms of category p. The results are shown in Table 5.3. In the table, we report the inter-group sum of squares, which represent the deviation of the average group bias from the overall average bias. We also use “∗” to denote the significance of the hypothesis tests. We observe that in categories of gender, age and language, the two opposing groups show obvious different sensitivity and specificity in most cases. Moreover, the ANOVA suggests that we are confident to reject the null hypotheses in these cases, 69 which means that the above three demographic categories can affect the annotator bias significantly in different datasets. Based on our observations, we conclude that the demographic attribute of an annotator can have a significant impact on its annotation behavior, and thereby, annotator group bias does exist. 5.3 Modeling Annotator Group Bias In this section, we discuss our approaches for annotator group bias estimation, as well as bias-aware label aggregation and model training. We first introduce the metrics for measuring annotator group bias, and then present the problem statement. Next, we detail GroupAnno, the probabilistic graphical model for modeling the production of annotations. Finally, we describe our extended EM algorithm for learning the proposed model. 5.3.1 Measurements To measure the annotator bias in terms of demographic groups, we extend the definitions of sensitivity and specificity to the group scenario. Formally, we define group sensitivity and group specificity of a group g in terms of category p as follows α p,g = Pr(z = 1|y = 1, grp = g) β p,g = Pr(z = 0|y = 0, grp = g) where y is the true label and z is the annotated label. grp = g represents that the annotator r belongs to group g in terms of demographic category p. We use π p = (α p,0 , α p,1 , β p,0 , β p,1 ) to denote the bias parameters of demographic category p. The bias parameters of all the P categories are denoted as π = {π p }Pp=1 . 5.3.2 Problem Statement Suppose that we have a dataset D = {xi , z1i , · · · , zRi i }N i=1 which contains N instances. Each instance xi is annotated by Ri different annotators, which results in labels z1i , · · · , zRi i . We also have an annotator 70 set A = {(g1r , · · · , gPr )}Rr=1 that records the demographic groups of a total of R annotators. Here, grp ∈ {0, 1} indicates the group that the r-th annotator belongs to in terms of the p-th demographic category. We consider P demographic categories for each annotator, and we have two groups (i.e., 0 and 1) for each category. Given D and A, we seek to (1) estimate the annotator group bias π; (2) estimate the true label yi of each instance xi ; and (3) learn a classifier Pw (y|x) which is parameterized by w. Next, we introduce our GroupAnno to model the annotation process, and propose an extended EM algorithm to estimate the parameters Θ = {w, π}. 5.3.3 GroupAnno: The Probabilistic Graphical Model As shown in Figure 5.1, GroupAnno models the generation procedure of annotations as follows. Given an instance x, its true label y is determined by an underlying distribution Pw (·|x). The distribution is expressed via a classifier with parameters w that we will learn. Given the true label y, the annotated label zr from an annotator r is determined by its bias π̃r = (α̃r , β̃r ). For simplicity, in the following formulations, we use π̃r to represent α̃r or β̃r . In Section 5.2.2, we show that the annotator bias can be modeled by a superposition of the effects of annotator group bias with a random variable reflecting the annotator individual bias. Thus, following Eq 5.1, we assume that the annotator bias of annotator r can be decomposed as 1 P π̃r = u + π 1,gr + · · · + π P,gr + πr To sum up, the parameters we introduced to model annotator bias are π = {u} ∪ {π p }Pp=1 ∪ {πr }Rr=1 . To estimate the parameters Θ = {w, π}, one way is to use maximum likelihood estimation. Under the assumption that instances are sampled independently, the likelihood function of Θ can be written as N P(D|Θ) = ∏ P(z1i , · · · , zRi i |xi ; Θ) i=1 71 Annotator Annotator Group Classifier Bias Bias u π1 π2 ⋯ πP w π̃r gr1 gr2 ⋯ grP πr Annotator Groups x y zr Instance True Label Annotated Label Figure 5.1: An illustration of GroupAnno. In the graph, grey circles represent observed data; a white circle indicates a latent variable; a diamond represents an intermediate variable; and squares denote the unknown parameters that we will learn. Therefore, the MLE parameters can be found by maximizing the log-likelihood Θ̂MLE = {ŵ, π̂} = argmaxΘ ln P(D|Θ) (5.2) 5.3.4 The extended EM algorithm However, we cannot directly apply MLE to solve Eq 5.2, because there is an unknown latent variable (i.e. the true label y) in the probabilistic graphical model. Thus, we propose an extended EM algorithm to effectively estimate the parameters Θ in GroupAnno. Since the true label yi is an unknown latent variable, the log-likelihood term in Eq 5.2 can be 72 decomposed as ln P(D|Θ) N = ∑ ln[Pw (yi = 1|xi )P(z1i , · · · , zRi i |yi = 1; α̃) i=1 + Pw (yi = 0|xi )P(z1i , · · · , zRi i |yi = 0; β̃ )] where α̃ = {α̃r }Rr=1 and β̃ = {β̃r }Rr=1 represent the collections of the sensitivity and the specificity of all the annotators. We further assume that the annotations for one instance from different annotators are conditionally independent given their demographic attributes [97]. Then we have ln P(D|Θ) N h Ri = ∑ ln Pw (yi = 1|xi ) × ∏ P(zri |yi = 1; α̃) i=1 r=1 Ri i + Pw (yi = 0|xi ) × ∏ P(zri |yi = 0; β̃ ) r=1 N = ∑ ln[pi ai + (1 − pi )bi ] (5.3) i=1 where we denote pi := Pw (yi = 1|xi ) Ri Ri zr r ai := ∏ P(zri |yi = 1; α̃) = ∏ α̃r i (1 − α̃r )1−zi r=1 r=1 Ri Ri r 1−zri bi := ∏ P(zri |yi = 0; β̃ ) = ∏(1 − β̃r )zi β̃r r=1 r=1 Note that due to the existence of the latent variable yi , Eq 5.3 contains the logarithm of the sum of two terms, which makes it very difficult to calculate its gradient w.r.t Θ. Thus, to solve the obstacle, we instead optimize a lower bound of ln P(D|Θ) via an EM algorithm. E-step. Given the observation D and the current parameters Θ, we calculate the following lower bound of the real likelihood ln P(D|Θ) ln P(D|Θ) ≥ Ey [ln P(D, y|Θ)] N = ∑ µi ln pi ai + (1 − µi ) ln(1 − pi )bi (5.4) i=1 73 where µi = P(yi = 1|z1i , . . . , zRi , xi , Θ) and it can be computed by the Bayes’ rule ai pi µi = (5.5) ai pi + bi (1 − pi ) M-step. In the M-step, we update the model parameters Θ by maximizing the conditional expectation in Eq 5.4 Θ ← Θ + α∇Θ Ey [ln P(D, y|Θ)] where α is the learning rate. The training algorithm is summarized in Algorithm 3. We first initialize the posterior probability of the labels µi based on majority voting (line 1). Next, we perform the extended EM algorithm to update the model parameters iteratively. In the E-step, we update µi by Bayes’ rule in Eq 5.5, and then calculate the expectation by Eq 5.4 (from lines 3 to 5). Afterward, we perform the M-step, where the gradients of the conditional expectation w.r.t the model parameters are calculated, and the model parameters are updated through gradient ascent. The iterative process is terminated when some specific stop requirements are satisfied. In our implementation, we execute the EM optimization steps for a fixed number of epochs. Algorithm 3: The extended EM algorithm for parameter estimation in GroupAnno. Input: Dataset D = {xi , z1i , · · · , zRi i }Ni=1 , annotator set A = {(g1r , · · · , gPr )}Rr=1 . Output: a text classification model w, estimated annotator bias parameters π Initialize µi = R1i ∑Rr=1 i zri based on majority voting. repeat E-step: Update µi : µi ← ai pi +baiip(1−p i i) Calculate the expectation Ey [ln P(D, y|Θ)] M-step: Update the parameters Θ by maximizing the above expectation. Θ ← Θ + α∇Θ Ey [ln P(D, y|Θ)] until meets stop requirements; 5.4 Experiment In this section, we evaluate the proposed method via comprehensive experiments. We test our model on both synthetic and real-world data. Through the experiments, we try to answer three research 74 questions: (1) is our method able to accurately estimate the annotator group bias? (2) can our method effectively infer the true labels? and (3) can our approach learn more accurate classifiers? 5.4.1 Baselines We compare our proposed framework GroupAnno with eight existing true label inference methods [134], including majority voting (MV), ZenCrowd [30], Minimax [135], LFC-binary [97], CATD [66], PM-CRH [2], KOS [56], and VI-MF [76]. 5.4.2 Data Synthetic Data. We first create two synthetic datasets on a simple binary classification task with 2-dimension features. As shown in Figure 5.2, the instances in the datasets are in the shape of circle and moon, respectively. In each dataset, we sample 400 instances for both classes. We simulate 40 annotators with two demographic attributes. We first randomly set the group bias for the two demographic attributes. Then, based on our assumed distribution that has been verified in Section 5.2, we sample the bias for each annotator. Finally, we suppose that each instance is labeled by 4 different annotators and simulate the annotations based on the sampled annotator bias. With the knowledge of actual annotator group bias and true labels in synthetic data, we can verify the capability of the proposed framework in group bias estimation and truth label inference. Wikipedia Detox Data. We conduct experiments on all the three subsets (i.e. Personal Attack, Aggression, and Toxicity) of the public Wikipedia Detox dataset. The details of this dataset are introduced in Section 5.2.1. For the three subsets in the Wikipedia Detox Corpus, we use the training/test sets split by the publisher of the data [120]. Since there is no available ground-truth label in this dataset, we pick up a subset of instances in the test set on which more than 80% annotations reach an agreement and treat the MV label as the ground-truth label. These instances are less controversial, thus we are confident that the MV labels are true labels. We report the performance of the models trained under various label inference approaches on this set. 75 Information Detection Data. This dataset consists of text transcribed from conversations recorded in several in-person and virtual meetings. Each text is assigned an information label which groups the text into three categories: give information (G), ask information (A), and other (O). Five different data annotators classified the text into one of G, A, or O categories. We conducted a survey to collect data on demographic characteristics of the annotators such as gender, race, and native speaker of English. We convert the three categories into two classes by treating G and A as positive (i.e., information exchange) and O as negative (i.e., other). There are 2,483 instances in total in this dataset. After the annotation, we randomly select 762 instances and ask the annotators to discuss and reach an agreement on their labels. We treat these labels as true labels. We construct the training set with the remaining 1,721 instances without true labels, plus 430 of the instances with true labels. Thus, we have 20% training data with true labels, on which we will report the truth inference performance. The rest 332 instances with true labels make up our test set. 5.4.3 Implementation Details For text classification tasks on the Wikipedia Detox data and the Information Detection data, we employ an one-layer recurrent neural network (RNN) with gated recurrent units (GRUs) as the classifier. In the RNN classifier, the word embedding size is set as 128 and the hidden size is set as 256. The classifier is optimized by an Adam optimizer [58] with a learning rate of 0.001. When modeling annotator group bias, we consider 1-2 demographic categories with the most significant group effects. For the Personal Attack dataset and the Aggression dataset, we consider age and language. For the Toxicity dataset, we consider gender. For the Information Detection dataset, we consider language. 5.4.4 Results on Synthetic Data Group Bias Estimation. In each of the synthetic datasets, we simulate the annotations based on presented annotator group bias. We simulate two demographic attributes for each annotator, where there are two groups in terms of each attribute. Thus, there are eight bias parameters to estimate: 76 Table 5.4: Results of group bias estimation on the synthetic 2-dimensional datasets. “Real” and “Estimation” indicate the real and the estimated values of the annotator group bias parameters. Estimation Params Real Circle Moon α 0,0 0.700 0.739 0.728 α 0,1 0.500 0.482 0.476 β 0,0 0.800 0.787 0.778 β 0,1 0.300 0.335 0.320 α 1,0 0.900 0.927 0.943 α 1,1 0.400 0.419 0.428 β 1,0 0.300 0.288 0.295 β 1,1 0.500 0.458 0.443 sensitivities α p,g and specificities β p,g , where p = 0, 1 and q = 0, 1. We compare the real values of the annotator group bias and the estimations from GroupAnno. The results are shown in Table 5.4. We observe that the bias parameters are estimated accurately within an acceptable error range. The results demonstrate the ability of our extended EM algorithm to estimate the parameters in GroupAnno. Truth Label Inference. The experimental results of truth label inference on synthetic data are shown in Table 5.5. In the table, we list the performance of different approaches on truth label inference. We make the following observations. First, MV performs the worst among all the methods. In fact, a majority vote often does not mean the truth. By explicitly modeling the annotation behaviors of the annotators, an algorithm can infer the true labels more accurately than the majority vote. Second, the baselines Minimax and LFC-binary outperform other baselines. LFC-binary leverages PGM to model the individual annotator bias for truth label inference, which achieves desirable performance. Third, our framework GroupAnno further improves the accuracy of truth label inference on the basis of LFC-binary, since GroupAnno finds and exploits the group annotator bias as additional information. GroupAnno models the group annotator bias as prior information of the individual bias of each annotator so that individual bias can be estimated more accurately. As a result, GroupAnno achieves the best performance on truth label inference. 77 circle moon 1.0 1.0 0.5 0.5 0.0 0.0 0.5 1.0 0.5 1.5 1.5 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Figure 5.2: Two synthetic datasets with simulated 2-dimensional data. Table 5.5: Experimental results on the synthetic 2-dimensional datasets. “Acc” and “F1” indicate the accuracy and the F1 score of true label inference. In the table, we report the results averaged over 5 runs from different random seeds. Circle Moon Methods Acc F1 Acc F1 MV 0.728 0.722 0.748 0.744 ZenCrowd 0.894 0.886 0.904 0.898 Minimax 0.911 0.909 0.916 0.914 LFC-binary 0.911 0.909 0.916 0.914 CATD 0.851 0.844 0.861 0.853 PM-CRH 0.860 0.851 0.875 0.868 KOS 0.891 0.884 0.897 0.891 VI-MF 0.907 0.905 0.914 0.911 GroupAnno 0.921 0.916 0.925 0.920 5.4.5 Results on Wikipedia Detox Dataset The experimental results on the Wikipedia Detox datasets are shown in the left section of Table 5.6. For LFC-binary and GroupAnno, where truth label inference and model training are conducted simultaneously, we directly report the performance of the resulting model on the test set. For other pure truth label inference approaches, we first infer the truth labels and then train the model on the inferred labels. Finally, we report the performances of these models on the test set. The results show that GroupAnno achieves better performances than the state-of-the-art methods, which demonstrates 78 Table 5.6: Expermental results on the Wikipedia Detox datasets and the Information Detection dataset. For Wikipedia Detox, we report the performances of the learned classifiers on the test data. For Information Detection, we report the performance on truth inference (“Truth Infer”) as well as the performance of the learned classifiers on the test data (“Prediction”). We report the results averaged over 5 runs from different random seeds. For the results of Wikipedia Detox, we also show the 95% confidence intervals. Dataset Wikipedia Detox Information Detection Aggression Personal Attack Toxicity Truth Infer Prediction Method F1 F1 F1 Acc F1 Acc F1 MV 0.953 ± 0.006 0.955 ± 0.005 0.951 ± 0.006 0.786 0.862 0.843 0.899 ZenCrowd 0.954 ± 0.005 0.952 ± 0.005 0.953 ± 0.006 0.786 0.862 0.845 0.900 Minimax 0.957 ± 0.005 0.959 ± 0.004 0.956 ± 0.005 0.823 0.872 0.855 0.898 LFC-binary 0.957 ± 0.006 0.960 ± 0.006 0.957 ± 0.003 0.814 0.872 0.864 0.907 CATD 0.935 ± 0.008 0.949 ± 0.005 0.954 ± 0.004 0.809 0.873 0.849 0.901 PM-CRH 0.949 ± 0.003 0.954 ± 0.006 0.955 ± 0.004 0.809 0.873 0.849 0.901 KOS 0.949 ± 0.006 0.952 ± 0.003 0.948 ± 0.006 0.786 0.862 0.844 0.899 VI-MF 0.955 ± 0.005 0.957 ± 0.004 0.951 ± 0.005 0.823 0.872 0.855 0.898 GroupAnno 0.961 ± 0.004 0.968 ± 0.005 0.962 ± 0.005 0.825 0.883 0.869 0.910 the effectiveness and superiority of our framework in practice. 5.4.6 Results on Information Detection Dataset The experimental results on the information detection dataset are shown in the right section of Table 5.6. Since we have 20% training data with available true labels, we first examine the accuracy of truth label inference of various methods on this part of the data, and then report the performance of the trained classifiers on the test data. We find that our proposed method still outperforms all the baselines on both truth inference and resulting classifier performance, which further verifies the superiority of GroupAnno in real-world data. 5.5 Related Work Bias and fairness issues are crucial as machine learning systems are being increasingly used in sensitive applications [25]. Bias is caused due to pre-existing societal norms [40], data source, data labeling, training algorithms, and post-processing models. Data source bias emerges when the source distribution differs from the target distribution where the model will be applied [108]. 79 Training algorithms can also introduce bias. For example, if we train a model on data that contain labels from two populations - a majority and a minority population - minimizing overall error will fit only the majority population ignoring the minority [25]. Data labeling bias exists when the distribution of the dependent variable in the data source diverges from the ideal distribution [108]. Many of these data labels are generated by human annotators, who can easily skew the distribution of training data [34]. Various factors such as task difficulty, task ambiguity, amount of contextual information made available, and the expertise of the annotator determine annotation results [52]. Prior literature studies various approaches to ensure the reliability of data annotations. In the works [30, 2], the authors use worker probability to model the ability of an annotator to correctly answer a task, and some other works [117, 67] introduce a similar concept, worker quality, by changing the value range from [0, 1] to (−∞, +∞). The work [116] models the bias and variance of the crowdsourcing workers on numeric annotation tasks. Moreover, in the works [39] and [78], researchers find that annotators show different qualities when answering different tasks, and thereby propose to model the diverse skills of annotators on various tasks. The work [65] realizes that annotators perform unevenly on each annotation instance, so the authors propose a novel method to model the instance-level annotator reliability for NLP labeling tasks. The work [43] uses language generated by annotators to identify annotator identity and shows that annotator identity information improves model performance. All these studies have been individual-focused and ignore group effects. Our approach differs in that we study systemic bias associated with annotators of a specific demographic group. 80 CHAPTER 6 CONCLUSIONS 6.1 Dissertation Summary In this dissertation, we have presented our efforts devoted to bias detection and mitigation in natural languages. Specifically, we have described out studies on (i) bias detection and mitigation in dialogue generation, (ii) implicit bias detection and mitigation, and (iii) annotator group bias in crowdsourcing. In chapter 2, we have investigated the fairness issues in dialogue systems. In particular, we define fairness in dialogue systems formally and further introduce four measurements to evaluate fairness of a dialogue system quantitatively, including diversity, politeness, sentiment, and attribute words. Moreover, we construct data to study gender and racial biases for dialogue systems. Then, we conduct detailed experiments on two types of dialogue models (i.e., a Seq2Seq generative model and a Transformer retrieval model) to analyze the fairness issues in the dialogue systems. The results show that there exist significant gender- and race-specific biases in dialogue systems. We introduce two debiasing methods to mitigate the biases in dialogue systems. Experiments show that the proposed methods effectively reduce the biases and ensure fairness of dialogue systems. In chapter 3, we focus on the problem of mitigating gender bias in neural dialogue models. We propose an adversarial training framework Debiased-Chat to reduce the bias of a dialogue model during the training process. With the help of a disentanglement model, we design an adversarial learning framework that trains dialogue models to cleverly include unbiased gender features and exclude biased gender features in responses. Experiments on two human conversation datasets demonstrate that our model successfully mitigates gender bias in dialogue models and outperforms baselines by producing more engaging, diverse, and gender-specific responses. In the future, we will investigate debiasing retrieval-based dialogue models and more complicated pipeline-based dialogue systems. 81 In chapter 4, we demonstrate that a text classifier with implicit bias makes predictions based on language features correlated with demographic groups of authors, and propose a novel learning framework Debiased-TC to mitigate such implicit bias. Particularly, our preliminary study shows that popular deep text classifiers can learn implicit bias towards the authors of texts. We build a learning-based interpretation model to understand the formation mechanism of implicit bias, and demonstrate that a classifier shows implicit bias when it makes predictions based on language features that correlated with demographic groups. Accordingly, we propose a novel learning framework Debiased-TC to train deep classification models free from implicit bias. It forces the classifier to focus on the right language features to make the prediction. We evaluate our proposed framework on two text classification models on three real-world datasets. The experimental results show that Debiased-TC significantly mitigates implicit bias, and maintains or even improves the text classification performance of the original models. In chapter 5, we investigate the annotator group bias in crowdsourcing. We first conduct an empirical study on real-world crowdsourcing datasets and show that annotators from the same demographic groups tend to show similar bias in the annotation tasks. We develop a novel framework GroupAnno that considers the group effect of annotator bias, to model the whole annotation process. To solve the optimization problem of the proposed framework, we propose a novel extended EM algorithm. Finally, we empirically verify our approach on two synthetic datasets and four real-world datasets. The experimental results show that our model can accurately estimate the annotator group bias, achieve more accurate truth inference, and also train better classifiers that outperform those learned under state-of-the-art true label inference baselines. As future work, we plan to investigate the annotator group bias in tasks beyond classification such as regression tasks and text generation tasks. 6.2 Future Works In addition to the promising findings and achievements from our studies, we believe more dedicated efforts should be devoted to understand and alleviate bias in natural languages. As future works, we 82 plan to investigate the following directions: • Bias Mitigation in Comprehensive Dialogue Systems. In this dissertation, we have devel- oped a novel framework for mitigating bias in generative dialogue models. Nevertheless, in the industry, a comprehensive dialogue system is typically designed in a pipeline-based architecture, where a generative dialogue model serves as a component in the entire system [136]. In addition to the generative dialogue models, rule-based models, retrieval-based models, and question answering (QA) models are also incorporated. How to debias such models and the entire pipeline in a comprehensive dialogue system remains a promising problem that I plan to work on. • Fairness in Pre-trained Language Models. Language model pre-training is a crucial task in NLP and it has been verified that such language models can exhibit human-like bias [68]. Although there are a few works studying the bias issues in language modeling, they only focus on the bias in the language model itself but overlook the impacts of the bias of the pre-training language model on downstream models. I plan to investigate whether downstream NLP models can inherit the bias in pre-training language models and how to prevent the spread of bias. • Trustworthy NLP Systems. In addition to fairness, other aspects also need to be considered to make an NLP system trustworthy, including robustness, privacy, and interpretation, etc [74]. As future directions, I plan to study these aspects for achieving trustworthy NLP, and explore the relationship between these dimensions and fairness in NLP. 83 BIBLIOGRAPHY 84 BIBLIOGRAPHY [1] Alekh Agarwal, Miroslav Dudik, and Zhiwei Steven Wu. Fair regression: Quantitative defi- nitions and reduction-based algorithms. In International Conference on Machine Learning, pages 120–129. PMLR, 2019. [2] Bahadir Ismail Aydin, Yavuz Selim Yilmaz, Yaliang Li, Qi Li, Jing Gao, and Murat Demirbas. Crowdsourcing for multiple-choice question answering. In AAAI, pages 2946–2953. Citeseer, 2014. [3] Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, and Tal Wagner. Scalable fair clustering. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 405–413, 2019. [4] Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, and Tal Wagner. Scalable fair clustering. In International Conference on Machine Learning, pages 405–413. PMLR, 2019. [5] Rachel KE Bellamy, Kuntal Dey, Michael Hind, Samuel C Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, et al. Ai fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. arXiv preprint arXiv:1810.01943, 2018. [6] Richard Berk, Hoda Heidari, Shahin Jabbari, Matthew Joseph, Michael Kearns, Jamie Morgenstern, Seth Neel, and Aaron Roth. A convex framework for fair regression. arXiv preprint arXiv:1706.02409, 2017. [7] Richard Berk, Hoda Heidari, Shahin Jabbari, Matthew Joseph, Michael J. Kearns, Jamie Morgenstern, Seth Neel, and Aaron Roth. A convex framework for fair regression. CoRR, abs/1706.02409, 2017. [8] Steven Bird. NLTK: the natural language toolkit. In ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Sydney, Australia, 17-21 July 2006, 2006. [9] Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. Language (technology) is power: A critical survey of “bias” in nlp. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, 2020. [10] Su Lin Blodgett, Lisa Green, and Brendan O’Connor. Demographic dialectal variation in social media: A case study of african-american english. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1119–1130, 2016. [11] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29:4349–4357, 2016. 85 [12] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4349–4357. Curran Associates, Inc., 2016. [13] Shikha Bordia and Samuel R. Bowman. Identifying and reducing gender bias in word-level language models. CoRR, abs/1904.03035, 2019. [14] Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. Nuanced metrics for measuring unintended bias with real data for text classification. In Companion Proceedings of The 2019 World Wide Web Conference, pages 491–500, 2019. [15] Avishek Joey Bose and William Hamilton. Compositional fairness constraints for graph embeddings. CoRR, abs/1905.10674, 2019. [16] Michael Buhrmester, Tracy Kwang, and Samuel D Gosling. Amazon’s mechanical turk: A new source of inexpensive, yet high-quality data? 2016. [17] Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in com- mercial gender classification. In Conference on fairness, accountability and transparency, pages 77–91. PMLR, 2018. [18] Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. A survey on dialogue systems: Recent advances and new frontiers. Acm Sigkdd Explorations Newsletter, 19(2):25–35, 2017. [19] Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. A survey on dialogue systems: Recent advances and new frontiers. CoRR, abs/1711.01731, 2017. [20] Jianbo Chen, Le Song, Martin Wainwright, and Michael Jordan. Learning to explain: An information-theoretic perspective on model interpretation. In International Conference on Machine Learning, pages 883–892. PMLR, 2018. [21] Xingyu Chen, Brandon Fain, Liang Lyu, and Kamesh Munagala. Proportionally fair clus- tering. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 1032–1041, 2019. [22] Xingyu Chen, Brandon Fain, Liang Lyu, and Kamesh Munagala. Proportionally fair cluster- ing. In International Conference on Machine Learning, pages 1032–1041. PMLR, 2019. [23] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder- decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. [24] Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153–163, 2017. [25] Alexandra Chouldechova and Aaron Roth. The frontiers of fairness in machine learning. arXiv preprint arXiv:1810.08810, 2018. 86 [26] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evalua- tion of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014, 2014. [27] Florian Coulmas. Sociolinguistics: The study of speakers’ choices. Cambridge University Press, 2013. [28] Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999. [29] Rachel Cummings, Varun Gupta, Dhamma Kimpara, and Jamie Morgenstern. On the compatibility of privacy and fairness. In Adjunct Publication of the 27th Conference on User Modeling, Adaptation and Personalization, pages 309–315, 2019. [30] Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proceedings of the 21st international conference on World Wide Web, pages 469–478, 2012. [31] Emily Dinan, Angela Fan, Adina Williams, Jack Urbanek, Douwe Kiela, and Jason Weston. Queens are powerful too: Mitigating gender bias in dialogue generation. arXiv preprint arXiv:1911.03842, 2019. [32] Emily Dinan, Angela Fan, Ledell Wu, Jason Weston, Douwe Kiela, and Adina Williams. Multi-dimensional gender bias classification. CoRR, abs/2005.00614, 2020. [33] Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. CoRR, abs/1908.06083, 2019. [34] Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 67–73, 2018. [35] Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander Miller, Arthur Szlam, and Jason Weston. Evaluating prerequisite qualities for learning end-to-end dialog systems. arXiv preprint arXiv:1511.06931, 2015. [36] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226, 2012. [37] Alice H Eagly. Sex differences in social behavior: A social-role interpretation. Psychology Press, 2013. [38] Yanai Elazar and Yoav Goldberg. Adversarial removal of demographic attributes from text data. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 11–21, 2018. 87 [39] Ju Fan, Guoliang Li, Beng Chin Ooi, Kian-lee Tan, and Jianhua Feng. icrowd: An adap- tive crowdsourcing framework. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1015–1030, 2015. [40] Batya Friedman and Helen Nissenbaum. Bias in computer systems. ACM Transactions on Information Systems (TOIS), 14(3):330–347, 1996. [41] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages 1180–1189. PMLR, 2015. [42] Jianfeng Gao, Michel Galley, and Lihong Li. Neural approaches to conversational AI. Foundations and Trends in Information Retrieval, 13(2-3):127–298, 2019. [43] Mor Geva, Yoav Goldberg, and Jonathan Berant. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1161–1166, 2019. [44] Peter Henderson, Koustuv Sinha, Nicolas Angelard-Gontier, Nan Rosemary Ke, Genevieve Fried, Ryan Lowe, and Joelle Pineau. Ethical challenges in data-driven dialogue systems. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES 2018, New Orleans, LA, USA, February 02-03, 2018, pages 123–129, 2018. [45] Ayanna Howard and Jason Borenstein. The ugly truth about ourselves and our robot creations: the problem of bias and social inequity. Science and engineering ethics, 24(5):1521–1536, 2018. [46] Xiaolei Huang, Linzi Xing, Franck Dernoncourt, and Michael Paul. Multilingual twitter cor- pus and baselines for evaluating demographic bias in hate speech recognition. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 1440–1448, 2020. [47] Clayton J. Hutto and Eric Gilbert. VADER: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the Eighth International Conference on Weblogs and Social Media, ICWSM 2014, Ann Arbor, Michigan, USA, June 1-4, 2014., 2014. [48] Aylin Caliskan Islam, Joanna J. Bryson, and Arvind Narayanan. Semantics derived auto- matically from language corpora necessarily contain human biases. CoRR, abs/1608.07187, 2016. [49] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016. [50] Manas Joglekar, Hector Garcia-Molina, and Aditya Parameswaran. Evaluating the crowd with confidence. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 686–694, 2013. 88 [51] Vineet John, Lili Mou, Hareesh Bahuleyan, and Olga Vechtomova. Disentangled representa- tion learning for non-parallel text style transfer. arXiv preprint arXiv:1808.04339, 2018. [52] Kenneth Joseph, Lisa Friedland, William Hobbs, David Lazer, and Oren Tsur. Constance: Modeling annotation contexts to improve stance classification. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1115–1124, 2017. [53] Dan Jurafsky and James H. Martin. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition. Prentice Hall series in artificial intelligence. Prentice Hall, Pearson Education International, 2009. [54] Faisal Kamiran and Toon Calders. Classifying without discriminating. In 2009 2nd International Conference on Computer, Control and Communication, pages 1–6. IEEE, 2009. [55] Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. Fairness-aware classifier with prejudice remover regularizer. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 35–50. Springer, 2012. [56] David R Karger, Sewoong Oh, and Devavrat Shah. Iterative learning for reliable crowdsourc- ing systems. Neural Information Processing Systems, 2011. [57] Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751. Association for Computational Linguistics, 2014. [58] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [59] Philipp Koehn. Neural machine translation. Cambridge University Press, 2020. [60] Matt J Kusner and José Miguel Hernández-Lobato. Gans for sequences of discrete elements with the gumbel-softmax distribution. arXiv preprint arXiv:1611.04051, 2016. [61] Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. In Advances in neural information processing systems, pages 4066–4076, 2017. [62] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity- promoting objective function for neural conversation models. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 110–119, 2016. [63] Jiwei Li, Michel Galley, Chris Brockett, Georgios Spithourakis, Jianfeng Gao, and Bill Dolan. A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 994–1003, 2016. 89 [64] Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547, 2017. [65] Maolin Li, Arvid Fahlström Myrman, Tingting Mu, and Sophia Ananiadou. Modelling instance-level annotator reliability for natural language labelling tasks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2873–2883, 2019. [66] Qi Li, Yaliang Li, Jing Gao, Lu Su, Bo Zhao, Murat Demirbas, Wei Fan, and Jiawei Han. A confidence-aware approach for truth discovery on long-tail data. Proceedings of the VLDB Endowment, 8(4):425–436, 2014. [67] Qi Li, Yaliang Li, Jing Gao, Bo Zhao, Wei Fan, and Jiawei Han. Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 1187–1198, 2014. [68] Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. Towards understanding and mitigating social biases in language models. In ICML, 2021. [69] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. In International Conference on Learning Representations, 2018. [70] Haochen Liu, Jamell Dacon, Wenqi Fan, Hui Liu, Zitao Liu, and Jiliang Tang. Does gender matter? towards fairness in dialogue systems. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4403–4416, 2020. [71] Haochen Liu, Tyler Derr, Zitao Liu, and Jiliang Tang. Say what I want: Towards the dark side of neural dialogue models. CoRR, abs/1909.06044, 2019. [72] Haochen Liu, Tyler Derr, Zitao Liu, and Jiliang Tang. Say what i want: Towards the dark side of neural dialogue models. arXiv preprint arXiv:1909.06044, 2019. [73] Haochen Liu, Wentao Wang, Yiqi Wang, Hui Liu, Zitao Liu, and Jiliang Tang. Mitigat- ing gender bias for neural dialogue generation with adversarial learning. arXiv preprint arXiv:2009.13028, 2020. [74] Haochen Liu, Yiqi Wang, Wenqi Fan, Xiaorui Liu, Yaxin Li, Shaili Jain, Yunhao Liu, Anil K Jain, and Jiliang Tang. Trustworthy ai: A computational perspective. arXiv preprint arXiv:2107.06641, 2021. [75] Haochen Liu, Zhiwei Wang, Tyler Derr, and Jiliang Tang. Chat as expected: Learning to manipulate black-box neural dialogue models. arXiv preprint arXiv:2005.13170, 2020. [76] Qiang Liu, UC ICS, Jian Peng, and Alexander Ihler. Variational inference for crowdsourcing. sign, 10:j2Mi, 2012. [77] Edward Loper and Steven Bird. Nltk: the natural language toolkit. arXiv preprint cs/0205028, 2002. 90 [78] Fenglong Ma, Yaliang Li, Qi Li, Minghui Qiu, Jing Gao, Shi Zhi, Lu Su, Bo Zhao, Heng Ji, and Jiawei Han. Faitcrowd: Fine grained truth discovery for crowdsourced data aggregation. In Proceedings of the 21th acm sigkdd international conference on knowledge discovery and data mining, pages 745–754, 2015. [79] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008. [80] Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pages 2113–2122, 2015. [81] Rabeeh Karimi Mahabadi, Yonatan Belinkov, and James Henderson. End-to-end bias mitigation by modelling biases in corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8706–8716, 2020. [82] Rowan Hall Maudslay, Hila Gonen, Ryan Cotterell, and Simone Teufel. It’s all in the name: Mitigating gender bias with name-based counterfactual data substitution. arXiv preprint arXiv:1909.00871, 2019. [83] Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. On measuring social biases in sentence encoders. CoRR, abs/1903.10561, 2019. [84] Walaa Medhat, Ahmed Hassan, and Hoda Korashy. Sentiment analysis algorithms and applications: A survey. Ain Shams engineering journal, 5(4):1093–1113, 2014. [85] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. CoRR, abs/1908.09635, 2019. [86] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. arXiv preprint arXiv:1908.09635, 2019. [87] Alexander H. Miller, Will Feng, Dhruv Batra, Antoine Bordes, Adam Fisch, Jiasen Lu, Devi Parikh, and Jason Weston. Parlai: A dialog research software platform. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017 - System Demonstrations, pages 79– 84, 2017. [88] Vedant Nanda, Samuel Dooley, Sahil Singla, Soheil Feizi, and John P Dickerson. Fairness through robustness: Investigating robustness disparity in deep learning. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 466–477, 2021. [89] Tong Niu and Mohit Bansal. Adversarial over-sensitivity and over-stability strategies for dia- logue models. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 486–496, 2018. [90] Ji Ho Park, Jamin Shin, and Pascale Fung. Reducing gender bias in abusive language de- tection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2799–2804, 2018. 91 [91] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014. [92] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. In ICML, 2018. [93] Marcelo O. R. Prates, Pedro H. C. Avelar, and Luís C. Lamb. Assessing gender bias in machine translation - A case study with google translate. CoRR, abs/1809.02208, 2018. [94] Marcelo OR Prates, Pedro H Avelar, and Luís C Lamb. Assessing gender bias in ma- chine translation: a case study with google translate. Neural Computing and Applications, 32(10):6363–6381, 2020. [95] Daniel Preoţiuc-Pietro and Lyle Ungar. User-level race and ethnicity predictors from twitter text. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1534–1545, 2018. [96] Francisco Rangel, Paolo Rosso, Ben Verhoeven, Walter Daelemans, Martin Potthast, and Benno Stein. Overview of the 4th author profiling task at pan 2016: cross-genre evaluations. Working Notes Papers of the CLEF, 2016:750–784, 2016. [97] Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds. Journal of Machine Learning Research, 11(4), 2010. [98] Alan Ritter, Colin Cherry, and William B. Dolan. Data-driven response generation in social media. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27-31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 583–593, 2011. [99] James A Rodger and Parag C Pendharkar. A field study of the impact of gender and user’s technical experience on the performance of voice-activated medical tracking application. International Journal of Human-Computer Studies, 60(5-6):529–544, 2004. [100] Filipe Rodrigues and Francisco Pereira. Deep learning from crowds. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [101] Adam Rose. Are face-detection cameras racist? Time Business, 2010. [102] Hee Jung Ryu, Margaret Mitchell, and Hartwig Adam. Improving smiling detection with race and gender diversity. arXiv preprint arXiv:1712.00193, 1(2):7, 2017. [103] Amrita Saha, Vardaan Pahuja, Mitesh M. Khapra, Karthik Sankaranarayanan, and Sarath Chandar. Complex sequential question answering: Towards learning to converse over linked question answer pairs with a knowledge graph. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 705–713, 2018. 92 [104] Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith. The risk of racial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1668–1678, 2019. [105] Henry Scheffe. The analysis of variance, volume 72. John Wiley & Sons, 1999. [106] Iulian Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. A hierarchical latent variable encoder-decoder model for generating dialogues. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, 2017. [107] Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and Joelle Pineau. Building end-to-end dialogue systems using generative hierarchical neural network models. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, pages 3776–3784, 2016. [108] Deven Shah, H Andrew Schwartz, and Dirk Hovy. Predictive biases in natural language processing models: A conceptual framework and overview. arXiv preprint arXiv:1912.11078, 2019. [109] Deven Santosh Shah, H Andrew Schwartz, and Dirk Hovy. Predictive biases in natural language processing models: A conceptual framework and overview. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5248–5264, 2020. [110] Lifeng Shang, Zhengdong Lu, and Hang Li. Neural responding machine for short-text con- versation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pages 1577–1586, 2015. [111] Rion Snow, Brendan O’connor, Dan Jurafsky, and Andrew Y Ng. Cheap and fast–but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 conference on empirical methods in natural language processing, pages 254–263, 2008. [112] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014. [113] Songül Tolan, Marius Miron, Emilia Gómez, and Carlos Castillo. Why machine learning may lead to unfairness: Evidence from risk assessment for juvenile justice in catalonia. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Law, ICAIL 2019, Montreal, QC, Canada, June 17-21, 2019., pages 83–92, 2019. [114] Amirsina Torfi, Rouzbeh A Shirvani, Yaser Keneshloo, Nader Tavaf, and Edward A Fox. Natural language processing advancements by deep learning: A survey. arXiv preprint arXiv:2003.01200, 2020. 93 [115] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6000–6010, 2017. [116] Peter Welinder, Steve Branson, Pietro Perona, and Serge Belongie. The multidimensional wisdom of crowds. Advances in neural information processing systems, 23:2424–2432, 2010. [117] Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier Movellan, and Paul Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. Advances in neural information processing systems, 22:2035–2043, 2009. [118] Marty J. Wolf, Keith W. Miller, and Frances S. Grodzinsky. Why we should have seen that coming: comments on microsoft’s tay "experiment, " and wider implications. SIGCAS Computers and Society, 47(3):54–64, 2017. [119] Marty J Wolf, Keith W Miller, and Frances S Grodzinsky. Why we should have seen that coming: comments on microsoft’s tay “experiment,” and wider implications. The ORBIT Journal, 1(2):1–12, 2017. [120] Ellery Wulczyn, Nithum Thain, and Lucas Dixon. Ex machina: Personal attacks seen at scale. In Proceedings of the 26th international conference on world wide web, pages 1391–1399, 2017. [121] Han Xu, Xiaorui Liu, Yaxin Li, and Jiliang Tang. To be robust or to be fair: Towards fairness in adversarial training. arXiv preprint arXiv:2010.06121, 2020. [122] Han Xu, Yao Ma, Hao-Chen Liu, Debayan Deb, Hui Liu, Ji-Liang Tang, and Anil K Jain. Adversarial attacks and defenses in images, graphs and text: A review. International Journal of Automation and Computing, 17(2):151–178, 2020. [123] Sirui Yao and Bert Huang. Beyond parity: Fairness objectives for collaborative filtering. In Advances in Neural Information Processing Systems, pages 2921–2930, 2017. [124] Jacob Yerushalmy. Statistical problems in assessing methods of medical diagnosis, with special reference to x-ray techniques. Public Health Reports (1896-1970), pages 1432–1449, 1947. [125] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P. Gummadi. Fairness constraints: Mechanisms for fair classification, 2015. [126] Guanhua Zhang, Bing Bai, Junqi Zhang, Kun Bai, Conghui Zhu, and Tiejun Zhao. Demo- graphics should not be the reason of toxicity: Mitigating discrimination in text classifications with instance weighting. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4134–4145, 2020. [127] Lu Zhang, Yongkai Wu, and Xintao Wu. A causal framework for discovering and removing direct and indirect discrimination. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 3929–3935, 2017. 94 [128] Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213, 2018. [129] Wei Emma Zhang, Quan Z Sheng, Ahoud Alhazmi, and Chenliang Li. Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 11(3):1–41, 2020. [130] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Gen- der bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:1804.06876, 2018. [131] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Gender bias in coreference resolution: Evaluation and debiasing methods. CoRR, abs/1804.06876, 2018. [132] Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. Learning gender- neutral word embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 4847–4853, 2018. [133] Yudian Zheng, Guoliang Li, and Reynold Cheng. Docs: a domain-aware crowdsourcing system using knowledge bases. Proceedings of the VLDB Endowment, 10(4):361–372, 2016. [134] Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan, and Reynold Cheng. Truth inference in crowdsourcing: Is the problem solved? Proceedings of the VLDB Endowment, 10(5):541– 552, 2017. [135] Denny Zhou, John C Platt, Sumit Basu, and Yi Mao. Learning from the wisdom of crowds by minimax entropy. 2012. [136] Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. The design and implementation of xiaoice, an empathetic social chatbot. Computational Linguistics, 46(1):53–93, 2020. 95