LEARNING FAIR REPRESENTATIONS WITHOUT DEMOGRAPHICS By Xiaoxue Wang A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science – Master of Science 2022 ABSTRACT LEARNING FAIR REPRESENTATIONS WITHOUT DEMOGRAPHICS By Xiaoxue Wang Due to hard accessibility, real-world adoption of fair representation learning algorithms lacks the prior knowledge of the sensitive attributes that we wish to be fair with. To address the challenge in fairness without explicit demographics, our solution is based on the idea of maximally randomizing the representation while being as informative as possible about the target task. We operationalize this goal through the concept of maximizing the entropy of the learned representation. For this purpose, we propose two new avenues for entropy maximization in the absence of demographic information: intra-class and inter- class entropy maximization. For 1) intra-class entropy maximization, it maximizes the entropy of the non-target class predictions (excluding the probability of the ground truth class label for classification problems), thus encouraging the model to discard spurious correlations between the different target classes, and for 2) inter-class entropy maximization, it maximizes the entropy of the representation conditioned on the target label, thus encouraging randomization of the samples within each target class label and minimizing the leakage of potential demographic information in the representation. Quantitative and qualitative results of our Maximum Entropy method (MaxEnt) on COMPAS and UCI Adult datasets show that 1) our method can outperform the State-of-the-art (SOTA) Adversarially Reweighted Learning (ARL) method and will enhance the difficulty of extracting sensitive demographic information in representation without prior demographic knowledge 2) our method reaches a good trade-off between utility and fairness. Copyright by XIAOXUE WANG 2022 This thesis is dedicated to my parents, Hui Yang and Yong Wang iv ACKNOWLEDGEMENTS The years at MSU have been a great experience in my life. It is a great chance to remember and to thank my teachers, colleagues, friends, and family whose influence contributed to this thesis. First and foremost, I would like to thank my advisor, Vishnu Naresh Boddeti, for his extraordinary support, patience, guidance, and funding my entire research on fairness and privacy. His patience on guiding and teaching has strengthened my research ability in the field of computer vision and deep learning. I deeply appreciate is enthusiasm for novel approaches and genuinely positive attitude towards science and my research progress. I thank all of the members of the HAL lab for participating in my research and providing valuable feedback on my work at all times. I am very grateful for their friendship and support. I thank my entire family for their love and support, especially my parents, Yong Wang, Hui Yang, my grandma, Taoxiu Li, and my boyfriend, Yuanda Wang. v TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 CHAPTER 3 PROBLEM SETTING . . . . . . . . . . . . . . . . . . . . . . . . . 5 CHAPTER 4 APPROCH: MAXIMUM ENTROPY . . . . . . . . . . . . . . . . . 7 4.1 Inter-class MaxEnt: Maximizing Entropy of non-Target Class . . . . . . . . . 7 4.2 Intra-class MaxEnt: Maximizing Entropy of Class-Conditional Representations 9 4.3 Maximum Entropy (MaxEnt) Method . . . . . . . . . . . . . . . . . . . . . . 11 CHAPTER 5 THEORETICAL ANALYSIS . . . . . . . . . . . . . . . . . . . . . . 14 CHAPTER 6 ABLATION STUDY AND EXPERIMENTAL RESULTS . . . . . . . 27 6.1 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 6.1.1 Ablation study of target loss LT and inter-class entropy loss LE . . . 27 6.1.2 Ablation study of target loss LT and intra-class entropy loss LG . . . 28 6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.2.1 Baselines and Implementation Details . . . . . . . . . . . . . . . . . . 30 6.2.2 Datasets and Evaluation Metrics . . . . . . . . . . . . . . . . . . . . 31 6.2.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 6.2.2.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 32 6.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 CHAPTER 7 CONCLUDING REMARKS . . . . . . . . . . . . . . . . . . . . . . 34 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 vi LIST OF TABLES Table 4.1: Main results: Inter-class Maximum Entropy branch vs Baseline . . . . . . 9 Table 6.1: Main results: MaxEnt vs ARL . . . . . . . . . . . . . . . . . . . . . . . . 33 vii LIST OF FIGURES Figure 4.1: Inter-class MaxEnt toy example: Make the non-target classes uniformly distributed to maximize the entropy can protect the sensitive information while maintaining good target task performance. . . . . . . . . . . . . . . 7 Figure 4.2: Intra-class MaxEnt toy example: Make the class-conditional representa- tions into Gaussian distributions to maximize the entropy can protect the sensitive information while maintaining good target task performance. 10 Figure 4.3: MaxEnt LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Figure 4.4: MaxEnt Method Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Figure 6.1: Curve of trade-off between utility and fairness vs α . . . . . . . . . . . . 27 Figure 6.2: Probability simplex vs α . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Figure 6.3: Entropy heat map on probability simplex . . . . . . . . . . . . . . . . . . 30 Figure 6.4: Class-conditional representations vs α . . . . . . . . . . . . . . . . . . . . 30 viii LIST OF ALGORITHMS Algorithm 4.1: Maximum Entropy Method . . . . . . . . . . . . . . . . . . . . . . . 13 ix CHAPTER 1 INTRODUCTION A common scenario that is often encountered during real-world adoption of fair representation learning algorithms is the apriori lack of knowledge of the sensitive attributes that we wish to be fair with respect to. This may be the case where the sensitive attributes are not readily available or are too sensitive to be accessible. The underlying premises for the feasibility of this learning problem are, (1) by only retaining the minimal information relevant to the target task it is feasible to be fair with respect to unknown sensitive attributes that are independent of target attributes y, and (2) all other cases will lead to a trade-off between utility and fairness that will depend on the degree of dependence between the attributes. Methods to address this scenario include using Distributionally Robust Optimization (DRO) [1], Adversarially Reweighted Learning (ARL) [2] and Blind Pareto Fairness (BPF) [3]. DRO has been shown to overfit outliers [2], while ARL overfits when the adversary is more expressive than a simple linear layer, and BPF assumes convexity of the classifier’s loss and hypothesis class. And more importantly, these approaches (1) seek to make the target predictions ŷ fair and do not consider the problem of learning fair representations, and (2) operate on small-scale problems. Another class of approaches seeks to leverage proxies to compensate for the lack of demographic labels [4, 5]. To address the challenge and the same scenario, we propose a Maximum Entropy method on both intra-class and inter-class branches by learning a representation that considers the trade-off between the utility and the fairness without prior demographic knowledge. The inter-class branch aims to make the non-target predictions uniformly distributed to mitigate the sensitive information leakage, while the intra-class branch tries to maximize the entropy of the embedding given each class to realize the fairness. We adopt two real-world datasets, COMPAS [6] and UCI Adult [7] datasets to validate the effectiveness of our proposed method. We make the following observations that (1) our method can outperform the SOTA ARL 1 method [2] and will enhance the difficulty of extracting sensitive demographic information in representation without prior demographic knowledge (2) our method reaches a good trade-off between utility and fairness. 2 CHAPTER 2 RELATED WORK There has been an increasing line of work to address fairness without explicit demographics. Some papers [4, 8, 9] address this issue by using proxy variables to compute the protected population labels. These methods contrast with our assumptions by relying on a preconceived notion on what the protected demographics are (i.e., the protected demographics are known, but unobserved), since prior knowledge is needed to design useful proxy variables. However, these proxy methods have been proved that undesired bias will be introduced to further exacerbate disparities [10, 11], and proxy information might be hard to obtain for many applications. The fairness notions can be grouped into three categories: (i) individual fairness [12–16] provides guarantees beyond protected attributes, but requires predefined similarity functions which may be hard or infeasible to design for real-world tasks; (ii) group fairness [17–19] learns a model that satisfies a certain notion of fairness across these groups (e.g., statistical parity, equality of opportunity), but it conflicts with notions of no-harm fairness as in the [20], where the benefits from some groups might be purposely harmed. (iii) fairness notions that aim to improve per-group performance, such as Pareto-fairness and Rawlsian Max-Min fairness, which are appropriate where quality of service is paramount [1–3, 20]. In this paper, we are focusing on (iii) per-group fairness. There are three recent approaches that are the most similar to our objective (learning maximum target information while protecting unknown and unobserved sensitive informa- tion/demographics). One is distributionally robust optimization (DRO) [1], where the goal is to achieve minimax fairness for unknown populations of sufficient size. They minimize the risk of the worst-case group for the worst-case group partition, and uses results from distributional robustness that focus the attention of the model exclusively on the high-risk samples (i.e., their model reduces the tail of the risk distribution). However, they do not 3 explicitly account for Pareto efficiency, meaning that their solution may be sub-optimal on the population segment that lies below their high-risk threshold, doing unnecessary harm. The second recent method is adversarially reweighted learning (ARL) [2], where the model is trained to reduce a positive linear combination of the sample errors. These weighting coefficients are proposed by an adversary (implemented as a neural network), with the goal of maximizing the weighted empirical error. The method focuses on computationally identifiable subgroups. However, their adversary model is restricted to linear models, and they do not provide an optimality guarantee on the adversary, nor do they pose a constraint on the computationally identifiable subgroups. The third recent method is Blind Pareto Fairness (BPF) [3], which does not rely on predefined notions of at-risk groups, neither at train nor at test time. It leverages no-regret dynamics to recover a fair minimax classifier that reduces worst-case risk of any potential subgroup of sufficient size, and guarantees that the remaining population receives the best possible level of service. However, this method is restricted in the strong convexity of the loss given classifier and the classifier hypothesis class. And the classifier in the experiment is restricted to the 1-hidden layer Multi-layer Perceptron (MLP). 4 CHAPTER 3 PROBLEM SETTING This chapter addresses the problem setting. In order to achieve fairness, we will first dive into the precise mathematics formulation. In this work, we will use the Rawlsian Max-Min Fairness as our fairness definition metric. Definition 1. Rawlsian Max-Min Fairness: Suppose H is a set of hypotheses, and UDs (h) is the expected utility of the hypothesis h for the individuals in group s, then a hypothesis h∗ is said to satisfy Rawlsian Max-Min fairness principle [20] if it maximizes the utility of the worst-off group, i.e., the group with the lowest utility. h∗ = argmax min UDs (h) (3.1) h∈H s∈S In our evaluation in Chapter 5, we use AUC as our utility metric, which aligns with the paper [2], and report the minimum utility over protected groups S as AUC(min). Most utility metrics in traditional machine learning are not differentiable, thus convex loss functions are commonly used. The traditional Machine Learning task is to learn a model h that minimizes the loss over the training data D. h∗avg = argmin LD (h) (3.2) h∈H where LD (h) = E(xi ,yi )∼D [ℓ(h(xi ), yi )] for some loss function ℓ(·) (e.g., cross entropy). Therefore, we take the same perspective in turning Rawlsian Max-Min Fairness as given in the 3.1 in to a learning objective in 3.2. Replacing the expected utility when an appropriate loss function LDs (h) over the set of individuals in group s, we can formulate the fairness objective as: h∗max = argmin maxs∈S LDs (h) (3.3) h∈H 5 where LD (h) = E(xi ,yi )∼D [ℓ(h(xi ), yi )] for some loss function ℓ(·) (e.g., cross entropy). However, the prior demographic information s is not necessarily needed during training. We only need the protected group s information during evaluation. According to the objective function 3.3, there are two goals that we are going to achieve. One is to learn the target task well, and at the same time, protect the sensitive information leakage without prior demographic knowledge. 6 CHAPTER 4 APPROCH: MAXIMUM ENTROPY In this chapter, we will introduce our Maximum Entropy method (MaxEnt). In information theory, the entropy of a random variable is a measurement of uncertainty. We maximize the uncertainty over the sensitive information to protect the privacy. We implement our Maximum Entropy method on two branches: inter-class branch and intra-class branch. For the inter-class branch, we maximize the entropy of non-target classes. For the intra-class branch, we maximize the entropy of class-conditional representations by making them as Gaussian distributions. 4.1 Inter-class MaxEnt: Maximizing Entropy of non-Target Class Definition 2. Maximum entropy distribution over a finite (discrete) range is the Uniform Distribution [21]. Based on the definition 2, we make the non-target classes uniformly distributed to maximize the entropy. The reason why maximizing entropy of non-target prediction can protect the sensitive information can be illustrated in the toy example below. In the figure 4.1, Figure 4.1: Inter-class MaxEnt toy example: Make the non-target classes uniformly distributed to maximize the entropy can protect the sensitive information while maintaining good target task performance. 7 for example, our target task is to classify the dog image from the dog, cat and car images [22]. At the mean time, we also have the sensitive labels in mind, which we do not use them during the training - animal and non-animal. From the output of class prediction, we can make the following observations from the results: we can predict the dog correctly as it has the highest prediction probability 0.55 compared to cat with 0.4 and car with 0.05. However, as cat and dog may have more similarities compared to car, so their predicted probabilities will be more similar, which will reveal that dog and cat may come from the same sensitive group - animal, and the remaining car will come from another sensitive group, non-animal. In this way, the sensitive information is leaked easily. In order to mitigate this sensitive information leakage, we can make the non-target classes uniformly distributed. So the original probability triplet P (input = dog) = 0.55, P (input = cat) = 0.4, P (input = car) = 0.05 will be distributed as P (input = dog) = 0.55, P (input = cat) = 0.225, P (input = car) = 0.225. In this case, we can still correctly predict the input image, which is dog, while can not tell the relationship of the sensitive groups of the three anymore. In this way, we protect the sensitive information while maintaining good performance on target task by making the non-target classes uniformly distributed to maximize the entropy. We further prove our idea with an example by using the real-world COMPAS dataset [6]. It has 11 features with 7215 data samples. The protected features are race and sex. The target label is recidivism prediction. The baseline we use is the vanilla group-agnostic Baseline used in [2], which performs standard Empirical Risk Minimization (ERM ) with uniform weights. We use a set of AUCs as evaluation metrics. AUC(min) is the minimum AUC over all protected groups. AUC(macro-average) is the macro-average over all protected group AUCs, which is a weighted average. AUC(minority) is the AUC reported for the smallest protected group in the dataset. We choose AUC (area under the ROC curve) as our evaluation metric as it is robust to class imbalance. Also, it encompasses both False Positive Rate (FPR) and False Negative Rate (FNR), and is threshold agnostic. 8 Table 4.1: Main results: Inter-class Maximum Entropy branch vs Baseline Dataset Method AUC avg AUC macro-avg AUC min AUC minority Baseline 0.748 0.730 0.674 0.774 COMPAS Inter-class branch 0.748 0.733 0.693 0.726 From table 4.1, we can make the following observations from the results that the overall AUCs by using the target loss and inter-class entropy loss will be larger than the baseline. And the gap between the AUCs will be smaller, which shows a good trade-off between utility and fairness. Though in COMPAS dataset, inter-class MaxEnt method has lower AUC (minority) compared to baseline method, it does not degrade our method, because AUC (minority) does not necessarily convey fairness. As AUC (minority) is denoted as the AUC from the minimum-size protected group, whereas in many cases minimum-size group does not belong to the weakness group, e.g. the rich and the poor group. So the other AUCs metrics are more convincing in fairness measurement. More detailed discussion is given in Chapter 6. 4.2 Intra-class MaxEnt: Maximizing Entropy of Class-Conditional Representations Definition 3. Maximum entropy distribution with a variance and mean constraint is the Gaussian distribution [21]. Based on the definition 3, we make the class-conditional representations into Gaussian distributions to maximize the entropy. The reason why maximizing entropy of class-conditional representations can protect the sensitive information will be explained in the example below. In this intra-class MaxEnt toy example, we assume the shape (circle and triangle) represents the target classes, and the color (blue and red) represents sensitive classes. Green line is the decision boundary for target classes, while the blue line is the decision boundary for sensitive classes. In figure 4.2, we make the following observations from the results that before making the class-conditional representations into Gaussian distributions, we can perfectly classify 9 (a) When class-conditional representa- (b) When class-conditional representa- tions are not Gaussian tions are Gaussian Figure 4.2: Intra-class MaxEnt toy example: Make the class-conditional representations into Gaussian distributions to maximize the entropy can protect the sensitive information while maintaining good target task performance. the shape and the color with the green and blue decision boundaries in figure 4.2a. After making the class-conditional representations into Gaussian distributions as in figure 4.2b (where the shape of the representations are more round), we can still perfectly classify the shapes with the green decision boundary, but we can not perfectly classify the colors with the blue decision boundary anymore. In this way, based on the toy example, it is illustrated that we can protect the sensitive information by making the class-conditional representations into Gaussian distributions to maximize the entropy while maintaining good target task performance at the same time. Next, We further prove our Intra-class MaxEnt idea with a linear example calling MaxEnt LDA. In traditional Linear Discriminant Analysis (LDA) [23], the goal is to perform dimensional- ity reduction and preserve as much of the class discriminatory information as possible. So the ideal projected mapping is to achieve far-away means between each class, and the small vari- T ance within each class. So the objective function of traditional LDA is max w S1 w wT S2 w , s.t.wT w = 1, w where S1 is the between-class scatter of the projected samples, and S2 is the within-class scatter of the original samples/ feature vectors. However, from figure 4.2b, we make the following observations from the results that the ideal representation/ embedding we want is to have the far-away means between each class, while the large variance within each class. So the objective function of the MaxEnt LDA 10 (linear example of the MaxEnt method) can be extended from the traditional LDA, whose objective function is max(αwT S1 w + +(1 − α)wT S2 w), s.t.α ∈ [0, 1] , wT w = 1. In this toy w example, we use 2-dimensional Gaussian as our toy data. Figure 4.3: MaxEnt LDA From figure 4.3, we can make the following observations from the results that when cleverly selecting α, we can find an ideal projection plane for the target class and the sensitive class so that we can perfectly classify the target class while not being able to classify the sensitive class by making the target representations well separated but the sensitive representations overlapped between each other. 4.3 Maximum Entropy (MaxEnt) Method Figure 4.4: MaxEnt Method Structure 11 Our goal is to learn a representation that achieves a good trade-off between utility and fairness. In order to reach a good utility, we learn a representation that can learn the target task well. And in order to achieve fairness, we use Maximum Entropy method on both inter-class and intra-class branch. In order to learn the target task well, encoderE is trained to learn the representation parameterized with θE , which tries to maximize the likelihood of the target attribute, as measured by the target predictor parameterized with θY . LT (θE , θY ) = KL(p(y|x) ∥ qY (ŷ|e(x, θE ); θY ) (4.1) In order to maximizing entropy of non-target classes to protect the sensitive information, the encoder and predictor will be trained with the inter-class Maximum Entropy loss LE . LE (θE , θY ) = KL(qY (ŷ|e(x, θE ); θY )ŷ̸=y ∥ Ũy ) 1 − P (y) s.t. Ũy = (4.2) #Y − 1 where P (y) denotes the probability of the target class, and #Y denotes the number of the total target classes. In order to make the class-conditional representations as Gaussian distributions, the  learnable hypothesis parameters µt|y , Σt|y are trained. The mean and covriance calculator will be used to calculate the mean and covariance of the representations given each class  represented with µz|y , Σz|y . The learnable hypothesis parameters will be updated in an iterative manner until it converges with the calculated mean and covariance of the representations given each class. The learnable hypothesis parameters will be trained with the intra-class Maximum Entropy loss shown below: h i LG (θE , µz|y , Σz|y ) = E ∥ µz|y − µ̂z|y ∥22 + E ∥ Σz|y − Σ̂z|y ∥22   (4.3) As the class-conditional representations are learned to be Gaussian distributions, regu- larizations are needed. So, the regularization losses LG and LR are proposed to make the 12 covariance elements of the class-conditional Gaussian distributions non-zero and isotropic. LR (θE , Σz|y ) =∥ diag(Σz|y ) − diag(Σ̂z|y ) ∥22 +off-diag(Σ̂z|y ) (4.4) h i LS (θE , Σz|y ) = E exp(−|diag(Σz|y )|) + exp(−|diag(Σ̂z|y )|) (4.5) In summary, the Maximum Entropy Method algorithm will be shown as follows: Algorithm 4.1: Maximum Entropy Method Require: Training data x ∈ X with target labels y ∈ Y , initial encoder parameters θE , initial target predictor parameters  θY , initial learnable hypothesis parameters for each class- conditional representation µt|y , Σt|y , trade-off parameters α1 , α2 such that 0 ≤ α1 +α2 ≤ 1, small regularization parameter ϵ1 , ϵ2 where ϵ1 , ϵ2 < 10−6 , and learning rate η1 , η2 . t←0 while not converge do t←t+1 Compute target loss LT (θE , θY ) // Using equation 5.3 Compute inter-class MaxEnt loss LE (θE , θY ) //Using equation 4.2 Compute intra-class MaxEnt loss LG (θE , µz|y , Σz|y ) //Using equation 4.3 Compute the Gaussian regularization loss LR (θE , Σz|y ) //Using equation 4.4 Compute the Gaussian singularity regularization loss LS (θE , Σz|y ) //Using equation 5.5 Lt ← α1 LtT + α2 LtE + (1 − α1 − α2 )LtG + ϵ1 LtR + ϵ2 LtS t Compute the back-propagation error w.r.t. input ∂L ∂xt for each layer i i t Update the parameters θEt by θEt+1 = θEt − η1 ∂θ ∂L t E ∂Lt ∂Lt t+1 and Σt+1  t Update the parameters µt|y , Σt|y by µt|y = µtt|y − η2 ∂µ t t|y = Σt|y − η2 ∂Σt t|y t|y end while  return θE or µt|y , Σt|y 13 CHAPTER 5 THEORETICAL ANALYSIS This chapter shows the theoretical analysis of the proposed Maximum Entropy Method. In order to theoretically prove the feasibility of our proposed Maximum Entropy method on inter-class and intra-class branch. The theorem 1 and theorem 2 are proposed. Lemma 1. p(s|z) = fys (z)kys p(y|z) p(z|s) p(s) Where fys (z) = p(z|y) , and kys = p(y) Proof. As Y and S are dependent, so P (S, Y ) should exist. So, P (S, Y ) = P (S|Y )P (Y ) = P (Y |S)P (S) P (S = s|Y = y) (5.1) ∴ P (S = s) = P (Y = y) P (Y = y|S = s) P (S = s|Y = y) ∴ P (S = s) = kys P (Y = y), kys = P (Y = y|S = s) Furthermore, p(y, z) p(z|y)p(y) ∵ p(y|z) = = p(z) p(z) p(z|y)p(y) ∴ p(z) = p(y|z) p(s, z) p(z|s)p(s) ∴ p(s|z) = = p(z) p(z) p(z|s)p(s) = p(y|z) p(z|y)p(y) (5.2) = fys (z)kys p(y|z) p(z|s) fys (z) = p(z|y) p(s) kys = p(y) p(z|s)p(s) p(z, s) fys (z)kys = = p(z|y)p(y) p(z, y) 14 Theorem 1. Maximizing H(Z|Y ) can either not change or maximize H(S|Z) Proof. H(S|Z) = H(S, Z) − H(Z) (5.3) = H(Z|S) + H(S) − H(Z) p(z|s) p(s) According to lemma 1, p(s|z) = fys (z)kys p(y|z), where fys (z) = p(z|y) , and kys = p(y) , so, XZ p(z, s) H(Z|S) = − p(s)p(z|s) log dz s∈S z∈Z p(s) XZ p(z, s) =− p(z, s) log dz s∈S z∈Z p(s) XZ p(s|z)p(z) =− p(s|z)p(z) log dz s∈S z∈Z p(s) XXZ fys (z)kys p(y|z)p(z) =− fys (z)kys p(y|z)p(z) log dz s∈S y∈Y z∈Z p(s) XXZ kys fys (z)p(z, y) =− kys fys (z)p(z, y) log dz (5.4) s∈S y∈Y z∈Z p(s) XXZ =− kys fys (z)p(z, y) log kys fys (z)dz s∈S y∈Y z∈Z XXZ − kys fys (z)p(z, y) log p(z, y)dz s∈S y∈Y z∈Z XXZ + kys fys (z)p(z, y) log p(s)dz s∈S y∈Y z∈Z = A1 + A2 + A3 15 XXZ A3 = kys fys (z)p(z, y) log p(s)dz s∈S y∈Y z∈Z XXZ p(z, s) = p(z, y) log p(s)dz s∈S y∈Y z∈Z p(z, y) XZ = p(z, s) log p(s)dz z∈Z s∈S (5.5) XZ = p(z, s)dz log p(s) s∈S z∈Z X = p(s) log p(s) s∈S = −H(S) For A2 , according to the First Mean Value Theorem [24] (shown in Theorem 3), we can rewrite A2 as: XXZ A2 = − kys fys (z)p(z, y) log p(z, y)dz s∈S y∈Y z∈Z p(z, s) p(z|s)p(s) ∵ kys fys = = continuous, −p(z, y) log p(z, y) ≥ 0 p(z, y) p(z|y)p(y) XXZ ∴=M −p(z, y) log p(z, y)dz (5.6) s∈S y∈Y z∈Z XZ =M −p(z, y) log p(z, y)dz y∈Y z∈Z = M H(Z, Y ) 16 For A3 , according to the Chain rule [25] (shown in Theorem 4), we can rewrite A1 as: XXZ A1 = − kys fys (z)p(z, y) log kys fys (z)dz s∈S y∈Y z∈Z XXZ p(z, s) p(z, s) =− p(z, y) log dz s∈S y∈Y z∈Z p(z, y) p(z, y) XXZ p(z, s) =− p(z, s) log dz s∈S y∈Y z∈Z p(z, y) = −D(p(z, s) ∥ p(z, y)) = −(D(p(s) ∥ p(y)) + D(p(z|s) ∥ p(z|y))) = C1 − D(p(z|s) ∥ p(z|y)) XXZ p(z|s) (5.7) = C1 − p(z|s) log dz s∈S y∈Y z∈Z p(z|y) XXZ p(z|s) = C1 − M2 log dz s∈S y∈Y z∈Z p(z|y) XXZ = C1 − M2 (log p(z|s) − log p(z|y))dz s∈S y∈Y z∈Z XZ XZ = C1 − M2 ( log p(z|s)dz − log p(z|y)dz) s∈S z∈Z y∈Y z∈Z = C1 − M2 (KY ∗ 1 − KS ∗ 1) = C2 Where KY is the total number of classes in Y , and KS is the total number of classes in S. C2 means a constant number. 17 We can rewrite equation 5.3 as: H(S|Z) = H(Z|S) + H(S) − H(Z) = A1 + A2 + A3 + H(S) − H(Z) = C2 + M H(Z, Y ) − H(S) + H(S) − H(Z) = C2 + M H(Z, Y ) − H(Z) ∴ H(S, Z) = H(S|Z) + H(Z) = C2 + M H(Z, Y ) − H(Z) + H(Z) = M H(Z, Y ) ∵ H(Z, Y ) = H(Y |Z) + H(Z) (5.8) = H(Z)if H(Y |Z) = 0 H(Z, S) = H(S|Z) + H(Z) = H(S|Z) + H(Z, Y ) ∵ H(S|Z) ̸= 0 ∴ H(Z, S) = H(S|Z) + H(Z, Y ) ≥ H(Z, Y ) ∴ H(Z, S) = M H(Z, Y ) ≥ H(Z, Y ) ∴M ≥1 According to equation 5.8, H(S|Z) = H(S, Z) − H(Z) = M H(Z, Y ) − H(Z) = M (H(Z|Y ) + H(Y )) − H(Z) (5.9) = M H(Z|Y ) − H(Z) + M H(Y ), M ≥ 1 According to paper [26], the upper bound of H(Z) is C3 + H(Z|Y ) (shown in 5), where C3 is 18 a constant number. XZ H(Z|Y ) = − p(z|y)p(y) log p(z|y)dz y∈Y z XK Z =− wk p(z|y = k) log p(z|y = k)dz k=1 z = w1 H1 (Z) + w2 H2 (Z) + · · · + wK HK (Z) (5.10) d 1 d 1 = w1 ( ln 2πe + ln |Σ1 |) + w2 ( ln 2πe + ln |Σ2 |)+ 2 2 2 2 d 1 · · · + wK ( ln 2πe + ln |ΣK |) 2 2 d 1 = ln 2πe + (w1 ln |Σ1 | + · · · wK ln |ΣK |) 2 2 According to the Theorem 3 in the reference paper [26], we can get the upper bound of H(Z) is that: K X d 1 H(Z) ≤ wk (− log wk + ln 2πe + ln |Σk |) k=1 2 2 d 1 (5.11) =C+ ln 2πe + (w1 ln |Σ1 | + · · · wK ln |ΣK |) 2 2 = C + H(Z|Y ) According to equation 5.9, H(S|Z) = M H(Z|Y ) − H(Z) + M H(Y ), M ≥ 1, so, ∂H(S|Z) ∂H(Z|Y ) ∂H(Z) =M − ,M ≥ 1 ∂|Σi | ∂|Σi | ∂|Σi | ∂H(Z|Y ) = (M − 1) ,M ≥ 1 ∂|Σi | (5.12) ∂H(Z|Y ) ∵ > 0, ∂|Σi | ∂H(S|Z) ∂H(Z|Y ) ∴ = (M − 1) ≥0 ∂|Σi | ∂|Σi | Theorem 2. Making non-target classes uniformly distributed is equivalent to maximizing H(Y |Z), and can maximize H(S|Z). Proof. Denote there are K target classes, and DKL denotes the KL divergence between two 19 probability distributions. 1 − P (y = Y |Z) DKL (P (y ̸= Y |Z) ∥ ) K −1 p(y ̸= Y |z) Z X X = p(z) p(y ̸= Y |z) log 1−p(y=Y |z) dz z∈Z y ∈Y / y∈Y K−1 XXZ p(y ̸= Y |z) = p(z)p(y ̸= Y |z) log 1−p(y=Y |z) dz y∈Y y ∈Y / z∈Z K−1 XXZ = p(z)p(y ̸= Y |z) log p(y ̸= Y |z)dz− (5.13) y∈Y y ∈Y / z∈Z XXZ p(z)p(y ̸= Y |z) log (1 − p(y = Y |z))dz y∈Y y ∈Y / z∈Z XXZ + p(z)p(y ̸= Y |z) log (K − 1)dz y∈Y y ∈Y / z∈Z = B1 + B2 + B3 XXZ B3 = p(z)p(y ̸= Y |z) log (K − 1)dz y∈Y y ∈Y / z∈Z XZ XZ = p(z)p(y ̸= Y1 |z) log (K − 1)dz + · · · + p(z)p(y ̸= YK |z) log (K − 1)dz y ∈Y / z∈Z y ∈Y / z∈Z Z Z =( p(z)p(y = Y2 |z) log (K − 1)dz + · · · + p(z)p(y = YK |z) log (K − 1)dz) + · · · z∈Z z∈Z Z Z +( p(z)p(y = Y1 |z) log (K − 1)dz + · · · + p(z)p(y = YK−1 |z) log (K − 1)dz) z∈Z z∈Z Z = (K − 1) p(z)p(y = Y1 |z) log (K − 1)dz + · · · z∈Z Z + (K − 1) p(z)p(y = YK |z) log (K − 1)dz z∈Z XZ = (K − 1) p(z)p(y|z) log (K − 1)dz y∈Y z∈Z XZ = (K − 1) p(z, y)dz log (K − 1) y∈Y z∈Z = (K − 1) log (K − 1) = C4 (5.14) 20 Where C4 is a constant. XXZ B2 = − p(z)p(y ̸= Y |z) log (1 − p(y = Y |z))dz y∈Y y ∈Y / z∈Z XZ =− p(z)p(y ̸= Y1 |z) log (1 − p(y = Y1 |z))dz − · · · y ∈Y / 1 z∈Z XZ − p(z)p(y ̸= YK |z) log (1 − p(y = YK |z))dz y ∈Y / K z∈Z Z =− p(z)p(y = Y2 |z) log (1 − p(y = Y1 |z))dz − · · · z∈Z Z − p(z)p(y = YK |z) log (1 − p(y = Y1 |z))dz − · · · z∈Z (5.15) Z − p(z)p(y = Y1 |z) log (1 − p(y = Y1 |z))dz − · · · z∈Z Z − p(z)p(y = YK−1 |z) log (1 − p(y = YK |z))dz z∈Z Z =− p(z)(p(y = Y2 |z) + · · · + p(y = YK |z)) log (1 − p(y = Y1 |z))dz − · · · z∈Z Z − p(z)(p(y = Y1 |z) + · · · + p(y = YK−1 |z)) log (1 − p(y = YK |z))dz z∈Z Z =− p(z)(1 − p(y = Y1 |z)) log (1 − p(y = Y1 |z))dz − · · · z∈Z Z − p(z)(1 − p(y = YK |z)) log (1 − p(y = YK |z))dz z∈Z 21 XXZ B1 = p(z)p(y ̸= Y |z) log p(y ̸= Y |z)dz y∈Y y ∈Y / z∈Z XZ = p(z)p(y ̸= Y1 |z) log p(y ̸= Y1 |z)dz + · · · y ∈Y / 1 z∈Z XZ + p(z)p(y ̸= YK |z) log p(y ̸= YK |z)dz y ∈Y / K z∈Z Z Z = p(z)p(y = Y2 |z) log p(y = Y2 |z)dz + · · · + p(z)p(y = YK |z) log p(y = YK |z)dz z∈Z z∈Z Z + ··· + p(z)p(y = Y1 |z) log p(y = Y1 |z)dz + · · · z∈Z Z + p(z)p(y = YK−1 |z) log p(y = YK−1 |z)dz z∈Z Z = (K − 1) p(z)p(y = Y1 |z) log p(y = Y1 |z)dz + · · · z∈Z Z + (K − 1) p(z)p(y = YK |z) log p(y = YK |z)dz z∈Z = −(K − 1)H(Y1 |Z) − · · · − (K − 1)H(YK |Z) (5.16) 22 According to equations 5.14 - 5.16, we can have: 1 − P (y = Y |Z) DKL (P (y ̸= Y |Z) ∥ ) = B1 + B2 + B3 Z K −1 = (K − 1) p(z)p(y = Y1 |z) log p(y = Y1 |z)dz + · · · z∈Z Z + (K − 1) p(z)p(y = YK |z) log p(y = YK |z)dz z∈Z Z − p(z)(1 − p(y = Y1 |z)) log (1 − p(y = Y1 |z))dz − · · · z∈Z Z − p(z)(1 − p(y = YK |z)) log (1 − p(y = YK |z))dz + C4 z∈Z Z =K p(z)p(y = Y1 |z) log p(y = Y1 |z)dz + · · · z∈Z Z +K p(z)p(y = YK |z) log p(y = YK |z)dz z∈Z Z − p(z)p(y = Y1 |z) log p(y = Y1 |z)dz − · · · z∈Z Z − p(z)p(y = YK |z) log p(y = YK |z)dz z∈Z Z − p(z)(1 − p(y = Y1 |z)) log (1 − p(y = Y1 |z))dz − · · · z∈Z (5.17) Z − p(z)(1 − p(y = YK |z)) log (1 − p(y = YK |z))dz + C4 z∈Z = −K ∗ H(Y1 |Z) − · · · − K ∗ H(YK |Z) Z − p(z)p(y = Y1 |z) log p(y = Y1 |z)dz − · · · z∈Z Z − p(z)p(y = YK |z) log p(y = YK |z)dz z∈Z Z − p(z)(1 − p(y = Y1 |z)) log (1 − p(y = Y1 |z))dz − · · · z∈Z Z − p(z)(1 − p(y = YK |z)) log (1 − p(y = YK |z))dz + C4 z∈Z = −K ∗ H(Y1 |Z) − · · · − K ∗ H(YK |Z) Z −2 p(z)p(y = Y1 |z) log p(y = Y1 |z)dz − · · · z∈Z Z −2 p(z)p(y = YK |z) log p(y = YK |z)dz + C4 z∈Z = −K ∗ H(Y1 |Z) − · · · − K ∗ H(YK |Z) + 2H(Y1 |Z) + · · · + 2H(YK |Z) + C4 = (2 − K)H(Y1 |Z) + · · · + (2 − K)H(YK |Z) + C4 23 In this way, we can rewrite the objective function 5.13 into:   1 − P (y = Y |Z) min DKL (P (y ̸= Y |Z) ∥ ) K −1 = min {(2 − K)H(Y1 |Z) + · · · + (2 − K)H(YK |Z) + C4 } = max {(K − 2)(H(Y1 |Z) + · · · + H(YK |Z)) + C4 } ∵K≥2 (5.18)   1 − P (y = Y |Z) ∴ min DKL (P (y ̸= Y |Z) ∥ ) K −1 = max {(K − 2)(H(Y1 |Z) + · · · + H(YK |Z)) + C4 } ∝ max {H(Y1 |Z) + · · · + H(YK |Z)} = max {H(Y |Z)} n o (y=Y |Z) In this way, we can get that min DKL (P (y ̸= Y |Z) ∥ 1−PK−1 ) is equivalent to max {H(Y |Z)}. Now we need to prove why max {H(Y |Z)} can max {H(S|Z)}. XZ H(Y |Z) = − p(z)p(y|z) log p(y|z)dz (5.19) y∈Y z According to equation Lemma 1, p(s|z) = fys (z)kys p(y|z), so: XZ H(S|Z) = − p(z)p(s|z) log p(s|z)dz s∈S z XXZ =− p(z)fys (z)kys p(y|z) log fys (z)kys p(y|z)dz s∈S y∈Y z XXZ =− p(z)fys (z)kys p(y|z) log fys (z)kys dz− (5.20) s∈S y∈Y z XXZ p(z)fys (z)kys p(y|z) log p(y|z)dz s∈S y∈Y z = D1 + D2 Same as the proof in equation 5.7, D1 = A1 = C2 . And According to the First Mean Value 24 Theorem [24], we can get: XXZ D2 = − p(z)fys (z)kys p(y|z) log p(y|z)dz s∈S y∈Y z X X Z p(z, s) = − p(z)p(y|z) log p(y|z)dz s∈S y∈Y z p(z, y) p(z, s) ∵ continuous, −p(z)p(y|z) log p(y|z) ≥ 0 p(z, y) (5.21) X X Z p(z, s) ∴ D2 = − p(z)p(y|z) log p(y|z)dz s∈S y∈Y z p(z, y) XZ = M1 −p(z)p(y|z) log p(y|z)dz y∈Y z = M1 H(Y |Z) In this way, XZ H(S|Z) = − p(z)p(s|z) log p(s|z)dz s∈S z = D1 + D2 = C2 + M1 H(Y |Z) (5.22) p(z, s) ∵ ≥ 0, ∴ M1 ≥ 0 p(z, y) dH(S|Z) ∴ = M1 ≥ 0 dHH(Y |Z) So, increasing H(Y |Z) can increase H(S|z), so that max {H(Y |Z)} can max {H(S|Z)}. Theorem 3. First Mean Value Theorem [24]: Let f and g be two continuous functions on [a, b] and assume g is non-negative, thus there Rb Rb exists some constant M such that: a f (x)g(x)dx = M a g(x)dx. Theorem 4. Chain rule of KL divergence [25]: D(p(x, y) ∥ q(x, y)) = D(p(x) ∥ q(x)) + D(p(y|x) ∥ q(y|x)) 25 Proof. XX p(x, y) D(p(x, y) ∥ q(x, y)) = p(x, y) log x y q(x, y) XX p(x)p(y|x) = p(x, y) log x y q(x)q(y|x) XX p(x) X X p(y|x) (5.23) = p(x, y) log + p(x, y) log x y q(x) x y q(y|x) X X p(y|x) = D(p(x) ∥ q(x)) + p(x) p(y|x) log x y q(y|x) = D(p(x) ∥ q(x)) + D(p(y|x) ∥ q(y|x)) Theorem 5. Upper bound of H(Z) when Z is a mixture of multivariate Gaussians [23]: K wk (− log wk + d2 ln 2πe + 12 ln |Σk |) P H(Z) ≤ k=1 Proof. Z X K X K H(Z) = − wk N (z; µk , Σk ) log ( wk N (z; µk , Σk ))dz z k=1 k=1 XK Z =− wk N (z; µk , Σk ) log (wk N (z; µk , Σk )(1 + ϵk ))dz k=1 z XK Z =− wk N (z; µk , Σk )(log wk N (z; µk , Σk ) + log (1 + ϵk ))dz (5.24) k=1 z PK wj N (z; µj , Σj ) i̸=j=1 ϵk = wi N (z; µi , Σi ) log (1 + ϵi ) ≥ 0 26 CHAPTER 6 ABLATION STUDY AND EXPERIMENTAL RESULTS 6.1 Ablation study 6.1.1 Ablation study of target loss LT and inter-class entropy loss LE In this section, the ablation study of target loss LT and inter-class entropy loss LE are introduced. In this study, L = αLT + (1 − α)LE , α = {0.0, 0.1, · · · , 1.0}. And the data used in this section is Gaussian distributed. Figure 6.1: Curve of trade-off between utility and fairness vs α In figure 6.1, the x-axis measures the fairness, which is denoted as the testing entropy. The larger the entropy, the better the fairness is achieved. And the y-axis measures the classification accuracy. The larger the classification accuracy value, the better the target task is learned. Each red dot denotes the value of the trade-off between the utility (classification accuracy) and the fairness (entropy) under different αs. 27 From figure 6.1, the result shows a Pareto Curve of utility and fairness, which means that by cleverly selecting the α, we can achieve a good trade-off between the utility and fairness. For example, in this scenario, when α = 0.1, we can have the high classification accuracy (utility) and large entropy (fairness) at the same time. We further display our results in a probability simplex to vividly visualize how target loss LT and inter-class entropy loss LE work in this ablation study. Definition 4. Probability simplex: A probability simplex is a mathematical space where each point represents a probability distribution between a finite number of mutually exclusive events. In this experiment, we use the 3-class Gaussian data. When L = αLT + (1 − α)LE , α = {0.0, 0.1, · · · , 1.0} and α increases, we will focus more on the target prediction task instead of fairness, and vice versa. We can make the following observations from the results: when α = 0, the probabilities for each class are all the same and all probabilities are uniformly distributed. The classification accuracy is 33%, which is randomly guess. While when we increase α, the points, which represents the probability triplets will move to the correct classification class areas and are less uniformly distributed. And when α = 1, all probability points will be move to the correct classification class area. We also show the entropy heat map on the probability simplex to mathematically prove the feasibility of our method. According to the figure 6.3, we can conclude that each area of the entropy value in the heat map on probability simplex after uniformity on the non-target classes will be larger than those before uniformity. 6.1.2 Ablation study of target loss LT and intra-class entropy loss LG In this section, the ablation study of target loss LT and intra-class entropy loss LG will be discussed. L = αLT + (1 − α)LG + ϵLS , α = {0.0, 0.1, · · · , 1.0} , ϵ {0.00, 0.05} in this study. The data used in this section is the 3-class Gaussian data. According to figure 6.4, we make 28 Figure 6.2: Probability simplex vs α the following observations from the results that when α = 0.0, we only care about whether the learnable hypothesis parameters and the calculated class-conditional representation parameters are the same or not, so all N1 , N2 are overlapped, and the three classes can not be classified. While when increasing α, we are more focusing on the classification target task, so the class-conditional representations start to separate from each other. Until α = 1, we only focus on the classification loss, so the 3-class representations are well separated, but the N1 , N2 are not overlapped with each other. And the N2 returns to its initial parameters, that is why the 3 N2 s are the same with µi = [0, 0] , Σi = [[1, 0], [0, 1]] , i = {0, 1, 2}. 29 Figure 6.3: Entropy heat map on probability simplex Figure 6.4: Class-conditional representations vs α 6.2 Experimental Results 6.2.1 Baselines and Implementation Details Baselines: We consider baseline that is based on the Rawlsian Max-Min Fairness without using prior demographic information. Specifically, we consider two types of baselines: (1) the 30 vanilla group-agnostic Baseline, which performs standard ERM with uniform weights, and (2) the ARL method [2]. Implementation Details: We use Optuna [27] to tune the hyperparameters. The target predictor is pre-trained for a binary-class classification task. And the initial class-conditional hypothesis mean-s and covariance-s are all set as the same. 6.2.2 Datasets and Evaluation Metrics 6.2.2.1 Datasets We now demonstrate the effectiveness of our proposed MaxEnt approach through experiments over two real-world datasets well used in the fairness literature: (i) Adult UCI dataset [7]: income prediction (ii) COMPAS [6]: recidivism prediction. 1. COMPAS dataset [6] COMPAS is a landmark dataset to study algorithmic (un)fairness. This data was used to predict recidivism (whether a criminal will reoffend or not) in the USA. The tool was meant to overcome human biases and offer an algorithmic, fair solution to predict recidivism in a diverse population. However, the algorithm ended up propagating existing social biases and thus, offered an unfair algorithmic solution to the problem. In this dataset, a model to predict recidivism has already been fit and predicted probabilities and predicted status (yes/no) for recidivism have been concatenated to the original data. It has 11 features with 7215 data samples.The protected features are race and sex. The target label is recidivism prediction. 2. UCI Adult dataset [7] The Adult dataset is from the Census Bureau and the task is to predict whether a given adult makes more than $50, 000 a year based attributes such as education, hours of work per week, etc. It has 14 features with 48842 data samples. The protected features are race and sex. The target label is income prediction. 31 6.2.2.2 Evaluation Metrics Before using evaluation metrics, we first stratify the protected groups for evaluation. We use pairwise protected attributes to stratify groups. For example, in the case where we have sex and race as our sensitive attributes, we may have {White, Black} × {Male, Female} groups. So in the end, we will have 4 protected groups, which are subgroup 1-Black Male, subgroup 2-Black Female, subgroup 3-White Male and subgroup 4-White Female. We use AUC as our evaluation metrics. We choose AUC (area under the ROC curve) as our evaluation metric as it is robust to class imbalance. Also, it encompasses both FPR and FNR, and is threshold agnostic. No prior sensitive information will be used during training but only used in evaluation. More specifically, we use a set of AUCs, which are AUC, AUC(min), AUC(macro-average) and AUC(minority). AUC is the AUC over all data. AUC(min) is the minimum AUC over all protected groups. AUC(macro-average) is the macro-average over all protected group AUCs, which is a weighted average. AUC(minority) is the AUC reported for the smallest protected group in the dataset. Though AUC(minority) is used as one of the metrics, it does not fully measures the fairness. As higher AUC(minority) does not necessarily lead to better fairness. Instead, the gap between AUC(minority) and the AUC/AUC(macro-average) can measure the fairness. For example, in the case when s ∈ S respectively denotes rich group and poor group in the world, rich group will definitely be the minority group, while they are not the disadvantage/weak group. But we still use this metric here as this is used in our baseline paper [2] for consistency. 6.2.3 Results Table 6.1 reports results based on average performance across runs, with the best average performance highlighted in bold. According to the table 6.1, we make the following key observations: MaxEnt improves worst-case performance: For both COMPAs and UCI Adult datasets, our proposed MaxEnt method outperforms the SOTA ARL method, and achieves best results 32 Table 6.1: Main results: MaxEnt vs ARL Dataset Method AUC avg AUC macro-avg AUC min AUC minority Baseline 0.748 0.730 0.674 0.774 COMPAS ARL 0.743 0.727 0.658 0.785 MaxEnt 0.751 0.748 0.706 0.730 Baseline 0.898 0.891 0.867 0.875 UCI Adult ARL 0.907 0.915 0.881 0.942 MaxEnt 0.919 0.923 0.906 0.943 for AUC (min). Though in COMPAS dataset, MaxEnt method has lower AUC (minority) compared to ARL method, it does not degrade our method as AUC (minority) does not necessarily measure the fairness and is highly dependent on the dataset configureation and properties. As AUC (minority) is denoted as the AUC from the minimum-size protected group, whereas in many cases, minimum-size group does not belong to the weakness group, e.g. the small-size rich group vs the large-size poor group. So the other AUC metrics are more convincing in fairness measurement. MaxEnt improves gap: We also conclude from the results that for both COMPAS and UCI Adult datasets, our proposed MaxEnt method outperforms the SOTA ARL method that gaps between each AUC metrics are smaller, which proves that our method mitigate the bias/differences between protected groups and can achieve the better fairness. Utility-Fairness Trade-off: Furthermore, we observe that Min-Diff incurs a drop in overall AUC for UCI Adult dataset and COMPAS dataset. In contrast, as noted earlier MaxEnt in-fact shows an improvement in overall AUCs. This result shows that MaxEnt method achieves a better pareto allocation of overall and subgroup AUC performance. This is because the goal of MaxEnt is to realize a best trade-off between utility and fairness. 33 CHAPTER 7 CONCLUDING REMARKS The main contribution of this paper is that it introduced an innovative Maximum Entropy Method for representation learning to mitigate sensitive information leakage from learned representations without prior demographic knowledge: 1) The fair representation learning can be learned without sensitive demographic information using Maximum Entropy Method under controlled correlations between target attributes and sensitive attributes, 2) and a good trade-off between utility and fairness can be reached using proposed Maximum Entropy Method. 3) The mathematics proof has also verified the availability of Maximum Entropy method. The main limitations are that: 1) More investigation and theoretic proofs are needed for complex sensitive-target attributes correlations. 2) More study on continuous sensitive/target attributes can be conducted. 3) Non-parametric method on representation learning might be employed instead of limiting the representation distribution only on the Gaussian distribution. 34 BIBLIOGRAPHY 35 BIBLIOGRAPHY [1] T. Hashimoto, M. Srivastava, H. Namkoong, and P. Liang, “Fairness without demographics in repeated loss minimization,” in International Conference on Machine Learning, pp. 1929– 1938, PMLR, 2018. [2] P. Lahoti, A. Beutel, J. Chen, K. Lee, F. Prost, N. Thain, X. Wang, and E. H. Chi, “Fairness without demographics through adversarially reweighted learning,” arXiv preprint arXiv:2006.13114, 2020. [3] N. L. Martinez, M. A. Bertran, A. Papadaki, M. Rodrigues, and G. Sapiro, “Blind pareto fairness and subgroup robustness,” in International Conference on Machine Learning, pp. 7492–7501, PMLR, 2021. [4] Y. Zhang, “Assessing fair lending risks using race/ethnicity proxies,” Management Science, vol. 64, no. 1, pp. 178–197, 2018. [5] H. Bahng, S. Chun, S. Yun, J. Choo, and S. J. Oh, “Learning de-biased representations with biased representations,” in Proceedings of the 37th International Conference on Machine Learning (H. D. III and A. Singh, eds.), vol. 119 of Proceedings of Machine Learning Research, pp. 528–539, PMLR, 13–18 Jul 2020. [6] M. Barenstein, “Propublica’s compas data revisited,” 2019. [7] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [8] M. N. Elliott, A. Fremont, P. A. Morrison, P. Pantoja, and N. Lurie, “A new method for estimating race/ethnicity and associated disparities where administrative records lack self-reported race/ethnicity,” Health services research, vol. 43, no. 5p1, pp. 1722–1736, 2008. [9] M. Gupta, A. Cotter, M. M. Fard, and S. Wang, “Proxy fairness,” arXiv preprint arXiv:1806.11212, 2018. [10] J. Chen, N. Kallus, X. Mao, G. Svacha, and M. Udell, “Fairness under unawareness,” Proceedings of the Conference on Fairness, Accountability, and Transparency, Jan 2019. [11] N. Kallus, X. Mao, and A. Zhou, “Assessing algorithmic fairness with unobserved protected class using data combination,” Management Science, 2021. [12] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel, “Fairness through awareness,” in Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214– 226, 2012. [13] P. Lahoti, K. P. Gummadi, and G. Weikum, “ifair: Learning individually fair data representations for algorithmic decision making,” in 2019 ieee 35th international conference on data engineering (icde), pp. 1334–1345, IEEE, 2019. 36 [14] P. Lahoti, K. P. Gummadi, and G. Weikum, “Operationalizing individual fairness with pairwise fair representations,” arXiv preprint arXiv:1907.01439, 2019. [15] H. Wang, N. Grgic-Hlaca, P. Lahoti, K. P. Gummadi, and A. Weller, “An empirical study on learning fairness metrics for compas data with human supervision,” arXiv preprint arXiv:1910.10255, 2019. [16] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork, “Learning fair representations,” in International Conference on Machine Learning (ICML), 2013. [17] T. Kamishima, S. Akaho, H. Asoh, and J. Sakuma, “Considerations on fairness-aware data mining,” in 2012 IEEE 12th International Conference on Data Mining Workshops, pp. 378–385, IEEE, 2012. [18] M. Hardt, E. Price, and N. Srebro, “Equality of opportunity in supervised learning,” Advances in Neural Information Processing Systems (NeurIPS), 2016. [19] T. Kamishima, S. Akaho, and J. Sakuma, “Fairness-aware learning through regularization approach,” in 2011 IEEE 11th International Conference on Data Mining Workshops, pp. 643–650, IEEE, 2011. [20] J. Rawls, Justice as fairness: A restatement. Harvard University Press, 2001. [21] K. Conrad, “Probability distributions and maximum entropy,” 2005. [22] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255, Ieee, 2009. [23] S. Z. Li and A. Jain, eds., LDA (Linear Discriminant Analysis), pp. 899–899. Boston, MA: Springer US, 2009. [24] T. Riedel and P. K. Sahoo, Mean value theorems and functional equations. Singapore, Singapore: World Scientific Publishing, Oct. 1998. [25] J. M. Joyce, Kullback-Leibler Divergence, pp. 720–722. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011. [26] M. F. Huber, T. Bailey, H. Durrant-Whyte, and U. D. Hanebeck, “On entropy approxi- mation for gaussian mixture random vectors,” in 2008 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, pp. 181–188, 2008. [27] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-generation hyperparameter optimization framework,” 2019. 37