INVARIANT REPRESENTATION LEARNING VIA FUNCTIONS IN REPRODUCING KERNEL HILBERT SPACES By Bashir Sadeghi A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science – Doctor of Philosophy 2023 ABSTRACT Many applications of representation learning, such as privacy preservation and algorithmic fairness, desire explicit control over some unwanted information being discarded. This goal is formulated as satisfying two objectives: maximizing utility for predicting a target attribute while simultaneously being invariant (independent) to a known sensitive attribute (like gender or race). Solutions to invariant representation learning (IRepL) problems lead to a trade-off between utility and invariance when they are competing. Most existing works are empirical and implicitly look for single or multiple points on the utility-invariance trade-off. They do not explicitly seek to characterize the entire trade-off front optimally and do not provide invariance and convergence guarantees. In this thesis, we address the shortcoming mentioned above by considering simple linear mod- eling and building upon them. As a first step, we derive a closed-form solution for the global optima of the underlying linear IRepL optimization problem. In further development, we consider neural network-based encoders, where we model the utility of the target task and the invariance to the sensitive attribute via kernelized ridge regressors. This setting leads to a stable iterative optimiza- tion scheme toward global/local optima(s). However, such a setting cannot guarantee universal invariance. This drawback motivated us to further study the case where the invariance measure is modeled universally via functions in some reproducing kernel Hilbert spaces (RKHS)s. By model- ing the encoder and target networks via functions in some RKHS, too, we derive a closed formula for a near-optimal trade-off, corresponding optimal representation dimensionality, and the associ- ated encoder(s). Our findings have an immediate application to fairness in terms of demographic parity. To my wife: Katharina To my parents: Farajollah and Farrokh To my sister’s family: Atekeh, Karen, and Hanzaleh To my brother’s family: Farhad and Fatemeh iii ACKNOWLEDGMENTS I am thankful to have Dr. Vishnu Naresh Boddeti as my PhD advisor. His attention, expec- tations, and encouragement helped me learn more than I could have imagined. With his meticulous attention to detail and critical assessment, and setting himself as an example, Dr. Bodetti made me decide what kind of researcher I want to be: A solid researcher who formulates scientific problems principally, who observes carefully, who evaluates/designs experiments efficiently and critically, and finally, presents the findings understandably and intuitively. I am honored to have Dr. Arun Ross, Dr. Sijia Liu, and Dr. Hamidreza Modares on my thesis committee. I am grateful that I could attend Dr. Sijia Liu’s ’Adversarial Machine Learning’ course. His deep insight into the topic made this course enjoyable to me. I am also very thankful to Dr. Mathew Hirn for his interesting ’Mathematics of Deep Learning’ course. This course contributed to filling some gaps between the application and theory of deep learning for me. I want to thank Gautam Sreekumar for his generous support and encouragement during the challenging times of my PhD. Likewise, I am also grateful for the encouragement I received from my other labmates, Hamed Bolandi, Lan Wang, Xiaoxue Wang, Shihua Huang, Rahul Dey, Sepehr Dehdashtian, and Wei Ao. Their valuable comments during group meetings and research discus- sions helped me to improve on what I could do alone. Last but not least, I am very fortunate and grateful that Dr. Runyi Yu was my MSc advisor. He set himself as an example of a principal scientist who approaches research problems fundamentally. Without any exaggeration, I could not see myself pursuing a PhD without his encouragement and support during my MSc study. iv TABLE OF CONTENTS LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Chapter 1 Introduction To Invariant Representation Learning . . . . . . . . . . . . 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Adversarial Representation Learning . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Invariant Representation Learning . . . . . . . . . . . . . . . . . . . . . . 4 1.2.3 Trade-Offs in Invariant Representation Learning . . . . . . . . . . . . . . . 5 1.2.4 Optimization Theory for Adversarial Learning . . . . . . . . . . . . . . . . 6 1.2.5 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.6 Dependence Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Overview of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2 Mathematical Background and Preliminaries . . . . . . . . . . . . . . . 11 2.1 Notations and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.1 Kernelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Kernelized Dependence Measures . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 3 Adversarial Representation Learning Under Linear Invariance . . . . . 17 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Adversarial Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.1 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.2 The Linear Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3 Empirical Solution for Linear Encoder . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.1 Non-Linear Extension Through Kernelization . . . . . . . . . . . . . . . . 31 3.4 Analytical Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.6 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.6.1 Mixture of Four Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.6.2 Fair Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.6.3 Illumination Invariant Face Classification . . . . . . . . . . . . . . . . . . 40 3.6.4 CIFAR-100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Chapter 4 Adversarial Representation Learning With Closed-Form Solvers . . . . 46 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.1 Motivating Exact Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 Exact Adversary and Target Predictor Solvers . . . . . . . . . . . . . . . . . . . . 51 v 4.3.1 Closed-Form Adversary and Target Predictor . . . . . . . . . . . . . . . . 51 4.3.2 Optimal Embedding Dimensionality . . . . . . . . . . . . . . . . . . . . . 54 4.3.3 Gradient of Closed-Form Solution . . . . . . . . . . . . . . . . . . . . . . 58 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.1 Fair Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.2 Mitigating Sensitive Information Leakage . . . . . . . . . . . . . . . . . . 62 4.4.3 Ablation Study on Mixture of Four Gaussians . . . . . . . . . . . . . . . . 64 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Chapter 5 Universal Invariant Representation Learning . . . . . . . . . . . . . . . 68 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.1.1 Adversarial Representation Learning . . . . . . . . . . . . . . . . . . . . . 71 5.1.2 Trade-Offs in Invariant Representation Learning: . . . . . . . . . . . . . . 71 5.2 Deficiency of Mean-Squared Error as A Measure of Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.4 Choice of Dependence Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.5 Exact Kernelized Trade-Off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.5.1 Numerical Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.5.2 Target Task Performance in K−TOpt . . . . . . . . . . . . . . . . . . . . . 87 5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.6.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.6.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.6.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.6.4 Choice of (Y, S) Pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.6.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.6.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.6.7 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Chapter 6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 101 6.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.3 Broader Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 vi LIST OF ABBREVIATIONS Acronyms / Abbreviations ARL Adversarial Representation Learning COCO Constrained Covariance DNN Deep Neural Network GAN Generative Adversarial Network HSIC Hilbert-Schmidt Independence Criterion IRepL Invariant Representation Learning iff If and Only If KCC Kernelized Canonical Correlation MLP Multi-Layer Perceptrons MSE Mean Square Error NN Neural Network PCA Principal Component Analysis RFF Random Fourier Features RKHS Reproducing Kernel Hilbert Space RV Random Variable SGD Stochastic Gradient Descent SGDA Stochastic Gradient Descent Ascent w.r.t. With Respect To vii Chapter 1 Introduction To Invariant Representation Learning 1.1 Introduction Real-world applications of representation learning often have to contend with objectives beyond predictive performance. These include cost functions pertaining to, invariance (e.g., to photo- metric or geometric variations), semantic independence (e.g., to age or race for face recognition systems), privacy (e.g., mitigating leakage of sensitive information [1]), algorithmic fairness (e.g., demographic parity [2]), and generalization across multiple domains [3], to name a few. At its core, the goal of the aforementioned formulations of representation learning is to sat- isfy two competing objectives: Extracting as much information necessary to predict a target label Y (e.g., face identity), while intentionally and permanently suppressing information about a given sensitive attribute S (e.g., age or gender). See Figure 1.1 for an illustration. An encoder f produces a representation Z = f (X) from the input data X. A target predictor gY operates on the repre- sentation Z to predict the target attribute Y . A parametric or non-parametric dependence measure Dep(Z, S) measures the statistical dependency of the representation Z on the sensitive attribute S. For example, Dep(Z, S) can be measured by a hypothetical adversary loss that aims to predict the sensitive attribute S. Even though randomized encoder and target predictor can also be consid- 1 X f (X) Z gY (Z) Yb Dep(Z, S) S Figure 1.1: An encoder f in the form of a Borel function produces a representation Z = f (X) from the input data X. A target predictor g, in the form of a Borel function, operates on the representation Z to predict the target attribute Y . A parametric or non-parametric dependence measure Dep(Z, S) quantifies the statistical dependency of the representation Z on the sensitive attribute S. Invariant representation learning seeks a representation Z = f (X) that contains as much information necessary for the downstream target predictor gY while being independent of the sensitive attribute S. ered, however in this dissertation, we assume that both f and gY are deterministic Borel functions. When the statistical dependency between Y and S is not negligible, learning a representation Z that is invariant to the sensitive attribute S (i.e., Z ⊥⊥ S) will necessarily degrade the performance of the target prediction, i.e., there exists a trade-off between utility and invariance. The primary application of invariant representation learning (IRepL) is invariant prediction. This is because if Z is independent of S, then the target prediction Yb = gY (Z) is independent of S, regardless of the target predictor gY . As a result, to be robust to the choice of target predictor [4], it is preferred to deploy IRepL for invariant prediction rather than enforcing invariance on Yb directly. The existence of a trade-off between utility and invariance has been well established, both theoretically and empirically, under various contexts of representation learning such as fairness [5, 6, 7, 8], invariance [9], and domain adaptation [10]. However, the central aspect of IRepL is still challenging: A learning algorithm that achieves any point on the utility-invariance trade-off, optimally or via local optima(s), and how can we estimate them from training data. A vast majority of existing works are empirical in nature. They implicitly look for single or multiple points on the trade-off between utility and invariance to the sensitive information and do not explicitly seek to 2 X f (X) Z gY (Z) Yb gS (Z) Sb Figure 1.2: Adversarial Representation Learning consists of three entities, an encoder f that obtains a compact representation Z of the input data X, a predictor gY that predicts a desired target attribute Y and an adversary gS that seeks to extract a sensitive attribute S, both from the embedding Z. characterize the entire trade-off front optimally. This dissertation aims to address the mentioned shortcoming of existing IRepL approaches by employing functions in some reproducing kernel Hilbert spaces (RKHS)s to model target predictor gY and the dependence measure Dep(Z, S). Under the case where the encoder f is also modeled via functions in some RKHSs, we are able to find a closed-form solution for the optimal encoder. For encoders modeled by neural networks (NN)s, we are able to provide some stability for the underlying iterative optimization problem. 1.2 Prior Work 1.2.1 Adversarial Representation Learning Most practical approaches for learning fair, invariant, domain adaptive, or privacy-preserving rep- resentations discussed above are based on adversarial representation learning (ARL). At the core of ARL is the idea of modeling Dep(Z, S) via a proxy adversary that seeks to extract the sensitive attribute S. See Figure 1.2 for an illustration. In the context of image classification, adversarial learning has been utilized to obtain representations that are invariant across domains [3, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]. Such representations allow classifiers that are trained on a source 3 domain to generalize to a different target domain. In the context of learning fair and unbiased rep- resentations, a number of approaches [1, 2, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35] have used and argued for explicit adversarial networks, to extract sensitive attributes from the en- coded data. All these methods are usually set up as a minimax game between the encoder, a target task, and a proxy adversary. The encoder is set up to achieve invariance by maximizing the loss of the proxy adversary, i.e., minimizing the negative log-likelihood or mean square error (MSE) of sensitive variables as measured by the proxy adversary. Roy et al. [1] identify and address the in- stability of the optimization in the zero-sum minimax formulation of ARL and propose an alternate non-zero-sum solution, demonstrating improved empirical performance. All the above approaches use deep neural networks (DNN)s to represent the ARL entities, optimize their parameters through stochastic gradient descent ascent (SGDA), and rely on empirical validation. However, none of them seek to study the nature of the ARL formulation itself, i.e., in terms of decoupling the role of the expressiveness of the models and convergence/stability properties of the optimization tools for learning the parameters of the corresponding models. This shortcoming motivates us to take some steps towards filling this gap by studying simpler forms of ARL from an optimal optimization perspective in Chapter 3 and build upon it in Chapters 4, 5. 1.2.2 Invariant Representation Learning The basic idea of representation learning that discards unwanted semantic information has been explored under many contexts like invariant, fair, or privacy-preserving learning. The concept of learning fair representations was first introduced by Zemel et al. [36]. The goal was to learn a rep- resentation of data by “fair clustering” while maintaining the discriminative features of the predic- tion task. Building upon this work, many techniques have been proposed to learn an unbiased rep- resentation of data while retaining its effectiveness for a prediction task. These include the Varia- 4 tional Fair Autoencoder [37] and the more recent information bottleneck-based objective by Moyer et al. [38]. In domain adaptation [11, 12, 39], the goal is to learn features that are independent of the data domain. In fair learning [2, 21, 40, 36, 41, 42, 43, 23, 24, 22, 44, 26, 45, 46, 47, 48, 49, 50], the goal is to discard the demographic information that leads to unfair outcomes. Similarly, there is growing interest in mitigating unintended leakage of private information from representa- tions [51, 52, 53, 54, 55, 45, 56, 57, 58, 59]. A vast majority of this body of work is empirical in nature. They implicitly look for single or multiple points on the trade-off between utility and sensitive information and do not explicitly seek to characterize the entire trade-off front. Overall, these approaches are not concerned with or aware of the inherent utility-invariance trade-off. In contrast, using functions in some RKHSs, we near-optimally characterize the trade-off and propose a practical learning algorithm that achieves this trade-off in Chapter 5. 1.2.3 Trade-Offs in Invariant Representation Learning Prior work has established the existence of trade-offs in IRepL, both empirically and theoretically. In the following, we categorize them based on properties of interest. Restricted Class of Attributes: A majority of existing work considers IRepL trade-offs under restricted settings, i.e., binary and/or categorical attributes Y and S. For instance, [60] uses information-theoretic tools and characterizes the utility-fairness trade-off in terms of lower bounds when both Y and S are binary labels. Later [55] provided both upper and lower bounds for binary labels. By leveraging Chernoff bound, [61] proposed a construction method to generate an ideal representation beyond the input data to achieve perfect fairness while maintaining the best per- formance on the target task. In the case of categorical features, a lower bound on utility-fairness trade-off has been provided by [6] for the total invariance scenario (i.e., Z ⊥⊥ S). In contrast to this 5 body of work, our trade-off analysis applies to multidimensional continuous/discrete attributes. To the best of our knowledge, the only prior work with a general setting is [9]. However, in [9], both S and Y are restricted to be continuous/discrete or binary at the same time (e.g., it is not possible to have Y binary while S is continuous). Characterizing Exact versus Bounds on Trade-Off: To the best of our knowledge, all existing approaches characterize the trade-off in terms of upper and/or lower bounds. In contrast, we exactly characterize a near-optimal trade-off with closed-form expressions for encoders belonging to some RKHSs. Optimal Encoder and Representation: Another property of practical interest is the optimal en- coder that achieves the desired point on the utility-invariance trade-off and the corresponding rep- resentation(s). Existing works which only study bounds on the trade-off do not obtain the encoder that achieves those bounds. Hilbert-Schmidt independent criterion (HSIC), a universal measure of dependence, has been adopted by prior work (e.g., [62]) to quantify all types of dependencies between Z and S. However, these methods adopt stochastic gradient descent (SGD) for optimizing the underlying non-convex optimization problem. As such, they fail to guarantee that the repre- sentation learning problem converges to a global optima. In contrast, we obtain a closed-form solution for the optimal encoder and its corresponding representation while detecting all modes of dependence between Z and S in Chapter 5. 1.2.4 Optimization Theory for Adversarial Learning A growing class of learning algorithms, including ARL, generative adversarial networks (GAN)s, etc., involve more than one objective and are trained via games played by cooperating or dueling NNs. An overview of the challenges presented by such algorithms and a plausible solution in general n-player games can be found in [63]. In the context of two-player minimax games such 6 as GANs, several solutions [64, 65, 66, 67, 68, 69, 70, 71] have been proposed to improve the optimization dynamics, many of them relying on the idea of taking an extrapolation step [72]. The non-convex nature of the ARL formulation poses unique challenges from an optimization perspective. Practically, the parameters of the models in ARL are optimized through SGD, either jointly [21, 64] or alternatively [11], with the former being a generalization of gradient descent and is known as SGDA. While the convergence properties of gradient descent and its variants are well understood, there is relatively little work on the convergence and stability of SGDA in adversarial minimax problems. Recently, Mescheder et al. [64] and Nagarajan et al. [65] both leveraged tools from non-linear systems theory [73] to analyze the convergence properties of SGDA, in the context of GANs, around a given equilibrium. They show that without the introduction of additional regularization terms to the objective of the zero-sum game, SGDA does not converge. However, their analysis is restricted to the two-player GAN setting and is not concerned with its global/local optima. In contrast, using kernelized ridge regressors for target and proxy adversary networks, we are able to optimize these networks optimally for any given representation Z = f (X) that turns the unstable SGDA optimization scheme into the stable SGD scheme. 1.2.5 Dimensionality Reduction The technical machinery of Chapters 3, 5 in this dissertation is closely related to principal com- ponent analysis (PCA) [74] and its kernelized version [75] for dimensionality reduction. PCA can provide a compact disentangled representation (i.e., different elements of the representation vector are uncorrelated to each other) of the input data, which is efficient for downstream classification, regression, and clustering tasks. In particular, we deploy supervised PCA in this thesis [76]. Ker- nel methods have also been previously used for fair dimensionality reduction by [77], where the Rayleigh quotient is employed to only search for a single point in the utility-invariance trade-off. 7 In contrast to this work, our approaches in Chapters 3, 5 aims to characterize the entire trade-off front. 1.2.6 Dependence Measure In 1959, Rényi introduced dependence measures as a quantifier of the statistical dependence be- tween two random variables (RV)s Z and S by a non-negative value, where zero indicates that Z and S are independent and with larger values indicating greater degrees of dependence [78]. A possible such dependence measure can be defined as the maximum Pearson correlation (aka correlation coefficients) between α(Z) and β(S) over all Borel functions α and β [78]. Such a measure is not computationally tractable if Z and/or S are continuous [76]. To circumvent this difficulty, [79] demonstrated that any universal RKHS is sufficient as a search space for α and β to detect all modes of dependence1 between Z and S. Later, [80] employed the maximum of covariance as a measure of independence for α and β belonging to a unit-ball in some univer- sal RKHSs. Further, [81] proposed HSIC, where they demonstrated that considering covariance for only elements of any basis set in the involved RKHSs is sufficient for a universal dependence measure. 1.3 Overview of the Thesis In the second chapter, we introduce the notations, definitions, mathematical background, and ma- chinery required for this dissertation. The mathematical background of this dissertation includes linear algebra, probability theory, and functional analysis. In particular, we sometimes deploy functions in some RKHSs to model the encoder f and/or the target predictor gY . Further, we 1 By ’all modes of dependence’, all types of linear or non-linear relations in contrast to only linear or monotonic relations. 8 sometimes model the dependence measure between the representation Z and the sensitive attribute S via kernel measures of independence. In the third chapter, we study the simplest ARL, where all players 1) encoder f , 2) target predictor gY , and 3) proxy adversary gS are modeled linearly. Under this scenario, we obtain a closed-form solution (both empirically and in population) for the optimal encoder in terms of the eigenvectors of the projection of input data into the space that is as close as possible to the target label space while lying on the least explanatory space for the sensitive attribute. We then generalize our formulation and closed-form solutions to encoders in RKHSs while target and proxy adversary networks remain linear. Moreover, we theoretically obtain an optimal embedding dimensionality (i.e., dim(Z)) as a function of the user-defined trade-off parameter. In the fourth chapter, we let the encoder be DNN and aim to circumvent the instability and lack of convergence guarantees induced by SGDA optimization in ARL by modeling adversary and target networks by kernelized ridge regressors. This, in turn, yields a closed-form solution for the optimal adversary and target predictors for any given representation Z = f (X). Therefore, the SGDA optimization strategy reduces to a simple SGD to learn the encoder parameters. Moreover, we theoretically obtain an upper bound for the optimal embedding dimensionality. In the fifth chapter, motivated by the fact that proxy adversary loss may not account for all modes of dependence, we model the invariance measure via a near-universal dependence measure rather than a proxy adversary loss. By modeling the target loss the same, we are able to find a closed-form solution (both empirically and in population) for encoders in RKHSs. The closed- form solution also leads to the determination of optimal embedding dimensionality. The utility- invariance trade-off induced by the optimal encoder can be interpreted as an inherent trade-off arising from the triplet of the input data X, the target label Y , and the sensitive attribute S. We conclude the thesis in the sixth chapter, where we discuss limitations, future work, and the 9 broader impact of this thesis. 1.4 Contributions of the Thesis • We obtain a closed-form solution for the global optima of ARL with linear/kernelized en- coder and linear target and proxy adversary networks under MSE loss in Chapter 3. This closed-form solution can interestingly be interpreted as a generalization of supervised PCA (and its kernelized version) when there is a sensitive attribute to discard. Consequently, we are able to obtain an exact optimal embedding dimensionality as a function of the user- defined utility-invariance trade-off parameter under the mentioned scenario. • We deploy kernelized ridge regressors for modeling proxy adversary and target networks while the encoder can be DNN in Chapter 4. In turn, the unstable SGDA optimization involved in ARL reduces to SGD, which is a stable optimization scheme. Furthermore, we obtain an upper bound for the optimal embedding dimensionality. • We introduce a simplified version of HSIC to measure the dependence between the represen- tation Z and the sensitive attribute S for encoders in RKHSs in Chapter 5. We demonstrate that our proposed dependence measure is near-universal and lends itself to a closed-form so- lution for the IRepL problem where the optimal embedding dimensionality can be precisely obtained as a function of the trade-off parameter. The introduced utility-invariance trade-off can be interpreted as a near-optimal trade-off induced by the triplet (X, Y, S). 10 Chapter 2 Mathematical Background and Preliminaries 2.1 Notations and Definitions Scalars are denoted by regular lowercase letters, e.g., r, λ. Deterministic vectors are denoted by boldface lowercase letters, e.g., x, s. The L2 -norm of the vector x is denoted by kxk, and the inner product between the vectors x and s of the same size is denoted by hx, si. We denote n- tuple vectors of ones and zeros by 1n and 0n , respectively. Finite or infinite sets are denoted by calligraphy letters, e.g., H, A. The indicator function is denoted by 1B (·), where    1 if m ∈ B 1B (m) = (2.1)   0 if m 6∈ B We denote deterministic matrices by boldface upper case letters, e.g., M , K. The element at i-th row and j-th column of any matrix M is denoted by (M )ij or mij ; its transpose is denoted by M T ; its inverse is denoted by M −1 , and its Moore-Pensore pseudo-inverse is denoted by M † . Centering, i.e., mean subtraction with respect to (w.r.t.) columns, is denoted by ”˜”, e.g., M̃ , 11 which can be obtained as 1 M̃ = M H where M ∈ Rm×n and H := In − 1n 1Tn . (2.2) n The subspace spanned by the columns of M is denoted by R(M ) or simply M; the orthogonal complement of M is denoted by M⊥ , and the null space of M is denoted by N (M ). The orthogonal projection onto M is denoted by PM and can be obtained as  † PM = M M T M M T . (2.3) We denote an n × n identity matrix by In or simply I. The trace of any square matrix K (i.e., the sum of diagonal elements) is denoted by Tr [K]. The Frobenius norm of any matrix M is denoted by kM kF , which is related to the trace as h i h i kM k2F = Tr MMT = Tr MT M . We denote both scalar-valued and multidimensional random variables (RV)s by regular upper- case letters, e.g., X, S. The expectation of the RV X is denoted by E[X]; and its covariance matrix is denoted by CX , where h i CX := E (X − E[X])(X − E[X])T . Similarly, denoted by CXY , the cross-covariance between the RVs X and S is defined as h i CXS := E (X − E[X])(S − E[S])T . 12 For a positive definite matrix C (denoted by C  0), its Cholesky factorization results in C  0 ⇒ C = QQT , Q is full rank. (2.4) If C is a positive semi-definite matrix (denoted by C  0), then its incomplete Cholesky factor- ization is C  0 ⇒ C = LLT , L is full column-rank. (2.5) Consider the probability space (Ω, F, P), where Ω is the sample space, F is a σ−algebra on Ω, and P is a probability measure on F. We assume that the joint RV, (X, Y, S) containing the input data X ∈ RdX , the target label Y ∈ RdY , and the sensitive attribute S ∈ RdS , is an RV on (Ω, F) with joint distribution pX,Y,S . Furthermore, Y and S can also belong to any finite set, like a categorical set. This setting enables us to work with classification and multidimensional regression tasks, where the sensitive attribute can be either categorical or multidimensional contin- uous/discrete RV. We let D := {(x1 , y1 , s1 ), · · · , (xn , yn , sn )} be the training data, containing n i.i.d. samples from the joint distribution pX,Y,S . We also separately define the input, the label, and the sensitive data, respectively, as follows. X := [x1 , · · · , xn ] ∈ RdX ×n Y := [y1 , · · · , yn ] ∈ RdY ×n S := [s1 , · · · , sn ] ∈ RdS ×n . 13 2.2 Preliminaries 2.2.1 Kernelization Let f ∈ HX , where HX is an RKHS of functions from RdX to R with kernel function kX (·, ·). Invoking the representer theorem [82], it follows that Xn f (X) = θi kX (xi , X) = θ [kX (x1 , X), · · · , kX (xn , X)]T , i=1 where θ ∈ Rn×1 and (θ)i = θi . Moreover, let f (X) := [f1 (X), · · · , fr (X)]T . Similarly, we have f (X) = Θ [kX (x1 , X), · · · , kX (xn , X)]T , (2.6) where Θ ∈ Rr×n and (Θ)ji = θji . 2.2.2 Kernelized Dependence Measures Principally, two RVs Z and S are independent (denoted by Z ⊥⊥ S) if and only if (iff) [83] Cov(α(Z), β(S)) := E [α(Z) β(S)] − E [α(Z)] E [β(S)] (2.7) is zero for all Borel functions α : Rr → R and βs : RdS → R belonging to the universal RKHSs HZ and HS , respectively. Note that universality ensures that RKHSs can approximate any Borel function with arbitrary precision [84]. In the remainder of this thesis, we consider the following assumption unless otherwise stated. Assumption 2.1. We assume that any RKHS H (from Rd to R) is universal and separable and the 14 corresponding kernel function, k(·, ·) is bounded: E [k(U, U )] < ∞ for any square-integrable d-dimensional U. (2.8) Note that a Hilbert space is separable iff it has a countable orthonormal basis set and U is square-   integrable iff E kU k2 < ∞. Now consider the following bi-linear functional: h : HZ × HS → R, h(α, β) 7→ Cov(α(Z), β(S)), where HZ and HS are RKHSs. This bi-linear functional is bounded due to Assumption 2.1 [85]. Invoking Riesz representation theorem [86], there exist a unique and bounded operator ΣSZ : HZ → HS such that Cov(α(Z), β(S)) = h(α, β) = hΣ α, βiH ∀ α ∈ HZ , β ∈ HS . (2.9) S Consequently, it follows that Z ⊥⊥ S ⇐⇒ ΣSZ = 0. (2.10) Notice that ΣSZ = 0 iff the norm of ΣSZ is zero for any valid norm like spectral norm: kΣSZ αkH kΣSZ kSpectral := sup S, (2.11) α∈HZ kαkH Z 15 or Hilbert-Schmidt norm: X 2 kΣSZ k2HS := ΣSZ αi , βj H , (2.12) S αi ∈UZ , βj ∈US where UZ and US are countable orthonormal basis sets for the separable universal RKHSs HZ and HS , respectively. These norms have been deployed in constrained covariance (COCO) [80] and HSIC, respectively, which are universal measures of dependence [81]. Moreover, kernelized canonical covariance (KCC) introduced by [79]: Cov(α(Z), β(S)) KCC(Z, S) := sup p , (2.13) α∈HZ ,β∈HS Var(α(Z)) Var(β(S)) has also widely been used as a universal measure of dependence. 16 Chapter 3 Adversarial Representation Learning Under Linear Invariance 3.1 Introduction Adversarial representation learning is a promising framework for training image representation models that can control the information encapsulated within it. ARL is practically employed for learning representations for a variety of applications, including unsupervised domain adaptation of images [87], censoring sensitive information from images [21], learning fair and unbiased rep- resentations [37, 2], learning representations that are controllably invariant to sensitive attributes [24] and mitigating unintended information leakage [1], amongst others. At the core of the ARL formulation is the idea of jointly optimizing three entities: (i) An encoder f that seeks to distill the information from the input data X and retains the information relevant to the target attribute Y while intentionally and permanently eliminating the information corresponding to the sensitive attribute S, (ii) a predictor gY that seeks to predict Y , and (iii) a proxy adversary gS , playing the role of an unknown adversary, that seeks to extract the sensitive information S. Figure 3.1 shows a pictorial illustration of the ARL problem. Typical instantiations of ARL represent these entities through non-linear functions in the form of NNs and formulate parameter learning as a minimax optimization problem. Practically, opti- 17 X f (X) Z gY (Z) Yb gS (Z) Sb Figure 3.1: Adversarial Representation Learning consists of three entities, an encoder f that obtains a compact representation Z of the input data X, a predictor gY that predicts a desired target attribute Y and an adversary gS that seeks to extract a sensitive attribute S, both from the embedding Z. mization is performed through SGDA, wherein small gradient steps are taken simultaneously in the parameter space of the encoder, target predictor, and proxy adversary. The solutions thus obtained have been effective in learning data representations with controlled invariance across applications such as image classification [1], multi-lingual machine translation [24], and domain adaptation [87]. Despite its practical promise, the aforementioned ARL setup suffers from a number of draw- backs: – Representation learning under adversarial settings is challenging in its own right. The minimax formulation of the problem leads to an optimization problem that is non-convex in the parameter space, both due to the adversarial loss function as well as due to the non-linear nature of mod- ern NNs. As we show in this chapter, even for simple instances of ARL where each entity is characterized by a linear function, the problem remains non-convex in the parameter space. Sim- ilar observations [65] have been made in a different but related context of adversarial learning in generative adversarial networks (GAN)s [88]. – Current paradigm of SGDA to solve the ARL problem provides no provable guarantees while suffering from instability and poor convergence [1, 2]. Again, similar observations [64, 65] have 18 been made in the context of GANs, demonstrating the difficulty posed by the minimax formula- tion of the optimization and exposing the limitations of standard simultaneous optimization (i.e., SGDA). – In applications of ARL related to fairness, accountability, and transparency of machine learning models, it is critically important to be able to provide performance bounds in addition to empirical evidence of their efficacy. A major shortcoming of existing works is the difficulty and lack of performance analysis and provable guarantees of unfairness or information leakage. In this chapter, we take a step back and analytically study the simplest version of the ARL prob- lem from an optimization perspective with the goal of addressing the aforementioned limitations. Doing so enables us to delineate the contributions of the expressivity of the entities in ARL (i.e., shallow versus DNNs) and the challenges of optimizing the parameters (i.e., local optima through SGDA versus global optima). We first consider the “linear” form of ARL, where the encoder is a linear transformation, the target predictor is a linear regression, and the proxy adversary is also a linear regressor. We show that this linear ARL leads to an optimization problem that is both non-convex and non-differentiable. Despite this fact, by reducing it into a set of trace problems on a Stiefel manifold, we obtain an exact closed-form solution for the global optima. As part of our solution, we also determine the optimal dimensionality of the embedding space. We then obtain analytical bounds (lower and upper) on the target and adversary objectives and prescribe a pro- cedure to control the maximal leakage of sensitive information explicitly. Finally, we extend the linear-ARL formulation to allow non-linear functions in some RKHSs while still enjoying an ex- act closed-form solution for the global optima. Numerical experiments on multiple datasets, both small and large scale, indicate that the global optima solution for the linear and kernel formulations of ARL are competitive and sometimes even outperform DNN-based ARL trained through SGDA. Practically, we also demonstrate the utility of linear ARL and kernel-ARL for “imparting” prov- 19 able invariance to any biased pre-trained data representation. We refer to our proposed algorithm for obtaining the global optima as spectral-ARL and abbreviate it as SARL. 3.2 Adversarial Representation Learning 3.2.1 Problem Setting The adversarial representation learning problem is formulated with the goal of learning parameters of an embedding function f (·; Θ) : X 7→ Z with two objectives: (i) aiding a target predictor gY (·; ΘY ) to accurately predict the target attribute Y from Z, and (ii) preventing an adversary gS (·; ΘS ) from inferring the sensitive attribute S from Z. The ARL problem can be formulated as min min EX,Y [LY (gY (f (X; Θ); ΘY ) , Y )] Θ ΘY s.t. min EX,S [LS (gS (f (X; Θ); ΘS ) , S)] ≥ α, (3.1) ΘS where LY and LS are the loss functions for the target and the adversary predictors, respectively; α ∈ [0, ∞) is a user-defined value that determines the minimum tolerable loss for the adversary on the sensitive attribute. For example, α = 0 corresponds to ignoring the adversary loss and result- ing in standard representation learning, while α → ∞ corresponds to no tolerance for adversary performance. The minimization in the constraint is equivalent to the encoder operating against an optimal adversary. Existing instances of this problem adopt DNNs to represent f , gY , and gS and learn their respective parameters {Θ, ΘY , ΘS } through SGDA. See Figure 3.2 for an illustration. 20 X f (X; Θ) Z gY (Z; ΘY ) Yb figure gS (Z; ΘS ) Sb Figure 3.2: ARL-SGDA: Illustration of training adversarial representation learning through stochastic gradient descent ascent. i) At first, the target predictor parameters ΘY are updated while the encoder and adversary are frozen. ii) Then, the adversary parameters ΘS are updated while the encoder and ΘY are frozen. iii) Finally, the encoder parameters Θ get updated while ΘY and ΘS are frozen. SGDA does not provide any convergence guarantees. 3.2.2 The Linear Case We first consider the simplest form of the ARL problem and analyze it from an optimization per- spective. We model both the adversary and the target predictors as linear regressors Yb = ΘY Z + bY , Sb = ΘS Z + bS , (3.2) where Z is an encoded version of X, and Yb and Sb are the predictions corresponding to the target and sensitive attributes, respectively. We also model the encoder through a linear mapping Θ ∈ Rr×dX : X 7→ Z = ΘX, (3.3) where r is the dimensionality of the projected space. While existing NN-based solutions select r on an ad-hoc basis, our approach for this problem determines r as part of our solution to the ARL problem. See Figure 3.3 for an illustration. For both adversary and target predictors, we adopt the 21 X Z ∈ Rr ΘY Z +bY Yb ΘX ΘS Z + bS Sb Figure 3.3: Linear-ARL: Illustration of linear adversarial representation learning for learning a fair representation. An encoder f , in the form of a linear mapping, produces a new representation Z = ΘX. A target predictor gY and an adversary gS , in the form of linear regressors, operate on the representation Z. We analytically analyze this ARL setup to obtain a closed-form solution for the globally optimal parameters of the encoder Θ. Provable bounds on the trade-off between utility and fairness of the representation are also derived. MSE to assess the quality of their respective predictions, i.e., h i h i LY (Y, Yb ) = E kY − Yb k2 , b = E kS − Sk LS (S, S) b 2 . 3.2.2.1 Optimization Problem For any given encoder Θ the following lemma gives the minimum MSE for a linear regressor in terms of covariance matrices and Θ. The following Lemma assumes that the RV X is zero-mean and the covariance matrix CX is positive definite. These assumptions are not restrictive since we can always remove the mean and dependent features from X. Lemma 3.1. Let X and U be two RVs with E[X] = 0, E[U ] = b, where CX  0. Consider a linear regressor, Ub = W Z + b, where W ∈ RdU ×r is the parameter matrix, and Z ∈ Rr is an encoded version of X for a given Θ: X 7→ Z = ΘX, Θ ∈ Rr×dX . The minimum MSE that can be achieved by designing W is h i 2 min E kU − U b k2 = Tr [CU ] − PM Q−T CXU , W X F 22 where M = QX ΘT ∈ RdX ×r , and CX = QTX QX (Cholesky factorization). Proof. See Appendix A.1 Applying this Lemma to the target and adversary regressors, we obtain their minimum MSEs as 2 JY (Θ) := min LY (gY (f (X; Θ); ΘY ) , Y ) = Tr [CY ] − PM Q−T X CXY (3.4) ΘY F 2 JS (Θ) := min LS (gS (f (X; Θ); ΘS ) , S) = Tr [CS ] − PM Q−T X CXS F . (3.5) ΘS Given the encoder Θ, JY (Θ) is related to the performance of the target predictor, whereas JS (Θ) corresponds to the amount of sensitive information that an adversary is able to extract. Note that the linear model for gY and gS enables us to obtain their respective optimal solutions for a given encoder Θ. On the other hand, when gY and gS are modeled as NNs, doing the same is analytically infeasible and potentially impractical. The orthogonal projector PM in Lemma 3.1 is a function of two factors: a data-dependent term QX and the encoder parameters Θ. While the former is fixed for a given dataset, the latter is our object of interest. Pursuantly, we decompose PM in order to separably characterize the effect of these two factors. Let the columns of LX ∈ RdX ×dX be an orthonormal basis for the column space of QX , and G ∈ RdX ×r be an arbitrary matrix. consider LX G := QX ΘT . Due to the bijection G = L−1 T X QX Θ ⇔ Θ = G LX QX , T T −T determining the encoder parameters Θ is equivalent to determining G. The projector PM can now be expressed in terms of PG , which is only dependent on the free parameter G:  † PM = M M T M M T = LX PG LTX , (3.6) 23 where we used the equality M = QX ΘT and the fact that LTX LX = I. Now, we turn back to the ARL setup and see how the above decomposition can be leveraged. The optimization problem in (3.1) reduces to min JY (G) G (3.7) s.t. JS (G) ≥ α, where the minimum MSE measures of (3.4) and (3.5) are now expressed in terms of G instead of Θ. Before solving this optimization problem, we will first interpret it geometrically. Consider a simple example where X is a white RV, i.e., CX = I. Under this setting, QX = LX = I and G = Θ. As a result, the optimization problem in (3.7) can alternatively be solved in terms of 2 2 G = Θ, where JY (G) = Tr [CY ] − PG CXY F and JS (G) = Tr [CS ] − PG CXS F . 2 The constraint JS (G) ≥ α implies PG CXS F ≤ (Tr [CS ] − α) which is geometrically equivalent to the subspace G being outside (or tangent to) the cone around CXS . Similarly, min- 2 imizing JY (G) implies maximizing PG CXY F , which in turn is equivalent to minimizing the angle between the subspace G and the vector CXY . Therefore, the global optima of (3.7) are any hyperplane G which is outside the cone around CXS while subtending the smallest angle to CXY . An illustration of this setting and its solution is shown in Figure 3.4 for dX = 3, r = 2, and dY = dS = 1. Constrained optimization problems such as (3.7) are commonly solved through their respective unconstrained scalarization [89] formulations as shown below min {(1 − λ) JY (G) − λ Js (G)} (3.8) G∈RdX ×r 24 CXS CXY G o Figure 3.4: Geometric Interpretation: An illustration of a three-dimensional input space X and one-dimensional target and adversary regressors. Therefore, both CXS and CXY are one- dimensional. We locate the y-axis in the same direction as CXS . The feasible space for the solution G = Θ imposed by the constraint JS (Θ) ≥ α corresponds to the region outside the cone (specified by CS and α) around CXS . The non-convexity of the problem stems from the non-convexity of this feasible set. The objective min JY (Θ) corresponds to minimizing the angle between the line CXY and the plane G. When CXY is outside the cone, the line CXY itself or any plane that contains the line CXY and does not intersect with the cone, is a valid solution. When CXY is inside the cone, the solution is either a line or, as we illustrate, a tangent hyperplane to the cone that is closest to CXY . The non-differentiability stems from the fact that the solution can either be a plane or a line. 25 for some parameter 0 ≤ λ < 1. Such an approach affords two main advantages and one disadvan- tage: (a) A direct and closed-form solution can be obtained. (b) Framing (3.8) in terms of λ and (1 − λ) allows explicit control between the two extremes of no fairness (λ = 0) and only fairness (λ → 1). As a consequence, it can be shown that for every λ ∈ [0, 1), ∃ α ∈ [αmin , αmax ] (see Appendix A.2 for a proof). (c) The vice-versa, on the other hand, does not necessarily hold, i.e., for a given tolerable loss α, there may not be a corresponding λ ∈ [0, 1). This is the theoretical limit of solving a scalarized problem instead of the constrained problem. Before we obtain the solution to the scalarization formulation (3.8), we characterize the nature of the optimization problem in the following theorem. Theorem 3.2. As a function of G ∈ RdX ×r , the objective function in (3.8) is neither convex nor differentiable. Proof. See Appendix A.3 3.2.2.2 Learning Despite the difficulty associated with the objective in (3.8), we derive a closed-form solution for its global optima. Our key insight lies in partitioning the search space RdX ×r based on the rank of the matrix G (i.e., the number of independent rows or columns of G). For a given rank i, let Si be the set containing all matrices G of rank i, n o Si = G ∈ RdX ×r | rank(G) = i , i = 0, 1, · · · , r. Sr Since i=0 Si = RdX ×r , the optimization problem in (3.8) can be solved by considering r mini- 26 mization problems, one for each possible rank of G: ( ) min min (1 − λ) JY (G) − λ JS (G) (3.9) i∈{1,...,r} G∈Si We observe from (3.4), (3.5), and (3.6) that the optimization problem in (3.8) is dependent only on a subspace G. As such, the solution G is not unique since many different matrices can span the same subspace. Hence, it is sufficient to solve for any particular G that spans the optimal subspace G. Without loss of generality, we seek an orthonormal basis spanning the optimal subspace G as our desired solution. We constrain G ∈ RdX ×i to be an orthonormal matrix i.e., GT G = Ii , where i is the dimensionality of G. Ignoring the constant terms in JY and JS , for each i = 1, . . . , r, the minimization problem over Si in (3.9) reduces to min Jλ (G), (3.10) GT G=Ii where Jλ (G) := λ kLX GGT LTX Q−T 2 T T −T X CXS kF −(1 − λ) kLX GG LX QX CXY kF . 2 h i From basic properties of trace, we have, Jλ (G) = Tr GT BG where B ∈ RdX ×dX is a symmetric matrix:   B = LTX Q−T λ C T C − (1 − λ) C T C −1 (3.11) X SX SX Y X Y X QX LX . The optimization problem in (3.10) is a trace minimization on a Stiefel manifold which has closed- form solution(s) (see [90] and [91]). 27 In view of the above discussion, the solution to the optimization problem in (3.8) or equiva- lently (3.9) can be stated in the next theorem. Theorem 3.3. Assume that the number of negative eigenvalues (β) of B in (3.11) is j. Denote γ = min{r, j}. Then, the minimum value in (3.9) is given as, β1 + β2 + · · · + βγ (3.12) where β1 ≤ β2 ≤ . . . ≤ βγ < 0 are the γ smallest eigenvalues of B. And the minimum can be attained by G = V , where the columns of V are eigenvectors corresponding to all the γ negative eigenvalues of B. Proof. Consider the inner optimization problem of (3.10) in (3.9). Using the trace optimization problems and their solutions in [90], we get h i min Jλ (G) = min Tr GT BG = β1 + β2 + · · · + βi , GT G=Ii GT G=Ii where β1 , β2 , . . . , βi are i smallest eigenvalues of B and minimum value can be achieved by the matrix V whose columns are corresponding eigenvectors. If the number of negative eigenvalues of B is less than r, then the optimum i in (3.9) is j, otherwise the optimum i is r. Note that including the eigenvectors corresponding to zero eigenvalues of B into our solution G in Theorem 3.3 does not change the minimum value in (3.12). But, considering only the eigen- vectors corresponding to negative eigenvalues results in G with the least rank and, thereby, an encoder that is less likely to contain sensitive information for an adversary to exploit. Once G is constructed, we can obtain our desired encoder as Θ = GT LTX Q−T X . Recall that the solution in Theorem 3.3 is under the assumption that the covariance CX is a full-rank matrix. 28 3.3 Empirical Solution for Linear Encoder In many practical scenarios, we only have access to data samples but not to the population mean vectors and covariance matrices. Therefore, the population solution might not be feasible in such as case. In this section, we provide an approach to solve the optimization problem in (3.3), which relies on empirical moments and is valid even if the covariance matrix CX is not full-rank. Firstly, for a given Θ, we find JY = min MSE (Yb − Y ). WY ,bY Note that the above optimization problem can be separated over WY and bY . Therefore, for a given WY , we first minimize over bY : n o 1X n min E kWY ΘX + bY − Y k = min 2 kWY Θxk + bY − yk k2 bY bY n k=1 Xn 1 = kWY Θxk + c − yk k2 , n k=1 where we used the empirical expectation and the minimizer c is n n n 1X 1X 1X c= (yk − WY Θxk ) = yk − WY Θ xk n n n k=1 k=1 k=1 (3.13) = E {Y } − WY Θ E {X} . 29 Let all the columns of matrix C be equal to c. We now have JY = min MSE (Yb − Y ) WY ,bY 1 = min kWY ΘX + C − Y k2F WY n 1 2 = min WY ΘX̃ − Ỹ WY n F 1 2 = min X̃ T ΘT WYT − Ỹ T WY n F 1 2 1 2 = min M WYT − PM Ỹ T + PM⊥ Ỹ T WY n F n F 2 1 1 2 = M | M }† PM Ỹ T − PM Ỹ T {z + PM⊥ Ỹ T n n F PM F = 1 P Ỹ T 2 n M⊥ F 1 2 1 2 = Ỹ T − PM Ỹ T , n F n F where in the third step we used (3.13), M = X̃ T ΘT and the fifth step is due to orthogonal decomposition. Using the same approach, we get 1 2 1 2 JS = S̃ T − PM S̃ T . (3.14) n F n F Now, assume that the columns of Lx are the orthogonal basis for the column space of X̃ T . Therefore, for any M , there exists a G such that LX G = M . In general, there is no bijection between Θ and G in the equality X̃ T Θ = LX G. But, there is a bijection between G and Θ ⊥ when constrained to Θ’s in which R(ΘT ) ⊆ N (X̃ T ) . This restricted bijection is sufficient to be considered since for any ΘT ∈ N (X̃ T ), we have M = 0. Once G is determined, ΘT can be 30 obtained as ΘT = (X̃ T )† LX G + Θ0 , Θ0 ⊆ N (X̃ T ). However, since 2 2 kΘk2F = ΘT = (X̃ T )† LX G + kΘ0 k2F , F F choosing Θ0 = 0 results in minimum kΘkF , which is favorable in terms of robustness to noise. By choosing Θ0 = 0, determining the encoder Θ would be equivalent to determining G. Similar to (3.6), we have PM = LX PG LTX . If we assume that the rank of PG is i, Jλ (G) in (3.10) can be expressed as 2 2 Jλ (G) = λ LX GGT LTX S̃ T − (1 − λ) LX GGT LTX Ỹ T F F where GGT = PG for some orthogonal matrix G ∈ RdX ×i . This resembles the optimization problem in (3.9) and therefore it has the same solution as Theorem 3.3 with modified B given by   B = LTX λ S̃ T S̃ − (1 − λ) Ỹ T Ỹ LX (3.15) Once G is determined, Θ can be obtained as GT LTX X̃ † . 3.3.1 Non-Linear Extension Through Kernelization We extend the “linear” version of the ARL problem studied thus far to a “non-linear” version through kernelization. We model the encoder in the ARL problem as a linear function over the non-linear mapping of inputs as illustrated in Figure 3.5. Let the data matrix X be mapped non- linearly by a possibly unknown and infinite dimensional function φX (·) and the corresponding 31 f1 (·) ∈ HX X .. . Z ∈ Rr fr (·) ∈ HX ΘY Z +bY Yb ΘkX (·, X) ΘS Z + bS Sb Figure 3.5: Kernel-ARL: Illustration of kernel adversarial representation learning for learning a fair representation. An encoder f , in the form of a linear mapping on top of kernelized input, produces a new representation Z = Θ [kX (x1 , X), · · · , kX (xn , X)]T . A target predictor gY and an adversary gS , in the form of linear regressors, operate on the representation Z. reproducing kernel function be kX (·, ·). From (2.6), it follows that the representation Z can be expressed as Z = Θ [kX (x1 , X), · · · , kX (xn , X)]T . (3.16) The scalarization formulation of this kernel-ARL setup and its solution share the same form as that of the linear case (3.8). The objective function remains non-convex and non-differentiable, while the matrix B is now dependent on the kernel matrix KX as opposed to the covariance matrix CX (see Appendix A.5 for details):   B = LTX λ S̃ T S̃ − (1 − λ) Ỹ T Ỹ LX , (3.17) where the columns of LX are the orthonormal basis for HKX . Once G is obtained through † the eigendecomposition of B, we can obtain the optimal encoder as Θ = GT LTX KX . This non- linear extension in the form of kernelization serves to study the ARL problem under a setting where the encoder possesses greater representational capacity while still being able to obtain the global optima and bounds on the objectives of the target predictor and the adversary. 32 3.4 Analytical Bounds In this section, we introduce bounds on the utility and invariance of the representation learned by SARL. We define four bounds αmin , αmax , γmin and γmax . γmin : A lower bound on the minimum achievable target loss, or equivalently an upper bound on the best achievable target performance. This bound can be expressed as the minimum target MSE across all possible encoders Θ and is attained at λ = 0: γmin = min JY (θ) Θ αmax : A upper bound on the maximum achievable adversary loss, or equivalently a lower bound on the minimum leakage of the sensitive attribute. This bound can be expressed as the maximum adversary MSE across all possible encoders Θ and is attained at λ = 1: αmax = max JS (Θ) Θ γmax : An upper bound on the maximum achievable target loss, or equivalently a lower bound on the minimum achievable target performance. This bound corresponds to the scenario where the encoder is constrained to hinder the adversary maximally. In all other cases, one can obtain higher target performance by choosing a better encoder. This bound is attained in the limit λ → 1 and can be expressed as γmax = min JY (Θ) arg max JS (Θ) αmin : A lower bound on the minimum achievable adversary loss, or equivalently an upper bound on the maximum leakage of the sensitive attribute. The absolute lower bound is obtained in the 33 scenario where the encoder is neither constrained to aid the target nor hinder the adversary, i.e., ∗ αmin = min JS (Θ) Θ However, this is an unrealistic scenario since in the ARL problem, by definition, the encoder is explicitly designed to aid the target. Therefore, a more realistic lower bound can be defined under the constraint that the encoder maximally aids the target, i.e., ᾱmin = min JS (Θ) arg min Jy (Θ) However, even this bound is not realistic since, among all the encoders that aid the target, one can always choose the encoder that minimizes the leakage of the sensitive attribute. The bound corresponding to such an encoder can be expressed as αmin = max JS (Θ) arg min JY (Θ) This bound is attained in the limit λ → 0. It is easy to see that these bounds are ordinally related as ∗ αmin ≤ ᾱmin ≤ αmin To summarize, in each of these cases, there exists an encoder that achieves the respective bound. Therefore, given a choice, the encoder that corresponds to αmin is the most desirable. The following Lemma defines these bounds and their respective closed-form expressions as a function of data. Theorem 3.4. Let the columns of LX be the orthonormal basis for HKX . Further, assume that 34 the columns of VS are the singular vectors corresponding to zero singular values of S̃LX and the columns of VY are the singular vectors corresponding to non-zero singular values of Ỹ LX . Then, we have 1 2 1 γmin := min JY (Θ) = Ỹ T − kỸ LX k2F Θ n F n 1 2 1 2 γmax := min JY (Θ) = Ỹ T − Ỹ LX VS arg max JS (Θ) n F n F 1 2 1 2 αmin := max JS (Θ) = S̃ T − S̃LX VY arg min JY (Θ) n F n F 1 2 αmax := max JS (Θ) = S̃ T Θ n F Proof. See Appendix A.6 Under the special case of one-dimensional data, i.e., X, Y , and S are scalars, the above bounds can be related to the correlation coefficients (i.e., normalized correlations) of the variables involved. Specifically, the normalized bounds γmin and αmin can be expressed as, γmin 2 = 1 − ρ2 (X, Y ) σS αmin = 1 − ρ2 (X, S) σS2 where ρ(·, ·) denotes the correlation coefficient between two RVs and σY2 := Var[Y ] (σS2 is sim- ilarly defined). Similarly, the upper bounds γmax and αmax can be expressed in terms of the variance of the label space as γmax αmax 2 = = 1. σy σs2 35 Therefore, in the one-dimensional setting, the achievable bounds are related to the underlying alignment between the subspace spanned by the data X, and the respective subspaces spanned by the labels S and Y . 3.5 Computational Complexity In the case of linear-SARL, calculating the covariance matrices CX , CY X and CSX requires O(d2X n), O(d2Y n), and O(d2S dX ) multiplications, respectively. Next, the complexity of Cholesky factorization CX = QTX QX and calculating its inverse Q−1 3 X is O(dX ) each. Finally, solving the optimization problem has a complexity of O(d3X ) to eigendecompose the dX × dX matrix B. In the case of kernel-SARL, the eigendecomposition of B requires O(n3 ) operations. However, for scalability, i.e., large n (e.g., CIFAR-100), the Nyström method with data sampling [92] can be adopted. To summarize, the complexity of the linear and kernel formulations is O(d3X ) and O(n3 ), respectively. 3.6 Numerical Experiments We evaluate the efficacy of the proposed Spectral-ARL (SARL) algorithm in finding the global op- tima and compare it with other ARL baselines that are based on the standard SGDA optimization. In all experiments, we refer to our solution for “linear” ARL as Linear-SARL and the solution to the “kernel” version of the encoder with linear classifiers for the predictor and adversary as Kernel-SARL. 36 90 80 Target Accuracy [%] 70 No Privacy 60 SGDA-ARL Linear-ARL Kernel-ARL 50 50 55 60 Adversary Accuracy [%] (a) (b) Figure 3.6: (a) Samples from a mixture of four Gaussians. Each sample has two attributes, shape and color. (b) The trade-off between target performance and leakage of a sensitive attribute by an adversary. 3.6.1 Mixture of Four Gaussians We first consider a simple example in order to visualize and compare the learned embeddings from different ARL solutions. We consider a three-dimensional problem where each data sample consists of two attributes, color and shape. Specifically, the input data X is generated from a mixture of four different Gaussian distributions corresponding to different possible combinations of the attributes, i.e., {•, •, ×, ×} with means at µ1 = (1, 1, 0), µ2 = (2, 2, 0), µ3 = (2, 2.5, 0),  µ4 = (2.5, 3, 0) and identical covariance matrices CX = diag 0.32 , 0.32 , 0.32 . The shape attribute is the target, while color is the sensitive attribute, as illustrated in Figure 3.6 (a). The goal of the ARL problem is to learn an encoder that projects the data such that it remains separable with respect to the shape and non-separable with respect to the color attribute. We sample 4000 points to learn linear and non-linear (RBF Gaussian kernel) encoders across λ ∈ [0, 1]. To train the encoder, the one-hot encoding of target and sensitive labels are treated as the regression targets. Then, we freeze the encoder and train logistic regressors for the adver- sary and target task for each λ. We evaluate their classification performance on a separate set of 1000 samples. The resulting trade-off front between target and adversary performance is shown 37 (a) Color (λ = 0) (b) Color (λ = 0.5) (c) Color (λ = 1) (d) Color (λ = 0.5) (e) Shape (λ = 0) (f) Shape (λ = 0.5) (g) Shape (Kernel, λ = 1) (h) Shape (Kernel, λ = 0.5) Figure 3.7: Gaussian Mixture: The optimal dimensionality of embedding Z is 1. Visualization of the embedding histograms w.r.t each attribute for different relative emphasis, λ, on the target (shape) and sensitive attributes (color). The top row is color and the bottom row is shape. The first three columns show results for a linear encoder. At λ = 0, the weight on the adversary is 0, so the color is still separable. As the value of λ increases, we observe that the colors are less and less separable. The last column shows results for a kernel encoder for λ = 0.5. We observe that the target attribute is quite separable while the sensitive attribute is entangled. in Figure 3.6 (b). We make the following observations, (1) For λ = 1, all methods achieve an accuracy of 50% for the adversary, which indicates complete removal of features corresponding to the sensitive attribute via our encoding, (2) At small values of λ the objective of Linear-ARL is close to being convex, hence the similarity in the trade-off fronts of Linear-SARL and SGDA-ARL in that region. However, everywhere else, due to the iterative nature of SGDA, it is unable to find the global solution and achieve the same trade-off as Linear-SARL. (3) The non-linear encoder in the Kernel-SARL solution significantly outperforms both Linear-SARL and SGDA-ARL. The non-linear nature of the encoder enables it to strongly entangle the color attribute (50% accuracy) while simultaneously achieving a higher target accuracy than the linear encoder. Figure 3.7 vi- sualizes the learned embedding space Z for different trade-offs between the target and adversary objectives. 38 1 1 min 0.98 0.95 max min 0.96 0.9 train max test 0.94 train 0.85 test 0.92 0.8 0.9 0.88 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 (a) Linear-SARL (b) Kernel-SARL Figure 3.8: Gaussian Mixture: Lower and upper bounds on adversary loss, αmin and αmax , computed on the training set. The loss achieved by our solution as we vary λ is shown on the training and testing sets, αtrain and αtest , respectively. Figure 3.8 shows the MSE of the adversary as we vary the relative trade-off parameter λ be- tween the target and adversary objectives. The plot illustrates (1) the lower and upper bounds αmin and αmax respectively calculated on the training dataset, (2) achievable adversary MSE computed on the training set αtrain , and finally, (3) achievable adversary MSE computed on the test set αtest . Observe that on the training dataset, all values of α ∈ [αmin , αmax ] are reachable as we sweep through λ ∈ [0, 1]. This is, however, not the case on the test set as the bounds are computed through empirical moments as opposed to the population covariance matrices. 3.6.2 Fair Classification We consider the task of learning representations that are invariant to a sensitive attribute on two datasets, Adult and German, from the UCI ML-repository [93]. For comparison, apart from the raw features X, we consider several baselines that use NNs and are trained through SGDA; LFR [36], VAE [94], VFAE [37], ML-ARL [24] and MaxEnt-ARL [1]. The Adult dataset contains 14 attributes. There are 30, 163 and 15, 060 instances in the training and test sets, respectively. The target task is the binary classification of annual income, i.e., more or less than 50K, and the sensitive attribute is gender. Similarly, the German dataset contains 1000 39 Table 3.1: Fair Classification Performance (in %) Adult Dataset German Dataset Method Target Sensitive ∆∗ Target Sensitive ∆∗ (income) (gender) (credit) (age) Raw Data 85.0 85.0 17.6 80.0 87.0 6.0 LFR [36] 82.3 67.0 0.4 72.3 80.5 0.5 VAE [94] 81.9 66.0 1.4 72.5 79.5 1.5 VFAE [37] 81.3 67.0 0.4 72.7 79.7 1.3 ML-ARL [24] 84.4 67.7 0.3 74.4 80.2 0.8 MaxEnt-ARL [1] 84.6 65.5 1.9 72.5 80.0 1.0 Linear-SARL 84.1 67.4 0.0 76.3 80.9 0.1 Kernel-SARL 84.1 67.4 0.0 76.3 80.9 0.1 ∗ Absolute difference between adversary accuracy and random guess instances of individuals with 20 different attributes. The target is to classify the creditworthiness of individuals as good or bad, with the sensitive attribute being age. We learn encoders on the training set, after which, following the baselines, we freeze the en- coder and train the target (logistic regression) and adversary (2-layer network with 64 units) clas- sifiers on the training set. Table 3.1 shows the performance of the target and adversary on both datasets. Both Linear-SARL and Kernel-SARL outperform all NN-based baselines. For either of these tasks, the Kernel-SARL does not afford any additional benefit over Linear-SARL. For the adult dataset, the linear encoder maps the 14 input features to just one dimension. The weights as- signed to each feature are shown in Figure 3.9. Notice that the encoder assigns almost zero weight to the gender feature in order to be fair with respect to the gender attribute. 3.6.3 Illumination Invariant Face Classification This task pertains to face classification under different illumination conditions on the Extended Yale B dataset [95]. It comprises of face images of 38 people under five different light source directions, namely, upper right, lower right, lower left, upper left, and front. The target task is to 40 0.9 0.8 0.7 0.6 0.5 0.4 0.3 age work class final weight education education num 0.2 marital status occupation relationship race 0.1 gender capital gain capital loss hours/week country 0 Figure 3.9: Adult Dataset: Magnitude of learned encoder weights Θ for each semantic input feature. establish the identity of the person in the image, with the direction of the light being the sensitive attribute. Since the direction of lighting is independent of identity, the ideal ARL solution should obtain a representation Z that is devoid of any sensitive information. We first followed the exper- imental setup of Xie et al. [24] in terms of the train/test split strategy, i.e., 190 samples (5 from each class) for training and 1096 images for testing. Our global solution was able to completely remove illumination from the embedding resulting in the adversary accuracy being 20%, i.e., ran- dom chance. To investigate further, we consider different variations of this problem, flipping target and sensitive attributes and exchanging training and test sets. The complete set of results, including NN-based baselines, are reported in Table 3.2 ([EX] corresponds to exchanging training and testing sets). In all these cases, our solution was able to completely remove the sensitive features resulting in adversary performance that is no better than random chance. Simultaneously, the embedding is also competitive with the baselines on the target task. 41 Table 3.2: Extended Yale B Performance (in %) Method Adversary Target Adversary Target (illumination) (identity) (identity) (illumination) Raw Data 96 78 - - VFAE [37] 57 85 - - ML-ARL [24] 57 89 - - MaxEnt-ARL [1] 40 89 - - Linear-SARL 21 81 3 94 Linear-SARL [EX] 20 86 3 97 Kernel-SARL 20 86 3 96 Kernel-SARL [EX] 20 88 3 96 3.6.4 CIFAR-100 The CIFAR-100 dataset [96] consists of 50, 000 images from 100 classes that are further grouped into 20 superclasses. Each image is therefore associated with two attributes, a “fine” class label and a “coarse” superclass label. We consider a setup where the “coarse” and “fine” labels are the target and sensitive attributes, respectively. For Linear-SARL and Kernel-SARL (degree five polynomial kernel) and SGDA, we use features (64-dimensional) extracted from a pre-trained ResNet-110 model as an input to the encoder instead of raw images. From these features, the encoder is tasked with aiding the target predictor and hindering the adversary. This setup serves as an example to illustrate how invariance can be “imparted” to an existing biased pre-trained representation. We also consider two NN-baselines, ML-ARL [24] and MaxEnt-ARL [1]. Unlike our scenario, where the pre-trained layers of ResNet-18 are not adapted, the baselines optimize the entire encoder for the ARL task. For evaluation, once the encoder is learned and frozen, we train a discriminator and adversary as 2-layer networks with 64 neurons each. Therefore, although our approach uses linear regressor as an adversary at training, we evaluate against stronger adversaries at test time. In contrast, the baselines train and evaluate against adversaries with equal capacity. Figure 3.10 shows the trade-off in accuracy between the target predictor and adversary. We 42 80 Target Accuracy [%] 60 40 No Privacy SGDA-ARL ML-ARL 20 MaxEnt-ARL Linear-ARL Kernel-ARL 0 0 15 30 45 60 Adversary Accuracy [%] Figure 3.10: CIFAR-100: Trade-off between target performance and leakage of sensitive attribute by adversary. observe that (1) Kernel-ARL significantly outperforms Linear-SARL. Since the former implicitly maps the data into a higher dimensional space, the sensitive features are potentially disentangled sufficiently for the linear encoder in that space to discard such information. Therefore, even for large values of λ, Kernel-SARL is able to simultaneously achieve high target accuracy while keep- ing the adversary performance low. (2) Despite being handicapped by the fact that Kernel-SARL is evaluated against stronger adversaries than it is trained against, its performance is comparable to that of the NN baselines. In fact, it outperforms both ML-ARL and MaxEnt-ARL with respect to the target task. (3) Despite repeated attempts with different hyper-parameters and choice of op- timizers, SGDA was highly unstable across most datasets and often got stuck in a local optimum and failed to find good solutions. Figure 3.11 shows the MSE of the adversary as we vary the relative trade-off λ between the target and adversary objectives. The plot illustrates (1) the lower and upper bounds αmin and αmax respectively calculated on the training dataset, (2) achievable adversary MSE computed on the training set αtrain , and finally, (3) achievable adversary MSE computed on the test set αtest . Observe that on the training dataset, all values of α ∈ [αmin , αmax ] are reachable as we sweep 43 1 1 min min max max 0.96 0.96 train train test test 0.93 0.93 0.9 0.9 0.87 0.87 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 (a) Linear-SARL (b) Kernel-SARL Figure 3.11: CIFAR-100: Lower and upper bounds on adversary loss, αmin and αmax , computed on the training set. The loss achieved by our solution as we vary λ is shown on the training and testing sets, αtrain and αtest , respectively. through λ ∈ [0, 1]. This is, however, not the case on the test set, as the bounds are computed through empirical moments as opposed to the true covariance matrices. Figure 3.12 plots the optimal embedding dimensionality provided by SARL as a function of the trade-off parameter λ. At small values of λ, the objective favors the target task, i.e., 20 class predictions. Thus, SARL does indeed determine the optimal dimensionality of 19 for a 20-class linear target regressor. However, at large values of λ, the objective only seeks to hinder the sen- sitive task, i.e., 100 class prediction. In this case, the ideal embedding dimensionality from the perspective of the linear adversary regressor is at least 99. The SARL ascertained dimensionality of one is, thus, optimal for maximally mitigating the leakage of the sensitive attribute from the embedding. However, unsurprisingly, the target task also suffers significantly. 3.7 Summary We studied the “linear” form of adversarial representation learning (ARL), where all the entities are linear functions. We showed that the optimization problem, even for this simplified version, is both non-convex and non-differentiable. Using tools from spectral learning, we obtained a 44 20 Embedding Dimensionality 15 10 5 Linear-SARL Kernel-SARL 0 0 0.2 0.4 0.6 0.8 1 Figure 3.12: CIFAR-100: Optimal embedding dimensionality learned by SARL. At small values of λ, the objective favors the target task, which predicts 20 classes. Thus, an embedding dimen- sionality of 19 is optimal for a linear target regressor. At large values of λ, the objective only seeks to hinder the adversary. Thus, SARL determines the optimal dimensionality of the embedding as one. closed-form expression for the global optima and derived analytical bounds on the achievable utility and invariance. We also extended these results to non-linear parameterizations through kernelization. Numerical experiments on multiple datasets indicated that the global optima solution of the “kernel” form of ARL is able to obtain a trade-off between utility and invariance that is comparable to that of local optima solutions of NN-based ARL. At the same time, unlike NN-based solutions, the proposed method can (1) analytically determine the achievable utility and invariance bounds and (2) provide explicit control over the trade-off between utility and invariance. Admittedly, the results presented in this chapter do not extend directly to NN-based formula- tions of ARL. However, we believe it sheds light on the nature of the ARL optimization problem and aids our understanding of the ARL problem. It helps delineate the role of the optimization al- gorithm and the choice of embedding function, highlighting the trade-off between the expressivity of the functions and our ability to obtain the global optima of the adversarial game. We consider our contribution as the first step towards controlling the non-convexity that naturally appears in game-theoretic representation learning. 45 Chapter 4 Adversarial Representation Learning With Closed-Form Solvers 4.1 Introduction In this chapter, we revisit the ARL problem and look at it from the NN-based optimization point of view. The vanilla algorithm for learning the parameters of the encoder, target, and adversary networks is SGDA [24, 1], where the players take a gradient step simultaneously. See Figure 4.1 for an illustration. However, applying SGDA is not an optimal strategy for ARL and is known to suffer from some drawbacks. Firstly, SGDA has undesirable convergence properties; it fails to converge to a local minimum and can converge to fixed points that are not local minimax while being very unstable and slow in practice [67, 68]. Secondly, SGDA exhibits strong rotation around fixed points, which requires using very small learning rates [64, 66] to converge. Numerous solu- tions [64, 65, 69] have been proposed recently to address the aforementioned computational chal- lenges. These approaches, however, seek to obtain solutions to the minimax optimization problem in the general case, where each player is modeled as a complex neural network. We take a different perspective and propose an alternative optimization algorithm for ARL. Our key insight is to replace the shallow NNs with other analytically tractable models with simi- lar capacities. We propose to adopt simple learning algorithms that admit closed-form solutions, 46 X f (X; Θ) Z gY (Z; ΘY ) Yb gS (Z; ΘS ) Sb Figure 4.1: ARL-SGDA: Illustration of training adversarial representation learning through stochastic gradient descent ascent. i) At first, the target predictor parameters ΘY are updated while the encoder and adversary parameters are frozen. ii) Then, the adversary parameters ΘS are updated while the encoder and target parameters are frozen. iii) Finally, the encoder parameters Θ get updated while target and adversary parameters are frozen. SGDA does not provide any convergence guarantees. such as linear or kernel ridge regressors for the target and adversary, while modeling the encoder as a DNN. Crucially, such models are particularly suitable for ARL and afford numerous advan- tages, including (1) closed-form solution allows learning problems to be optimized globally and efficiently, (2) analytically obtaining upper bound on optimal dimensionality of the embedding, (3) the simplicity and differentiability allows us to backpropagate through the closed-form solution, (4) practically it resolves the notorious rotational behavior of iterative minimax gradient dynamics, resulting in a simple optimization that is empirically stable, reliable, converges faster to a local optimum, and ultimately results in a more effective encoder. We demonstrate the practical effectiveness of our approach, dubbed OptNet-ARL, through numerical experiments on an illustrative toy example, fair classification on UCI Adult and German datasets, and mitigating information leakage on the CelebA dataset. We consider two scenarios where the target and sensitive attributes are (a) dependent and (b) independent. Our results indicate that, in comparison to existing ARL solutions, OptNet-ARL is more stable and converges faster while also outperforming them in terms of accuracy, especially in the latter scenario. A number of recent approaches have integrated differentiable solvers, both iterative as well as 47 closed-form, within end-to-end learning systems. Structured layers for segmentation and higher- order pooling were introduced by [97]. Similarly, [98] proposed an asymmetric architecture that incorporates a correlation filter as a differentiable layer. Differential optimization as a layer in NNs was introduced by [99, 100]. More recently, differentiable solvers have also been adopted for meta- learning [101, 102] as well. The primary motivation for all the aforementioned approaches is to endow DNNs with differential optimization and ultimately achieve faster convergence of the end- to-end systems. In contrast, our inspiration for using differential closed-form solvers is to control the non-convexity of the optimization in ARL in terms of stability, reliability, and effectiveness. 4.2 Problem Setting Recall the ARL optimization problem in (3.1) and denote the global minimums of the adversary and target estimators as JY (Θ) := min EX,Y [LY (gY (f (X; Θ) ; ΘY ), Y )] ΘY (4.1) JS (Θ) := min EX,S [LS (gS (f (X; Θ) ; ΘS ), S)] . ΘS Similar to Chapter 3, instead of solving the constrained optimization problem in (3.1), we solve its scalarization version: min {(1 − λ) JY (Θ) − λ JS (Θ)} , 0 ≤ λ ≤ 1, (4.2) Θ where λ is the trade-off parameter between utility and the leakage of the sensitive information. 48 4.2.1 Motivating Exact Solvers Most state-of-the-art ARL algorithms cannot solve the optimization problems in (4.1) optimally (e.g., SGDA). For any given Θ, denote any non-optimal adversary and target predictors’ loss func- approx approx tions by JY (Θ) and JS (Θ), respectively. It is obvious that for any given Θ, it holds that approx approx JY (Θ) ≥ JY (Θ) and JS (Θ) ≥ JS (Θ). Note that the optimization problem raised from a non-optimal adversary and target predictors is  approx approx min (1 − λ) JY (Θ) − λ JS (Θ) , 0 ≤ λ ≤ 1. (4.3) Θ Intuitively, solution(s) of (4.3) do not outperform that of (4.2). We now formulate this intuition more concretely. Definition 4.1. Let (a1 , a2 ) and (b1 , b2 ) be two arbitrary points in R2 . We say (b1 , b2 ) dominates (a1 , a2 ) iff b1 > a1 and b2 < a2 hold simultaneously. The following lemma states that any solutions obtained by a sub-optimal adversary and target predictors cannot dominate that of exact solvers. Lemma 4.2. For any λ1 , λ2 ∈ [0, 1), consider the following optimization problems Θexact = arg min {(1 − λ1 ) JY (Θ) − λ1 JS (Θ)} (4.4) Θ and  approx approx Θapprox = arg min (1 − λ2 ) JY (Θ) − λ2 JS (Θ) . Θ 49  Then, any adversary-target objective trade-off generated by JS (Θexact ), JY (Θexact ) cannot be dominated by the trade-off generated by (JS (Θapprox ), JY (Θapprox )). Proof. It is enough to show that if (i) JS (Θapprox ) > JS (Θexact ) then JY (Θapprox ) ≥ JY (Θexact ), and if (ii) JY (Θapprox ) < JY (Θexact ) then JS (Θapprox ) ≤ JS (Θexact ). approx approx The key point is to observe from (4.4) that regardless of λ2 , Jy and JS , we have (1 − λ1 ) JY (Θexact ) − λ1 JS (Θexact ) ≤ (1 − λ1 ) JY (Θapprox ) − λ1 JS (Θapprox ). Now, consider three possible cases for λ1 : a) λ1 = 0: In this case we have JY (Θexact ) ≤ JY (Θapprox ) and therefore regardless of Js (Θexact ) and JS (Θapprox ), (ii) cannot happen and (i) holds under its assumption. b) λ1 = 1: In this case we have JS (Θexact ) ≥ JY (Θapprox ) and therefore regardless of JY (Θexact ) and JY (Θapprox ), (i) cannot happen and (ii) holds under its assumption. c) 0 < λ1 < 1: (i) If JS (Θapprox ) > JS (Θexact ), then   0 < λ1 Js (Θapprox ) − JS (Θexact ) ≤ (1 − λ1 ) JY (Θapprox ) − JY (Θexact ) . This implies that JY (Θapprox ) ≥ JY (Θexact ). (ii) If JY (Θapprox ) < JY (Θexact ), then   0 < (1 − λ1 ) JY (Θexact ) − JY (Θapprox ) ≤ λ1 JS (Θexact ) − JS (Θapprox ) . This implies that JS (Θapprox ) < Js (Θexact ). 50 X f (X; Θ) Z ΘY kY (·, Z) Yb ΘS kS (·, Z) Sb Figure 4.2: ARL with kernelized ridge regressors for adversary and target predictors. This setting turns typical SGDA optimization of ARLs into a simple SGD optimization. 4.3 Exact Adversary and Target Predictor Solvers Existing instances of ARL adopt NNs to represent f , gY , and gS and learn their respective param- eters {Θ, ΘY , ΘS } through SGDA. Consequently, the target and adversary in equation (4.1) are not solved to optimality, thereby resulting in a sub-optimal encoder. 4.3.1 Closed-Form Adversary and Target Predictor The machine learning literature offers a wealth of methods with exact solutions appropriate for modeling adversary and target predictors. In this section, we argue for and adopt simple, fast, and differentiable methods such as kernel ridge regressors as shown in Figure 4.2. Such modeling allows us to obtain the optimal estimators globally for any given encoder f (·; Θ). On the other hand, kernelized ridge regressors can be stronger than the shallow NNs that are used in many ARL-based solutions(e.g., [24, 103, 2, 1]). Although it is not the focus of this dissertation, it is worth noting that even DNNs in the infinite-width limit reduce to linear models with a kernel called the neural tangent kernel [104], and as such can be adopted to increase the capacity of our regressors. Consider two RKHSs of functions, HS and HY , for the adversary and target networks, respec- 51 tively. Let a possible corresponding pair of feature maps be φS (·) ∈ RrS and φY (·) ∈ RrY where rS and rY are the dimensionality of the resulting features and can potentially approach infinity. The respective kernel functions for HS and HY can be represented as kS (z, z 0 ) = hφS (z), φS (z 0 )iH S and kY (z, z 0 ) = hφY (z), φY (z 0 )iH . Under this setting, we can relate the target and sensitive Y attributes to any given embedding Z as Yb = ΘY [kY (z1 , Z), · · · , kY (zn , Z)]T (4.5) Sb = ΘS [kS (z1 , Z), · · · , kS (zn , Z)]T , where ΘY ∈ RdY ×n and ΘS ∈ RdS ×n , and n is the number of data samples. Let the entire embedding of input data be denoted as Z := [z1 , · · · , zn ] ∈ Rr×n , where zi = f (xi ) for i = 1, · · · , n. Consequently, it follows that Yb := [b y1 , · · · , ybn ] = ΘY KY (4.6) b := [b S s1 , · · · , sbn ] = ΘS KS . In a typical ARL setting, once an encoder is learned (i.e., for a given fixed embedding Z), we eval- uate against the best possible adversary and target predictors. In the following lemma, we obtain the minimum MSE for the kernelized adversary and target predictors for any given embedding Z. Lemma 4.3. Let JY (Z) and JS (Z) be regularized minimum MSEs for adversary and target:     2 2 JY (Z) = min E Y − Y b + γY kΘY kF , ΘY     2 2 JS (Z) = min E S − S b + γS kΘS kF ΘS where γY and γS are regularization parameters for target and adversary regressors, respectively. 52 Then, for any given embedding matrix Z, the minimum MSE for the kernelized adversary and target can be obtained as   2 T 1 2 1  Y  JY (Z) = kY kF − P   n n MY   0n×d Y F   2 (4.7) T 1 1  S  JS (Z) = kSk2F − PM   , n n S  0n×d S F where      KY   KS  MY :=  √ ,  MS :=  √   nγY In nγS In are both full-column rank matrices, and the orthogonal projection matrix for any full-column rank matrix M can be obtained as PM = M (M T M )−1 M T . Proof. Using the empirical mean, we have ( n ) 1X JY (Z) = min kΘY KY − Y k2F + γY kΘY k2F ΘY n k= 2     T 1  KY  T  Y  = min    Θ −   n ΘY √  Y   nγY In 0n×d Y | {z } MY F   2   2 T T 1 T  Y  1  Y  = min MY ΘY − PM   + P   n ΘY Y   n M⊥ Y   0n×d 0n×d Y F Y F 53     2   2 YT   Y T  Y T 1 †       1    = MY MY   − P MY   + P ⊥  n | {z } n MY PM 0 n×dY 0 n×dY 0n×d Y Y F F   2 T 1  Y  = P ⊥   n MY  0n×d Y F   2 T 1 2 1  Y  = kY kF − PM    ,  (4.8) n n Y 0n×d Y F where we used orthogonal decomposition w.r.t  MY in the third and last steps and a possible T †  Y  minimizer used in the forth step is ΘTY = MY  . Using the same approach, we get   0n×d Y   2 T 1 2 1  S  JS (Z) = kSkF − P   , n n MS   0n×d S F    KS  where MS :=  √ .  nγS In It is quite straightforward to generalize this method to the case of multiple target and adversary predictors through equation (4.3). In this case, we will have multiple λs to trade-off between utility and the leakage of sensitive information. 4.3.2 Optimal Embedding Dimensionality The ability to effectively optimize the parameters of the encoder is critically dependent on the dimensionality of the embedding as well. Higher dimensional embeddings can inherently absorb 54 unnecessary extraneous information in the data. Existing ARL applications, where the target and adversary are non-linear NNs, select the dimensionality of the embedding on an ad-hoc basis. Adopting closed-form solvers for the target and adversary enables us to analytically determine an upper bound on the optimal dimensionality of the embedding for OptNet-ARL. To obtain the upper bound, we rely on the observation that a non-linear target predictor and adversary, by virtue of greater capacity, can learn non-linear decision boundaries. As such, in the context of ARL, the optimal dimensionality required by non-linear models is lower than the optimal dimensionality of linear target predictor and adversary. Therefore, we analytically determine the optimal dimension- ality of the embedding in the following theorem. Theorem 4.4. Let Z in Figure 4.2 be disconnected from the encoder and be a free vector in Rr . Further, assume that both adversary and target predictors are linear regressors, and γS = γY = 0. Then, for any 0 ≤ λ ≤ 1 the optimal dimensionality of the embedding vector, r is the number of negative eigenvalues of B = λS̃ T S̃ − (1 − λ)Ỹ T Ỹ . (4.9) Proof. Recall that for linear regressor adversary and target predictors, we have Yb = ΘY Z + bY , Sb = ΘS Z + bS . (4.10) Following the proof in Lemma 4.3, we have 1 2 1 2 1 2 1 2 JY (Z) = Ỹ − PM Ỹ , JS (Z) = S̃ − PM S̃ n F n F n F n F where M is the column space of Z̃ T Z̃ or equivalently the column space of Z̃. Consequently, it 55 follows that (1 − λ) JY (Z) − λ JS (Z) =   1 T 2 T 2 T 2 T 2 (1 − λ) Ỹ − λ S̃ − (1 − λ) PM Ỹ + λ PM S̃ . n F F F F (4.11) 2 Now, consider PM Ỹ T : F       2   PM Ỹ T T = Tr Ỹ PM PM Ỹ T F   | {z }    PM  n o = Tr PM Ỹ T Ỹ Similarly, 2 n o PM S̃ T = Tr PM S̃ T S̃ . F 2 2 The terms Ỹ T and S̃ T on the right side of (4.11) are constants with respect to Z, and F F hence can be ignored. We now get 2 2 λ PM S̃ T − (1 − λ) PM Ỹ T = Tr {PM B} , F F where B = λ S̃ T S̃ − (1 − λ) Ỹ T Ỹ . 56 Noting that any projection matrix PM ∈ Rn×n of rank i ≤ n can be decomposed as V V T for some orthogonal matrix V ∈ Rn×i , we get   2 T 2 min min λ PM S̃ T − (1 − λ) PM Ỹ r∈{1,···,n} Z∈Rr×n F F = min min Tr {PM B} r∈{1,···,n} dim M≤r n o = min min min Tr V V T B r∈{1,···,n} i∈{1,···,r} V T V =I i n o = min min Tr V V T B . i=r∈{1,···,n} V T V =I i From trace optimization problems and their solution in [90], we have n o min min Tr V V T B = min {β1 + β2 + · · · + · · · βr } r∈{1,···,n} V T V =Ir r∈{1,···,n} = β1 + β2 + · · · + βj where β1 , · · · , βr are the r smallest eigenvalues of B, j denotes the number of negative eigenvalues of B and a possible minimizer is matrix Z in which the columns space of Z̃ T (i.e., M) is the span of eigenvectors corresponding to all negative eigenvalues of B. Given a dataset with the target and sensitive labels, Y and S, respectively, the matrix B in (4.9) and its eigenvalues can be computed offline to determine the upper bound on the optimal dimen- sionality. By virtue of the greater capacity, the optimal dimensionality required by non-linear models is lower than the optimal dimensionality of linear predictors and therefore, Theorem 2 is a tight upper bound for the optimal embedding dimensionality. On large datasets where B ∈ Rn×n can be a very large matrix, the Nyström method with data sampling [92] can be adopted. 57 4.3.3 Gradient of Closed-Form Solution In order to find the gradient of the encoder loss function in (4.3) with JY and JS given in (4.7), we can ignore the constant terms, kY kF and kSkF . Then, the optimization problem in (4.3) would be equivalent to    2     2    T T     S   Y   min (1 − λ) PM   − λ PM   Θ  S  Y         0n×d S 0n×d Y  F F   dS dY  X k 2 X 2 = min (1 − λ) PM uS − λ PM u m , (4.12) Θ  S Y Y  k=1 m=1     ST T    Y  where the vectors ukS and um are the k-th and m-th columns of   and  , respec- Y     0n×d 0n×d S Y tively. Let M be an arbitrary matrix function of Θ, and θ be an arbitrary scalar element of Θ. Then, from [105] we have ∂kPM uk2 ∂M = 2 uT PM⊥ M † u, (4.13) ∂θ ∂θ where            ∇Tzi [M ] ij ∇ (z θ i ) + ∇ T [M ] zj ij ∇θ (zj ), i ≤ n ∂M = ∂θ ij    0, else. Equation (4.13) can be directly used to obtain the gradient of objective function in (4.12). Directly computing the gradient in equation (4.13) requires a pseudo-inverse of the matrix M ∈ R2n×n , which has a complexity of O(n3 ). For large datasets, this computation can get prohibitively expensive. Therefore, we approximate the gradient using a single batch of data as we 58 optimize the encoder end-to-end. Similar approximations [92] are in fact, commonly employed to scale up kernel methods. Thus, the computational complexity of computing the loss for OptNet- ARL reduces to O(b3 ), where b is the batch size. Since maximum batch sizes in training NNs are of the order of 10 to 1000, computing the gradient is practically feasible. We note that the procedure presented in this section is a simple SGD in which its stability can be guaranteed under Lipschitz and smoothness assumptions on encoder network [106]. 4.4 Experiments In this section, we will evaluate the efficacy of our proposed approach, OptNet-ARL, on three different tasks; Fair Classification on UCI [107] dataset, mitigating leakage of private informa- tion on the CelebA dataset, and ablation study on a Gaussian mixture example. We also compare OptNet-ARL with other ARL baselines in terms of stability of optimization, the achievable trade- off front between the target and adversary objectives, convergence speed, and the effect of em- bedding dimensionality. We consider three baselines, (1) SGDA-ARL: vanilla stochastic gradient descent ascent that is employed by multiple ARL approaches including [24, 103, 2, 1, 108] etc., (2) ExtraSGDA-ARL: a state-of-the-art of stochastic gradient descent ascent that uses an extra gradient step [72] for optimizing minimax games. Specifically, we use the ExtraAdam algorithm from [69], and (3) SARL: a global optimum solution for a kernelized regressor encoder and linear target and adversary [49]. Specifically, hypervolume (HV) [109], a metric for stability and good- ness of trade-off (comparing algorithms under multiple objectives) is also utilized. A larger HV indicates a better Pareto front achieved, and the standard deviation of the HV represents stability. In the training stage, the encoder, a DNN, is optimized end-to-end against kernel (RBF Gaus- sian kernel) ridge regressors in the case of OptNet-ARL and multi-layer perceptrons (MLP)s for 59 the baselines. Table 4.1 summarizes the network architecture of all experiments. We note that the optimal embedding dimensionality, r, for the binary target is equal to one, which is consistent with Fisher’s linear discriminant analysis [110]. The embedding is instance normalized (unit norm). So we adopted a fixed value of σ = 1 for the Gaussian Kernel in all the experiments. We let the regression regularization parameter be 10−4 for all experiments. The learning rate is 3×10−4 with weight decay of 2 × 10−4 , and we use Adam as an optimizer for all experiments. At the inference stage, the encoder is frozen, features are extracted, and a new target predictor and adversary are trained. At this stage, for both OptNet-ARL and the baselines, the target and adversary have the same model capacity. Furthermore, each experiment on each dataset is repeated five times with different random seeds (except for SARL, which has a closed-form solution for the encoder) and several different trade-off parameters λ ∈ [0, 1). We report the median and standard deviation across the five repetitions. 4.4.1 Fair Classification We consider fair classification on two different tasks. UCI Adult Dataset: It includes 14 features from 45, 222 instances. The task is to classify the annual income of each person as high (50K or above) or low (below 50K). The sensitive feature we wish to be fair with respect to is the gender of each person. UCI German Dataset: It contains 1000 instances of individuals with 20 different attributes. The target task is to predict their creditworthiness while being unbiased with respect to age. The correlation between the target and sensitive attributes is 0.03 and 0.02 for the Adult and German datasets, respectively. This indicates that the target attributes are almost orthogonal to the sensitive attributes. Therefore, the sensitive information can be totally removed with only a negligible sacrifice on the performance of the target task. Stability and Performance: Since there is no trade-off between the two attributes, we compare 60 Table 4.1: Network Architectures in Experiments. Method Encoder Embd Target Adversary Target Adversary (ARL) Dim (Train) (Train) (Test) (Test) Adult SGDA [24, 2] MLP-4-2 1 MLP-4 MLP-4 MLP-4-2 MLP-4-2 ExtraSGDA [72] MLP-4-2 1 MLP-4 MLP-4 MLP-4-2 MLP-4-2 SARL [49] RBF krnl 1 linear linear MLP-4-2 MLP-4-2 OptNet-ARL MLP-4-2 1 RBF krnl RBF krnl MLP-4-2 MLP-4-2 (ours) German SGDA [24, 2] MLP-4 1 MLP-2 MLP-2 logistic logistic ExtraSGDA [72] MLP-4 1 MLP-2 MLP-2 logistic logistic SARL [49] RBF krnl 1 linear linear logistic logistic OptNet-ARL MLP-4 1 RBF krnl RBF krnl logistic logistic (ours) CelebA SGDA [24, 2] ResNet-18 128 MLP-64 MLP-64 MLP-32-16 MLP-32-16 ExtraSGDA [72] ResNet-18 128 MLP-64-32 MLP-64-32 MLP-32-16 MLP-32-16 OptNet-ARL ResNet-18 [1, 128] RBF krnl RBF krnl MLP-32-16 MLP-32-16 (ours) Gaussian Mixture SGDA [24, 2] MLP-8-4 2 MLP-8-4 MLP-8-4 MLP-4-4 MLP-4-4 ExtraSGDA [72] MLP-8-4 2 MLP-8-4 MLP-8-4 MLP-4-4 MLP-4-4 SARL [49] RBF krnl 2 linear linear MLP-4-4 MLP-4-4 RBF-OptNet-ARL MLP-8-4 2 RBF krnl RBF krnl MLP-4-4 MLP-4-4 (ours) IMQ-OptNet-ARL MLP-8-4 [1, · · · , 512] IMQ krnl IMQ krnl MLP-4-4 MLP-4-4 (ours) stability by reporting the median and standard deviation of the target and adversary performance in Table 4.2. Our results indicate that OptNet-ARL achieves a higher accuracy for target tasks and lower leakage of the sensitive attribute with less variance. For instance, in the Adult dataset, our OptNet-ARL method achieves 83.81% target accuracy with almost zero sensitive leakage. For OptNet-ARL, the standard deviation of the sensitive attribute is exactly zero, which demonstrates its effectiveness and stability in comparison to the baselines. Similarly, for the German dataset, OptNet-ARL achieves 80.13% for sensitive accuracy, which is close to random chance (around 81%) while having the largest target accuracy compared to the baselines. 61 Table 4.2: Fair classification on UCI Adult and German datasets (in %) Adult Dataset German Dataset Method Target Sensitive Diff Target Sensitive Diff (income) (gender) 67.83 (credit) (age) 81 Raw Data 85.0 85.0 17.6 80.0 87.0 6.0 LFR [36] 82.3 67.0 0.4 72.3 80.5 0.5 AEVB [94] 81.9 66.0 1.4 72.5 79.5 1.5 VFAE [37] 81.3 67.0 0.4 72.7 79.7 1.3 SARL [49] 84.1 67.4 0.0 76.3 80.9 0.1 SGDA-ARL [24] 83.61 ± 0.38 67.08 ± 0.48 0.40 76.53 ± 1.07 87.13 ± 5.70 6.13 ExtraSGDA-ARL [69] 83.66 ± 0.26 66.98 ± 0.49 0.4 75.60 ± 1.68 86.80 ± 4.05 5.80 OptNet-ARL 83.81 ± 0.23 67.38 ± 0.00 0.00 76.67 ± 2.21 80.13 ± 1.48 0.87 4.4.2 Mitigating Sensitive Information Leakage The CelebA dataset [111] contains 202, 599 face images of 10, 177 celebrities. Each image con- tains 40 different binary attributes (e.g., gender, emotion, age, etc.). Images are pre-processed and aligned to a fixed size of 112 × 96, and we use the official train-test splits. The target task is defined as predicting the presence or absence of high cheekbones (binary), with the sensitive attribute be- ing smiling/not smiling (binary). The choice of this attribute pair is motivated by the presence of a trade-off between them. We observe that the correlation between this attribute pair is equal to 0.45, indicating that there is no encoder that can maintain target performance without leaking the sensitive attribute. For this experiment, we note that SARL [49] cannot be employed since (1) it does not scale to large datasets (O(n3 )) like CelebA, and (2) it cannot be applied directly on raw images and needs features extracted from a pre-trained network. Most other attribute pairs in this dataset either suffer from severe class imbalance or small correlation, indicating the lack of a trade-off. Network architecture details are shown in Table 4.1. Stability and Trade-off: Figure 4.3 (a) shows the attainment surface [112] and hypervol- ume [109] (median and standard deviation) for all methods. SGDA-ARL spans only a small part 62 of the trade-off and, at the same time, exhibits large variance around the median curves. Overall, both baselines are unstable and unreliable when the two attributes are dependent on each other. On the other hand, OptNet-ARL solutions are very stable while also achieving a better trade-off between target and adversary accuracy. Optimal Embedding Dimensionality: Figure 4.3 (b) compares the utility-bias trade-off of the sub-optimal embedding dimensionality (r = 128) with that of the optimal dimensionality (r = 1). We can observe that optimal embedding dimensionality (r = 1) is producing a more stable trade- off between adversary and target accuracies. Training Time: It takes five runs for SGDA-ARL and ExtraSGDA and two runs for OptNet- ARL to train a reliable encoder for overall 11 different values of λ ∈ [0, 1). The summary of training time is given in Figure 4.3 (c). ExtraSGDA-ARL takes an extra step to update the weights, and therefore, it is slightly slower than SGDA-ARL. OptNet-ARL, on the other hand, is signif- icantly faster in obtaining reliable results. Even for a single run, OptNet-ARL is faster than the baselines. This is because OptNet-ARL uses closed-form solvers for adversary and target and therefore does not need to train any additional networks downstream to the encoder. Independent Features: We consider the target task to be the binary classification of smil- ing/not smiling, with the sensitive attribute being gender. In this case, the correlation between gender and target feature is 0.02, indicating that the two attributes are almost independent, and hence it should be feasible for an encoder to remove the sensitive information without affecting the target task. The results are presented in Figure 4.3 (d). In contrast to the scenario where the two attributes are dependent, we observe that all ARL methods can perfectly hide the sensitive infor- mation (gender) from representation without sacrificing on the target task performance. Therefore, OptNet-ARL is especially effective in a more practical setting where the target and sensitive at- tributes are correlated and hence can only attain a trade-off. 63 (a) (b) Method Target Sensitive (ARL) (smiling) (gender) Method Target Sensitive MSE % (smiling) (gender) Method One Run Training Time No Privacy 93.0 - SGDA-ARL [3 MSE % SGDA [39] 93.0 0.0 ExtraSGDA-ARL 93.0 - SGDA-ARL [39] 79.8 398.9 ExtraSGDA-ARL [14] 84.0 419.7 ExtraSGDA [14] 93.0 0.0 OptNet-ARL (ou 93.0 0.0 93.0 0.0 OptNet-ARL (ours) 76.8 153.6 OptNet-ARL (ours) 93.0 0.0 s) 93.0 0.0 e (c) (d) d d Figure 4.3: CelebA: (a) Trade-off between adversary and target accuracy for dependent pair (smiling/not-smiling, high cheekbones). (b) Comparison between the trade-offs of optimal em- bedding dimensionality c r = 1 and that of r = 128. (c) Overall and single run training time for dif- dversary and target accuracy forTrade-off ferent ARL methods. (d) dependent between pair adversary and target for independent pair (smiling/not- Figure 2: CelebA: (a) Trade-off between adversary and target smiling, gender). de-off between adversary and target for independent accuracy d single run (smiling/not-smiling, training time for different ARL highmethods. cheekbones). (b) Trade-off between adversary and ta ARL trainingpair 4.4.3 is Ablation time(smiling/not-smiling, significantly Study lower onthan Mixture that ofof Four gender) Gaussians (b) Overall and single run training time for dif aster becauseOptNet-ARLit does not needRegardless to train any target and the fact that OptNet-ARL training time is significant In this experiment, baselines, evenweaconsider singlea simple run ofexample where the data OptNet-ARL isisfaster generated by a mixture because of fournot need to it does adversary. different Gaussian distributions. Let {fi }4i=1 be all Gaussian distributions with means at (0, 0), (0, 1), (1, 0), and (1, 1), respectively and covariance matrices all equal to C = 0.22 I2 . Denote by lack of a trade-off. Network architecture details are f (X) the distribution of the input data. Then, it follows that the attainment surface [22] and hypervolume [42] 259 imbalance or small correlation, indicating the lack of a trade-off. Network . SGDA spans only a small part of the trade-off and at arc e median 260 curves.shown in Table Overall f2. both baselines (X| •) =are unstable f1 (X) 1 + f2 (X) + + f3 (X), 1 P {•} = 1 2 2 2 ndent on each other. On the other hand, OptNet-ARL better Stability and Trade-off: Figure ??(a)1 shows the attainment surface [22] an 1 1 ving a 261 trade-off between ftarget (X| •)and = f4adversary (X) + f2 (X) + + f3 (X), P {•} = 262 (median and standard deviation) 2 for all methods. 2 SGDA spans 2 only a small part o RL and263 the same ExtraSGDA and twotimeruns exhibits large variance for OptNet-ARL to around the median curves. Overall both b values264 of and The 2 unreliable [0,sensitive 1]. Theattribute when summary ofthe is assumed two attributes to be time training the color (0 forare reddependent and 1 for blue),on andeach other. the target On the othe task is an extra 265stepsolutions to update the areweights very stable and while and therefore, it also achieving a better trade-off between RL on 266 the other reconstructing hand is the input data. significantly We sample faster to obtain4000 points for training and 1000 points for the testing accuracy. t-ARL is faster than the baselines. This is because, rsary and267 targetTrainingand thereforeTime: does It not takes need five runs 64 to train for SGDA-ARL and ExtraSGDA and two run oder. 268 train a reliable encoder for overall 11 different values of 2 [0, 1]. The summ task to269 is given be binary in Table classification 2(c). ExtraSGDA-ARL of smiling/not smiling takes an extra step to update the weig set independently. For visualization, the testing set is shown in Figure 4.4 (a). In this illustrative dataset, the correlation between input data and color is 0.61, and therefore there is no encoder that results in full target performance at no leakage of the sensitive attribute. Network architecture details are shown in Table 4.1. Stability and Trade-off: Figure 4.4 (b) illustrates the five-run attainment surfaces and me- dian hypervolumes for all methods. Since the dimensionality of both input and output is 2, the optimal embedding dimensionality is equal to 2, which we set in this experiment. We note that SARL achieves hypervolume better than SGDA and ExtraSGDA ARLs, which is not surprising due to the strong performance of SARL on small-size datasets. However, SARL is not applicable to large datasets. Among other baselines, ExtraSGDA-ARL appears to be slightly better. In con- trast, the solutions obtained by RBF-OptNet-ARL (Gaussian kernel) outperform all baselines and are highly stable across different runs, which can be observed from both attainment surfaces and hypervolumes. In addition to Gaussian kernel, we also used inverse multi quadratic (IMQ) ker- nel [113]1 for OptNet-ARL to examine the effect kernel function. As we observe from Figure 4.4 (b), IMQ-OptNet-ARL performs almost similar to OptNet-ARR with Gaussian kernel in terms of both trade-off and stability. Batch Size: In order to examine the effect of batch size on OptNet-ARL (with Gaussian ker- nel), we train the encoder with different values of batch size between 2 and 4000 (entire training data). The results are illustrated in Figure 4.4 (c). We observe that the HV is quite insensitive to batch sizes greater than 25, which implies that the gradient of the mini-batch is an accurate enough estimator of the gradient of the entire data. Embedding Dimensionality: We also study the effect of embedding dimensionality (r) by 1 k(z, z 0 ) = q 1 kz−z 0 k2 +c2 65 1.5 1 0.5 0 −0.5 −0.5 0 0.5 1 1.5 (a) (b) (c) (d) Figure 4.4: Mixture of Gaussians: (a) Input data. The target task is to learn a representation that is informative enough to reconstruct the input data and, at the same time, hide the color information (• versus •). (b) The trade-off between the MSEs of adversary and target task for different ARL methods. (c) The HVs of OptNet-ARL (Gaussian kernel) versus different batch size values in [2, 4000]. (d) The HV values of OptNet-ARL (Gaussian kernel) versus different values of r in [1, 512]. examining different values for r in [1, 512] using RBF-OptNet-ARL. The results are illustrated in Figure 4.4 (d). It is evident that the optimal embedding dimensionality (r = 2) outperforms other values of r. Additionally, HV of r = 1 suffers severely due to the information loss in embedding, while for 2 < r ≤ 512, the trade-off performance is comparable to that of optimal embedding dimensionality, i.e., r = 2. 4.5 Summary Adversarial representation learning is a minimax theoretic game formulation that affords explicit control over unwanted information in learned data representations. Optimization algorithms for 66 ARL, such as SGDA and their variants, are sub-optimal, unstable, and unreliable in practice. In this chapter, we introduced OptNet-ARL to address this challenge by employing differentiable closed- form solvers, such as kernelized ridge regressors, to model the ARL players that are downstream from the representation. OptNet-ARL reduces iterative SGDA to a simple optimization, leading to a fast, stable, and reliable algorithm that outperforms existing ARL approaches on both small and large-scale datasets. 67 Chapter 5 Universal Invariant Representation Learning 5.1 Introduction Ideally, the utility-invariance trade-off is defined as a bi-objective optimization problem: inf EXY [LY (gY (f (X)) , Y )] such that Dep (f (X), S) ≤ , (5.1) f ∈HX , gY ∈HY where f is the encoder that extracts the representation Z = f (X) from X, gY predicts Yb from the representation Z, HX and HY are the corresponding hypothesis classes, and LY is the loss function for predicting the target attribute Y . The function Dep(·, ·) ≥ 0 is a parametric or non- parametric measure of statistical dependence, i.e., Dep(Q, U ) = 0 implies Q and U are indepen- dent, and Dep(Q, U ) > 0 implies Q and U are dependent with larger values indicating greater degrees of dependence. The scalar  ≥ 0 is a user-defined parameter that controls the trade-off between the two objectives, with  → ∞ being the standard scenario that has no invariance con- straints with respect to (w.r.t.) S while  → 0 enforces Z ⊥⊥ S (i.e., total invariance). Involving all Borel functions in HX and HY ensures that the best possible trade-off is included within the feasi- ble solution space. For example, when  → ∞ and LY is MSE loss, the optimal Bayes estimation, 68 gY (Z) Yb X f (X) Z Dep(Z, S) S Figure 5.1: Invariant representation learning seeks a representation Z = f (X) that contains as much information necessary for the downstream target predictor gY while being independent of the semantic attribute S. gY (f (X)) = E [Y | X] is attainable. In this chapter, we consider the linear combination of utility and invariance in (5.1) and de- fine the optimal utility-invariance trade-off (denoted by TOpt ) as a single objective optimization problem: Definition 5.1. TOpt := (5.2) ( ) inf (1 − λ) inf EX,Y [LY (gY (f (X)) , Y )] + λ Dep (f (X) , S) , 0 ≤ λ < 1, f ∈HX gY ∈HY where λ controls the trade-off between utility and invariance (e.g., λ = 0 corresponds to ignoring the invariance and only optimizing the utility, while λ → 1 corresponds to Z ⊥⊥ S). The motivation behind deploying this single-objective IRepL is that any solution to this simpli- fied problem is a solution to the bi-objective problem in (5.1) and even (5.2) is challenging to solve, and it has not been fully investigated by existing works. An illustration of the utility-invariance trade-off is illustrated in Figure 5.2. In this chapter, we restrict HX to be some RKHSs and Dep(Z, S) to be a simplified version of the Hilbert-Schmidt Independence Criterion (HSIC) [81]. Further, we replace the target loss function in (5.2) by Dep(Z, Y ) as presented and justified in Sections 5.3.1 and 5.5.2. 69 Utility 100% le TOpt ib as fe In TSub Increase Utility le s ib Fea Yb ⊥ ⊥S Increase Independence Dep(Z, S) Figure 5.2: (The trade-off (denoted by TOpt ) between utility (target task performance) and invari- ance (measured by the dependence metric Dep(Z, S)) is induced by a controlled representation learner in the hypothesis class of all Borel functions. The basic idea of representation learning that discards unwanted semantic information has been explored under many contexts like invariant, fair, or privacy-preserving learning. In domain adaptation [11, 12, 39], the goal is to learn features that are independent of the data domain. In fair learning [41, 42, 43, 40, 36, 21, 23, 24, 22, 44, 2, 26, 45, 46, 47, 48, 49], the goal is to discard the demographic information that leads to unfair outcomes. Similarly, there is growing interest in mitigating unintended leakage of private information from representations [51, 52, 1, 53, 54]. A vast majority of this body of work is empirical in nature. They implicitly look for single or multiple points on the trade-off between utility and semantic information and do not explicitly seek to characterize the whole trade-off front. Overall, these approaches are not concerned with or aware of the inherent utility-invariance trade-off. In contrast, with the cost of restricting encoders to lie in some RKHSs, we exactly characterize the trade-off and propose a practical learning algorithm that achieves this trade-off. 70 5.1.1 Adversarial Representation Learning Most practical approaches for learning fair, invariant, domain adaptive, or privacy-preserving rep- resentations discussed above are based on adversarial representation learning (ARL). ARL is typi- cally formulated as ( ) inf (1 − λ) inf EX,Y [LY (gY (f (X)) , Y )] − λ inf EX,S [LS (gS (f (X)) , S)] ,(5.3) f ∈HX gY ∈HY gS ∈HS where LS is the loss function of a hypothetical adversary gS who intends to extract the semantic attribute S through the best estimator within the hypothesis class HS and 0 ≤ λ < 1 is the utility-invariance trade-off parameter. ARL is a special case of (5.2) where the negative loss of the adversary, − inf EX,S [LS (gS (f (X)) , S)] plays the role of Dep(f (X), S). However, this gS ∈HS form of adversarial learning suffers from a drawback. The induced independence measure is not guaranteed to account for all modes of non-linear dependence between S and Z if the adversary loss function LS is not bounded like MSE or cross-entropy [56, 114]. In the case of MSE loss, even if the loss is maximized at a bounded value, where the corresponding representation Z = f (X) is also bounded, still, it is not guaranteed that Z ⊥⊥ S is attainable (see Section 5.2 for more details). This implies that designing the adversary loss in ARL that accounts for all modes of dependence is challenging, and it can be infeasible for some loss functions. 5.1.2 Trade-Offs in Invariant Representation Learning: Prior work has established the existence of trade-offs in IRepL, both empirically and theoretically. In the following, we categorize them based on properties of interest. Restricted Class of Attributes: A majority of existing work considers IRepL trade-offs under restricted settings, i.e., binary and/or categorical attributes Y and S. For instance, [60] uses 71 information-theoretic tools and characterizes the utility-fairness trade-off in terms of lower bounds when both Y and S are binary labels. Later [55] provided both upper and lower bounds for binary labels. By leveraging Chernoff bound, [61] proposed a construction method to generate an ideal representation beyond the input data to achieve perfect fairness while maintaining the best per- formance on the target task. In the case of categorical features, a lower bound on utility-fairness trade-off has been provided by [6] for the total invariance scenario (i.e., Z ⊥⊥ S). In contrast to this body of work, our trade-off analysis applies to multidimensional continuous/discrete attributes. To the best of our knowledge, the only prior works with a general setting are [49] and [9]. However, in [9], both S and Y are restricted to be continuous/discrete or binary at the same time (e.g., it is not possible to have Y binary while S is continuous). Characterizing Exact versus Bounds on Trade-Off: To the best of our knowledge, all existing approaches except [49], which obtains the trade-off for the linear dependence only, characterize the trade-off in terms of upper and/or lower bounds. In contrast, we exactly characterize a near-optimal trade-off with closed-form expressions for encoders belonging to some RKHSs. Optimal Encoder and Representation: Another property of practical interest is the optimal en- coder that achieves the desired point on the utility-invariance trade-off and the corresponding rep- resentation(s). Existing works which only study bounds on the trade-off do not obtain the encoder that achieves those bounds. [49] do develop a learning algorithm that obtains a globally optimal encoder, but only under a linear dependence measure between Z and S. HSIC, a universal measure of dependence, has been adopted by prior work (e.g., [62]) to quantify all types of dependencies between Z and S. However, these methods adopt stochastic gradient descent for optimizing the underlying non-convex optimization problem. As such, they fail to provide guarantees that the representation learning problem converges to a global optima. In contrast, we obtain a closed-form solution for the optimal encoder and its corresponding representation while detecting all modes of 72 dependence between Z and S. Summary of Contributions: i) We design a dependence measure that accounts for all modes of dependence between Z and S (under a mild assumption) while allowing for analytical tractabil- ity. ii) We employ functions in RKHSs and obtain closed-form solutions for the IRepL optimiza- tion problem. Consequently, we exactly characterize a near-optimal approximation of TOpt via encoders restricted to RKHSs. iii) We obtain a closed-form estimator for the encoder that achieves the near-optimal trade-off, and we establish its numerical convergence. iv) Using random Fourier features (RFF) [115], we provide a scalable version (in terms of both memory and computation) of our IRepL algorithm. v) We numerically quantify our TOpt (denoted by K-TOpt ) on an illustrative problem as well as large-scale real-world datasets, Folktables [116] and CelebA [111], where we compare K-TOpt to those obtained by existing works. 5.2 Deficiency of Mean-Squared Error as A Measure of Dependence Theorem 5.2. Let HS contain all Borel functions, S be a dS -dimensional RV, and LS (·, ·) be MSE loss. Then, ( ) Z ∈ arg sup inf EX,S [LS (gS (Z) , S)] ⇔ E[S | Z] = E[S]. gS ∈HS Proof. Let Si , (gS (Z))i , and (E[S | Z] )i denote the i-th entries of S, gS (Z), and E[S | Z], respec- 73 tively. Then, it follows that dS h i X inf EX,S [LS (gS (Z) , S)] = inf EX,S ((gS (Z))i − Si )2 gS ∈HS gS ∈HS i=1 dS h i X = EX,S ((E[S | Z] )i − Si )2 i=1 dS h i X dS X 2 ≤ ES ((E[S] )i − Si ) = Var[Si ], i=1 i=1 where the second step is due to the optimality of conditional mean (i.e., Bayes estimation) for MSE [117] and the last step is because independence between Z and S leads to an upper bound on n o MSE. Therefore, if Z ∈ arg sup inf g ∈H EX,S [LS (gS (Z) , S)] , then E[S | Z] = E[S]. On S S the other hand, if E[S | Z] = E[S], then it follows immediately that ( ) Z ∈ arg sup inf EX,S [LS (gS (Z) , S)] . gS ∈HS This theorem implies that an optimal adversary does not necessarily lead to a representation Z that is statistically independent of S but instead leads to S being mean independent of the representation Z. 74 5.3 Problem Setting 5.3.1 Problem Setup The representation RV Z can be expressed as Z = f (X) := [Z1 , · · · , Zr ]T ∈ Rr , Zj = fj (X), fj ∈ HX ∀j = 1, . . . , r, (5.4) = Θ [kX (x1 , X), · · · , kX (xn , X)] where r is the dimensionality of the representation and Θ ∈ Rr×n . As we will discuss in Corollary 5.1, unlike common practice where r is chosen on an ad-hoc basis, it is an object of interest for optimization. We consider a general scenario where both Y and S can be continuous/discrete or categorical, or one of Y or S is continuous/discrete while the other is categorical. To accomplish this, we replace the target loss, inf EX,Y [LY (gY (Z), Y )] in (5.2) by the negative of a non- gY ∈HY parametric measure of dependence, i.e., −Dep (Z, Y ). The main reason for this replacement is that maximizing statistical dependency between the representation Z and the target attribute Y can flexibly learn a representation that is applicable for different downstream target tasks, including regression, classification, clustering, etc [118]. Particularly, Theorem 5.8 in Section 5.5.2 indicates that with an appropriate choice of involved RKHS for Dep (Z, Y ), we can learn a representation that lends itself to an estimator that performs as optimally as the Bayes estimation, i.e., EX [Y |X]. Furthermore, in an unsupervised setting, where there is no target attribute Y , the target loss can be replaced with Dep (Z, X), which implicitly forces the representation Z to be as dependent on the input data X. This scenario is of practical interest when a data producer aims to provide an invariant representation for an unknown downstream target task. 75 Xr X Dep(Z, Y ) := Cov2 (Zj , βY (Y )) βY (·) Y j=1 βY ∈UY X Z = f (X) = Θ kX (·, X) Xr X Dep(Z, S) := Cov2 (Zj , βS (S)) βS (·) S j=1 βS ∈US Figure 5.3: Our IRepL model consists of three components: i) An r-dimensional encoder f be- longing to the universal RKHS HX . ii) A measure of dependence that accounts for all dependence modes between data representation Z and semantic attribute S induced by the covariance between Z = f (X) and βS (S) where βS belongs to a universal RKHS HS . iii) A measure of dependency between Z and the target attribute Y defined similarly to that for S. 5.4 Choice of Dependence Measure We only discuss for Dep (Z, S) since Dep (Z, Y ) follows similarly. Accounting for all possible non-linear relations between RVs is a key desideratum of dependence measures. A well-known ex- ample of such measures is MI (e.g., MINE [119]). However, calculating MI for multidimensional continuous representation is analytically challenging and computationally intractable. Kernel- based measures are an alternative solution with the attractive properties of being computationally feasible/efficient and analytically tractable [83]. Principally, Z ⊥⊥ S iff Cov(α(Z), βS (S)) = 0 for all Borel functions α : Rr → R and βs : RdS → R belonging to the universal RKHSs HZ and HS , respectively. Alternatively, Z ⊥ ⊥S iff HSIC(Z, S) = 0 for HSIC [81] being defined as X X HSIC(Z, S) := Cov2 (α(Z), βS (S)) , (5.5) α∈UZ βS ∈US where UZ and US are countable orthonormal basis sets for the separable universal RKHSs HZ and HS , respectively. However, since Z = f (X) , calculating Cov(α(Z), βS (S)) necessitates the application of a cascade of kernels, which limits the analytical tractability of our solution. 76 Therefore, we adopt a simplified version of HSIC that considers transformation on S only but affords analytical tractability for solving the IRepL optimization problem. We define this measure as Xr X  Dep(Z, S) := Cov2 Zj , βS (S) , (5.6) j=1 βS ∈US where Zj = fj (X) for fj s defined in (5.4). We note that Dep(·, ·), unlike HSIC and other kernelization-based dependence measures, is not symmetric. However, symmetry is not neces- sary for measuring statistical dependence. The measure Dep(Z, S) in (5.6) captures all modes of non-linear dependence under the assumption that the distribution of a low-dimensional projection of high-dimensional data is approximately normal [120], [121]. To see why this reasoning is rel- evant, we note from (5.4) that Z can be expressed as Z = ΘV , where V ∈ Rn and Θ ∈ Rr×n . This indicates that for large n and small r (which is the case for most real-world datasets), Z is indeed a low-dimensional projection of high-dimensional data. In other words, (Z, βS (S)) is ap- proximately a jointly Gaussian RV. In our numerical experiments in Section 5.6 we empirically observe that Dep(Z, S) enjoys a monotonic relation with the underlying invariance measure and captures all modes of dependency in practice, especially as Z ⊥⊥ S. Nevertheless, if the normal- ity assumption on the distribution of (Z, βS (S)) fails, Dep(Z, S) reduces to measuring the linear dependency between Z and βS (S) for all Borel functions βS . This corresponds to measuring the mean independency of Z from S, i.e., how much information a predictor (linear and non-linear) can infer (in the sense of MSE) about Z from S. See Section 5.2 for more technical details on mean independency. Lemma 5.3. Let KX , KS ∈ Rn×n be the Gram matrices corresponding to HX and HS , re- spectively, i.e., (KX )ij = kX (xi , xj ) and (KS )ij = kS (si , sj ), where covariance is empirically 77 estimated as n n n  1X 1 XX Cov fj (X), βS (S) ≈ fj (xi )βS (si ) − 2 fj (xi )βS (sk ). n n i=1 i=1 k=1 It follows that, the corresponding empirical estimation for Dep (Z, S) is 1 Depemp (Z, S) = kΘKX HLS k2F , (5.7) n2 where H = In − n1 1n 1Tn is the centering matrix, and LS is a full column-rank matrix in which LS LTS = KS (Cholesky factorization). Furthermore, the empirical estimator in (5.7) has a bias of O(n−1 ) and a convergence rate of O(n−1/2 ). Proof. See Appendix B.2 Notice that the dependence measure between Z and Y can be defined similarly. 5.5 Exact Kernelized Trade-Off Consider the optimization problem corresponding to TOpt in (5.2). Recall that Z = f (X) is an r-dimensional RV, where the embedding dimensionality r is also a variable to be optimized. A common desideratum of learned representations is that of compactness [122], to avoid learning representations with redundant information where different dimensions are highly correlated to each other. Therefore, going beyond the assumption that each component of f (i.e., fj s) belongs to the universal RKHS HX , we impose additional constraints on the representation. Specifically, we constrain the search space of the encoder f (·) to learn a disentangled representation [122] as 78 follows n  o Ar := (f1 , · · · , fr ) | fi , fj ∈ HX , Cov fi (X), fj (X) + γ hfi , fj iH = δi,j . (5.8) X  In the above set, the Cov fi (X), fj (X) part enforces the covariance matrix of Z = f (X) to be an identity matrix. This kind of disentanglement is used in the principal component analysis (PCA) and encourages the variance of each entry of Z to be one and different entries of Z to be uncorre- lated with each other. The regularization part, γ hfi , fj iH encourages the encoder components X to be as orthogonal as possible to each other and to be of the unit norm, which aids with numer- ical stability during empirical estimation [85]. As the following theorem states formally, such disentanglement is an invertible transformation, and therefore it does not nullify any information. Theorem 5.4. Let Z = f (X) be an arbitrary representation of the input data, where f ∈ HX . Then, there exists an invertible Borel function h, such that h ◦ f belongs to Ar . Proof. See Appendix B.3 This Theorem implies that the disentanglement preserves the performance of the downstream task since any target network can revert the disentanglement h and access to the original rep- resentation Z. In addition, any deterministic measurable transformation of Z will not add any information about S that does not already exist in Z. We define our K−TOpt as sup {J (f , λ) := (1 − λ) Dep (f (X), Y ) − λ Dep (f (X), S)} , 0 ≤ λ < 1, (5.9) f ∈Ar where λ is the utility-invariance trade-off parameter. Fortunately, the above optimization problem lends itself to a closed-form solution as follows. 79 Theorem 5.5. Consider the operator ΣSX to be induced by the bi-linear functional Cov(α(X), βS (S)) = hβS , ΣSX αiH S and define ΣY X and ΣXX , similarly. Then, a global optima for the optimization problem in (5.9) is the eigenfunctions corresponding to the r largest eigenvalues of the following generalized eigen- value problem  (1 − λ) Σ∗Y X ΣY X − λ Σ∗SX ΣSX f = τ (ΣXX + γ IX ) f , (5.10) where γ is the disentanglement regularization parameter defined in (5.8), IX is the identity oper- ator in HX , and Σ∗ is the adjoint of Σ. Proof. Consider Dep(Z, S) in (5.6): X X r  Dep(Z, S) = Cov2 fj (X), βS (S) βS ∈US j=1 Xr X 2 = βS , ΣSX fj H S j=1 βS ∈US Xr = kΣSX fj k2H , S j=1 where the last step is due to Parseval’s identity for the orthonormal basis set. Similarly, we have Pr 2 dep(Z, Y ) = j=1 kΣY X fj kHY . Recall that Z = f (X) = [(f1 (X), · · · , fr (X)], then it follows 80 that X r Xr J(f (X)) = (1 − λ) kΣY X fj k2H −λ kΣSX fj k2H Y S j=1 j=1 X r Xr = (1 − λ) ΣY X fj , ΣY X fj H − λ ΣSX fj , ΣSX fj H Y S j=1 j=1 X r = fj , ((1 − λ) Σ∗Y X ΣY X − λ Σ∗SX ΣSX )fj H , X j=1 where Σ∗ is the adjoint operator of Σ. Further, note that Cov(fi (X), fj (X)) = hfi , ΣXX fj iH . X As a result, the optimization problem in (5.10) can be restated as X r  sup fj , (1 − λ)Σ∗Y X ΣY X − λ Σ∗SX ΣSX fj H , 1 ≤ i, k ≤ r hfi ,(ΣXX +γIX )fk iH =δi,k j=1 X X where IX denotes identity operator from HX to HX . This optimization problem is known as generalized Rayleigh quotient [123] and a possible solution to it is given by the eigenfunctions corresponding to the r largest eigenvalues of the following generalized problem ((1 − λ) ΣXY ΣY X − λ ΣXS ΣSX ) f = λ (ΣXX + γIX ) f. Remark. If the trade-off parameter λ = 0 (i.e., no semantic independence constraint is imposed) and γ → 0, the solution in Theorem 5.5 is equivalent to a supervised kernel-PCA. On the other hand, if λ → 1 (i.e., utility is ignored and only semantic independence is considered), the solution in Theorem 5.5 is the eigenfunctions corresponding to the r smallest eigenvalues of Σ∗SX ΣSX , which are the directions that are the least explanatory of the semantic attribute S. 81 Now, consider the empirical counterpart of the optimization problem (5.9), sup {J emp (f , λ) := (1 − λ) Depemp (f (X), Y ) − λ Depemp (f (X), S)} , 0 ≤ λ < 1 (5.11) f ∈Ar where Depemp (f (X), S) is given in (5.7) and Depemp (f (X), Y ) is defined similarly. Theorem 5.6. Let the Cholesky factorization of KX be KX = LX LTX , where LX ∈ Rn×d (d ≤ n) is a full column-rank matrix. Let r ≤ d, then a solution to (5.11) is f Opt (X) = ΘOpt [kX (x1 , X), · · · , kX (xn , X)]T , † where Θopt = U T LX and the columns of U are eigenvectors corresponding to the r largest eigenvalues of the following generalized eigenvalue problem.   1 T LTX ((1 − λ)HKY H − λHKS H) LX u = τ L HLX + γI u. (5.12) n X Pr Further, the objective value of (5.11) is equal to j=1 βj , where {β1 , · · · , βr } are the r largest eigenvalues of (5.12). Proof. Consider the Cholesky factorization, Kx = Lx LTx where Lx is a full column-rank matrix. 82 Using the representer theorem, the disentanglement property in (5.8) can be expressed as  Cov fi (X), fj (X) + γ hfi , fj iH X X n Xn X n 1 1 = fi (xk )fj (xk ) − 2 fi (xk ) fj (xm ) + γ hfi , fj iH n n X k=1 k=1 m=1 X n X n Xn 1 1 = KX (xk , xt )θit KX (xk , xm )θjm − 2 θiT KX 1n 1Tn KX θj + γ hfi , fj iH n n X k=1 t=1 m=1 1  1 = (KX θi )T Kx θj − 2 θiT KX 1n 1Tn KX θj n * n + Xn X n +γ θik kX (·, xk ), θit kX (·, xt ) k=1 t=1 HX 1 T = θi KX HKX θj + γ θiT KX θj n 1 T   = θi LX LTX HLX + nγ I LTX θj n = δi,j . As a result, f ∈ Ar is equivalent to 1  ΘLX LT HLX + γI LTX ΘT = Ir , | n X {z } :=C where Θ := [θ1 , · · · , θr ]T ∈ Rr×n . 83 Let V = LTX ΘT and consider the optimization problem in (13): sup {(1 − λ) Depemp (f (X), Y ) − λ Depemp (f (X), S)} f ∈Ar 1 n o 2 2 = sup (1 − λ) kΘKX HLY kF − λ kΘKX HLS kF f ∈Ar n2 1 n n o n oo = sup (1 − λ) Tr ΘKX HKY HKX ΘT − λ Tr ΘKX HKS HKX ΘT f ∈Ar n2 1 n T T o = max 2 Tr ΘLX BLX Θ V T CV =Ir n 1 n T o = max 2 Tr V BV (5.13) V T CV =Ir n where the second step is due to (5.7) and B := LTX ((1 − λ)HKY H − λHKS H) LX It is shown in [90] that an1 optimizer of (5.13) is any matrix U whose columns are eigenvectors corresponding to r largest eigenvalues of generalized problem Bu = τ Cu (5.14) and the maximum value is the summation of r largest eigenvalues. Once U is determined, then, any Θ in which LTX ΘT = U is optimal Θ (denoted by Θopt ). Note that Θopt is not unique and has a general form of  †   ΘT = LTX U + Λ0 , R(Λ0 ) ⊆ N LTX . 1 Optimal V is not unique. 84 However, setting Λ0 to zero would lead to a minimum norm for Θ. Therefore, we opt Θopt = † U T LX . Corollary 5.1. Embedding Dimensionality: A useful corollary of Theorem 5.6 is characterizing optimal embedding dimensionality as a function of the trade-off parameter, λ: ( ) rOpt (λ) := arg sup sup {J emp (f , λ)} = number of non-negative eigenvalues of (5.12). 0≤r≤d f ∈Ar Proof. From proof of Theorem 5.6, we know that Xr sup {(1 − λ) Depemp (f (X), Y ) − λ Depemp (f (X), S)} = τj , f ∈Ar j=1 where {τ1 , · · · , τn } are eigenvalues of the generalized problem in (5.12) in decreasing order. It follows immediately that   X r  arg sup τj = number of non-negative elements of {τ1 , · · · , τl }. r   j=1 To examine these results, consider two extreme cases: i) If there is no semantic independence constraint (i.e., λ = 0), all eigenvalues of (5.12) are non-negative since HKY H is a non-negative definite matrix and n1 LTX HLX + γI is a positive definite matrix. This indicates that rOpt is equal to the maximum possible value (that is equal to d), and therefore it is not required for Z to nullify any information in X. ii) If we only concern about the semantic independence and ignore the target task utility (i.e., λ → 1), all eigenvalues of (5.12) are non-positive and therefore rOpt would be the number of zero eigenvalues of (5.12). This indicates that Depemp (Z, S) in (5.7) is equal to 85 zero, since Θopt KX is zero for zero eigenvalues of (5.12) when λ → 1. In this case, adding more dimension to Z will necessarily increase Depemp (Z, S). The following Theorem characterizes the convergence behavior of empirical K−TOpt to its population counterpart. Theorem 5.7. Assume that kS and kY are bounded by one and fj2 (xi ) ≤ M for any j = 1, . . . , r and i = 1, . . . , n for which f = (f1 , . . . , fr ) ∈ Ar . Then, for any n > 1 and 0 < δ < 1, with probability at least 1 − δ, we have r   log(6/δ) 1 sup J(f , λ) − sup J emp (f , λ) ≤ rM +O . f ∈Ar f ∈Ar 0.222 n n Proof. See Appendix B.4 Pn Note that, for any x in the training set, fj (x) can be calculated as fj (x) = i=1 θji kX (xi , x). We can assume that kX (·, ·) is bounded. For example, in RBF Gaussian and Laplacian RKHSs √ (that are universal), kX (·, ·) ≤ 1. This implies that fj2 (x) ≤ nkθj k, where θj is j-th row of √ Θ in equation (5.4). One always can normalize fj (x) by dividing it by the maximum of nkθj k over js, or by dividing by the maximum of |fj (xi )| over is and js. Notice that, this normalization is only a scalar multiplication and has no effect on the invariance of Z = f (X) to S and the utility of any downstream target task predictor gY (Z). 5.5.1 Numerical Complexity Computational Complexity: If LX in (5.12) is provided in the training dataset, then, the com- putational complexity of obtaining the optimal encoder is O(l3 ), where l ≤ n is the numerical rank of the Gram matrix KX . However, the dominating part of the computational complexity is 86 due to the Cholesky factorization, KX = LX LTX which is O(n3 ). Using random Fourier features (RFF) [115], kX (x, x0 ) can be approximated by rX (x)T rX (x0 ), where rX (x) ∈ Rd . In this situation, the Cholesky factorization can be directly calculated as   T  rX (x1 )     .  LX =   . .  ∈ Rn×d .  (5.15)     rX (xn ) T As a result, the computational complexity of obtaining the optimal encoder becomes O(d3 ), where the RFF dimension, d can be significantly less than the sample size n with negligible sacrifice on kX (x, x0 ) ≈ rX (x)T rX (x0 ) approximation. Memory Complexity: The memory complexity of (5.12), if calculated naively, is O(n2 ) since KY and KS are n by n matrices. However, using RFF together with Cholesky factorization KY = LY LTY , KS = LS LTS , the left-hand side of (5.12) can be re-arranged as       (1 − λ) LTX L̃Y L̃TY LX − λ LTX L̃S L̃TS LX , (5.16) where L̃TY = HLY = LY − n1 1n (1Tn LY ) and therefore, the required memory complexity is O(nd). Note that, L̃TS and HLX can be calculated similarly. 5.5.2 Target Task Performance in K−TOpt Assume that the desired target loss function is MSE. In the following Theorem, we show that max- imizing Dep (f (X), Y ) over f ∈ Ar can learn a representation Z that is informative enough for a target predictor on Z to achieve the most optimal estimation, i.e., the Bayes estimation (E[Y | X]). 87 Theorem 5.8. Let f ∗ be the optimal encoder by maximizing Dep(f (X), Y ), where γ → 0 and HY is a linear RKHS. Then, there exist W ∈ RdY ×r and b ∈ RdY such that W f ∗ (X) + b is the Bayes estimator, i.e., h i h i ∗ kW Z + b − Y k 2 = inf EX,Y kh(X) − Y k2 h is Borel h i = EX,Y kE[Y | X] − Y k2 . Proof. We only prove this theorem for the empirical version due to its convergence to the popula- tion counterpart. The optimal Bayes estimator can be the composition of the kernelized encoder Z = f (X) and a linear regressor on top of it. More specifically, Yb = W f (X) + b can approach to E[Y | X] if we optimize f , W , and b all together. This is because f ∈ HX can approximate any Borel function (due to the universality of HX ) and, since r ≥ dy , W can be surjective. Let Z := [z1 , · · · , zn ] ∈ Rr×n and Y := [y1 , · · · , yn ] ∈ Rdy ×n . Further, let Z̃ and ỹ be the centered (i.e., mean subtracted) version of Z and Y , respectively. We firstly optimize b for any given f , r, and W : n 1X bopt := arg min kW zi + b − yi k2 b n i=1 X n n 1 1X = yi − W zi . n n i=1 i=1 Then, optimizing over W would lead to 1 2 1 2 min W Z̃ − Ỹ = min Z̃ T W T − Ỹ T W n F n W F 1 2 1 2 = min Z̃ T W T − PZ̃ Ỹ T + PZ̃ ⊥ Ỹ T W n F n F 1 2 1 2 1 2 = PZ̃ ⊥ Ỹ T = Ỹ − P Ỹ T , n F n F n Z̃ F 88 where PZ̃ denotes the orthogonal projector onto the column space of Z̃ T and a possible min- imizer is Wopt T = (Z̃ T )† Ỹ T or equivalently W † opt = Ỹ (Z̃) . Since the MSE loss is a func- tion of the range (column space) of Z̃ T , we can consider only Z̃ T with orthonormal columns or equivalently n1 Z̃ Z̃ T = Ir . In this setting, it holds PZ̃ = n1 Z̃ T Z̃. Now, consider optimizing f (X) = Θ [kX (x1 , X), · · · , kX (xn , X)]T . We have, Z̃ = ΘKX H where H is the centering matrix. Let V = LTx ΘT and C = n1 LTX HLX , then it follows that   1 2 T 2 min Ỹ − PZ̃ Ỹ ΘKX HKX ΘT =nIr n F F 1 2 1 2 = Ỹ − max PZ̃ Ỹ T n F ΘKX HKX ΘT =nIr n F 1 2 1 h i = Ỹ − max T Tr Ỹ HKX Θ ΘKX H Ỹ T n 2 F V T CV =Ir n 1 2 1 h T Ỹ HK ΘT i = Ỹ − max Tr ΘK X H Ỹ X n2 F V T CV =Ir n 2 2 1 h T T T i = Ỹ − max 2 Tr V L X Ỹ Ỹ L X V F V T CV =Ir n r 1 2 1 X = Ỹ − 2 λj , n F n j=1 where λ1 , · · · , λr are r largest eigenvalues of the following generalized problem B0 u = λ Cu and B0 := LTX Ỹ T Ỹ LX . This resembles the eigenvalue problem in Section , equation (5.14) where λ = 0, HY is a linear RKHS and γ → 0. This Theorem implies that not only Dep (f (X), Y ) can preserve all the necessary information in Z to optimally predict Y , also, the learned representation is simple enough for a linear regressor to achieve the optimal performance. 89 5.6 Experiments In this section, we numerically quantify our K−TOpt through the closed-form solution for the en- coder obtained in Section 5.5 on an illustrative toy example and two real-world datasets, Folktables and CelebA. 5.6.1 Baselines We consider two types of baselines: (1) ARL (the main framework for IRepL) with MSE or Cross- Entropy as the adversarial loss. Such methods are expected to fail to learn a fully invariant repre- sentation [56, 114]. These include [24, 22, 2], and SARL [49]. (2) HSIC-based adversarial loss that accounts for all modes of dependence, and as such is theoretically expected to learn a fully in- variant representation [62]. Among these baselines, except for SARL, all the others are optimized via iterative minimax optimization which is often unstable and not guaranteed to converge. On the other hand, SARL obtains a closed-form solution for the global-optima of the minimax optimiza- tion under a linear dependence measure between Z and, S which may fail to capture all modes of dependence between Z and S. 5.6.2 Datasets Gaussian Toy Example: We design an illustrative toy example where X and S are mean indepen- dent in some dimensions but not fully independent in those dimensions. Specifically, X and S are 4-dimensional continuous RVs and generated as following U = [U1 , U2 , U3 , U4 ] ∼ N (04 , I4 ) , N ∼ N (04 , I4 ) , U ⊥⊥ N π  h π  π i X = cos U + 0.005N, S = sin [U1 , U2 ] , cos [U3 , U4 ] , (5.17) 6 6 6 90 where sin(·) and cos(·) are applied point-wise. To generate the target attribute, we define four binary RVs as follows. Yi = 1{|U |>T } (Ui ), i = 1, 2, 3, 4, i where 1B (·) is the indicator function, and we set T = 0.6744, so it holds that P[Yi = 0] = P[Yi = 1] = 0.5 for i = 1, 2, 3, 4. Finally, we define Y as a 16-class categorical RV concatenated by Yi s. Since S is dependent on X through all the dimensions of X, then, a wholly invariant Z (i.e., Z ⊥⊥ S) should not contain any information about X. However, since [S1 , S2 ] is only mean independent of [X1 , X2 ] (i.e., E [S1 , S2 | X1 , X2 ] = E [S1 , S2 ]), ARL baselines with MSE as the adversary loss, i.e., [24, 22, 2] and SARL cannot capture the dependency of Z to [S1 , S2 ] and result in a representation that is always dependent on [S1 , S2 ] (see Section 5.2 for theoretical details). We sample 18, 000 instances from pX,Y,S , independently, and split these samples equally into training, validation, and testing partitions. Folktables: We consider a fair representation learning task on Folktables [116] dataset (a deriva- tion of the US census data). Particularly, we use 2018-WA (Washington) and 2018-NY (New York) census data where the target attribute Y is the employment status (binary for WA and 4 categories for NY) and the semantic attribute S is age (discrete value between 0 and 95 years). We seek to learn a representation that predicts employment status while being fair in demographic parity (DP) w.r.t. age. DP requires that the prediction Yb be independent of S which can be achieved by enforcing Z ⊥⊥ S. The WA and NY datasets contain 76, 225 and 196, 967 samples, respec- tively, each constructed from 16 different features. We randomly split the data into training (70%), validation (15%), and testing (15%) partitions. Further, we adopt embeddings for categorical fea- tures (learned in a supervised fashion by Y ) and normalization for continuous/discrete features (by dividing to the maximum value). 91 CelebA: CelebA dataset [111] contains 202, 599 face images of 10, 177 different celebrities with standard training, validation, and testing splits. Each image is annotated with 40 different at- tributes. We choose the target attribute Y as the high cheekbone attribute (binary) and the semantic attributes S to be the concatenation of gender and age (a 4-class categorical RV). The objective of this experiment is similar to that of Folktables. Since raw image data is not appropriate for kernel methods, we pre-train a ResNet-18 [124] (supervised by Y ) on CelebA images and extract features of dimension 256. These features are used as the input data for all methods. 5.6.3 Evaluation Metrics We use the accuracy of the classification tasks (16-class classification for Gaussian toy example, employment prediction for Folktables, and high cheekbone prediction for CelebA) as a utility. For Folktables and CelebA datasets, we define DP violation as h  i b DPV(Y , S) := EYb VarS P[Y | S]b (5.18) and use it as a metric to measure the variance (unfairness) of the prediction Yb w.r.t. the semantic attribute S. For the Gaussian toy example, the above metric is challenging to compute because S is a continuous RV. To circumvent this difficulty, we deploy KCC [79] Cov(α(Z), β(S)) KCC(Z, S) := sup p , (5.19) α∈HZ ,β∈HS Var(α(Z)) Var(β(S)) as a measure of invariance of Z to S, where HZ and HS are RBF-Gaussian RKHS. The reason for using KCC instead of HSIC is that, unlike HSIC, KCC is normalized, and therefore it is a more readily interpretable measure for comparing the invariance of representations between different 92 methods. 5.6.4 Choice of (Y, S) Pair The existence of a utility-invariance trade-off ultimately depends on the statistical dependency between target and semantic attributes. If Dep(Z, S) is negligible, then there does not exist a trade-off. Keeping this in mind, we first chose the semantic attribute to be a sensitive attribute for Folktables (i.e., age) and CelebA (i.e., concatenation of age and gender) datasets. Then, we calculated the data imbalance (i.e., |P[Y = 0] − 0.5|) and KCC(Y, S) for all possible Y s. Finally, we chose Y with a small data imbalance and a moderate KCC(Y, S). For Folktables dataset, |P[employment = 0] − 0.5| = 0.04 and KCC(employment, age) = 0.4. For CelebA dataset, |P[high cheekbone = 0] − 0.5| = 0.05 and KCC(high cheekbone, [age, gender]) = 0.1. 5.6.5 Implementation Details For all methods, we pick different values of λ (100 λs for the Gaussian toy example and 70 λs for Folktables and CelebA datasets) between zero and one for obtaining the utility-invariance trade- off. We train the baselines that use a neural network for encoder five times with different random seeds. We let the random seed also change the training-validation-testing split for the Folktables dataset (CelebA and Gaussian datasets have fixed splits). Embedding Dimensionality: None of the baseline methods have any strategy to find the optimum embedding dimensionality (r) and they all set r to be constant w.r.t. λ. Therefore, for baseline methods, we set r = 15 (i.e., the minimum dimensionality required to linearly classify 16 different categories) for the Gaussian toy example and r = 3 (i.e., the minimum dimensionality required to linearly classify 4 different categories) for Folktables-NY dataset, that is also equal to rOpt when 93 3 15 Embedding Dim. Embedding Dim. 10 2 5 1 0 0 10−4 10−3 10−2 10−1 100 0 0.2 0.4 0.6 0.8 1 1−λ λ (a) (b) Figure 5.4: Plots of rOpt (λ) versus the dependence trade-off parameter 1 − λ for (a) the Gaussian toy dataset and (b) Folktables-NY dataset. There is a non-decreasing relation between rOpt (λ) and 1 − λ. λ = 0. For K-TOpt , we use rOpt (λ) in Corollary 5.1. See Figure 5.4 for the plot of rOpt versus λ for the toy Gaussian and Folktables-NY datasets. For Folktables-WA and CelebA datasets, rOpt (λ = 0) is equal to one, and therefore we let r = 1 for all methods and all 0 ≤ λ < 1. K−TOpt : We let HX , HS , and HY be RBF Gaussian RKHS, where we compute the corresponding band-widths (i.e., σs) using the median strategy introduced by [125]. We optimize the regulariza- tion parameter γ in the disentanglement set (5.8) by minimizing the corresponding target losses over γs in {10−6 , 10−5 , 10−4 , 10−3 , 10−2 , 10−1 , 1} on validation sets. RFF (as discussed in Sec- tion 5.5.1) is deployed for all datasets. For RFF dimensionality, we started with a small value and gradually increased it until reaching the maximum possible performance for λ = 0 (i.e. the standard unconstrained representation learning) on the corresponding validation sets that results in 100 for the Gaussian dataset, 5000 for Folktables dataset, and 1000 for CelebA dataset. SARL [49]: SARL method is similar to our K−TOpt except that HY and HS are linear RKHSs, and therefore we set σX and γ similar to that of K−TOpt . ARL [24, 22, 2]: The representation Z = f (X) is extracted via the encoder f , which is an MLP (4 hidden layers and 15, 15 neurons for Gaussian data; 3 hidden layers and 128, 64 neurons for Folk- tables and CelebA datasets). These architecture choices were based on starting with a single linear 94 layer and gradually increasing the number of layers and neurons until over-fitting was observed. This results in the number of encoder parameters for the Gaussian toy example to be 735, while that is 100 = 100 ∗ rOpt (λ → 1) ≤ 100 ∗ rOpt (λ) ≤ 100 ∗ rOpt (λ = 0) = 1500 for K−TOpt . For Folktables and CelebA, those are 41, 024 and 15, 616, respectively, for ARL and 5000 and 1000 for K−TOpt . The representation Z is fed to a target task predictor gY and a proxy adversary gS networks where both are MLP with (2 hidden layers, and 16 neurons for Gaussian data, 2 hidden layers, and 128 neurons for Folktables and CelebA datasets). All involved networks (f , gY , gS ) are trained end-to-end. We use stochastic gradient descent-ascent (SGDA) [24] with AdamW [126] as an optimizer to alternately train the encoder, target predictor, and proxy adversary networks. We choose batch size as 500 for Gaussian data; and 128 for Folktables and CelebA datasets. Then, the corresponding learning rates are optimized over {10−2 , 10−3 , 5 × 10−4 , 10−4 , 10−5 } by minimiz- ing the target loss on the corresponding validation sets. HSIC-IRepL [62]: This method can be formulated as (5.2) where Dep(Z, S) is replaced by HSIC(Z, S). The encoder and target predictor networks have the same architecture as the ARL. Therefore, we follow the same optimization procedure as ARL to train the involved neural net- works. 5.6.6 Results Utility-Invariance Trade-offs: Figures 5.5 and 5.6 show the utility-invariance and Dep(Z, Y ) − Dep(Z, S) trade-offs for the toy Gaussian, Folktables-WA, Folktables-NY, and CelebA datasets, respectively. The invariance measure for the Gaussian toy example is KCC (5.19), and the in- variance measure for Folktables and CelebA datasets is the fairness measure, DPV (5.18). We make the following observations: 1) K−TOpt is highly stable and almost spans the entire trade- off front for all datasets except Folktables-NY which can be due to the inability of scalarized 95 0.9 K−TOpt HSIC 0.8 ARL 0.8 ARL SARL Accuracy Accuracy 0.6 0.7 0.4 K−TOpt HSIC 0.6 OptNet 0.2 ARL SARL 0 0.5 0 0.2 0.4 0.6 0.8 0 0.1 0.2 KCC(Z, S) DPV(Yb , S) (a) (b) 0.9 0.9 0.8 0.8 Accuracy Accuracy 0.7 0.7 0.6 K−TOpt K−TOpt HSIC SARL 0.6 OptNet HSIC 0.5 ARL OptNet SARL ARL 0.4 0.5 0 5 · 10−2 0.1 0.15 0 0.1 0.2 DPV(Yb , S) DPV(Yb , S) (c) (d) Figure 5.5: Utility versus invariance trade-offs obtained by K−TOpt and other baselines for (a) Gaussian, (b) Folktables-WA, (c) Folktables-NY, and (d) CelebA datasets. K-TOpt stably spans the entire trade-off front and considerably dominates other methods for all datasets. (a) ARL and SARL span a small portion of the trade-off front since S is mean independent (but not fully independent) of X in some dimensions for the Gaussian toy example. HSIC-IRepL, despite using a universal dependence measure, performs sub-optimally due to the lack of convergence guarantees to the global optima. single-objective formulation in (5.2), in contrast to the constrained optimization in (5.1), to find all Pareto-optimal points. 2) There is almost the same trend in the trade-off between Dep(Z, Y ) and Dep(Z, S) (Figure 5.6) as the utility-invariance trade-off (Figure 5.5). This is a desired observa- tion since Dep(Z, Y )-Dep(Z, S) trade-off is what we actually optimized in (5.9) as a surrogate to utility-invariance trade-off. 3) The baseline method HSIC-IRepL, despite using a universal depen- dence measure, leads to a sub-optimal trade-off front due to the lack of convergence guarantees to the global optima. 4) The baselines, ARL and SARL span only a small portion of the trade-off front in the Gaussian toy example, since some dimensions of the semantic attribute S in (5.17) are mean independent (but not fully independent) to some dimensions of X and therefore the adversary 96 0.2 0.25 0.2 0.15 Dep(Z, Y ) Dep(Z, Y ) 0.15 0.1 0.1 0.05 0.05 0 0 0 0.05 0.1 0.15 0 0.05 0.1 0.15 0.2 Dep(Z, S) Dep(Z, S) (a) (b) 0.5 0.8 0.4 Dep(Z, Y ) Dep(Z, Y ) 0.6 0.3 0.4 0.2 0.1 0.2 0 0 0 0.05 0.1 0.15 0.2 0 0.1 0.2 0.3 Dep(Z, S) Dep(Z, S) (c) (d) Figure 5.6: Dep(Z, Y ) versus Dep(Z, S) in K−TOpt for (a) Gaussian, (b) Folktables-WA, (c) Folktables-NY, and (d) CelebA datasets. We can observe that there is the same trend in Dep(Z, Y )- Dep(Z, S) trade-off as utility-invariance-trade-off in Figure 5.5. does not provide any information to the encoder to discard [S1 , S2 ] from the representation. In this dataset, ARL and SARL baselines do not approach Z ⊥⊥ S, i.e., KCC(Z, S) = 0 cannot be attained for any value of the trade-off parameter λ. 5) ARL shows high deviation on Folktables dataset due to the unstable nature of the minimax optimization. 6) SARL performs as good as K−TOpt for CelebA dataset. This is because both S and Y are categorical for CelebA dataset, and therefore linear RKHS on one-hot encoded attribute performs just as well as universal RKHSs [127]. Universality of Dep(Z, S): We empirically examine the practical validity of our assumption in Section 5.4 and verify if our dependence measure Dep(Z, S), defined in (5.6), can capture all modes of dependency between Z and S. Figure 5.7 (a) shows the plot of the universal dependence measure KCC(Z, S) versus Dep(Z, S) for the Gaussian dataset and Figures 5.7 (b, c) illustrate the relation between DPV(Yb , S) and Dep(Z, S) for Folktables and CelebA datasets, respectively. We 97 0.2 0.8 0.15 KCC(Z, S) 0.6 DPV(Yb , S) 0.1 0.4 0.05 0.2 0 0 0 0.05 0.1 0 0.05 0.1 0.15 0.2 Dep(Z, S) Dep(Z, S) (a) (b) 0.2 0.1 0.15 DPV(Yb , S) DPV(Yb , S) 0.1 0.05 5 · 10−2 0 0 0 0.05 0.1 0.15 0.2 0.25 0 0.1 0.2 0.3 Dep(Z, S) Dep(Z, S) (c) (d) Figure 5.7: Invariance versus Dep(Z, S) of K−TOpt for (a) Gaussian, (b) Folktables-WA, (c) Folktables-NY, and (d) CelebA datasets. Dep(Z, S) enjoys a monotonic relation with the un- derlying invariance measures. observe that there is a non-decreasing relation between the corresponding invariance measures and Dep(Z, S). More importantly, as KCC(Z, S) → 0 (or DPV(Yb , S) → 0) so does dep(Z, S). These observations verify that Dep(Z, S) accounts for all modes of dependence between Z and S. 5.6.7 Ablation Study Effect of Embedding Dimensionality: In this experiment, we examine the significance of the embedding dimensionality, rOpt (λ), discussed in Corollary 5.1. We obtain the utility-invariance trade-off when the embedding dimensionality is fixed to be r = rOpt (λ = 0) = 15. A com- parison plot between the utility-invariance trade-off induced by rOpt (λ) and the fixed r = 15 is illustrated in Figure 5.8 (a). We can observe that not only the utility-invariance trade-off for fixed r is dominated by that of rOpt (λ), but also, using fixed r is unable to achieve the total invariance 98 15 eig1 K−TOpt (r = rOpt (λ)) K−TOpt (r = 15) eig5 0.8 eig10 10 eig15 Accuracy Eigenvalues 0.6 5 0.4 0.2 0 0 −5 0 0.2 0.4 0.6 0.8 10−4 10−3 10−2 10−1 100 KCC(Z, S) 1−λ (a) (b) Figure 5.8: (a) Comparison between the utility-invariance trade-offs induced by the optimal em- bedding dimensionality rOpt (λ) and that of fixed r = 15. Fixed r = 15 is significantly dominated by that of rOpt (λ) and fails to attain Z ⊥⊥ S. (b) The first, fifth, tenth, and fifteenth largest eigen- values in (5.12) versus 1 − λ. Given λ, rOpt is equal to the number of non-negative eigenvalues. As 1 − λ decreases, the largest eigenvalues approach to negative numbers. 0.9 0.8 0.8 Accuracy 0.7 Accuracy 0.7 K−TOpt HSIC 0.6 0.6 OptNet K−TOpt (Age Present) ARL SARL K−TOpt (Age Removed) 0.5 0.5 0 5 · 10−2 0.1 0.15 0.2 0 5 · 10−2 0.1 0.15 DPV(Yb , S) DPV(Yb , S) (a) (b) Figure 5.9: (a) Utility versus invariance trade-offs for all methods when age (i.e., the sensitive attribute) is discarded from the input data. (b) A comparison between trade-offs of K−TOpt when age is present versus age is discarded from the input data. Removing the age attribute slightly degrades the trade-off due to information discarding. representation, i.e., Z ⊥⊥ S. Further, rOpt (λ) and some of the largest eigenvalues of (5.12) vs the invariance trade-off parameter λ are plotted in Figures 5.8 (b, c), respectively. We recall from Corollary 5.1 that, for any given λ, rOpt is the number of non-negative eigenvalues of (5.12). 99 5.7 Summary Invariant representation learning (IRepL) often involves a trade-off between utility and invariance. While the existence of such trade-off and its bounds have been studied, its exact characterization has not been investigated. This chapter takes some steps to address this problem by, i) establishing the exact kernelized trade-off (denoted by K−TOpt ), ii) determining the optimal dimensionality of the data representation necessary to achieve a desired optimal trade-off point, and iii) developing a scalable learning algorithm for encoders in some RKHSs to achieve K−TOpt . Numerical results on both an illustrative example and two real-world datasets show that commonly used adversarial representation learning-based techniques are unable to attain the optimal trade-off estimated by our solution. 100 Chapter 6 Conclusion and Future Work Invariant representation learning often introduces a trade-off between utility and invariance. How- ever, a learning algorithm that can optimally achieve any point on the trade-off front is challenging. To circumvent this difficulty, in this dissertation, we propose to model IRepL players via functions in some RKHSs. Consequently, we found a closed-form solution for the underlying IRepL prob- lem. Additionally, the optimal embedding dimensionality is also determined. We also considered the case where the encoder is an NN. In this situation, we model the target network and invariance measure via kernelized ridge regressors that yield, in turn, closed-form solutions for the target net- work and invariance measure. As a result, the unstable SGDA optimization scheme turns to SGD, which is a stable optimization scheme. Numerical experiments on real-world datasets like the US Census and CelebA confirm the optimality and efficiency of our proposed methods compared to baseline IRepL algorithms. Finally, our theoretical results and empirical solutions shed light on the utility-invariance trade-offs in various settings, such as algorithmic fairness and privacy-preserving learning. 6.1 Limitations This dissertation restricts the target predictor to be modeled via functions in some RKHSs to afford analytical tractability and theoretical guarantees in Chapters 3, 5. Our proposed IRepL formulation is under the scalarization of the bi-objective trade-off formulation. Even though any solution to the 101 scalarized version is a solution to the original bi-objective IRepL problem, some regions on the utility-invariance trade-off may not be spotted by the scalarized counterpart. In general, IRepL is a function of the involved dependence measure that quantifies the dependence of learned represen- tations on the sensitive attribute. As such, the trade-off obtained in this dissertation is optimal for HSIC-like dependence measures and may not be optimum for other invariance measures like KCC or COCO. 6.2 Future Work As a result of the limitations mentioned above, studying the bi-objective trade-off (rather than the scalarization) and other universal measures are interesting directions for future work. Moreover, the invariance criterion deployed in this dissertation is analogous to demographic parity in fairness which can be an unsuitable fairness notion in some practical scenarios [128, 129]. Our method in this dissertation can be extended to other notions of fairness, such as equalized odds and equality of opportunity [128] by modifying the invariance measure to capture the conditional dependency of the representation on the sensitive attribute given the target label. 6.3 Broader Impact IRepL can enable many machine learning systems to prevent the leakage of private (sensitive) information while being effective for the desired prediction task(s). In particular, IRepL has an immediate application in fairness which is a significant societal problem. Our approach in this dissertation proposes some algorithms that can be used to learn fair representations of data. More generally, these approaches can enable machine learning systems to discard specific data before 102 making predictions. Of course, as with any technological or algorithmic solutions, one can also employ them for harmful purposes. 103 BIBLIOGRAPHY [1] P. Roy and V. N. Boddeti, “Mitigating information leakage in image representations: A max- imum entropy approach,” IEEE Conference on Computer Vision and Pattern Recognition, 2019. [2] D. Madras, E. Creager, T. Pitassi, and R. Zemel, “Learning adversarially fair and transfer- able representations,” arXiv preprint arXiv:1802.06309, 2018. [3] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016. [4] V. Grari, O. E. Hajouji, S. Lamprier, and M. Detyniecki, “Learning unbiased representations via rényi minimization,” in Joint European Conference on Machine Learning and Knowl- edge Discovery in Databases, pp. 749–764, Springer, 2021. [5] A. K. Menon and R. C. Williamson, “The cost of fairness in binary classification,” Confer- ence on Fairness, Accountability and Transparency, pp. 107–118, 2018. [6] H. Zhao and G. J. Gordon, “Inherent tradeoffs in learning fair representations,” arXiv preprint arXiv:1906.08386, 2019. [7] T. L. Gouic, J.-M. Loubes, and P. Rigollet, “Projection to fairness in statistical learning,” arXiv preprint arXiv:2005.11720, 2020. [8] H. Zhao, “Costs and benefits of wasserstein fair regression,” arXiv preprint arXiv:2106.08812, 2021. [9] H. Zhao, C. Dan, B. Aragam, T. S. Jaakkola, G. J. Gordon, and P. Ravikumar, “Fundamental limits and tradeoffs in invariant representation learning,” arXiv preprint arXiv:2012.10713, 2020. [10] H. Zhao, R. T. Des Combes, K. Zhang, and G. Gordon, “On learning invariant represen- tations for domain adaptation,” International Conference on Machine Learning, pp. 7523– 7532, 2019. [11] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” Inter- national Conference on Machine Learning, 2015. [12] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adap- tation,” IEEE Conference on Computer Vision and Pattern Recognition, 2017. [13] K. Zhou, Z. Liu, Y. Qiao, T. Xiang, and C. C. Loy, “Domain generalization: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022. 104 [14] L. Zhang and X. Gao, “Transfer adaptation learning: A decade survey,” IEEE Transactions on Neural Networks and Learning Systems, 2022. [15] T. Isobe, D. Li, L. Tian, W. Chen, Y. Shan, and S. Wang, “Towards discriminative represen- tation learning for unsupervised person re-identification,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8526–8536, 2021. [16] H. Xia, H. Zhao, and Z. Ding, “Adaptive adversarial network for source-free domain adap- tation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9010–9019, 2021. [17] X. Liu, Z. Guo, S. Li, F. Xing, J. You, C.-C. J. Kuo, G. El Fakhri, and J. Woo, “Adversarial unsupervised domain adaptation with conditional and label shift: Infer, align and iterate,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10367– 10376, 2021. [18] W. Zhang, X. Li, H. Ma, Z. Luo, and X. Li, “Universal domain adaptation in fault diag- nostics with hybrid weighted deep adversarial learning,” IEEE Transactions on Industrial Informatics, vol. 17, no. 12, pp. 7957–7967, 2021. [19] X. Fan, Q. Wang, J. Ke, F. Yang, B. Gong, and M. Zhou, “Adversarially adaptive normal- ization for single domain generalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8208–8217, 2021. [20] Z. Shen, J. Liu, Y. He, X. Zhang, R. Xu, H. Yu, and P. Cui, “Towards out-of-distribution generalization: A survey,” arXiv preprint arXiv:2108.13624, 2021. [21] H. Edwards and A. Storkey, “Censoring representations with an adversary,” arXiv preprint arXiv:1511.05897, 2015. [22] B. H. Zhang, B. Lemoine, and M. Mitchell, “Mitigating unwanted biases with adversarial learning,” AAAI/ACM Conference on AI, Ethics, and Society, 2018. [23] A. Beutel, J. Chen, Z. Zhao, and E. H. Chi, “Data decisions and theoretical implications when adversarially learning fair representations,” arXiv preprint arXiv:1707.00075, 2017. [24] Q. Xie, Z. Dai, Y. Du, E. Hovy, and G. Neubig, “Controllable invariance through adversarial feature learning,” Advances in Neural Information Processing Systems, pp. 585–596, 2017. [25] V. Mirjalili, S. Raschka, A. Namboodiri, and A. Ross, “Semi-adversarial networks: Convo- lutional autoencoders for imparting privacy to face images,” in International Conference on Biometrics, 2018. 105 [26] M. Bertran, N. Martinez, A. Papadaki, Q. Qiu, M. Rodrigues, G. Reeves, and G. Sapiro, “Adversarially learned representations for information obfuscation and inference,” Interna- tional Conference on Machine Learning, 2019. [27] A. B. Arrieta, N. Dı́az-Rodrı́guez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. Garcı́a, S. Gil-López, D. Molina, R. Benjamins, et al., “Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai,” Information fusion, vol. 58, pp. 82–115, 2020. [28] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan, “A survey on bias and fairness in machine learning,” arXiv preprint arXiv:1908.09635, 2019. [29] S. L. Blodgett, S. Barocas, H. Daumé III, and H. Wallach, “Language (technology) is power: A critical survey of” bias” in nlp,” arXiv preprint arXiv:2005.14050, 2020. [30] S. Qian, V. H. Pham, T. Lutellier, Z. Hu, J. Kim, L. Tan, Y. Yu, J. Chen, and S. Shah, “Are my deep learning systems fair? an empirical study of fixed-seed training,” Advances in Neural Information Processing Systems, vol. 34, pp. 30211–30227, 2021. [31] S. Ravfogel, M. Twiton, Y. Goldberg, and R. D. Cotterell, “Linear adversarial concept era- sure,” in International Conference on Machine Learning, pp. 18400–18421, PMLR, 2022. [32] K. Karkkainen and J. Joo, “Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1548–1558, 2021. [33] M. Du, S. Mukherjee, G. Wang, R. Tang, A. Awadallah, and X. Hu, “Fairness via rep- resentation neutralization,” Advances in Neural Information Processing Systems, vol. 34, pp. 12091–12103, 2021. [34] A. Wang, A. Liu, R. Zhang, A. Kleiman, L. Kim, D. Zhao, I. Shirai, A. Narayanan, and O. Russakovsky, “Revise: A tool for measuring and mitigating bias in visual datasets,” International Journal of Computer Vision, pp. 1–21, 2022. [35] L. Wu, L. Chen, P. Shao, R. Hong, X. Wang, and M. Wang, “Learning fair representations for recommendation: A graph-based perspective,” in Proceedings of the Web Conference 2021, pp. 2198–2208, 2021. [36] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork, “Learning fair representations,” International Conference on Machine Learning, pp. 325–333, 2013. [37] C. Louizos, K. Swersky, Y. Li, M. Welling, and R. Zemel, “The variational fair autoencoder,” arXiv preprint arXiv:1511.00830, 2015. 106 [38] D. Moyer, S. Gao, R. Brekelmans, A. Galstyan, and G. Ver Steeg, “Invariant representa- tions without adversarial training,” in Advances in Neural Information Processing Systems, pp. 9102–9111, 2018. [39] H. Zhao, S. Zhang, G. Wu, J. M. Moura, J. P. Costeira, and G. J. Gordon, “Adversarial multiple source domain adaptation,” Advances in Neural Information Processing Systems, vol. 31, pp. 8559–8570, 2018. [40] F. Calmon, D. Wei, B. Vinzamuri, K. N. Ramamurthy, and K. R. Varshney, “Optimized pre-processing for discrimination prevention,” Advances in Neural Information Processing Systems, pp. 3992–4001, 2017. [41] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel, “Fairness through awareness,” Innovations in Theoretical Computer Science Conference, pp. 214–226, 2012. [42] S. Ruggieri, “Using t-closeness anonymity to control for non-discrimination.,” Transactions on Data Privacy, vol. 7, no. 2, pp. 99–129, 2014. [43] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian, “Certi- fying and removing disparate impact,” ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining, pp. 259–268, 2015. [44] J. Song, P. Kalluri, A. Grover, S. Zhao, and S. Ermon, “Learning controllable fair represen- tations,” International Conference on Artificial Intelligence and Statistics, 2019. [45] E. Creager, D. Madras, J.-H. Jacobsen, M. A. Weis, K. Swersky, T. Pitassi, and R. Zemel, “Flexibly fair representation learning by disentanglement,” International Conference on Machine Learning, 2019. [46] F. Locatello, G. Abbati, T. Rainforth, S. Bauer, B. Schölkopf, and O. Bachem, “On the fair- ness of disentangled representations,” Advances in Neural Information Processing Systems, pp. 14611–14624, 2019. [47] J. Mary, C. Calauzenes, and N. El Karoui, “Fairness-aware learning for continuous attributes and treatments,” International Conference on Machine Learning, pp. 4382–4391, 2019. [48] N. Martinez, M. Bertran, and G. Sapiro, “Minimax pareto fairness: A multi objective per- spective,” International Conference on Machine Learning, pp. 6755–6764, 2020. [49] B. Sadeghi, R. Yu, and V. Boddeti, “On the global optima of kernelized adversarial rep- resentation learning,” IEEE International Conference on Computer Vision, pp. 7971–7979, 2019. 107 [50] H. Zhang, Y.-F. Zhang, W. Liu, A. Weller, B. Schölkopf, and E. P. Xing, “Towards principled disentanglement for domain generalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8024–8034, 2022. [51] J. Hamm, “Minimax filter: Learning to preserve privacy from inference attacks,” Journal of Machine Learning Research, vol. 18, no. 1, pp. 4704–4734, 2017. [52] M. Coavoux, S. Narayan, and S. B. Cohen, “Privacy-preserving neural representations of text,” arXiv preprint arXiv:1808.09408, 2018. [53] T. Xiao, Y.-H. Tsai, K. Sohn, M. Chandraker, and M.-H. Yang, “Adversarial learning of privacy-preserving and task-oriented representations,” Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12434–12441, 2020. [54] M. Dusmanu, J. L. Schönberger, S. N. Sinha, and M. Pollefeys, “Privacy-preserving visual feature descriptors through adversarial affine subspace embedding,” IEEE Conference on Computer Vision and Pattern Recognition, 2021. [55] D. McNamara, C. S. Ong, and R. C. Williamson, “Costs and benefits of fair representation learning,” AAAI/ACM Conference on AI, Ethics, and Society, pp. 263–270, 2019. [56] E. Adeli, Q. Zhao, A. Pfefferbaum, E. V. Sullivan, L. Fei-Fei, J. C. Niebles, and K. M. Pohl, “Representation learning with statistical independence to mitigate bias,” IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2513–2523, 2021. [57] C. Agarwal, H. Lakkaraju, and M. Zitnik, “Towards a unified framework for fair and sta- ble graph representation learning,” in Uncertainty in Artificial Intelligence, pp. 2114–2124, PMLR, 2021. [58] K. Yang, J. H. Yau, L. Fei-Fei, J. Deng, and O. Russakovsky, “A study of face obfuscation in imagenet,” in International Conference on Machine Learning, pp. 25313–25330, PMLR, 2022. [59] B. Rodrı́guez-Gálvez, R. Thobaben, and M. Skoglund, “A variational approach to privacy and fairness,” in 2021 IEEE Information Theory Workshop (ITW), pp. 1–6, IEEE, 2021. [60] H. Zhao, J. Chi, Y. Tian, and G. J. Gordon, “Trade-offs and guarantees of adversarial repre- sentation learning for information obfuscation,” arXiv preprint arXiv:1906.07902, 2019. [61] S. Dutta, D. Wei, H. Yueksel, P.-Y. Chen, S. Liu, and K. Varshney, “Is there a trade-off between fairness and accuracy? A perspective using mismatched hypothesis testing,” Inter- national Conference on Machine Learning, pp. 2803–2813, 2020. 108 [62] N. Quadrianto, V. Sharmanska, and O. Thomas, “Discovering fair representations in the data domain,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8227–8236, 2019. [63] A. Letcher, D. Balduzzi, S. Racaniere, J. Martens, J. N. Foerster, K. Tuyls, and T. Graepel, “Differentiable game mechanics.,” Journal of Machine Learning Research, vol. 20, no. 84, pp. 1–40, 2019. [64] L. Mescheder, S. Nowozin, and A. Geiger, “The numerics of gans,” in Advances in Neural Information Processing Systems, pp. 1825–1835, 2017. [65] V. Nagarajan and J. Z. Kolter, “Gradient descent gan optimization is locally stable,” in Ad- vances in Neural Information Processing Systems, pp. 5585–5595, 2017. [66] D. Balduzzi, S. Racaniere, J. Martens, J. Foerster, K. Tuyls, and T. Graepel, “The mechanics of n-player differentiable games,” in International Conference on Machine Learning, 2018. [67] C. Daskalakis and I. Panageas, “The limit points of (optimistic) gradient descent in min-max optimization,” in Advances in Neural Information Processing Systems, 2018. [68] C. Jin, P. Netrapalli, and M. I. Jordan, “What is local optimality in nonconvex-nonconcave minimax optimization?,” arXiv preprint arXiv:1902.00618, 2019. [69] G. Gidel, H. Berard, G. Vignoud, P. Vincent, and S. Lacoste-Julien, “A variational inequal- ity perspective on generative adversarial networks,” International Conference on Learning Representations, 2019. [70] S. Wang, Y. Teng, and P. Perdikaris, “Understanding and mitigating gradient flow patholo- gies in physics-informed neural networks,” SIAM Journal on Scientific Computing, vol. 43, no. 5, pp. A3055–A3081, 2021. [71] N. Loizou, H. Berard, G. Gidel, I. Mitliagkas, and S. Lacoste-Julien, “Stochastic gradient descent-ascent and consensus optimization for smooth games: Convergence analysis un- der expected co-coercivity,” Advances in Neural Information Processing Systems, vol. 34, pp. 19095–19108, 2021. [72] G. Korpelevich, “The extragradient method for finding saddle points and other problems,” Matecon, vol. 12, pp. 747–756, 1976. [73] H. K. Khalil, “Nonlinear systems,” Printice-Hall Inc, 1996. [74] I. T. Jolliffe and J. Cadima, “Principal component analysis: a review and recent develop- ments,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 374, no. 2065, p. 20150202, 2016. 109 [75] B. Schölkopf, A. Smola, and K.-R. Müller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Computation, vol. 10, no. 5, pp. 1299–1319, 1998. [76] S. Baharlouei, M. Nouiehed, A. Beirami, and M. Razaviyayn, “R’enyi fair inference,” arXiv preprint arXiv:1906.12005, 2019. [77] A. Pérez-Suay, V. Laparra, G. Mateo-Garcı́a, J. Muñoz-Marı́, L. Gómez-Chova, and G. Camps-Valls, “Fair kernel learning,” Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 339–355, 2017. [78] A. Rényi, “On measures of dependence,” Acta Mathematica Academiae Scientiarum Hun- garica, vol. 10, no. 3-4, pp. 441–451, 1959. [79] F. R. Bach and M. I. Jordan, “Kernel independent component analysis,” Journal of Machine Learning Research, vol. 3, no. 6, pp. 1–48, 2002. [80] A. Gretton, A. Smola, O. Bousquet, R. Herbrich, B. Schölkopf, and N. Logothetis, “Be- haviour and convergence of the constrained covariance,” Max Planck Institute for Biological Cybernetics, 2004. [81] A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf, “Measuring statistical dependence with Hilbert-Schmidt norms,” International Conference on Algorithmic Learning Theory, pp. 63–77, 2005. [82] J. Shawe-Taylor and N. Cristianini, Kernel methods for pattern analysis. Cambridge Uni- versity Press, 2004. [83] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Schölkopf, “Kernel methods for measuring independence,” Journal of Machine Learning Research, vol. 6, no. 12, pp. 2075– 2129, 2005. [84] B. K. Sriperumbudur, K. Fukumizu, and G. R. Lanckriet, “Universality, characteristic ker- nels and RKHS embedding of measures,” Journal of Machine Learning Research, vol. 12, no. 7, 2011. [85] K. Fukumizu, F. R. Bach, and A. Gretton, “Statistical consistency of kernel canonical cor- relation analysis.,” Journal of Machine Learning Research, vol. 8, no. 2, 2007. [86] E. Kreyszig, Introductory functional analysis with applications, vol. 1. wiley New York, 1978. [87] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” arXiv preprint arXiv:1409.7495, 2014. 110 [88] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, pp. 2672–2680, 2014. [89] D. P. Bertsekas, “Nonlinear programming,” Journal of the Operational Research Society, vol. 48, no. 3, pp. 334–334, 1997. [90] E. Kokiopoulou, J. Chen, and Y. Saad, “Trace optimization and eigenproblems in dimension reduction methods,” Numerical Linear Algebra with Applications, vol. 18, no. 3, pp. 565– 602, 2011. [91] A. Edelman, T. A. Arias, and S. T. Smith, “The geometry of algorithms with orthogonality constraints,” SIAM Journal on Matrix Analysis and Applications, vol. 20, no. 2, pp. 303– 353, 1998. [92] S. Kumar, M. Mohri, and A. Talwalkar, “Sampling methods for the Nyström method,” Jour- nal of Machine Learning Research, vol. 13, no. Apr, pp. 981–1006, 2012. [93] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [94] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013. [95] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many: Illumination cone models for face recognition under variable lighting and pose,” IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 6, pp. 643–660, 2001. [96] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” tech. rep., Citeseer, 2009. [97] C. Ionescu, O. Vantzos, and C. Sminchisescu, “Training deep networks with structured lay- ers by matrix backpropagation,” arXiv preprint arXiv:1509.07838, 2015. [98] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. Torr, “End-to-end representa- tion learning for correlation filter based tracking,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017. [99] B. Amos and J. Z. Kolter, “Optnet: Differentiable optimization as a layer in neural net- works,” in International Conference on Machine Learning, 2017. [100] A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and J. Z. Kolter, “Differentiable convex optimization layers,” in Advances in Neural Information Processing Systems, 2019. [101] L. Bertinetto, J. F. Henriques, P. H. Torr, and A. Vedaldi, “Meta-learning with differentiable closed-form solvers,” in International Conference on Learning Representations, 2018. 111 [102] K. Lee, S. Maji, A. Ravichandran, and S. Soatto, “Meta-learning with differentiable convex optimization,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019. [103] Y. Elazar and Y. Goldberg, “Adversarial removal of demographic attributes from text data,” Empirical Methods in Natural Language Processing, 2018. [104] A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generaliza- tion in neural networks,” in Advances in Neural Information Processing Systems, 2018. [105] G. H. Golub and V. Pereyra, “The differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate,” SIAM Journal on Numerical Analysis, vol. 10, no. 2, pp. 413–432, 1973. [106] M. Hardt, B. Recht, and Y. Singer, “Train faster, generalize better: Stability of stochas- tic gradient descent,” in International Conference on Machine Learning, pp. 1225–1234, PMLR, 2016. [107] D. Dua and C. Graff, “Uci machine learning repository (2017),” URL http://archive. ics. uci. edu/ml, 2017. [108] B. Kim, H. Kim, K. Kim, S. Kim, and J. Kim, “Learning not to learn: Training deep neural networks with biased data,” in IEEE Conference on Computer Vision and Pattern Recogni- tion, 2019. [109] E. Zitzler and L. Thiele, “Multiobjective optimization using evolutionary algorithms—a comparative case study,” in International Conference on Parallel Problem Solving from Na- ture, pp. 292–301, 1998. [110] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of Eugen- ics, vol. 7, no. 2, pp. 179–188, 1936. [111] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” IEEE International Conference on Computer Vision, 2015. [112] J. Knowles, “A summary-attainment-surface plotting method for visualizing the perfor- mance of stochastic multiobjective optimizers,” in International Conference on Intelligent Systems Design and Applications (ISDA), 2005. [113] C. R. Souza, “Kernel functions for machine learning applications,” Creative Commons Attribution-Noncommercial-Share Alike, vol. 3, p. 29, 2010. [114] V. Grari, O. E. Hajouji, S. Lamprier, and M. Detyniecki, “Learning unbiased representations via Rényi minimization,” arXiv preprint arXiv:2009.03183, 2020. [115] A. Rahimi, B. Recht, et al., “Random features for large-scale kernel machines.,” Advances in Neural Information Processing Systems, vol. 3, no. 4, p. 5, 2007. 112 [116] F. Ding, M. Hardt, J. Miller, and L. Schmidt, “Retiring adult: New datasets for fair machine learning,” arXiv preprint arXiv:2108.04884, 2021. [117] J. Jacod and P. Protter, Probability essentials. Springer Science & Business Media, 2012. [118] E. Barshan, A. Ghodsi, Z. Azimifar, and M. Z. Jahromi, “Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds,” Pat- tern Recognition, vol. 44, no. 7, pp. 1357–1371, 2011. [119] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm, “Mutual information neural estimation,” International Conference on Machine Learning, pp. 531–540, 2018. [120] P. Diaconis and D. Freedman, “Asymptotics of graphical projection pursuit,” The Annals of Statistics, pp. 793–815, 1984. [121] P. Hall and K.-C. Li, “On almost linearity of low dimensional projections from high dimen- sional data,” The Annals of Statistics, pp. 867–889, 1993. [122] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new per- spectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013. [123] R. L. Strawderman, “The symmetric eigenvalue problem (classics in applied mathematics, number 20),” Journal of the American Statistical Association, vol. 94, no. 446, p. 657, 1999. [124] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770– 778, 2016. [125] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola, “A kernel sta- tistical test of independence,” Advances in Neural Information Processing Systems, vol. 20, pp. 585–592, 2007. [126] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017. [127] Y. Li, R. Pogodin, D. J. Sutherland, and A. Gretton, “Self-supervised learning with kernel dependence maximization,” arXiv preprint arXiv:2106.08320, 2021. [128] M. Hardt, E. Price, N. Srebro, et al., “Equality of opportunity in supervised learning,” in Advances in Neural Information Processing Systems, pp. 3315–3323, 2016. [129] A. Chouldechova, “Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. big data 5, 2 (2017), 153–163,” arXiv preprint arXiv:1610.07524, 2017. 113 [130] J. Adebayo and L. Kagal, “Iterative orthogonal feature projection for diagnosing bias in black-box models,” Fairness, Accountability, and Transparency in Machine Learning, 2016. [131] P. Adler, C. Falk, S. A. Friedler, T. Nix, G. Rybeck, C. Scheidegger, B. Smith, and S. Venkatasubramanian, “Auditing black-box models for indirect influence,” Knowledge and Information Systems, vol. 54, no. 1, pp. 95–122, 2018. [132] T. Bolukbasi, K.-W. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai, “Man is to computer programmer as woman is to homemaker? debiasing word embeddings,” in Advances in Neural Information Processing Systems, pp. 4349–4357, 2016. [133] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge University Press, 2004. [134] B. Fish, J. Kun, and A. D. Lelkes, “Fair boosting: a case study,” in Workshop on Fairness, Accountability, and Transparency in Machine Learning, Citeseer, 2015. [135] G. Goh, A. Cotter, M. Gupta, and M. P. Friedlander, “Satisfying real-world goals with dataset constraints,” in Advances in Neural Information Processing Systems, pp. 2415–2423, 2016. [136] F. Kamiran and T. Calders, “Classification with no discrimination by preferential sampling,” in Proc. 19th Machine Learning Conf. Belgium and The Netherlands, vol. 1, Citeseer, 2010. [137] T. Kamishima, S. Akaho, H. Asoh, and J. Sakuma, “Fairness-aware classifier with prejudice remover regularizer,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 35–50, Springer, 2012. [138] T. Kamishima, S. Akaho, H. Asoh, and J. Sakuma, “Enhancement of the neutrality in rec- ommendation.,” in Decisions@ RecSys, pp. 8–14, Citeseer, 2012. [139] T. Kamishima, S. Akaho, H. Asoh, and I. Sato, “Model-based approaches for independence- enhanced recommendation,” in International Conference on Data Mining Workshops, pp. 860–867, IEEE, 2016. [140] E. Kazemi, M. Zadimoghaddam, and A. Karbasi, “Scalable deletion-robust submodular maximization: Data summarization with privacy and fairness constraints,” in International Conference on Machine Learning, pp. 2549–2558, 2018. [141] D. H. Kim, T. Kong, and S. Jeong, “Finding solutions to generative adversarial privacy,” arXiv preprint arXiv:1810.02069, 2018. [142] J. Komiyama, A. Takeda, J. Honda, and H. Shimao, “Nonconvex optimization for regression with fairness constraints,” in International Conference on Machine Learning, pp. 2742– 2751, 2018. 114 [143] A. J. Laub, Matrix analysis for scientists and engineers, vol. 91. Siam, 2005. [144] G. Meanti, L. Carratino, L. Rosasco, and A. Rudi, “Kernel methods through the roof: han- dling billions of points efficiently,” arXiv preprint arXiv:2006.10350, 2020. [145] G. Ristanoski, W. Liu, and J. Bailey, “Discrimination aware classification for imbalanced datasets,” in ACM International Conference on Information & Knowledge Management, pp. 1529–1532, ACM, 2013. [146] S. Samadi, U. Tantipongpipat, J. H. Morgenstern, M. Singh, and S. Vempala, “The price of fair pca: One extra dimension,” in Advances in Neural Information Processing Systems, pp. 10999–11010, 2018. [147] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi, “Fairness beyond dis- parate treatment & disparate impact: Learning classification without disparate mistreat- ment,” in International Conference on World Wide Web, pp. 1171–1180, International World Wide Web Conferences Steering Committee, 2017. [148] I. Žliobaite, F. Kamiran, and T. Calders, “Handling conditional discrimination,” in Interna- tional Conference on Data Mining, pp. 992–1001, IEEE, 2011. [149] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European Conference on Computer Vision, pp. 630–645, Springer, 2016. [150] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [151] Y. Li, K. Swersky, and R. Zemel, “Learning unbiased features,” arXiv preprint arXiv:1412.5244, 2014. [152] B. Schölkopf, R. Herbrich, and A. J. Smola, “A generalized representer theorem,” in Inter- national Conference on Computational Learning Theory, 2001. [153] C. A. Micchelli, Y. Xu, and H. Zhang, “Universal kernels,” Journal of Machine Learning Research, vol. 7, no. Dec, pp. 2651–2667, 2006. [154] K. Fukumizu, F. R. Bach, and M. I. Jordan, “Dimensionality reduction for supervised learn- ing with reproducing kernel hilbert spaces,” Journal of Machine Learning Research, vol. 5, no. Jan, pp. 73–99, 2004. [155] J. Angwin, J. Larson, S. Mattu, and L. Kirchner, “Machine bias,” ProPublica, May, vol. 23, p. 2016, 2016. [156] J. Buolamwini and T. Gebru, “Gender shades: Intersectional accuracy disparities in com- mercial gender classification,” in Conference on Fairness, Accountability and Transparency, pp. 77–91, 2018. 115 [157] A. Datta, M. C. Tschantz, and A. Datta, “Automated experiments on ad privacy settings: A tale of opacity, choice, and discrimination,” Proceedings on Privacy Enhancing Technolo- gies, vol. 2015, no. 1, pp. 92–112, 2015. [158] L. Sweeney, “Discrimination in online ad delivery,” Queue, vol. 11, no. 3, pp. 10–29, 2013. [159] A. Caliskan, J. J. Bryson, and A. Narayanan, “Semantics derived automatically from lan- guage corpora contain human-like biases,” Science, vol. 356, no. 6334, pp. 183–186, 2017. [160] S. Verma and J. Rubin, “Fairness definitions explained,” in IEEE/ACM International Work- shop on Software Fairness (FairWare), pp. 1–7, IEEE, 2018. [161] T. Kamishima, S. Akaho, and J. Sakuma, “Fairness-aware learning through regularization approach,” in International Conference on Data Mining Workshops, pp. 643–650, IEEE, 2011. [162] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi, “Fairness beyond dis- parate treatment & disparate impact: Learning classification without disparate mistreat- ment,” in International Conference on World Wide Web, pp. 1171–1180, 2017. [163] Y. Bechavod and K. Ligett, “Penalizing unfairness in binary classification,” arXiv preprint arXiv:1707.00044, 2017. [164] R. Berk, H. Heidari, S. Jabbari, M. Joseph, M. Kearns, J. Morgenstern, S. Neel, and A. Roth, “A convex framework for fair regression,” arXiv preprint arXiv:1706.02409, 2017. [165] M. Kearns, S. Neel, A. Roth, and Z. S. Wu, “Preventing fairness gerrymandering: Audit- ing and learning for subgroup fairness,” in International Conference on Machine Learning, pp. 2564–2572, 2018. [166] B. Fish, J. Kun, and Á. D. Lelkes, “A confidence-based approach for balancing fairness and accuracy,” in International Conference on Data Mining, pp. 144–152, SIAM, 2016. [167] D. Alabi, N. Immorlica, and A. T. Kalai, “Unleashing linear optimizers for group-fair learn- ing and optimization,” arXiv preprint arXiv:1804.04503, 2018. [168] L. E. Celis, L. Huang, V. Keswani, and N. K. Vishnoi, “Classification with fairness con- straints: A meta-algorithm with provable guarantees,” in Conference on Fairness, Account- ability, and Transparency, pp. 319–328, 2019. [169] T. Calders, F. Kamiran, and M. Pechenizkiy, “Building classifiers with independency con- straints,” in IEEE International Conference on Data Mining Workshops, pp. 13–18, 2009. [170] F. Kamiran and T. Calders, “Classifying without discriminating,” in International Confer- ence on Computer, Control and Communication, pp. 1–6, 2009. 116 [171] F. Kamiran and T. Calders, “Data preprocessing techniques for classification without dis- crimination,” Knowledge and Information Systems, vol. 33, no. 1, pp. 1–33, 2012. [172] M. Wick, J.-B. Tristan, et al., “Unlocking fairness: a trade-off revisited,” Advances in neural information processing systems, vol. 32, 2019. [173] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” in The Col- lected Works of Wassily Hoeffding, pp. 409–426, Springer, 1994. [174] J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Schölkopf, “Distinguishing cause from effect using observational data: methods and benchmarks,” Journal of Machine Learn- ing Research, vol. 17, no. 1, pp. 1103–1204, 2016. [175] D. Greenfeld and U. Shalit, “Robust learning with the hilbert-schmidt independence crite- rion,” in International Conference on Machine Learning, pp. 3759–3768, PMLR, 2020. [176] S. Yu, F. Alesiani, X. Yu, R. Jenssen, and J. C. Principe, “Measuring dependence with matrix-based entropy functional,” arXiv preprint arXiv:2101.10160, 2021. [177] B. Sadeghi, L. Wang, and V. N. Boddeti, “Adversarial representation learning with closed- form solvers,” in Joint European Conference on Machine Learning and Knowledge Discov- ery in Databases, pp. 731–748, Springer, 2021. [178] C. Frogner, C. Zhang, H. Mobahi, M. Araya-Polo, and T. Poggio, “Learning with a wasser- stein loss,” arXiv preprint arXiv:1506.05439, 2015. [179] L. G. S. Giraldo, M. Rao, and J. C. Principe, “Measures of entropy from data using infinitely divisible kernels,” IEEE Transactions on Information Theory, vol. 61, no. 1, pp. 535–548, 2014. [180] A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola, “A kernel method for the two-sample-problem,” Advances in Neural Information Processing Systems, vol. 19, 2006. [181] J. Song, P. Kalluri, A. Grover, S. Zhao, and S. Ermon, “Learning controllable fair represen- tations,” International Conference on Artificial Intelligence and Statistics, 2019. [182] B. Sadeghi and V. N. Boddeti, “Imparting fairness to pre-trained biased representations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Work- shops, pp. 16–17, 2020. 117 APPENDIX A.1 Proof of Lemma 3.1 Lemma. Let X and U be two RVs with E[X] = 0, E[U ] = b, where CX  0. Consider a linear b = W Z + b, where W ∈ RdU ×r is the parameter matrix, and Z ∈ Rr is an encoded regressor, U version of X for a given Θ: X 7→ Z = ΘX, Θ ∈ Rr×dX . The minimum MSE that can be achieved by designing W is h i 2 min E kU − U b k2 = Tr [CU ] − PM Q−T CXU , W X F where M = QX ΘT ∈ RdX ×r , and CX = QTX QX (Cholesky factorization). Proof. Direct calculation yields:   2 JU := E U − U b h n oi = Tr E (U − b − W Z)(U − b − W Z) T h n = Tr E (U − b)(U − b)T + (W ΘX)(W ΘX)T oi − (U − b)(W ΘX)T − (W ΘX)(U − b)T h i T = Tr CU + (W Θ)CX (W Θ) − CU X (W Θ) − (W Θ)CU XT T h i = Tr CU + (W ΘQTX )(W ΘQTX )T − CU X (W Θ)T − (W Θ)CUT X h = Tr (W ΘQTX − CU X Q−1 T X )(W ΘQX − CU X QX ) −1 T i + CU − (CU X Q−1X )(C Q UX X −1 )T 2 2 = QX ΘW T − Q−T X CXY − Q−TX CXU + Tr[CU ] F F Hence, the minimizer of JU is obtained by minimizing the first term in the last equation, which is a standard least square error problem. Let M = QX Θ, then the minimizer is given by W T = M † Q−T X CXU . Finally, Using the orthogonal decompositions 2 2 2 Q−TX CXU = PM Q−T X CXU F + PM⊥ QX CXU F −T F 118 and 2 2 2 QX ΘW T − Q−T X CXU = M W T − PM Q−T X CXU F + PM⊥ QX CXU F −T F 2 = M | {zM }† Q−T X CXU − PM QX CXU −T PM F 2 + PM⊥ Q−T X CXU F 2 = PM⊥ Q−T X CXU F , we obtain the minimum value of JU as 2 Tr [CU ] − PM Q−T X CXU F . A.2 Relation Between Constrained Optimization Problem in (3.7) and Its Scalarization in (3.8) Consider the optimization problem in (3.7) Gα = arg min JY (G), s.t. JS (G) ≥ α. (1) G and the optimization problem in (3.8) Gλ = arg min Jλ (G) (2) G where Jλ(G) = (1 − λ) JY (G) − λ JS (G), λ ∈ [0, 1). Claim For each λ ∈ [0, 1), solution Gλ of (2) is also a solution of (1) with α = JS (Gλ ). (3) Proof. Let us consider (1) while assuming that (2) is satisfied. For each λ and corresponding solution Gλ , let α be given as in (3). For an arbitrary G satisfying JS (G) ≥ α, we have (1 − λ) JY (Gλ ) − λ α = (1 − λ) JY (Gλ ) − λ JS (Gλ ) (4) ≤ (1 − λ) JY (G) − λ JS (G), where the second step is from the assumption that (2) is satisfied. Consequently, we have (1 − λ) [JY (G) − JY (Gλ )] ≥ λ [JS (G) − α] ≥ 0. (5) Since JS (G) ≥ α, it follows that JY (G) ≥ JY (Gλ ) and consequently Gλ is a possible minimizer of problem (1). 119 A.3 Proof of Theorem 3.2 Theorem. As a function of G ∈ RdX ×r , the objective function in equation (3.8) is neither convex nor differentiable. Proof. Recall that PG is equal to G(GT G)† GT . Therefore, due to the involvement of the pseudo inverse, (3.8) is not differentiable (see [105]). For non-convexity consider the theorem that f (G) is convex in G ∈ RdX ×r if and only if h(t) = f (t G1 + G2 ) is convex in t ∈ R for any constants G1 , G2 ∈ RdX ×r (see [133]). In order to use the above theorem, consider rank one matrices     1 0 ... 0 1 0 ... 0 0 0 . . . 0 1 0 . . . 0         G1 =  0 0 . . . 0 and G2 = 0 0 . . . 0 .      .. .. . .   .. .. . .  . . .  . . .  0 0 ... 0 0 0 ... 0 Define G = (t G1 + G2 ). Then   (t + 1)2 (t + 1) 0 ... 0  (t + 1) 1 0 . . . 0   1   PG (t) = G(GT G)† GT =  0 0 0 . . . 0 . (t + 1)2 + 1   .. .. .. ..    . . . .  0 0 0 ... 0 Using basic properties of trace we get   (1 − λ) JY (G) − λ JS (G) = Tr PG (t)B , where the matrix B is given in (3.11) and we used Lemma 3.1. Now, represent B as   b11 b12 . . . b1d   b12 b22 . . . b2d  B=  .. .. . . .   . . .  b1d b2d . . . bdd Thus,   2b (t + 1) + b22 − b11 Tr PG (t)B = b11 + 12 . (t + 1)2 + 1 It can be shown that the above function of t is convex only if b12 = 0 and b11 = b22 . On the other hand, if these two conditions hold, it can be similarly shown that (1 − λ) JY (G) − λ JS (G) is non- convex by considering a different pair of matrices G1 and G2 . This implies that (1 − λ) JY (G) − λ JS (G) is not convex. 120 A.4 Proof of Theorem 3.3 Theorem. Assume that the number of negative eigenvalues (β) of B in (3.11) is j. Denote γ = min{r, j}. Then, the minimum value in (3.9) is given as β1 + β2 + · · · + βγ where β1 ≤ β2 ≤ . . . ≤ βγ < 0 are the γ smallest eigenvalues of B. And the minimum can be attained by G = V , where the columns of V are eigenvectors corresponding to all the γ negative eigenvalues of B. Proof. Consider the inner optimization problem of (3.10) in (3.9). Using the trace optimization problems and their solutions in [90], we get h i min Jλ (G) = min Tr GT BG = β1 + β2 + · · · + βi , GT G=Ii GT G=Ii where β1 , β2 , . . . , βi are i smallest eigenvalues of B and minimum value can be achieved by the matrix V whose columns are corresponding eigenvectors. If the number of negative eigenvalues of B is less than r, then the optimum i in (3.9) is j, otherwise the optimum i is r. A.5 Non-Linear Extension Through Kernelization We assume that X is non-linearly mapped to φX (X) as illustrated in Figure 3.5. Recall from (2.6) that Z = Θ [kX (x1 , X), · · · , kX (xn , X)]T . For a given fixed Θ, we have JY = min MSE (Yb − Y ). WY ,bY Note that the above optimization problem can be separated over WY , bY . Therefore, for a given WY , we first minimize over bY : n o min E kWY Z + bY − Y k2 bY n 1X = min kWY zk + bY − yk k2 bY n k=1 n 1X = kWY zk + c − yk k2 , n k=1 where the minimizer c is n 1X c= (yk − WY zk ) n k=1 Xn n (6) 1 1X = yk − WY zk n n k=1 k=1 = E {Y } − WY E {Z} . 121 Let all the columns of C be equal to c. Therefore we now have min MSE (Yb − Y ) WY ,bY 1 = min kWY ΘKX + C − Y k2F WY n 1 2 = min WY ΘKX H − Ỹ WY n F 1 2 = min HKX ΘT WYT − Ỹ T WY n F 1 2 1 2 (7) = min M WYT − PM Ỹ T + PM⊥ Ỹ T WY n F n F 2 1 1 2 = M | M }† PM Ỹ T − PM Ỹ T {z + PM⊥ Ỹ T n n F PM F 1 2 = PM⊥ Ỹ T n F 1 2 1 2 = Ỹ T − PM Ỹ T , n F n F where the third step is due to (6), M = HKX ΘT , and the fifth step is the orthogonal decompo- sition w.r.t. M . Using the same approach, we get 1 2 1 2 JS = S̃ T − PM S̃ T . (8) n F n F Assume that the columns of LX are the orthonormal basis for the columns space of HKX . For any M , there exists G such that LX G = M . In general, there is no bijection between Θ and G in the equality HKX ΘT = LX G. But, there is a bijection between G and Θ when constrained to Θ’s in which R(ΘT ) ⊆ N (HKX )⊥ . This restricted bijection is sufficient since for any ΘT ∈ N (HKX ) we have M = 0. Once G is determined, ΘT can be obtained as ΘT = (HKX )† LX G + Θ0 , Θ0 ⊆ N (HKX ). However, since 2 2 kΘk2F = ΘT = (HKX )† LX G + kΘ0 k2F , F F choosing Θ0 = 0 results in minimum kΘkF , which is favorable in terms of robustness to the noise. Similar to (3.6), we have PM = LX PG LTX . If we assume that the rank of PG is i, Jλ (G) in (3.10) can be expressed as 2 2 Jλ (G) = λ LX GGT LTX S̃ T − (1 − λ) LX GGT LTX Ỹ T , F F where PG = GGT for some orthogonal matrix G ∈ RdX ×i . This resembles the optimization problem in (3.9) and therefore have the same solution as Theorem 3.3 with modified B as   B = LTX λ S̃ T S̃ − (1 − λ) Ỹ T Ỹ LX . (9) 122 Once G is determined, Θ can be computed as Θ = GT LTX (HKX )† . A.6 Proof of Theorem 3.4 Theorem. Let the columns of LX be the orthonormal basis for HKX . Further, assume that the columns of VS are the singular vectors corresponding to zero singular values of S̃LX and the columns of VY are the singular vectors corresponding to non-zero singular values of Ỹ LX . Then, we have 1 2 1 γmin := min JY (Θ) = Ỹ T − kỸ LX k2F Θ n F n 1 2 1 2 γmax := min JY (Θ) = Ỹ T − Ỹ LX VS arg max JS (Θ) n F n F 1 2 1 2 αmin := max JS (Θ) = S̃ T − S̃LX VY arg min JY (Θ) n F n F 1 2 αmax := max JS (Θ) = S̃ T Θ n F Proof. Firstly, we recall from Section that instead of Θ, we consider G. These two matrices are related to each other as HKX ΘT = LX G = M , where the columns of LX are the orthogonal basis for the column space of HKX . Therefore we can now express the projection onto M in terms of projection onto G, i.e.,PM = LX PG LX . Using (7), we get 1 2 1 2 γmin = Ỹ T − max PM Ỹ T n F n Θ F 1 2 1 2 = Ỹ T − max LX PG LTX Ỹ T n F n G F ( ) 1 2 1 h i = Ỹ T − max max Tr GT LTX Ỹ T Ỹ LX G n F n i T G G=Ii 1 2 1 h i = Ỹ T − Tr VYT LTX Ỹ T Ỹ LX VY (10) n F n 1 2 1X 2 = Ỹ T − σk n F n k 1 2 1 X 2 = Ỹ T − σk n F n σk >0 = 1 Ỹ T 2 − 1 Ỹ LX , 2 n F n F where the fourth step is borrowed from trace optimization problems studied in [90] and σk ’s are the singular values of Ỹ LX . In order to interpret the bounds in more detail, we consider the one-dimensional case where 123 X, Y, ∈ R. In this setting, the correlation coefficient (denoted by ρ(·, ·)) between X and Y is Ỹ X̃ T ρ(X, Y ) = p Ỹ Ỹ T X̃ X̃ T kỸ LX kF = (11) σ s Y γ = 1 − min , σY2 where σY2 = kỸ k2F /n. As a result, the normalized MSE can be expressed as γmin = 1 − ρ2 (X, Y ). (12) σY2 Therefore, the lower bound of the target’s MSE is independent of the encoder and is only related to the alignment between the subspaces spanned by the data and labels. Next, we find an encoder that allows the target task to obtain its optimal loss, γmin , while seeking to minimize the leakage of sensitive attributes as much as possible. Thus, we constrain the domain of the encoder to {arg min JY (Θ)}. Assume that the columns of the encoder G is the concatenation of the columns of VY together with at least one singular vector corresponding to a zero singular value of Ỹ LX . Therefore VY ⊆ G and consequently kLX PV LTX U k2F ≤ Y kLX PG LTX U k2F for arbitrary matrix U . As a result, JS (G) ≥ JS (VY ) and at the same time JY (G) = JY (VY ). The latter can be observed from 2 2 LX PG LTx Ỹ T = Ỹ LX PG LTX F F 2 = Ỹ LX GG LX Ỹ T T T F (13) 2 = Ỹ LX VY VYT LTX F 2 = LX PV LTX Ỹ T . Y F We then have 1 2 1 2 αmin = S̃ T − LX PV LTX S̃ T n F n Y F 1 2 1 h i = S̃ T T T − Tr VY LX S̃ S̃LX VY T (14) n F n 1 2 1 2 = S̃ T − S̃LX VY . n F n F This bound can again be interpreted under the one-dimensional setting of X, S ∈ R as αmin = 1 − ρ2 (X, S) (15) σS2 124 On the other hand, αmax turns out to be, 1 2 αmax = S̃ T n F (16) = σS ,2 which can be achieved via a trivial choice of G = 0. However, we let the columns of G be the sin- gular vectors corresponding to all zero singular values of S̃LX in order to maximize PM Ỹ T F and at the same time ensuring that JS (G) equal to αmax . As a result, we have 1 2 1 2 γmax = Ỹ T − Ỹ LX VS . n F n F For the one dimensional case i.e., X, Y, S ∈ R, we get VS = 0 and consequently, γmax = σY2 . (17) B.1 A Population Expression for Definition in (5.6) A population expression for Dep(Z, S) in (5.6) is given in the following. X r n   Dep(Z, S) = EX,S,X 0 ,S 0 fj (X) fj (X 0 ) kS (X, X 0 ) j=1       +EX fj (X) EX 0 fj (X 0 ) ES,S 0 kS (X, S 0 )  o −2 EX,S fj (X) EX 0 [fj (X 0 )] ES 0 [kS (S, X 0 )] where (X 0 , S 0 ) is independent of (X, S) with the same distribution as pXS . Proof. We first note that this population expression is inspired by that of HSIC [81]. Consider the operator ΣSX induced by the linear functional Cov (α(X), βS (S)) = hβS , ΣSX αiH . Then, it follows that S X r X  Dep(Z, S) = Cov2 fj (X), βS (S) j=1 βS ∈US X r X 2 = βS , ΣSX fj H S j=1 βS ∈US X r X 2 = βS , ΣSX fj H S j=1 βS ∈US X r (a) = kΣSX fj k2H S j=1 125 X r = ΣSX fj , ΣSX fj H S j=1 X r (b)  = Cov fj (X), (ΣSX fj )(S) j=1 X r   = Cov fj (X), kS (·, S), ΣSX fj H S j=1 X r  = Cov fj (X), Cov(fj (X 0 ), kS (S 0 , S)) j=1 X r   = Cov fj (X), EX 0 ,S 0 [fj (X 0 ) kS (S, S 0 )] − EX 0 [fj (X 0 )] ES 0 [kS (S, S 0 )] j=1 X r n   = EX,S,X 0 ,S 0 fj (X) fj (X 0 ) kS (S, S 0 ) j=1       +EX fj (X) EX 0 fj (X 0 ) ES,S 0 kS (S, S 0 )  o −2 EX,S fj (X) EX 0 [fj (X 0 )] ES 0 [kS (S, S 0 )] where (a) is due to Parseval relation for orthonormal basis and (b) is from the definition of ΣSX . B.2 Proof of Lemma 5.3 Lemma. Let KX , KS ∈ Rn×n be Gram matrices corresponding to HX and HS , respectively, i.e., (KX )ij = kX (xi , xj ) and (KS )ij = kS (si , sj ), where covariance is empirically estimated as n n n  1X 1 XX Cov fj (X), βS (S) ≈ fj (xi )βS (si ) − 2 fj (xi )βS (sk ). n n i=1 i=1 k=1 It follows that, the corresponding empirical estimation for Dep (Z, S) is 1 Depemp (Z, S) := 2 kΘKX HLS k2F , (18) n where H = In − n1 1n 1Tn is the centering matrix, and LS is a full column-rank matrix in which LS LTS = KS (Cholesky factorization). Furthermore, the empirical estimator in (5.7) has a bias of O(n−1 ) and a convergence rate of O(n−1/2 ). Proof. Firstly, let us reconstruct the orthonormal set US when n i.i.d. observations {sj }n j=1 are 126 given. Invoking representer theorem, for two arbitrary elements βi and βm in US , we have * n n + X X hβi , βm iH = αj kS (sj , ·), ηl kS (sl , ·) S j=1 l=1 HS Xn X n = αj ηl kS (sj , sl ) j=1 l=1 = α T KS η D E = LTS α, LTS η q R where LS ∈ Rn×q is a full column-rank matrix and KS = LS LTS is the Cholesky factorization of KS . As a result, searching for βi ∈ US is equivalent to searching for LTS α ∈ Uq where Uq is any complete orthonormal set for Rq . Using empirical expression for covariance, we get r ( n n n )2 X X 1X 1 X X Depemp (Z, S) := fj (xi )βS (si ) − 2 fj (xi ) βS (sk ) n n βS ∈US j=1 i=1 i=1 k=1 X X r n o2 1 T 1 T T = θ K K α − 2 θj KX 1n 1n KS α n j X S n T LS α∈Uq j=1 X X r  2 1 T = θ K HKS α n j X LT j=1 S α∈Uq X X r  2 1 T T = θ K HLS LS α n j X LT j=1 S α∈Uq X X r  2 1 T = θ K HLS ζ n j X ζ∈Uq j=1 X 1 = 2 kΘKX HLS ζk22 n ζ∈Uq 1 = 2 kΘKX HLS k2F , n where f (X) = Θ[kX (x1 , X), · · · , kX (xn , X)]T and Θ := [θ1 , · · · , θr ]T .   We now show that the bias of Depepm (Z, S) for estimating Dep(Z, S) in (5.7) is O n1 . To 127 achieve this, we split Depepm (Z, S) into three terms as, 1 1 n o 2 T kΘK X HL S Fk = Tr ΘK X HK S HK X Θ n2 n2       1 1 T 1 T T = Tr ΘKX I − 11 KS I − 11 KX Θ n2 n n 1 n o 2 n o = Tr K Θ T ΘK K − Tr 1 T K ΘT ΘK K 1 X X S X X S n2 | {z } n | 3 {z } I II 1 n o + 4 Tr 1T KX ΘT ΘKX 11T KS 1 (19) n | {z } III Let cn p denote the set of all p-tuples drawn without replacement from {1, · · · , n}. Moreover, let Θ = [θ1 , · · · , θr ]T ∈ Rr×n and (A)ij denote the element of an arbitrary matrix A at i-th row and j-th column. Then, it follows that (I): h n oi E Tr KX ΘT ΘKX KS    r     X   = T E Tr KX θk θk KX KS    | {z }   k=1 :=αk Xr h n oi = E Tr αk αTk KS k=1   Xr X X  = E  (αk αTk )ii (KS )ii + (αk αTk )ij (KS )ji  k=1 i (i,j)∈cn 2 Xr h i = n EX,S fk2 (X)kS (S, S) k=1 X r n!   + EX,S,X 0 ,S 0 fk (X)fk (X 0 )kS (S, S 0 ) (n − 2)! k=1 Xr n!   = O(n) + EX,S,X 0 ,S 0 fk (X)fk (X 0 )kS (S, S 0 ) (20) (n − 2)! k=1 where (X, S) and (X 0 , S 0 ) are independently drawn from the joint distribution pXS . (II): h i E 1T KX ΘT ΘKX KS 1   r X   = E 1T KX θk θkT KX KS 1 | {z } k=1 αk 128 Xr h i = T T E 1 α k α k KS 1 k=1   Xr Xn X n X n = E (αk αTk )mi (KS )mj  k=1 m=1 i=1 j=1   r X X X  = E  (αk αTk )ii (KS )ii + (αk αTk )mm (KS )mj  k=1 i (m,j)∈cn2   Xr  X X  + E (αk αTk )mi (KS )mm + (αk αTk )mj (KS )mj  k=1 (m,i)∈cn2 (m,j)∈cn2   Xr X   + E (αk αTk )mi (KS )mj  k=1 (m,i,j)∈cn3 Xr h i = n EX,S fk2 (X)kS (S, S) k=1 n! X r h i + 2 EX,S,S 0 fk (X)kS (S, S ) 0 (n − 2)! k=1 X r n!   + EX,S,X 0 fk (X)fk (X 0 )kS (S, S) (n − 2)! k=1 X r n!   + EX,S,X 0 ,S 0 fk (X)fk (X 0 )kS (S, S 0 ) (n − 2)! k=1 X r n!   + EX,S fk (X)EX 0 [fk (X 0 )]ES 0 [kS (S, S 0 )] (n − 3)! k=1 X r n!   = EX,S fk (X)EX 0 [fk (X 0 )]ES 0 [kS (S, S 0 )] (n − 3)! k=1 + O(n2 ). (21) (III): h i E 1T KX ΘT ΘKX 11T KS 1   r X   = E 1T KX θk θkT KX 11T KS 1 | {z } k=1 αk X r h i = E 1T αk αTk 11T KS 1 k=1 129   Xr X = E (αk αTk )ij (KS )ml  k=1 i,j,m,l   X r X   = O(n3 ) + E (αk αTk )ij (KS )ml  k=1 (i,j,m,l)∈cn4 X r n!     = EX [fk (X)] EX 0 fk (X 0 ) ES,S 0 kS (S, S 0 ) (n − 4)! k=1 + O(n3 ). (22) Using above calculations together with Lemma 2 lead to   1 Dep(Z, S) = E [Depemp (Z, S)] + O . n We now obtain the convergence of depemp (Z, S). Consider the decomposition in (19) together with (20), (21), and (22). Let αk := KX θk , then it follows that P {Dep(Z, S) − Depemp (Z, S) ≥ t} ( r X   ≤ P EX,S,X 0 ,S 0 fk (X)fk (X 0 )kS (S, S 0 ) k=1 r   ) (n − 2)! X X 1 − (αk αTk )ij (KS )ji + O ≥ at n! n n k=1 (i,j)∈c 2 ( r X   + P EX,S fk (X)EX 0 [fk (X 0 )]ES 0 [kS (S, S 0 )] k=1 r   ) (n − 3)! X X 1 − (αk αTk )mi (KS )mj + O ≥ bt n! n n k=1 (i,j,m)∈c 3 ( r X     + P EX [fk (X)] EX 0 fk (X 0 ) ES,S 0 kS (S, S 0 ) k=1 r   ) (n − 4)! X X 1 − (αk αTk )ij (KS )ml + O ≥ (1 − a − b)t , n! n n k=1 (i,j,m,l)∈c 4   where a, b > 0 and a + b < 1. For convenience, we omit the term O n1 and add it back in the last stage. Define ζ := (X, S) and consider the following U-statistics [173] X X r (n − 2)! u1 (ζi , ζj ) = (αk αTk )ij (KS )ij n! (i,j)∈cn2 k=1 130 X Xr (n − 3)! u2 (ζi , ζj , ζm ) = (αk αTk )mi (KS )mj n! (i,j,m)∈cn 3 k=1 X X r (n − 4)! u3 (ζi , ζj , ζm , ζl ) = (αk αTk )ij (KS )ml n! (i,j,m,l)∈cn 4 k=1 Then, from Hoeffding’s inequality [173] it follows that −2a2 t2 n −2b2 t2 n −2(1−a−b)2 t2 n P {Dep(Z, S) − Depemp (Z, S) ≥ t} ≤ e 2r2 M 2 + e 3r2 M 2 +e 4r2 M 2 , where we assumed that kS (·, ·) is bounded by one and fk2 (Xi ) is bounded by M for any k = 1, · · · , r and i = 1, · · · , n. Further, if 0.22 ≤ a < 1, it holds that −2a2 t2 n −2b2 t2 n −2(1−a−b)2 t2 −a2 t2 n n e 2r2 M 2 + e 3r2 M 2 +e 4r2 M 2 ≤ 3e r2 M 2 . Consequently, we have −a2 t2 n P {|Dep(Z, S) − Depemp (Z, S)| ≥ t} ≤ 6e r2 M 2 . Therefore, with probability at least 1 − δ, it holds s   r2 M 2 log(6/σ) 1 |Dep(Z, S) − Depemp (Z, S)| ≤ +O . (23) α2 n n B.3 Proof of Theorem 5.4 Theorem. Let Z = f (X) be an arbitrary representation of the input data, where f ∈ HX . Then, there exist an invertible Borel function h, such that, h ◦ f belongs to Ar . Proof. Recall that the space of disentangled representation is n  o Ar := (f1 , · · · , fr ) fi , fj ∈ HX , Cov fi (X), fj (X) + γhfi , fj iH = δi,j , X where γ > 0. Let IX denote the identity operator from HX to HX . We claim that h := [h1 , · · · , hr ], where   hf1 , f1 iH · · · hf1 , fr iH  X X  .. . . ..  G0 =  . . .  hfr , f1 iH · · · hfr , fr iH X X G = G0 −1/2 X r hj ◦ f = gjm (ΣXX + γIX )−1/2 fj , ∀j = 1, · · · , r m=1 131 is the desired invertible transformation. To see this, construct  Cov hi (f (X)), hj (f (X)) + γhhi ◦ f , hj ◦ f iH X = hi ◦ f , (ΣXX + γIX ) hj ◦ f H * r X + X Xr = gim (ΣXX + γIX )−1/2 fi , gjk (ΣXX + γIX ) (ΣXX + γIX )−1/2 fj m=1 k=1 HX Xr X r = gim gjk fi , fj H = (G G0 G)ij = δi,j X m=1 k=1 The inverse of h is h0 := [h01 , · · · , h0r ] where 1/2 H = G0 X r h0j ◦h= hjm (ΣXX + γIX )1/2 hj , ∀j = 1, · · · , r. m=1 B.4 Proof of Theorem 5.7 Theorem. Assume that kS and kY are bounded by one and fj2 (xi ) ≤ M for any j = 1, . . . , r and i = 1, . . . , n for which f = (f1 , . . . , fr ) ∈ Ar . Then, for any n > 1 and 0 < δ < 1, with probability at least 1 − δ, we have r   emp log(6/δ) 1 sup J(f , λ) − sup J (f , λ) ≤ rM + O . f ∈Ar f ∈Ar 0.222 n n Proof. Recall that in the proof of Lemma 5.3 we have shown that with probability at least 1 − δ, the following inequality holds s   emp r2 M 2 log(6/σ) 1 |Dep(Z, S) − Dep (Z, S)| ≤ + O . 0.222 n n Using the same reasoning for dep(Z, Y ), with probability at least 1 − δ, we have s   emp r2 M 2 log(6/σ) 1 |Dep(Z, Y ) − Dep (Z, Y )| ≤ + O . 0.222 n n Since J(f (X)) = (1 − λ) dep(Z, Y ) − λ dep(Z, S) and J emp (f (X)) := (1 − λ) depemp (Z, Y ) − λ depemp (Z, S), it follows that with probability at least 1 − δ, r   emp log(6/σ) 1 |J(f , λ) − J (f , λ)| ≤ rM 2 +O . 0.22 n n 132 We complete the proof by noting that, the following inequality holds for any bounded J and J emp : sup J(f , λ) − sup J emp (f , λ) ≤ sup |J(f , λ) − J emp (f , λ)| . f ∈Ar f ∈Ar f ∈Ar 133